“Selfie”: Novel Method Improves Image Models Accuracy By Self-supervised Pretraining

Researchers from Google Brain have proposed a novel pre-training technique called Selfie, which applies the concept of masked language modeling to images.

Arguing that language model pre-training and language modeling, in general, have been revolutionized by BERT – the concept of bi-directional embeddings in masked language modeling, researchers generalized this concept to learn image embeddings.

In their proposed method they introduce a self-supervised pre-training approach for generating image embeddings. The method works by masking out patches in an image and trying to learn the correct patch to fill the empty location among other distractor patches from the same image.

The method then uses a convolutional neural network to extract features from (unmasked) patches, which an attention module is trying to summarize before predicting the masked patches.

Researchers showed that this kind of embeddings drastically improves classification accuracy. They conduct a number of experiments trying to evaluate how the proposed embeddings method affects the performance of different models and with different amounts of labeled data.


How Selfie pre-training works.

They showed that Selfie provides consistent improvements in all datasets compared to training in a standard supervised setting. CIFAR-10, ImageNet 32 and Imagenet 224 were the datasets used, and researchers designed specific experiments varying the portion of labeled data (from 5% all the way to 100%).

Additionally, researchers observed that the proposed pre-training method improves the training stability of ResNet-50.

This method shows how an idea that was a breakthrough in Natural Language Processing can also be applied in Computer Vision and improve the performance of models significantly. More about Selfie can be read in the pre-print paper published on arxiv.

Mesh R-CNN Detects Objects and Estimates 3D Shape From Images

Researchers from Facebook AI Research have proposed Mesh R-CNN – a neural network model that detects objects in images and outputs a 3D shape in the form of triangle mesh for each object.

The novel system that researchers proposed actually combines the advances in several computer vision tasks. Mesh R-CNN is based on Mask R-CNN neural network model for object detection which it augments with the ability to produce 3D shapes for the detected objects.



The input to the proposed method is a single RGB image, similarly as in Mask R-CNN. Also, the architecture of the object detection branch is the same containing a backbone network and a region-proposal network. The novelty is the voxel branch which takes the aligned features and estimates a coarse 3D voxelization of a detected object.

The coarse cubified mesh from the voxel branch is passed through a graph convolution network acting as a mesh refinement branch. This branch outputs the final more precise mesh as a triangle mesh.

The architecture of Mesh R-CNN.


Researchers evaluated the proposed method by evaluating the mesh predictor separately on ShapeNet and the full Mesh R-CNN on the Pix3D dataset for 3D shape prediction from natural images. They mention that Mesh R-CNN shows promising results in 3D shape estimation from images.

Mesh R-CNN represents the first model that jointly performs object detection and 3D shape estimation from in-the-wild images. More details about the method can be found in the paper published on arxiv.

Google’s Neural Network Model Generates Realistic High-Fidelity Videos

Researchers from Google Research proposed a novel method for generating realistic, high-fidelity natural videos.

In the past several years, we have witnessed the progress of generative models like GANs (Generative Adversarial Networks) and VAEs (variational autoencoders) towards generating realistic images.

However, due to the increased complexity, those models failed in generating realistic videos. Several problems have been encountered when trying to make generative models work well for video data. Researchers designed complex methods for tackling this problem but they failed most of the time.

In a novel paper, named “Scaling Autoregressive Video Models”, researchers from Google propose an autoregressive model which can generate realistic videos.

The proposed model is a simple autoregressive video generation model based on 3D self-attention mechanism. In fact, researchers generalized the (originally) one-dimensional Transformer model. They tackled the problem by representing videos as three-dimensional spatio-temporal volumes and applying the self-attention mechanism.


The proposed 3D Transformer model.

Researchers evaluated the proposed model using two datasets: BAIR Robot Pushing Dataset and Youtube Kinetics Dataset. The results showed that the scaled autoregressive model is able to produce diverse and realistic frames as continuations to videos from the Kinetics dataset. According to researchers, this work represents the first application of video generation models to high-complexity (natural) videos.

Google Research Football: New Reinforcement Learning Environment for Training Football Playing Agents

Google AI announced the release of a new reinforcement learning environment – Google Research Football.

After the number of successes of deep reinforcement learning in playing computer games, researchers have focused on creating virtual environments for training RL agents. Many of these games such as Atari, Dota 2 or Starcraft 2 are quite challenging and RL environments for those games allowed new ideas and algorithms in Reinforcement Learning to be quickly developed.

Now Google AI is releasing Google Research Football Environment. The new environment allows training Reinforcement Learning agents to play the world’s most popular sport – football.

The new RL environment has a physics-based 3D engine and is developed after popular football video games like FIFA and Pro Evolution Soccer. Within the environment, researchers created a set of problems called Football Benchmarks and a “Football Academy” which contains a set of RL scenarios.



The code of Google’s Research Football Environment was open-sourced and is available on Github. Researchers also released results of two state-of-the-art reinforcement learning algorithms to serve as a baseline and reference in the environment.

Does Object Recognition Work for Everyone?

Researchers from Facebook AI Research (FAIR) have performed an interesting experiment analyzing how object recognition systems perform for people coming from different countries and with different income levels.

To be able to perform such study researchers used the Dollar Street image dataset which contains images of common household items all over the world. This dataset was collected by photographers to point out and emphasize the difference between ‘everyday life’ among people with different income levels.

Five publicly available and powerful object recognition systems were included in this study: Microsoft Azure, Clarifai, Google Cloud Vision, Amazon Rekognition and IBM Watson.

Within the study, researchers tried to find a correlation between the accuracy of object recognition and the income levels and they tried to compare the accuracy across different regions of the world.

They found that all the systems perform poorly on household items in countries with low household income. In fact, for all systems, they found that the accuracy increases as the monthly consumption income increases.

The accuracy as a function of the household monthly income.

To explain this, researchers analyzed the results of all models and concluded that object recognition systems fail due to two main things: difference in object appearance and context. This means systems fail because common objects look different and the context in which they appear is also different.

However, object recognition systems fail because the datasets that they were trained on are biased and contain images mostly from Western countries. Researchers plot the geo-distribution of popular datasets for object recognition to demonstrate this.

Images of household items from different places in the world and predictions from object recognition systems.

This study is part of a bigger project on fairness in machine learning and it suggests that still further work is needed to make object recognition work for everyone.

TensorNetwork: Google AI Released Open-source Library for Efficient Tensor Computations

Google AI introduced TensorNetwork – an open-source library for efficient calculations for tensor networks. In the blog post, Chase Roberts, Research Engineer at Google AI and Stefan Leichenauer, Research Scientist at X, write about the need of such a library in order to bridge the gap between Tensor Networks and Machine Learning.

Tensor Networks, a data structure that is less known (at least in the machine learning community) allows performing efficient computations in quantum physics, where quantum states can become exponentially large.

Arguing that tensor networks can be used in machine learning and that this data structure has already been finding some applications within ML, researchers point out the need for a tensor networks library.

Engineers and researchers from Google as well as the Perimeter Institute for Theoretical Physics and X, have developed an efficient library that will allow running tensor network algorithms at scale.

The novel library, named TensorNetwork uses TensorFlow as backend and provides significant speedups, especially because it allows the usage of GPUs.

Together with the library, researchers are planning to release a series of papers which will describe the library, and provide example use-cases within physics as well as within machine learning.
In the first paper, that is actually released researchers introduce the library and its API and give an overview of tensor networks for the non-physics audience.

Going Beyond GANs? DeepMind Researchers Proposed VAE Model for Generating Realistic Images

In a recent paper, researchers from DeepMind propose a variation of VQ-VAE (Vector Quantized Variational Autoencoder) model that rivals Generative Adversarial Networks in high-fidelity image generation.

The new model, called VQ-VAE-2 introduces several improvements over the original vector quantized VAE model. Researchers explored the capabilities of the new model and prove that it is able to generate realistic and coherent results.

In the past few years, Generative Adversarial Networks were considered the most powerful generative models for generating high-fidelity images. Despite having several problems such as training instability, mode collapse, etc., GANs have shown superior results compared to other generative models.

Variational Autoencoders, on the other hand, have had a major problem when trained to learn a distribution of realistic, high-fidelity images. In fact, in Variational Autoencoders the approximation of the posterior distribution is usually oversimplified and the model cannot capture the true distribution of the images. GANs were able to solve this at the expense of introducing a mode-collapse and reducing the variational capabilities of the model.

In the new paper, researchers improve Variational Autoencoders or more specifically VQ-Variational Autoencoders to be able to generate high-fidelity images comparable with those generated from GANs.

Researchers proposed two main improvements: a multi-scale hierarchical organization of the autoencoder model and learning powerful priors over the latent codes which are used for sampling.

The architecture of the proposed VQ-VAE 2.

The evaluations show that the proposed method is able to generate realistic images, trained on the popular ImageNet dataset. According to researchers, the results are comparable to the ones that a GAN network can generate and the novel method does not suffer from the mode collapse issue.

Deep Drone Racing: Novel Method Uses CNN and Imitation Learning To Control A Drone

Researchers from the University of Zurich, ETH Zurich and Intel propose a vision-based method for drone racing using a path-planning system and a convolutional neural network.

Arguing that robot (and/or drone) navigation in complex and dynamic environments represents a fundamental challenge, they combine the perceptual capabilities of CNNs with state-of-the-art path-planning methods to tackle the problem.

In their proposed approach, the convolutional neural network directly maps raw image data to waypoints and speed values. The path planner takes those values and tries to generate a minimum-jerk trajectory. As part of the modular drone-racing system, the drone controller receives the generated trajectory and moves the drone accordingly.

Researchers mention that a big advantage of their modular design that divides perception and control is that it allows training of perception exclusively in simulation and testing the system in real-life without the control part changed. They show that simulation provides enough and diverse training data that makes the method more robust.

The experiments conducted showed that the method outperforms current state-of-the-art drone control systems in both simulated and real-world driving.

Researchers released a video showing the performance of the proposed method. The paper was published and is available on arxiv.


AI System Can Provide Orthodontic Diagnosis and Treatment Plans

A group of researchers from Osaka University in Japan have proposed an automated orthodontic diagnosis system based on Natural Language Processing (NLP).

In a paper published recently, Tomoyuki Kajiwara and his colleagues describe an artificial intelligence system that takes various patient data and outputs a treatment plan. The proposed system starts with imaging and model data from the patients (x-rays, facial photos, 3D model, etc) as the learning data. Computer vision and mostly Natural Language Processing techniques are then used on this data to extract diagnoses and from them to define treatment plans.

The system is designed such that it takes multi-modal data as input, provides a list of diagnoses ordered by priority and several treatment plans based on the diagnoses. In the end, the system is able to provide human-readable output containing the detected problems and the treatment plan.

Researchers tried to make the method more comprehensive so that it takes into account many diagnoses as well as treatment plans. The proposed method is able to organize each medical condition within a treatment protocol. In order to do this, researchers consider the problem as a multi-label classification problem given 400 types of medical conditions.

To develop and evaluate the proposed method, 900 hand-written certificates by dentists were used in the experiments. Researchers report that they achieved 0.584 correlation coefficient with human rankings for the treatment prioritization task.

A diagram of the proposed system.

In conclusion, researchers propose an interesting system for automatic orthodontic diagnosis based on machine learning. They showed that it is possible to extract features from free-form doctor certificates to build powerful machine learning models that, in turn, can provide diagnoses and treatments.

More about the proposed automated orthodontic diagnosis system can be read in the official paper.

Detect-to-Retrieve: Novel Method Improves Image Search by Detecting Landmarks

Researchers from the University of Cambridge and Google AI have proposed a new image retrieval method that leverages region detection to provide improved image similarity estimates.

In the novel paper, “Detect-to-Retrieve: Efficient Regional Aggregation for Image Search” accepted at CVPR2019 researchers propose to use a trained landmark detector to index image regions and support image retrieval for improved accuracy.

In fact, researchers argue that better aggregated image representation can be obtained detecting specific regions and extracting local visual words in those regions. In order to tackle the problem using this approach, researchers first created a new dataset od landmark bounding boxes based on Google Landmarks dataset.

Then, they used the new dataset and a trained landmark detector to extract information from local regions which is later aggregated into an image representation.

In the paper, the authors also propose a so-called regional aggregated selective match kernel (R-ASMK), that combines the information from different regions and gives a better image representation (from a perspective of image retrieval).

The proposed Detect-to-Retrieve approach.

The proposed image retrieval method based on object semantics outperforms current state-of-the-art methods by a large margin on the Revisited Oxford and Paris datasets. Researchers show that regional search improves the accuracy of image retrieval systems, especially in cluttered scenes.

The new dataset and the implementation of the proposed method have been open-sourced and are available on Github.


Now “Visual Question Answering” Models Can Read Text In Images

In a paper accepted to CVPR 2019, researchers from Facebook AI Research (FAIR) and Georgia Institute of Technology introduce TextVQA – Visual Question Answering models that can understand text and reason based on both the image and the text.

Researchers argue that existing VQA models are not able to read at all. In fact, all previous efforts in this field have been aimed towards scene understanding and object detection in images to provide reasoning and answers to natural questions.

It turns out that many researchers overlooked the necessity of text understanding, which is present in many of the images and answers to natural questions can, in fact, emerge from just reading that text.

Therefore, in order to tackle the problem, researchers first created a TextVQA dataset which contains images that contain some text and ground truth labels for the text present in those images. The created dataset contains more than 45 000 questions on more than 28 000 images.

Then, they introduced a new model architecture that takes into account text understanding and used the created dataset to train the VQA model to provide answers based on both the visual content and the text.


The proposed TextVQA approach.

The novel framework was named LoRRA – Look, Read, Reason and Answer. Researchers showed that the proposed model outperforms existing VQA models on the novel TextVQA dataset since the latter are not able to provide answers based on text understanding.

The implementation of the proposed method was open-sourced and it is available on Github, as part of Facebook’s Pythia framework. The pre-print paper was published on arxiv.

EfficientNet: Achieving State-of-The-Art Results By Systematically Scaling Deep Neural Networks

Researchers from Google AI have found a way to systematically scale deep neural networks and improve accuracy as well as efficiency.

Deep neural networks have shown unprecedented performance on a broad range of problems coming from a variety of different fields. It has been shown in the past that performance in terms of accuracy grows together with the amounts of data but also the depth of the networks.

Increasing the number of layers has proven to increase accuracy and researchers have been building larger and larger models in the past years.

In their paper “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, researchers from Google argue that the scaling of deep neural networks should not be done arbitrarily and proposed a method for structured scaling of deep networks.

They proposed an approach where different dimensions of the model can be scaled in a different manner using scaling coefficients.

To obtain those scaling coefficients, they perform a grid search to find the relationship between those scaling dimensions and how they affect the overall performance. Then, they apply those scaling coefficients to a baseline network model to achieve maximum performance given a (computational) constraint.

Using this approach, researchers developed a family of models, called EfficientNets, which surpass state-of-the-art accuracy with up to 10x better efficiency.


Model Size vs. Accuracy Comparison.

More about the proposed scaling method and the family of models called EfficientNets can be read in the official blog post. The paper was accepted as a conference paper in ICML 2019 and is available on arxiv.

“Now You See It, Now You Don’t”: Deep Flow-Guided Video Inpainting

A group of researchers from the Chinese University of Hong Kong and the Nanyang Technological University has proposed a novel flow-guided video inpainting method that achieves state-of-the-art results.

Video inpainting is a challenging problem within Computer Vision which aims at “inpainting” or filling missing parts of a video. Many approaches have been proposed in the past but video inpainting still remains one one very difficult task.

To overcome the problem of preserving the spatio-temporal coherence of the video contents, researchers propose a neural network model based on optical flow. They reformulate the problem of video inpainting from “filling missing pixels at each frame” into a pixel propagation problem. Then, they designed a neural network model that is able to perform video inpainting.

In their method, they compute the optical flow field across video frames using the Deep Flow Completion Network. This flow field is then used to tackle the problem of pixel propagation along with frames in the video.

The proposed method was evaluated on DAVIS and Youtube VOS dataset and researchers reported that they achieve state-of-the-art results in terms of video inpainting quality as well as speed.



More details about the proposed method and the whole project can be found on the project’s website. Researchers released a video showing the performance of the method in tasks such as object removal from video, watermark removal, etc.

They also open-sourced the implementation of the method and it’s available on Github. The paper is available on arxiv.

SANet: Flexible Neural Network Model for Style Transfer

Researchers from the Artificial Intelligence Research Institute in Korea have proposed a novel neural network method for arbitrary style transfer.

Style transfer, as a technique of recomposing images in the style of other images, has become very popular especially with the rise of convolutional neural networks in the past years. Many different methods have been proposed since then and neural networks were able to solve the problem of style transfer with sufficiently good results.

However, many existing approaches and algorithms are not able to balance both the style patterns and the content structure of the image. To overcome this kind of problems, researchers proposed a new neural network model called SANet.

SANet (which stands for style-attentional network), is able to integrate style patterns in the content image in an efficient and flexible manner. The proposed neural network is based on the self-attention mechanism and learns a mapping between content features and style features by modifying the self-attention mechanism.

The proposed style transfer method takes as input an image of a person and a “style” pattern image to be used for the composition. Both the image and the style are encoded using an encoder network into a latent representation which is fed into two separate Style-attentional Networks. The output is then concatenated and passed through a decoder which provides the final output image.

The architecture of the proposed method.

Researchers use identity loss as the loss function which gives the difference between the original image and the generated one. For training and evaluation of the proposed method, they used WikiArt and MS-COCO datasets.

In their paper, researchers report that their method is both effective and efficient. According to them, SANet is able to perform style transfer in a flexible manner using the loss function that combines traditional style reconstruction loss and identity loss.

Researchers released a small online demo where users can upload a photo and see the results of the method. More details about SANet can be read in the pre-print paper which was accepted as a conference paper at CVPR 2019. The article is published and available on arxiv.

Researchers Expanded the Popular MNIST Dataset With 50 000 New Images

Researchers from New York University and Facebook AI Research have restored and expanded the wide known MNIST dataset. Their idea was to recover the original MNIST (who is assumed to be forever lost) by reconstructing the missing part of the MNIST data.

MNIST is one of the most popular and most used datasets for building and testing image processing systems. A lot of research work in the past decades has developed methods using MNIST and the dataset itself has become a baseline for many image processing problems.

Arguing that the official MNIST dataset with only 10 000 images is too small to provide meaningful confidence intervals, they tried to recreate the MNIST preprocessing algorithms.

Through an iterative process, researchers tried to generate an additional 50 000 images of MNIST-like data. They started with a reconstruction process given in the paper introducing MNIST and used the Hungarian algorithm to find the best matches between the original MNIST samples and their reconstructed samples.

After many iterations of improvements in the reconstruction algorithm trying to extract the best matches between the generated and the original samples, researchers improved the samples and generated a dataset of an additional 50 000 digit images.

The new expanded MNIST dataset will allow examining the existing methods and investigating their generalization capabilities since many of them might have been overfitting on the small MNIST official testing set.

The dataset, as well as a detailed explanation of the reconstruction process,  can be found on Github. The pre-print paper is available on arxiv.

Two Cameras Can Provide LiDAR-like Object Detection for Self-Driving

Researchers from Cornell University NY, led by Yan Wang have come up with an interesting idea on how to bridge the gap in accuracy between 2D and 3D object detection methods.

In their paper, titled “Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving”, they argue that it’s not the difference in data that makes 3D-based methods better but its representation. To demonstrate this, they propose a method for converting image-based depth maps into a pseudo-LIDAR representation (a representation that to some extent resembles the 3D point cloud coming from a LIDAR sensor).

In fact, researchers tried to mimic the LIDAR signal by using only camera data and depth estimation models. They propose a transformation of the image depth map to pseudo-LIDAR representation and use depth estimation and 3D detection model to overperform existing 3D detectors by a large margin.


The proposed pipeline for 3D object detection.

The novelty that they introduce is the transformation from depth image to point cloud by projecting each pixel into a 3D coordinate. Instead of adding the depth as an additional channel in the image they try to derive a 3D location relative to the camera position.


Instead of incorporating the depth D as multiple additional channels to the RGB images, as is typically done, the 3D location  (x, y, z) of each pixel (u, v) is derived in the left camera’s coordinate system.

Researchers showed that the generated pseudo-LIDAR point cloud aligns well with the original LIDAR signal. The evaluations conducted on the popular KITTI benchmark show that this approach yields large improvements over existing state-of-the-art methods. The algorithm is currently the highest entry in the KITTI 3D object detection benchmark for stereo image-based methods.

Speech2Face: Neural Network Predicts the Face Behind a Voice

In a paper published recently, researchers from MIT’s Computer Science & Artificial Intelligence Laboratory have proposed a method for learning a face from audio recordings of that person speaking.

The goal of the project was to investigate how much information about a person’s looks can be inferred from the way they speak. Researchers proposed a neural network architecture designed specifically to perform the task of facial reconstruction from audio.

They used natural videos of people speaking collected from Youtube and other internet sources. The proposed approach is self-supervised and researchers exploit the natural synchronization of faces and speech in videos to learn the reconstruction of a person’s face from speech segments.

Several results produced by the Speech2Face model.


In their architecture, researchers utilize facial recognition pre-trained models as well as a face decoder model which takes as an input a latent vector and outputs an image with a reconstruction.


The proposed self-supervised learning approach.


From the videos, they extract speech-face pairs, which are fed into two branches of the architecture. The images are encoded into a latent vector using the pre-trained face recognition model, whilst the waveform is fed into a voice encoder in a form of a spectrogram, in order to utilize the power of convolutional architectures. The encoded vector from the voice encoder is fed into the face decoder to obtain the final face reconstruction.

The evaluations of the method show that it is able to predict plausible faces with consistent facial features with the ones from real images. Researchers created a page with supplementary material where sample outputs of the method can be found. The paper was accepted as a conference paper at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019.

Deepfake Videos: GAN Sythesizes a Video From a Single Photo

Researchers from Samsung AI and Skolkovo Institute of Science and Technology have produced a system that can create realistic fake videos of a person talking, given only a few images of that person.

In their paper, named “Few-Shot Adversarial Learning of Realistic Neural Talking Head Models”, researchers propose a method which is able to generate a personalized talking head model without the need of a large number of images of a single person.

Arguing that in practical scenarios such videos need to be generated only using few image samples or even a single one, they designed a few-shot learning scheme.

The proposed architecture contains three modules: a generator network, an embedder network and a discriminator network. The architecture was designed in such a way that it disentangles the pose and person’s facial features and exploits the adversarial learning technique to generate realistic videos.

The proposed GAN architecture.


The embedder network is the module which extracts pose-independent features of the person in the given frame. It is supposed to learn the person’s identity and generate low-dimensional embeddings. These embeddings are then fed in the generator network as AdaIN parameters (Adaptive Instance Normalization). This allows convolutional layers to be modulated with the latent embeddings containing person-specific information.

The generator network takes facial landmarks as input (as well as the embeddings and the ground truth image) and it is supposed to produce a synthetic image sample of a person as a video frame.

Finally, the discriminator network should learn to discriminate the distributions and force the generator to produce samples from the realistic distribution.

Researchers trained the system in a supervised manner using two datasets with talking head videos: VoxCeleb1 and VoxCeleb2. The evaluation showed that the proposed method is able to learn to generate a talking head video from as little as one single image sample. However, the best results are reported with the model trained using 32 images.


PyTorch’s torchvision 0.3 Comes With Segmentation and Detection Models, New Datasets and More

The new release 0.3 of PyTorch’s torchvision library brings several new features and improvements. The newest version of torchvision includes models for semantic segmentation, instance segmentation, object detection, person keypoint detection, etc.

Torchvision developers added reference training and evaluation scripts for several tasks within computer vision. These scripts are meant to provide flexibility and convenience when tackling common computer vision problems and they serve as a base for training specific models and providing evaluation, baselines, etc.

The new release also contains torchvision ops – custom C++/CUDA operators that are used in computer vision. Some examples of torchvision ops include roi_pool, box_area, roi_align, etc.

As we mentioned above, torchvision 0.3 includes many popular models for segmentation, detection, and classification. Some of the models which are available in the new release include FCN, DeepLabV3 (with ResNet backbone) for segmentation, Faster R-CNN, Mask R-CNN, Keypoint R-CNN for detection and GoogleNet, MobileNetV2, ShuffleNet V2, ResNeXt for classification.

Five new datasets are also part of the release of torchvision 0.3. Caltech101, Caltech256, CelebA, Imagenet, and Semantic Boundaries datasets are now available in the torchvision package. Researchers added a superclass VisionDataset as a base class for all datasets used for computer vision tasks.

The full release notes are available here. Torchvision developers also added a tutorial as a Google Colab notebook that shows how to fine-tune a segmentation model on a custom dataset in torchvision.

“Moving Camera, Moving People”: Novel Method for Depth Estimation from Google AI

Researchers from Google AI have proposed a new method for depth estimation of moving people. The new method is capable of handling videos where both the people and the camera are moving.

As a particularly challenging problem, depth prediction of moving objects was not explored much in the past. Depth estimation methods, generally filter out dynamic objects in scenes or simply ignore them and therefore an incomplete depth map is obtained containing only background and static objects.

In this approach, researchers took advantage of the popular “Mannequin Challenge”, where people all over the world were uploading videos of imitating mannequins by freezing in some natural pose. The fact that in this kind of videos (uploaded by users on Youtube) the camera is moving around all static objects allowed researchers to construct a large dataset of depth maps corresponding to the videos.



For the task of acquiring accurate depth maps as labels, they used triangulation-based methods such as multi-view-stereo(MVS) and collected a dataset of around 200 videos. The dataset is of high variability since the videos are taken from the Mannequin Challenge, where people tried to naturally pose in a wide variety of everyday situations.

Researchers proposed a neural network model for depth estimation trained in a supervised manner using the collected Mannequin Challenge dataset. They frame the problem as a regression problem where the input to the network is the RGB frame, depth mask of the background and binary mask of the people in the image. According to them, the depth from parallax (depth mask of the background) will give strong depth cues and help to solve the problem in a more efficient manner. This depth mask is computed by computing the 2D optical flow between two frames.


The architecture of the proposed method.

So, the task that the network is solving is basically “inpainting” the depth values for the masked regions with people. Results show that the network is able to learn depth estimation of moving people and researchers mention that it can generalize and handle any natural video with arbitrary human motion.


Facebook Open-sourced Pythia: Deep Learning Framework for Image and Language Models

Facebook AI Research (FAIR) has released Phytia – an open-source framework for multimodal learning. Built on top of PyTorch, Pythia supports multitask learning for the vision and language domains.

Multimodal learning addresses the challenges of learning from different modalities in data. In the recent past, many approaches have been proposed that learn from multi-modal data, especially models combining visual and textual data.

For this reason, researchers at Facebook AI Research developed a modular framework, that enables easy and quick building, reproduction and evaluation of multimodal AI models. The framework supports distributed training on top of PyTorch, as well as a number of multimodal datasets, custom metrics, loss functions, optimizers, etc. It also contains the most commonly used vision and language layers for deep neural networks. Additionally, researchers included FAIR’s winning entries in recent AI competitions such as VQA 2018 and Vizwiz Challenge 2018.

The team developing Pythia expects that it will help researchers and engineers to speed up the research in the intersection of computer vision and natural language processing. Pythia is available on Github and more details about the framework can be found on the official blog post or in the official documentation.