“Selfie”: Novel Method Improves Image Models Accuracy By Self-supervised Pretraining

Researchers from Google Brain have proposed a novel pre-training technique called Selfie, which applies the concept of masked language modeling to images.

Arguing that language model pre-training and language modeling, in general, have been revolutionized by BERT – the concept of bi-directional embeddings in masked language modeling, researchers generalized this concept to learn image embeddings.

In their proposed method they introduce a self-supervised pre-training approach for generating image embeddings. The method works by masking out patches in an image and trying to learn the correct patch to fill the empty location among other distractor patches from the same image.

The method then uses a convolutional neural network to extract features from (unmasked) patches, which an attention module is trying to summarize before predicting the masked patches.

Researchers showed that this kind of embeddings drastically improves classification accuracy. They conduct a number of experiments trying to evaluate how the proposed embeddings method affects the performance of different models and with different amounts of labeled data.


How Selfie pre-training works.

They showed that Selfie provides consistent improvements in all datasets compared to training in a standard supervised setting. CIFAR-10, ImageNet 32 and Imagenet 224 were the datasets used, and researchers designed specific experiments varying the portion of labeled data (from 5% all the way to 100%).

Additionally, researchers observed that the proposed pre-training method improves the training stability of ResNet-50.

This method shows how an idea that was a breakthrough in Natural Language Processing can also be applied in Computer Vision and improve the performance of models significantly. More about Selfie can be read in the pre-print paper published on arxiv.

Google’s Neural Network Model Generates Realistic High-Fidelity Videos

Researchers from Google Research proposed a novel method for generating realistic, high-fidelity natural videos. In the past several years, we have witnessed the progress of generative models like GANs (Generative Adversarial Networks) and VAEs (variational autoencoders) towards generating realistic images. However, due to the…

Does Object Recognition Work for Everyone?

Researchers from Facebook AI Research (FAIR) have performed an interesting experiment analyzing how object recognition systems perform for people coming from different countries and with different income levels. To be able to perform such study researchers used the Dollar Street image dataset which contains…

AI System Can Provide Orthodontic Diagnosis and Treatment Plans

A group of researchers from Osaka University in Japan have proposed an automated orthodontic diagnosis system based on Natural Language Processing (NLP). In a paper published recently, Tomoyuki Kajiwara and his colleagues describe an artificial intelligence system that takes various patient data and outputs…

SANet: Flexible Neural Network Model for Style Transfer

Researchers from the Artificial Intelligence Research Institute in Korea have proposed a novel neural network method for arbitrary style transfer. Style transfer, as a technique of recomposing images in the style of other images, has become very popular especially with the rise of convolutional…

Speech2Face: Neural Network Predicts the Face Behind a Voice

In a paper published recently, researchers from MIT’s Computer Science & Artificial Intelligence Laboratory have proposed a method for learning a face from audio recordings of that person speaking. The goal of the project was to investigate how much information about a person’s looks…

Deepfake Videos: GAN Sythesizes a Video From a Single Photo

Researchers from Samsung AI and Skolkovo Institute of Science and Technology have produced a system that can create realistic fake videos of a person talking, given only a few images of that person. In their paper, named “Few-Shot Adversarial Learning of Realistic Neural Talking…

PyTorch’s torchvision 0.3 Comes With Segmentation and Detection Models, New Datasets and More

The new release 0.3 of PyTorch’s torchvision library brings several new features and improvements. The newest version of torchvision includes models for semantic segmentation, instance segmentation, object detection, person keypoint detection, etc. Torchvision developers added reference training and evaluation scripts for several tasks within…