Multimodal learning addresses the challenges of learning from different modalities in data. In the recent past, many approaches have been proposed that learn from multi-modal data, especially models combining visual and textual data.
For this reason, researchers at Facebook AI Research developed a modular framework, that enables easy and quick building, reproduction and evaluation of multimodal AI models. The framework supports distributed training on top of PyTorch, as well as a number of multimodal datasets, custom metrics, loss functions, optimizers, etc. It also contains the most commonly used vision and language layers for deep neural networks. Additionally, researchers included FAIR’s winning entries in recent AI competitions such as VQA 2018 and Vizwiz Challenge 2018.
The team developing Pythia expects that it will help researchers and engineers to speed up the research in the intersection of computer vision and natural language processing. Pythia is available on Github and more details about the framework can be found on the official blog post or in the official documentation.