Knowledge Transfer from Text Annotations towards more Effective Learning for Computer Vision
Computer Vision models have achieved human-level accuracy in certain tasks like classification and localization by leveraging large annotated datasets, leading to widespread adoption in several domains. However, in fields like medical diagnostics, adoption is still hampered by the scarcity and/or cost of annotated data. Recently, several works in few-shot learning and self-supervised learning have tried to learn from a limited amount of annotated data, but with limited success. A recent analysis (W.Chen et al 2019) of few-shot algorithms shows that a simple baseline that finetunes a deep model is as good as current state-of-the-art few-shot learning algorithms and fares better in the realistic scenario of a non-negligible domain shift between the train and test sets. Another such analysis (Y. Asano et al 2020) of self-supervised learning methods suggests that unlabelled images aid only in learning low-level features of the initial layers and are not sufficient to learn discriminative mid-level or high-level features. Both these analyses suggest that visual information alone is not enough to perform well on computer vision tasks in the annotation scarce scenario. In contrast to deep learning-based models, humans can learn to recognize new objects or point them out in images from just a handful of labeled examples. One possible reason why humans can understand objects/concepts from a few examples is because of the existence of an external representation of information about the world from prior experiences. Inspired by this, this research project aims to explore how prior knowledge can be modeled and how it can be used to improve the performance of vision models in a limited annotation.scenario. The objectives of this research project are: 1. Develop a knowledge model of the world from a text corpus and already annotated images. Natural language text is a rich source of knowledge. Semantic relationships between objects can be modeled from language to produce a knowledge representation (G Miller 1995). Here we intend to explore how annotated images can be jointly modeled with natural language to produce knowledge prior. 2. Explore how information can flow from this knowledge model to the vision model to improve performance in few-shot learning. 3. Explore how information from this knowledge model can aid in getting more discriminative feature representations in self-supervised learning.