Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here

Nils Hohing

Student

Project Title

Grounded Language Understanding

Project Description

Natural language understanding systems rely on word statistics performed by language models. These approaches capture an approximation of language semantics, but they exhibit many known failure cases like poor understanding of causal relationships, being sensitive to rephrasing of sentences and producing syntactically convincing, but semantically questionable outputs.

Many of these issues can be attributed to language models’ lack of grounding in reality. Humans know which concepts from our world the words of a text correspond to, e.g. for the word “tree” what a tree looks like, which sounds it produces and which tactile impressions are associated with it. This knowledge gives us an edge over current language understanding systems in reasoning which is implicitly required for all language understanding tasks.

Existing research has contributed a variety of benchmarks to measure the alignment between different modalities like vision, language and audio. The best way to test for alignment are retrieval benchmarks like Winoground, where the model is tasked to retrieve items like images from a big database that best match a key, e.g. a given text. Image or text generation benchmarks in contrast have unreliable automatic evaluation because defining a sensible distance metric between a ground truth image or text and a generated one is very hard.

Learning the alignment currently works very well for higher level concepts, for example understanding the visual differences between a wolf and a bear, but it fails in the details. For example simple spatial understanding like discerning between left and right surprisingly often does not work. Also unusual compositions are rarely understood well. For the prompt “a cup on a spoon” DALLE-2, the image generation model, generates only spoons in cups. This reveals serious deficits in the model’s language understanding.

This project aims to overcome failure cases of those existing solutions by improving models that understand the relationship between words and images (possibly also videos) measured by existing benchmarks. Additionally, new benchmarks to measure the performance in those areas more precisely will be created. At last the point is to demonstrate that these image-text multimodal models can outperform language models in purely textual domains (when there is no visual information available at inference time).

For the first step there could be three sources of the aforementioned problems with image-text models: the model architecture, the data and the learning strategy.

-The initial experiments have shown that the image and text processing architectures are capable of learning basic physical relations like “left” and “right”

-Since the datasets used in this domain contain millions up to billions of image-text pairs, a lack of data also seems unlikely.

-Therefore, either the quality of the data or the learning strategies must be the problem.

To start, the goal of this project is therefore to examine the data quality for the specific purposes and to improve image-text alignment via novel curriculum learning strategies.

The main challenges will be working with very big datasets and doing meaningful evaluation.