14: Multimodality: How To Use LLMs For Visual & Audio Data

Google PaLM-E: An embodied multimodal language model (Mar 2023)

Simple idea: this is a generalist robotics model that is able to solve problems in robotics using visual information - but it relies entirely on an LLM.

Simple trick: to feed non-language information into the LLM, the paper trains various models that convert any kind of input information into the same embeddings space that words are located in.
Then all kinds of observations and information can be injected into the LLM prompts. For example: “What happened between <img_1> and <img_2>?”, and those are two images that are run through a model that converts an image into embeddings space.
PaLM-E offers a new paradigm for training a generalist model, which is achieved by framing robot tasks and vision-language tasks together through a common representation: taking images and text as input, and outputting text. A key result is that PaLM-E attains significant positive knowledge transfer from both the vision and language domains, improving the effectiveness of robot learning
- The visual-language data actually significantly improves the performance of the robot tasks

Here are the input transformations for various input types (besides language):

Vision Transformer (ViT). ViT (Dosovitskiy et al., 2020) is a transformer architecture mapping an image I into a number of token embeddings. We consider several variants, including the 4 billion parameter model from Chen et al. (2022), which we refer to as ViT-4B, and a similar 22 billion parameter model, ViT22B (Dehghani et al., 2023), both of which have been pretrained on image classification.
State estimation vectors. State vectors, e.g. from a robot or a state estimate for objects, are perhaps the simplest to input into PaLM-E. Let s ∈ R S be a vector describing the state of the objects in a scene. For example, s could contain the pose, size, color etc. of those objects. Then, the MLP maps s into the language embedding space.
Entity referrals. PaLM-E must be able to reference objects in its generated plan. In many cases, including the majority of our experiments, objects in a scene can be identified in natural language by some of their unique properties. However, there also exist settings where objects are not easily identifiable by language in few words, e.g. if there are multiple blocks on a table of the same color at different locations. For object-centric representations, we label the multi-modal tokens corresponding to an object in the input prompt as follows: Object 1 is <obj 1>,. Object j is <obj j>. This enables PaLM-E to reference objects via special tokens of the form obj j in its generated output sentences.

In training: to form the multi-modal sentences within the model, we have special tokens in the text that get replaced by the embedding vectors of the encoders at the locations in the text of those tokens.

Here is a fascinating insight: catastrophic forgetting. By training PaLM-E on all of those other embeddings, it can get worse at the tasks the old PaLM model originally does well, i.e., natural language tasks. But the bigger the model gets, the less the danger of “catastrophic forgetting”. See below: the 562B model has almost no forgetting on natural language tasks (NLG).

Language Is Not All You Need: Aligning Perception with Language Models (Mar 2023)

(link)

The paper trains a multimodal LLM. The core idea is that all the processing happens through the LLM, but images (and in the future even sounds) are part of the prompt through embeddings.

The input is all text using tags: “<s> document </s>” is a text input, and “<s> paragraph <image> Image Embedding </image> paragraph </s>” is an interleaved image-text input.
The training data has text corpora, image-caption pairs and interleaved image-text data (e.g., website text with pictures).
There are two main networks:
- The LLM component, a transformer with 24 layers, 32 attention heads, and 1.3B parameters in total.
- The image embedding network, which is a pretrained CLIP ViT-L/14 (vision transformer) model with 1,024 feature dimensions, and images preprocessed into 224x224 resolution during training.
- During training, the parameters of the CLIP model are frozen - except for the last layer (that probably enables the LLM pick up some training alongside the image transformer).
- The total # of parameters in the entire model is just 1.6B.
- They’re using the AdamW optimizer to train: 1.2 million tokens (0.5 million tokens from text corpora, 0.5 million tokens from image-caption pairs, and 0.2 million tokens from interleaved data) and train KOSMOS-1 for 300k steps, corresponding to about 360 billion tokens
- In order to better align KOSMOS-1 with human instructions, we perform language-only instruction tuning. Specifically, we continue-train the model with the instruction data in the format of (instructions, inputs, and outputs). The instruction data is language-only. The instruction training data is coming from two publicly available datasets.

The model performs very well compared to other architectures:

It beats the Flamingo-3B and -9B models (way more parameters) at image captioning
For a visual IQ test (Raven), it performs at 22% (vs. 17% random chance), and at 26% without language-only instruction tuning.
In OCR-free language understanding (reading an image without doing explicit OCR), it performs about the same as CLIP ViT-L/14 on the HatefulMemes task.
For web page question answering, there is an interesting comparison between pure LLMs and Kosmos-1: when feeding just the text from a website into an LLM, you get performance of 7.6. Kosmos-1 gets to 15.8. When Kosmos-1 gets no extracted text, it only scores at 3.8.
It is able to do multimodal chain-of-thought processing.
It is a remarkable result that image recognition gets better for a combined language-image model, if the LLM does chain-of-thought: this suggests that the extra world knowledge of an LLM makes a big difference in understanding an image.

Also interesting, if you give it image descriptions with images, it gets much better:

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (Mar 2023)

(link)

The paper tries to augment ChatGPT with visual capabilities. It does this in a very simple way: it simply instructs ChatGPT to invoke various external tools when dealing with images. There is no actual understanding beyond textual information derived by other tools from images. So this is kind of similar to the Toolformer paper.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (Mar 2023)

(link)

Idea: Some advanced vision tasks exceed the capabilities of existing vision and vision-language models. The paper introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos - so you can get vision information into ChatGPT. Here is what it does:

Use image and video file names as the input to ChatGPT
Set up various vision expert models that ChatGPT can call
Add instructions to ChatGPT prompt to make the model call those vision experts

Example use cases: associate information from multiple uploaded receipts and calculate the total travel cost (“MultiImage Reasoning”), recognize and answer questions about the “morel mushrooms” (“Open-World Concept Understanding”), and condense a long video into representative thumbnails (“Video Summarization and Event Localization”).

This is how it works, below is the flowchart:

Instruct ChatGPT to say specific watchwords in action request if a vision expert is required.
Regex matching is applied to parse the expert’s name and the file path, which are then used to call the vision expert.
The expert’s output is combined with the chat history and fed back into ChatGPT.

IMAGEBIND: One Embedding Space To Bind Them All (May 2023, added 5/21/23)

(link)

When you’re using any kind of input data in any kind of neural network, you generally operate in embeddings space - meaning, you first train a model to map an input token into a vector/position in that abstract embeddings space, where each position has real semantic meaning. For example, if you do that for text, these embeddings vectors can even be added, and you can do math like embedding(“queen”) = embedding(“king”) - embedding(“male”) + embedding(“female”).

This paper shows that you can build such an embeddings space out of multimodal data, with six modalities (image/video, text, audio, depth, thermal images, and IMU [inertial measurement unit] data). This is cool because those six modalities suddenly exist in the same embeddings space, and you can treat them completely on equal footing! So you can do things like ‘show me an image that corresponds to embedding(“bird”) + embedding(“sound of waves crashing”)’.
The paper novelty is to show that you don’t need a complete dataset that for each training element binds all six of these together - it’s enough for the model to see some pairwise relationships, and it can then generalize from there. In other words, the model will learn which image information the “sound of waves crashing” corresponds to, without having it been shown explicitly.

How does the paper do it?

First, contrastive learning is a general technique for learning an embedding space by using pairs of related examples (positives) and unrelated examples (negatives). Using pairs of aligned observations, contrastive learning can align pairs of modalities such as (image, text), (audio, text), (image, depth), (video, audio) etc. However, in each case, the joint embeddings are trained and evaluated using the same pairs of modalities. Thus, (video, audio) embeddings are not directly applicable for text-based tasks while (image, text) embeddings cannot be applied for audio tasks.
The paper grounds everything in training around images, and for each image, it finds information in other modalities. Given an image I and its corresponding observation in the other modality M, we encode them into normalized embeddings: q = f(I) and k = g(M) where f, g are deep networks. They use a transformer for all the modality encoders. So: a) you use a separate encoder and deep network to encode each modality, and b) you only have image-modality pairs (no pairs between other two modalities).
Then what happens is this: We observe an emergent behavior in the embedding space that aligns two pairs of modalities (M1,M2) even though we only train using the pairs (I,M1) and (I,M2). This behavior allows us to perform a wide variety of zero-shot and cross-modal retrieval tasks without training for them.
Here are examples for what the new embeddings space can do:

15: Generative Graphics →