Pretrained Transformers as Universal Computation Engines

In this work (arXiv), we found pretraining on language data could lead to nontrivially better performance on non-language tasks – early signs of multimodality inherent in large language models.

Multimodal transfer

Our main result was that you can take GPT-2 (~= 100M parameters), and adapt it to a new modality by finetuning the input, output, and layer-norm parameters. This meant that we could transfer the bulk of the model (the attention and feedforward layers) to the new modality and achieve good performance without finetuning!

One implication of this is that you might think of the language model has having learned capabilities similar to the hippocampus: acting as a general, multimodal sequence processor. Then, perhaps you could use the base language model as the base for future adapters on top, performing reasoning in the space of language. Our initial result was far from this, but later work has pushed this capability, with LLMs such as GPT-4 and Gemini serving as the base for powerful multimodal capabilities.

Follow-up work

  • Apr 2023LLaVA: Imbuing a language model with highly-performant vision capabilities
  • Mar 2023PaLM-E: Turning a large language model into an embodied multimodal model
  • Jun 2021Frozen: No longer necessary to finetune the layer-norm parameters

Unaffiliated links:

Notes mentioning this note