Decision Transformer

In this work (arXiv), we studied how offline reinforcement learning could be performed using conditional sequence modeling, the same approach behind language models and GPT.

Example

The simplest way to think of Decision Transformer now (in the ChatGPT era) is as a chatbot:

  1. You start by saying: “You are an expert in X” or “You have an IQ of 140” (the target return).
  2. You then provide context to the situation, or give it a question (its observation).
  3. The model then responds to your question (its action).

This turned out to be a general emergent phenomena of large models trained on internet data:

Here, we have:

  • Target return: “very” ^ n
  • Observation: the text prompt
  • Action: the model’s generated image

Then for n=22: “A very very very very very very very very very very very very very very very very very very very very very very beautiful painting of a mountain next to a waterfall.”

Comparison to imitation learning

One of the crucial questions for us was whether to train on all the data (and condition on expertise), or only the high-quality expert subset: we named the latter “percentile behavior cloning” (%BC).

It turns out on the popular D4RL robotics benchmark, %BC was sufficient to achieve state of the art performance! However, in data-scarce settings such as the Atari benchmark, performance suffered drastically when throwing out data:

I think this is still a really interesting question, and continues to be an active topic of interest, as large foundation models are pretrained on vast low-quality data and then finetuned on high-quality subsets.

Follow-up work

There’s been a lot of interesting papers in this area, but just to highlight a few:

Unaffiliated links:

Notes mentioning this note