# Decision Transformer: Reinforcement Learning via Sequence Modeling
---
> Related Papers:
>
> [1] (Main Reading Material) Lili, Chen, et al.. Decision Transformer: Reinforcement Learning via Sequence Modeling, 2021.
>
> [2] Reinforcement Learning Upside Down: Don’t Predict Rewards - Just Map Them to Actions, 2021.
>
> [3] Jurgen, Schmidhuber, Offline Reinforcement Learning as One Big Sequence Modeling Problem, 2020.
Conservative Q-learning
Transformer: credit assignment
How to get the ground-truth step-wise credit assignment?
Contrastive Learning for long-middle-problem
discount-free?
achievability in the inferrence phase?
Active learning & %BC
Hyper-parameter tuning?
## Paper Overview
In this paper, the author newly proposed a transformer-based framework to formulate Reinforcement Learning problem as a auto-regressive sequence generation problem. The scope of modelling offline RL problems as a historical experience sequence modelling task has been first brought out by Jurgen Schmidhuber in [3]. However, their *Up-Down RL* model only targets to the highest rewarding behavior, while the *Decision Transformer* [1] discussed here trained on random walk sequences, which aims to learn the general pattern of the agent, being more data efficient meanwhile.
## Main Insights
In this section, I summarize several main insights according to my own understanding. Some of them might not be mentioned in the original paper [1]. Please correct me if anything wrong here :)
### 1. Step-wise Credit Assignment by Self-Attention
In traditional reinforcement learning setting, the reward is a one-time gain obtained at the last step. Thus, the contributions of the previous steps are prone to be ignored. The self-attention mechanism in transformer architecture could assign the step-wise credit to each step, by learning the *similarity* scoring scheme.
### 2. Parallel Context Learning instead of Long Backpropagation
From my perspective, the conventional RL approach learns each action, state step through the backwards propagating from the last step of the whole sequence. The future reward would be discounted when calculating the *return* value in every previous step. It might make the whole learning progress suffer from the similar long-context disaster as the vanilla RNN model. With the transformer model, the whole context could be considered at the same time -- Not only the causality relationship could be preserved and better learnt, but also avoid the tedious dynamic programming process. Besides, the previous one-step markov assumption could be extended to longer contexts, which help to avoid the undesirable short-sighted behaviors in TD learning mechanism.

## Potential Pitfalls
Based on my own understanding, I lists some critiques and concerns on the authors approaches, including the dataset, evaluation task, and experiment conducts.
### 1. Data Availability: Does step-wise return always available?
For some RL tasks, the binary reward would be assigned only at the last step, while the previous contribution scores are identical or manually assigned. In that case, it might be hard to get the convincing step-wise return to train the decision transformer model.
### 2. Evaluation Task: Shortest Path Problem
The authors design the Shortest-Path problem as one of the evaluation tasks. As shown in the figure below, if the final state is reached, the ultimate cost would be 0, otherwise infinity. Each previous step would add one point to the total cost. In the training phase, the *return* values at the initial state are used as the supervision signal to train the model. *( The right-most figure in the paper might mistaken the -2 as -3)*.
In this setting, my main concerns are two-folds:
1. **During the inference phase, the achievability of the task have not been checked.** For example, in the original right-most mini-figure in Figure 1, if the initial return value is assigned as -3, the learnt sequence is very likely to be three steps with the descending returns [-3, -2, -1], which fails to find the shortest path. *<u>However, the pre-check for the shortest path problem **still in need of the DP algorithm**, which might weaken the alleged effort-saving property.</u>*
2. **The invalid paths with infinite cost are useless.** Each of the step in these process is labelled as infinite value, which are with no value to learn for the model. However, we cannot control the number of these invalid route in the *Random Walk* data collection process, which leads to the concern.

### 3. Experiment: Ablation Study
The concerns mainly raised upon their ablation experiments on the context length.
- Ablation Experiment
The authors conduct the ablation experiment on the influence of context length by only comparing DT's performance with context length K=50/30 and without any context (K=1). Thus, the problem they study here is actually <u>*whether the model need context*,</u> rather than <u>*how much context would optimize the performance*.</u>

## Potential Future Extension
### 1. Contrastive Learning to Avoid *futile mid-context* Problem
The author has raised one concern in the paper, which I called "futile mid-context" problem here. When the final state changed depend mostly on the initial state, while the long context in the middle has very few contributions, it might be hard for the transformer to identify them.
***<u>[Proposal]</u>***: Intuitively, if we can collect enough number of negative samples, we can try to integrate contrastive loss to mitigate this problem.
### 2. Active Learning for %BC methodology
The author experiment on the impact of percentile of samples on the performance of Percentile Behavior Cloning. However, they have not describe the sampling methodology (or just randomly sample).
***<u>[Proposal]</u>*** We can try active learning to determine which part of data could yield the most performance for %BC method.

---
> This is the end of the doc, thanks for reading me! :)
###### tags: `Paper Reading Notes`