> [Paper link](https://openreview.net/pdf?id=qaxhBG1UUaS) | ICLR 2022 | RL Group 2023/02/07
## Abstract
It is very challenging in terms of RL since the natural language action space is astronomical, while feasible (syntactically and semantically correct) actions are very sparse.
This paper introduces GPT-Critic, an offline RL method for task-oriented dialogue.
GPT-Critic **built upon GPT-2**, fine-tuning the language model through behavior cloning of the critic-guided self-generated sentences
## Introduction
The current best performing end-to-end conversational agents for a task-oriented dialogue system utilize a pre-training on large-scale corpus and fine-tuning on downstream tasks.
This combination of **pre-training** and **fine-tuning** significantly improves overall performance in the task-oriented dialogues.
Since, supervised fine-tuning (i.e. imitation learning of the dialogue corpus) alone may not be sufficient to learn an optimal dialogue strategy, this paper focus on **goal-oriented training (i.e. reinforcement learning)** is an essential and promising direction to pursue.
And for RL, **Weighted behavior cloning (BC)** is one of the representative offline RL algorithms, which is free from the issue of diverging from human language, it can filter out bad actions and imitating good actions with some task-specific information
> GPT-Critic can be adopted for any generative pre-trained language model
**GPT-Critic, aims to revise unsuccessful dialogues into successful ones**, rather than removing them as done in weighted BC.
* It starts with **fine-tuning the GPT-2 model** and **learning the action-value function (critic)** using the dialogue corpus
* Then, GPT-Critic **generates a strategically promising action** that is selected **based on** the value estimated by the **critic**.
* GPT-Critic **updates the policy through behavior clonin**g of the critic-guided self-generated responses.
This paper use actor-critic method to inherits GPT-2’s ability to generate human-like responses.
<center>
<img src = "https://i.imgur.com/lRVgXef.png">
</center>
## Background
### Offline RL for task-oriented dialogue
Task-oriented dialogue system that can be modeled as a partially observable Markov decision process (POMDP) defined by tuple $<S, A, O, T, Z, R, \gamma>$
- $S$: the set of environment states $d = <g, h>$, $g$ as user goal, $h$ as dialogue history, which $h_t = \{ o_0, a_0, ... o_{t-1}, a_{t-1}, o_t\}$
- $A$: the set of actions $a$ (a sequence of tokens which represents `dialogue act` and `system response`)
- $O$: the set of observations $o$ (user utterance)
- $T$: the transition function $T(s' \ | \ s, a) = \mathrm{Pr} (s_{t+1} = s' \ | \ s_t = s, a_t = a)$
- $Z$: the observation probability $Z(o \ | \ s', a) = \mathrm{Pr} (o_{t+1} = o \ | \ s_{t+1} = s', a_t = a)$
- $R$: the reward function $R(g, h, a)$
The policy $\pi(a_t \ | h_t)$ is mapping from history h_t to a proability distribution over $A$.
The goal is to find an optimal policy $\pi^{*}$ that maximizes the expected cumulative rewards
$$
\pi^*=\arg \max _\pi \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R\left(g, h_t, a_t\right)\right]
$$
The action-value function of policy $\pi$ is defined as
$$
Q^\pi(h, a):=\mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R\left(g, h_t, a_t\right) \mid h_0=h, a_0=a\right]
$$
, where $Q^{\pi}$ is a unique solution of the Bellman equation:
$$
Q^\pi(h, a)=\mathbb{E}_g[R(g, h, a)]+\gamma \mathbb{E}_\pi\left[Q^\pi\left(h^{\prime}, a^{\prime}\right)\right]
$$
:::info
**Pervious work by Offline RL for dialogue policy optimization**
The agent optimizes the policy from the pre-collected dataset
$$
\mathcal{D}=\left\{\left\{\left(g^j, h_t^j, a_t^j, r_t^j, h_{t+1}^j\right)_{t=0}^T\right\}_{j=1}^N\right\}
$$
without online environment interaction during the intermediate stages of training.
The algorithm relys on off-policy actor-critic method, where the critic network is trained by minimizing the temporal differnce error with respect to the target policy $\pi$:
$$
\tag{1}\underset{\phi}{\arg \min } \ \mathbb{E}_{\left(h_t, a_t, r_t, h_{t+1}\right) \sim \mathcal{D}}\left[\left(r_t+\gamma \mathbb{E}_\color{yellow}{a_{t+1} \sim \pi\left(h_{t+1}\right)}\left[Q_{\bar{\phi}}\left(h_{t+1}, \color{yellow}{a_{t+1}}\right)\right]-Q_\phi\left(h_t, a_t\right)\right)^2\right]
$$
where $\bar\phi$ is the parameters of the target network.
---
**Challenge**
How to optimize this loss in the **offline RL setting**, due to the overestimation issue in the bootstrapping process by taking **out-of-distribution (OOD) actions** to evaluate the value of the next state.
:::
### End-to-end task-oriented dialogue system
This paper focuses on the MultiWOZ 2.0 dataset, which is a representative benchmark for task-oriented dialogue.
The MultiWOZ dataset is a fully-annotated corpus of human-human task-oriented conversations.
In this paper, **the algorithm is built upon UBAR**, which is based on GPT-2 and currently the state-of-the-art end-to-end dialogue agent for the MultiWOZ domain.
:::info
**UBAR: Towards Fully End-to-End Task-Oriented Dialog System with GPT-2**
The model is acquired by fine-tuning the large pre-trained unidirectional language model GPT-2 on the sequence of the entire dialog session which is composed of *user utterance*, *belief state*, *database result*, *system act*, and *system response of every dialog turn*.
<center>
<img src = "https://i.imgur.com/vpEYT60.png)">
</center>
:::
## Method
Since corpus collected from human-human conversations inevitably contains unsuccessful dialogues in terms of task completion, this paper aims to *revise* unsuccessful dialogues into successful ones in order to prevent repeating the past failure while improving the task performance.
GPT-Critic is analogous to Actor-Critic method:
* GPT (Actor) decides which action to take **(not using policy gradient)**
* Critic informs how good the action was and provides a signal for policy improvement.
The method sample a set of action candidates using GPT-2 and pick the best one using the critic, which constitutes a *revised* dialogue corpus. Then, we perform supervised fine-tuning of the GPT-2 on the revised dialogue corpus.
:::info
**Benfits**
1. This learning procedure of our GPT-Critic does not hurt the agent’s capability to generate human-like sentences
2. The method can be adopted for any generative pre-trained language model.
:::
### Policy Evaluation
It starts by training the action-value function.
The architecture of the critic network basically **follows GPT-2 with employing different last layers to compute the Q-value.** (critic network $Q_{\phi}$ share the parameters of the Transformer layers of GPT-2)
Such that the parameters of the Transformer layers are only updated during the policy improvement step.
The critic network is trained by minimizing the temporal difference error with respect to the dataset $\mathcal{D}$
$$
\tag{2}\underset{\phi}{\arg \min } \mathbb{E}_{\left(h_t, a_t, r_t, h_{t+1}, \color{yellow}{a_{t+1}}\right) \color{yellow}{\sim \mathcal{D}}}\left[\left(r_t+\gamma Q_{\bar{\phi}}\left(h_{t+1}, \color{yellow}{a_{t+1}}\right)-Q_\phi\left(h_t, a_t\right)\right)^2\right]
$$
where $\bar\phi$ is the parameters of the target network.
The method is an *on-policy* evaluation on the dataset $D$, which can be optimized very stably since every $a_{t+1}$ is always an **in-distribution sample** of $D$.
Unlike equation $(1)$, which using out-of-distribution actions sampled from the target policy $\pi$.
---
This kind of on-policy evaluation were limited to only *one-step* policy improvement before.
(need off-policy evaluation for further policy iteration)
GPT-Critic performs policy improvement by generating an improved dataset based on the learned critic, where we can perform on-policy evaluation on the new dataset again.
It can enjoy the stable *multi-step* policy iteration through alternation between on-policy evaluation and policy improvement via revising dataset.
### Policy Impovement via Dataset Revision
In the task-oriented dialogues, the reward is given by the external program provided as a part of the dataset, which checks whether the user goal is satisfied by examining the dialogue history.
GPT-Critic generates a new dataset containing *revised* responses by:
$$
\begin{equation}
\tag{3}\mathcal{D}_{i+1}=\left\{\left(g, h_t, a_t^*, r_t^*, h_{t+1}^*\right) \mid a_t^*=\underset{\substack{a \in\left\{a_k\right\}^N \\\left\{a_k\right\}^N \sim \pi_\theta^i\left(h_t\right)}}{\arg \max } Q_\phi\left(h_t, a\right) \text { where } h_t \in \mathcal{D}_i\right\}
\end{equation}
$$
where $\{ a_k \}^N$ is a set of $N$ response candidates generated from the policy $\pi$
(i.e. fine-tuned GPT-2 and $D_i$ is the dataset at $i$-th iteration
The revised reward $r_t^{*} = R(g, h_t, a_t^{*})$, where $a_t^*$ is revised system action.
The dialogue history is a sequence of all previous observations and actions, thus the revised history $h_{t+1}^*=\left\{o_0, a_0, \ldots, o_t, a_t^*, o_{t+1}\right\}$
---
GPT-Critic then performs behavior cloning of critic-guided self-generated dialogues:
$$
\begin{equation}
\tag{4}\underset{\theta}{\arg \min } \mathbb{E}_{\left(h_t, a_t\right) \sim \mathcal{D}_{i+1}}\left[-\log \pi_\theta\left(a_t \mid h_t\right)\right]
\end{equation}
$$
where $\theta$ is the parameters of GPT-2.
:::info
**Prove that the above policy improvement step has a higher value than the old policy**
Theorem 1. (Policy Improvement)
Given a policy $\pi$ and the number of sampling actions $N \geq 1$.
If we update the new policy $\pi_N^{\text {new }}$ by
$$
\forall s, \pi_N^{\text {new }}(\cdot \mid s)=\underset{\substack{a \in\left\{a_k\right\}^N \\\left\{a_k\right\}^N \sim \pi(s)}}{\arg \max } Q^\pi(s, a)
$$
then $Q^{\pi_N^{\text {new }}}(s, a) \geq Q^\pi(s, a) \forall s$, a always holds.
Furthermore, for any $N, M$ such that $N \geq M \geq$ 1, $Q^{\pi_N^{\text {new }}}(s, a) \geq Q^{\pi_M^{\text {new }}}(s, a) \forall s$, always holds. (Proof in Appendix $A$.)
:::
<center>
<img src="https://i.imgur.com/9E2f3sM.png">
</center>
## Related work
* End-to-End Task-Oriented Dialogue
* Modular pipeline (traditional)
* Pre-trained LM-based
* GPT-2-based end-to-end task-oriented dialogue agents
* Reinforcement Learning for Task-Oriented Dialogue Systems
* Latent representation models (addresses aforementioned problem)
* KL-control to restrict the policy
(applied to large-scale PTLMs with discrete latent variables)
* Bayes-adaptive Monte-Carlo
(negotiation dialogue then use it as a policy improvement operator)
* Offline Reinforcement Learning
* Weighted behavior cloning
* Decision Transformer
## Experiments
1. MultiWOZ 2.0 as dataset-based automatic evaluation (compared with baseline methods including offline RL algorithms)
2. ConvLab framework for more realistic evaluation
3. Human evaluation (the quality of generated responses)
4. Qualitative analysis on the training dataset of MultiWOZ 2.0
### Experiment Setup
GPT-Critic based on the HuggingFace Transformers library and codebase of UBAR, and using MultiWOZ 2.0 dataset
DistilGPT2 as PLM, is a distilled version of GPT-2.
For the hyperparameters of fine-tuning the GPT-2 model, they follow the setting in the public code of UBAR.
$N=5$ for the number of candidate actions $\{a_k\}^N$, and the set of candidate actions are constructed by vanilla softmax sampling from the policy.
For each behavior cloning iteration, all models are fine-tuned with a training dataset from the pre-trained GPT-2 and early stop according to the loss on the validation set.
:::info
:bulb: What is vanilla softmax sampling ?
Vanilla softmax sampling refers to a simple sampling method in which the next word in a sequence is predicted by taking the one with the highest probability score generated by a softmax activation function applied to the output of a language model. In this method, the probabilities of all possible words are calculated based on their similarity to the preceding context, and the word with the highest probability is selected as the predicted next word. This method is commonly used in language models and can be improved with other sampling methods such as top-k or nucleus sampling.
:::
### Evlaluation on the MultiWOZ dataset


:::info
**Automatic evaluation metrics**
1) Inform: evaluates whether the system provides an appropriate entity
2) Success: evaluates whether the system answers all the requested information
3) BLEU: measures the fluency of the generated response
:::
Combined Score as an overall quality measure (Combined = (Inform + Success) × 0.5 + BLEU)

### Evlaluation on ConvLab evaluator
:::info
**Automatic evaluation metrics**
1) Complete: evaluates whether the system completes the goal
2) Success: evaluates whether all the user requests have been informed and the booked entities satisfy the constraints
3) Book: evaluates how many booked entities satisfy the user constraints
4) Inform (Precision / Recall / F1): evaluates how many user requests have been informed
5) Turn (success / all): evaluates the average turn number for successful/all dialogues
:::


### Human Evaluation

:::info
**Evaluation metrics on a Likert scale (1-5)**
1) Appropriateness: evaluates whether the generated responses are appropriate for the given context
2) Fluency: evaluates whether the generated responses are comprehensible and human-like.
:::
## Conclusion
GPT-Critic, an offline RL algorithm for task-oriented dialogue system, it aims to learn an end-to-end task-oriented dialogue agent without the issue of diverging from human language.
GPT-Critic starts with fine-tuning the GPT-2 model and learning the critic using the dialogue corpus.
Then, GPT-Critic updates the policy through the behavior cloning of the critic-guided self-generated responses, thus it is essentially free from the issue of diverging from human language.
## Appendix
* POLICY IMPROVEMENT THEOREM
* QUALITATIVE ANALYSIS OF SELF-GENERATED DIALOGUES
* QUALITATIVE EXAMPLES OF STANDARD REINFORCEMENT LEARNING ALGORITHM
* EVALUATION FOR THE QUALITY OF GENERATED DIALOGUE STATES AND DIALOGUE ACTS