---
title : "Deep Reinforcement Learning for Dialogue Generation"
tags : "IvLabs, RL"
---
# Deep Reinforcement Learning for Dialogue Generation
Link to the [Research Paper](https://arxiv.org/abs/1606.01541)
{%pdfhttps://arxiv.org/pdf/1606.01541.pdf%}
## Abstract
Neural models - short sighted, predicting utterances one at a time while ignoring their influence on future outcomes
Conversational Porperties - informativity, coherence and ease of answering
Evaluation - Diversity, length, Human judges
## Introduction
Seq2Seq
- Maximizes the probabiltity of generating a response given the previous dialogue turn
- Trained by predicting dialogue turn in a given conversational context using the maximum-likelihood estimation
- System becomes stuck in an infinite loop of repeatitive responses
Framework should have
- Ability to integrate developer-defined rewards that better mimic the true goal of chatbot development
- Ability to model the long-term influence of a generated response in an ongoing dialogue
The parameters of an encoder-decoder RNN define a policy over an infinite action space consisting of all possible utterances
## Related Work
Efforts to build statistical dialog systems fall into two major categories
- Dialogue generation as a source-to-target transduction problem and learns mapping rules between input messages and responses from a massive amount of training data
- Task-oriented dialogue systems to solve domain-specific tasks
## Reinforcement Learning for Open-Domain Dialogue
The learning system consists of two agents
- p - sentences generated from the first agent
- q - sentences generated from the second agent
Generated sentences - actions taken according to a policy defined by an encoder-decoder RNN language model
Policy Gradient is more appropriate than Q-learning
- We can initialize the encoder-decoder RNN using MLE parameters that already produce plausible responses, before changing the objective and tuning towards a policy that maximizes long-term reward
- Q-Learning - directly estimates the future expected reward of each action - differ from the MLE objective by orders of magnitude - making MLE parameters inappropriate for initialization
Components of Sequential decision process
- Action
- State
- Policy
- Reward
### Action
It is the dialogue utterance to generate - infinite action space - arbitrary-length sequences can be generated
### State
A state is denoted by the previous two dialogue turns $[p_i,q_i]$
The dialogue history is further transformed to a vector representation by feeding the concatenation of $p_i$ and $q_i$ into an LSTM encoder model
### Policy
A policy takes the form of an LSTM encoder-decoder and is defined by its parameters
Stochastic Policy
Deterministic Policy - would result in a discontinous objective that is difficult to optimize using gradient-based methods
### Reward
r denotes the reward obtained for each action
#### Ease of answering
- A turn generated by a machine should be easy to respond to - forward-looking function - the constraints a turn places on the next turn
- Negative Log likelihood of responding to that utterance with a dull response
$r_1 = \displaystyle -\frac 1{N_\Bbb S}\underset{s\in\Bbb S}{\sum}\frac{1}{N_s}\text{log }p_{\text{seq2seq2}}(s|a)$
$N_\Bbb S$ - Cardinality of $N_\Bbb S$
N~s~ - number of tokens in the dull response s
$p_{\text{seq2seq2}}$ - Likelihood output by SEQ2SEQ models
A system less likely to generate utterances in the list is thus also less likely to generate other dull responses
#### Information Flow
- Agent should contribute new information at each turn
- Penalizing semantic similarity between consecutive turns from the same agent
$\displaystyle r_2 = -\text{log cos}(h_{p_i}, h_{p_{i+1}}) = -\text{log cos}\left(\frac{h_{p_i} \cdot h_{p_{i+1}}}{||h_{p_i}||\,||p_{i+1}||}\right)$
$h_{p_i}$, $h_{p_{i+1}}$ - Representations obtained from the encoder for two consecutive turns $p_i$ and $p_{i+1}$
#### Semantic Coherence
- Measure of adequacy of responses to avoid situations in which the generated replies are highly rewarded but are ungramatical or not coherent
- Mutual information between the action a and previous turns in the histoy to ensure the generated responses are coherent and appropriate
$r_3 = \displaystyle\frac1{N_a}\text{log }p_{\text{seq2seq}}(a|q_i,p_i) + \frac1{N_{q_i}}\text{log }p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$
$p_{\text{seq2seq}}(a|p_i,q_i)$ - probability of generating response a gien the prvious dialogue utterance $[p_i, q_i]$
$p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$ - the backward propability of generating the previous dialogue utterance q_i based on response a - it is trained in a similar way as standard SEQ2SEQ models with sources and targets swapped
Final Reward
$r(a,[p_i,q_i]) =
\lambda_1r_1+\lambda_2r_2+\lambda_3r_3$
where $\lambda_1+\lambda_2+\lambda_3 = 1$
## Simulation
State action space is explored by the two virtual agents taking turns talking with each other
Policy - $p_{\text{RL}}(p_{i+1}|p_i,q_i)$
Initialized the RL system using a general response generation policy which is learned from a fully supervised setting
### Supervised Learning
First stage - build on prior work of predicting a generated target sequence given dialogue history using the supervised SEQ2SEQ model
SEQ2SEQ model with attention
### Mutual Information
We do not want to initialize the policy model using the pre trained SEQ2SEQ models because this will lead to a lack of diversity in the RL models' experiences
Modeling mutual information between sources and targets will significantly decrease the chance of generating dull
$r_3 = \displaystyle\frac1{N_a}\text{log }p_{\text{seq2seq}}(a|q_i,p_i) + \frac1{N_{q_i}}\text{log }p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$
The second term of this equation required the target sentence to be completely generated
This problem is treated by considering it to be a Reinforcement Learning problem in which a reward of mutual information value is observed when the model arrives at the end of a sequence
Policy Gradient methods for optimization
Initialize the policy model $p_{\text{RL}}$ using a pre-trained $p_{\text{seq2seq}}(a|p_i,q_i)$ model
Given an input source $[p_i,q_i]$, we generate a candidate list $A = \{\hat a|\hat a \sim p_{\text{RL}}\}$
For each candidate $\hat a$ we obtain the mutual information score $m(\hat a,[p_i,q_i])$ from the pre-trained $p_{\text{SEQ2SEQ}}(a|p_i,q_i)$ and $p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$
This mutual information score will be used as a reward and back-propagated to the encoder-decoder model
The expected reward for a sequence is given by
$\qquad J(\theta) = \Bbb E[m(\hat a, [p_i,q_i])]$
The gradient is estimated using the likelihood ratio trick
$\qquad \nabla J(\theta) = m(\hat a, [p_i,q_i])\nabla\text{log }p_{\text{RL}}(\hat a|[p_i,q_i])$
Parameters are updated in the encoder-decoder using stochastic gradient descent
Curriculum Learning strategy - every sequence of length T we use the MLE loss for the first L tokens and the reinforcement algorithm for the remaining T- L tokens
L - (L -> 0)
To decrease the learning variance - an additional neural model takes as input the generated target and the inital source and outputs a baseline value
$\qquad \nabla J(\theta) = \nabla\text{log }p_{\text{RL}}(\hat a|[p_i,q_i])[m(\hat a, [p_i,q_i]) - b]$
### Dialogue Simulation between Two Agents
- A message from the training set is fed to the first agent
- The agent encodes the input message to a vector representation and starts decoding to generate a response output
- Combining the immediate output from thr first agent with the dialogue history
- The second agent updates the state by encoding the dialogue history into a representation and uses the decoder RNN to generate responses
Optimization
- Initialize the policy model $p_{\text{RL}}$ with parameters from the mutual information model
- The objective to maximize is the expected future reward
$\qquad\displaystyle J_{\text{RL}}(\theta) = \Bbb E_{p_{\text{RL}}(a_1:T)}[\underset{i = 1}{\overset{i = T}{\sum}}R(a_i,[p_i,q_i])]$
R - Reward resulting from action a
- The likelihood ratio trick is used for gradient updates
$\qquad\displaystyle J_{\text{RL}}(\theta) = \underset i\sum\nabla\text{log }p(a_i|p_i,q_i)\underset{i = 1}{\overset{i = T}{\sum}}R(a_i,[p_i,q_i])$
### Curriculum Learning
- CL strategy is applied again - begin by simulating the dialogue for 2 turns
- Generate 5 at most - as the number of candidates to examine grows exponentially in the size of candidate list
## Experimental Results
### Automatic Evaluation
Length of the dialogue
- A dialogue ends when one of the agents starts generating dull responses or two consecutive utterances from the same user are highly overlapping
Diversity
- Degree of diversity - calculating the number of distinct unigrams and bigrams in generated responses
### Human Evaluation
- Crowdsourced Judges to evaluate a random sample of 500 items