Deep Reinforcement Learning for Dialogue Generation

Abstract

Neural models - short sighted, predicting utterances one at a time while ignoring their influence on future outcomes
Conversational Porperties - informativity, coherence and ease of answering
Evaluation - Diversity, length, Human judges

Introduction

Seq2Seq

Maximizes the probabiltity of generating a response given the previous dialogue turn
Trained by predicting dialogue turn in a given conversational context using the maximum-likelihood estimation
System becomes stuck in an infinite loop of repeatitive responses

Framework should have

Ability to integrate developer-defined rewards that better mimic the true goal of chatbot development
Ability to model the long-term influence of a generated response in an ongoing dialogue

The parameters of an encoder-decoder RNN define a policy over an infinite action space consisting of all possible utterances

Efforts to build statistical dialog systems fall into two major categories

Dialogue generation as a source-to-target transduction problem and learns mapping rules between input messages and responses from a massive amount of training data
Task-oriented dialogue systems to solve domain-specific tasks

Reinforcement Learning for Open-Domain Dialogue

The learning system consists of two agents

p - sentences generated from the first agent
q - sentences generated from the second agent

Generated sentences - actions taken according to a policy defined by an encoder-decoder RNN language model

Policy Gradient is more appropriate than Q-learning

We can initialize the encoder-decoder RNN using MLE parameters that already produce plausible responses, before changing the objective and tuning towards a policy that maximizes long-term reward
Q-Learning - directly estimates the future expected reward of each action - differ from the MLE objective by orders of magnitude - making MLE parameters inappropriate for initialization

Components of Sequential decision process

Action
State
Policy
Reward

Action

It is the dialogue utterance to generate - infinite action space - arbitrary-length sequences can be generated

State

A state is denoted by the previous two dialogue turns

[p_{i}, q_{i}]

The dialogue history is further transformed to a vector representation by feeding the concatenation of

p_{i}

and

q_{i}

into an LSTM encoder model

Policy

A policy takes the form of an LSTM encoder-decoder and is defined by its parameters
Stochastic Policy
Deterministic Policy - would result in a discontinous objective that is difficult to optimize using gradient-based methods

Reward

r denotes the reward obtained for each action

Ease of answering

A turn generated by a machine should be easy to respond to - forward-looking function - the constraints a turn places on the next turn
Negative Log likelihood of responding to that utterance with a dull response

r_{1} = - \frac{1}{N_{S}} \sum_{s \in S} \frac{1}{N_{s}} log p_{seq2seq2} (s | a)

N_{S}

- Cardinality of

N_{S}

N_s - number of tokens in the dull response s

p_{seq2seq2}

- Likelihood output by SEQ2SEQ models

A system less likely to generate utterances in the list is thus also less likely to generate other dull responses

Information Flow

Agent should contribute new information at each turn
Penalizing semantic similarity between consecutive turns from the same agent

r_{2} = - log cos (h_{p_{i}}, h_{p_{i + 1}}) = - log cos (\frac{h_{p_{i}} \cdot h_{p_{i + 1}}}{| | h_{p_{i}} | | | | p_{i + 1} | |})

h_{p_{i}}

h_{p_{i + 1}}

- Representations obtained from the encoder for two consecutive turns

p_{i}

and

p_{i + 1}

Semantic Coherence

Measure of adequacy of responses to avoid situations in which the generated replies are highly rewarded but are ungramatical or not coherent
Mutual information between the action a and previous turns in the histoy to ensure the generated responses are coherent and appropriate

r_{3} = \frac{1}{N_{a}} log p_{seq2seq} (a | q_{i}, p_{i}) + \frac{1}{N_{q_{i}}} log p_{seq2seq}^{backward} (q_{i} | a)

p_{seq2seq} (a | p_{i}, q_{i})

- probability of generating response a gien the prvious dialogue utterance

[p_{i}, q_{i}]

p_{seq2seq}^{backward} (q_{i} | a)

- the backward propability of generating the previous dialogue utterance q_i based on response a - it is trained in a similar way as standard SEQ2SEQ models with sources and targets swapped

Final Reward

r (a, [p_{i}, q_{i}]) = λ_{1} r_{1} + λ_{2} r_{2} + λ_{3} r_{3}

where

λ_{1} + λ_{2} + λ_{3} = 1

Simulation

State action space is explored by the two virtual agents taking turns talking with each other

Policy -

p_{RL} (p_{i + 1} | p_{i}, q_{i})

Initialized the RL system using a general response generation policy which is learned from a fully supervised setting

Supervised Learning

First stage - build on prior work of predicting a generated target sequence given dialogue history using the supervised SEQ2SEQ model

SEQ2SEQ model with attention

Mutual Information

We do not want to initialize the policy model using the pre trained SEQ2SEQ models because this will lead to a lack of diversity in the RL models' experiences
Modeling mutual information between sources and targets will significantly decrease the chance of generating dull

r_{3} = \frac{1}{N_{a}} log p_{seq2seq} (a | q_{i}, p_{i}) + \frac{1}{N_{q_{i}}} log p_{seq2seq}^{backward} (q_{i} | a)

The second term of this equation required the target sentence to be completely generated
This problem is treated by considering it to be a Reinforcement Learning problem in which a reward of mutual information value is observed when the model arrives at the end of a sequence

Policy Gradient methods for optimization

Initialize the policy model

p_{RL}

using a pre-trained

p_{seq2seq} (a | p_{i}, q_{i})

model
Given an input source

[p_{i}, q_{i}]

, we generate a candidate list

A = {\hat{a} | \hat{a} \sim p_{RL}}

For each candidate

\hat{a}

we obtain the mutual information score

m (\hat{a}, [p_{i}, q_{i}])

from the pre-trained

p_{SEQ2SEQ} (a | p_{i}, q_{i})

and

p_{seq2seq}^{backward} (q_{i} | a)

This mutual information score will be used as a reward and back-propagated to the encoder-decoder model

The expected reward for a sequence is given by

J (θ) = E [m (\hat{a}, [p_{i}, q_{i}])]

The gradient is estimated using the likelihood ratio trick

\nabla J (θ) = m (\hat{a}, [p_{i}, q_{i}]) \nabla log p_{RL} (\hat{a} | [p_{i}, q_{i}])

Parameters are updated in the encoder-decoder using stochastic gradient descent

Curriculum Learning strategy - every sequence of length T we use the MLE loss for the first L tokens and the reinforcement algorithm for the remaining T- L tokens
L - (L -> 0)

To decrease the learning variance - an additional neural model takes as input the generated target and the inital source and outputs a baseline value

\nabla J (θ) = \nabla log p_{RL} (\hat{a} | [p_{i}, q_{i}]) [m (\hat{a}, [p_{i}, q_{i}]) - b]

Dialogue Simulation between Two Agents

A message from the training set is fed to the first agent
The agent encodes the input message to a vector representation and starts decoding to generate a response output
Combining the immediate output from thr first agent with the dialogue history
The second agent updates the state by encoding the dialogue history into a representation and uses the decoder RNN to generate responses

Optimization

Initialize the policy model
$p_{RL}$ with parameters from the mutual information model
The objective to maximize is the expected future reward

$J_{RL} (θ) = E_{p_{RL} (a_{1} : T)} [\underset{i = 1}{\sum^{i = T}} R (a_{i}, [p_{i}, q_{i}])]$
R - Reward resulting from action a
The likelihood ratio trick is used for gradient updates

$J_{RL} (θ) = \sum_{i} \nabla log p (a_{i} | p_{i}, q_{i}) \underset{i = 1}{\sum^{i = T}} R (a_{i}, [p_{i}, q_{i}])$

Curriculum Learning

CL strategy is applied again - begin by simulating the dialogue for 2 turns
Generate 5 at most - as the number of candidates to examine grows exponentially in the size of candidate list

Experimental Results

Automatic Evaluation

Length of the dialogue

A dialogue ends when one of the agents starts generating dull responses or two consecutive utterances from the same user are highly overlapping

Diversity

Degree of diversity - calculating the number of distinct unigrams and bigrams in generated responses

Human Evaluation

Crowdsourced Judges to evaluate a random sample of 500 items