Try   HackMD

Deep Reinforcement Learning for Dialogue Generation

Link to the Research Paper

Abstract

Neural models - short sighted, predicting utterances one at a time while ignoring their influence on future outcomes
Conversational Porperties - informativity, coherence and ease of answering
Evaluation - Diversity, length, Human judges

Introduction

Seq2Seq

  • Maximizes the probabiltity of generating a response given the previous dialogue turn
  • Trained by predicting dialogue turn in a given conversational context using the maximum-likelihood estimation
  • System becomes stuck in an infinite loop of repeatitive responses

Framework should have

  • Ability to integrate developer-defined rewards that better mimic the true goal of chatbot development
  • Ability to model the long-term influence of a generated response in an ongoing dialogue

The parameters of an encoder-decoder RNN define a policy over an infinite action space consisting of all possible utterances

Efforts to build statistical dialog systems fall into two major categories

  • Dialogue generation as a source-to-target transduction problem and learns mapping rules between input messages and responses from a massive amount of training data
  • Task-oriented dialogue systems to solve domain-specific tasks

Reinforcement Learning for Open-Domain Dialogue

The learning system consists of two agents

  • p - sentences generated from the first agent
  • q - sentences generated from the second agent

Generated sentences - actions taken according to a policy defined by an encoder-decoder RNN language model

Policy Gradient is more appropriate than Q-learning

  • We can initialize the encoder-decoder RNN using MLE parameters that already produce plausible responses, before changing the objective and tuning towards a policy that maximizes long-term reward
  • Q-Learning - directly estimates the future expected reward of each action - differ from the MLE objective by orders of magnitude - making MLE parameters inappropriate for initialization

Components of Sequential decision process

  • Action
  • State
  • Policy
  • Reward

Action

It is the dialogue utterance to generate - infinite action space - arbitrary-length sequences can be generated

State

A state is denoted by the previous two dialogue turns

[pi,qi]
The dialogue history is further transformed to a vector representation by feeding the concatenation of
pi
and
qi
into an LSTM encoder model

Policy

A policy takes the form of an LSTM encoder-decoder and is defined by its parameters
Stochastic Policy
Deterministic Policy - would result in a discontinous objective that is difficult to optimize using gradient-based methods

Reward

r denotes the reward obtained for each action

Ease of answering

  • A turn generated by a machine should be easy to respond to - forward-looking function - the constraints a turn places on the next turn
  • Negative Log likelihood of responding to that utterance with a dull response

r1=1NSsS1Nslog pseq2seq2(s|a)

NS - Cardinality of
NS

Ns - number of tokens in the dull response s
pseq2seq2
- Likelihood output by SEQ2SEQ models

A system less likely to generate utterances in the list is thus also less likely to generate other dull responses

Information Flow

  • Agent should contribute new information at each turn
  • Penalizing semantic similarity between consecutive turns from the same agent

r2=log cos(hpi,hpi+1)=log cos(hpihpi+1||hpi||||pi+1||)

hpi,
hpi+1
- Representations obtained from the encoder for two consecutive turns
pi
and
pi+1

Semantic Coherence

  • Measure of adequacy of responses to avoid situations in which the generated replies are highly rewarded but are ungramatical or not coherent
  • Mutual information between the action a and previous turns in the histoy to ensure the generated responses are coherent and appropriate

r3=1Nalog pseq2seq(a|qi,pi)+1Nqilog pseq2seqbackward(qi|a)

pseq2seq(a|pi,qi) - probability of generating response a gien the prvious dialogue utterance
[pi,qi]

pseq2seqbackward(qi|a)
- the backward propability of generating the previous dialogue utterance q_i based on response a - it is trained in a similar way as standard SEQ2SEQ models with sources and targets swapped

Final Reward

r(a,[pi,qi])=λ1r1+λ2r2+λ3r3

where

λ1+λ2+λ3=1

Simulation

State action space is explored by the two virtual agents taking turns talking with each other

Policy -

pRL(pi+1|pi,qi)

Initialized the RL system using a general response generation policy which is learned from a fully supervised setting

Supervised Learning

First stage - build on prior work of predicting a generated target sequence given dialogue history using the supervised SEQ2SEQ model

SEQ2SEQ model with attention

Mutual Information

We do not want to initialize the policy model using the pre trained SEQ2SEQ models because this will lead to a lack of diversity in the RL models' experiences
Modeling mutual information between sources and targets will significantly decrease the chance of generating dull

r3=1Nalog pseq2seq(a|qi,pi)+1Nqilog pseq2seqbackward(qi|a)

The second term of this equation required the target sentence to be completely generated
This problem is treated by considering it to be a Reinforcement Learning problem in which a reward of mutual information value is observed when the model arrives at the end of a sequence

Policy Gradient methods for optimization

Initialize the policy model

pRL using a pre-trained
pseq2seq(a|pi,qi)
model
Given an input source
[pi,qi]
, we generate a candidate list
A={a^|a^pRL}

For each candidate

a^ we obtain the mutual information score
m(a^,[pi,qi])
from the pre-trained
pSEQ2SEQ(a|pi,qi)
and
pseq2seqbackward(qi|a)

This mutual information score will be used as a reward and back-propagated to the encoder-decoder model

The expected reward for a sequence is given by

J(θ)=E[m(a^,[pi,qi])]

The gradient is estimated using the likelihood ratio trick

J(θ)=m(a^,[pi,qi])log pRL(a^|[pi,qi])

Parameters are updated in the encoder-decoder using stochastic gradient descent

Curriculum Learning strategy - every sequence of length T we use the MLE loss for the first L tokens and the reinforcement algorithm for the remaining T- L tokens
L - (L -> 0)

To decrease the learning variance - an additional neural model takes as input the generated target and the inital source and outputs a baseline value

J(θ)=log pRL(a^|[pi,qi])[m(a^,[pi,qi])b]

Dialogue Simulation between Two Agents

  • A message from the training set is fed to the first agent
  • The agent encodes the input message to a vector representation and starts decoding to generate a response output
  • Combining the immediate output from thr first agent with the dialogue history
  • The second agent updates the state by encoding the dialogue history into a representation and uses the decoder RNN to generate responses

Optimization

  • Initialize the policy model
    pRL
    with parameters from the mutual information model
  • The objective to maximize is the expected future reward
    JRL(θ)=EpRL(a1:T)[i=Ti=1R(ai,[pi,qi])]

    R - Reward resulting from action a
  • The likelihood ratio trick is used for gradient updates
    JRL(θ)=ilog p(ai|pi,qi)i=Ti=1R(ai,[pi,qi])

Curriculum Learning

  • CL strategy is applied again - begin by simulating the dialogue for 2 turns
  • Generate 5 at most - as the number of candidates to examine grows exponentially in the size of candidate list

Experimental Results

Automatic Evaluation

Length of the dialogue

  • A dialogue ends when one of the agents starts generating dull responses or two consecutive utterances from the same user are highly overlapping

Diversity

  • Degree of diversity - calculating the number of distinct unigrams and bigrams in generated responses

Human Evaluation

  • Crowdsourced Judges to evaluate a random sample of 500 items