Neural models - short sighted, predicting utterances one at a time while ignoring their influence on future outcomes Conversational Porperties - informativity, coherence and ease of answering Evaluation - Diversity, length, Human judges
Introduction
Seq2Seq
Maximizes the probabiltity of generating a response given the previous dialogue turn
Trained by predicting dialogue turn in a given conversational context using the maximum-likelihood estimation
System becomes stuck in an infinite loop of repeatitive responses
Framework should have
Ability to integrate developer-defined rewards that better mimic the true goal of chatbot development
Ability to model the long-term influence of a generated response in an ongoing dialogue
The parameters of an encoder-decoder RNN define a policy over an infinite action space consisting of all possible utterances
Related Work
Efforts to build statistical dialog systems fall into two major categories
Dialogue generation as a source-to-target transduction problem and learns mapping rules between input messages and responses from a massive amount of training data
Task-oriented dialogue systems to solve domain-specific tasks
Reinforcement Learning for Open-Domain Dialogue
The learning system consists of two agents
p - sentences generated from the first agent
q - sentences generated from the second agent
Generated sentences - actions taken according to a policy defined by an encoder-decoder RNN language model
Policy Gradient is more appropriate than Q-learning
We can initialize the encoder-decoder RNN using MLE parameters that already produce plausible responses, before changing the objective and tuning towards a policy that maximizes long-term reward
Q-Learning - directly estimates the future expected reward of each action - differ from the MLE objective by orders of magnitude - making MLE parameters inappropriate for initialization
Components of Sequential decision process
Action
State
Policy
Reward
Action
It is the dialogue utterance to generate - infinite action space - arbitrary-length sequences can be generated
State
A state is denoted by the previous two dialogue turns The dialogue history is further transformed to a vector representation by feeding the concatenation of and into an LSTM encoder model
Policy
A policy takes the form of an LSTM encoder-decoder and is defined by its parameters Stochastic Policy Deterministic Policy - would result in a discontinous objective that is difficult to optimize using gradient-based methods
Reward
r denotes the reward obtained for each action
Ease of answering
A turn generated by a machine should be easy to respond to - forward-looking function - the constraints a turn places on the next turn
Negative Log likelihood of responding to that utterance with a dull response
- Cardinality of Ns - number of tokens in the dull response s - Likelihood output by SEQ2SEQ models
A system less likely to generate utterances in the list is thus also less likely to generate other dull responses
Information Flow
Agent should contribute new information at each turn
Penalizing semantic similarity between consecutive turns from the same agent
, - Representations obtained from the encoder for two consecutive turns and
Semantic Coherence
Measure of adequacy of responses to avoid situations in which the generated replies are highly rewarded but are ungramatical or not coherent
Mutual information between the action a and previous turns in the histoy to ensure the generated responses are coherent and appropriate
- probability of generating response a gien the prvious dialogue utterance - the backward propability of generating the previous dialogue utterance q_i based on response a - it is trained in a similar way as standard SEQ2SEQ models with sources and targets swapped
Final Reward
where
Simulation
State action space is explored by the two virtual agents taking turns talking with each other
Policy -
Initialized the RL system using a general response generation policy which is learned from a fully supervised setting
Supervised Learning
First stage - build on prior work of predicting a generated target sequence given dialogue history using the supervised SEQ2SEQ model
SEQ2SEQ model with attention
Mutual Information
We do not want to initialize the policy model using the pre trained SEQ2SEQ models because this will lead to a lack of diversity in the RL models' experiences Modeling mutual information between sources and targets will significantly decrease the chance of generating dull
The second term of this equation required the target sentence to be completely generated This problem is treated by considering it to be a Reinforcement Learning problem in which a reward of mutual information value is observed when the model arrives at the end of a sequence
Policy Gradient methods for optimization
Initialize the policy model using a pre-trained model Given an input source , we generate a candidate list
For each candidate we obtain the mutual information score from the pre-trained and
This mutual information score will be used as a reward and back-propagated to the encoder-decoder model
The expected reward for a sequence is given by
The gradient is estimated using the likelihood ratio trick
Parameters are updated in the encoder-decoder using stochastic gradient descent
Curriculum Learning strategy - every sequence of length T we use the MLE loss for the first L tokens and the reinforcement algorithm for the remaining T- L tokens L - (L -> 0)
To decrease the learning variance - an additional neural model takes as input the generated target and the inital source and outputs a baseline value
Dialogue Simulation between Two Agents
A message from the training set is fed to the first agent
The agent encodes the input message to a vector representation and starts decoding to generate a response output
Combining the immediate output from thr first agent with the dialogue history
The second agent updates the state by encoding the dialogue history into a representation and uses the decoder RNN to generate responses
Optimization
Initialize the policy model with parameters from the mutual information model
The objective to maximize is the expected future reward R - Reward resulting from action a
The likelihood ratio trick is used for gradient updates
Curriculum Learning
CL strategy is applied again - begin by simulating the dialogue for 2 turns
Generate 5 at most - as the number of candidates to examine grows exponentially in the size of candidate list
Experimental Results
Automatic Evaluation
Length of the dialogue
A dialogue ends when one of the agents starts generating dull responses or two consecutive utterances from the same user are highly overlapping
Diversity
Degree of diversity - calculating the number of distinct unigrams and bigrams in generated responses
Human Evaluation
Crowdsourced Judges to evaluate a random sample of 500 items