--- title : "Deep Reinforcement Learning for Dialogue Generation" tags : "IvLabs, RL" --- # Deep Reinforcement Learning for Dialogue Generation Link to the [Research Paper](https://arxiv.org/abs/1606.01541) {%pdfhttps://arxiv.org/pdf/1606.01541.pdf%} ## Abstract Neural models - short sighted, predicting utterances one at a time while ignoring their influence on future outcomes Conversational Porperties - informativity, coherence and ease of answering Evaluation - Diversity, length, Human judges ## Introduction Seq2Seq - Maximizes the probabiltity of generating a response given the previous dialogue turn - Trained by predicting dialogue turn in a given conversational context using the maximum-likelihood estimation - System becomes stuck in an infinite loop of repeatitive responses Framework should have - Ability to integrate developer-defined rewards that better mimic the true goal of chatbot development - Ability to model the long-term influence of a generated response in an ongoing dialogue The parameters of an encoder-decoder RNN define a policy over an infinite action space consisting of all possible utterances ## Related Work Efforts to build statistical dialog systems fall into two major categories - Dialogue generation as a source-to-target transduction problem and learns mapping rules between input messages and responses from a massive amount of training data - Task-oriented dialogue systems to solve domain-specific tasks ## Reinforcement Learning for Open-Domain Dialogue The learning system consists of two agents - p - sentences generated from the first agent - q - sentences generated from the second agent Generated sentences - actions taken according to a policy defined by an encoder-decoder RNN language model Policy Gradient is more appropriate than Q-learning - We can initialize the encoder-decoder RNN using MLE parameters that already produce plausible responses, before changing the objective and tuning towards a policy that maximizes long-term reward - Q-Learning - directly estimates the future expected reward of each action - differ from the MLE objective by orders of magnitude - making MLE parameters inappropriate for initialization Components of Sequential decision process - Action - State - Policy - Reward ### Action It is the dialogue utterance to generate - infinite action space - arbitrary-length sequences can be generated ### State A state is denoted by the previous two dialogue turns $[p_i,q_i]$ The dialogue history is further transformed to a vector representation by feeding the concatenation of $p_i$ and $q_i$ into an LSTM encoder model ### Policy A policy takes the form of an LSTM encoder-decoder and is defined by its parameters Stochastic Policy Deterministic Policy - would result in a discontinous objective that is difficult to optimize using gradient-based methods ### Reward r denotes the reward obtained for each action #### Ease of answering - A turn generated by a machine should be easy to respond to - forward-looking function - the constraints a turn places on the next turn - Negative Log likelihood of responding to that utterance with a dull response $r_1 = \displaystyle -\frac 1{N_\Bbb S}\underset{s\in\Bbb S}{\sum}\frac{1}{N_s}\text{log }p_{\text{seq2seq2}}(s|a)$ $N_\Bbb S$ - Cardinality of $N_\Bbb S$ N~s~ - number of tokens in the dull response s $p_{\text{seq2seq2}}$ - Likelihood output by SEQ2SEQ models A system less likely to generate utterances in the list is thus also less likely to generate other dull responses #### Information Flow - Agent should contribute new information at each turn - Penalizing semantic similarity between consecutive turns from the same agent $\displaystyle r_2 = -\text{log cos}(h_{p_i}, h_{p_{i+1}}) = -\text{log cos}\left(\frac{h_{p_i} \cdot h_{p_{i+1}}}{||h_{p_i}||\,||p_{i+1}||}\right)$ $h_{p_i}$, $h_{p_{i+1}}$ - Representations obtained from the encoder for two consecutive turns $p_i$ and $p_{i+1}$ #### Semantic Coherence - Measure of adequacy of responses to avoid situations in which the generated replies are highly rewarded but are ungramatical or not coherent - Mutual information between the action a and previous turns in the histoy to ensure the generated responses are coherent and appropriate $r_3 = \displaystyle\frac1{N_a}\text{log }p_{\text{seq2seq}}(a|q_i,p_i) + \frac1{N_{q_i}}\text{log }p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$ $p_{\text{seq2seq}}(a|p_i,q_i)$ - probability of generating response a gien the prvious dialogue utterance $[p_i, q_i]$ $p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$ - the backward propability of generating the previous dialogue utterance q_i based on response a - it is trained in a similar way as standard SEQ2SEQ models with sources and targets swapped Final Reward $r(a,[p_i,q_i]) = \lambda_1r_1+\lambda_2r_2+\lambda_3r_3$ where $\lambda_1+\lambda_2+\lambda_3 = 1$ ## Simulation State action space is explored by the two virtual agents taking turns talking with each other Policy - $p_{\text{RL}}(p_{i+1}|p_i,q_i)$ Initialized the RL system using a general response generation policy which is learned from a fully supervised setting ### Supervised Learning First stage - build on prior work of predicting a generated target sequence given dialogue history using the supervised SEQ2SEQ model SEQ2SEQ model with attention ### Mutual Information We do not want to initialize the policy model using the pre trained SEQ2SEQ models because this will lead to a lack of diversity in the RL models' experiences Modeling mutual information between sources and targets will significantly decrease the chance of generating dull $r_3 = \displaystyle\frac1{N_a}\text{log }p_{\text{seq2seq}}(a|q_i,p_i) + \frac1{N_{q_i}}\text{log }p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$ The second term of this equation required the target sentence to be completely generated This problem is treated by considering it to be a Reinforcement Learning problem in which a reward of mutual information value is observed when the model arrives at the end of a sequence Policy Gradient methods for optimization Initialize the policy model $p_{\text{RL}}$ using a pre-trained $p_{\text{seq2seq}}(a|p_i,q_i)$ model Given an input source $[p_i,q_i]$, we generate a candidate list $A = \{\hat a|\hat a \sim p_{\text{RL}}\}$ For each candidate $\hat a$ we obtain the mutual information score $m(\hat a,[p_i,q_i])$ from the pre-trained $p_{\text{SEQ2SEQ}}(a|p_i,q_i)$ and $p_{\text{seq2seq}}^{\text{backward}}(q_i|a)$ This mutual information score will be used as a reward and back-propagated to the encoder-decoder model The expected reward for a sequence is given by $\qquad J(\theta) = \Bbb E[m(\hat a, [p_i,q_i])]$ The gradient is estimated using the likelihood ratio trick $\qquad \nabla J(\theta) = m(\hat a, [p_i,q_i])\nabla\text{log }p_{\text{RL}}(\hat a|[p_i,q_i])$ Parameters are updated in the encoder-decoder using stochastic gradient descent Curriculum Learning strategy - every sequence of length T we use the MLE loss for the first L tokens and the reinforcement algorithm for the remaining T- L tokens L - (L -> 0) To decrease the learning variance - an additional neural model takes as input the generated target and the inital source and outputs a baseline value $\qquad \nabla J(\theta) = \nabla\text{log }p_{\text{RL}}(\hat a|[p_i,q_i])[m(\hat a, [p_i,q_i]) - b]$ ### Dialogue Simulation between Two Agents - A message from the training set is fed to the first agent - The agent encodes the input message to a vector representation and starts decoding to generate a response output - Combining the immediate output from thr first agent with the dialogue history - The second agent updates the state by encoding the dialogue history into a representation and uses the decoder RNN to generate responses Optimization - Initialize the policy model $p_{\text{RL}}$ with parameters from the mutual information model - The objective to maximize is the expected future reward $\qquad\displaystyle J_{\text{RL}}(\theta) = \Bbb E_{p_{\text{RL}}(a_1:T)}[\underset{i = 1}{\overset{i = T}{\sum}}R(a_i,[p_i,q_i])]$ R - Reward resulting from action a - The likelihood ratio trick is used for gradient updates $\qquad\displaystyle J_{\text{RL}}(\theta) = \underset i\sum\nabla\text{log }p(a_i|p_i,q_i)\underset{i = 1}{\overset{i = T}{\sum}}R(a_i,[p_i,q_i])$ ### Curriculum Learning - CL strategy is applied again - begin by simulating the dialogue for 2 turns - Generate 5 at most - as the number of candidates to examine grows exponentially in the size of candidate list ## Experimental Results ### Automatic Evaluation Length of the dialogue - A dialogue ends when one of the agents starts generating dull responses or two consecutive utterances from the same user are highly overlapping Diversity - Degree of diversity - calculating the number of distinct unigrams and bigrams in generated responses ### Human Evaluation - Crowdsourced Judges to evaluate a random sample of 500 items