Learning to Copy Coherent Knowledge for Response Generation (AAAI-21)

# [Learning to Copy Coherent Knowledge for Response Generation (AAAI-21)](https://ojs.aaai.org/index.php/AAAI/article/view/17486) # Outline - Introduction - Approach - Experiments - Conclusion # Introduction Previous works dropped into the paradigm of non-goal-oriented knowledge-driven dialog. They are **prone to ignore the effect of dialog goal**, which has potential impacts on knowledge exploitation and response generation. Goal-Oriented Knowledge Copy network(GOKC), a goal-oriented knowledge discernment mechanism is designed to help the model discern the knowledge facts. A context manager is devised to copy facts: - from the **discerned knowledge** - from the **dialog goal** and the **dialog context** which allows the model to accurately restate the facts in the generated response. ![image](https://hackmd.io/_uploads/Bkojw27Y6.png =80%x) Contribution: 1. The model can generate responses with more **knowledge-coherent** facts. 2. Not only maintain the **accuracy** of discerned knowledge but alleviate the above problem. 3. More **coherent** and **fluent** responses. 4. The experiment results on both **human evaluation and automatic evaluation** show that their model has **superior performance** than several competitive baseline models. --------------- # Main figure ![image](https://hackmd.io/_uploads/SJUwPSKdp.png) --------------- # Approach ## Problem Formalization $D = \left\{\left(U_{i}, K_{i}, G_{i}, Y_{i}\right)\right\}_{i=1}^{N}$ Our goal is to **learn a response generation model **$P(Y |U, K, G)$** with $D$** when given a new dialogue history $U$ paired with the related knowledge set $K$, one can generate an appropriate response $Y$ that achieved the given dialogue goal $G$. ## Encoder They all built upon the bi-directional recurrent neural network (Bi-RNN) with gated recurrent unit (GRU). $h_{t}=\left[h_{t}^{f w} ; h_{t}^{b w}\right]=\left[\overrightarrow{\operatorname{GRU}}\left(x_{t}, h_{t-1}^{f w}\right) ; \overleftarrow{GRU}\left(x_{t}, h_{t-1}^{b w}\right)\right]$ ## Knowledge Discernment The prior knowledge discernment: $$ P\left(k_{j} \mid U, G\right)=\frac{\exp \left(s_{K, j} \cdot d_{\text {prior }}\right)}{\sum_{i=1}^{N_{K}} \exp \left(s_{K, i} \cdot d_{\text {prior }}\right)} $$ where $d_{\text {prior }}=\tanh \left(\beta \odot s_U+(1-\beta) \odot s_G\right)$ The posterior knowledge discernment: $$ P\left(k_{i} \mid Y\right)=\frac{\exp \left(s_{K, i} \cdot d_{\text {post }}\right)}{\sum_{j=1}^{N_{K}} \exp \left(s_{K, j} \cdot d_{\text {post }}\right)} $$ where $d_{\text {post }}=\tanh \left(W_{\text {post }} s_Y\right)$ ## KLDiv Loss $$ L_{K L}(\theta)=\sum_{i=1}^{N} P\left(k_{i} \mid Y\right)\left(\frac{P\left(k_{i} \mid Y\right)}{P\left(k_{j} \mid U, G\right)}\right) $$ ## BOW Loss The BOW loss is used to ensure the relevance between the estimated knowledge distribution and the response. $$ L_{B O W}(\theta)=-\frac{\sum_{y_{t} \in \mathrm{B}} \log \varphi\left(y_{t} \mid s_{k}\right)}{|\mathrm{B}|} $$ ## Decoder The decoder is composed of a forward RNN encoder with gated recurrent units. $h_{t}=\operatorname{GRU}\left(y_{t-1}, h_{t-1}\right)$ Where $h_t$ can be used to obtain the generation probability $P_{vocab} (w_t)$ over the fixed vocabulary obtained from the training set. $P_{v o c a b}\left(w_{t}\right)=M L P\left(h_{t}\right)$ ## Context Manager **Copying from Multi-Sources** $d_{t}^{\Phi}, c_{t}^{\Phi}=\text { Attention }\left(o \Phi, h_{t}\right)$ $P_{\Phi}\left(w_{t}\right)=\sum_{\left\{l: \varphi_{l}=w_{t}\right\}} d_{t, l}^{\Phi}$ $$ P_{K}\left(w_{t}\right)=\sum_{i=1}^{N_{K}} P\left(k_{i}\right) \cdot P\left(w_{t} \mid k_{i}\right) \quad=\sum_{i=1}^{N_{K}} \delta_{k, i} \cdot \sum_{\left\{l: k_{i}^{l}=w_{t}\right\}} d_{t, l}^{k, i} $$ $δ_{k,i}$ is the probability distribution of knowledge fact $k_i$. **Sources Fusion** $c_{t}^{K}=\sum_{i=1}^{N_{K}} \delta_{k, i} \cdot c_{t}^{k, i}$ $\alpha_{t}, c_{t}=\text { Attention }\left(\left[c_{t}^{U}, c_{t}^{K}, c_{t}^{G}\right]^{T}, h_{t}\right)$ $$ P\left(w_{t}\right)=p_{t}^{\text {gen }} P_{\text {vocab }}\left(w_{t}\right)+\left(1-p_{t}^{\text {gen }}\right)\cdot\sum_{\{\phi: \phi \in\{U, K, G\}\}} \alpha_t^{(\phi)} P_\phi\left(w_t\right) $$ $p_t^{g e n}=\sigma\left(W_{g e n}\left[y_{t-1} ; h_t ; c_t\right]\right)$ ## Training Negative log-likelihood (**NLL**) Loss to capture the word order information: $$ L(\theta)=-\frac{1}{|Y|} \sum_{t=1}^{|Y|} \log \left(P\left(y_{t} \mid y_{1: t-1}, U, K, G\right)\right) $$ Final loss: $$ L(\theta)=L_{N L L}(\theta)+L_{B O W}(\theta)+L_{K L}(\theta) $$ --- # Experiments datasets: - DuConv--A proactive conversation dataset - DuRecDial--A goal-oriented knowledge-driven conversation recommendation dataset ![image](https://hackmd.io/_uploads/H1RAyucYT.png =80%x) --- # Conclusion and Future Work They propose a **goal-oriented** knowledge copy network that could copy tokens from multiple input sources and discern coherent knowledge for response generation. They intend to incorporate **transfer learning** into dialog system and enhance the quality of generated response by alleviating knowledge repetition problem. --- # Appendix [Knowledge-Interaction and knowledge Copy (KIC)](https://aclanthology.org/2020.acl-main.6.pdf) ![image](https://hackmd.io/_uploads/rJbKZgpF6.png)