# [Learning to Copy Coherent Knowledge for Response Generation (AAAI-21)](https://ojs.aaai.org/index.php/AAAI/article/view/17486)
# Outline
- Introduction
- Approach
- Experiments
- Conclusion
# Introduction
Previous works dropped into the paradigm of non-goal-oriented knowledge-driven dialog.
They are **prone to ignore the effect of dialog goal**, which has potential impacts on knowledge exploitation and response generation.
Goal-Oriented Knowledge Copy network(GOKC), a goal-oriented knowledge discernment mechanism is designed to help the model discern the knowledge facts.
A context manager is devised to copy facts:
- from the **discerned knowledge**
- from the **dialog goal** and the **dialog context**
which allows the model to accurately restate the facts in the generated response.

Contribution:
1. The model can generate responses with more **knowledge-coherent** facts.
2. Not only maintain the **accuracy** of discerned knowledge but alleviate the above problem.
3. More **coherent** and **fluent** responses.
4. The experiment results on both **human evaluation and automatic evaluation** show that their model has **superior performance** than several competitive baseline models.
---------------
# Main figure

---------------
# Approach
## Problem Formalization
$D = \left\{\left(U_{i}, K_{i}, G_{i}, Y_{i}\right)\right\}_{i=1}^{N}$
Our goal is to **learn a response generation model **$P(Y |U, K, G)$** with $D$** when given a new dialogue history $U$ paired with the related knowledge set $K$, one can generate an appropriate response $Y$ that achieved the given dialogue goal $G$.
## Encoder
They all built upon the bi-directional recurrent neural network (Bi-RNN) with gated recurrent unit (GRU).
$h_{t}=\left[h_{t}^{f w} ; h_{t}^{b w}\right]=\left[\overrightarrow{\operatorname{GRU}}\left(x_{t}, h_{t-1}^{f w}\right) ; \overleftarrow{GRU}\left(x_{t}, h_{t-1}^{b w}\right)\right]$
## Knowledge Discernment
The prior knowledge discernment:
$$
P\left(k_{j} \mid U, G\right)=\frac{\exp \left(s_{K, j} \cdot d_{\text {prior }}\right)}{\sum_{i=1}^{N_{K}} \exp \left(s_{K, i} \cdot d_{\text {prior }}\right)}
$$
where $d_{\text {prior }}=\tanh \left(\beta \odot s_U+(1-\beta) \odot s_G\right)$
The posterior knowledge discernment:
$$
P\left(k_{i} \mid Y\right)=\frac{\exp \left(s_{K, i} \cdot d_{\text {post }}\right)}{\sum_{j=1}^{N_{K}} \exp \left(s_{K, j} \cdot d_{\text {post }}\right)}
$$
where $d_{\text {post }}=\tanh \left(W_{\text {post }} s_Y\right)$
## KLDiv Loss
$$
L_{K L}(\theta)=\sum_{i=1}^{N} P\left(k_{i} \mid Y\right)\left(\frac{P\left(k_{i} \mid Y\right)}{P\left(k_{j} \mid U, G\right)}\right)
$$
## BOW Loss
The BOW loss is used to ensure the relevance between the estimated knowledge distribution and the response.
$$
L_{B O W}(\theta)=-\frac{\sum_{y_{t} \in \mathrm{B}} \log \varphi\left(y_{t} \mid s_{k}\right)}{|\mathrm{B}|}
$$
## Decoder
The decoder is composed of a forward RNN encoder with gated recurrent units.
$h_{t}=\operatorname{GRU}\left(y_{t-1}, h_{t-1}\right)$
Where $h_t$ can be used to obtain the generation probability $P_{vocab} (w_t)$ over the fixed vocabulary obtained from the training set.
$P_{v o c a b}\left(w_{t}\right)=M L P\left(h_{t}\right)$
## Context Manager
**Copying from Multi-Sources**
$d_{t}^{\Phi}, c_{t}^{\Phi}=\text { Attention }\left(o \Phi, h_{t}\right)$
$P_{\Phi}\left(w_{t}\right)=\sum_{\left\{l: \varphi_{l}=w_{t}\right\}} d_{t, l}^{\Phi}$
$$
P_{K}\left(w_{t}\right)=\sum_{i=1}^{N_{K}} P\left(k_{i}\right) \cdot P\left(w_{t} \mid k_{i}\right)
\quad=\sum_{i=1}^{N_{K}} \delta_{k, i} \cdot \sum_{\left\{l: k_{i}^{l}=w_{t}\right\}} d_{t, l}^{k, i}
$$
$δ_{k,i}$ is the probability distribution of knowledge fact $k_i$.
**Sources Fusion**
$c_{t}^{K}=\sum_{i=1}^{N_{K}} \delta_{k, i} \cdot c_{t}^{k, i}$
$\alpha_{t}, c_{t}=\text { Attention }\left(\left[c_{t}^{U}, c_{t}^{K}, c_{t}^{G}\right]^{T}, h_{t}\right)$
$$
P\left(w_{t}\right)=p_{t}^{\text {gen }} P_{\text {vocab }}\left(w_{t}\right)+\left(1-p_{t}^{\text {gen }}\right)\cdot\sum_{\{\phi: \phi \in\{U, K, G\}\}} \alpha_t^{(\phi)} P_\phi\left(w_t\right)
$$
$p_t^{g e n}=\sigma\left(W_{g e n}\left[y_{t-1} ; h_t ; c_t\right]\right)$
## Training
Negative log-likelihood (**NLL**) Loss to capture the word order information:
$$
L(\theta)=-\frac{1}{|Y|} \sum_{t=1}^{|Y|} \log \left(P\left(y_{t} \mid y_{1: t-1}, U, K, G\right)\right)
$$
Final loss:
$$
L(\theta)=L_{N L L}(\theta)+L_{B O W}(\theta)+L_{K L}(\theta)
$$
---
# Experiments
datasets:
- DuConv--A proactive conversation dataset
- DuRecDial--A goal-oriented knowledge-driven conversation recommendation dataset

---
# Conclusion and Future Work
They propose a **goal-oriented** knowledge copy network that could copy tokens from multiple input sources and discern coherent knowledge for response generation.
They intend to incorporate **transfer learning** into dialog system and enhance the quality of generated response by alleviating knowledge repetition problem.
---
# Appendix
[Knowledge-Interaction and knowledge Copy (KIC)](https://aclanthology.org/2020.acl-main.6.pdf)
