<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/pdf/2009.09708v1.pdf) | arXiv 2020
:::success
**Thoughts**
They propose Know-EDG
* Knowledge-Enhanced Context Encoder: enrich knowledge with context graph
* Emotion Identifier: distill the emotion information from the dialogue context and predicate the emotion signal
* Emotion-Focused Response Generator: use an emotion-focused attention mechanism to learn the emotional dependencies from the dialogue context and generate the final response
In the experiment, when the number of external concepts is set to 8, they get the best emotion accuracy.
:::
## Abstract
The task of empathetic dialogue generation is proposed to address this problem.
Two challenges still exist in this task:
- Perceiving nuanced emotions implied in the dialogue context
- Modelling emotional dependencies
Lacking useful external knowledge makes it challenging to perceive implicit fine-grained emotions.
Missing the emotional interactions among interlocutors also restricts the performance of empathetic dialogue generation
---
This paper propose a knowledge-enhanced framework, named Know-EDG.
1. They enrich dialogue context by bunches of emotion-related concepts and construct a **knowledge-enhanced context graph**.
2. Also, they introduce a **graph-aware Transformer encoder** to learn graph’s semantic and emotional representations, which are the prerequisites of the emotion identifier to predicate the target emotion signal.
3. Fianlly, they propose an **emotion-focused attention mechanism** to exploit the emotional dependencies between dialogue context and target empathetic response.
## Introduction
Because humans usually rely on experience and external knowledge to reason and express implicit emotions, which inspired us that external knowledge may play a crucial role in emotion information understating and reasoning.

To illustrate this phenomenon concretely, they statistically investigate the functions of external knowledge in emotion information understanding on the dataset EmpatheticDialogues and the conclusions are illustrated in Figure 2.

Moreover, during the investigations, they observe an interesting phenomenon that emotional dependency and emotional inertia commonly appear in many real-world conversations.

In Figure 3, the darker diagonal grids show that listeners tend to mirror the emotion of their interlocutors to build rapport.
Therefore, intuitively, modelling emotional dependencies between interlocutors is crucial and helpful to improve the empathetic dialogue performance.
---
They propose a Knowledge-enhanced framework, **Know-EDG**.
- Knowledge-enhanced context encoder
- Emotion identifier: jointly modeling the dynamic emotion transition mechanism
- Emotion-focused response generator
## Preliminaries
**ConceptNet** is a large-scale knowledge graph that describes general human knowledge in natural language.
It comprises 5.9M tuples, 3.1M concepts, and 38 relations.
They denote each tuple (head concept, relation, tail concept, confidence score) as $\tau=(h,r,t,s)$ where the confidence score $s \in [1, 10]$.
They use *min-max* normalization to scale $s$ between 0 and 1:
$$
\tag{1} \textit{min-max}(s) = \frac{s - \min_s}{\max_s - \min_s}
$$
where $\min_s$ and $\max_s$ is 1 and 10, respectively.
**NRC_VAD** is a list of VAD (Valence-ArousalDominance) vectors with dimensions ($V_a, A_r, D_o$) for 20k English words.

They adopt NRC_VAD to compute the emotion intensity for words in dialogue context and concepts from ConceptNet.
$$
\tag{2} \eta_i = \textit{min-max} (\left\| V_a(x_i) -\frac{1}{2}, \frac{A_r(x_i)}{2} \right\|_2)
$$
If $x_i$ not in NRC_VAD, $V_a(x_i)$ and $A_r(x_i)$ will be set to 0.5.
## Model
They propose to use **commonsense-** and **emotion-knowledge** sources for an open-ended empathetic conversational system to ground in.
### Overview
They are given a dialogue history of turns $[X_1, \dots , X_M]$ where $X_i$ is the $i$-th utterance, the commonsense knowledge graph ConceptNet, and the list of VAD vectors VADs.
The system needs to generate an empathetic response $Y = \{y_1, \dots , y_n\}$ that is both emotionally appropriate and meaningful with the content.

**Knowledge-Enhanced Context Encoder**
They construct a knowledge-enriched context graph $G = (V, A)$ as the dialogue context, where $V$ is a set of nodes (a root node CLS, utterance token nodes, and knowledge concept nodes) and $A$ denotes an adjacency matrix describing the directed edges.
**Emotion Identifier**
They **distill the emotion information from the dialogue context** and predicate the emotion signal for the response generation.
**Emotion-Focused Response Generator**
They **design an emotion-focused attention mechanism to learn the emotional dependencies** from the dialogue context and generate the final response
### Knowledge-Enhanced Context Encoder
#### Dialogue Context Graph Construction
They flat $M$ utterances into a long token sequence $S$. For each non-stopword token $x_i$ in $S$, they retrieve a set of candidate tuples $\tau^i_k = (x_i, r^i_k, c^i_k, s^i_k)_{k=1,\dots,K}$.
To refine the emotion-related knowledge from the commonsense knowledge, they filter the retrieved tuples in two steps:
- **Relation Filtering**: They create an excluded relation list $L_{ex}$ to filter the set of retrieved tuples and reserve the tuples whose $r_k^i \notin L_{ex}$ and $s^i_k \gt \alpha$, $\alpha$ is a pre-defined threshold.
- **Concept Ranking**: Matching score for each $\tau_k^i$ according to three aspects (emotion, semantic, and confidence):
$$
\tag{3} \text{Score}(\tau_k^i) = f_e(c_k^i) + \cos (x_i, c_k^i) + s_k^i,
$$
where $f_e(c_k^i) = \textit{min-max} (\| V_a(c_i) -\frac{1}{2}, \frac{A_r(c_i)}{2} \|_2)$, and $c_k^i$ is the tail concept.
For the context graph $G = (V, A)$, dialogue history $x$ and tail concept $c$ in emotional knowledge form $V$.
$G$ contains 3 directed relation types:
1. *sequence*: $x_i \rightarrow x_{i+1}$
2. *emotion*: $c_k^i \rightarrow x_i$
3. *globality*: relation between $\text{CLS}$ node and other nodes
All the three directed relations among nodes are set to 1 in the adjacency matrix $A$.
#### Dialogue Context Graph Encoding
They use a **word embedding layer** and a **positional embedding laye**r to convert each node $v_i$ into vectors and also incorporating the **dialogue state embedding**:
$$
\tag{4} \textbf{v}_i = \textbf{E}_w (v_i) + \textbf{E}_d (v_i) + \textbf{E}_p (v_i)
$$
Then they apply a multi-head graph-attention mechanism to update the node representations with emotional knowledge.
$$
\tag{5} \hat{\textbf{v}}_i = \textbf{v}_i + \|_{n=1}^H \sum_{j \in \mathcal{A}_i} \alpha_{ij}^n \textbf{W}_v^n \textbf{v}_j
$$
where $\alpha_{ij}^n = a^n(\textbf{v}_i, \textbf{v}_j)$ and $a^n()$ represent the self-attention mechanism.
$$
\tag{6} a^n(\textbf{q}_i, \textbf{k}_j) = \frac{\exp((\textbf{W}_q^n \textbf{q}_i)^\top \textbf{W}^n_k \textbf{k}_j)}{\sum_{\mathcal{z} \in \mathcal{A}_i} \exp((\textbf{W}_k^n) \textbf{k}_\mathcal{z}^\top \textbf{W}_q^n \textbf{q}_i)}
$$
where $\textbf{W}_q^n, \textbf{W}_k^n$ are the linear transformations of the node $\textbf{q}$ and node $\textbf{k}$ respectively.
Since the previous operations are only conducted to the local context (neighbours), they use the $l$-th transformer layer to encode global information for all nodes $\{ \hat{\textbf{v}}_i \}_{i=1, \dots, m}$.
$$
\tag{7} \textbf{h}_i^l = \text{LayerNorm} (\hat{\textbf{v}}_i^{l-1} + \text{MHAtt}(\hat{\textbf{v}}_i^{l-1}))
$$
$$
\tag{8} \tilde{\textbf{v}}_i^{l} = \text{LayerNorm} (\textbf{h}_i^l + \text{FFN} (\textbf{h}_i^l))
$$
The obtained final context representations are denoted as $\textbf{C} = \{ \tilde{\textbf{v}}_i \}_{i=1, \dots, m}$.
### Emotion Identifier
This component is designed to conduct the emotion information predicting for the target response.
They use weighted summation to derive the emotion-focused context representation $\textbf{E}_e$:
$$
\tag{9} \textbf{E}_e = \sum_i \frac{\exp (\eta_i)}{\sum_j \exp(\eta_j)} \tilde{\textbf{v}}_i
$$
Then they get the emotion category distribution $\textbf{e}_p$:
$$
\tag{10} \textbf{E}_p = \textbf{W}_e \textbf{E}_e
$$
where $\textbf{W}_e \in \mathbb{R}^{d \times q}$ and $q$ is the number of emotion categories.
$$
\tag{11} \textbf{e}_p = \text{softmax}(\textbf{E}_p)
$$
When the training time, $\textbf{e} \in \mathbb{R}^q$ is the golden truth one-hot emotion label, and the loss function:
$$
\tag{12} \mathcal{L}_{emo} = - \sum_i \textbf{e}^i \log (\textbf{e}_p^i)
$$
### Emotion-Focused Response Generator
The emotion signal $\textbf{E}_p \in \mathbb{R}^{1 \times q}$ is firstly be transformed by a linear transformation into $\textbf{E}^\prime_p \in \mathbb{R}^{1 \times d}$.
They feed $\textbf{Y}_{emb}$ into the response generator, where $\textbf{Y}_{emb} = \{ \textbf{y} \}_{t = 0, \dots, j-1}$ and $\textbf{y}_0 = \textbf{E}^\prime_p$.
Set $\textbf{Y} = \{ \textbf{y}_t \}_{t=0, \dots, j-1}$ as new vector representations with inputs $\textbf{Y}_{emb}$.
They design an **emotion-focused cross attention mechanism** $\text{E-MHAtt}$ to replace the cross-attention sub-layer in the original
Transformer decoder layer, so they can get emotion dependency vectors:
$$
\tag{13} \textbf{D} = \textbf{Y} + \textbf{W}_m \text{E-MHAtt} (\textbf{Y}, \textbf{C})
$$
where $\text{E-MHAtt} (\textbf{Y}, \textbf{C}) = [\textbf{MHAtt}(\textbf{Y}, \textbf{C}) \| \textbf{E}_e]$.
And then doing normalization:
$$
\tag{14} \hat{\textbf{D}} = \text{LayerNorm}(\textbf{D})
$$
$$
\tag{15} \hat{\textbf{Y}} = \text{LayerNorm}(\hat{\textbf{D}} + \text{FFN}(\hat{\textbf{D}}))
$$
where $\hat{\textbf{Y}} = [\hat{\textbf{y}}_1, \dots, \hat{\textbf{y}}_j]$.
Then the response generator yeilds the distribution over the vocabulary for the next $j$-th token:
$$
\tag{16} \alpha^g (y_j \mid \textbf{C}, y_{0:j-1}) = \text{softmax} (\textbf{W}_o \hat{\textbf{y}}_{\textbf{j}})
$$
They compute a probability $p_g$ of copying from nodes $\{ v_i \}_{i=1, \dots, m}$ and derive the final next-token probability distribution $p(y_j)$:
$$
\tag{17} p_g = \sigma (\textbf{W}_c \hat{\textbf{y}}_j + b_c)
$$
$$
\tag{18} p(y_j) = (1 - p_g) \ast \alpha^c + p_g \ast \alpha^g
$$
where the copy probability $\alpha^c$ is obtained from the concatenation of attention weights $\{ \alpha_i^c = \alpha(\hat{\textbf{y}}_j, \tilde{\textbf{v}}_i)$, for $\tilde{\textbf{v}}_i \in \textbf{C} \}$.
For most dialogue generation tasks, the optimization objective:
$$
\tag{19} \mathcal{L}_{gen} = - \log p(y_j \mid y_{<j}, \textbf{C})
$$
But they set a joint loss:
$$
\tag{20} \mathcal{L}_{gen} = \gamma_1 \mathcal{L}_{emo} + \gamma_2 \mathcal{L}_{gen}
$$
where $\gamma_1 = \gamma_2 = 1$.
## Experimental Settings
### Data Preparation
They use EmpatheticDialogues dataset.
### Baselines
- **Transformer**
- **EmoPrepend-1**: Transformer model which incorporates an additional supervised emotion classifier.
- **MoEL**: Transformer model which softly combines the response representations from different transformer decoders.
Ablation studies:
- **w/o CE**: The Know-EDG model without the Knowledge-Enhanced Context Encoder.
- **w/o EFAM**: The Know-EDG model without the Emotion-Focused Attention Mechanism.
### Implementation Details
### Evaluation Metrics
**Automatic Evaluations**
They adopt *Emotion Accuracy* as the agreement between the ground truth emotion labels and the predicted emotion labels by the emotion identifier.
Perplexity measures the high-level general quality of the generation model. Distinct-1 / Distinct-2 is the proportion of the distinct unigrams / bigrams in all the generated results to indicate the diversity.
**Human Evaluations**
## Results and Analysis


### External Knowledge Analysis

### Emotion-Focused Attention Analysis

### Case Study

## Conclusion
They propose a knowledge-enhanced framework, Know-EDG, to enhance the performance of empathetic dialogue generation.
They design an emotion-focused attention mechanism to exploit the emotional dependencies between the dialogue context and the target empathetic response.