<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/abs/1811.01063) | [Code link](https://github.com/nouhadziri/THRED) | ACL 2019
:::success
**Thoughts**
They use both utterance and topic to generate response.
Contribution:
- New Topical Hierarchical approach
- Two new metrics for evaluate the quality of the responses
- A clean dataset for improves generated responses
:::
## Abstract
This paper introduces a Topical Hierarchical Recurrent Encoder Decoder (THRED), a novel, fully data-driven, multi-turn response generation system intended to produce contextual and topic-aware responses.
Their hierarchical joint attention mechanism that incorporates topical concepts and previous interactions into the response generation.
## Introduction
Sequence-to-Sequence (Seq2Seq) neural network model has witnessed substantial breakthroughs in improving the performance of conversational agents. Such a model
succeeds in learning the backbone of the conversation but lacks any aptitude for producing context-sensitive and diverse conversations.
They speculate that incorporating conversation history and topic information with their novel model and method will improve generated conversational responses.
**Their model builds upon the basic Seq2Seq model by combining conversational data and external knowledge information trained through a hierarchical joint attention neural model.**
---
The key contributions of this work are:
- A fully data-driven neural conversational model that leverages conversation history and topic information
- Two novel automated metrics:
- Semantic Similarity
- Response Echo Index
- Conversational dataset from Reddit comment
## Topical Hierarchical Recurrent Encoder Decoder
Topical Hierarchical Recurrent Encoder Decoder (THRED) can be viewed as a hybrid model that conditions the response generation on conversation history captured from **previous utterances** and on **topic words** acquired from a Latent Dirichlet Allocation (LDA) model.

### Message Encode
Let a sequence of $N$ utterances with a dialogue $D = \{ U_1, \dots, U_N \}$. Every utterance $U_i = \{ w_{i,1}, \dots, w_{i, L_i} \}$ contains a random variable $L_i$ of sequence of words where $w_{i, k}$ represents the word embedding vector at position $k$ in the utterance $U_i$.
It updates its hidden state at every time step $t$ by a bidirectional GRU-RNN according to:
$$
\tag{1} h_{i, t} = GRU(h_{i, t-1}, w_{i, t}), \ \forall t \in \{ 1, \dots, L_i \}
$$
where $h_{i, t-1}$ represents the previous hidden state.
### Message Attention
The message attention in THRED operates by putting more focus on the salient input words with regard to the output.
$$
\tag{2} m_{i, t} = \sum_{j=1}^{L_i} \alpha_{i, j, t} h_{i, j}, \ \forall i \in \{ 1, \dots, N \}
$$
where $\alpha_{i, j, t}$ is computed as:
$$
\alpha_{i, j, t} = \frac{\exp(e_{i, j, t})}{\sum_{k=1}^{L_i} \exp(e_{i, k, t})}
$$
$$
e_{i, j, t} = \eta (s_{t-1}, h_{i, j}, c_{i-1, t})
$$
where $c_{i, t}$ delineates the hidden state of the context-level encoder $(3)$, $\eta$ is a multi-layer perceptron having tanh as activation function.
### Context-Level Encoder
The context-level encoder takes as input each utterance representation ($m_{1,t}, \dots, m_{N,t}$) and calculates the sequence of recurrent hidden states as:
$$
\tag{3} c_{i, t} = GRU(c_{i-1, t}, m_{i,t}), \ \forall i \in \{ 1, \dots, N \}
$$
### Context-Topic Joint Attention
**Context Attention**
$$
\tag{4} r_t = \sum_{j=1}^N \gamma_{j, t} c_{j, t}
$$
where:
$$
\tag{5} \gamma_{j, t} = \frac{\exp (e^\prime_{j, t})}{\sum_{i=1}^N \exp (e^\prime_{i, t})}; e_{i, t}^\prime = \eta(s_{t-1}, c_{i, t})
$$
**Topic Attention**
After acquiring topic words for the entire history, they pick the $n$ highest probable words under $T$. Here $n = 100$. The topic words $\{ t_1, \dots, t_n \}$ are then linearly combined to form a fixed-length vector $k$.
The weight values are calculated as:
$$
\tag{6} \beta_{i, t} = \frac{\exp (\eta (s_{t-1}, t_i, c_{N, t}))}{\sum_{j=1}^n \exp (\eta(s_{t-1}, t_j, c_{N, t}))}
$$
where $i \in \{ 1, \dots, n \}$, $c_{N, t}$ is the last hidden state of the context-level encoder.
In summary, the topic words are summarized as a topic vector $k$ representing prior knowledge for response generation.
The key idea of this approach is to affect the generation process by avoiding the need to learn the same conversational pattern for each utterance but instead enriching the responses with topics and words related to the subject of the message even if the words were never used before.
### Decoder
They added an extra probability to the standard generation probability, enforcing the model to account for the topical tokens.
$$
\tag{7} p(w_i) = p_V(w_i) + p_K(w_i)
$$
where $K$ and $V$ represent respectively topic vocabulary and response vocabulary.
$$
p_V(w_i) = \frac{1}{M} \exp (\sigma_V (s_i, w_{i-1}))
$$
$$
p_K(w_i) = \frac{1}{M} \exp (\sigma_V (s_i, w_{i-1}, r_i))
$$
where $s_i = GRU(w_{i-1}, s_{i-1}, r_i, k)$, $\sigma$ is tanh.
$$
M = \sum_{v \in V} \exp(\sigma_V (s_i, w_{i-1})) + \sum_{v^\prime \in K} \exp(\sigma_K (s_i, w_{i-1}, r_i))
$$

## Datasets
One of the main weaknesses of dialogue systems is caused by the paucity of high-quality conversational dataset. To enable the study of high-quality and large-scale dataset for dialogue modeling, they have collected a corpus of 35M conversations drawn from the Reddit data.
## Experiments
They compare THRED against three opensource baselines, namely Standard Seq2Seq with
- Attention mechanism
- HRED
- Topic-Aware (TA) Seq2Seq
For Standard Seq2Seq and TA-Seq2Seq, we concatenate the dialogue history to account for context in a multi-turn conversation.
All experiments are conducted on two datasets (i.e., Reddit and OpenSubtitles).
They use word perplexity (PPL) as metrics, it captures how likely the responses are under a generation probability distribution. But, it does not measure the degree of diversity and engagingness in the responses.

### Semantic Similarity
Semantic Similarity (SS) metric estimates the correspondence between the utterances in the context and the generated response.
To render the semantic representation of an utterance, they leverage Universal Sentence Encoder wherein a sentence is projected to a fixed dimensional embedding vector.
The penalized Semantic Similarity score is therefore defined as:
$$
SS(utt_{i, j}, resp_i) = P \times (1 - cos(\vec{utt}_{i, j}, \vec{resp}_{i, j}))
$$
where the penalty factor $P = 1 + \log \frac{2 + L^\prime}{2 + L^{\prime \prime}}$ and $L^\prime$ indicates the length of the response after dropping stop words and punctuation and $L^{\prime \prime}$ stands for the length of non-dull part of the response after dropping stop words, where $i$ represents the index for the dialogue in the test dataset and $j$ denotes the index of the utterance in the conversation history.

### Response Echo Index
The goal of the Response Echo Index (REI) metric is to detect overfitting to the training dataset.
$$
REI (resp_i) = \max_{utt_m \in \mathbb{T}_{0.1}} \mathcal{J} (\overline{resp_i}, \overline{utt_m})
$$
where $\overline{t}$ is the normalized form of text $t$, $\mathbb{T}_{0.1}$ denotes the sampled training data, and $\mathcal{J}$ represents Jaccard function.

### Human Evaluation

### Comparing Datasets

## Conclusion
In this work, they introduce the Topical Hierarchical Recurrent Encoder Decoder (THRED) model for generating topically consistent responses in multi-turn open conversations.
Additionally, they evaluate our new model and existing models with two new metrics which prove to be good measures for automatically evaluating the quality of the responses.
Finally, they present a parsed and cleaned dataset based on conversations from Reddit which improves generated responses.