###### tags: `PaperReview`
[Paper Link](https://www.isca-speech.org/archive/pdfs/interspeech_2022/maekaku22_interspeech.pdf)
# Attention Weight Smoothing Using Prior Distributions
> Takashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe
>
> Yahoo Japan Corporation, Tokyo, Japan
> Carnegie Mellon University, PA, USA
>
> *Interspeech 2022*
## Introduction
- Transformer-based encoder-decoder models have so far been widely used for end-to-end automatic speech recognition.
- However, it has been found out that the self-attention weight matrix could be too peaky and biased toward the diagonal component.

- They propose two attention weight smoothing based on a hypothesis that an attention weight matrix whose diagonal components are not peaky can capture more context information.
## Related work
### 1. Transformer-based End-to-End ASR
#### Architecture
- Predict token IDs $Y = (y_l ∈ V|l = 1, ...,L)$ with length $L$ where $V$ is a set of distinct tokens, given sequence of speech features $X = (x_t ∈ R^d|t = 1, ...,T')$ with length $T'$ where $d$ is the dimension of an acoustic feature.
- Consist of encoder and decoder network.
- First X will be fed to CNN to obtain subsampled sequence $X' = (x'_t|t = 1, ...,T)$ where $T(<T')$ is length after subsampling.
- Then encoder converts $X'$ into sequence of latent representation $X_e$ and decoder predicts a token *y~(l+1)~* given $X_e$ and prefix tokens *(y~1~, ..., y~l~)*
#### Self-attention mechanism
- output of self-attention with input $X'$ is defined as:
$$\begin{aligned}
\operatorname{Att}\left(\mathbf{X}^{\prime}\right) & =\operatorname{softmax}\left(\frac{\left(\mathbf{X}^{\prime} \mathbf{W}^{\mathrm{q}}\right)\left(\mathbf{X}^{\prime} \mathbf{W}^{\mathrm{k}}\right)^{\mathrm{T}}}{\sqrt{d^{\mathrm{att}}}}\right) \mathbf{X}^{\prime} \mathbf{W}^{\mathrm{v}} \\
& =\mathbf{A} \mathbf{X}^{\prime} \mathbf{W}^{\mathrm{v}}
\end{aligned}
$$
where $W_q,W_k,W_v ∈ R^{d^{att}×d^{att}}$ are linear transformations to produce a sequence of query, key, and values, respectively. $d^{att}$ is the dimension of the attention. $*^{T}$ is a transpose operation. $A ∈ R^{T×T}$ is called the attention weight.
- In the decoder, source-target attention is used in addition to self-attention. The only difference between these two attentions is that for source-target attention, the output of the previous layer of the decoder is used as the query input, and the output of the encoder $X_e$ is used for the key and value inputs.
#### Training and decoding
- the objective function is defined as
$$\begin{aligned}
L & =-\log p\left(Y \mid X_{\mathrm{e}}\right) \\
& =-\log \prod_{l=1}^{L-1} p\left(y_{(l+1)} \mid y_{1: l}, X_{\mathrm{e}}\right),
\end{aligned}
$$
where $p(Y|Xe)$ is decomposed into the product of the decoder’s emission probabilities at each time step.
- In the decoding stage, the decoder network outputs each token in an autoregressive manner, in which beam search decoding is often adopted to find the most likely hypothesis ˆ Y as follows:
$$\hat{Y}=\underset{Y \in \mathcal{Y}^*}{\arg \max } \log p\left(Y \mid X_{\mathrm{e}}\right)
$$
where ${Y \in \mathcal{Y}^*}$ is a set of output hypotheses.
## Uniform Smoothing and Proposed Smoothing Methods
### 1. Attention smoothing using uniform prior
- [Relaxed attention](https://arxiv.org/pdf/2107.01275) has introduced a uniform distribution as a prior for source-target attention, with intention to prevent attention from being overly focused on the encoder outputs.
$$\mathbf{A}_{(l)}^{\text {uni }}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \frac{1}{T}
$$
where $A_{(l)}$ is the *l*-th layer’s attention weight, $\gamma$
is a tunable interpolation hyperparameter, and $T$ is the length of the subsampling.
### 2. Attention smoothing using truncated prior
- There is another possible method to learn the prior distribution which is from data. But it is difficult because the size of attention map depends on each utterance.
- Hence, they introduced band matrix to realize a truncated prior of self-attention weights $B\in{R^{TxT}}$.
- This matrix $B$ is generated by using a *1 × k* tensor as follows: First, create a zero matrix $0\in R^{T×(T+k−1)}$. Then, in each row *t* of $0$, add the *1 × k* tensor to columns t through *(t + k − 1)*. Finally, the desired band matrix is obtained by extracting *T* columns from the ⌈*k/2*⌉-th column of this matrix. Then, the softmax probability of $B$ in the row direction can be regarded as a truncated prior distribution and linear interpolation with $A_{(l)}$ is performed as follows:
$$\mathbf{A}_{(l)}^{\mathrm{bm}}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \cdot \operatorname{softmax}\left(\mathbf{B}_{(l)}\right)
$$

### 3. Attention smoothing using previous layer’s attention
- The above method has a disadvantage in that it is less flexible because it uses the same prior distribution for all utterances.
- Hence, they propose an alternative smoothing method to use the attention weights of the previous layer as a prior.
- They investigate three variants of this smoothing technique:
#### 3.1. Non-recursive Smoothing:
- Non-recursive attention smoothing in the l-th layer simply performs a linear interpolation between the original l-th attention weight and (l − 1)-th attention weight.
$$
\mathbf{A}_{(l)}^{\text {nonrec }}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \cdot \mathbf{A}_{(l-1)}
$$
#### 3.2. Recursive Smoothing:
- The other method is to apply the linear interpolation recursively to the attention weights as follows:
$$
\left\{\begin{array}{l}
\mathbf{A}_{(1)}^{\text {rec }}=(1-\gamma) \mathbf{A}_{(1)}+\gamma \cdot \mathbf{A}_{(0)} \\
\mathbf{A}_{(l)}^{\text {rec }}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \cdot \mathbf{A}_{(l-1)}^{\text {rec }} \text { for } l>1
\end{array}\right.$$
#### 3.3. Prediction of Interpolation Coefficient:
- Instead of tuning $\gamma$ as a hyperparameter, they propose to predict it depending on the input as follows:
$$
\left\{\begin{aligned}
\mathbf{g}_{(l)} & =\sigma\left(\mathbf{X}_{(l)}^{\prime} \mathbf{W}_{(l)}^{\mathrm{q}} \mathbf{c}_{(l)}\right), \\
\mathbf{A}_{(l)}^{\mathrm{rec}} & =\left(\mathbf{1}-\mathbf{g}_{(l)}\right) \odot \mathbf{A}_{(l)}+\mathbf{g}_{(l)} \odot \mathbf{A}_{(l-1)}^{\mathrm{rec}},
\end{aligned}\right.$$
where where σ(·) denotes the sigmoid function, and $c(l) ∈ R^{d^{att}}$ is a learnable vector. ⊙ represents row-wise product between g~(l)~(i) and i-th row component of the multiplied matrix. Making $\gamma$ predictable in this way not only eliminates the need for tuning it but also allows the value to be adjusted for each time step. Note that $g$ is different for each head, though omitted for simplicity.
## Experiments
### Experimental setup
- Trained E2E ASR models using the **100-hour subset of clean audios from LibriSpeech and the 81-hour training set of Wall Street Journal (WSJ)**.
- All the models are based on **transformer architecture implemented on ESPnet toolkit**.
- All the models were trained with the **joint CTC and attention objectives** with a multi-task loss of weight of **0.3**.
- The **hyperparameters** for the Transformer-related architecture are shown in **Table 1**. they followed the **same setup as in the *librispeech_100h* or *wsj* recipes in the ESPnet** for regularization hyperparameters (e.g., dropout rate, learning rate, label-smoothing weight, and optimizer).

- For data augmentation, they use **speed perturbation** at a ratio of **0.9, 1.0, and 1.1**.
- **SpecAugment** was applied only when training on ****LibriSpeech**. they trained the models for **70 and 100 epochs for LibriSpeech and WSJ**, respectively.
- During inference, **model averaging** was performed using the model from the last 10 epochs, **CTC weight** is set to **0.3**, and **beam size** was set to **10 for both corpora** and **did not use any external language model (LM)** during decoding to simplify the experimental investigations.
### Results and discussion
#### Non-recursive vs. recursive smoothing

- From Table 2 we can see that the improvement in "test-other" is quite large.
- This indicates that a more reliable prior distribution is obtained by incorporating information from all layers while emphasizing information from the previous layer. Therefore, it is used in subsequent experiments.
#### Comparison of baseline and proposed methods

- Relaxed attention only applied during training only but the proposed method applied smoothing on both training and inference.
- The recursive smoothing is effective when applied to source-target attention as well as self-attention.
- In the case of WSJ, making $\gamma$ predictable and applied for all attentions in the encoder and decoder was effective for improving performance.
- The smoothing with the truncated prior outperformed the two baseline systems in some cases even though the improvements were small.
- Therefore, it is confirmed that both of the proposed methods contributed to improving ASR performance.
#### Attention analysis

- (a) shows extremely sharp attention weigths that are concentrated on the diagonal component which indicates not having useful context information.
- The attention weight in Fig. 3 (b) and \(c\) have diagonal components that are not as sharp as the attention weight in (a).
- This result suggests that the recursive smoothing works to suppress diagonal components of the attention weight matrix from becoming peaky.
## Conclusion
- The authors proposed two novel smoothing methods of attention weight using its prior distribution for the Transformer-based end-to-end automatic speech recognition.
- Experimental results showed that relative improvements of up to 2.9% and 7.9% in LibriSpeech and Wall Street Journal, respectively, are achieved compared to a vanilla Transformer model.
- The evaluation of the performance with an external LM is left to a future work since the relaxed attention has been reported to improve the performance with LM.