Attention Weight Smoothing Using Prior Distributions

###### tags: `PaperReview` [Paper Link](https://www.isca-speech.org/archive/pdfs/interspeech_2022/maekaku22_interspeech.pdf) # Attention Weight Smoothing Using Prior Distributions > Takashi Maekaku, Yuya Fujita, Yifan Peng, Shinji Watanabe > > Yahoo Japan Corporation, Tokyo, Japan > Carnegie Mellon University, PA, USA > > *Interspeech 2022* ## Introduction - Transformer-based encoder-decoder models have so far been widely used for end-to-end automatic speech recognition. - However, it has been found out that the self-attention weight matrix could be too peaky and biased toward the diagonal component. ![](https://i.imgur.com/InAVjEM.png) - They propose two attention weight smoothing based on a hypothesis that an attention weight matrix whose diagonal components are not peaky can capture more context information. ## Related work ### 1. Transformer-based End-to-End ASR #### Architecture - Predict token IDs $Y = (y_l ∈ V|l = 1, ...,L)$ with length $L$ where $V$ is a set of distinct tokens, given sequence of speech features $X = (x_t ∈ R^d|t = 1, ...,T')$ with length $T'$ where $d$ is the dimension of an acoustic feature. - Consist of encoder and decoder network. - First X will be fed to CNN to obtain subsampled sequence $X' = (x'_t|t = 1, ...,T)$ where $T(<T')$ is length after subsampling. - Then encoder converts $X'$ into sequence of latent representation $X_e$ and decoder predicts a token *y~(l+1)~* given $X_e$ and prefix tokens *(y~1~, ..., y~l~)* #### Self-attention mechanism - output of self-attention with input $X'$ is defined as: $$\begin{aligned} \operatorname{Att}\left(\mathbf{X}^{\prime}\right) & =\operatorname{softmax}\left(\frac{\left(\mathbf{X}^{\prime} \mathbf{W}^{\mathrm{q}}\right)\left(\mathbf{X}^{\prime} \mathbf{W}^{\mathrm{k}}\right)^{\mathrm{T}}}{\sqrt{d^{\mathrm{att}}}}\right) \mathbf{X}^{\prime} \mathbf{W}^{\mathrm{v}} \\ & =\mathbf{A} \mathbf{X}^{\prime} \mathbf{W}^{\mathrm{v}} \end{aligned} $$ where $W_q,W_k,W_v ∈ R^{d^{att}×d^{att}}$ are linear transformations to produce a sequence of query, key, and values, respectively. $d^{att}$ is the dimension of the attention. $*^{T}$ is a transpose operation. $A ∈ R^{T×T}$ is called the attention weight. - In the decoder, source-target attention is used in addition to self-attention. The only difference between these two attentions is that for source-target attention, the output of the previous layer of the decoder is used as the query input, and the output of the encoder $X_e$ is used for the key and value inputs. #### Training and decoding - the objective function is defined as $$\begin{aligned} L & =-\log p\left(Y \mid X_{\mathrm{e}}\right) \\ & =-\log \prod_{l=1}^{L-1} p\left(y_{(l+1)} \mid y_{1: l}, X_{\mathrm{e}}\right), \end{aligned} $$ where $p(Y|Xe)$ is decomposed into the product of the decoder’s emission probabilities at each time step. - In the decoding stage, the decoder network outputs each token in an autoregressive manner, in which beam search decoding is often adopted to find the most likely hypothesis ˆ Y as follows: $$\hat{Y}=\underset{Y \in \mathcal{Y}^*}{\arg \max } \log p\left(Y \mid X_{\mathrm{e}}\right) $$ where ${Y \in \mathcal{Y}^*}$ is a set of output hypotheses. ## Uniform Smoothing and Proposed Smoothing Methods ### 1. Attention smoothing using uniform prior - [Relaxed attention](https://arxiv.org/pdf/2107.01275) has introduced a uniform distribution as a prior for source-target attention, with intention to prevent attention from being overly focused on the encoder outputs. $$\mathbf{A}_{(l)}^{\text {uni }}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \frac{1}{T} $$ where $A_{(l)}$ is the *l*-th layer’s attention weight, $\gamma$ is a tunable interpolation hyperparameter, and $T$ is the length of the subsampling. ### 2. Attention smoothing using truncated prior - There is another possible method to learn the prior distribution which is from data. But it is difficult because the size of attention map depends on each utterance. - Hence, they introduced band matrix to realize a truncated prior of self-attention weights $B\in{R^{TxT}}$. - This matrix $B$ is generated by using a *1 × k* tensor as follows: First, create a zero matrix $0\in R^{T×(T+k−1)}$. Then, in each row *t* of $0$, add the *1 × k* tensor to columns t through *(t + k − 1)*. Finally, the desired band matrix is obtained by extracting *T* columns from the ⌈*k/2*⌉-th column of this matrix. Then, the softmax probability of $B$ in the row direction can be regarded as a truncated prior distribution and linear interpolation with $A_{(l)}$ is performed as follows: $$\mathbf{A}_{(l)}^{\mathrm{bm}}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \cdot \operatorname{softmax}\left(\mathbf{B}_{(l)}\right) $$ ![](https://i.imgur.com/rP3xqmj.png) ### 3. Attention smoothing using previous layer’s attention - The above method has a disadvantage in that it is less flexible because it uses the same prior distribution for all utterances. - Hence, they propose an alternative smoothing method to use the attention weights of the previous layer as a prior. - They investigate three variants of this smoothing technique: #### 3.1. Non-recursive Smoothing: - Non-recursive attention smoothing in the l-th layer simply performs a linear interpolation between the original l-th attention weight and (l − 1)-th attention weight. $$ \mathbf{A}_{(l)}^{\text {nonrec }}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \cdot \mathbf{A}_{(l-1)} $$ #### 3.2. Recursive Smoothing: - The other method is to apply the linear interpolation recursively to the attention weights as follows: $$ \left\{\begin{array}{l} \mathbf{A}_{(1)}^{\text {rec }}=(1-\gamma) \mathbf{A}_{(1)}+\gamma \cdot \mathbf{A}_{(0)} \\ \mathbf{A}_{(l)}^{\text {rec }}=(1-\gamma) \mathbf{A}_{(l)}+\gamma \cdot \mathbf{A}_{(l-1)}^{\text {rec }} \text { for } l>1 \end{array}\right.$$ #### 3.3. Prediction of Interpolation Coefficient: - Instead of tuning $\gamma$ as a hyperparameter, they propose to predict it depending on the input as follows: $$ \left\{\begin{aligned} \mathbf{g}_{(l)} & =\sigma\left(\mathbf{X}_{(l)}^{\prime} \mathbf{W}_{(l)}^{\mathrm{q}} \mathbf{c}_{(l)}\right), \\ \mathbf{A}_{(l)}^{\mathrm{rec}} & =\left(\mathbf{1}-\mathbf{g}_{(l)}\right) \odot \mathbf{A}_{(l)}+\mathbf{g}_{(l)} \odot \mathbf{A}_{(l-1)}^{\mathrm{rec}}, \end{aligned}\right.$$ where where σ(·) denotes the sigmoid function, and $c(l) ∈ R^{d^{att}}$ is a learnable vector. ⊙ represents row-wise product between g~(l)~(i) and i-th row component of the multiplied matrix. Making $\gamma$ predictable in this way not only eliminates the need for tuning it but also allows the value to be adjusted for each time step. Note that $g$ is different for each head, though omitted for simplicity. ## Experiments ### Experimental setup - Trained E2E ASR models using the **100-hour subset of clean audios from LibriSpeech and the 81-hour training set of Wall Street Journal (WSJ)**. - All the models are based on **transformer architecture implemented on ESPnet toolkit**. - All the models were trained with the **joint CTC and attention objectives** with a multi-task loss of weight of **0.3**. - The **hyperparameters** for the Transformer-related architecture are shown in **Table 1**. they followed the **same setup as in the *librispeech_100h* or *wsj* recipes in the ESPnet** for regularization hyperparameters (e.g., dropout rate, learning rate, label-smoothing weight, and optimizer). ![](https://i.imgur.com/aGjnyHr.png) - For data augmentation, they use **speed perturbation** at a ratio of **0.9, 1.0, and 1.1**. - **SpecAugment** was applied only when training on ****LibriSpeech**. they trained the models for **70 and 100 epochs for LibriSpeech and WSJ**, respectively. - During inference, **model averaging** was performed using the model from the last 10 epochs, **CTC weight** is set to **0.3**, and **beam size** was set to **10 for both corpora** and **did not use any external language model (LM)** during decoding to simplify the experimental investigations. ### Results and discussion #### Non-recursive vs. recursive smoothing ![](https://i.imgur.com/MrnKKIv.png) - From Table 2 we can see that the improvement in "test-other" is quite large. - This indicates that a more reliable prior distribution is obtained by incorporating information from all layers while emphasizing information from the previous layer. Therefore, it is used in subsequent experiments. #### Comparison of baseline and proposed methods ![](https://i.imgur.com/NvdT3zr.png) - Relaxed attention only applied during training only but the proposed method applied smoothing on both training and inference. - The recursive smoothing is effective when applied to source-target attention as well as self-attention. - In the case of WSJ, making $\gamma$ predictable and applied for all attentions in the encoder and decoder was effective for improving performance. - The smoothing with the truncated prior outperformed the two baseline systems in some cases even though the improvements were small. - Therefore, it is confirmed that both of the proposed methods contributed to improving ASR performance. #### Attention analysis ![](https://i.imgur.com/e34DlKc.png) - (a) shows extremely sharp attention weigths that are concentrated on the diagonal component which indicates not having useful context information. - The attention weight in Fig. 3 (b) and $c$ have diagonal components that are not as sharp as the attention weight in (a). - This result suggests that the recursive smoothing works to suppress diagonal components of the attention weight matrix from becoming peaky. ## Conclusion - The authors proposed two novel smoothing methods of attention weight using its prior distribution for the Transformer-based end-to-end automatic speech recognition. - Experimental results showed that relative improvements of up to 2.9% and 7.9% in LibriSpeech and Wall Street Journal, respectively, are achieved compared to a vanilla Transformer model. - The evaluation of the performance with an external LM is left to a future work since the relaxed attention has been reported to improve the performance with LM.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.