E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition

###### tags: `PaperReview` # E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition > ASAPP Inc., Mountain View, CA, USA > Carniege Mellon University, Pittsburgh, PA, USA > SLT 2022 ## Introduction - **Combining convolution and self-attention** to capture both local and global information has shown remarkable performance for ASR encoder. - **Conformer**, which combines convolution and self-attention **sequentially** and also **branchformer**, which does the combination **in parallel branches**, both exhibit superior performance compare to transformer by processing local and global context vector. - Propose **E-Branchformer**, a descendent of branchformer which enhances the merging mechanism for the convolution and self attention. ## Related work ### Combining Self-Attention with Convolution #### Sequentially - QANet and Evolved Transformer **adds additional convolution block(s) BEFORE self-attention layer**. ![image](https://hackmd.io/_uploads/rytTzX1NT.png) - But for ASR model, experiments from another paper shows that **adding a convolutional block AFTER the self-attention block achieves the best performance** compare to applying it before or in parallel with the self-attention. #### In Parallel - Previously proposed method, **Lite Transformer using Long-Short Range Attention (LSRA)**, applies **multi-head attention and dynamic convolution in parallel and concate their output**, has shown superior performance in machine translation and summarization task. ![image](https://hackmd.io/_uploads/H1pAZ7yEp.png) - The embedding was split into two. - ConvBERT **combines multi-head attention** and their newly proposed **span-based dynamic convolution with shared queries and values** in two branches. - In speech, Branchformer **combines self-attention and the convolutional spatial gating unit** (CSGU) achieving performance comparable with the conformer. #### Hybrid - both Sequentially and in Parallel - Inception Transformer which has **three brances** (average pooling, convolution, and self-attention) **fused with a depth-wise convolution** achieves impressive performance on several vision task. ![image](https://hackmd.io/_uploads/S1MObQJVa.png) - E-Branchformer shares similar concept with this method. ## Preliminary: Branchformer ![image](https://hackmd.io/_uploads/SkiXxbyNp.png) - Branchformer has three main components, **Global Extractor, Local Extractor, and Merge module.** - The **global extractor** is a conventional self-attention block on Transformer. $$ \mathbf{\mathit{Y_G}} = Dropout(MHSA(LN(\mathbf{\mathit{X}}))) $$ Where $\mathbf{\mathit{X, Y_G}} \in \mathbb{R}^{T \times d}$ denote the input and global-extractor branch output with the length $T$ and hidden dimension $d$ - Branchformer also use **relative positional embeddings**, which generally shows better than absolute positional embedding, similar to Conformer. ![image](https://hackmd.io/_uploads/B1HVeXy4T.png) ![image](https://hackmd.io/_uploads/rkLjeQ1Va.png) ![image](https://hackmd.io/_uploads/ry4hgQy4p.png) ![image](https://hackmd.io/_uploads/ByR0emyVT.png) ![image](https://hackmd.io/_uploads/HJZl-mJ4T.png) - The **local extractor** architecture can be seen in the below figure: ![image](https://hackmd.io/_uploads/BkDyNW1Ep.png) - First linear layer **projects input to higher dimension**, usually $d_{inter} = 6d$, and second linear layer **projects back to its original dimension**. $$ \begin{aligned} & \boldsymbol{Z}=\operatorname{GELU}(\mathrm{LN}(\boldsymbol{X}) \boldsymbol{U}), \\ & {[\boldsymbol{A} \; \boldsymbol{B}]=\boldsymbol{Z},} \\ & \tilde{\boldsymbol{Z}}=\operatorname{CSGU}(\boldsymbol{Z})=\boldsymbol{A} \odot \operatorname{DwConv}(\operatorname{LN}(\boldsymbol{B})), \\ & \boldsymbol{Y}_{\boldsymbol{L}}=\operatorname{Dropout}(\tilde{\boldsymbol{Z}} \boldsymbol{V}), \end{aligned} $$ where $Z \in \mathbb{R}^{T \times d_{\text {inter }}}, A, B, \tilde{Z} \in \mathbb{R}^{T \times d_{\text {inter }} / 2}$ are **intermediate hidden features**, $\odot$ is element-wise product, and $\boldsymbol{U} \in \mathbb{R}^{d \times d_{\text {inter }}}, \boldsymbol{V} \in$ $\mathbb{R}^{d_{\text {inter }} / 2 \times d}$ denote the trainable **weights of two linear projections**. The local extractor branch outputs $Y_L \in \mathbb{R}^{T \times d}$. - The **Merge module** has several options to merge the two representations: $$ \begin{aligned} & Y_{Merge} = \text{Concat}(Y_G, Y_L)W \\ & Y_{Merge} = w_gY_G + w_lY_L \end{aligned} $$ but experiments show that the **concatenation method is simpler and more accurate**. ## E-Branchformer ### Enhanced Merge Module - Author argue that **combine output of two branches point-wise and linearly** is optimal. #### Depth-wise Convolution - Introduce **depth-wise convolution to the merge module** allowing it to **take adjacent features into account** when combing information from two branches. - Depth-wise convolution requires **little computation** and has **a negligible effect on the speed of the model**. $$ \begin{aligned} & Y_{C} = \text{Concat}(Y_G, Y_L) \\ & Y_D = \text{DwConv}(Y_C) \\ & Y_{Merge} = (Y_C + Y_D)W \end{aligned} $$ Where *DwConv* stands for Depth-wise convolution and $W \in \mathbb{R}^{2d \times d}$ is the trainable weights of the linear projection. #### Squeeze-and-Excitation - SE block takes a **global average pooling** over the temporal dimension, and feeds it to a **tiny two-layer Feed-Forward Network (FFN)** to produce a channelwise gate. $$ \begin{aligned} & \bar{y}_D=\frac{1}{T} \sum_{t=1}^T \boldsymbol{Y}_{\boldsymbol{D}_t}, \\ & \left.g=\sigma\left(\operatorname{MLP}\left(\bar{y}_D\right)\right)=\sigma\left(\operatorname{Swish}\left(\bar{y}_D \boldsymbol{W}_{\mathbf{1}}\right)\right) \boldsymbol{W}_{\mathbf{2}}\right), \\ & \boldsymbol{Y}_{\boldsymbol{D}_i}^{\prime}=\boldsymbol{g}_i \odot \boldsymbol{Y}_{\boldsymbol{D}_i} \quad \forall i \in\{1, \ldots, d\}, \\ & \boldsymbol{Y}_{\text {Merge }}=\left(\boldsymbol{Y}_{\boldsymbol{C}}+\boldsymbol{Y}_{\boldsymbol{D}}^{\prime}\right) \boldsymbol{W}, \end{aligned} $$ where $Y_D$ is the same as above, Swish and $\sigma$ are Swish $\left(\text{Swish}(x) = \frac{x}{1 + e^{-\beta x}}\right)$ and sigmoid non-linearity, respectively, $\odot$ denotes channel-wise multiplication, and $W_1 \in \mathbb{R}^{d \times d / 8}$ and $W_2 \in$ $\mathbb{R}^{d / 8 \times d}$ are the trainable weights of the two-layer MLP. ![image](https://hackmd.io/_uploads/S1XnGfk4T.png) #### Revisiting the Point-Wise Feed-Forward Network - Because branchformer has two projection layers inside its cgMLP, **it does not have FFN blocks** like transformer and conformer. - But the **role** of linear projection inside cgMLP in branchformer **may be different** with FFN in transformer and conformer. - It is possible that **stacking E-branchformer with FFN modules may perform better** than stacking only E-branchformer. - In this direction, **FFN was stacked together with E-branchformer** in an interleaving pattern to increase the expected model capacity. ## Experiments ### Experimental Setups - Experiments on LibriSpeech dataset. - Employ AED model with 80-dimensional log Mel with 32ms window size and 10ms stride, and 5K BPE sub-word units as output tokens. - Subsampling module consists of two 2D conv layer, a ReLU, and a linear layer. - Consider two main models, BASE which has 16 layers with $d = 256$, and LARGE which has 17 layers with $d = 512$, with number of SA heads of $d/64$. ### Inference - Employ joint CTC-attention decoding with tuned weight. - Use external LM for shallow fusion (tranformers with 16 layers, 128 embedding dimension, 512 attention dimensions and 8 attention heads --- in total of 53.17M parameters). - Apply [Internal Language Model Estimation](https://arxiv.org/pdf/2011.01991.pdf) (IMLE) to the model. - Estimating the internal language model distribution allows explicit interpolate internal LM and external LM with tunable hyperparameters, resulting in more effective decoding. - ![image](https://hackmd.io/_uploads/ry7ZSmyVa.png) - For each hypothesis, the score is given by: $$ \begin{aligned} \log (P(Y))= & \log \left(P\left(Y \mid X ; \theta^{\mathrm{AED}}\right)\right) -\lambda_{\mathrm{ilm}} \log \left(P\left(Y ; \theta^{\mathrm{AED}}\right)\right)+\lambda_{\text {elm }} \log \left(P\left(Y ; \theta^{\mathrm{LM}}\right)\right) \end{aligned} $$ where $\lambda_{\text {ilm }}$ and $\lambda_{\text {elm }}$ are the interpolation weights for the internal and external langauge model, respectively. $P\left(Y \mid X ; \theta^{\mathrm{AED}}\right)$ is the probability of the hypothesis $Y$ yielded by the ASR model given the input acoustic feature $X . P\left(Y ; \theta^{\mathrm{AED}}\right)$ and $P\left(Y ; \theta^{\mathrm{LM}}\right)$ represent the **internal language model estimation and the external language model probability over the hypothesis**. In the experiment, $\lambda_{\text {ilm }}$ and $\lambda_{\mathrm{elm}}$ **were set to 0.2 and 0.6** , respectively. ## Result ### Main Result ![image](https://hackmd.io/_uploads/rJCFozJVp.png) ### Ablation Studies #### FFN Module ![image](https://hackmd.io/_uploads/B1LG3fkNT.png) [origin of "macaron-style FFN"](https://arxiv.org/pdf/1906.02762.pdf) ![image](https://hackmd.io/_uploads/BJ5ELQJVp.png) - Using FFN together is a reasonable way to expect better accuracy than deeply stacking only Branchformers. #### Kernel Sizes of the Depth-wise Convolution ![image](https://hackmd.io/_uploads/BJL53f1ET.png) #### Merge Module ![image](https://hackmd.io/_uploads/BksnnMkVa.png) ![image](https://hackmd.io/_uploads/S1XnGfk4T.png) - Simply adding the depth-wise convolution is effective in terms of performance and also efficient in terms of the parameter size and the computational complexity. ## Conclusion - The proposed E-Branchformer features an enhanced merging mechanism that enables a hybrid application of self-attention and convolution in both sequential and parallel manners. - E-Branchformer demonstrates superior performance compared to existing models, specifically outperforming Conformer and Branchformer. - E-Branchformer achieves a new state-of-the-art on LibriSpeech test sets without the use of external data, showcasing its effectiveness in ASR tasks.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.