###### tags: `PaperReview`
# E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition
> ASAPP Inc., Mountain View, CA, USA
> Carniege Mellon University, Pittsburgh, PA, USA
> SLT 2022
## Introduction
- **Combining convolution and self-attention** to capture both local and global information has shown remarkable performance for ASR encoder.
- **Conformer**, which combines convolution and self-attention **sequentially** and also **branchformer**, which does the combination **in parallel branches**, both exhibit superior performance compare to transformer by processing local and global context vector.
- Propose **E-Branchformer**, a descendent of branchformer which enhances the merging mechanism for the convolution and self attention.
## Related work
### Combining Self-Attention with Convolution
#### Sequentially
- QANet and Evolved Transformer **adds additional convolution block(s) BEFORE self-attention layer**.

- But for ASR model, experiments from another paper shows that **adding a convolutional block AFTER the self-attention block achieves the best performance** compare to applying it before or in parallel with the self-attention.
#### In Parallel
- Previously proposed method, **Lite Transformer using Long-Short Range Attention (LSRA)**, applies **multi-head attention and dynamic convolution in parallel and concate their output**, has shown superior performance in machine translation and summarization task.

- The embedding was split into two.
- ConvBERT **combines multi-head attention** and their newly proposed **span-based dynamic convolution with shared queries and values** in two branches.
- In speech, Branchformer **combines self-attention and the convolutional spatial gating unit** (CSGU) achieving performance comparable with the conformer.
#### Hybrid - both Sequentially and in Parallel
- Inception Transformer which has **three brances** (average pooling, convolution, and self-attention) **fused with a depth-wise convolution** achieves impressive performance on several vision task.

- E-Branchformer shares similar concept with this method.
## Preliminary: Branchformer

- Branchformer has three main components, **Global Extractor, Local Extractor, and Merge module.**
- The **global extractor** is a conventional self-attention block on Transformer.
$$
\mathbf{\mathit{Y_G}} = Dropout(MHSA(LN(\mathbf{\mathit{X}})))
$$
Where $\mathbf{\mathit{X, Y_G}} \in \mathbb{R}^{T \times d}$ denote the input and global-extractor branch output with the length $T$ and hidden dimension $d$
- Branchformer also use **relative positional embeddings**, which generally shows better than absolute positional embedding, similar to Conformer.





- The **local extractor** architecture can be seen in the below figure:

- First linear layer **projects input to higher dimension**, usually $d_{inter} = 6d$, and second linear layer **projects back to its original dimension**.
$$
\begin{aligned}
& \boldsymbol{Z}=\operatorname{GELU}(\mathrm{LN}(\boldsymbol{X}) \boldsymbol{U}), \\
& {[\boldsymbol{A} \; \boldsymbol{B}]=\boldsymbol{Z},} \\
& \tilde{\boldsymbol{Z}}=\operatorname{CSGU}(\boldsymbol{Z})=\boldsymbol{A} \odot \operatorname{DwConv}(\operatorname{LN}(\boldsymbol{B})), \\
& \boldsymbol{Y}_{\boldsymbol{L}}=\operatorname{Dropout}(\tilde{\boldsymbol{Z}} \boldsymbol{V}),
\end{aligned}
$$
where $Z \in \mathbb{R}^{T \times d_{\text {inter }}}, A, B, \tilde{Z} \in \mathbb{R}^{T \times d_{\text {inter }} / 2}$ are **intermediate hidden features**, $\odot$ is element-wise product, and $\boldsymbol{U} \in \mathbb{R}^{d \times d_{\text {inter }}}, \boldsymbol{V} \in$ $\mathbb{R}^{d_{\text {inter }} / 2 \times d}$ denote the trainable **weights of two linear projections**. The local extractor branch outputs $Y_L \in \mathbb{R}^{T \times d}$.
- The **Merge module** has several options to merge the two representations:
$$
\begin{aligned}
& Y_{Merge} = \text{Concat}(Y_G, Y_L)W \\
& Y_{Merge} = w_gY_G + w_lY_L
\end{aligned}
$$
but experiments show that the **concatenation method is simpler and more accurate**.
## E-Branchformer
### Enhanced Merge Module
- Author argue that **combine output of two branches point-wise and linearly** is optimal.
#### Depth-wise Convolution
- Introduce **depth-wise convolution to the merge module** allowing it to **take adjacent features into account** when combing information from two branches.
- Depth-wise convolution requires **little computation** and has **a negligible effect on the speed of the model**.
$$
\begin{aligned}
& Y_{C} = \text{Concat}(Y_G, Y_L) \\
& Y_D = \text{DwConv}(Y_C) \\
& Y_{Merge} = (Y_C + Y_D)W
\end{aligned}
$$
Where *DwConv* stands for Depth-wise convolution and $W \in \mathbb{R}^{2d \times d}$ is the trainable weights of the linear projection.
#### Squeeze-and-Excitation
- SE block takes a **global average pooling** over the temporal dimension, and feeds it to a **tiny two-layer Feed-Forward Network (FFN)** to produce a channelwise gate.
$$
\begin{aligned}
& \bar{y}_D=\frac{1}{T} \sum_{t=1}^T \boldsymbol{Y}_{\boldsymbol{D}_t}, \\
& \left.g=\sigma\left(\operatorname{MLP}\left(\bar{y}_D\right)\right)=\sigma\left(\operatorname{Swish}\left(\bar{y}_D \boldsymbol{W}_{\mathbf{1}}\right)\right) \boldsymbol{W}_{\mathbf{2}}\right), \\
& \boldsymbol{Y}_{\boldsymbol{D}_i}^{\prime}=\boldsymbol{g}_i \odot \boldsymbol{Y}_{\boldsymbol{D}_i} \quad \forall i \in\{1, \ldots, d\}, \\
& \boldsymbol{Y}_{\text {Merge }}=\left(\boldsymbol{Y}_{\boldsymbol{C}}+\boldsymbol{Y}_{\boldsymbol{D}}^{\prime}\right) \boldsymbol{W},
\end{aligned}
$$
where $Y_D$ is the same as above, Swish and $\sigma$ are Swish $\left(\text{Swish}(x) = \frac{x}{1 + e^{-\beta x}}\right)$ and sigmoid non-linearity, respectively, $\odot$ denotes channel-wise multiplication, and $W_1 \in \mathbb{R}^{d \times d / 8}$ and $W_2 \in$ $\mathbb{R}^{d / 8 \times d}$ are the trainable weights of the two-layer MLP.

#### Revisiting the Point-Wise Feed-Forward Network
- Because branchformer has two projection layers inside its cgMLP, **it does not have FFN blocks** like transformer and conformer.
- But the **role** of linear projection inside cgMLP in branchformer **may be different** with FFN in transformer and conformer.
- It is possible that **stacking E-branchformer with FFN modules may perform better** than stacking only E-branchformer.
- In this direction, **FFN was stacked together with E-branchformer** in an interleaving pattern to increase the expected model capacity.
## Experiments
### Experimental Setups
- Experiments on LibriSpeech dataset.
- Employ AED model with 80-dimensional log Mel with 32ms window size and 10ms stride, and 5K BPE sub-word units as output tokens.
- Subsampling module consists of two 2D conv layer, a ReLU, and a linear layer.
- Consider two main models, BASE which has 16 layers with $d = 256$, and LARGE which has 17 layers with $d = 512$, with number of SA heads of $d/64$.
### Inference
- Employ joint CTC-attention decoding with tuned weight.
- Use external LM for shallow fusion (tranformers with 16 layers, 128 embedding dimension, 512 attention dimensions and 8 attention heads --- in total of 53.17M parameters).
- Apply [Internal Language Model Estimation](https://arxiv.org/pdf/2011.01991.pdf) (IMLE) to the model.
- Estimating the internal language model distribution allows explicit interpolate internal LM and external LM with tunable hyperparameters, resulting in more effective decoding.
- 
- For each hypothesis, the score is given by:
$$
\begin{aligned}
\log (P(Y))= & \log \left(P\left(Y \mid X ; \theta^{\mathrm{AED}}\right)\right) -\lambda_{\mathrm{ilm}} \log \left(P\left(Y ; \theta^{\mathrm{AED}}\right)\right)+\lambda_{\text {elm }} \log \left(P\left(Y ; \theta^{\mathrm{LM}}\right)\right)
\end{aligned}
$$
where $\lambda_{\text {ilm }}$ and $\lambda_{\text {elm }}$ are the interpolation weights for the internal and external langauge model, respectively. $P\left(Y \mid X ; \theta^{\mathrm{AED}}\right)$ is the probability of the hypothesis $Y$ yielded by the ASR model given the input acoustic feature $X . P\left(Y ; \theta^{\mathrm{AED}}\right)$ and $P\left(Y ; \theta^{\mathrm{LM}}\right)$ represent the **internal language model estimation and the external language model probability over the hypothesis**. In the experiment, $\lambda_{\text {ilm }}$ and $\lambda_{\mathrm{elm}}$ **were set to 0.2 and 0.6** , respectively.
## Result
### Main Result

### Ablation Studies
#### FFN Module

[origin of "macaron-style FFN"](https://arxiv.org/pdf/1906.02762.pdf)

- Using FFN together is a reasonable way to expect better accuracy than deeply stacking only Branchformers.
#### Kernel Sizes of the Depth-wise Convolution

#### Merge Module


- Simply adding the depth-wise convolution is effective in terms of performance and also efficient in terms of the parameter size and the computational complexity.
## Conclusion
- The proposed E-Branchformer features an enhanced merging mechanism that enables a hybrid application of self-attention and convolution in both sequential and parallel manners.
- E-Branchformer demonstrates superior performance compared to existing models, specifically outperforming Conformer and Branchformer.
- E-Branchformer achieves a new state-of-the-art on LibriSpeech test sets without the use of external data, showcasing its effectiveness in ASR tasks.