# Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling 5/15
###### tags: `Yang`
NIPS 2021 Facebook AI Research
### 1. Introduction
Attention mechanism allows a model to focus on informative parts of the inputs, and multi-head attention computes attention over inputs by multiple heads independently. With each head attending to different information, **multi-head attention potentially captures more complicated data patterns and extracts sophisticated knowledge.**
In this work, we bring the mitigation of language and domain interference under a common umbrella, and tackle it by improving parameter sharing within multi-head attention. **We propose strategies to select attention heads for different languages or domains.** Instead of sharing everything across languages or domains, our model automatically learns to share heads among a subset of languages or domains. **It encourages positive transfer within the subset and preserves their specificity without interference from outside the subset**. The major contributions of this work are summarized below:
1. We propose attention head selection to mitigate language or domain interference;
2. The parameter sharing strategies are lightweight and preserve inference efficiency;
3. We extensively evaluate attention sharing strategies on various sequence modeling tasks including speech recognition, text-to-text and speech-to-text translation. **Consistent gains are achieved across multiple benchmark datasets**
### 2. Model
- Latent Variable for Head Selection
For conditional sequence modeling tasks such as machine translation, the posterior of $p(y\mid x)$ can be computed by marginalizing over the posterior of latent variable $z$, which modulates parameters $\Theta$ in the standard Transformer architecture:
$$
p(y \mid x, \Theta)=\mathbf{E}_{p(z \mid \Theta)}[p(y \mid x, z)]=\int p(y \mid x, z) p(z \mid \Theta) \mathrm{d} z
$$
**Parameterization of $z_t$**. In this work, we define $z_{t}$ as modulating the selection of attention heads by task $t$. We have $z_{t}=\left\{z_{t}^{(h)}\right\}_{h}$ where $z_{t}^{(h)}$ is a discrete latent variable from Bernoulli distribution indicating **whether task $t$ selects attention head $h$.**
Marginalizing over $z_{t}$ is intractable given numerous heads in neural models. Therefore, we use variational inference to derive an approximate solution. Suppose that $\left(x_{t}, y_{t}\right)$ is from task $t$. Specifically, we learn an inference network $q_{\phi}\left(z_{t}\right)$, which is paramterized with $\phi$, to approximate the true distribution $p\left(z_{t}\right)$ and optimize the evidence lower bound (ELBO) of $p(y \mid x)$ :
$$
\log p(y \mid x) \geq \sum_{t}\left(\mathbf{E}_{q_{\phi}\left(z_{t}\right)}\left[\log p_{\theta}\left(y_{t} \mid x_{t}, z_{t}\right)\right]-\operatorname{KL}\left(q_{\phi}\left(z_{t}\right) \| p\left(z_{t}\right)\right)\right),
$$
**Training and interference.** We use the Gumbel-Softmax reparameterization to draw samples of $z_{t}^{(h)}$ from the posterior $q_{\phi}\left(z_{t}^{(h)}\right)$. It makes the model end-to-end differentiable, while learning discrete policies of head selection without resorting to policy gradients. We adopt a lightweight estimator of $q_{\phi}\left(z_{t}^{(h)}\right)$ by directly learning the logit parameters $\left\{\phi_{t}^{(h)}\right\}$ :
$$
q_{\phi}\left(z_{t}^{(h)}\right)=\frac{\exp \left(\left(\phi_{t}^{(h)}(1)+\epsilon(1)\right) / \tau\right)}{\sum_{j \in\{0,1\}} \exp \left(\left(\phi_{t}^{(h)}(j)+\epsilon(j)\right) / \tau\right)}, \epsilon \sim \mathcal{G}(0,1)
$$
- Attention Selection Strategies

**Subset strategy.** The subset strategy is straightforward, and we compare the posterior $\left\{q_{\phi}\left(z_{t}^{(h)}\right)\right.$ : $\left.h \in\left[1, H^{\prime}\right]\right\}$ of all $H^{\prime}$ heads given a task $t$. A subset of $H$ heads with the highest posterior are selected by the task. The subset strategy is described in Fig. 1(a). The binary mask $s_{t}^{(h)}$ indicates whether an attention head $h$ is assigned to task $t$.
$$
s_{t}^{(h)}= \begin{cases}1, & h \in \operatorname{TopH}\left(\left\{q_{\phi}\left(z_{t}^{(h)}\right)\right\}\right), \\ 0, & \text { otherwise }\end{cases}
$$
where $TopH( )$ returns the top $H$ heads with the highest values.
The outputs of the selected heads are concatenated as the attention output. Note that the subset strategy **does not consider the order of the attention heads**.
**Group strategy.** We further propose group strategy to preserve the order of attention heads during head selection. Different from the subset strategy, the group strategy first divides $H^{\prime}$ heads into $H$ groups. As is shown in Fig. 1(b), each group contains $r=\frac{H^{\prime}}{H}$ candidates. Each task could choose one attention head from each group, and has access to $H$ heads per layer. The group strategy keeps the head order in that heads from group $g$ only contribute to $g$ 's corresponding dimensions in the attention output. The head with the highest posterior in its group would be selected by a given task $t$. We use the binary mask $s_{t}^{(h)}$ to indicate the selection of head $h$ in group $g$.
$$
s_{t}^{(h)}= \begin{cases}1, & h=\operatorname{argmax}\left(\left\{q_{\phi}\left(z_{t}^{\left(h^{\prime}\right)}\right): h^{\prime} \in g\right\}\right), \\ 0, & \text { otherwise. }\end{cases}
$$
The output of group $g$ is:
$$
\mathbf{x}^{(g)}=\sum_{h \in g} s_{t}^{(h)} \cdot \mathbf{x}^{(h)}
$$
The outputs of $H$ groups are concatenated as the output of the attention module for task $t$
### 3. Experiment
- **Baselines:**
1. Transformer
2. S2T Transformer: As a variant of Transformer for speech processing.
3. Adapter model: Adapters have been shown as an effective approach to language and domain adaptation. A typical adapter layer consists of two feed-forward sub-layers.
4. Static strategy of head selection: A static strategy assigns each task with a fixed subset of attention heads based on the task similarity.
- In the multilingual setting, we group languages into linguistic families, and each family is assigned with an exclusive set of heads.
Ex: we group languages into 5 linguistic families. Each family is assigned with 4 attention head
(1) Indo-European family: cs, de, es, fr, gu, lt, lv, ro and ru; (2) Estonian family: et; (3) Uralic family: fi; (4) Turkic family: kk and tr; (5) Sino-Tibetan family: zh
- As for the multi-domain setting, each domain has its own set of attention heads.
- **Machine Translation**

- **Speech Recognition**
- Multilingual Speech Recognition

- Multi-Domain Speech Recognition

- **Speech Translation**

### 4. Discussion

We observe that group strategy shows comparable or better performance than subset strategy across tasks. One possible explanation is that group strategy keeps the head order information while subset strategy does not. With a larger pool of head candidates, there is less sharing among tasks.
### 5. Conclusion
- In this work, we propose head selection strategies to allow attention heads to be shared or specialized for different languages or domains.
- It effectively mitigates interference within multi-head attention which is a core part of strong sequence models, and demonstrates good empirical gains in various text generation tasks.