# Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation
###### tags: `Yang`
## 1. Introduction
Recently, representation learning for text and speech has successfully improved many language related tasks. However, all existing methods suffer from two limitations:
- (a) they only learn from one input modality, while a unified representation for both speech and text is needed by tasks such as end-to-end speech translation, and as a result,
- (b) they can not exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data.
In recent years, **task-agnostic text representation learning** has attracted much attention in the NLP community due to its strong performance to many downstream tasks.
More recently, unsupervised speech representation learning also successfully improved many speech related tasks, such as speech recognition and speech translation.
However all these existing methods can only handle one modality, either text or speech, while joint acoustic and text representation is desired for many end-to-end spoken language processing tasks.
The translation quality of end-to-end speech translation (ST) is limited by the scarcity of large-scale parallel speech translation data while there exists sufficient data for speech recognition and text machine translation. It would be helpful if source speech and bilingual text can be encoded into a unified representation via abundant speech recognition and text machine translation data.

**We make the following contributions:**
- We propose the Fused Acoustic and Text Masked Language Model (FAT-MLM), which can learn a unified acoustic and text representation.
- Based on FAT-MLM, we propose the Fused Acoustic and Text Speech Translation model (FAT-ST), **which can do speech recognition and machine translation in a single encoder-decoder framework**.

## 2. Fused Acoustic and Text Masked Language Model (FAT-MLM)
We propose the Fused Acoustic and Text Masked Language Model (FAT-MLM), a multimodal pretraining model which encodes acoustic, text into a unified representation. In the following sections, we first introduce the monolingual FAT-MLM and then show how to extend it to translation scenario.
### 2.1. Monolingual FAT-MLM
The monolingual FAT-MLM takes speech and transcription tuples as input, denotes as $D_{\mathbf{s}, \mathbf{x}}=\{(\mathbf{s}, \mathbf{x})\}$, $\mathbf{s}=\left(s_{1}, \ldots, s_{|s|}\right)$ is a sequence of acoustic features $s_{i} \in \mathbb{R}^{d_{s}}$ which can be the spectrogram or mel-spectrogram of the speech audio, and each $s_{i}$ represents the frame-level speech feature, and $\mathbf{x}=\left(x_{1}, \ldots, x_{|\mathbf{x}|}\right)$ is the sequence of corresponding transcription.
we first randomly mask several spans of s by a random masking function over the input s
$$
\hat{\mathbf{s}} \sim \operatorname{Mask}_{\text {span }}(\mathbf{s}, \lambda)$$
Similarly, we randomly mask tokens in x by a random masking function over the input s, x:
$$\hat{\mathbf{x}} \sim \operatorname{Mask}_{\text {token }}(\mathbf{x}, \lambda)$$
we concatenate acoustic embeddings and source text embeddings $\left[\hat{e}_{\mathbf{s}} ; \hat{\mathbf{x}}\right]$, and obtain the latent representation $f\left(\left[e_{\hat{\mathbf{s}}} ; \hat{\mathbf{x}}\right]\right)$ using another Transformer encoder, denoted as $f$.
The training objective of monolingual FAT-MLM includes a speech reconstruction loss $\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}}\right)$ and a text reconstruction loss $\ell_{\mathbf{x}}\left(D_{\mathbf{s}, \mathbf{x}}\right)$. For speech input $\mathbf{s}$, we have the following training objective to reconstruct the original speech signal with the surrounding context information.
$$
\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}}\right)=\sum_{(\mathbf{s}, \mathbf{x}) \in D_{\mathbf{s}, \mathbf{x}}} \| \mathbf{s}-g\left(f\left(\left[e_{\hat{\mathbf{s}}} ; \hat{\mathbf{x}}\right]\right) \|_{2}^{2}\right.
$$
For transcription input $\mathbf{x}$ we use cross entropy loss to reconstruct the masked token, denoted as
$$
\ell_{\mathbf{x}}\left(D_{\mathbf{s}, \mathbf{x}}\right)=-\sum_{(\mathbf{s}, \mathbf{x}) \in D_{\mathbf{s}, \mathbf{x}}} \log p\left(\mathbf{x} \mid\left[\boldsymbol{e}_{\hat{\mathbf{s}}} ; \hat{\mathbf{x}}\right]\right)
$$
The final loss for monolingual FAT-MLM is:
$$
\ell_{\text {FAT-MLM }}\left(D_{\mathbf{s}, \mathbf{x}}\right)=\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}}\right)+\ell_{\mathbf{x}}\left(D_{\mathbf{s}, \mathbf{x}}\right)
$$
### 2.2. Translation FAT-MLM
To support multimodal crosslingual tasks such as speech translation, We propose **Translation FAT-MLM** which extends Monolingual FAT-MLM by using additional target language translation of the source language transcription as input. Formally Translation FAT-MLM takes $D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}=$ $\{(\mathbf{s}, \mathbf{x}, \mathbf{y})\}$ as input, where $\mathbf{y}=\left[y_{1}, \ldots, y_{|y|}\right]$ denotes the sequence of target language translation. This kind of triplet input is very common in **speech translation corpus.**
we incorporate source language embedding $e_{\text {src }}$ and target language embedding $e_{\text {tgt }}$ for different languages to show the language difference. Similar to Monolingual FAT-MLM, Translation FAT-MLM randomly masks the translation input $\hat{\mathbf{y}} \sim \operatorname{Mask}_{\text {token }}(\mathbf{y}, \lambda)$ and concatenate it with another two embeddings:
$$
\mathbf{h}_{\mathbf{s}, \mathbf{x}, \mathbf{y}}=\left[e_{\hat{\mathbf{s}}}+e_{\mathrm{src}} ; \hat{\mathbf{x}}+e_{\mathrm{src}} ; \hat{\mathbf{y}}+e_{\mathrm{tgt}}\right]
$$
Then we reconstruct masked input from concatenated embeddings $\mathbf{h}_{\mathbf{s}, \mathbf{x}, \mathbf{y}}$ via a Transformer encoder. The reconstruction loss for different masked input is:
$$
\begin{aligned}
&\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)=\sum_{(\mathbf{s}, \mathbf{x}, \mathbf{y}) \in D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}} \| \mathbf{s}-g\left(f\left(\mathbf{h}_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right) \|_{2}^{2}\right. \\
&\ell_{\mathbf{x}}\left(D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)=-\sum_{(\mathbf{s}, \mathbf{x}, \mathbf{y}) \in D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}} \log p\left(\mathbf{x} \mid \mathbf{h}_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right) \\
&\ell_{\mathbf{y}}\left(D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)=-\sum_{(\mathbf{s}, \mathbf{x}, \mathbf{y}) \in D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}} \log p\left(\mathbf{y} \mid \mathbf{h}_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)
\end{aligned}
$$
We sum these loss functions for the final loss function of Translation FAT-MLM:
$$
\ell_{\text {FAT-MLM }}\left(D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)=\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)+\ell_{\mathbf{x}}\left(D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)+\ell_{\mathbf{y}}\left(D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}\right)
$$
To fully utilize the corpora for different tasks, FAT-MLM can take any combination of speech, transcription, translation triplets $D_{2\{s, \mathbf{x}, \mathbf{y}\}}$ as input. Specifically, these combinations include speech only data $\{\mathbf{s}\}$, monolingual text data, $\{\mathbf{x}\}$ or $\{\mathbf{y}\}$, speech and transcription tuple $\{(\mathbf{s}, \mathbf{x})\}$ for speech recognition, transcription and translation tuple $\{(\mathbf{x}, \mathbf{y})\}$ for machine translation, speech and translation tuple $\{(\mathbf{s}, \mathbf{y})\}$ for direct speech translation and speech transcription translation triplets $\{(\mathbf{s}, \mathbf{x}, \mathbf{y})\}$. For different combinations of input, FAT-MLM encodes the full concatenation of their embeddings and recover the masked portion. The loss function is:
$$
\ell_{\text {FAT-MLM }}\left(D_{2\{\mathbf{s}, \mathbf{x}, \mathbf{y}\}}\right)=\ell_{\mathbf{s}}\left(D_{\mathbf{s} \star}\right)+\ell_{\mathbf{x}}\left(D_{\mathbf{x} \star}\right)+\ell_{\mathbf{y}}\left(D_{\mathbf{y} \star}\right)
$$
where $D_{\mathbf{s} \star}, D_{\mathbf{x} \star}, D_{\mathbf{y} \star}$ means any input including speech, source language text and target language text respectively. Note that in this framework, we can denote $\operatorname{MLM}$ as $\ell_{\mathbf{x}}\left(D_{\mathbf{x}}\right)$, TLM as $\ell_{\mathbf{x}, \mathbf{y}}\left(D_{\mathbf{x}, \mathbf{y}}\right)$, MAM as $\ell_{\mathbf{s}}(\mathbf{s})$.
## 3. Fused Acoustic and Text Speech Translation (FAT-ST)

In this section, we present how to adapt FAT-MLM to speech translation and enable speech translation models to learn from speech recognition and text machine translation.
To boost the performance of end-to-end speech translation, we propose to enable speech translation to encode both acoustic and text features as input by simply adapting the architecture of monolingual FAT-MLM to a Fused Acoustic and Text Speech Translation model (FAT-ST).
FAT-ST's encoder shares identical architecture with monolingual FAT-MLM. In this way, we can simply encode either acoustic or text features by this encoder and the FAT-ST model can be optimized by speech translation loss $\ell_{\mathrm{ST}}$, machine translation loss $\ell_{\mathrm{MT}}$ and FAT-MLM loss $\ell_{\text {FAT-MLM}}$. For a speech translation dataset $D_{\mathbf{s}, \mathbf{x}, \mathbf{y}}$, we decouple the triplets into three part $D_{\mathbf{s}, \mathbf{y}}$ for $\ell_{\mathrm{ST}}, D_{\mathbf{s}, \mathbf{x}}$ for $\ell_{\text {FAT-MLM }}$ and $D_{\mathbf{x}, \mathbf{y}}$ for $\ell_{\mathrm{MT}}$. The loss function of FAT-ST is:
$$
\begin{array}{r}
\ell_{\mathrm{FAT}-\mathrm{ST}}\left(D_{\mathbf{s}, \mathbf{y}} \cup D_{\mathbf{s}, \mathbf{x}} \cup D_{\mathbf{x}, \mathbf{y}}\right)=\ell_{\mathrm{ST}}\left(D_{\mathbf{s}, \mathbf{y}}\right)+\ell_{\mathrm{MT}}\left(D_{\mathbf{x}, \mathbf{y}}\right)
+\ell_{\text {FAT-MLM }}\left(D_{\mathbf{s}, \mathbf{x}}\right)
\end{array}
$$
$$
\ell_{\mathrm{MT}}\left(D_{\mathbf{x}, \mathbf{y}}\right)=-\sum_{(\mathbf{x}, \mathbf{y}) \in D_{\mathbf{x}, \mathbf{y}}} \log p(\mathbf{y} \mid \mathbf{x})
$$
$$\ell_{\mathrm{ST}}\left(D_{\mathbf{s}, \mathbf{y}}\right)=-\sum_{(\mathbf{s}, \mathbf{y}) \in D_{\mathbf{s}, \mathbf{y}}} \log p(\mathbf{y} \mid \mathbf{s})$$
### 3.1. Finetuning FAT-ST from Translation FAT-MLM
Since the FAT-ST decoder predicts text only, we initialize it from the acoustic and text shared Transformer encoder. Although Transformer decoder is unidirectional which is different from bidirectional FAT-MLM, it can still benefit from FAT-MLM in our experiments.
## 4. Experiments
### 4.1 Dataset

### 4.2 Translation Quality Comparisons


## 5. Conclusion
In this paper, we propose Fused Acoustic and Text Masked Language Model (FAT-MLM) which **learns a unified representation for text and speech from any data that combines speech and text**.
We further extend this framework to a sequence-to-sequence speech translation model which enables learning from speech recognition and text-based machine translation data at the first time. Our results show significant improvement on three translation directions of the Must-C dataset and outperform the cascaded baseline.