# A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing - ICML 2022
###### tags: `Yang`
### 1. Introduction
- Recently, speech representation learning has attracted much attention in the speech community due to its strong performance to many speech-related downstream tasks, such as speech recognition, speech classification, and speech translation.
- However, all these efforts can only support **speech understanding tasks** which take speech as input, but for the inverse direction, speech synthesis, which synthesis speech as output, the potential of representation learning is yet to be realized.
- To address this problem, we propose **our framework, Alignment-Aware Acoustic-Text Pretraining (A3T)**, where we introduce cross-modal alignment embeddings which make the model easier to learn the alignment between the acoustic and phoneme input during multi-modal pretraining, and significantly improve the quality of the reconstructed acoustic signals.
#### Contributions
- Propose A3T, a BERT-style pretraining model. It can reconstruct masked spectrograms with high quality without finetuning and uses the identical framework for decoding.
- Show that the proposed A3T model has the ability to do speech editing and outperforms the current SOTA.
- We propose the prompt-based decoding method. We show that our A3T model has the ability to do **speech synthesis for unseen speaker** and outperforms the speaker-embedding-based multi-speaker TTS system.
### 2. Alignment-Aware Acoustic-Text Pretraining


#### 2.1 $A^3T$
- $\mathrm{A}^3 \mathrm{T}$ takes speech and transcription tuples as input, denotes as $D_{\mathbf{s}, \mathbf{x}}=\left\{\langle\mathbf{s}, \mathbf{x}\rangle^{(n)}\right\}_{n=1}^{|D|}$, where $\mathbf{s}=\left(s_1, \ldots, s_{|s|}\right)$ is a sequence of acoustic features $s_i \in \mathbb{R}^{d_s}$, and $\mathbf{x}=$ $\left(x_1, \ldots, x_{|\mathbf{x}|}\right)$ is the sequence of corresponding transcription.
#### 2.2 Cross-modal Alignment Embedding
- To strengthen the interaction between the speech and text input, we introduce cross-modal alignment embedding as one input of encoder, where we sum the $i$ th acoustic embedding $e_{s_i}$ or text embedding $\mathbf{x}_i$ with its positional embedding $e_{\mathrm{pos}_i}$ , and alignment embedding $e_{\mathrm{aln}_i}$ , all together: $e_{s_i}+e_{\mathrm{pos}_i}+e_{\mathrm{aln}_i}$.
#### 2.3 Conformer
- Given the recent success of Convolution-augmented Transformer (Conformer) on various speech tasks, we adopt Conformer as the backbone of our encoder and decoder.
- Compared with Transformer, Conformer introduces a convolution module and an additional feedforward module.

#### 2.4 Post-Net and Loss Function
- We follow Tacotron 2 to use Post-Net to refine the generated spectrogram.
- The training objective of multi-modal $\text{A}^3\text{T}$ includes a speech reconstruction loss $\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}}\right)$ which takes a spectrogram s and a text sequence $\mathrm{x}$ as input.
$$
\begin{aligned}
\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}}\right)=\sum_{\langle\mathbf{s}, \mathbf{x}\rangle \in D_{\mathbf{s}, \mathbf{x}}} &\|\underbrace{f\left(\left[e_{\mathbf{s}} ; \mathbf{x}\right]\right)+g\left(f\left(\left[e_{\mathbf{s}} ; \mathbf{x}\right]\right)\right)}_{\text {refined spectrogram }}-\mathbf{s}\|_1 \\
&+\|\underbrace{f\left(\left[e_{\mathbf{s}}; \mathbf{x}\right]\right)}_{\text {reconstructed spectrogram }}-\mathbf{s}\|_1
\end{aligned}
$$
where $g$ is a Post-Net which tries to recover a better original signal from encoded representation $f\left(\left[e_{\hat{\mathbf{s}}} ; \hat{\mathbf{x}}\right]\right)$. We use mean absolute error (MAE) for measuring the difference between $s$ and the reconstructed spectrogram.
#### 2.5 $\text{A}^3\text{T}$ for Speech Editing
- Once A3T finishes the pretraining process, it can be used as a speech editing system directly with an external duration predictor.
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$
#### 2.6 $\text{A}^3\text{T}$ for Multi-speaker TTS
- In addition to the speech editing, we find our model has the potential for unseen speaker TTS.
- In this work, we find our model can achieve comparable naturalness to models with speaker embeddings for unseen speaker TTS task; What’s more, our generations are more similar to the unseen speaker’s reference speech.
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$
- The key idea is to concatenate the prompt and the target together into a new utterance input, where the target speech is consist of $n$ $\text{[MASK]}$ and $n$ is predicted by a duration predictor. By inputting the concatenated speech and text, A3T model will predict the spectrogram of these masked frames.
- The role of the reference text and speech in our model issimilar to prompts in language model, and hence we call it prompt-based decoding/generation.
### 3. Experiments
In this section, we introduce our experiments for spectrogram reconstruction pretraining task, speech-editing task, and multi-speaker TTS
1. Ablation Study with Spectrogram Reconstruction



2. Speech Editing
- Baseline 1: This is a TTS system regenerating a complete waveform from the whole sentence to be edited.
- Baseline 2: This system generates the modified region with a TTS model and insert the generation back to the original waveform with a forced aligner
- Baseline 3: This system is similar to Baseline 1, but we cut the modified region from the generation and insert it back to the original waveform with a forced aligner


3. Prompt-based Multi-speaker TTS

### 4. Conclusions
- In this paper, we propose Alignment-Aware Acoustic-Text Pretraining (A3T) which can reconstruct masked acoustic signals with high quality.
- We show that our proposed A3T model has the ability to do speech editing and outperforms the current SOTA models, and also improves unseen-speaker speech synthesis with our proposed prompt-based decoding.