A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing - ICML 2022

# A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing - ICML 2022 ###### tags: `Yang` ### 1. Introduction - Recently, speech representation learning has attracted much attention in the speech community due to its strong performance to many speech-related downstream tasks, such as speech recognition, speech classification, and speech translation. - However, all these efforts can only support **speech understanding tasks** which take speech as input, but for the inverse direction, speech synthesis, which synthesis speech as output, the potential of representation learning is yet to be realized. - To address this problem, we propose **our framework, Alignment-Aware Acoustic-Text Pretraining (A3T)**, where we introduce cross-modal alignment embeddings which make the model easier to learn the alignment between the acoustic and phoneme input during multi-modal pretraining, and significantly improve the quality of the reconstructed acoustic signals. #### Contributions - Propose A3T, a BERT-style pretraining model. It can reconstruct masked spectrograms with high quality without finetuning and uses the identical framework for decoding. - Show that the proposed A3T model has the ability to do speech editing and outperforms the current SOTA. - We propose the prompt-based decoding method. We show that our A3T model has the ability to do **speech synthesis for unseen speaker** and outperforms the speaker-embedding-based multi-speaker TTS system. ### 2. Alignment-Aware Acoustic-Text Pretraining ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_f512772c2822064dd621d758f6f12cc4.png) ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_9c2acc057701d66e614d8b7b051dc2fb.png) #### 2.1 $A^3T$ - $\mathrm{A}^3 \mathrm{T}$ takes speech and transcription tuples as input, denotes as $D_{\mathbf{s}, \mathbf{x}}=\left\{\langle\mathbf{s}, \mathbf{x}\rangle^{(n)}\right\}_{n=1}^{|D|}$, where $\mathbf{s}=\left(s_1, \ldots, s_{|s|}\right)$ is a sequence of acoustic features $s_i \in \mathbb{R}^{d_s}$, and $\mathbf{x}=$ $\left(x_1, \ldots, x_{|\mathbf{x}|}\right)$ is the sequence of corresponding transcription. #### 2.2 Cross-modal Alignment Embedding - To strengthen the interaction between the speech and text input, we introduce cross-modal alignment embedding as one input of encoder, where we sum the $i$ th acoustic embedding $e_{s_i}$ or text embedding $\mathbf{x}_i$ with its positional embedding $e_{\mathrm{pos}_i}$ , and alignment embedding $e_{\mathrm{aln}_i}$ , all together: $e_{s_i}+e_{\mathrm{pos}_i}+e_{\mathrm{aln}_i}$. #### 2.3 Conformer - Given the recent success of Convolution-augmented Transformer (Conformer) on various speech tasks, we adopt Conformer as the backbone of our encoder and decoder. - Compared with Transformer, Conformer introduces a convolution module and an additional feedforward module. ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_5f6e2e4669dcf5ded55ca2e27fc93fca.png) #### 2.4 Post-Net and Loss Function - We follow Tacotron 2 to use Post-Net to refine the generated spectrogram. - The training objective of multi-modal $\text{A}^3\text{T}$ includes a speech reconstruction loss $\ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}}\right)$ which takes a spectrogram s and a text sequence $\mathrm{x}$ as input. $$ \begin{aligned} \ell_{\mathbf{s}}\left(D_{\mathbf{s}, \mathbf{x}}\right)=\sum_{\langle\mathbf{s}, \mathbf{x}\rangle \in D_{\mathbf{s}, \mathbf{x}}} &\|\underbrace{f\left(\left[e_{\mathbf{s}} ; \mathbf{x}\right]\right)+g\left(f\left(\left[e_{\mathbf{s}} ; \mathbf{x}\right]\right)\right)}_{\text {refined spectrogram }}-\mathbf{s}\|_1 \\ &+\|\underbrace{f\left(\left[e_{\mathbf{s}}; \mathbf{x}\right]\right)}_{\text {reconstructed spectrogram }}-\mathbf{s}\|_1 \end{aligned} $$ where $g$ is a Post-Net which tries to recover a better original signal from encoded representation $f\left(\left[e_{\hat{\mathbf{s}}} ; \hat{\mathbf{x}}\right]\right)$. We use mean absolute error (MAE) for measuring the difference between $s$ and the reconstructed spectrogram. #### 2.5 $\text{A}^3\text{T}$ for Speech Editing - Once A3T finishes the pretraining process, it can be used as a speech editing system directly with an external duration predictor. $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_84bd74083dd13644981f04264c914bdc.png) #### 2.6 $\text{A}^3\text{T}$ for Multi-speaker TTS - In addition to the speech editing, we find our model has the potential for unseen speaker TTS. - In this work, we find our model can achieve comparable naturalness to models with speaker embeddings for unseen speaker TTS task; What’s more, our generations are more similar to the unseen speaker’s reference speech. $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_12807f2520592e305312ee8da4305e86.png) - The key idea is to concatenate the prompt and the target together into a new utterance input, where the target speech is consist of $n$ $\text{[MASK]}$ and $n$ is predicted by a duration predictor. By inputting the concatenated speech and text, A3T model will predict the spectrogram of these masked frames. - The role of the reference text and speech in our model issimilar to prompts in language model, and hence we call it prompt-based decoding/generation. ### 3. Experiments In this section, we introduce our experiments for spectrogram reconstruction pretraining task, speech-editing task, and multi-speaker TTS 1. Ablation Study with Spectrogram Reconstruction ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_cb1804c0cca5f4baa20579eea549daf7.png) ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_5d5dc0354c1cd31726d38198e8c6efcd.png) ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_0a365fd1e57982aa9a5ca2e3bf6b64f3.png) 2. Speech Editing - Baseline 1: This is a TTS system regenerating a complete waveform from the whole sentence to be edited. - Baseline 2: This system generates the modified region with a TTS model and insert the generation back to the original waveform with a forced aligner - Baseline 3: This system is similar to Baseline 1, but we cut the modified region from the generation and insert it back to the original waveform with a forced aligner ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_6aeb7cfe51da3cd1046fe3a0cb83d06c.png) ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_5ec79ff0ef31a469a52ff8bb89630152.png) 3. Prompt-based Multi-speaker TTS ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_47ce9f8d0bae836944c7dc18a1d569ff.png) ### 4. Conclusions - In this paper, we propose Alignment-Aware Acoustic-Text Pretraining (A3T) which can reconstruct masked acoustic signals with high quality. - We show that our proposed A3T model has the ability to do speech editing and outperforms the current SOTA models, and also improves unseen-speaker speech synthesis with our proposed prompt-based decoding.