FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

# FastSpeech 2: Fast and High-Quality End-to-End Text to Speech ICLR 2021 [Link to paper](https://arxiv.org/pdf/2006.04558.pdf) ## Introduction - Non-autoregressive TTS models can synthesize speech significantly faster than autoregressive TTS models - FastSpeech is one of the most successful non-autoregressive TTS that produce comparable quality. They - Reduced data variance in the target side by using the generated mel -spectrogram from an autoregressive teacher model as the training target - Introduce duration (extracted from the attention map of the teacher model) information to expand the text sequence to match the length of the mel-spectrogram sequence - However, FastSpeech suffers - As a one-to-many and non-autoregressive model which only has text as the input, it is very prone to overfit to the variation of target speech in training set which might lead to poor generalization ability - FastSpeech use two-stage teacher-student training pipeline which makes tge training very complicated - Because it uses the teacher-student model, the information loss might be very big - The duration extracted from the attention map of teacher model is not accurate enough - Instead, in FastSpeech 2 - Directly train the model with ground-truth target instead of the simplified output from a teacher - Introduced some variation information of speech including pitch, energy and more accurate duration. - in training, they extract duration, pitch and energy from the target speech waveform and directly take them as conditional inputs - in inference, we use values predicted by the predictors that are jointly trained with the FastSpeech 2 model - They also introduce FastSpeech 2s, which does not use mel-spectrograms as intermediate output and directly generates speech waveform from text in inference ## Model ![](https://i.imgur.com/xtoGs4V.png) ![](https://i.imgur.com/HCoWobR.png) ### Overview 1. Input Encoding: The text input is first tokenized into a sequence of subword units and then passed through an embedding layer to produce a sequence of continuous embeddings. 2. Encoder Network: The encoded sequence of embeddings from the input encoding stage is then fed into a multi-layer transformer encoder network, which processes the input and produces a sequence of encoded representations. 3. Variance Adaptor: A variance adaptor network is used to modify the encoded representations in order to control the prosody and speaking style of the generated speech. The variance adaptor can adjust the timing, pitch, and energy of the speech, among other characteristics. 4. Decoder Network: A decoder network is then used to generate the spectrogram of the synthesized speech. The decoder takes as input the modified encoded representations, and generates a sequence of spectrogram frames. 5. Mel-Spectrogram: The spectrogram frames are then converted to a mel-spectrogram, which is a more compact and perceptually-relevant representation of the speech signal. - Both Encoder and Decoder network use **feed-forward Transformer block** as the basic structure ### Variance Adaptor Variance Adaptor's main purpose is to add variance information to the pphoneme hidden sequence which will help for one-to-many mapping in TTS 1) Phoneme duration, which represents how long the speech voice sounds 2) Pitch, which is a key feature to convey emotions and greatly affects the speech prosody 3) Energy, which indicates frame level magnitude of mel-spectrograms and directly affects the volume and prosody of speech - In training, they take the ground-truth value of duration, pitch and energy extracted from the recordings as input into the hidden sequence to predict the target speech and at the same time training the duration, pitch, and energy predictors which are used in inference to synthesize target speech. #### Duration Predictor - The duration predictor takes the phoneme hidden sequence as input and predicts the duration of each phoneme, which represents how many mel frames correspond to this phoneme, and is converted into logarithmic domain for ease of prediction. - They extract the phoneme duration using Montreal Forced Alignment (MFA) and optimized using MSE loss. #### Pitch Predictor - Previous models that consider pitch prediction often predict pitch contour directly. - However, it has a very high variation of ground-truth which makes the result not very good. - What they do instead is they convert the pitch contour into pitch spectogram using Continous Wavelet Transform (CWT). So that the predictor will predict spectogram instead of contours and it will be returned to pitch contour again using Inverse Continous Wavelet Transform (iCWT). #### Energy Predictor - They compute L2-norm of the amplitude of each short-time Fourier transform (STFT) frame as the energy - Optimize with MSE Loss ### FastSpeech 2s - FastSpeech 2s directly generates waveform from text without cascaded mel-spectrogram generation (acoustic model) and waveform generation (vocoder) - Will perform much faster in inference due to removal of mel-spectogram decoder while aslo having comparable performance with a cascaded system - However challenges in non-autoregressive text-to-waveform generation: - Since the waveform contains more variance information (e.g., phase) than mel-spectrograms, the information gap between the input and output is larger than that in text-to-spectrogram generation - Phase can be thought of as the "position" of a particular frequency component within the overall sound waveform. If two frequency components have the same amplitude but different phases, they will interfere with each other in a different way and produce a different overall waveform. - It is difficult to train on the audio clip that corresponds to the full text sequence due to the extremely long waveform samples and limited GPU memory - FastSpeechs 2s: - Introduce adversarial training in the waveform decoder to force it to implicitly recover the phase information by itself because phase is very difficult to be learnt. The discriminator in the adversarial training adopts the same structure in Parallel WaveGAN - Leverage the mel-spectrogram decoder of FastSpeech 2, which is trained on the full text sequence to help on the text feature extraction - Waveform decoder is based on the strucutre of WaveNet ## Experiments ### Dataset - Experiments being done on LJSpeech dataset which contains 13,100 English audio clips (about 24 hours) and corresponding text transcripts We split the dataset into three sets: 12,228 samples for training, 349 samples for validation and 523 samples for testing - For subjective evaluation, they randomly choose 100 samples in test set. To alleviate the mispronunciation problem, they convert the text sequence into the phoneme sequence with an open-source grapheme-to-phoneme tool. ### Results ![](https://i.imgur.com/8HeqyEK.png) - MOS means Mean Opinion Score. GT means Ground Truth; MEL + PWG means they first convert the ground-truth audio into mel-spectrograms, and then convert the mel-spectrograms back to audio using Parallel WaveGAN ![](https://i.imgur.com/Ea4c9Qg.png) ![](https://i.imgur.com/BZhXljG.png) ![](https://i.imgur.com/237yQeo.png) ![](https://i.imgur.com/xkc0j6R.png) ### Ablation Study ![](https://i.imgur.com/MRHH098.png)