MultiSpeech / ADAPTIVE TEXT TO SPEECH FOR CUSTOM VOICE

## MultiSpeech: Multi-Speaker Text to Speech with Transformer ### Introduction To reduce deployment and serving cost in commercial applications, building a TTS system supporting multiple speakers has attracted much attention in both industry and academia. While it is affordable to record high-quality and clean voice in professional studios for a single speaker, it is costly to do so for hundreds or thousands of speakers to build a multi-speaker TTS system. **Thus, multispeaker TTS systems are usually built using multi-speaker data recorded for automatic speech recognition (ASR) or voice conversion**. While applying Transformer to multispeaker TTS, the text-to-speech alignment between the encoder and decoder is more difficult than that of RNN models. In single-speaker TTS, the text and speech data are usually of high-quality and the text-to-speech alignments are easy to learn. **However, as forementioned, the speech data for multi-speaker TTS is usually noisy, which makes the alignments much more difficult.** In order to bring the advantages of transformer into multispeaker TTS modeling, in this paper, we develop a robust and high-quality multi-speaker TTS system called MultiSpeech, which greatly improves the text-to-speech alignment in Transformer. **Specifically, we introduce several techniques to improve the alignments based on empirical observations and insights.** - Diagonal constraint on the weight matrix - Layer normalization step on phoneme embeddings - Bottleneck structure in the decoder pre-net ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_cc96a41da9c77c32d436e0a2779a5632.png =450x) ### Improving Text-to-Speech Alignment #### Diagonal Constraint in Attention We first formulate the diagonal attention rate $r$ as $$r=\frac{\sum_{t=1}^{T} \sum_{s=k t-b}^{kt+b}A_{t, s}}{S}$$ where $S$ is the length of speech mel-spectrogram and $T$ is the length of text(phoneme or character). $k = \frac{S}{T}$ is the slop for each training sample and $b$ is a hyperparameter for bandwidth, both of which determine the shape the diagonal area. $A_{t, s}$ is the t-th row and s-th column of the attention weight matrix A. The diagonal constraint loss $L_{DC}$ encourages larger attention weights in the diagonal area as shown in Figure 2, which is defined as $L_{DC} = −r$. $L_{DC}$ is added on the original TTS loss with a weight λ to adjust the strength of the constraint. #### Position Information in Encoder The encoder of Transformer based TTS model usually takes $x + p$ as input, where $x$ is the embedding of phoneme/character token and $p$ is positional embedding to give the Transformer model a sense of token order. To preserve the position information properly in $x + p$, we first add layer normalization on $x$ and then add with $p$, $LN(x) + p$, as shown in Figure 1. $LN(x)$ is defined as $$ LN(x) = γ\frac{x − µ}{σ}+β $$ where $µ$ and $σ$ are the mean and variance of vector $x$, $γ$ and $β$ are the scale and bias parameters. In this case, the scale of phoneme embedding $x$ can be restricted to a limited range by learning the scale and bias parameters in layer normalization #### Pre-Net Bottleneck in Decoder The adjacent frames of mel-spectrogram are usually very similar since the hop size is usually much smaller than the window size, which means two adjacent frames have large information overlap. As a consequence, **when predicting next frame given current frame as input in autoregressive training, the model is prone to directly copy some information from the input frame** instead of extracting information from text side for meaningful prediction. As shown in Figure 2, we further reduce the bottleneck hidden size to as small as 1/8 of the original hidden size (e.g., 32 vs. the original hidden size 256) plus with 0.5 dropout ratio, and the structure becomes 80-32-32-256. We found this small bottleneck size is essential to learn meaningful alignments and avoid direct copying input frame. ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_1e8bac379043218c6bf2a78d292644c8.png =425x) ### Experiments and Results. - The Quality of MultiSpeech ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_2fe048385df42f8cac728db3fcb515fa.png) - Ablation Study ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_2c724f61309edb1700b74d134ad45f58.png) ### Conclusions In this paper, we developed MultiSpeech, a multi-speaker Transformer TTS system that leverages three techniques including diagonal constraint in attention, layer normalization in encoder and pre-net bottleneck in decoder, to improve the text to-speech alignments in multi-speaker scenario. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate effectiveness of MutiSpeech: 1) it generates much higher-quality and more stable voice compared with Transformer TTS baseline. 2) using MultiSpeech as a teacher, we obtain a strong multi-speaker FastSpeech model to enjoy extremely fast inference speed. In the future, we will continue to improve the voice quality of MultiSpeech and multi-speaker FastSpeech model to deliver better multi-speaker TTS solutions. ## ADASPEECH: ADAPTIVE TEXT TO SPEECH FOR CUSTOM VOICE ### Introduction Nowadays, custom voice has attracted increasing interests in different application scenarios such as personal assistant, news broadcast and audio navigation, and has been widely supported in commercial speech platforms. **In custom voice, a source TTS model is usually adapted on personalized voices with few adaptation data, since the users of custom voice prefer to record as few adaptation data as possible (several minutes or seconds) for convenient purpose**. Few adaptation data presents great challenges on the naturalness and similarity of adapted voice. Furthermore, there are also several distinctive challenges in custom voice: 1) The recordings of the custom users a**re usually of different acoustic conditions from the source speech data** (the data to train the source TTS model). 3) When adapting the source TTS model to a new voice, there is a **trade-off between the fine-tuning parameters and voice quality**. Generally speaking, more adaptation parameters will usually result in better voice quality, which, as a result, increases the memory storage and serving cost In this paper, we propose AdaSpeech, an adaptive TTS model for high-quality and efficient customization of new voice. AdaSpeech employ a three-stage pipeline for custom voice: - Pre-training - Fine-tuning - Inference ### ADASPEECH The model structure of AdaSpeech is shown in Figure 1. We adopt FastSpeech 2 (Ren et al., 2021) as the model backbone considering the FastSpeech (Ren et al., 2019; 2021) series are one of the most popular models in non-autoregressive TTS. ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_8c78369d3f04effe7b7454abc1c00d4e.png =250x) - Acoustic condition modeling In text to speech, since the input text lacks enough acoustic conditions (such as speaker timbre, prosody and recording environments) to predict the target speech, the model tends to memorize and overfit on the training data, and has poor generalization during adaptation. A natural way to solve such problem is to provide corresponding acoustic conditions as input to make the model learn reasonable text-to-speech mapping towards better generalization instead of memorizing. To better model the acoustic conditions with different granularities, we categorize the acoustic conditions in different levels as shown in Figure 2a: 1) Speaker level: the coarse-grained acoustic conditions to capture the overall characteristics of a speaker; 2) Utterance level: the fine-grained acoustic conditions in each utterance of a speaker; 3) Phoneme level: the more fine-grained acoustic conditions in each phoneme of an utterance, such as accents on specific phonemes, pitches, prosodies and temporal environment noises (Montreal Forced Aligner) ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_c36ab5c294118acaf744709e72add4eb.png) - Conditional layer normalization ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_4e45e2fbd569ec47e04ef8537e4f41ca.png =300x) We find that layer normalization is adopted in each self-attention and feed-forward network in decoder, which can greatly influence the hidden activation and final prediction with a light-weight learnable scale vector $γ$ and bias vector $β$: $LN(x) = γ\frac{x−µ}{σ}+β$, where $µ$ and $σ$ are the mean and variance of hidden vector $x$ As shown in Figure 3, the conditional network consists of two simple linear layers $W^γ_c$ and $W^β_c$ that take speaker embedding $E^s$ as input and output the scale and bias vector respectively: $$γ^s_c = E^s * W^γ_c$$ $$β^s_c = E^s * W^β_c$$ where $s$ denotes the speaker ID, and $c$ ∈ $[C]$ denotes there are $C$ conditional layer normalizations in the decoder - Pipeline of ADASPEECH ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_892c875a03d2ff165687cb0cd08d6c73.png) ### Results - The quality of adapation voice ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_6461c8195b6c58096646565821f40ca1.png) - Ablation study ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_f82e8924a0bf6cbafd80fd7f0986cf2c.png =300x) - Analyses on acoustic condition modeling ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_ecdeef2b8f31278a5ada6ffb4c1219a3.png) - Analyses on Conditional Layer Normalization ![](https://mllab.asuscomm.com:12950/hackmd/uploads/upload_262b8082f80c4395d6587f2290e63cfb.png =300x) ### Conclusions In this paper, we have developed AdaSpeech, an adaptive TTS system to support the distinctive requirements in custom voice. **We propose acoustic condition modeling to make the source TTS model more adaptable for custom voice with various coustic conditions**. We further design conditional layer normalization to improve the adaptation efficiency: fine-tuning few model parameters to achieve high voice quality.