More recent TTS works often use phonemes as input, to achieve better stability and generalize better beyond their training sets. However, phoneme based models may suffer from the ambiguity of the representation, such as on homophones.
“To cancel the payment, press one; or to continue, two.”
In the phoneme representation, the trailing “…, two.” can be easily confused with “…, too.”, which is used more frequently in English.
However, in natural speech, different prosody is expected at the comma positions in these two patterns – the former(two) expects a moderate pause at the comma, while having no pause sounds more natural for the latter(too).
This paper introduces PnG BERT, an augmented BERT model that can be used as a drop-in replacement for the encoder in typical neural TTS models.
PnG BERT can benefit neural TTS models by taking advantages of both phoneme and grapheme representation, as well as by self-supervised pre-training on large text corpora to better understand natural language in its input.
Input
The input to PnG BERT is composed of two segments , which are two different representations of the same content.
The special tokens CLS and SEP are added in the same way as in the original BERT. All tokens share the same ID space for embedding lookup and MLM classification.
Pretraining
Similar to the original BERT model, PnG BERT can be pretrained on a plain text corpus in a self-supervised manner.
Fine-tuning
The PnG BERT model can be used as an input encoder for typical neural TTS models. Its weights can be initialized from a pre-trained model, and further fine-tuned during TTS training.
We freeze the weights of the the lower Transformer layers, and only fine-tune the higher layers, to prevent degradation due to the smaller TTS training set and to help the generalization ability of the trained TTS model.
Pre-training performance
TTS performance
Standard TTS systems are commonly developed by assuming they are operating in ideal clean environments. However, in reality, many modern applications, such as digital assistants, require TTS to communicate with users in noisy places.
Inspired by the human speech chain mechanism, a machine speech chain establish a closed feedback loop between the listening component (ASR) and speaking component (TTS) so both components can assist each other in semi-supervised learning given unpaired data.
In this work, we propose an advanced version of a machine speech chain that utilizes a feedback mechanism, not only during training but also during inference. Simulating the Lombard effect\(^1\), we implement a machine speech chain for an end-to-end neural TTS in noisy environments that enables TTS to speak louder in noisy conditions given the auditory feedback.
\(^1\)The Lombard effect is the tendency of speakers to increase their vocal effort when speaking in loud noise.
TTS with SNR feedback
TTS generates speech waveform based on text input and feedback from SNR embedding. The SNR feedback represents the power or intensity measurement of how well the TTS speech can be heard in noisy environments.
\[ Z_{S N R}=S N R \text { Embedding }\left(y^{n o i s y}\right)\] \[\operatorname{Loss}_{S N R}\left(l, p_l\right)=-\sum_{c_l=1}^{C_l} 1\left(l=c_l\right) * \log p_l\left[c_l\right]\]
TTS with SNR-ASR feedback
The second version of TTS generates speech waveform based on text input and feedback from the SNR and ASR-loss embedding. The ASR-loss embedding represents the speech intelligibility measurement of how well the noisy TTS speech can be recognized.
\[Z_{ASR}=\text{ASR } \it{Loss~Embedding} \left(\operatorname{Loss}_{A S R}\left(x, p_x\right)\right)\] \[\tilde{p}_x=p\left(x \mid y^{\text {noisy }}\right)\] \[\operatorname{Loss}_{A S R}\left(x_s, p_{x_s}\right)=-\sum 1\left(x_s=c\right) * \log p_{x_s}[c]\]
TTS with SNR-ASR feedback and variance adaptor
Humans tend to increase their speech intensity and pitch in noisy environments and also speak slower. Therefore, in addition to the SNR and ASR-loss embedding feedback, we applied a variance adaptor module within the proposed TTS with a similar approach as in FastSpeech2 that guides the prosody adaptation.
\(\operatorname{Loss}_{T T S}(\boldsymbol{y}, \hat{\boldsymbol{y}})=\)
\[\begin{aligned}&\frac{1}{T} \sum_{t=1}^T\left(\left(y_t-\hat{y}_t\right)^2-\left(b_t \log \left(\hat{b}_t\right)+\left(1-b_t\right) \log \left(1-\hat{b}_t\right)\right)\right)+ \\ &\operatorname{Loss}_{\text {pred }}\left(v^P, \hat{v}^P\right)+\operatorname{Loss}_{\text {pred }}\left(v^G, \hat{v}^G\right)+\operatorname{Loss}_{\text {pred }}\left(v^D, \hat{v}^D\right) \end{aligned}
\]