Try   HackMD

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Introduction

More recent TTS works often use phonemes as input, to achieve better stability and generalize better beyond their training sets. However, phoneme based models may suffer from the ambiguity of the representation, such as on homophones.

“To cancel the payment, press one; or to continue, two.”
In the phoneme representation, the trailing , two.” can be easily confused with , too.”, which is used more frequently in English.

However, in natural speech, different prosody is expected at the comma positions in these two patterns – the former(two) expects a moderate pause at the comma, while having no pause sounds more natural for the latter(too).

This paper introduces PnG BERT, an augmented BERT model that can be used as a drop-in replacement for the encoder in typical neural TTS models.
PnG BERT can benefit neural TTS models by taking advantages of both phoneme and grapheme representation, as well as by self-supervised pre-training on large text corpora to better understand natural language in its input.

PnG BERT

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Input
    The input to PnG BERT is composed of two segments , which are two different representations of the same content.

    1. Phoneme sequence, containing individual IPA phonemes as tokens.
    2. Grapheme sequence, containing subword units from text as tokens.

    The special tokens CLS and SEP are added in the same way as in the original BERT. All tokens share the same ID space for embedding lookup and MLM classification.

  • Pretraining
    Similar to the original BERT model, PnG BERT can be pretrained on a plain text corpus in a self-supervised manner.

    • Masking strategy
      The input to PnG BERT consists of two distinct sequences representing essentially the same content. If random masking was applied as in the original BERT model, This would make the MLM prediction significantly easier and would reduce the effectiveness of the pre-training.
      To avoid this issue, we apply random masking at word-level, consistently masking out phoneme and grapheme tokens belonging to the same word.
  • Fine-tuning
    The PnG BERT model can be used as an input encoder for typical neural TTS models. Its weights can be initialized from a pre-trained model, and further fine-tuned during TTS training.

    We freeze the weights of the the lower Transformer layers, and only fine-tune the higher layers, to prevent degradation due to the smaller TTS training set and to help the generalization ability of the trained TTS model.

Experiments

  • Pre-training performance

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • TTS performance

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Conclusions

  • Experimental results showed that PnG BERT can significantly improve the performance of NAT, a state-of-the-art neural TTS model
  • Subjective side-by-side preference evaluation showed that raters had no statistically significant preference between the speech synthesized using PnG BERT and the ground truth recordings from professional speakers.

Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder

1. Introduction

Standard TTS systems are commonly developed by assuming they are operating in ideal clean environments. However, in reality, many modern applications, such as digital assistants, require TTS to communicate with users in noisy places.

Inspired by the human speech chain mechanism, a machine speech chain establish a closed feedback loop between the listening component (ASR) and speaking component (TTS) so both components can assist each other in semi-supervised learning given unpaired data.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

In this work, we propose an advanced version of a machine speech chain that utilizes a feedback mechanism, not only during training but also during inference. Simulating the Lombard effect\(^1\), we implement a machine speech chain for an end-to-end neural TTS in noisy environments that enables TTS to speak louder in noisy conditions given the auditory feedback.

\(^1\)The Lombard effect is the tendency of speakers to increase their vocal effort when speaking in loud noise.

2. Proposed TTS in Speech Chain Framework

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • TTS with SNR feedback
    TTS generates speech waveform based on text input and feedback from SNR embedding. The SNR feedback represents the power or intensity measurement of how well the TTS speech can be heard in noisy environments.
    \[ Z_{S N R}=S N R \text { Embedding }\left(y^{n o i s y}\right)\] \[\operatorname{Loss}_{S N R}\left(l, p_l\right)=-\sum_{c_l=1}^{C_l} 1\left(l=c_l\right) * \log p_l\left[c_l\right]\]

  • TTS with SNR-ASR feedback
    The second version of TTS generates speech waveform based on text input and feedback from the SNR and ASR-loss embedding. The ASR-loss embedding represents the speech intelligibility measurement of how well the noisy TTS speech can be recognized.
    \[Z_{ASR}=\text{ASR } \it{Loss~Embedding} \left(\operatorname{Loss}_{A S R}\left(x, p_x\right)\right)\] \[\tilde{p}_x=p\left(x \mid y^{\text {noisy }}\right)\] \[\operatorname{Loss}_{A S R}\left(x_s, p_{x_s}\right)=-\sum 1\left(x_s=c\right) * \log p_{x_s}[c]\]

  • TTS with SNR-ASR feedback and variance adaptor
    Humans tend to increase their speech intensity and pitch in noisy environments and also speak slower. Therefore, in addition to the SNR and ASR-loss embedding feedback, we applied a variance adaptor module within the proposed TTS with a similar approach as in FastSpeech2 that guides the prosody adaptation.

    \(\operatorname{Loss}_{T T S}(\boldsymbol{y}, \hat{\boldsymbol{y}})=\)
    \[\begin{aligned}&\frac{1}{T} \sum_{t=1}^T\left(\left(y_t-\hat{y}_t\right)^2-\left(b_t \log \left(\hat{b}_t\right)+\left(1-b_t\right) \log \left(1-\hat{b}_t\right)\right)\right)+ \\ &\operatorname{Loss}_{\text {pred }}\left(v^P, \hat{v}^P\right)+\operatorname{Loss}_{\text {pred }}\left(v^G, \hat{v}^G\right)+\operatorname{Loss}_{\text {pred }}\left(v^D, \hat{v}^D\right) \end{aligned} \]

3. Experiments

  • Dataset: Wall Street Journal (WSJ) corpus
  • Natural Lombard speech with a single male speaker who read the WSJ development and test sets in noise conditions
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

4. Conclusions

  • We constructed a dynamically adaptive machine speech chain inference framework to support TTS in noisy conditions.
  • Our proposed systems with auditory feedback and a variance adaptor successfully produced highly intelligible speech that surpassed a standard TTS with a fine-tuning method and achieved closer to human performances.