PnG BERT / Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Introduction

More recent TTS works often use phonemes as input, to achieve better stability and generalize better beyond their training sets. However, phoneme based models may suffer from the ambiguity of the representation, such as on homophones.

“To cancel the payment, press one; or to continue, two.”
In the phoneme representation, the trailing “…, two.” can be easily confused with “…, too.”, which is used more frequently in English.

However, in natural speech, different prosody is expected at the comma positions in these two patterns – the former(two) expects a moderate pause at the comma, while having no pause sounds more natural for the latter(too).

This paper introduces PnG BERT, an augmented BERT model that can be used as a drop-in replacement for the encoder in typical neural TTS models.
PnG BERT can benefit neural TTS models by taking advantages of both phoneme and grapheme representation, as well as by self-supervised pre-training on large text corpora to better understand natural language in its input.

PnG BERT

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Input
The input to PnG BERT is composed of two segments , which are two different representations of the same content.
1. Phoneme sequence, containing individual IPA phonemes as tokens.
2. Grapheme sequence, containing subword units from text as tokens.
The special tokens CLS and SEP are added in the same way as in the original BERT. All tokens share the same ID space for embedding lookup and MLM classification.
Pretraining
Similar to the original BERT model, PnG BERT can be pretrained on a plain text corpus in a self-supervised manner.
- Masking strategy
  The input to PnG BERT consists of two distinct sequences representing essentially the same content. If random masking was applied as in the original BERT model, This would make the MLM prediction significantly easier and would reduce the effectiveness of the pre-training.
  To avoid this issue, we apply random masking at word-level, consistently masking out phoneme and grapheme tokens belonging to the same word.
Fine-tuning
The PnG BERT model can be used as an input encoder for typical neural TTS models. Its weights can be initialized from a pre-trained model, and further fine-tuned during TTS training.

We freeze the weights of the the lower Transformer layers, and only fine-tune the higher layers, to prevent degradation due to the smaller TTS training set and to help the generalization ability of the trained TTS model.

Experiments

Pre-training performance
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
TTS performance
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Conclusions

Experimental results showed that PnG BERT can significantly improve the performance of NAT, a state-of-the-art neural TTS model
Subjective side-by-side preference evaluation showed that raters had no statistically significant preference between the speech synthesized using PnG BERT and the ground truth recordings from professional speakers.

Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder

1. Introduction

Standard TTS systems are commonly developed by assuming they are operating in ideal clean environments. However, in reality, many modern applications, such as digital assistants, require TTS to communicate with users in noisy places.

Inspired by the human speech chain mechanism, a machine speech chain establish a closed feedback loop between the listening component (ASR) and speaking component (TTS) so both components can assist each other in semi-supervised learning given unpaired data.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In this work, we propose an advanced version of a machine speech chain that utilizes a feedback mechanism, not only during training but also during inference. Simulating the Lombard effect

^{1}

, we implement a machine speech chain for an end-to-end neural TTS in noisy environments that enables TTS to speak louder in noisy conditions given the auditory feedback.

^{1}

The Lombard effect is the tendency of speakers to increase their vocal effort when speaking in loud noise.

2. Proposed TTS in Speech Chain Framework

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

TTS with SNR feedback
TTS generates speech waveform based on text input and feedback from SNR embedding. The SNR feedback represents the power or intensity measurement of how well the TTS speech can be heard in noisy environments.

$Z_{S N R} = S N R Embedding (y^{n o i s y})$
${Loss}_{S N R} (l, p_{l}) = - \sum_{c_{l} = 1}^{C_{l}} 1 (l = c_{l}) * \log p_{l} [c_{l}]$
TTS with SNR-ASR feedback
The second version of TTS generates speech waveform based on text input and feedback from the SNR and ASR-loss embedding. The ASR-loss embedding represents the speech intelligibility measurement of how well the noisy TTS speech can be recognized.

$Z_{A S R} = ASR L o s s E m b e d d i n g ({Loss}_{A S R} (x, p_{x}))$
${\tilde{p}}_{x} = p (x ∣ y^{noisy})$
${Loss}_{A S R} (x_{s}, p_{x_{s}}) = - \sum 1 (x_{s} = c) * \log p_{x_{s}} [c]$
TTS with SNR-ASR feedback and variance adaptor
Humans tend to increase their speech intensity and pitch in noisy environments and also speak slower. Therefore, in addition to the SNR and ASR-loss embedding feedback, we applied a variance adaptor module within the proposed TTS with a similar approach as in FastSpeech2 that guides the prosody adaptation.

${Loss}_{T T S} (y, \hat{y}) =$

$\begin{aligned} \frac{1}{T} \sum_{t = 1}^{T} ({(y_{t} - {\hat{y}}_{t})}^{2} - (b_{t} \log ({\hat{b}}_{t}) + (1 - b_{t}) \log (1 - {\hat{b}}_{t}))) + \\ {Loss}_{pred} (v^{P}, {\hat{v}}^{P}) + {Loss}_{pred} (v^{G}, {\hat{v}}^{G}) + {Loss}_{pred} (v^{D}, {\hat{v}}^{D}) \end{aligned}$

3. Experiments

Dataset: Wall Street Journal (WSJ) corpus
Natural Lombard speech with a single male speaker who read the WSJ development and test sets in noise conditions
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

4. Conclusions

We constructed a dynamically adaptive machine speech chain inference framework to support TTS in noisy conditions.
Our proposed systems with auditory feedback and a variance adaptor successfully produced highly intelligible speech that surpassed a standard TTS with a fine-tuning method and achieved closer to human performances.

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Introduction

PnG BERT

Experiments

Conclusions

Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder

1. Introduction

2. Proposed TTS in Speech Chain Framework

3. Experiments

4. Conclusions

Read more

Talk like a Graph: Encoding Graphs for Large Language Models

Graph Neural Prompting with Large Language Models

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

Graph of Thoughts: Solving Elaborate Problems with Large Language Models