# [REVIEWED] VL Blog post part one <br> "*How to prepare a dataset for neural Text-To-Speech — Part 1: Text Preparation*" In recent years, deep learning-based Text-To-Speech systems outperformed other approaches in terms of speech quality and naturalness. In 2017, Google proposed the [Tacotron2](https://https://arxiv.org/abs/1712.05884) end-to-end system capable of generating high-quality speech approaching the human voice. Since then, deep learning synthesizers have become a hot topic. Many researchers and companies have published their Tacotron2 based end-to-end TTS architectures like [FastPitch](https://arxiv.org/abs/2006.06873) and [HifiGan](https://arxiv.org/abs/2010.05646). Besides the ongoing research about few-shot TTS models, top architectures still require relatively large datasets to train. Moreover, both spectrogram generators and vocoders are sensitive to errors and imperfections of the training data. There are datasets within the public domain, such as [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) and [M-AILABS](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/). However, only a few of them provide good enough quality, especially in languages other than English. In this article, we will present our approach to building a custom dataset designed for a deep learning-based text-to-speech model. We will explain our methodology and provide tips on how to ensure high-quality of both transcriptions and audio. We split it into two parts. The first one will focus on the text and provide data on where to get and how to prepare transcriptions for recording. The [second part](link_part_2) will mainly cover recording and audio processing. *Keep in mind that there are differences between languages. Some of our findings might not apply to all of them.* ## Source selection First things first, we need to have a source of our text data. The easiest to get and probably the most commonly used sources of text are open domain books. While we agree that they're a great source, there are few things to consider: 1) Many public domain books are old, which might cause transcriptions to be archaic, especially in languages that changed over the years. 2) They may be lacking difficult words and have little variety (e.g. children's books) 3) They might have insufficient punctuation (e.g. there are very few questions in nature books). Since TTS models learn to map n-grams to sounds, if the dataset lacks some of them (which might be caused by 1 or 2), the model will have problems with their pronunciation. It will be especially noticeable in words with difficult or unique pronunciation (this is often called phoneme coverage). The third might cause problems like not stressing questions or strange behaviour when faced with a lot of punctuation - e.g. when enumerating. Therefore, to ensure good phoneme coverage and a sufficient amount of punctuation, we recommend diversifying the sources of transcriptions. Each source and genre has its characteristics that we should consider when building a dataset. For example, interviews on average have significantly more questions than news outlets or articles about nature do. Dialogue-heavy stories have more interpunction. Scientific papers will more often have an advanced vocabulary. The press will probably require more normalization (more on that later), some sources might have a lot of foreign words (which we want to avoid), and so on. If we intend to use our TTS model for something more specific, adding transcriptions that have domain-specific words may also improve the result. ## How much data is enough? Like most things, this depends, and it's hard to find a precise answer. Many papers recommend 15-20 hours of audio when training from scratch. For fine-tuning, it depends on the architecture. For example, we can find on the [HifiGAN GitHub issues](https://github.com/jik876/hifi-gan/issues) page that 10 hours of data should be enough to fine-tune. During our research on the Polish TTS model, we found out that for a FastPitch spectrogram generator trained with an [external aligner](https://arxiv.org/abs/2108.10447), even one hour was enough to get good sounding results when training from scratch. We think that the the simplicity of Polish pronunciation improves the results. In the Polish language, graphemes map almost directly into phoneme. Fine-tuning the spectrogram generator from a model pre-trained on the larger dataset did not improve the result in our case. We only tried cross-language fine-tuning. We found out that the HifiGAN vocoder can be finetuned using less than an hour of recorded speech. The factor that influences quality is voice similarity. In our case, we took a model pre-trained on LJSpeech and after only a few thousand iterations on our female speaker samples with a similar pitch, we got sharp sounding results. It is worth mentioning that flow-based vocoders like [WaveGlow](https://arxiv.org/abs/1811.00002) are more universal and can even be used without fine-tuning on our data. However, they are slower and, in our case, gave worse results. ## How to estimate text dataset length? How can we find out how many hours of data we will have from collected text? For that we will need speech rate. Speech rate is the number of words spoken in 1 minute, and the average speech rates are relative to context and language. * Presentations: between 100 - 150 wpm for a comfortable pace * Conversational: between 120 - 150 wpm * Audiobooks: between 150 - 160 wpm, which is the upper range for people to comfortably hear and vocalise words * Radio hosts and podcasters: between 150 - 160 wpm * Auctioneers: can speak at about 250 wpm * Commentators: between 250- 400 wpm *Source: https://virtualspeech.com/blog/average-speaking-rate-words-per-minute* As you can see, the best speech rate for the general use TTS system is around 150 words per minute. The LJSpeech speech rate is approx. 140 wpm. Of course, you should consider your speakers' natural speech rate to ensure they sound authentic. After finding the perfect speech rate, the length dataset can be estimated with the following python script. ```python import os DIR_PATH = '/path/to/transcriptions/' # Path to the directory with our transcriptions SPEECH_RATE = 120 # Speech rate (words per minute) transcriptions = {} word_count = 0 for current_path, folder, files in os.walk(DIR_PATH): for file_name in files: path = os.path.join(current_path, file_name) with open(path) as transcription_file: transcription = transcription_file.read() for punct in string.punctuation: transcription = transcription.replace(punct, ' ') # Replace punctuation with whitespaces transcription = re.sub(' {2,}', ' ', transcription) # Delete repeated whitespaces transcriptions[path] = len(transcription.split()) word_count += transcriptions[path] print(f"There are {word_count} words in total, which is around {word_count/SPEECH_RATE} minutes.") ``` It simply sums the number of words in every transcript and divides that number by your estimated speaker's speech rate. ## Preprocessing transcriptions It is useful to think of text-to-speech synthesis as a mapping many-to-many problem. We want our model to learn how to change letters into sounds, hence the size of the space of all possible mappings between phonemes/letters and sounds is important. As mentioned before, a spectrogram generator without a phonemizer may perform better on the Polish dataset than its English version of the same size. It seems easier to train a TTS model for languages like Polish or Spanish, whose orthographic systems come closer to being consistent [phonemic representations](https://en.wikipedia.org/wiki/Phonemic_orthography). Presented below visualisation of many-to-many problem for English TTS. ![diagram](https://i.imgur.com/f9vtSPr.png) To minimize the number of mappings between characters and sounds, we should normalize our dataset. After this step, it should contain only the words and punctuation we need to verbalize digits and abbreviations, write numbers with words and expand abbreviations into their full forms. | Raw | Normalized | |----------|-------------| | David Bowie was born on **8 Jan 1947.** | David Bowie was born on **eighth January nineteen forty-seven.** | | Bakery is on **47** Old Brompton **Rd**. | Bakery is on **forty-seven** Old Brompton **Road**. | Moreover, we need to take care of borrowings, words from other languages, which may appear in our texts. Borrowings should be written as you hear them. | Raw | Normalized | |----------|-------------| | She yelled – **Garçon**! | She yelled – **Garsawn**! | | Causal greeting in Polish is "**[cześć](https://en.wiktionary.org/wiki/cze%C5%9B%C4%87)**". | Casual greeting in Polish is "**cheshch**". | Of course, you can restrain from doing that if you don't have any borrowings in your dataset or if you enforce a specific way of reading on speakers, but for their comfort and simplicity, it is better to just write borrowings as you hear them. Another problematic part of language normalization is acronyms. In our dataset, we decided that acronyms that are read as letters will be split with '-'. | Raw | Normalized | |----------|-------------| | **UK** | **U-K** | | **OPEC** | **OPEC** | As in the example above, OPEC is read as a word, not as letters, therefore we don't need to split it. The last tip on normalization is to choose a symbol set. That allows us to filter out unwanted symbols, as it will make the further stages of processing easier and simpler. After text collecting and processing, it's time to record and form our brand-new dataset. Next in the series: **[How to prepare dataset for neural Text-To-Speech — Part 2: Recordings & Audio Processing](link_part_2)**