[INGA]
Generally, all TTS frameworks require the training data to be in certain form. This is sentence-long .wav and .txt pairs. The files should not vary too much in length.
Make sure .wav and .txt long file pairs are identically named
Run WebMAUS basic force-aligner
Next, the word boundary tier is converted to SENTENCE level based on silence duration between the sentences. It might require some fine-tuning of the duration variable to find a suitable treshold to each speaker [SCRIPT: scripts/concatenate_webmaus_word_tier.py].
Run splitter script (Python) – the [SCRIPT: split_sound_by_labeled_intervals_from_tgs_in_a_folder.py] saves each labeled interval (defined in the script) into indexed short .wav and .txt files into a folder
Gather the .wav filenames and transcripts from corresponding txt files to a table [SCRIPT: scripts/extract_filenames.py]. Fill in the paths carefully!
Check the table manually that everything is correct and that there are no unnecessary characters
[TO DO: add to Cristin]