owned this note
owned this note
Published
Linked with GitHub
# smj TTS project technical documentation
## General
* speech-sme, speech-smj [LINKS LATER]
* [SCRIPTS] that are referred to, are going to be documented in detail in the script files
* Almost all scripts mentioned here are written in python. Make sure to (pip) install python packages like: [sox](https://pypi.org/project/sox/), [tgt tools](https://textgridtools.readthedocs.io/en/stable/)
## Text Corpus
* Texts that have CC-BY licence, or other texts that we have permission to use and also publish as a TTS text corpus which is a part of the project
* [Freecorpus](https://raw.githubusercontent.com/giellalt/lang-smj/main/corp/freecorpus.txt)
* We have also used parts of texts in [Boundcorpus], these are texts without CC-BY lincence. Hvordan og hvorfor? [INGA?]
* Styles in the smj TTS text corpus:
* ![](https://hackmd.io/_uploads/rkJgAIwUh.png)
* Minimum amount of texts ~ corresponding to 10 hours of speech; smj -- ~ 74 000 words, +12 h, depending on the speech rate
* Corpus statistics were done using an R script: aggregate_corpus_data.R -- it counts words for each text style accoring to a style label
* Proofreading and checking the texts for grammar, formatting etc.
* Checking trigrams [SCRIPT: save_trigram_freqs_for_entire_corpus_nltk.py] and gradation patterns and their coverage [SCRIPT: check_gradation/gradation_checker_progress_bar.py], adding missing/scarce ones
* Texts also in English, Swedish, Norwegian
* Open-source texts from Wikipedia, LJSpeech sentences etc...
* TODO: also Finnish names?
* Praat text prompter and logger [SCRIPT: Praat_Prompter/smj_text_prompt_6.praat]
* It's good to use a prompter to show 1 paragraph at a time, to make the taks comfortable for the reader
* The reader can control the pace her/himself by using arrow keys (keys cannot be pressed when speaking, though)
## Recording speech corpus
* Voice talents
[INGA]
* Finding suitable voice talents
* Taking care of voice talents, scheduling of the work
* Instructions for voice talents (see Praat prompter script): reading style, voice quality, reading quality
* Recording setup: microphone, sound card, sampling frequency etc.
* Microphone: DPA headset microphone to ensure the microphone is at the same distance from the mouth throughout recordings and thus takes less room acoustic, also because this microphone will not block the reader's sight like a large-diaphragm microphone
* Use of 2 microphones, the other one for backup. We used Zoom H2n handheld recorder for the male smj voice; for the female voice, AKG C414 was used with the omni polar pattern
* Room acoustics -- minimize noise sources like grounding noise, lamps, air conditioning etc. Try to avoid echo by acoustic material, curtains, anything that prevents room reverb.
* Digital audio workstation (DAW), monitoring, test recordings
* Audacity works best, but at NRK with male smj voice, we used a broadcasting software NRK uses (Digas) -- this did not allow monitoring of the recordings properly
* File back-ups, file naming conventions
* After each session the recordings were backed up to an external hard drive and to a private UiT OneDrive folder
## Audio processing
* Find the right text files that were actually read from the Praat script log file
* Name files: wavs and txts identically
* Also: split very long files to 2-3 parts to make the automatic pre-processing easier
* First cleaning of the audio: cut long pauses, noise, anything not suitable for synthesis
* Filters to long audio files (before splitting, see folder 'audio_processing')
* Echo removal (if needed) - Cubase (NB! [A commercial software!](https://www.steinberg.net/cubase/))
* ![](https://hackmd.io/_uploads/B1QWSwvI2.png)
* High pass filter - [Audacity](https://www.audacityteam.org/)
![](https://hackmd.io/_uploads/rkCLrPvIh.png)
* Noise gate/noise reduction - [Audacity](https://www.audacityteam.org/)
* ![](https://hackmd.io/_uploads/Sk1jrPvLn.png)
* Level normalization (make all sound files in the corpus to be at the same volume level)- [sox](https://pypi.org/project/sox/) & [STL](https://github.com/openitu/STL)
* copy the [SCRIPT: norm_file_noconv.sh] to the folder where you have your target files, open a Terminal and cd to that folder with the script and the target files. Make a separate /output subfolder
* remember to export path before running the command: export PATH=$PATH:/home/hiovain/STL/bin
* run this command (example; fill with your own folder paths): ls -1 /home/hiovain/Bodø/Sander_mdgn/*.wav | xargs -P 8 -I {} bash norm_file_noconv.sh {} /home/hiovain/Bodø/Sander_mdgn/output
## Text processing
[INGA]
* Editing texts to match the audio/read speech accurately
* Also "light" mistakes, corrections, repetitions are kept and written out in the text transcript
* Numbers will be written out!
* There is a lot of variation of how numbers are read out so these need to be manually checked
* Second cleaning after careful listening of the recordings: cutting the rest of unusable parts out (bad mistakes, coughing etc.)
## Splitting the data to sentences
* Generally, all TTS frameworks require the training data to be in certain form. This is sentence-long .wav and .txt pairs. The files should not vary too much in length.
* Make sure .wav and .txt long file pairs are identically named
* Run [WebMAUS basic force-aligner](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/WebMAUSBasic)
* Audio files over 200 MB/30mins in size should be split in smaller chunks first or the aligner won't work
* TIP for long files: [use Pipeline without ASR](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/Pipeline) with G2P -> Chunker -> MAUS options
* No Sámi model -> FINNISH model works for Sámi but note that numbers etc. are normalized in Finnish if any so make sure numbers are normalized BEFORE "MAUSING"!
* NOTE that WebMAUS removes commas -- these need to be readded!!! [TO DO: SCRIPT]
* WebMAUS automatically outputs a Praat .TextGrid annotation file with 4 annotation layers and boundaries on phoneme and word levels/tiers
* Next, the word boundary tier is converted to SENTENCE level based on silence duration between the sentences. It might require some fine-tuning of the duration variable to find a suitable treshold to each speaker [SCRIPT: scripts/concatenate_webmaus_word_tier.py].
* The resulting sentence tier is manually checked and fixed in Praat
![](https://hackmd.io/_uploads/SyBSzIVvh.png)
* Run splitter script (Python) -- the [SCRIPT: split_sound_by_labeled_intervals_from_tgs_in_a_folder.py] saves each labeled interval (defined in the script) into indexed short .wav and .txt files into a folder
* Gather the .wav filenames and transcripts from corresponding txt files to a table [SCRIPT: scripts/extract_filenames.py]. Fill in the paths carefully!
* Check the table manually that everything is correct and that there are no unnecessary characters
* Remember to check COMMAS, too! Add commas to the transcriptions whenever the reader makes a (breathing) pause in the speech. Important in lists especially. Otherwise the prosody will not be natural.
## TTS frameworks
### Tacotron2 based setup
* see [lang-sme-speech-ml](https://github.com/divvun/lang-sme-ml-speech)
* Liliia's thesis: https://gupea.ub.gu.se/bitstream/handle/2077/69692/gupea_2077_69692_1.pdf?sequence=1)
* While this setup worked well as an experiment, the development of the TTS systems are very very fast, and shortly after the Tacotron 2, a better framework with faster and more effective training was available.
### Fastpitch setup and training
* [Fastpitch on GitHub](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch)
* [Add our setup as a repo to speech-sme?]
* Why is Fastpitch better than Tacotron2?
* Faster & lighter to train, trains even on a laptop if necessary
* 3 days is enough for 1000 epochs (usual amount of epochs for a good model) – but depends on the data size
* Better and more natural prosody because of explicit pitch prediction: pitch and duration values for every segment is modeled
* No Tacotron "artefacts"
* Very easy to "combine" voices/putting 2 datasets together, for example (sme)
* Reference: [Adrian Łańcucki, 2021](https://arxiv.org/abs/2006.06873)
* Define symbol set of the language carefully [SCRIPT, for example: /home/hiovain/DeepLearningExamples/PyTorch/SpeechSynthesis/FastPitch/common/text/symbols.py]
* This is EXTREMELY important to get right! If the symbol list is not correct, the training will not work correctly. It's better to include "too many" symbols than too few.
* Fastpitch pre-processing [SCRIPT: scripts/prepare_dataset.sh]
* define data table/filelist
* extract mel spectrograms and pitch for each file, remember to CONVERT audio files to 22 KhZ (downsample)
* add pitch file paths to the data table [SCRIPT: save_pitch_filenames.py]
* Then: add speaker numbers!
* using the *shuf* command, shuffle the table
* make a test/validation set: ~100 sentences from the corpus are left out from training
* If you face an error like: "Loss is NaN", run SCRIPT: check_pitch_1.py acapela_f_new/pitch to see if there are pitch files without any pitch analyzed (due to creaking voice, for example)
* Training
* training [SCRIPT], parametres
* training on cluster [SCRIPT]
* training script with debug plotting: draw a spectrogram plot of each epoch [SCRIPT]
* running inference [SCRIPTS]
* VOCODERS!!! HifiGAN, NeMo UnivNet
* UnivNet generally performs better: it is easier to generalize to unseen speakers; no audible vocoder noise. Installation: pip install nemo_toolkit[tts]
* Reference: Jang, W., Lim, D., Yoon, J., Kim, B., & Kim, J. (2021). Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. arXiv preprint arXiv:2106.07889.
* Usage in the inference script:
...
from nemo.collections.tts.models import UnivNetModel
...
self.vocoder = UnivNetModel.from_pretrained(model_name="tts_en_libritts_univnet")
...
## Exporting TorchScript for C++ integration
* Command: python export_torchscript.py --generator-name FastPitch --generator-checkpoint /output_sme_af/FastPitch_checkpoint_1000.pt --output /output_sme_af/torchscript_sme_f.pt --amp
## User interface for TTS
* Huggingface demo user interface [SCRIPT]:
* voice: male/female
* speed should be decided by the user in the interface
* also possibility to choose the (Sami) language from the same demo page: sme, smj, sma...
* TO DO: run on a server?
## Publications & presentations
[TO DO: add to Cristin]
* PLM 2021: [Building open-source Text-to-Speech synthesis for minority languages with ternary quantity: Lule Saami as an example ](http://ifa.amu.edu.pl/plm_old/2020/files/plm2021_abstracts/Hiovain_PLM2021_Abstract.pdf)
* SIGUL 2022: [Building Open-source Speech Technology for Low-resource Minority Languages with Sámi as an Example – Tools, Methods and Experiments](https://aclanthology.org/2022.sigul-1.22/)
* LREC 2022: [Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch](https://aclanthology.org/2022.lrec-1.125/)
* SAALS5 2022: [Building open-source speech technology for low-resource minority languages with Sámi as an example – tools, methods and experiments](https://keel.ut.ee/en/saals5-program)
* INTERSPEECH 2023
* SIGUL 2023
* [AI4LAM April 2023 - Language Technology for the Sámi population](https://docs.google.com/presentation/d/14gMCzF04dJeRBJ7UPpqno8wOje_xmS5UhP7fZWWRmbg/edit?usp=sharing)
* Fonetiikan päivät 2024
## Publishing the materials and models
* Where to publish/store data and models?
* Restrictions for specific use cases, such as commercial, academic etc...?
## To do...
* train fastpitch model with orthography + IPA/phonemes
* text normalization using fst output!
* evaluation: Fonetiikan päivät 2024
* training models using orthographic texts + phonemes together -- will it help and maybe reduce the amount of training data requirement?
* Speaker adaptation, voice cloning: combining voice models
* essentially: use a base model and record your own voice for 5 minutes and make the model sound like you?
* [Voice fixer](https://github.com/haoheliu/voicefixer) for making also old, low-quality archive material usable for speech technology. It uses a TTS vocoder for cleaning audio from noise and making it more clear and intelligible
* [Resemble Enhance](https://github.com/resemble-ai/resemble-enhance) – even better than voice fixer, possible to run on cluster for the entire data set
* Using pronunciation lexicon for Sámi for words with unusual pronunciation, see for Swedish: https://www.openslr.org/29/
* language identification from speech as a part of an ASR system (like facebook fairseq: https://github.com/facebookresearch/fairseq
* style embeddings: change the "style" of the output, like formal, informal, adjust pitch... (Helsinki group)
* other minority languages examples with speech technology projects -- more ideas, comparative studies etc.:
* Võro
* Gaelic (Dewi Jones)
* Maori
* Irish?
* Frisian (Phat Do)
* REPEAT FOR ALL SÁMI LANGUAGES!
* Voices for different dialects/areas varieties; especially Finnish North Sámi variety?
## Bibliography & other useful links
* https://tts.readthedocs.io/en/latest/what_makes_a_good_dataset.html
* LRSpeech: Extremely Low-Resource Speech Synthesis and
Recognition https://arxiv.org/abs/2008.03687
* coqui-ai TTS: https://github.com/coqui-ai/TTS
* What makes a good TTS dataset: <https://docs.coqui.ai/en/latest/what_makes_a_good_dataset.html>