kathiasi
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# smj TTS project technical documentation ## General * speech-sme, speech-smj [LINKS LATER] * [SCRIPTS] that are referred to, are going to be documented in detail in the script files * Almost all scripts mentioned here are written in python. Make sure to (pip) install python packages like: [sox](https://pypi.org/project/sox/), [tgt tools](https://textgridtools.readthedocs.io/en/stable/) ## Text Corpus * Texts that have CC-BY licence, or other texts that we have permission to use and also publish as a TTS text corpus which is a part of the project * [Freecorpus](https://raw.githubusercontent.com/giellalt/lang-smj/main/corp/freecorpus.txt) * We have also used parts of texts in [Boundcorpus], these are texts without CC-BY lincence. Hvordan og hvorfor? [INGA?] * Styles in the smj TTS text corpus: * ![](https://hackmd.io/_uploads/rkJgAIwUh.png) * Minimum amount of texts ~ corresponding to 10 hours of speech; smj -- ~ 74 000 words, +12 h, depending on the speech rate * Corpus statistics were done using an R script: aggregate_corpus_data.R -- it counts words for each text style accoring to a style label * Proofreading and checking the texts for grammar, formatting etc. * Checking trigrams [SCRIPT: save_trigram_freqs_for_entire_corpus_nltk.py] and gradation patterns and their coverage [SCRIPT: check_gradation/gradation_checker_progress_bar.py], adding missing/scarce ones * Texts also in English, Swedish, Norwegian * Open-source texts from Wikipedia, LJSpeech sentences etc... * TODO: also Finnish names? * Praat text prompter and logger [SCRIPT: Praat_Prompter/smj_text_prompt_6.praat] * It's good to use a prompter to show 1 paragraph at a time, to make the taks comfortable for the reader * The reader can control the pace her/himself by using arrow keys (keys cannot be pressed when speaking, though) ## Recording speech corpus * Voice talents [INGA] * Finding suitable voice talents * Taking care of voice talents, scheduling of the work * Instructions for voice talents (see Praat prompter script): reading style, voice quality, reading quality * Recording setup: microphone, sound card, sampling frequency etc. * Microphone: DPA headset microphone to ensure the microphone is at the same distance from the mouth throughout recordings and thus takes less room acoustic, also because this microphone will not block the reader's sight like a large-diaphragm microphone * Use of 2 microphones, the other one for backup. We used Zoom H2n handheld recorder for the male smj voice; for the female voice, AKG C414 was used with the omni polar pattern * Room acoustics -- minimize noise sources like grounding noise, lamps, air conditioning etc. Try to avoid echo by acoustic material, curtains, anything that prevents room reverb. * Digital audio workstation (DAW), monitoring, test recordings * Audacity works best, but at NRK with male smj voice, we used a broadcasting software NRK uses (Digas) -- this did not allow monitoring of the recordings properly * File back-ups, file naming conventions * After each session the recordings were backed up to an external hard drive and to a private UiT OneDrive folder ## Audio processing * Find the right text files that were actually read from the Praat script log file * Name files: wavs and txts identically * Also: split very long files to 2-3 parts to make the automatic pre-processing easier * First cleaning of the audio: cut long pauses, noise, anything not suitable for synthesis * Filters to long audio files (before splitting, see folder 'audio_processing') * Echo removal (if needed) - Cubase (NB! [A commercial software!](https://www.steinberg.net/cubase/)) * ![](https://hackmd.io/_uploads/B1QWSwvI2.png) * High pass filter - [Audacity](https://www.audacityteam.org/) ![](https://hackmd.io/_uploads/rkCLrPvIh.png) * Noise gate/noise reduction - [Audacity](https://www.audacityteam.org/) * ![](https://hackmd.io/_uploads/Sk1jrPvLn.png) * Level normalization (make all sound files in the corpus to be at the same volume level)- [sox](https://pypi.org/project/sox/) & [STL](https://github.com/openitu/STL) * copy the [SCRIPT: norm_file_noconv.sh] to the folder where you have your target files, open a Terminal and cd to that folder with the script and the target files. Make a separate /output subfolder * remember to export path before running the command: export PATH=$PATH:/home/hiovain/STL/bin * run this command (example; fill with your own folder paths): ls -1 /home/hiovain/Bodø/Sander_mdgn/*.wav | xargs -P 8 -I {} bash norm_file_noconv.sh {} /home/hiovain/Bodø/Sander_mdgn/output ## Text processing [INGA] * Editing texts to match the audio/read speech accurately * Also "light" mistakes, corrections, repetitions are kept and written out in the text transcript * Numbers will be written out! * There is a lot of variation of how numbers are read out so these need to be manually checked * Second cleaning after careful listening of the recordings: cutting the rest of unusable parts out (bad mistakes, coughing etc.) ## Splitting the data to sentences * Generally, all TTS frameworks require the training data to be in certain form. This is sentence-long .wav and .txt pairs. The files should not vary too much in length. * Make sure .wav and .txt long file pairs are identically named * Run [WebMAUS basic force-aligner](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/WebMAUSBasic) * Audio files over 200 MB/30mins in size should be split in smaller chunks first or the aligner won't work * TIP for long files: [use Pipeline without ASR](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/Pipeline) with G2P -> Chunker -> MAUS options * No Sámi model -> FINNISH model works for Sámi but note that numbers etc. are normalized in Finnish if any so make sure numbers are normalized BEFORE "MAUSING"! * NOTE that WebMAUS removes commas -- these need to be readded!!! [TO DO: SCRIPT] * WebMAUS automatically outputs a Praat .TextGrid annotation file with 4 annotation layers and boundaries on phoneme and word levels/tiers * Next, the word boundary tier is converted to SENTENCE level based on silence duration between the sentences. It might require some fine-tuning of the duration variable to find a suitable treshold to each speaker [SCRIPT: scripts/concatenate_webmaus_word_tier.py]. * The resulting sentence tier is manually checked and fixed in Praat ![](https://hackmd.io/_uploads/SyBSzIVvh.png) * Run splitter script (Python) -- the [SCRIPT: split_sound_by_labeled_intervals_from_tgs_in_a_folder.py] saves each labeled interval (defined in the script) into indexed short .wav and .txt files into a folder * Gather the .wav filenames and transcripts from corresponding txt files to a table [SCRIPT: scripts/extract_filenames.py]. Fill in the paths carefully! * Check the table manually that everything is correct and that there are no unnecessary characters * Remember to check COMMAS, too! Add commas to the transcriptions whenever the reader makes a (breathing) pause in the speech. Important in lists especially. Otherwise the prosody will not be natural. ## TTS frameworks ### Tacotron2 based setup * see [lang-sme-speech-ml](https://github.com/divvun/lang-sme-ml-speech) * Liliia's thesis: https://gupea.ub.gu.se/bitstream/handle/2077/69692/gupea_2077_69692_1.pdf?sequence=1) * While this setup worked well as an experiment, the development of the TTS systems are very very fast, and shortly after the Tacotron 2, a better framework with faster and more effective training was available. ### Fastpitch setup and training * [Fastpitch on GitHub](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch) * [Add our setup as a repo to speech-sme?] * Why is Fastpitch better than Tacotron2? * Faster & lighter to train, trains even on a laptop if necessary * 3 days is enough for 1000 epochs (usual amount of epochs for a good model) – but depends on the data size * Better and more natural prosody because of explicit pitch prediction: pitch and duration values for every segment is modeled * No Tacotron "artefacts" * Very easy to "combine" voices/putting 2 datasets together, for example (sme) * Reference: [Adrian Łańcucki, 2021](https://arxiv.org/abs/2006.06873) * Define symbol set of the language carefully [SCRIPT, for example: /home/hiovain/DeepLearningExamples/PyTorch/SpeechSynthesis/FastPitch/common/text/symbols.py] * This is EXTREMELY important to get right! If the symbol list is not correct, the training will not work correctly. It's better to include "too many" symbols than too few. * Fastpitch pre-processing [SCRIPT: scripts/prepare_dataset.sh] * define data table/filelist * extract mel spectrograms and pitch for each file, remember to CONVERT audio files to 22 KhZ (downsample) * add pitch file paths to the data table [SCRIPT: save_pitch_filenames.py] * Then: add speaker numbers! * using the *shuf* command, shuffle the table * make a test/validation set: ~100 sentences from the corpus are left out from training * If you face an error like: "Loss is NaN", run SCRIPT: check_pitch_1.py acapela_f_new/pitch to see if there are pitch files without any pitch analyzed (due to creaking voice, for example) * Training * training [SCRIPT], parametres * training on cluster [SCRIPT] * training script with debug plotting: draw a spectrogram plot of each epoch [SCRIPT] * running inference [SCRIPTS] * VOCODERS!!! HifiGAN, NeMo UnivNet * UnivNet generally performs better: it is easier to generalize to unseen speakers; no audible vocoder noise. Installation: pip install nemo_toolkit[tts] * Reference: Jang, W., Lim, D., Yoon, J., Kim, B., & Kim, J. (2021). Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. arXiv preprint arXiv:2106.07889. * Usage in the inference script: ... from nemo.collections.tts.models import UnivNetModel ... self.vocoder = UnivNetModel.from_pretrained(model_name="tts_en_libritts_univnet") ... ## Exporting TorchScript for C++ integration * Command: python export_torchscript.py --generator-name FastPitch --generator-checkpoint /output_sme_af/FastPitch_checkpoint_1000.pt --output /output_sme_af/torchscript_sme_f.pt --amp ## User interface for TTS * Huggingface demo user interface [SCRIPT]: * voice: male/female * speed should be decided by the user in the interface * also possibility to choose the (Sami) language from the same demo page: sme, smj, sma... * TO DO: run on a server? ## Publications & presentations [TO DO: add to Cristin] * PLM 2021: [Building open-source Text-to-Speech synthesis for minority languages with ternary quantity: Lule Saami as an example ](http://ifa.amu.edu.pl/plm_old/2020/files/plm2021_abstracts/Hiovain_PLM2021_Abstract.pdf) * SIGUL 2022: [Building Open-source Speech Technology for Low-resource Minority Languages with Sámi as an Example – Tools, Methods and Experiments](https://aclanthology.org/2022.sigul-1.22/) * LREC 2022: [Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch](https://aclanthology.org/2022.lrec-1.125/) * SAALS5 2022: [Building open-source speech technology for low-resource minority languages with Sámi as an example – tools, methods and experiments](https://keel.ut.ee/en/saals5-program) * INTERSPEECH 2023 * SIGUL 2023 * [AI4LAM April 2023 - Language Technology for the Sámi population](https://docs.google.com/presentation/d/14gMCzF04dJeRBJ7UPpqno8wOje_xmS5UhP7fZWWRmbg/edit?usp=sharing) * Fonetiikan päivät 2024 ## Publishing the materials and models * Where to publish/store data and models? * Restrictions for specific use cases, such as commercial, academic etc...? ## To do... * train fastpitch model with orthography + IPA/phonemes * text normalization using fst output! * evaluation: Fonetiikan päivät 2024 * training models using orthographic texts + phonemes together -- will it help and maybe reduce the amount of training data requirement? * Speaker adaptation, voice cloning: combining voice models * essentially: use a base model and record your own voice for 5 minutes and make the model sound like you? * [Voice fixer](https://github.com/haoheliu/voicefixer) for making also old, low-quality archive material usable for speech technology. It uses a TTS vocoder for cleaning audio from noise and making it more clear and intelligible * [Resemble Enhance](https://github.com/resemble-ai/resemble-enhance) – even better than voice fixer, possible to run on cluster for the entire data set * Using pronunciation lexicon for Sámi for words with unusual pronunciation, see for Swedish: https://www.openslr.org/29/ * language identification from speech as a part of an ASR system (like facebook fairseq: https://github.com/facebookresearch/fairseq * style embeddings: change the "style" of the output, like formal, informal, adjust pitch... (Helsinki group) * other minority languages examples with speech technology projects -- more ideas, comparative studies etc.: * Võro * Gaelic (Dewi Jones) * Maori * Irish? * Frisian (Phat Do) * REPEAT FOR ALL SÁMI LANGUAGES! * Voices for different dialects/areas varieties; especially Finnish North Sámi variety? ## Bibliography & other useful links * https://tts.readthedocs.io/en/latest/what_makes_a_good_dataset.html * LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition https://arxiv.org/abs/2008.03687 * coqui-ai TTS: https://github.com/coqui-ai/TTS * What makes a good TTS dataset: <https://docs.coqui.ai/en/latest/what_makes_a_good_dataset.html>

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully