SoPro: Adapt Voice

# SoPro: Adapt Voice **Group members:** Annegret Janzo (aj), Urs Peter (up), Joanna Dietinger (jd), Anastasia Borisenkov (ab) > [Overleaf Template: Adapt voice](https://www.overleaf.com/2215996661byfhpmzmqkhn) ## Open To-Do's - [ ] How do we account for accent? - [ ] continue writing on project proposal in overleaf - [ ] Research text data for Louisa to read-out (jd) - [ ] [Harvard Sentences](https://github.com/dariusk/corpora/pull/201/commits/2a5ea77a18e3fc78d787fbbc3a395b7df70492c9) - [ ] update Overleaf document on Friday (ab, aj, jd, up) - [x] verify sound studio constraints with Luisa: - Timeframe for recording; dates - [ ] research about presets android-voice app (up) - [ ] research about phonetical details for voice recording (jd) - how should the audio data be recorded - which audio data can we use - how/what should be annotated about the audio ### To-Do's till Friday Meeting - [x] Do pre-recorded audio data from Harvard sentences exist? (jd) - [ ] if yes, is the audio data in right format? - [ ] convert audio data if format not correct - [ ] manually convert some harvard sentences to technical details named in paper - [ ] find and test software for converting audio files - [x] research and hopefully fix bug with checkpoint loading of models in flowtron (ab, up) - [ ] evaluate run model? - [x] make model converge (1 mio. epochs is tooooo muuuuuch) - [ ] adapt the read-in txt file for train-model (up) - [ ] how to call trained model for usage & evaluate it (ab) - [x] How to integrate our adapted TTS model to the android TTS app (aj) - demo-app uses android-intern TTS system - [Android TTS System](https://developer.android.com/reference/android/speech/tts/TextToSpeech) - [Android Voice Adjustments](https://developer.android.com/reference/android/speech/tts/Voice#getFeatures()) - [Android Speech Recognition and TTS](https://gotev.github.io/android-speech/) - [x] need to find out - [x] a) how to adjust these settings to get the voice we need or - [x] b) how to replace this model so that it still fits the existing code - to run a Pytorch model on Android, we can probably use [Pytorch Mobile](https://pytorch.org/mobile/android/) - Ernie tries to find some built-in tools to make it even easier - to completely integrate the TTS model to the (final) app needs to be discussed with the Chinese company that wrote it - send model with notice for input/output to the Chinese company for integration to app ## 08-06-2021: QA-Meeting with Eran - read flowtron paper - https://arxiv.org/pdf/2005.05957.pdf - louisa: harvard sentence - phonetically balanced - additional data: - timid corpus additional data; need probably at least 1h of total speech. - Roughly 4 days of training - Fine-tuning faster than training from scratch - Training - speaker embedding layer - switched off or one speaker - phonetic, normal sampa - normally phonemes better than graphemes - converter - train on both grpahemes and phonemes - make sure that there are mixed sentences - Speech Style: - Variation in the speaker(speech Variation in the paper) - they have the values written in paper - in the first recording don't tell her to vary her speech - don't get rid of punctuation - have to talk to the generation group regarding punctuation - Have to make sure the data is in good quality - record her with variation in her speech and treat this as a different speaker (happy speaker,angry speaker;for example 70% happy, 30% sad) - radio interviews with her? - 50 ms at the end before cutting the phrase - librosa for audio processing in python - bash libraries also available - praat and audacity possible(but very work intensive) - use save checkpoint and check if it goes in the right direction - Graphs show alignment, It should aim for a diagonal - Evaluation methods in the Paper - evaluation on similarity - papers on voice conversion - mos - mosh-prow?! - write a documentation on what we did ## ... ## 17-05-2021 ### morning - distribution of tasks between all group members - continued bug-fixing in making model run ### afternoon - fixed the run command - COMMAND: - USEFUL LINK: https://github.com/NVIDIA/flowtron/issues/41 - fixed path command to audio files with sed: - COMMAND: - made the model run with LJS audio files - **To-Do: substitute with dummy audio data** ## 10-05-2021 ### Flowtron - Fine-tuning (*data of 15 to 30 minutes*): adapatation of speech **variance** and **style** possible - not sure if voice tone is adaptable as well - Train from scratch (*data of 10+ hours might be neeeded*) - more things need to be researched about training from scratch and/or fine-tuning ### Tacotron2 - training model from scratch - example data: [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/) - total of ca. 24 hours of data - over 13000 audio clips (1-10 seconds) + transcripts - tutorial on training with this data [Tutorial](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) and the step-by-step approach in the [Github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) did both not work on my machine - Tutorial: problems with loading the model using pytorch - Github: "Clone succeeded, but checkout failed." ## 07-05-2021 - next meeting ## 03-05-2021 - aj + ab updated overleaf document - explained to Ernie & Vera which models we might want to use (Tacotron2, Flowtron) - Don't forget to read and test the two models - talked about audio data we might need and what to look out for - we also added new to-do's ## 30-04-2021 - Annegret talks about meeting with Eran - fulfill to-do's for this date (see section "Old To-Do's") - talk about next tasks which should be done (if possible) until the next meeting (see section "Open To-Do's") ## 27-04-2021: Meeting with Eran - Eran will be available for technical questions and can also join some group meetings - we need at least a few hours of voice recordings - existing data that can be used for testing is about 24h - especially when training from scratch - less when taking an existing model - possibly sending some example recordings to Louisa (already including the pauses before and after each sentence) - important: good GPU - preprocessing: focus on phoneme-level - around 15ms before and after each sentence - eliminating possible background noise - not excluding punctuation (especially if it leads to pauses) - tacotron2: speech model by google, can be trained on own data - https://github.com/NVIDIA/tacotron2 - takes most time - accoustic training - python code - can create co-alignments in contexts - best to even try out with generic data to test as soon as possible - to test training and getting used to it - leads to similar voice (not neccessarily exactly the same) - save temporal models every few epochs (to see improvements) - create own model/use existing model - WaveGlow - connected to tacotron2 - encoder - "faster than real life" - can be used to visualize audio representation - takes representation from model and turns it to sound - How to know what to read? - searching for papers about phonetically balanced texts - text recommended in the tacotron2-repo - maybe: texts about Louisas own work (because this is the topic of interest) - text (and recordings) need to be on sentence-level - Additional ideas when there is time for that - if more time: additional challenge of mixing our new model with an existing model - if even more data available (in two or more different accents) - train multiple models (one per accent) ## 26-04-2021 - general setup ### Approach - Ideas - Ideas - Adapt voice - Alignment - Train from scratch #### First Goal - research how to adapt voice in TTS system #### Tools: - Google TTS - Help by Eran - Meeting Tomorrow with Eran at 10:15; Annegret joins ## Funny To-Do's - [ ] Create accent in voice model (hihi) ## Old To-Do's ### ... - [x] research about flowtron (ab) [Paper flowtron](https://arxiv.org/pdf/2005.05957.pdf) - technicalities: understand what is meant with *speech variation* and *speech style adaptation* - *speech variation*: makes voice non-monotonic - *speech style adaptation*: self-explanatory - data needed - try to make it run - [x] research about tacotron2 (aj) - technicalities - data needed - try to make it run - [x] set-up working environment on Coli-servers - see if tony-1 or tony-2 is available ### ... - [x] Read about tacotron2 and flowtron (until monday) - [x] [tacotron2](https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html) - [Article 2](https://analyticsindiamag.com/tacotron-2-google-ai-text-to-speech-system/) - [x] [flowtron](https://developer.nvidia.com/blog/training-your-own-voice-font-using-flowtron/) - [Colab](https://colab.research.google.com/drive/1bSiFze0jb6_thZL57hYTrprpNue1MiUc?usp=sharing) - [x] Make the code in AndroidStudio run (make run until Sunday evening) - [Android App](https://gitlab.com/erniecyc/android-voice-app) ### 10-05-2020 - [x] Integrate new question-answer pairs (ab until monday) ### 30-04-2021 - [x] Create Overleaf Document until 2.5. - [x] Adapt Overleaf Document (~15 min) - [x] Research TTS voice adaption (talk about research ~20-30 min) - [x] PyCharm Shared: try-out (~20 min) - [x] create GitLab branch? (Urs)