# SoPro: Adapt Voice
**Group members:**
Annegret Janzo (aj), Urs Peter (up), Joanna Dietinger (jd), Anastasia Borisenkov (ab)
> [Overleaf Template: Adapt voice](https://www.overleaf.com/2215996661byfhpmzmqkhn)
## Open To-Do's
- [ ] How do we account for accent?
- [ ] continue writing on project proposal in overleaf
- [ ] Research text data for Louisa to read-out (jd)
- [ ] [Harvard Sentences](https://github.com/dariusk/corpora/pull/201/commits/2a5ea77a18e3fc78d787fbbc3a395b7df70492c9)
- [ ] update Overleaf document on Friday (ab, aj, jd, up)
- [x] verify sound studio constraints with Luisa:
- Timeframe for recording; dates
- [ ] research about presets android-voice app (up)
- [ ] research about phonetical details for voice recording (jd)
- how should the audio data be recorded
- which audio data can we use
- how/what should be annotated about the audio
### To-Do's till Friday Meeting
- [x] Do pre-recorded audio data from Harvard sentences exist? (jd)
- [ ] if yes, is the audio data in right format?
- [ ] convert audio data if format not correct
- [ ] manually convert some harvard sentences to technical details named in paper
- [ ] find and test software for converting audio files
- [x] research and hopefully fix bug with checkpoint loading of models in flowtron (ab, up)
- [ ] evaluate run model?
- [x] make model converge (1 mio. epochs is tooooo muuuuuch)
- [ ] adapt the read-in txt file for train-model (up)
- [ ] how to call trained model for usage & evaluate it (ab)
- [x] How to integrate our adapted TTS model to the android TTS app (aj)
- demo-app uses android-intern TTS system
- [Android TTS System](https://developer.android.com/reference/android/speech/tts/TextToSpeech)
- [Android Voice Adjustments](https://developer.android.com/reference/android/speech/tts/Voice#getFeatures())
- [Android Speech Recognition and TTS](https://gotev.github.io/android-speech/)
- [x] need to find out
- [x] a) how to adjust these settings to get the voice we need or
- [x] b) how to replace this model so that it still fits the existing code
- to run a Pytorch model on Android, we can probably use [Pytorch Mobile](https://pytorch.org/mobile/android/)
- Ernie tries to find some built-in tools to make it even easier
- to completely integrate the TTS model to the (final) app needs to be discussed with the Chinese company that wrote it
- send model with notice for input/output to the Chinese company for integration to app
## 08-06-2021: QA-Meeting with Eran
- read flowtron paper
- https://arxiv.org/pdf/2005.05957.pdf
- louisa: harvard sentence - phonetically balanced
- additional data:
- timid corpus additional data; need probably at least 1h of total speech.
- Roughly 4 days of training
- Fine-tuning faster than training from scratch
- Training
- speaker embedding layer - switched off or one speaker
- phonetic, normal sampa
- normally phonemes better than graphemes
- converter
- train on both grpahemes and phonemes
- make sure that there are mixed sentences
- Speech Style:
- Variation in the speaker(speech Variation in the paper)
- they have the values written in paper
- in the first recording don't tell her to vary her speech
- don't get rid of punctuation
- have to talk to the generation group regarding punctuation
- Have to make sure the data is in good quality
- record her with variation in her speech and treat this as a different speaker (happy speaker,angry speaker;for example 70% happy, 30% sad)
- radio interviews with her?
- 50 ms at the end before cutting the phrase
- librosa for audio processing in python
- bash libraries also available
- praat and audacity possible(but very work intensive)
- use save checkpoint and check if it goes in the right direction
- Graphs show alignment, It should aim for a diagonal
- Evaluation methods in the Paper
- evaluation on similarity
- papers on voice conversion
- mos
- mosh-prow?!
- write a documentation on what we did
## ...
## 17-05-2021
### morning
- distribution of tasks between all group members
- continued bug-fixing in making model run
### afternoon
- fixed the run command
- COMMAND:
- USEFUL LINK: https://github.com/NVIDIA/flowtron/issues/41
- fixed path command to audio files with sed:
- COMMAND:
- made the model run with LJS audio files
- **To-Do: substitute with dummy audio data**
## 10-05-2021
### Flowtron
- Fine-tuning (*data of 15 to 30 minutes*): adapatation of speech **variance** and **style** possible
- not sure if voice tone is adaptable as well
- Train from scratch (*data of 10+ hours might be neeeded*)
- more things need to be researched about training from scratch and/or fine-tuning
### Tacotron2
- training model from scratch
- example data: [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/)
- total of ca. 24 hours of data
- over 13000 audio clips (1-10 seconds) + transcripts
- tutorial on training with this data [Tutorial](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/) and the step-by-step approach in the [Github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) did both not work on my machine
- Tutorial: problems with loading the model using pytorch
- Github: "Clone succeeded, but checkout failed."
## 07-05-2021
- next meeting
## 03-05-2021
- aj + ab updated overleaf document
- explained to Ernie & Vera which models we might want to use (Tacotron2, Flowtron)
- Don't forget to read and test the two models
- talked about audio data we might need and what to look out for
- we also added new to-do's
## 30-04-2021
- Annegret talks about meeting with Eran
- fulfill to-do's for this date (see section "Old To-Do's")
- talk about next tasks which should be done (if possible) until the next meeting (see section "Open To-Do's")
## 27-04-2021: Meeting with Eran
- Eran will be available for technical questions and can also join some group meetings
- we need at least a few hours of voice recordings
- existing data that can be used for testing is about 24h
- especially when training from scratch
- less when taking an existing model
- possibly sending some example recordings to Louisa (already including the pauses before and after each sentence)
- important: good GPU
- preprocessing: focus on phoneme-level
- around 15ms before and after each sentence
- eliminating possible background noise
- not excluding punctuation (especially if it leads to pauses)
- tacotron2: speech model by google, can be trained on own data
- https://github.com/NVIDIA/tacotron2
- takes most time
- accoustic training
- python code
- can create co-alignments in contexts
- best to even try out with generic data to test as soon as possible
- to test training and getting used to it
- leads to similar voice (not neccessarily exactly the same)
- save temporal models every few epochs (to see improvements)
- create own model/use existing model
- WaveGlow
- connected to tacotron2
- encoder
- "faster than real life"
- can be used to visualize audio representation
- takes representation from model and turns it to sound
- How to know what to read?
- searching for papers about phonetically balanced texts
- text recommended in the tacotron2-repo
- maybe: texts about Louisas own work (because this is the topic of interest)
- text (and recordings) need to be on sentence-level
- Additional ideas when there is time for that
- if more time: additional challenge of mixing our new model with an existing model
- if even more data available (in two or more different accents)
- train multiple models (one per accent)
## 26-04-2021
- general setup
### Approach - Ideas
- Ideas
- Adapt voice
- Alignment
- Train from scratch
#### First Goal
- research how to adapt voice in TTS system
#### Tools:
- Google TTS
- Help by Eran
- Meeting Tomorrow with Eran at 10:15; Annegret joins
## Funny To-Do's
- [ ] Create accent in voice model (hihi)
## Old To-Do's
### ...
- [x] research about flowtron (ab) [Paper flowtron](https://arxiv.org/pdf/2005.05957.pdf)
- technicalities: understand what is meant with *speech variation* and *speech style adaptation*
- *speech variation*: makes voice non-monotonic
- *speech style adaptation*: self-explanatory
- data needed
- try to make it run
- [x] research about tacotron2 (aj)
- technicalities
- data needed
- try to make it run
- [x] set-up working environment on Coli-servers
- see if tony-1 or tony-2 is available
### ...
- [x] Read about tacotron2 and flowtron (until monday)
- [x] [tacotron2](https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html)
- [Article 2](https://analyticsindiamag.com/tacotron-2-google-ai-text-to-speech-system/)
- [x] [flowtron](https://developer.nvidia.com/blog/training-your-own-voice-font-using-flowtron/)
- [Colab](https://colab.research.google.com/drive/1bSiFze0jb6_thZL57hYTrprpNue1MiUc?usp=sharing)
- [x] Make the code in AndroidStudio run (make run until Sunday evening)
- [Android App](https://gitlab.com/erniecyc/android-voice-app)
### 10-05-2020
- [x] Integrate new question-answer pairs (ab until monday)
### 30-04-2021
- [x] Create Overleaf Document until 2.5.
- [x] Adapt Overleaf Document (~15 min)
- [x] Research TTS voice adaption (talk about research ~20-30 min)
- [x] PyCharm Shared: try-out (~20 min)
- [x] create GitLab branch? (Urs)