Data Cleanup Pipeline

# Data Cleanup Pipeline ## Getting Started **The source audio files for these are [here](https://we.tl/t-PPnTiaRJsD)** The audio sources and transcriptions are extracted from the following videos: https://www.youtube.com/watch?v=UOoJe35T2GY https://www.youtube.com/watch?v=8-BLWt_GdE8 https://www.youtube.com/watch?v=i5Ioa6t47lM https://www.youtube.com/watch?v=1gqpHZoc3es https://www.youtube.com/watch?v=5DAwY85_gYQ https://www.youtube.com/watch?v=fjy5Gk9-ulg https://www.youtube.com/watch?v=OgiDY_AIFMg https://www.youtube.com/watch?v=-iPMwOLWoE4 https://www.youtube.com/watch?v=bxBE-DYF-3c https://www.youtube.com/watch?v=wA3Ta_lPUmM https://www.youtube.com/watch?v=_H8ru45qmO0 https://www.youtube.com/watch?v=7sJlNObrBRo https://www.youtube.com/watch?v=oI-l9Ss1aYM ## Audio Pre-Processing We have a system to generate transcriptions. These transcriptions will need to later be human reviewed. However, to make sure the transcriptions are accurate and don't include other speakers, we need to pre-process the data. ### Step One - Remove all audio that isn't the speaker We want to load the full WAV files and delete any noise, leaving the speaker audio in place. The final output of this step should be an audio file of equal length to the input file, but with all of the parts that are not the speaker removed or replaced with silence. ### Step Two - Normalize and equalize all audio tracks together The speech synthesis works best on audio that is clear and consistent. We want to normalize and equalize the tracks so that they are all the same level and ideally frequency range / tonality. Feel free to add compression, limiting, etc **If there are any audio files which don't sound like the others or fit the corpus well, feel free to exclude them** We particularly want to look out for noise, reverb, music, etc as these negatively impact the training ### Step Three - Remove Noise Audio kind of modern audio noise cleaning tool (Izotope, etc) that can remove external noise is helpful here ### Step Four -- Export as 22khz mono We will be training on mono sources and want to work with mono exlusively once prepared, since the rest of the process will go faster ### Step Five -- Zip and prepare for transcription Please send the content back to The Nexus for transcription, along with any notes # Transcription and Timing The Nexus will run the transcription from our repo here: https://github.com/AtlasFoundation/AutomaticTikTalk/tree/main/youtube If the audio engineer knows python, they can run this themselves. However, we expect that The Nexus will need to run this. Feel free to try it, though! # Transcription Review Once transcription is finished, The Nexus will send the metadata.txt and split WAV files containing the transcriptions and timings back to the audio engineer. The audio engineer will need to listen through to all audio and review the transcripts to make sure they are valid. For words or names, please spell out phonetically as best you can hear if it is not a common word. For example, for the character Kaname Date, we would use 'dah tay' whenever a character said it. Here is the phonetic annunciation that we use for all words: http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b All expletives will need to be added if the transcription has removed them. For usage of particularly offensive words, we leave it to the audio engineer's discretion to transcribe or delete all uses from the source audio. # Final Review The final dataset that is fed into the training system will be the metadata.csv file and a collection of split audio files, several words each in length. The audio engineer will need to review the final audio and transcriptions to make sure everything lines up, matches, and sounds relatively close together.