Emo-Codec - HackMD

# Multi sample rate codec ## Step 1 : Resynthesize speech 1. Create a on HuggingFace 2. Get speech data from xxx 3. Clone [Codec-SUPERB](https://github.com/dlion168/AudioDecBenchmark/tree/superb_main) from this branch to resynthesize speech with codec. You need to write a dataset loading script at [here](https://github.com/voidful/Codec-SUPERB/tree/main/SoundCodec/dataset) to load data from disk. You may need to use [audiofolder](https://huggingface.co/docs/datasets/audio_load) to load data into huggingface dataset format. [Example 1](https://github.com/dlion168/AudioDecBenchmark/blob/superb_main/SoundCodec/dataset/PODCAST.py).[Example 2](https://github.com/dlion168/AudioDecBenchmark/blob/superb_main/SoundCodec/dataset/IMPROV.py). Note that some dataset with "train", "test" in their filenames may not be loaded correctly. 4. Useful scripts ``` pip install -r requirements.txt pip install git+https://github.com/voidful/AudioDec.git pip install git+https://github.com/voidful/descript-audio-codec.git pip install encodec pip install git+https://github.com/voidful/FunCodec.git CUDA_VISIBLE_DEVICES=2 python dataset_creator.py --dataset PODCAST --push_to_hub # check if # of loaded file is the same as folder ls -1 | wc -l ``` 5. Check the resynthesized speech quality by hearing some sample on [Emo-Codec](https://huggingface.co/Emo-Codec). **Codecs to run**: * SpeechTokenizer, * DAC 16k, 24k, 44k, * Encodec 24k all 5 models, * Funcodec en_16k_nq32ds320 & en_16k_nq32ds640, zh_en_16k_nq32ds320 & zh_en_16k_nq32ds640, * AudioDec 24k, 48k_uni * AcademiCodec 16k_320d_large_uni, 24k * LanguageCodec chinese_8nq, paper_8nq * FAcodec 20 models in total ``` from datasets import load_dataset, Audio, Dataset from functools import partial def gen_from_iterable_dataset(iterable_ds): yield from iterable_ds def load_data(): dataset = load_dataset("Codec-SUPERB/fsd50k_synth", split="original", streaming=True) dataset = Dataset.from_generator(partial(gen_from_iterable_dataset, dataset), features=dataset.features) dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) return dataset ``` ## Step 2 : Evaluate performance of resynthesized speech 1. Run [benchmarking.py](https://github.com/dlion168/AudioDecBenchmark/blob/main/benchmarking.py) for all metrics. Note: * Skip PESQ, STOI, F0corr for audio data * Skip F0corr for speech data ## Paper writing * When you want to modify the current content, please preserve and comment the original version * For each paragraph, write your name above * After completing each line, you should press 'Enter' for a new line * To fulfill consistency, use Table~\ref{} and Figure~\ref{} * Pay attention to duplicated references * Pay attention to the \label when making a table and alleviate duplicated \labels * When draw the table, can leverage Chatgpt with detailed prompts