# Deep Learning HW 02@NYCU, 2023 ## Environment & Tools * [ESPnet](https://github.com/espnet/espnet): end-to-end speech processing toolkit * [Pytorch](https://pytorch.org/docs/stable/index.html): optimized tensor library for deep learning * [Sox](https://sox.sourceforge.net/): Convert various formats of computer audio files in to other formats. ## Data Format * Taiwanese audio file (Female, Kaohsiung, 3 hrs), Provided by NYCU-Speech AI Research Center * Input: .wav, 22kHz, mono, 32 bits * Output: Taiwan romanized script pingyin scheme ### Testing Data * random-noisy-test-7 ### Training Data * clean * noise1: same setting as random-noisy-test-7 * noist2: raise the prob. of adding noise and increase the ratio of time stretch and pitch shift ## Model ### 1. wav2vec2 vq-w2v將encoder output轉換成離散的,更符合人類在學習語言的邏輯,而w2v2將原本的context-network(most of the time, simple CNN)換成transformer,原本是先讓encoder從語音中學出representation,然後再讓CNN利用representation學出Context的兩步驟式學習,但是w2v2讓這兩件事一起學,而效果也更好。 ![](https://hackmd.io/_uploads/SJkjwNU4h.png) wav2vec ![](https://hackmd.io/_uploads/SkeA84LNn.png) wav2vec 2.0 The vq-w2v model converts the encoder output into discrete units, which is more in line with the logic of how humans learn language. Meanwhile, w2v2 replaces the original context network (usually a simple CNN) with a transformer. In the past, the encoder was first used to learn representations from the speech data, and then a CNN was used to learn context from the representation in a two-step learning process. However, w2v2 learns both tasks simultaneously, resulting in improved performance. ### 2. hubert 原本離散化的部分是在encoder之後(representations),hubert把其挪到context-network後,變成離散的部分是(representation + context) In w2v2, the discretization was performed after the encoder (i.e., on the representations), while in Hubert, it was moved to the context network so that the discretization was performed on the concatenation of the representation and context. ![](https://hackmd.io/_uploads/HyNTLVIE2.png) ### 3. wavlm wavlm又在此之上考量了denoise的部分,先透過乾淨的audio產生target label,讓加入noise之後的training result 可以與target label越近越好 Wavlm takes into account the denoising aspect by generating target labels using clean audio, so that the training results with added noise are optimized to be as close as possible to the target labels. ![](https://hackmd.io/_uploads/ryxqINL42.png) ## Training Process size of pre-trained model: * wav2vec2_base_s2st_en_librilight:380.3MB * wavlm_base_plus: 377.6MB * wavlm_large:1.3GB * hubert_large_ll60k:1.3GB A total of eight experiments were carried out, the following is the information of individual experiments: ### 1. w2v2_960_35epoch * front-end: wav2vec2_base_s2st_en_librilight(w2v2_960) * epoch: 35 * data: clean ### 2. w2v2_960_80epoch * front-end: wav2vec2_base_s2st_en_librilight(w2v2_960) * epoch: 80 * data: clean * <font color = red>batch bin: 100000(orgin 40000000)</font> ### 3. w2v2_960_80epoch * front-end: wav2vec2_base_s2st_en_librilight(w2v2_960) * epoch: 80 * data: clean ### 4. HuBERT * front-end: hubert_large_ll60k * epoch: 35 * data: clean ### 5. WavLM_base * front-end: wavlm_base_plus * epoch: 80 * data: clean ### 6. HuBERT + Noise * front-end: hubert_large_ll60k * epoch: 35 * data: clean + noise1 ### 7. HuBERT + DoubleNoise * front-end: hubert_large_ll60k * epoch: 35 * data: clean + noise1 + noise2 ### 8. WavLM_large + DoubleNoise * front-end: wavlm_large * epoch: 35 * data: clean + noise1 + noise2 ## Experiment Result & Review * Evaluation: Because of the limitation of Kaggle, the evaluation metric for this competition is Mean Levenshtein Distance calculated for each sentence in char-level.(The closer to 0 the better). In fact, the scoring standard for speech recognition should usually be Word-Error Rate (WER). $WER= \frac{(D + S + I)}{N} × 100\%$ N - total number of labels (total number of words) D - deletion errors S - substitution errors I - Insertion errors * Score(baseline=20): 1. w2v2_960_35epoch - $16.44$ 2. w2v2_960_80epoch - $16.05$ 3. w2v2_960_80epoch - $16.52$ 4. HuBERT - $13.18$ 5. WavLM_base - $14.62$ 6. HuBERT + Noise - $8.01$ 7. HuBERT + DoubleNoise - $7.13$ 8. WavLM_large + DoubleNoise - $6.00$ ## Summarization ### 1. Use larger pre-trained model often get a better performance. In this experiment, a total of four models were used: wav2vec2_base, wavlm_base_plus, wavlm_large, and hubert_large_ll60k. The first two models are relatively small, with a size of 380MB, while the latter two are larger with a size of 1.3GB. Due to hardware limitations (3080, 12000MiB), the batch size had to be adjusted to 1/100 of the original setting for all models. Another interesting phenomenon is that wavlm(small, large) seems to perform even better than w2v(small) and hubert(large). This can also be inferred from the original paper, as both wavlm and hubert mask some of the original text(hidden units), but wavlm also puts effort into the denoising process. ### 2. Adding noisy data raise a problem: dramatically drop of acc. 最後三個實驗是關於增加noisy data,但是在訓練的過程中無一例外遇到了訓練過程到一半時,accuracy出現急遽下滑,並且後面都是overfitting的狀況,有試著調整過learning rate,但會導致訓練的最一開始就無法找出有效資訊 The last three experiments were about adding noisy data. However, without exception, during the training process, there was a sharp drop in accuracy in the middle of the process, and overfitting afterwards. I tried adjusting the learning rate, but it resulted in the inability to find effective information from the very beginning of the training. ![](https://hackmd.io/_uploads/rkJ1tNLV2.png)