HW01 Taiwanese-ASR

# HW01 Taiwanese-ASR 本次作業在 **GeForce RTX 2080 Ti**下進行模型訓練 ## Environment Setup 根據[2022 Tutorial at CMU](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_recipe_tutorial_CMU_11751_18781_Fall2022.ipynb#scrollTo=S85-3X82kWbm)完成環境建構 ### Install packages for espnet ``` sudo apt-get install cmake sudo apt-get install sox sudo apt-get install libsndfile1-dev sudo apt-get install ffmpeg sudo apt-get install flac git clone https://github.com/espnet/espnet # Download ESPnet ``` ### Build conda environment ``` conda create --name espnet python=3.9.16 conda activate espnet ``` ### Setup conda environment ``` cd espnet/tools CONDA_TOOLS_DIR=/home/s112065522/miniconda3 # Change it to your actual path ./setup_anaconda.sh ${CONDA_TOOLS_DIR} espnet 3.9.16 make TH_VERSION=1.12.1 CUDA_VERSION=11.3 -j32 # Install ESPnet ``` After that, install the needed modules. For example, install **s3prl**```!. ./activate_python.sh && ./installers/install_s3prl.sh ``` to use SSLRs. ### Create a new recipe ``` cd espnet/egs2 ./TEMPLATE/asr1/setup.sh ./taiwanese/asr1 ``` --- ## Dataset preprocessing To run a new dataset we need to create a new folder ```mkdir downloads``` under ``` espnet/egs2/taiwanese/asr1 ``` and transfer the kaggle dataset to it. After that, we also need to specify the absolute path to the dataset in `db.sh` as follows: ``` echo "" >> db.sh echo "TAIWANESE=/espnet/egs2/taiwanese/asr1/downloads >> db.sh ``` ### Setup the script for data preparation 1. Create a file `local/data.sh`. ``` set -e set -u set -o pipefail log() { local fname=${BASH_SOURCE[1]##*/} echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*" } SECONDS=0 stage=1 stop_stage=100 log "$0 $*" . utils/parse_options.sh . ./db.sh . ./path.sh . ./cmd.sh if [ $# -ne 0 ]; then log "Error: No positional arguments are required." exit 2 fi if [ -z "${TAIWANESE}" ]; then log "Fill the value of 'TAIWANESE' of db.sh" exit 1 fi train_set="train_nodev" train_dev="train_dev" test_set="test" ndev_utt=2000 # validation set 大小 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then log "stage 1: Data preparation" mkdir -p data/{train,test} # Prepare data in the Kaldi format, including three files: # text, wav.scp, utt2spk python3 local/data_prep.py ${TAIWANESE} sph2pipe for x in train; do for f in text wav.scp utt2spk; do sort data/${x}/${f} -o data/${x}/${f} done utils/utt2spk_to_spk2utt.pl data/${x}/utt2spk > "data/${x}/spk2utt" done for x in test; do for f in text wav.scp utt2spk; do sort data/${x}/${f} -o data/${x}/${f} done utils/utt2spk_to_spk2utt.pl data/${x}/utt2spk > "data/${x}/spk2utt" done # make a dev set utils/subset_data_dir.sh --first data/train "${ndev_utt}" "data/${train_dev}" n=$(($(wc -l < data/train/text) - ndev_utt)) utils/subset_data_dir.sh --last data/train "${n}" "data/${train_set}" fi log "Successfully finished. [elapsed=${SECONDS}s]" ``` 2. Create a python script `local/data_prep.py` which is used by the previous shell script. ``` import os import re import sys if len(sys.argv) != 3: print("Usage: python data_prep.py [root] [sph2pipe]") sys.exit(1) root = sys.argv[1] sph2pipe = sys.argv[2] for x in ["train","test"]: with open(os.path.join(root, x, "text.txt")) as transcript_f, open(os.path.join("data", x, "text"), "w") as text_f, open(os.path.join("data", x, "wav.scp"), "w") as wav_scp_f, open(os.path.join("data", x, "utt2spk"), "w") as utt2spk_f: text_f.truncate() wav_scp_f.truncate() utt2spk_f.truncate() # 跳過首行 next(transcript_f) lines = transcript_f.readlines() for line in lines: line = line.strip() id = line.split()[:1] utt_id = "global" words = " ".join(line.split()[1:]) text_f.write(id[0] + " " + words+ "\n") wav_scp_f.write( id[0] + " " + os.path.join(root, x,id[0] + ".wav") + "\n" ) utt2spk_f.write(id[0] + " " + utt_id + "\n") ``` ### Run Task All the file in the ```egs2/taiwanese/asr1``` ``` 1. conf * train_asr_transformer.yaml * decode_asr_transformer.yaml * train_lm_transformer.yaml # if use_lm is true 2. local * data.sh * data_prep.py 3. steps 4. utils 5. pyscripts 6. cripts 7. cmd.sh 8. path.sh 9. asr.sh 10. db.sh 11. downloads 12. run.sh ``` If all the file are in the folder, we can run the task, using ```run.sh```. (Task #3 should be runned under `run_whisper_finetune.sh`) ``` #!/usr/bin/env bash set -e set -u set -o pipefail train_set="train_nodev" valid_set="train_dev" test_sets="test" asr_config=conf/tuning/train_asr_conformer7_wavlm_base.yaml inference_config=conf/decode_asr_transformer.yaml use_lm=false use_wordlm=false speed_perturb_factors="0.9 1.0 1.1" ./asr.sh \ --nj 32 \ --inference_nj 32 \ --ngpu 1 \ --lang zh \ --audio_format wav \ --feats_type raw \ --token_type char \ --use_lm "${use_lm}" \ --use_word_lm "${use_wordlm}" \ --asr_config "${asr_config}" \ --inference_config "${inference_config}" \ --feats_normalize utt_mvn \ --train_set "${train_set}" \ --valid_set "${valid_set}" \ --test_sets "${test_sets}" \ --speed_perturb_factors "${speed_perturb_factors}" \ --asr_speech_fold_length 512 \ --asr_text_fold_length 150 \ --lm_fold_length 150 \ --lm_train_text "data/${train_set}/text" "$@" ``` --- test id need to add under the `downloads/test/text.txt` ![image](https://hackmd.io/_uploads/rk7W7pv-0.png) ## Data Augmentation 實驗過程中發現原本的 training set 可能是因為過小的原因，導致大一點的模型（e.g. whisper）都存在著overfitting 的問題。 **My solution** -> 進行 **Data Augmentation**，額外加入[台灣媠聲 2.0](https://suisiann-dataset.ithuan.tw/) 近7小時的資料集以及[Common Voice Corpus 17.0](https://commonvoice.mozilla.org/nan-tw/datasets) 約5小的資料集(**僅使用驗證過且無down_votes的資料**)到train set使用。過程中需要用到[臺灣言語工具](https://i3thuan5.github.io/tai5-uan5_gian5-gi2_kang1-ku7/index.html)進行text資料的正規化處理。 -> **此處可能導致 data leakage （test set可能有些資料來自台灣媠聲 2.0）須小心使用** > * 正規化 : 將台羅拼音轉成為全羅馬字音標 (TLPA)，並移除聲調 > 調整前： > ![image](https://hackmd.io/_uploads/SyCCReBbR.png) > 調整後： > ![image](https://hackmd.io/_uploads/SkcXJZS-R.png) > * 配合訓練，將`.wav`音檔用sox轉音檔格式，轉成 16 kHz > data（3119 -> 14528） split: > * train -> 10528 > * validation -> 4000 > * test -> 346 **Result** -> 加入額外資料集之後，模型 overfitting 的問題確實有改善！以下附上資料擴充前後的結果比較： --- ## Task #1: ESPnet Transformer **Under 50 epoch** | Accuracy | Loss | | -------- | -------- | |![image](https://hackmd.io/_uploads/BkkHAwvWA.png =2000x)|![image](https://hackmd.io/_uploads/Sy5pCvv-0.png =2000x) | CER | WER | | -------- | -------- | |![image](https://hackmd.io/_uploads/ByBFAPwZ0.png)|![image](https://hackmd.io/_uploads/BJDAAwwbR.png) **->Result (WER): 0.31405** :::info ### After Augmentation | Accuracy | Loss | | -------- | -------- | |![image](https://hackmd.io/_uploads/S1hOTtBlR.png)|![image](https://hackmd.io/_uploads/rkH9pFBeA.png) | CER | WER | | -------- | -------- | |![image](https://hackmd.io/_uploads/H1KRndrWC.png)|![image](https://hackmd.io/_uploads/SyRsTFBx0.png) **->Result（WER） : 0.23015** ::: --- ## Task #2: Combine ESPnet with S3PRL ### WavLM_base_plus ![image](https://hackmd.io/_uploads/By5t-yu36.png) **Under 50 epoch** | Accuracy | Loss | | -------- | -------- | |![image](https://hackmd.io/_uploads/H1a5aAwWA.png)|![image](https://hackmd.io/_uploads/ByCoTADb0.png) | CER | WER | | -------- | -------- | |![image](https://hackmd.io/_uploads/SJEo6RDW0.png)|![image](https://hackmd.io/_uploads/B1s2TCDZA.png) **->Result (WER): 0.31559** :::info ### After Augmentation | Accuracy | Loss | | -------- | -------- | |![image](https://hackmd.io/_uploads/Ska8WdDbR.png)|![image](https://hackmd.io/_uploads/r1luW_PZR.png) | CER | WER | | -------- | -------- | |![image](https://hackmd.io/_uploads/HyFYbOP-0.png)|![image](https://hackmd.io/_uploads/SJo5WuPZ0.png) **->Result（WER） : 0.21968** ::: --- ## Task #3: OpenAI Whisper Finetuning ![image](https://hackmd.io/_uploads/BJHN0Gr-A.png) ### Whisper_small **Under 15 epoch** | Accuracy | Loss | | -------- | -------- | |![image](https://hackmd.io/_uploads/HJQd_IVeC.png)|![image](https://hackmd.io/_uploads/Bks3uLVe0.png) | CER | WER | | -------- | -------- | |![image](https://hackmd.io/_uploads/HyQsO84gA.png)|![image](https://hackmd.io/_uploads/By-RuUNlA.png) **->Result（WER） : 0.99046** :::info ### After Augmentation |Accuracy | Loss| |------------ | -------------| |![image](https://hackmd.io/_uploads/HyxTqXClC.png)|![image](https://hackmd.io/_uploads/BySeiXRxA.png)| |CER|WER| | -------- | -------- | |![image](https://hackmd.io/_uploads/SJyC5mRg0.png)|![image](https://hackmd.io/_uploads/ryT0cXAxA.png) ::: **->Result（WER） : 0.14267** --- ## Best Result ### whisper_small under 35epoch （Best） |Accuracy | Loss| |------------ | -------------| |![image](https://hackmd.io/_uploads/By8tGQebC.png)|![image](https://hackmd.io/_uploads/HyY1QmxbA.png) |CER|WER| | -------- | -------- | |![image](https://hackmd.io/_uploads/SJHhMXg-C.png)|![image](https://hackmd.io/_uploads/B1CafmgbR.png) ![image](https://hackmd.io/_uploads/SJ4L6dS-R.png) ## 心得 1. 資料量如果不夠大在較大的模型可能會有 **overfitting** 的問題 2. validation set 如果開得不夠大，可能也會影響模型解果 3. 越往後的 Task 隨著使用 model 加大，訓練時間越拉越長（**Task3花費最長時間**） 4. 以自監督模型作為frontend的情況下，WavLM_base_plus 表現優於xls_r_300m 以及其他模型，可能是由於 WavLM 本身在訓練時有denosing任務的原因 5. 三個任務下來 **whisper_small 模型在台文ASR任務上表現最好** 6. whisper_small參數配置： * batch: 100000 * epoch: 35 * lr: 5.0e-04 * grad_clip: 1.0 * ...