# HW01 Taiwanese-ASR
本次作業在 **GeForce RTX 2080 Ti**下進行模型訓練
## Environment Setup
根據[2022 Tutorial at CMU](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_recipe_tutorial_CMU_11751_18781_Fall2022.ipynb#scrollTo=S85-3X82kWbm)完成環境建構
### Install packages for espnet
```
sudo apt-get install cmake
sudo apt-get install sox
sudo apt-get install libsndfile1-dev
sudo apt-get install ffmpeg
sudo apt-get install flac
git clone https://github.com/espnet/espnet # Download ESPnet
```
### Build conda environment
```
conda create --name espnet python=3.9.16
conda activate espnet
```
### Setup conda environment
```
cd espnet/tools
CONDA_TOOLS_DIR=/home/s112065522/miniconda3 # Change it to your actual path
./setup_anaconda.sh ${CONDA_TOOLS_DIR} espnet 3.9.16
make TH_VERSION=1.12.1 CUDA_VERSION=11.3 -j32 # Install ESPnet
```
After that, install the needed modules. For example, install **s3prl**```!. ./activate_python.sh && ./installers/install_s3prl.sh ``` to use SSLRs.
### Create a new recipe
```
cd espnet/egs2
./TEMPLATE/asr1/setup.sh ./taiwanese/asr1
```
---
## Dataset preprocessing
To run a new dataset we need to create a new folder ```mkdir downloads``` under ``` espnet/egs2/taiwanese/asr1 ``` and transfer the kaggle dataset to it.
After that, we also need to specify the absolute path to the dataset in `db.sh` as follows:
```
echo "" >> db.sh
echo "TAIWANESE=/espnet/egs2/taiwanese/asr1/downloads >> db.sh
```
### Setup the script for data preparation
1. Create a file `local/data.sh`.
```
set -e
set -u
set -o pipefail
log() {
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}
SECONDS=0
stage=1
stop_stage=100
log "$0 $*"
. utils/parse_options.sh
. ./db.sh
. ./path.sh
. ./cmd.sh
if [ $# -ne 0 ]; then
log "Error: No positional arguments are required."
exit 2
fi
if [ -z "${TAIWANESE}" ]; then
log "Fill the value of 'TAIWANESE' of db.sh"
exit 1
fi
train_set="train_nodev"
train_dev="train_dev"
test_set="test"
ndev_utt=2000 # validation set 大小
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
log "stage 1: Data preparation"
mkdir -p data/{train,test}
# Prepare data in the Kaldi format, including three files:
# text, wav.scp, utt2spk
python3 local/data_prep.py ${TAIWANESE} sph2pipe
for x in train; do
for f in text wav.scp utt2spk; do
sort data/${x}/${f} -o data/${x}/${f}
done
utils/utt2spk_to_spk2utt.pl data/${x}/utt2spk > "data/${x}/spk2utt"
done
for x in test; do
for f in text wav.scp utt2spk; do
sort data/${x}/${f} -o data/${x}/${f}
done
utils/utt2spk_to_spk2utt.pl data/${x}/utt2spk > "data/${x}/spk2utt"
done
# make a dev set
utils/subset_data_dir.sh --first data/train "${ndev_utt}" "data/${train_dev}"
n=$(($(wc -l < data/train/text) - ndev_utt))
utils/subset_data_dir.sh --last data/train "${n}" "data/${train_set}"
fi
log "Successfully finished. [elapsed=${SECONDS}s]"
```
2. Create a python script `local/data_prep.py` which is used by the previous shell script.
```
import os
import re
import sys
if len(sys.argv) != 3:
print("Usage: python data_prep.py [root] [sph2pipe]")
sys.exit(1)
root = sys.argv[1]
sph2pipe = sys.argv[2]
for x in ["train","test"]:
with open(os.path.join(root, x, "text.txt")) as transcript_f, open(os.path.join("data", x, "text"), "w") as text_f, open(os.path.join("data", x, "wav.scp"), "w") as wav_scp_f, open(os.path.join("data", x, "utt2spk"), "w") as utt2spk_f:
text_f.truncate()
wav_scp_f.truncate()
utt2spk_f.truncate()
# 跳過首行
next(transcript_f)
lines = transcript_f.readlines()
for line in lines:
line = line.strip()
id = line.split()[:1]
utt_id = "global"
words = " ".join(line.split()[1:])
text_f.write(id[0] + " " + words+ "\n")
wav_scp_f.write(
id[0]
+ " "
+ os.path.join(root, x,id[0] + ".wav")
+ "\n"
)
utt2spk_f.write(id[0] + " " + utt_id + "\n")
```
### Run Task
All the file in the ```egs2/taiwanese/asr1```
```
1. conf
* train_asr_transformer.yaml
* decode_asr_transformer.yaml
* train_lm_transformer.yaml # if use_lm is true
2. local
* data.sh
* data_prep.py
3. steps
4. utils
5. pyscripts
6. cripts
7. cmd.sh
8. path.sh
9. asr.sh
10. db.sh
11. downloads
12. run.sh
```
If all the file are in the folder, we can run the task, using ```run.sh```.
(Task #3 should be runned under `run_whisper_finetune.sh`)
```
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
train_set="train_nodev"
valid_set="train_dev"
test_sets="test"
asr_config=conf/tuning/train_asr_conformer7_wavlm_base.yaml
inference_config=conf/decode_asr_transformer.yaml
use_lm=false
use_wordlm=false
speed_perturb_factors="0.9 1.0 1.1"
./asr.sh \
--nj 32 \
--inference_nj 32 \
--ngpu 1 \
--lang zh \
--audio_format wav \
--feats_type raw \
--token_type char \
--use_lm "${use_lm}" \
--use_word_lm "${use_wordlm}" \
--asr_config "${asr_config}" \
--inference_config "${inference_config}" \
--feats_normalize utt_mvn \
--train_set "${train_set}" \
--valid_set "${valid_set}" \
--test_sets "${test_sets}" \
--speed_perturb_factors "${speed_perturb_factors}" \
--asr_speech_fold_length 512 \
--asr_text_fold_length 150 \
--lm_fold_length 150 \
--lm_train_text "data/${train_set}/text" "$@"
```
---
test id need to add under the `downloads/test/text.txt`

## Data Augmentation
實驗過程中發現原本的 training set 可能是因為過小的原因,導致大一點的模型(e.g. whisper)都存在著overfitting 的問題。
**My solution** -> 進行 **Data Augmentation**,額外加入[台灣媠聲 2.0](https://suisiann-dataset.ithuan.tw/) 近7小時的資料集以及[Common Voice Corpus 17.0](https://commonvoice.mozilla.org/nan-tw/datasets) 約5小的資料集(**僅使用驗證過且無down_votes的資料**)到train set使用。過程中需要用到[臺灣言語工具](https://i3thuan5.github.io/tai5-uan5_gian5-gi2_kang1-ku7/index.html)進行text資料的正規化處理。
-> **此處可能導致 data leakage (test set可能有些資料來自台灣媠聲 2.0)須小心使用**
> * 正規化 : 將台羅拼音轉成為全羅馬字音標 (TLPA),並移除聲調
> 調整前:
> 
> 調整後:
> 
> * 配合訓練,將`.wav`音檔用sox轉音檔格式,轉成 16 kHz
> data(3119 -> 14528) split:
> * train -> 10528
> * validation -> 4000
> * test -> 346
**Result** -> 加入額外資料集之後,模型 overfitting 的問題確實有改善!
以下附上 資料擴充前後的結果比較:
---
## Task #1: ESPnet Transformer
**Under 50 epoch**
| Accuracy | Loss |
| -------- | -------- |
||
| CER | WER |
| -------- | -------- |
||
**->Result (WER): 0.31405**
:::info
### After Augmentation
| Accuracy | Loss |
| -------- | -------- |
||
| CER | WER |
| -------- | -------- |
||
**->Result(WER) : 0.23015**
:::
---
## Task #2: Combine ESPnet with S3PRL
### WavLM_base_plus

**Under 50 epoch**
| Accuracy | Loss |
| -------- | -------- |
||
| CER | WER |
| -------- | -------- |
||
**->Result (WER): 0.31559**
:::info
### After Augmentation
| Accuracy | Loss |
| -------- | -------- |
||
| CER | WER |
| -------- | -------- |
||
**->Result(WER) : 0.21968**
:::
---
## Task #3: OpenAI Whisper Finetuning

### Whisper_small
**Under 15 epoch**
| Accuracy | Loss |
| -------- | -------- |
||
| CER | WER |
| -------- | -------- |
||
**->Result(WER) : 0.99046**
:::info
### After Augmentation
|Accuracy | Loss|
|------------ | -------------|
|||
|CER|WER|
| -------- | -------- |
||
:::
**->Result(WER) : 0.14267**
---
## Best Result
### whisper_small under 35epoch (Best)
|Accuracy | Loss|
|------------ | -------------|
||
|CER|WER|
| -------- | -------- |
||

## 心得
1. 資料量如果不夠大在較大的模型可能會有 **overfitting** 的問題
2. validation set 如果開得不夠大,可能也會影響模型解果
3. 越往後的 Task 隨著使用 model 加大,訓練時間越拉越長(**Task3花費最長時間**)
4. 以自監督模型作為frontend的情況下,WavLM_base_plus 表現優於xls_r_300m 以及其他模型,可能是由於 WavLM 本身在訓練時有denosing任務的原因
5. 三個任務下來 **whisper_small 模型在台文ASR任務上表現最好**
6. whisper_small參數配置:
* batch: 100000
* epoch: 35
* lr: 5.0e-04
* grad_clip: 1.0
* ...