訓練鋩角 - HackMD

# 訓練鋩角 ## 語料庫分配資料量 |小量 | 大量 --|--|-- training |80.00% | dev |10.00%| 15-30分鐘抑是1000句 testing |10.00% |15-30分鐘抑是1000句 >[name=ricer]train、dev、testing的語者盡量攏無仝 >dev, test上濟30分鐘，愛1000句以上，小數點才有意義 ##### [ Best Data Splitting Strategies ](https://groups.google.com/d/msg/kaldi-help/_0PLIoQWXVc/FiIU9ltABwAJ) Na-u 1000 tiam-tsing, kian-gi dev,test 4 tiam-tsing ## 訓練流程 >[name=ricer]盡量incremental做，有時陣雄雄會害去 1. 逐个語料庫分別訓練一个模型，做test，看有人較bai2無。若有，分析彼个語料庫 2. 選上強的語料庫做底，一擺加一个語料庫入看，看分析結果 ##### [Using custom data/lang](https://groups.google.com/d/msg/kaldi-help/fO8KHjk27uA/vah784IZAQAJ) 會當參考伊按怎解釋伊的實驗環境 ### 分析辨識問題愛判斷是聲學模型的問題無，所以先共語言模型的影響提掉 #### 實驗1 通常中文`inside 3 gram 語言模型`配合`outside的音檔`， `Character error rate`要在大約5%。 #### 實驗2 用去聲調的音節做成 free-syllable 的語言模型，看看辨識結果如何，應該正確率要差不多 20-30之間。 ## 語言模型分析 ##### Perplexity >[name=呂]詳細數字要看採用的文字語料中【相異】的語言單位之數量。 ``` 辨識單位 |Ngram| Perplexity ---------|-----| ---------- 文(Ch) | 0g | 3426 文(Ch) | 7g | 37 音(Ts) | 0g | 2441 音(Ts) | 7g | 45 ``` 以上是TwESC01 (朗讀140篇)10萬字左右做出來的結果。其中 3426 恰好是10萬字中，相異的漢羅字數；2441恰好是相異的帶聲調音節數。 ##### 案例1 試perplexity的結果是這樣： ``` $ ngram -lm data/local/lm/語言模型.lm -ppl text_tshi3.txt data/local/lm/語言模型.lm: line 7: warning: non-zero probability for <unk> in closed-vocabulary LM file text_tshi3.txt: 277 sentences, 1135 words, 0 OOVs 0 zeroprobs, logprob= -5321.24 ppl= 5869.24 ppl1= 48788.2 ``` 通常英文的`ppl`在`100`左右，華語大約在`200`。台語`5000`太高了 ### 訓練參數，jobs佮thread設定 * https://groups.google.com/forum/#!msg/kaldi-help/kTDwa48u2PM/M5qMYGPVCwAJ ### script出現錯誤來除錯 * data/lang/G.fst is not ilabel sorted * https://sourceforge.net/p/kaldi/discussion/1355348/thread/c99fe7a6/?limit=25 * 音檔傷短 - https://sourceforge.net/p/kaldi/discussion/1355348/thread/476965d5/ * 更新lexicon.txt辭典 * https://sourceforge.net/p/kaldi/discussion/1355347/thread/33098413/?page=0 ##### [ General Advice for Small Datasets ](https://groups.google.com/d/msg/kaldi-help/09OUK1-grT8/U5jugNUbBQAJ) mini_librispeech (5 hours training) and heroico (10 hours training). Right now the recipes use resnet-style factorized TDNN. ##### [Please help, something was wrong when decoding TDNN chain model](https://groups.google.com/d/msg/kaldi-help/m34wm8hoILU/eVUWItVTAgAJ) - 幾千句ê訓練資料，ē-tàng參考 mini_librispeech ê TDNN設定 ##### [ What is the best percentage for each training data and test data from whole data? ](https://groups.google.com/d/msg/kaldi-help/PKnkVQu_iOE/GMbKdbwKBAAJ) 建議test data愛 3 點鐘，按呢較穩定，盡量予 noise < 0.2 ##### [ Error increase when perplexity decrease ](https://groups.google.com/d/msg/kaldi-help/8pmAx3PlANc/YIFPupnCBwAJ) https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/nnet3/report/generate_plots.py `--is-chain` ##### [](https://groups.google.com/d/msg/kaldi-help/LFxfyxNqNAo/X8oKFONuBQAJ) 4 tiam-tsing e giliau e-tang ing `SpecAugment`