kaldi Q&A - HackMD

# kaldi Q&A [toc] ###### tags: `kaldi` `Q&A` `ASR` ## 1. decode 過程或 lattice 的哪些操作包含了語言模型的參與? >依據我自己的經驗 kaldi 聲學模型產出的其實是和一般神經網路一樣是一組和 output 層一樣維度的一組向量資料,所以我去網路上找了一些比較好懂得敘述: >[reference](https://shiweipku.gitbooks.io/chinese-doc-of-kaldi/content/lattice.html) 其實這篇網站還有很多 lattice 的操作之後也許會碰到，可以留著研究參考~ * lattice 本身就含有語言模型的資訊了,所以這是我們在用 latgen 這類的執行檔產生 decode 結果就會包含語言模型的成分了。 ``` Lattic 是一個FST，權重包含兩個浮點權值（圖代價(the graph cost)和聲學代價(the acoustic cost)），輸入符號是transition-ids （近似於上下文無關的HMM狀態），輸出符號是詞（一般來講，它們表示解碼圖的輸出符號） ``` * 而後還有一段關於 LM rescoring 的敘述: ``` 因為LatticeWeight的“graph part”（第一部分）包含了語言模型得分和轉換模型(transition-model)得分，所有的發音或靜音概率，我們不能只用新的語言模型得分替換它，否則會丟失後兩個。所以，我們要先減去舊的 LM概率再加上新的 LM概率。這兩個階段的核心操作是組合(還有一些權值的縮放,等等)。 ``` * 但這個可能不太符合我們想知道的情境，所以我上網找了有人詢問 dan povey 的部分，下列包含 dan 的回覆和一些網友的意見: [reference](https://sourceforge.net/p/kaldi/discussion/1355348/thread/52ec0caf/?limit=50) ``` Q: I would like to know how I can do the decoding only base on the acoustic models where there is no effect of language models? Dan: You could set a large acoustic scale in decoding, e.g. --acoustic-scale=100, but you'd have to increase the beam accordingly (make it 10 times larger than the default). 網友1: You can, build 0-gram LM - basically a word list as G (grammar) as part of HCLG.decoding graf. Probably the easiest way is to create arpa model and coverted to G.fst 網友2:If you don't need the transition costs you can remove the costs from the search graph using and use the unweighted machine during decoding. fstmap --map_type=rmweight HCLG.fst > HCLG.no.weights.fst ``` ## 2. HCLG 分別代表什麼? [reference](https://blog.csdn.net/qq_36782366/article/details/102847110) 這個 reference 比較簡單易懂，lattice 格式、操作解釋清楚 ``` G：語言模型WFST，輸入輸出符號相同，實際是一個WFSA（acceptor接受機），為了方便與其它三個WFST進行操作，將其視為一個輸入輸出相同的WFST。 L：發音詞典WFST，輸入符號：monophone，輸出符號：詞; C：上下文相關WFST，輸入符號：triphone（上下文相關），輸出符號：monophnoe; H：HMM聲學模型WFST，輸入符號：HMMtransitions-ids，輸出符號：triphone。將四者逐層合併，即可得到最後的圖。 HCLG= asl（min（rds（det（H’omin（det（Co min（det（Lo G））））））） ``` ## 3. MFCC 取得過程? kaldi high resolution MFCC 做了什麼? [mfcc 基本介紹](https://blog.csdn.net/Magical_Bubble/article/details/90295814), [kaldi mfcc 流程](https://blog.csdn.net/robingao1994/article/details/80018415) 參考上述兩個網站得到小結: 就是 MFCC 會先經過一些濾波器過濾信號後，再使用DCT 來取得倒頻譜，然後在 kaldi 裡面的 high-resolution MFCC 就是用比預設多的濾波器(--num-mel-bins=40)過濾，並且會取較多維的倒頻譜維度(--num-ceps=40)。 * 會不會取得較高頻率的資訊? 是會的，因為它的 mfcc_hire.conf 裡面還設定了 --high-freq=-400 (這個數值怎麼設定我還不太清楚) ## 4. kaldi 的訓練 epoch 為什麼比一般神經網路少? [這裡](https://groups.google.com/g/kaldi-help/c/7OrqJI2Szvg/m/vk3P8qKWAwAJ)寫的是 dan 在回應別人的部分擷取，但是具體意思是 1. kaldi 用了 frame subsampling factor=3 然後在訓練的時候會分開訓練 audio shift = -1, 0, 1 三種，而全部訓練完才計量一個 epoch 實際則看了資料九次。 2. natural gradient 的算法比一般的 SGD 可以用較大的 learning rate 而有同樣效果，因此 epoch 數本來就可以比其他算法少。 3. kaldi 訓練最後會做模型的平均和結合，配合 natrual gradient 可以讓 model 在不被 noise 影響的情況下獲得更多訓練資訊。 4. kaldi 一開始用的 lattice 是產自 gmm 所以其實 gmm 訓練的資訊也是後續 dnn 學習的部分。 >總結以 Dan 的看法, kaldi 訓練 5 個 epoch 就已經看過資料約50次了。不過像是 attention, 或其他資料增量(後來的 spec-augment-layer)的訓練方法可能還是需要更多的 epoch 來訓練。 ``` - We actually count epochs *after* augmentation, and with a system that has frame-subsampling-factor of 3 we separately train on the data shifted by -1, 0 and 1 and count that all as one epoch. So for 3-fold augmentation and frame-subsampling-factor=3, each "epoch" actually ends up seeing the data 9 times. - Kaldi uses natural gradient, which has better convergence properties than regular SGD and allows you to train with larger learning rates; this might allow you to reduce the num-epochs by at least a factor of 1.5 or 2 versus what you'd use with normal SGD. - We do model averaging at the end-- averaging over the last few iterations of training (an iteration is an interval of usually a couple minutes' training time). This allows us to use relatively large learning rates at the end and not worry too much about the added noise; and it allows us to use relatively high learning rates at the end, which further decreases the training time. This wouldn't work without the natural gradient; the natural gradient stops the model from moving too far in the more important directions within parameter space. - We start with aligments learned from a GMM system, so the nnet doesn't have to do all the work of figuring out the alignments-- i.e. it's not training from a completely uninformed start. So supposing we say we are using 5 epochs, we are really seeing the data more like 50 times, and if we didn't have those tricks (NG, model averaging) that might have to be more like 100 or 150 epochs, and without knowing the alignments, maybe 200 or 300 epochs. Also it's likely that attention-based models take longer to train than the more standard models that we use. ``` ## 5. 怎麼從 kaldi lattice 取得 phone/word alignment? [參考](https://groups.google.com/g/kaldi-help/c/RlH_noG1FRw) sol1(自己拼接):   linear-to-nbest | lattice-align-words | nbest-to-ctm sol2(kaldi 原生):   lattice-arc-post (對應的 index 要換成 word, phone) ## 6. kaldi 聲學模型結果和 lattice-to-post 結果差異? [參考](https://groups.google.com/g/kaldi-help/c/iBgeXN4diSY) 未整理... forward-backward 可能直接用 nnet3-am-compute 取得 ## 7. kaldi online decode 怎麼產生 ivector? [參考](https://www.funcwj.cn/2017/08/02/kaldi-online-decoder/#more) https://sourceforge.net/p/kaldi/discussion/1355348/thread/4e0dc1d5/?limit=25 ### 7.1 延伸: kaldi online decode and offline decode 差異? 未 survey ### 7.2 延伸: kaldi make_mfcc_pitch_online.sh 和 make_mfcc_pitch.sh 差異? 未 survey ## 8. word_boundary 檔案是什麼? [參考](https://groups.google.com/g/kaldi-help/c/ecmv0Mipwy0) ``` The word_boundary.int file is only created if you used position-dependent phones (it's an option to prepare_lang.sh). It's used in word alignment. However, it's possible to do the word alignment by using lattice-align-words-lexicon, using align_lexicon.int. The make_index.sh script could be extended to support non-word-position-dependent phones, by adding a branch in the script (search for lattice-align-words-lexicon in the scripts to see examples). I just created an issue to track this. You or anyone else could extend the script in this way. Dan ``` 用於 position-dependent phones 情況下協助做 alignment 的檔案。 ## 9. 大量資料在進行 cleanup 步驟遇到的 python hash 問題 `Fatal Python error: _Py_HashRandomization_Init: failed to get random numbers to initialize Python` ※ 僅會在 python3 中發生解法: 在 path.sh 設定 `export PYTHONHASHSEED=1` 固定基礎參數 [ref](https://my.oschina.net/u/221/blog/3000969) 原因: 參考 https://docs.python.org/3.3/using/cmdline.html - Kept for compatibility. On Python 3.3 and greater, hash randomization is turned on by default. - On previous versions of Python, this option turns on hash randomization, so that the __hash__() values of str, bytes and datetime are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python. - Hash randomization is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict construction, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details. - PYTHONHASHSEED allows you to set a fixed value for the hash seed secret. ## 10. validate data dir 遇到 nonprintable character 問題 ※用 fix_data_dir.sh 沒用 1. 用 utils/validate_data_dir.sh --non-print \<data-dir\> 找出哪個 utterance 有問題一般是特殊簡體字或特殊空白字元造成的，修正後重新驗證即可。 2. 有的特殊空白字元 validate_data_dir.sh --non-print 不會找出來，需透過 grep [^[:print:]] \<data-dir\>/text 來找到 3. 若確定是退回(\b)或window換行(\r)字元可透過 cat \<data-dir\>/text |tr -d '\b' >text.new 來去除 ### 附錄1: 中文拼音怎麼準備? - table of pinyin [github](https://github.com/kfcd/pinyin) - table of pronouncable pinyin [github](https://github.com/kawanet/Lingua-ZH-Romanize-Pinyin/blob/master/cxterm/dict/big5/PY.tit) - table of 國語, 粵語, 閩南語, 客家語 [github](https://github.com/phlinhng/hwa-pinyin/tree/master/name) ## 11. oov recovery (未整理) subword lm: https://groups.google.com/g/kaldi-help/c/t483gvXiEIE/m/NSqdCHS0DAAJ rnnlm: https://www.danielpovey.com/files/2020_icassp_oov_recovery.pdf unk lm: https://github.com/kaldi-asr/kaldi/blob/master/egs/tedlium/s5_r2/local/run_unk_model.sh or utils/lang/make_unk_lm.sh