# kaldi reading
一些 kaldi 相關連結和概念性的內容
## basic knowledge
### history
- 語音辨識: 早期嘗試->機率模型(Frederick Jelinek)->hybrid架構(GMM-HMM)->神經網路->BLSLM+CTC
- 企業: Dragon system[Dragon Dictate] -> [Dragon NaturallySpeaking] -> Nuance Communication -> Google/Apple/Amazon/....
- kaldi: 早期功能近似HTK, 但加入大量線代函式庫,整合FST技術,大量訓練指令稿
### terms
- 特徵分析: waveform, frame shift, mfcc, plp, smoothing, backoff
- 點對點語音辨識: blstm, ctc, attention
- 訊號處理: sampling, quantify, echo cancellation, noise reduction(空域space、頻域frequency, beamforming), volume(音頻大小) v.s. gain(對同音頻因麥克風遠近而音量大小不一而調整統一化), decode
- 語言發音: phoneme, pronunciation dictionary(lexicon), tri-phone
- 效能評估: WER, CER, RTF(辨識耗時/語音時長), insertion, deletion, substitution
- 資料範例: 朗讀, 電話錄音, 廣播, 麥克風陣列, 多語者, 多語言, 語言模型, 雜訊, 音樂, 問答, 關鍵字, ...
### classical acoustic modeling techniques
- monophone and transition model
- tri-phone and contextual acoustic model
- supervised/non-supervised training
- dicriminatived training [ref](https://blog.csdn.net/qq_35742630/article/details/89004890)
### graph making and decoding
### viterbi and token-passing
### on-the-fly Composition
- difference between dynamic graph(HCLr.fst, Gr.fst) and static graph(HCLG.fst)
- **概念理解** **[ref](https://zhuanlan.zhihu.com/p/338163396)**
- **實際使用** **[ref](https://github.com/jpuigcerver/kaldi-decoders/issues/1)**
## data prepare
### file format
### lexicon
### language model
### topo
### spec-augment
- specaugment-layer: kaldi 原生, 網上說要多幾個 epoch 效果才會比較好
- 先對 feature 遮罩, 替換為完整的 lattice: 在賽微實驗的經驗顯示, 不需要多 epoch 效果也有提升, 不過玉山需要先實作好 change lattce 的方法
### hclg and fst [概念/lattice 操作](https://blog.csdn.net/qq_36782366/article/details/102847110)
## training argument
### TDNN layer descriptors ([sum and append](https://desh2608.github.io/2019-03-27-kaldi-tricks/#sum-append), [specaugment layer](https://github.com/kaldi-asr/kaldi/blob/master/egs/mini_librispeech/s5/local/chain/tuning/run_cnn_tdnn_1b.sh))
### change lattice 效果 [相關論文: 對遠場的影響](https://www.danielpovey.com/files/2016_interspeech_ami.pdf)
## metadata processing
### scp, ark
- scp:列表表單, ark:存檔表單, 順序 ark 在前
- b:binary, t:text
- f/nf:更新/不更新, p/np:寬容/不寬容(遇無資料欄位是否報錯), s/ns:有序/無序表單, cs/ncs:有序/無序存取 (use LC_COLLATE=C by export LC_ALL=C)
### lattice [lattice 概念/操作](https://shiweipku.gitbooks.io/chinese-doc-of-kaldi/content/lattice.html), [decode 去除LM影響](https://sourceforge.net/p/kaldi/discussion/1355348/thread/52ec0caf/?limit=50)
### alignment
### ctm
## other settings
- [kaldi epoch](https://groups.google.com/g/kaldi-help/c/7OrqJI2Szvg/m/vk3P8qKWAwAJ)
- mfcc(註1): [mfcc](https://blog.csdn.net/Magical_Bubble/article/details/90295814), [kaldi-mfcc-hire](https://blog.csdn.net/robingao1994/article/details/80018415)
``mfcc: 就是 MFCC 會先經過一些濾波器過濾信號後,在使用DCT 來取得倒頻譜,
然後在 kaldi 裡面的 high-resolution MFCC 就是用比預設多的濾波器過濾
(--num-mel-bins=40),並且會取較多維的倒頻譜維度(--num-ceps=40)。
至於會不會取得較高頻率的資訊?是會的,因為它的 mfcc_hire.conf 裡面還設定
--high-freq=-400 (這個數值怎麼設定我還不太清楚)``
- back propagation
## related topic
### keyword spotting and keyword searching
### speaker identification and speaker diarization
- ivector
- xvector
### voice activity detection
### srilm for language model
- [陳柏琳實驗室參考指令-2009version](http://berlin.csie.ntnu.edu.tw/Courses/Speech%20Recognition/Lectures2009/SP2009F_Lecture11_SRILM%20Tutorial.pdf)
- [陳柏琳實驗室參考指令-2004version](http://berlin.csie.ntnu.edu.tw/Courses/2004F-SpeechRecognition/Slides/SP2004F_Lecture06-02_SRILM%20Toolkit.pdf)
- [srilm ppl scripts](http://www.speech.sri.com/projects/srilm/manpages/ppl-scripts.1.html)
- [lm training and concept](http://fancyerii.github.io/dev287x/lm/)
- [official doc](http://www.speech.sri.com/projects/srilm/manpages/ngram.1.html)
- [build-big-lm](https://joshua.incubator.apache.org/6.0/large-lms.html) Building large LMs with SRILM
- [lm training concept 簡字](http://fancyerii.github.io/dev287x/lm/)
### others
- [ESPnet](https://espnet.github.io/espnet/notebook/asr_cli.html)
- k2 (Dan Povey in MIDC)
https://zhuanlan.zhihu.com/p/284008844
some details in [PPT](https://codingnote.cc/zh-tw/p/183186/)
## FAQ
### [kaldi Q&A](/MSoPeszmRvOTrcZBsAV24Q)
### kaldi clean up
- discount 0.3->0.9 :更加偏向相信AM而非LM
- soft count(1best)
## installation
- download from git
- install cuda
- tools/extras/check_dependencies.sh
- compiled by G++, Apple LLVM, or Clang
- libraries: zlib, IntelMKL, OpenBLAS, ATLAS
- compiling supported lib: libtool, automake, autoconf, patch, bzip2, gzip, wget, subversion
- tools for scriptes: python, gawk, perl
- tools: OpenFst, CUB, Sclite, Sph2pipe, IRSTLM/SRILM(lbfgs)/Kaldi_lm, OpenBLAS/MKL, CLAPACK
- compile Kaldi (src/configure)
- make->make test->make valgrind->make cudavalgrind->make clean->make depend->make
- set parallel environment
- check for NFS, NIS or use SGE, SLURM for larger system
- applications:
- vosk: [doc](https://alphacephei.com/vosk) [server](https://github.com/alphacep/vosk-server) [api](https://github.com/alphacep/vosk-api)
- nvidia accerlerated version: [doc](https://developer.nvidia.com/blog/gpu-accelerated-speech-to-text-with-kaldi-a-tutorial-on-getting-started/)
- audacity: [official](https://www.audacityteam.org/)
###### tags: `kaldi` `ASR` `concept` `general`