kaldi reading - HackMD

# kaldi reading 一些 kaldi 相關連結和概念性的內容 ## basic knowledge ### history - 語音辨識: 早期嘗試->機率模型(Frederick Jelinek)->hybrid架構(GMM-HMM)->神經網路->BLSLM+CTC - 企業: Dragon system[Dragon Dictate] -> [Dragon NaturallySpeaking] -> Nuance Communication -> Google/Apple/Amazon/.... - kaldi: 早期功能近似HTK, 但加入大量線代函式庫，整合FST技術，大量訓練指令稿 ### terms - 特徵分析: waveform, frame shift, mfcc, plp, smoothing, backoff - 點對點語音辨識: blstm, ctc, attention - 訊號處理: sampling, quantify, echo cancellation, noise reduction(空域space、頻域frequency, beamforming), volume(音頻大小) v.s. gain(對同音頻因麥克風遠近而音量大小不一而調整統一化), decode - 語言發音: phoneme, pronunciation dictionary(lexicon), tri-phone - 效能評估: WER, CER, RTF(辨識耗時/語音時長), insertion, deletion, substitution - 資料範例: 朗讀, 電話錄音, 廣播, 麥克風陣列, 多語者, 多語言, 語言模型, 雜訊, 音樂, 問答, 關鍵字, ... ### classical acoustic modeling techniques - monophone and transition model - tri-phone and contextual acoustic model - supervised/non-supervised training - dicriminatived training [ref](https://blog.csdn.net/qq_35742630/article/details/89004890) ### graph making and decoding ### viterbi and token-passing ### on-the-fly Composition - difference between dynamic graph(HCLr.fst, Gr.fst) and static graph(HCLG.fst) - **概念理解** **[ref](https://zhuanlan.zhihu.com/p/338163396)** - **實際使用** **[ref](https://github.com/jpuigcerver/kaldi-decoders/issues/1)** ## data prepare ### file format ### lexicon ### language model ### topo ### spec-augment - specaugment-layer: kaldi 原生, 網上說要多幾個 epoch 效果才會比較好 - 先對 feature 遮罩, 替換為完整的 lattice: 在賽微實驗的經驗顯示, 不需要多 epoch 效果也有提升, 不過玉山需要先實作好 change lattce 的方法 ### hclg and fst [概念/lattice 操作](https://blog.csdn.net/qq_36782366/article/details/102847110) ## training argument ### TDNN layer descriptors ([sum and append](https://desh2608.github.io/2019-03-27-kaldi-tricks/#sum-append), [specaugment layer](https://github.com/kaldi-asr/kaldi/blob/master/egs/mini_librispeech/s5/local/chain/tuning/run_cnn_tdnn_1b.sh)) ### change lattice 效果 [相關論文: 對遠場的影響](https://www.danielpovey.com/files/2016_interspeech_ami.pdf) ## metadata processing ### scp, ark - scp:列表表單, ark:存檔表單, 順序 ark 在前 - b:binary, t:text - f/nf:更新/不更新, p/np:寬容/不寬容(遇無資料欄位是否報錯), s/ns:有序/無序表單, cs/ncs:有序/無序存取 (use LC_COLLATE=C by export LC_ALL=C) ### lattice [lattice 概念/操作](https://shiweipku.gitbooks.io/chinese-doc-of-kaldi/content/lattice.html), [decode 去除LM影響](https://sourceforge.net/p/kaldi/discussion/1355348/thread/52ec0caf/?limit=50) ### alignment ### ctm ## other settings - [kaldi epoch](https://groups.google.com/g/kaldi-help/c/7OrqJI2Szvg/m/vk3P8qKWAwAJ) - mfcc(註1): [mfcc](https://blog.csdn.net/Magical_Bubble/article/details/90295814), [kaldi-mfcc-hire](https://blog.csdn.net/robingao1994/article/details/80018415) ``mfcc: 就是 MFCC 會先經過一些濾波器過濾信號後，在使用DCT 來取得倒頻譜，然後在 kaldi 裡面的 high-resolution MFCC 就是用比預設多的濾波器過濾 (--num-mel-bins=40)，並且會取較多維的倒頻譜維度(--num-ceps=40)。至於會不會取得較高頻率的資訊?是會的，因為它的 mfcc_hire.conf 裡面還設定 --high-freq=-400 (這個數值怎麼設定我還不太清楚)`` - back propagation ## related topic ### keyword spotting and keyword searching ### speaker identification and speaker diarization - ivector - xvector ### voice activity detection ### srilm for language model - [陳柏琳實驗室參考指令-2009version](http://berlin.csie.ntnu.edu.tw/Courses/Speech%20Recognition/Lectures2009/SP2009F_Lecture11_SRILM%20Tutorial.pdf) - [陳柏琳實驗室參考指令-2004version](http://berlin.csie.ntnu.edu.tw/Courses/2004F-SpeechRecognition/Slides/SP2004F_Lecture06-02_SRILM%20Toolkit.pdf) - [srilm ppl scripts](http://www.speech.sri.com/projects/srilm/manpages/ppl-scripts.1.html) - [lm training and concept](http://fancyerii.github.io/dev287x/lm/) - [official doc](http://www.speech.sri.com/projects/srilm/manpages/ngram.1.html) - [build-big-lm](https://joshua.incubator.apache.org/6.0/large-lms.html) Building large LMs with SRILM - [lm training concept 簡字](http://fancyerii.github.io/dev287x/lm/) ### others - [ESPnet](https://espnet.github.io/espnet/notebook/asr_cli.html) - k2 (Dan Povey in MIDC) https://zhuanlan.zhihu.com/p/284008844 some details in [PPT](https://codingnote.cc/zh-tw/p/183186/) ## FAQ ### [kaldi Q&A](/MSoPeszmRvOTrcZBsAV24Q) ### kaldi clean up - discount 0.3->0.9 :更加偏向相信AM而非LM - soft count(1best) ## installation - download from git - install cuda - tools/extras/check_dependencies.sh - compiled by G++, Apple LLVM, or Clang - libraries: zlib, IntelMKL, OpenBLAS, ATLAS - compiling supported lib: libtool, automake, autoconf, patch, bzip2, gzip, wget, subversion - tools for scriptes: python, gawk, perl - tools: OpenFst, CUB, Sclite, Sph2pipe, IRSTLM/SRILM(lbfgs)/Kaldi_lm, OpenBLAS/MKL, CLAPACK - compile Kaldi (src/configure) - make->make test->make valgrind->make cudavalgrind->make clean->make depend->make - set parallel environment - check for NFS, NIS or use SGE, SLURM for larger system - applications: - vosk: [doc](https://alphacephei.com/vosk) [server](https://github.com/alphacep/vosk-server) [api](https://github.com/alphacep/vosk-api) - nvidia accerlerated version: [doc](https://developer.nvidia.com/blog/gpu-accelerated-speech-to-text-with-kaldi-a-tutorial-on-getting-started/) - audacity: [official](https://www.audacityteam.org/) ###### tags: `kaldi` `ASR` `concept` `general`