^kaldi要怎麼調參數

# ^kaldi要怎麼調參數 > tag: `ASR` > 首先，調參數方面我建議直接從NN模型為前提下去調，標準的流程是：一、先大量嘗試不同架構。二、確認比較好的架構後開始微調參數。目前已經有很多架構的recipe被開發出來，位置在"kaldi/egs/swbd/s5c/local/chain/tuning/裡面，每個文件名會透露那個文件是用什麼架構；然後文件的尾號則是在那個架構下面的微調參數的測驗結果。舉個例子：**run_tdnn_3k.sh**的意思就是tdnn架構配上第3k個實驗。裡面每個實驗無非就是加加層數或是改一些數字。由於NN模型都依賴之前HMM模型的alignment結果，因此最好將run.sh先跑過一遍，確認無誤之後再進行NN的訓練。 ## NN 系列模型為了方便起見，我們就都用最新版的nnet3系列的model就好了^[http://kaldi-asr.org/doc/dnn3.html]。在每一個recipe下到了tri5之後都會進入nnet步驟，通常作者會提供run_nnet3和chain兩種model供使用者選擇，由於許多結果來說chain都是好於nnet的，因此這篇就只介紹到chain model。 ### Chain model Chain Model 是DNN HMM的組合，在訓練的時候最後的層數是HMM model，所以是藉由計算forward-backward的方式算出lattice之中的機率。在文件中有提到近期的Chain model是採用(Lattice free mutual information)LF-MMI的方式技算forward-backward 中需要的參數^[http://kaldi-asr.org/doc/chain.html#chain_model]。 > **The training procedure for 'chain' models** The training procedure for chain models is a lattice-free version of MMI, where the denominator state posteriors are obtained by the forward-backward algorithm over a HMM formed from a phone-level decoding graph, and the numerator state posteriors are obtained by a similar forward-backward algorithm but limited to sequences corresponding to the transcript. For each output index of the neural net (i.e. for each pdf-id), we compute a derivative of of the form (numerator occupation probability - denominator occupation probability), and these are propagated back to the network. 現在就拿"formosa/s5/local/chain/run_tdnn.sh"為例子， <img src="https://i.imgur.com/xgxVnzb.png" width="30%" style="display: block;margin-left: auto;margin-right: auto;"> 上面是我嘗試畫過的架構，這個架構簡單來說是以下程式碼的呈現 ``` input dim=100 name=ivector input dim=43 name=input # please note that it is important to have input layer with the name=input # as the layer immediately preceding the fixed-affine-layer to enable # the use of short notation for the descriptor fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat # the first splicing is moved before the lda layer, so no splicing here relu-batchnorm-layer name=tdnn1 dim=625 relu-batchnorm-layer name=tdnn2 input=Append(-1,0,1) dim=625 relu-batchnorm-layer name=tdnn3 input=Append(-1,0,1) dim=625 relu-batchnorm-layer name=tdnn4 input=Append(-3,0,3) dim=625 relu-batchnorm-layer name=tdnn5 input=Append(-3,0,3) dim=625 relu-batchnorm-layer name=tdnn6 input=Append(-3,0,3) dim=625 ## adding the layers for chain branch relu-batchnorm-layer name=prefinal-chain input=tdnn6 dim=625 target-rms=0.5 output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5 # adding the layers for xent branch # This block prints the configs for a separate output that will be # trained with a cross-entropy objective in the 'chain' models... this # has the effect of regularizing the hidden parts of the model. we use # 0.5 / args.xent_regularize as the learning rate factor- the factor of # 0.5 / args.xent_regularize is suitable as it means the xent # final-layer learns at a rate independent of the regularization # constant; and the 0.5 was tuned so as to make the relative progress # similar in the xent and regular final layers. relu-batchnorm-layer name=prefinal-xent input=tdnn6 dim=625 target-rms=0.5 output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5 ``` 這個架構下可以調整的參數可以是tdnn（time delay neural network）的維度、深度，另外還有"Append(-3,0,3)"意思是把前一層的前三幀和後三幀當成輸入。再來看到code："relu-batchnorm-layer name=tdnn5 input=Append(-3,0,3) dim=625" 在Dan的nnet3系統中，要新建一層的NN就是加入一行他們定義的層。像是這行batchnorm-layer就是會進行batchnorm的操作，各位可以看看其他文件，就可以發現Dan團隊開發了LSTM、BLSTM、CNN、TDNN-F、ATTENTION等等的層，那NN架構的選用建議讀者可以參考別人的[paper]((https://sites.google.com/speech.ntut.edu.tw/fsw/home/workshop?fbclid=IwAR2k_TGwgms_rz579Qfo3_z-GHPyKbC6xFULXmy0QOd9tmCJjbBVIxBiXJk))。 ## 實驗結果 ![](https://i.imgur.com/feGlr92.png) 這些實驗是測試在formosa的資料上，資料庫的簡介在[formosa](https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge)。接下來來介紹一下結果，通常中文ASR的模型驗證都會用Word error rate和character error rate(CER)，但由於WER會牽涉到中文切詞規則的好壞，因此大部份時候都會看CER就好。在我們的實驗中，TDNN-F的模型是最好的，因此就kaldi調參數的邏輯，我的下一步可以在TDNN-F的架構下進行微調。再來是Seq2Seq 0.5/ ctc模型，[espnet](https://github.com/espnet/espnet)實踐出來的，是完全的DNN架構不同於Dan的DNN-HMM。最後最底下的結果：Google ASR是用google提供的Api跑出來的結果[Google Cloud Speech API ](https://cloud.google.com/speech-to-text/?hl=zh-tw&utm_source=google&utm_medium=cpc&utm_campaign=japac-TW-all-en-dr-bkws-all-all-trial-b-dr-1003987&utm_content=text-ad-none-none-DEV_c-CRE_252375342940-ADGP_Hybrid+%7C+AW+SEM+%7C+BKWS+~+T1+%7C+BMM+%7C+ML+%7C+M:1+%7C+TW+%7C+en+%7C+Speech+%7C+API-KWID_43700029827988136-kwd-309378380810&userloc_1012810&utm_term=KW_%2Bgoogle%20%2Basr&gclid=Cj0KCQiAzKnjBRDPARIsAKxfTRBLgVIWwnvlFxU3ZORfcJ7C0FN_nb7q7w8e68jEKZFg6GjVTxihCw0aAgeAEALw_wcB)，使用不難，需付費但一開始使用會贈送300美金的quota，每個帳號需要綁定信用卡，使用方法是要用程式呼叫他提供的Api，我實踐了一個python版本的[範例程式](https://github.com/JackingChen/Kaldi_tutorials/blob/master/GOOGLE_ASR_baseline.py)主要的程式如下： ``` Fs=16000 client = speech.SpeechClient() # Loads the audio into memory with io.open(Path, 'rb') as audio_file: content = audio_file.read() audio = types.RecognitionAudio(content=content) audio = types.RecognitionAudio(content=content) config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=Fs, language_code='cmn-Hant-TW') # Detects speech in the audio file response = client.recognize(config, audio) ``` 對於一分鐘以上的句子則需要另外處理（本篇不另外講述）。最後討論一下我們的實驗好於google ASR的現象，因為google ASR是泛用在各個語音文件的情況，但我們是完全用我們的資料庫環境做訓練來預測來自同一個環境的data，因此我們的系統看似好，卻可能無法用在其他的資料庫上。

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.