# ^kaldi要怎麼調參數
> tag: `ASR`
>
首先,調參數方面我建議直接從NN模型為前提下去調,標準的流程是:一、先大量嘗試不同架構。二、確認比較好的架構後開始微調參數。目前已經有很多架構的recipe被開發出來,位置在"kaldi/egs/swbd/s5c/local/chain/tuning/裡面,每個文件名會透露那個文件是用什麼架構;然後文件的尾號則是在那個架構下面的微調參數的測驗結果。舉個例子:**run_tdnn_3k.sh**的意思就是tdnn架構配上第3k個實驗。裡面每個實驗無非就是加加層數或是改一些數字。由於NN模型都依賴之前HMM模型的alignment結果,因此最好將run.sh先跑過一遍,確認無誤之後再進行NN的訓練。
## NN 系列模型
為了方便起見,我們就都用最新版的nnet3系列的model就好了^[http://kaldi-asr.org/doc/dnn3.html]。在每一個recipe下到了tri5之後都會進入nnet步驟,通常作者會提供run_nnet3和chain兩種model供使用者選擇,由於許多結果來說chain都是好於nnet的,因此這篇就只介紹到chain model。
### Chain model
Chain Model 是DNN HMM的組合,在訓練的時候最後的層數是HMM model,所以是藉由計算forward-backward的方式算出lattice之中的機率。在文件中有提到近期的Chain model是採用(Lattice free mutual information)LF-MMI的方式技算forward-backward 中需要的參數^[http://kaldi-asr.org/doc/chain.html#chain_model]。
> **The training procedure for 'chain' models**
The training procedure for chain models is a lattice-free version of MMI, where the denominator state posteriors are obtained by the forward-backward algorithm over a HMM formed from a phone-level decoding graph, and the numerator state posteriors are obtained by a similar forward-backward algorithm but limited to sequences corresponding to the transcript.
For each output index of the neural net (i.e. for each pdf-id), we compute a derivative of of the form (numerator occupation probability - denominator occupation probability), and these are propagated back to the network.
現在就拿"formosa/s5/local/chain/run_tdnn.sh"為例子,
<img src="https://i.imgur.com/xgxVnzb.png" width="30%" style="display: block;margin-left: auto;margin-right: auto;">
上面是我嘗試畫過的架構,這個架構簡單來說是以下程式碼的呈現
```
input dim=100 name=ivector
input dim=43 name=input
# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
# the first splicing is moved before the lda layer, so no splicing here
relu-batchnorm-layer name=tdnn1 dim=625
relu-batchnorm-layer name=tdnn2 input=Append(-1,0,1) dim=625
relu-batchnorm-layer name=tdnn3 input=Append(-1,0,1) dim=625
relu-batchnorm-layer name=tdnn4 input=Append(-3,0,3) dim=625
relu-batchnorm-layer name=tdnn5 input=Append(-3,0,3) dim=625
relu-batchnorm-layer name=tdnn6 input=Append(-3,0,3) dim=625
## adding the layers for chain branch
relu-batchnorm-layer name=prefinal-chain input=tdnn6 dim=625 target-rms=0.5
output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5
# adding the layers for xent branch
# This block prints the configs for a separate output that will be
# trained with a cross-entropy objective in the 'chain' models... this
# has the effect of regularizing the hidden parts of the model. we use
# 0.5 / args.xent_regularize as the learning rate factor- the factor of
# 0.5 / args.xent_regularize is suitable as it means the xent
# final-layer learns at a rate independent of the regularization
# constant; and the 0.5 was tuned so as to make the relative progress
# similar in the xent and regular final layers.
relu-batchnorm-layer name=prefinal-xent input=tdnn6 dim=625 target-rms=0.5
output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5
```
這個架構下可以調整的參數可以是tdnn(time delay neural network)的維度、深度,另外還有"Append(-3,0,3)"意思是把前一層的前三幀和後三幀當成輸入。再來看到code:"relu-batchnorm-layer name=tdnn5 input=Append(-3,0,3) dim=625" 在Dan的nnet3系統中,要新建一層的NN就是加入一行他們定義的層。像是這行batchnorm-layer就是會進行batchnorm的操作,各位可以看看其他文件,就可以發現Dan團隊開發了LSTM、BLSTM、CNN、TDNN-F、ATTENTION等等的層,那NN架構的選用建議讀者可以參考別人的[paper]((https://sites.google.com/speech.ntut.edu.tw/fsw/home/workshop?fbclid=IwAR2k_TGwgms_rz579Qfo3_z-GHPyKbC6xFULXmy0QOd9tmCJjbBVIxBiXJk))。
## 實驗結果

這些實驗是測試在formosa的資料上,資料庫的簡介在[formosa](https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge)。接下來來介紹一下結果,通常中文ASR的模型驗證都會用Word error rate和character error rate(CER),但由於WER會牽涉到中文切詞規則的好壞,因此大部份時候都會看CER就好。在我們的實驗中,TDNN-F的模型是最好的,因此就kaldi調參數的邏輯,我的下一步可以在TDNN-F的架構下進行微調。再來是Seq2Seq 0.5/ ctc模型,[espnet](https://github.com/espnet/espnet)實踐出來的,是完全的DNN架構不同於Dan的DNN-HMM。最後最底下的結果:Google ASR是用google提供的Api跑出來的結果[Google Cloud Speech API ](https://cloud.google.com/speech-to-text/?hl=zh-tw&utm_source=google&utm_medium=cpc&utm_campaign=japac-TW-all-en-dr-bkws-all-all-trial-b-dr-1003987&utm_content=text-ad-none-none-DEV_c-CRE_252375342940-ADGP_Hybrid+%7C+AW+SEM+%7C+BKWS+~+T1+%7C+BMM+%7C+ML+%7C+M:1+%7C+TW+%7C+en+%7C+Speech+%7C+API-KWID_43700029827988136-kwd-309378380810&userloc_1012810&utm_term=KW_%2Bgoogle%20%2Basr&gclid=Cj0KCQiAzKnjBRDPARIsAKxfTRBLgVIWwnvlFxU3ZORfcJ7C0FN_nb7q7w8e68jEKZFg6GjVTxihCw0aAgeAEALw_wcB),使用不難,需付費但一開始使用會贈送300美金的quota,每個帳號需要綁定信用卡,使用方法是要用程式呼叫他提供的Api,我實踐了一個python版本的[範例程式](https://github.com/JackingChen/Kaldi_tutorials/blob/master/GOOGLE_ASR_baseline.py)主要的程式如下:
```
Fs=16000
client = speech.SpeechClient()
# Loads the audio into memory
with io.open(Path, 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=Fs,
language_code='cmn-Hant-TW')
# Detects speech in the audio file
response = client.recognize(config, audio)
```
對於一分鐘以上的句子則需要另外處理(本篇不另外講述)。
最後討論一下我們的實驗好於google ASR的現象,因為google ASR是泛用在各個語音文件的情況,但我們是完全用我們的資料庫環境做訓練來預測來自同一個環境的data,因此我們的系統看似好,卻可能無法用在其他的資料庫上。