---
# System prepended metadata

title: 0. wenet

---

# 0. wenet


> Production First and Production Ready End-to-End Speech Recognition Toolkit

- 以 wenet 為始的理由
    - 開發簡單，有容易參考的 recipe，cf.:
        - espnet: 環境很難 build (在研發雲)，樣板很多但有點太多，
        - k2(next-gen kaldi): 環境很難 build (在研發雲)，門檻稍高
        - speechbrain: research 取向
        - nemo: 範例有點少
        - fairseq/flashlight: 很 hardcore；都在解大問題
        - ![](https://i.imgur.com/qkDIs5U.png)
    - 社群接近
    - 有比較多 runtime support 可以參考
- (2021 interspeech) WeNet: Production Oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit
    - Production first and production ready (JIT)
    - Unified solution for streaming and non-streaming ASR (U2)
    - Portable runtime
    - light weight
- (2022 interspeech) WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit
    - U2++
    - Production language model solution
    - Context biasing
    - Unified IO (UIO)


# 1. ASR: review

## 1.1 Some seq2seq models

- 大類
    1. Transformer
    2. CTC
    3. RNN-t
    4. Neural transducer
- some references
    - https://speech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR%20(v12).pdf
    - https://lorenlugosch.github.io/posts/2020/11/transducer/
    - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.
    - https://www.youtube.com/watch?v=N6aRv06iv2g&t=15s&ab_channel=Hung-yiLee


### 1.1.1 Attention model

- (2016 icassp) "Listen, Attend and Spell" Chan, et al.
- ![](https://i.imgur.com/SwAl4Ro.png)
    - https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch/blob/master/tests/sample_data/demo.png

<!-- #### 1.1.1.1 Listen

![](https://i.imgur.com/OU0lF9C.png)

#### 1.1.1.2 Attention

![](https://i.imgur.com/dLP8WVd.png)

#### 1.1.1.3 Spell

![](https://i.imgur.com/OD0Nazx.png) -->


### 1.1.2 CTC model

- (2006 icml) "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks". Alex Graves, et al.

#### 1.1.2.1 CTC

![](https://i.imgur.com/ZMkkori.png)

#### 1.1.2.2 CTC prediction

![](https://i.imgur.com/SIZtylA.png)


### 1.1.3 Transducer model

- (2012 icml) "Sequence Transduction with Recurrent Neural Networks". Alex Graves, et al.
- (2016 nips) "A Neural Transducer". Navdeep Jaitly, et al.

#### 1.1.3.1 RNN transducer

![](https://i.imgur.com/iEYmyCZ.png)

#### 1.1.3.2 Neural transducer

![](https://i.imgur.com/BQBj5go.png)

### 1.1.4 比較

![](https://i.imgur.com/oQKwT4y.png)

- 優劣分析
    - attention: 可看到完整上下文；無法 offline；長音檔attention 耗費 memory (O(TU))；較難收斂
    - CTC: 架構簡單；有 len(T)>len(U) 的假設；假設每個 frame 獨立不太合理
    - transducers: 架構天生可以解決 CTC 的 frame 間獨立的假設與 attention 要看上下文無法 online 的問題；訓練更耗費 memory (O(BTUV))


## 1.2 streaming vs non-streaming

- streaming: 對整段聲音訊號進行辨識
- non-streaming: 一邊看訊號一邊辨識


# 2. e2e ASR: wenet

- model architecture
- decoding
- unified model
- language model
- contextual biasing
- UIO
- runtime

## 2.1 model architecture

- shared encoder:
    - transformer
    - conformer
- decoder:
    - ctc
    - transformer
    - rnn-t
- 主要使用 joint CTC/AED 架構，加速與穩定訓練

### 2.1.1 joint CTC/attention training

- (2017 icassp) JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING
- shared encoder 可使用基本的 transformer 或 conformer
- architecture of joint CTC/AED
    - ![](https://i.imgur.com/6ScOqU3.png)
- loss of joint CTC/AED 
    - ![](https://i.imgur.com/hEehJcB.png)


### 2.2 decoding

- modes:
    - attention: transformer decode
    - ctc_greedy_search: ctc 1best
    - ctc_prefix_beam_search: 對 ctc 結果進行 beam search
    - attention_rescoring: 根據 ctc prefix beam search 結果與 encoder output 做 attention rescoring
    - rnn-t
- 可用 causal convolution 減少 right context dependency


### 2.3 U2 model

- U: unify streaming and non-streaming training，達到一個模型可以應用在 streaming 與 non streaming。dynamic chunk training: 
    - > We adopt a dynamic chunk training technique to unify the non-streaming and streaming model. Firstly, the input is split into several chunks by a fixed chunk size C with inputs [t + 1, t + 2, ..., t + C ] and every chunk attends on itself and all the previous chunk"
- 透過調整 chunk size 調整 latency，當 chunk size = 1 時為純 streaming；chunk size 不設定則等價於 non streaming
    - ![](https://i.imgur.com/lUzQIlE.png)
- 藉由調整 chunk size 取得速度與準確度的平衡
    - chunk size = -1 : non streaming
    - chunk size = 4/8/16 : streaming
- ![](https://i.imgur.com/sVxIBi0.png)

#### 2.3.1 U2++

- decoder 改成雙向 attention
- ![](https://i.imgur.com/EBF4PJO.png)
- ![](https://i.imgur.com/NuKMD5I.png)


### 2.4 Language model

- ngram for fast domain adaption
    - 有 in-domain training data 的話
    - cf. rnn-t lm
- decoding graph: TLG
    - cf. kaldi HCLG
- ![](https://i.imgur.com/vcESJ61.png)
- ![](https://i.imgur.com/v4wX7FJ.png)


### 2.5 Contexual biasing

- much faster domain adaption
    - 提升特定名詞辨識率
- on the fly WFST
    - w/o LM: 把 phrase list 分解成 decoder output units
    - w/ LM: 把 phrase list 分解成 LM 的 vocabulary
- ![](https://i.imgur.com/9qxgxBB.png)
- ![](https://i.imgur.com/nGJKEn7.png)


### 2.6 UIO

- Unified IO
- 克服 production level training data 存取問題
    - random access x 1000000+小檔案index = OOM
    - 就算沒 oom 也很慢
- 通用 IO 介面，後端可為 local file system 或 cloud (S3/OSS/HDFS/...)
- 把所有檔案 load 完連同 metadata 包成 shards
- which is actually done by gnu tar
- 開發環境需要足夠的空間存 shard
- on the fly feature extraction
- ![](https://i.imgur.com/nbeFwjJ.png)
- ![](https://i.imgur.com/P2C3T0v.png)


### 2.7 runtime

- system design:
    - ![](https://i.imgur.com/kT8JyYO.png)
- 提供多種程度之支援
- 支援之 architecture: arm(andrdoid) / intel x86
    - library
- 支援之 runtime platform: onnx / libtorch
    - model
- 支援之語言: python/c++
    - python: wenetruntime
        - i.e. pip install wenetruntime
- 提供多個範例
    - python
    - command line
    - websocket (python/c++)


### 2.7.1 quantization performance

- ![](https://i.imgur.com/PqRRWoN.png)


### 2.7.2 runtime rtf

- ![](https://i.imgur.com/0LYwv3L.png)


### 2.7.3 runtime latency

- ![](https://i.imgur.com/ce9iNgH.png)
    - L1: model latency, waiting time introduced by model structure
    - L2: rescoring latency, time spent on attention rescoring
    - L3: final latency, user perceived latency
- L3 部分來自於 L2


## 2.8 conclusion

- 使用 joint CTC/AED 架構，提升速度與準確度
- 使用 dynamic chunk training，達到一個模型可以同時進行 streaming 與 non-streaming
- 提供多種 runtime 方案供使用與參考
- 提供 LM 與 context bias 等可快速處理 production ASR domain adaption 議題的方案
- 提供 UIO 解決大量資料訓練時可能產生的問題


# 3. others

## 3.1 furure works: wenet 3.0

- ![](https://i.imgur.com/JNdizrg.png)

## 3.2 其他功能

- tts
- speaker
- kws


# 4. wenet in esun

|model|MER(att_res)|
|:--:|:--:|
|cnn-tdnnf| 11.32 |
|cnn-tdnnf-a-lstm| 10.97 |
|conformer CTC/AED| 9.83 |
|conformer CTC/AED U2++| 10.34 |

- training 語料未完全對齊
- training epoch 時間 (v100 16GB * 8)
    - conformer ctc/aed: ~45 min 
    - u2++: ~55 min

# 5. 會後 qa

#### Q: dynamic chunk training 的 label?

A:
參考 http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html#chunk-based-mask：
中的：第3節-問題3，conformer ctc/aed 的 attention 有：
1. encoder 中的 self attention
2. decoder 中的 self attention
3. encoder 與 decoder 間的 cross attention

dynamic chunk 是針對 encoder 中的 self attention 進行。
主要可以參考連結中的圖：
![](http://placebokkk.github.io/assets/images/wenet/mask-encoder-attention.png)

dynamic chunk 為 2,3,4 列。以圖片中 8 個 input 為例：
1. 如果是第一列 full attention，每個時間點都會看到所有 input 計算 attention，i.e. 
	- encoder t1 的 output 會看過 t1~t8
	- encoder t2 的 output 會看過 t2~t8
	- encoder t5 的 output 會看過 t1~t8
	- encoder t8 的 output 會看過 t1~t8
2. 如果是第二列 chunk based attention，每個時間點只看 chunk 內的 input，i.e. 
	- encoder t1 的 output 會看過 t1~t2
	- encoder t2 的 output 會看過 t1~t2
	- encoder t5 的 output 會看過 t5~t6
	- encoder t8 的 output 會看過 t7~t8
第三列的不看 right context 與第四列只看 limited left context 道理亦同。

#### Q: context bais 發生的時間點？是對 embedding 嗎？

#### Q: chunk size 跟 latency 的關係？