# Speech Recognision ## I/O of SR sys ![](https://i.imgur.com/cvWcMtM.png) ## Feature Extraction before entering NN ![](https://i.imgur.com/0ved76u.png) ## Acoustic features used in paper ![](https://i.imgur.com/TnA3Jo4.png) ## How much data needed ![](https://i.imgur.com/BRAhraZ.png) ## Famous Structure ![](https://i.imgur.com/mZB079A.png) ## Audio signal NN pooling ![](https://i.imgur.com/87sO7hd.png) ![](https://i.imgur.com/1TtskHu.png) ## Listen, Atten, and Spell (LAS) ![](https://i.imgur.com/mTjIwvH.png) Inference: Beam Search - Greedy Search and its limit ![](https://i.imgur.com/IzG6qTW.png) - Beam Search ![](https://i.imgur.com/cfN6PVZ.png) Training: Teacher forcing ![](https://i.imgur.com/rQohWnb.png) Attention ![](https://i.imgur.com/DoFiQMu.png) ![](https://i.imgur.com/6GWMnEF.png) ## Connectionist Temporal Classification(CTC) ![](https://i.imgur.com/inrNwHP.png) * Input T acoustic features, output T tokens (ignoring down sampling) * Output tokens including $\phi$, merging duplicate tokens, removing$\phi$ ![](https://i.imgur.com/o8Io1G1.png) ![](https://i.imgur.com/edBEOHh.png) ## Model summary ![](https://i.imgur.com/DwqGhti.png)