# Automatic Speech Recognition Program
https://hackmd.io/8_C9xjpbQgaFkCFpdRei4g?both
深度學習理論與應用
https://github.com/MarkWuNLP/SemanticMask
https://awesomeopensource.com/
**ssh yh@140.113.170.49 -p 20203
---
## Content Table
* [Presented and Preparing Papers](#Presented-and-Preparing-Papers)
* [Dataset preprocessing by Sean](##Dataset-preprocessing-by-Sean)
* [OpenASR (low resource)](##OpenASR-(low-resource))
* [Lenny & Sean & Huang's Model](#Model)
* [State of the art papers (paper with code)](#State-Of-The-Art)
* [ASR Frameworks](#ASR-Frameworks)
* [ASR Papers](#ASR-Papers)
* [ASR Conferences](#ASR-Conferences)
* [ASR Paper Summary](#ASR-Paper-Summary)
* [Links](#Links)
---
## Presented and Preparing Papers
**毓涵 Presented**
| Paper |
| --------------------------------------------------------------------------------------------------------------------------------- |
| [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (ICLR 2020)](https://openreview.net/forum?id=rylwJxrYDS)|
|[wav2vec: Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/1904.05862)|
|[wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)|
|[Wav2Letter: an End-to-End ConvNet-based Speech Recognition System](https://arxiv.org/abs/1609.03193)|
|[Word-level Speech Recognition with a Letter to Word Encoder (ICML 2020)](https://arxiv.org/abs/1906.04323)|
**毓涵 Preparing**
|Paper|
|:----|
|Sparse Sinkhorn Attention (ICML 2020)|
|Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes|
|Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture|
|Training ASR models by generation of contextual information|
|Effectiveness of self-supervised pre-training for speech recognition|
|An attention-based joint acoustic and text on-device end-to-end model|
<!-- https://hackmd.io/oW_shU5-TPqt661UkYCPpw -->
___
**Shang-En Presented**
| [Deep Speech 1: Scaling up end-to-end Speech Recognition](https://arxiv.org/pdf/1412.5567.pdf) |
| -------- |
**Shang-En preparing**
|Paper|
|:----|
| [End-to-End Multi-Lingual Multi-Speaker Speech Recognition](https://openreview.net/pdf?id=HJxqMhC5YQ) |
| [English Conversational Telephone Speech Recognition by Humans and Machines](https://arxiv.org/pdf/1703.02136.pdf) |
| [Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](http://proceedings.mlr.press/v48/amodei16.pdf) [code](https://github.com/tensorflow/models/tree/master/research/deep_speech)|
| [First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs](https://arxiv.org/pdf/1408.2873.pdf)|
|[Sequence Transduction with Recurrent Neural Networks](https://arxiv.org/pdf/1211.3711.pdf)|
|[TRANSFORMER TRANSDUCER: A STREAMABLE SPEECH RECOGNITION MODEL WITH TRANSFORMER ENCODERS AND RNN-T LOSS](https://arxiv.org/pdf/2002.02562.pdf)|
|[ESP on CHiME6](https://github.com/espnet/espnet/blob/master/egs/chime6/asr1/RESULTS.md?fbclid=IwAR0WHUQJS6MpAmt_P9UHluu7MscIt8yC2YiVhKIjxOtV3ML_t0Vq8LVQStY)|
___
**意麟 Presented**
| [CP-GAN: CONTEXT PYRAMID GENERATIVE ADVERSARIAL NETWORK FOR SPEECH ENHANCEMENT](https://ieeexplore.ieee.org/document/9054060) |
| -------- |
**意麟 preparing**
| [Improving Voice Separation by Incorporating End-To-End Speech Recognition](https://ieeexplore.ieee.org/document/9053845) |
| ------------------------------------------------------------------------------------------------------------------------- |
| SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition |
| Transformer with Bidirectional Decoder for Speech Recognition |
| Very Deep Self-Attention Networks for End-to-End Speech Recognition |
| Two-pass End-to-End Speech Recognition |
| END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED
LEARNING WITH MODERN ARCHITECTURES |
| IMPROVING CTC USING STIMULATED LEARNING FOR SEQUENCE MODELING |
| "Very Deep Self-Attention Networks for End-to-End Speech Recognition |
| Vectorized beam search for CTC-attention-based speech recognition |
| IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION |
---
<!-- **ASR框架整理** -->
<!-- |Framework|Language|Institution/Developer|
|:-------:|:------:|:-------------------:|
|[wav2letter++](https://github.com/facebookresearch/wav2letter/tree/v0.2)|C++|Facebook Research|
|[Espnet 2](https://espnet.github.io/espnet/espnet2_tutorial.html) |Pytorch|Espnet|
|[End-to-end-ASR-Pytorch](https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch)|pytorch|XenderLiu|
|[LAS_Mandarin_PyTorch](https://github.com/jackaduma/LAS_Mandarin_PyTorch)|Pytorch|Kun Ma|
|[vq-wav2vec](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec)|Pytorch|Facebook AI Research| -->
<!-- |[ASRT Speech Recognition](https://github.com/nl8590687/ASRT_SpeechRecognition)|Python Tensorflow (pure)|AI柠檬|
|[Lingvo](https://github.com/tensorflow/lingvo)|Python Tensorflow (pure)|Tensorflow|
|[Pytorch Kaldi](https://github.com/mravanelli/pytorch-kaldi)| Python Pytorch M. Ravanelli, T. Parcollet, Y. Bengio |
|[OpenSeq2Seq](https://github.com/NVIDIA/OpenSeq2Seq)|Python Tensorflow (pure)|NVIDIA|
|[chinese-ASR工业落地的中文语音识别系统](https://taorui-plus.github.io/chinese-ASR/)|毓涵幫忙看QQ (看起來還沒完成)|意麟找的|
|[PYCHAIN](https://arxiv.org/pdf/2005.09824v1.pdf)| [github點我](https://github.com/YiwenShaoStephen)|
|[Espresso: A Fast End-to-end Neural Speech Recognition Toolkit]((https://arxiv.org/abs/1909.08723))|[github點我](https://github.com/freewym/espresso)|意麟找的|
|[一个超容易上手的端到端开源语音识别项目--TensorflowASR](https://zhuanlan.zhihu.com/p/182848910?fbclid=IwAR0rOKSybuwqLeyCWx6I3cdM9hnt7VNZIm4X83_32kHb-LvhGC-j7Cf3-UI) |[github](https://github.com/Z-yq/TensorflowASR)|
|[Awesome End-to-End Speech Recognition](https://github.com/charlesliucn/awesome-end2end-speech-recognition#toolkits)|||
|[NeuralSP: Neural network based Speech Processing]x(https://github.com/hirofumi0810/neural_sp)||意麟找的|
|https://github.com/gentaiscool/end2end-asr-pytorch|||
|https://github.com/jackjhliu/End-to-End-Mandarin-ASR||| -->
> Reference: [Github Topic (Speech Recognition)](https://github.com/topics/speech-recognition?l=python)
> Reference: [Papers with code (End-To-End Speech Recognition)](https://paperswithcode.com/task/end-to-end-speech-recognition)
### [Dataset preprocessing by Sean](https://hackmd.io/CrfPBDVaT22voNajEVvqnA)
### OpenASR (low resource)
[OpenASR Challenge](https://www.nist.gov/itl/iad/mig/openasr-challenge)
* Task

* Matrices
> [Word error rate(WER)](https://en.wikipedia.org/wiki/Word_error_rate)
> Time and memory resources
> Training conditions
>> Constrained
>> Unconstrained
> Languages
....
* Schedule

## Model
### Wang
[LAS](https://hackmd.io/qPxQax_7RIi5g3O_7xRSzA)
[paper](https://hackmd.io/0BayeTpCSyqzLvx6g4byAg)
### Lee
[hyperlink]()
### Huang
[Adaptive Sparse Transformer (hackmd)](https://hackmd.io/@AndyHuang/SkfIX__Wv)
[Adaptive Sparse Transformer (overleaf)](https://www.overleaf.com/read/mtvvjzwtttzs)
## State Of The Art
https://paperswithcode.com/task/speech-recognition
## ASR Frameworks
|Framework|Language|Institution/Developer|
|:-------:|:------:|:-------------------:|
|[wav2letter++](https://github.com/facebookresearch/wav2letter/tree/v0.2)|C++|Facebook Research|
|[Espnet]() |Pytorch|Espnet|
|[End-to-end-ASR-Pytorch](https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch)|Pytorch|XenderLiu|
|[LAS_Mandarin_PyTorch](https://github.com/jackaduma/LAS_Mandarin_PyTorch)|Pytorch|Kun Ma|
|[vq-wav2vec](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec)|Pytorch|Facebook AI Research|
## ASR Papers
| | Paper | Conference | Authors |
|:-----:|:-----------------------------------------------------------------------------------------------------------------------:|:----------------:|:------------:|
| Huang | [Streaming End-to-end Speech Recognition for Mobile Devices](https://arxiv.org/abs/1811.06621) | ICASSP 2019 | Google |
| Huang | [Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes](https://arxiv.org/abs/1811.09021) | ICASSP 2019 | Google |
| 3 | Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition | ICASSP 2019 | Amazon |
| Huang | [Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model](https://arxiv.org/abs/1909.05330) | Interspeech 2019 | Google |
| Sean | [Self-Training for End-to-End Speech Recognition](https://arxiv.org/pdf/1909.09116.pdf) | ICASSP 2020 | Facebook AI |
| Sean | [Transformer-based Acoustic Modeling for Hybrid Speech Recognition](https://arxiv.org/pdf/1910.09799.pdf) | ICASSP 2020 | Facebook AI |
| 7 | Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture | ICASSP 2020 | |
| 8 | Multi-task self-supervised learning for Robust Speech Recognition | ICASSP 2020 | |
| 9 | End-to-End Multi-speaker Speech Recognition with Transformer | ICASSP 2020 | |
| Huang | [Word-level Speech Recognition with a Letter to Word Encoder](https://arxiv.org/abs/1906.04323) | ICML 2020 | |
| Huang | [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations](https://arxiv.org/abs/1910.05453v1) | ICLR 2020 | |
| Huang | [wav2vec: Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/1904.05862) | Interspeech 2019 | |
| 13 | Contextnet: Improving convolutional neural networks for automatic speech recognition with global context | Interspeech 2020 | |
| Sean | [Conformer: Convolution-augmented transformer for speech recognition](https://arxiv.org/pdf/2005.08100.pdf) | Interspeech 2020 | Google |
| 15 | Specaugment on large scale datasets | ICASSP 2020 | |
| 16 | Speech sentiment analysis via pre-trained features from end-to-end asr models | ICASSP 2020 | |
| 17 | An attention-based joint acoustic and text on-device end-to-end model | ICASSP 2020 | |
| 18 | Two-pass end-to-end speech recognition | Interspeech 2019 | |
| 19 | Semi-supervised training for end-to-end models via weak distillation | ICASSP 2019 | |
| 20 | Imperceptible, robust, and targeted adversarial examples for automatic speech recognition | ICML 2020 | |
| 21 | A spelling correction model for end-to-end speech recognition | ICASSP 2019 | |
| lenny | [Compression of End-to-End Models](https://pdfs.semanticscholar.org/7e05/eb04d83b07014e7b2018666358ff5b9432a7.pdf) | Interspeech 2018 | |
| 23 | Improving the Performance of Online Neural Transducer Models | ICASSP 2018 | |
| 24 | Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions | Interspeech 2019 | Facebook |
| Sean | [Who Needs Words? Lexicon-Free Speech Recognition](https://arxiv.org/pdf/1904.04479.pdf) | Interspeech 2019 | Facebook |
| 26 | Libri-Light: A (large) data set for ASR with limited or no supervision | ICASSP 2020 | Facebook |
| Huang | [Effectiveness of self-supervised pre-training for speech recognition](https://arxiv.org/abs/1911.03912) | ICASSP 2020 | Facebook |
| 28 | OOV recovery with efficient 2nd pass decoding and open-vocabulary word-level RNNLM rescoring for ASR | ICASSP 2020 | Facebook |
| 29 | Spatial attention for far-field speech recognition with deep beamforming neural networks | ICASSP 2020 | Facebook |
| 30 | Training ASR models by generation of contextual information | ICASSP 2020 | Facebook |
| 31 | Improving Voice Separation by Incorporating End-To-End Speech Recognition | ICASSP 2020 | ? |
| lenny | [Transformer with Bidirectional Decoder for Speech Recognition](https://arxiv.org/abs/2008.04481) | InterSpeech 2020 | tsinghua(cn) |
| 33 | END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED | ICML2020 | Facebook |
| 34 | IMPROVING CTC USING STIMULATED LEARNING FOR SEQUENCE MODELING | ICASSP 2019 | ? |
| 35 | Vectorized beam search for CTC-attention-based speech recognition | ? | ? |
| lenny | [Iterative Compression of End-to-End ASR Model using AutoML](https://arxiv.org/abs/2008.02897) | Interspeech2020 | Samsung |
| lenny | [Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience](https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1750.pdf) | Interspeech2019 | |
| lenny | [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/pdf/1904.08779.pdf) | Interspeech 2019 | Google |
> [Reference (Tensorflow/lingvo)](https://github.com/tensorflow/lingvo/blob/master/PUBLICATIONS.md)
## ASR Conferences
1. ASRU
2. ASLT
3. ICASSP
4. InterSpeech
## ASR Paper Summary
### Summary Table
|Type|Counts|
|:--:|:----:|
|Real-time and Model Compression|4||
|Self- and Semi-supervised Training|4|
|Multilingual|2|
|Output Encoding|2
|Transformer-based|2|
|CNN-based|1|
---
### Example
1. **Paper Title** (Conference)
* **Architecture:** <br/>
* Descriptions
* **Methodology or Adavantage:** <br/>
* Descriptions
* **Dataset (optional):**<br/>
* Descriptions
* **Keyword:** <br/>
* Descriptions
---
### Real-time and Model Compression
1. **Streaming End-to-end Speech Recognition for Mobile Devices** (ICASSP 2019)
* **Architecture:**<br/>
* Recurrent neural network transducer.
* LSTM
* **Methodology or Adavantage:**<br/>
* Outperform a conventional CTC-based model interms of both latency and accuracy and achieve 0.51 RT90 (real time factor at 90 percentile).
* Add a time-reduction layer in the encoder to speed up training and inference
* Train with word-piece subword units, which outperform graphemes in our experiments.
* Use different threads for the encoder and the prediction network to enable pipelining through asynchrony in order to save time.
* Quantize parameters from 32-bit floating-point precision into 8-bit fixed-point.
* **Dataset:**<br/>
* 14.8K voice search (VS) utterances extracted from Google traffic.
* 15.7K dictation utterances, which refer to as the IME test set.
* **Keyword:** <br/>
* Real-time, RNN-T, Speech Recognition
2. **Compression of End-to-End Models** (Interspeech 2018)
* **Methodology or Adavantage:** <br/>
* This work explores the problem of compressing end-to-end models with the goal of satisfying device constraints without sacrificing model accuracy.
* **Keyword:** <br/>
* Matrix factorization, knowledge distillation, parameter , Speech Recognition
3. **Iterative Compression of End-to-End ASR Model using AutoML** (Interspeech 2020)
* **Methodology or Adavantage:** <br/>
* Increasing demand for on-device Automatic Speech Recognition (ASR) systems has resulted in renewed interests in developing automatic model compression techniques.
* AutoML-based Low Rank Factorization (LRF) technique
* 提出 Iterative AutoML-based LRF approach that achieves over 5×compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression
* **Keyword:** <br/>
* On-device, AutoML, Compression, Speech Recognition
4. **Automatic Compression of Subtitles with Neural Networks** (Interspeech 2019)
* **Architecture:**<br/>
* LSTM
* Encoder+Decoder
* **Methodology or Adavantage:** <br/>
* Automatic sentence compression
* for fast speech or limited screen size, it might be advantageous to compress the subtitles to their most relevant content.
* **Keyword:** <br/>
* LSTM, Compression, Speech Recognition
---
### Multilingual
1. **Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes** (ICASSP 2019)
* **Architecture:**<br/>
* Audio-to-Byte model is based on the Listen, Attend and Spell (LAS).
* Byte-to-Audio model is based on Tacotron 2.
* **Methodology or Adavantage:** <br/>
* Output target changed from graphemes to Unicode bytes.
* Generates the text sequence one Unicode byte at a time.
* Any script of any language representable by Unicode can be represented by a byte sequence, and there is no need to change the existing model structure.
* **Dataset:**<br/>
* English + Japanese + Spanish + Korean
* **Keyword:** <br/>
* Multilingual, Speech Recognition, LAS, Tacotron 2, Byte-encoding
2. **Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model** (Interspeech 2019)
* **Architecture:** <br/>
* RNN-T
* **Methodology or Adavantage:** <br/>
* E2E multilingual system which is equipped to operate in low-latency interactive applications.
* Deal with imbalanced multilingual data by
* Data sampling
* Conditioning on language vector
* Adapter modules
* **Dataset:**<br/>
* 9 languages. Hindi, Marathi, ...
* **Keyword:** <br/>
* Multilingual, Speech Recognition, RNN-T, Imbalanced data, Low-lantency
---
### Output Encoding
1. **Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes** (ICASSP 2019)
* Same paper described in multilingual section.
2. **Word-level Speech Recognition with a Letter to Word Encoder** (ICML 2020)
* **Architecture:** <br/>
* Trasnsformer, CTC
* **Methodology or Adavantage:** <br/>
* Propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters.
* The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention.
* With beam search. Acoustic model + language model + insertion weight.
* **Dataset:**<br/>
* LibriSpeech
* **Keyword:** <br/>
* Transformer, CTC, Attention, Word Embedding, Speech Recognition
---
### Self- and Semi-supervised Training
1. **vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations** (ICLR 2020)
* **Architecture:** <br/>
* wav2letter
* BERT
* **Methodology or Adavantage:** <br/>
* Learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task.
* The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations.
* Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.
* **Dataset:**<br/>
* TIMIT
* WSJ
* **Keyword:** <br/>
* BERT, wav2letter, Self-supervised, Discrete, Speech Recognition
2. **wav2vec: Unsupervised Pre-training for Speech Recognition** (InterSpeech 2019)
* **Architecture:** <br/>
* CNN
* **Methodology or Adavantage:** <br/>
* Explore unsupervised pre-training for speech recognition by learning representations of raw audio.
* We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task.
* Experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available.
* Outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
* **Dataset:**<br/>
* WSJ
* Librispeech
* TIMIT
* **Keyword:** <br/>
* Unsupervised, pre-training, CNN, Speech Recognition
<!-- 3. **Semi-supervised training for end-to-end models via weak distillation** (ICASSP 2019)
* **Architecture:** <br/>
* Descriptions
* Descriptions
* **Methodology or Adavantage:** <br/>
* Descriptions
* **Dataset:**<br/>
* Descriptions
* **Keyword:** <br/>
* Descriptions -->
3. **Effectiveness of self-supervised pre-training for speech recognition** (ICASSP 2020)
* **Architecture:** <br/>
* wav2vec
* Discrete BERT
* Continuous BERT
* **Methodology or Adavantage:** <br/>
* Directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model.
* Also propose a BERT-style model learning directly from the continuous audio data and compare pre-training on raw audio to spectral features.
* **Dataset:**<br/>
* Librispeech
* **Keyword:** <br/>
* BERT, fine-tune, self-supervised, Speech Recognition
4. **Self-Training for End-to-End Speech Recognition** (ICASSP 2020)
* **Architecture:** <br/>

* **Methodology or Adavantage:** <br/>
* Use unlabeled sentence Y train LM, then use LM and AM generate pseudo labels with unlabeled acoustic input X.Finally use pair ( X , pseudo labels ) train AM model.
* **Keyword:** <br/>
* Self-training, CNN, RNN, Multi-head Attention, Speech Recognition
---
### Transformer-based
1. **Transformer-based Acoustic Modeling for Hybrid Speech Recognition** (ICASSP 2020)
* **Architecture:** <br/>
* Deep Transformer , Comparison between trabsformer and BiLSTM
* **Methodology or Adavantage:** <br/>
* Add LayerNorm into Transformer Architecture.
* **Keyword:** <br/>
* Transformer, BiLSTM, Layer Norm, Speech Recognition
2. **Transformer with Bidirectional Decoder for Speech Recognition** (Interspeech 2020)
* **Architecture:**<br/>
* Speech transformer
* End2end asr
* Bidirectional decoder
* **Methodology or Adavantage:** <br/>
* The conventional transformer-based approaches usually generate the sequence results token by token from left to right,we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously.
* We propose a novel speech transformer with bidirectional decoder to exploit the bidirectional context information for ASR task.
* We investigate the effectiveness of the bidirectional target and bidirectional decoding with comprehensive experiments.
* Our method achieves a 3.6% relative CER reduction. The best model in this paper described as STBD-big achieves 6.64% CER with a large margin improvement on AISHELL-1 dataset.www.
* **Dataset:**<br/>
* AISHELL-1
* **Keyword:** <br/>
* Speech Transformer, Bidirectional, Attention, End2end, Speech Recognition
---
### CNN-based
1. **Conformer: Convolution-augmented transformer for speech recognition** (Interspeech 2020)
* **Architecture:** <br/>
* CNN
* Transformer
* **Methodology or Adavantage:** <br/>
* Use CNN to extract local information in (Conformer)Transformer block
* Use Depthwise Conv and Swish Activation as CNN block. Besides, use lots of residual network.
* **Keyword:** <br/>
* CNN, Transformer, Depthwise Conv, Speech Recognition
---
## Links
[Linguistic Data Consortium](https://www.ldc.upenn.edu)