Automatic Speech Recognition Program

# Automatic Speech Recognition Program https://hackmd.io/8_C9xjpbQgaFkCFpdRei4g?both 深度學習理論與應用 https://github.com/MarkWuNLP/SemanticMask https://awesomeopensource.com/ **ssh yh@140.113.170.49 -p 20203 --- ## Content Table * [Presented and Preparing Papers](#Presented-and-Preparing-Papers) * [Dataset preprocessing by Sean](##Dataset-preprocessing-by-Sean) * [OpenASR (low resource)](##OpenASR-(low-resource)) * [Lenny & Sean & Huang's Model](#Model) * [State of the art papers (paper with code)](#State-Of-The-Art) * [ASR Frameworks](#ASR-Frameworks) * [ASR Papers](#ASR-Papers) * [ASR Conferences](#ASR-Conferences) * [ASR Paper Summary](#ASR-Paper-Summary) * [Links](#Links) --- ## Presented and Preparing Papers **毓涵 Presented** | Paper | | --------------------------------------------------------------------------------------------------------------------------------- | | [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (ICLR 2020)](https://openreview.net/forum?id=rylwJxrYDS)| |[wav2vec: Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/1904.05862)| |[wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)| |[Wav2Letter: an End-to-End ConvNet-based Speech Recognition System](https://arxiv.org/abs/1609.03193)| |[Word-level Speech Recognition with a Letter to Word Encoder (ICML 2020)](https://arxiv.org/abs/1906.04323)| **毓涵 Preparing** |Paper| |:----| |Sparse Sinkhorn Attention (ICML 2020)| |Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes| |Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture| |Training ASR models by generation of contextual information| |Effectiveness of self-supervised pre-training for speech recognition| |An attention-based joint acoustic and text on-device end-to-end model|  ___ **Shang-En Presented** | [Deep Speech 1: Scaling up end-to-end Speech Recognition](https://arxiv.org/pdf/1412.5567.pdf) | | -------- | **Shang-En preparing** |Paper| |:----| | [End-to-End Multi-Lingual Multi-Speaker Speech Recognition](https://openreview.net/pdf?id=HJxqMhC5YQ) | | [English Conversational Telephone Speech Recognition by Humans and Machines](https://arxiv.org/pdf/1703.02136.pdf) | | [Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](http://proceedings.mlr.press/v48/amodei16.pdf) [code](https://github.com/tensorflow/models/tree/master/research/deep_speech)| | [First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs](https://arxiv.org/pdf/1408.2873.pdf)| |[Sequence Transduction with Recurrent Neural Networks](https://arxiv.org/pdf/1211.3711.pdf)| |[TRANSFORMER TRANSDUCER: A STREAMABLE SPEECH RECOGNITION MODEL WITH TRANSFORMER ENCODERS AND RNN-T LOSS](https://arxiv.org/pdf/2002.02562.pdf)| |[ESP on CHiME6](https://github.com/espnet/espnet/blob/master/egs/chime6/asr1/RESULTS.md?fbclid=IwAR0WHUQJS6MpAmt_P9UHluu7MscIt8yC2YiVhKIjxOtV3ML_t0Vq8LVQStY)| ___ **意麟 Presented** | [CP-GAN: CONTEXT PYRAMID GENERATIVE ADVERSARIAL NETWORK FOR SPEECH ENHANCEMENT](https://ieeexplore.ieee.org/document/9054060) | | -------- | **意麟 preparing** | [Improving Voice Separation by Incorporating End-To-End Speech Recognition](https://ieeexplore.ieee.org/document/9053845) | | ------------------------------------------------------------------------------------------------------------------------- | | SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | | Transformer with Bidirectional Decoder for Speech Recognition | | Very Deep Self-Attention Networks for End-to-End Speech Recognition | | Two-pass End-to-End Speech Recognition | | END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED LEARNING WITH MODERN ARCHITECTURES | | IMPROVING CTC USING STIMULATED LEARNING FOR SEQUENCE MODELING | | "Very Deep Self-Attention Networks for End-to-End Speech Recognition | | Vectorized beam search for CTC-attention-based speech recognition | | IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION | ---    > Reference: [Github Topic (Speech Recognition)](https://github.com/topics/speech-recognition?l=python) > Reference: [Papers with code (End-To-End Speech Recognition)](https://paperswithcode.com/task/end-to-end-speech-recognition) ### [Dataset preprocessing by Sean](https://hackmd.io/CrfPBDVaT22voNajEVvqnA) ### OpenASR (low resource) [OpenASR Challenge](https://www.nist.gov/itl/iad/mig/openasr-challenge) * Task ![](https://i.imgur.com/7KSmN5x.png) * Matrices > [Word error rate(WER)](https://en.wikipedia.org/wiki/Word_error_rate) > Time and memory resources > Training conditions >> Constrained >> Unconstrained > Languages .... * Schedule ![](https://i.imgur.com/i012vgQ.png) ## Model ### Wang [LAS](https://hackmd.io/qPxQax_7RIi5g3O_7xRSzA) [paper](https://hackmd.io/0BayeTpCSyqzLvx6g4byAg) ### Lee [hyperlink]() ### Huang [Adaptive Sparse Transformer (hackmd)](https://hackmd.io/@AndyHuang/SkfIX__Wv) [Adaptive Sparse Transformer (overleaf)](https://www.overleaf.com/read/mtvvjzwtttzs) ## State Of The Art https://paperswithcode.com/task/speech-recognition ## ASR Frameworks |Framework|Language|Institution/Developer| |:-------:|:------:|:-------------------:| |[wav2letter++](https://github.com/facebookresearch/wav2letter/tree/v0.2)|C++|Facebook Research| |[Espnet]() |Pytorch|Espnet| |[End-to-end-ASR-Pytorch](https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch)|Pytorch|XenderLiu| |[LAS_Mandarin_PyTorch](https://github.com/jackaduma/LAS_Mandarin_PyTorch)|Pytorch|Kun Ma| |[vq-wav2vec](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec)|Pytorch|Facebook AI Research| ## ASR Papers | | Paper | Conference | Authors | |:-----:|:-----------------------------------------------------------------------------------------------------------------------:|:----------------:|:------------:| | Huang | [Streaming End-to-end Speech Recognition for Mobile Devices](https://arxiv.org/abs/1811.06621) | ICASSP 2019 | Google | | Huang | [Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes](https://arxiv.org/abs/1811.09021) | ICASSP 2019 | Google | | 3 | Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition | ICASSP 2019 | Amazon | | Huang | [Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model](https://arxiv.org/abs/1909.05330) | Interspeech 2019 | Google | | Sean | [Self-Training for End-to-End Speech Recognition](https://arxiv.org/pdf/1909.09116.pdf) | ICASSP 2020 | Facebook AI | | Sean | [Transformer-based Acoustic Modeling for Hybrid Speech Recognition](https://arxiv.org/pdf/1910.09799.pdf) | ICASSP 2020 | Facebook AI | | 7 | Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture | ICASSP 2020 | | | 8 | Multi-task self-supervised learning for Robust Speech Recognition | ICASSP 2020 | | | 9 | End-to-End Multi-speaker Speech Recognition with Transformer | ICASSP 2020 | | | Huang | [Word-level Speech Recognition with a Letter to Word Encoder](https://arxiv.org/abs/1906.04323) | ICML 2020 | | | Huang | [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations](https://arxiv.org/abs/1910.05453v1) | ICLR 2020 | | | Huang | [wav2vec: Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/1904.05862) | Interspeech 2019 | | | 13 | Contextnet: Improving convolutional neural networks for automatic speech recognition with global context | Interspeech 2020 | | | Sean | [Conformer: Convolution-augmented transformer for speech recognition](https://arxiv.org/pdf/2005.08100.pdf) | Interspeech 2020 | Google | | 15 | Specaugment on large scale datasets | ICASSP 2020 | | | 16 | Speech sentiment analysis via pre-trained features from end-to-end asr models | ICASSP 2020 | | | 17 | An attention-based joint acoustic and text on-device end-to-end model | ICASSP 2020 | | | 18 | Two-pass end-to-end speech recognition | Interspeech 2019 | | | 19 | Semi-supervised training for end-to-end models via weak distillation | ICASSP 2019 | | | 20 | Imperceptible, robust, and targeted adversarial examples for automatic speech recognition | ICML 2020 | | | 21 | A spelling correction model for end-to-end speech recognition | ICASSP 2019 | | | lenny | [Compression of End-to-End Models](https://pdfs.semanticscholar.org/7e05/eb04d83b07014e7b2018666358ff5b9432a7.pdf) | Interspeech 2018 | | | 23 | Improving the Performance of Online Neural Transducer Models | ICASSP 2018 | | | 24 | Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions | Interspeech 2019 | Facebook | | Sean | [Who Needs Words? Lexicon-Free Speech Recognition](https://arxiv.org/pdf/1904.04479.pdf) | Interspeech 2019 | Facebook | | 26 | Libri-Light: A (large) data set for ASR with limited or no supervision | ICASSP 2020 | Facebook | | Huang | [Effectiveness of self-supervised pre-training for speech recognition](https://arxiv.org/abs/1911.03912) | ICASSP 2020 | Facebook | | 28 | OOV recovery with efficient 2nd pass decoding and open-vocabulary word-level RNNLM rescoring for ASR | ICASSP 2020 | Facebook | | 29 | Spatial attention for far-field speech recognition with deep beamforming neural networks | ICASSP 2020 | Facebook | | 30 | Training ASR models by generation of contextual information | ICASSP 2020 | Facebook | | 31 | Improving Voice Separation by Incorporating End-To-End Speech Recognition | ICASSP 2020 | ? | | lenny | [Transformer with Bidirectional Decoder for Speech Recognition](https://arxiv.org/abs/2008.04481) | InterSpeech 2020 | tsinghua(cn) | | 33 | END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED | ICML2020 | Facebook | | 34 | IMPROVING CTC USING STIMULATED LEARNING FOR SEQUENCE MODELING | ICASSP 2019 | ? | | 35 | Vectorized beam search for CTC-attention-based speech recognition | ? | ? | | lenny | [Iterative Compression of End-to-End ASR Model using AutoML](https://arxiv.org/abs/2008.02897) | Interspeech2020 | Samsung | | lenny | [Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience](https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1750.pdf) | Interspeech2019 | | | lenny | [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/pdf/1904.08779.pdf) | Interspeech 2019 | Google | > [Reference (Tensorflow/lingvo)](https://github.com/tensorflow/lingvo/blob/master/PUBLICATIONS.md) ## ASR Conferences 1. ASRU 2. ASLT 3. ICASSP 4. InterSpeech ## ASR Paper Summary ### Summary Table |Type|Counts| |:--:|:----:| |Real-time and Model Compression|4|| |Self- and Semi-supervised Training|4| |Multilingual|2| |Output Encoding|2 |Transformer-based|2| |CNN-based|1| --- ### Example 1. **Paper Title** (Conference) * **Architecture:** * Descriptions * **Methodology or Adavantage:** * Descriptions * **Dataset (optional):** * Descriptions * **Keyword:** * Descriptions --- ### Real-time and Model Compression 1. **Streaming End-to-end Speech Recognition for Mobile Devices** (ICASSP 2019) * **Architecture:** * Recurrent neural network transducer. * LSTM * **Methodology or Adavantage:** * Outperform a conventional CTC-based model interms of both latency and accuracy and achieve 0.51 RT90 (real time factor at 90 percentile). * Add a time-reduction layer in the encoder to speed up training and inference * Train with word-piece subword units, which outperform graphemes in our experiments. * Use different threads for the encoder and the prediction network to enable pipelining through asynchrony in order to save time. * Quantize parameters from 32-bit floating-point precision into 8-bit fixed-point. * **Dataset:** * 14.8K voice search (VS) utterances extracted from Google trafﬁc. * 15.7K dictation utterances, which refer to as the IME test set. * **Keyword:** * Real-time, RNN-T, Speech Recognition 2. **Compression of End-to-End Models** (Interspeech 2018) * **Methodology or Adavantage:** * This work explores the problem of compressing end-to-end models with the goal of satisfying device constraints without sacrificing model accuracy. * **Keyword:** * Matrix factorization, knowledge distillation, parameter , Speech Recognition 3. **Iterative Compression of End-to-End ASR Model using AutoML** (Interspeech 2020) * **Methodology or Adavantage:** * Increasing demand for on-device Automatic Speech Recognition (ASR) systems has resulted in renewed interests in developing automatic model compression techniques. * AutoML-based Low Rank Factorization (LRF) technique * 提出 Iterative AutoML-based LRF approach that achieves over 5×compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression * **Keyword:** * On-device, AutoML, Compression, Speech Recognition 4. **Automatic Compression of Subtitles with Neural Networks** (Interspeech 2019) * **Architecture:** * LSTM * Encoder+Decoder * **Methodology or Adavantage:** * Automatic sentence compression * for fast speech or limited screen size, it might be advantageous to compress the subtitles to their most relevant content. * **Keyword:** * LSTM, Compression, Speech Recognition --- ### Multilingual 1. **Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes** (ICASSP 2019) * **Architecture:** * Audio-to-Byte model is based on the Listen, Attend and Spell (LAS). * Byte-to-Audio model is based on Tacotron 2. * **Methodology or Adavantage:** * Output target changed from graphemes to Unicode bytes. * Generates the text sequence one Unicode byte at a time. * Any script of any language representable by Unicode can be represented by a byte sequence, and there is no need to change the existing model structure. * **Dataset:** * English + Japanese + Spanish + Korean * **Keyword:** * Multilingual, Speech Recognition, LAS, Tacotron 2, Byte-encoding 2. **Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model** (Interspeech 2019) * **Architecture:** * RNN-T * **Methodology or Adavantage:** * E2E multilingual system which is equipped to operate in low-latency interactive applications. * Deal with imbalanced multilingual data by * Data sampling * Conditioning on language vector * Adapter modules * **Dataset:** * 9 languages. Hindi, Marathi, ... * **Keyword:** * Multilingual, Speech Recognition, RNN-T, Imbalanced data, Low-lantency --- ### Output Encoding 1. **Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes** (ICASSP 2019) * Same paper described in multilingual section. 2. **Word-level Speech Recognition with a Letter to Word Encoder** (ICML 2020) * **Architecture:** * Trasnsformer, CTC * **Methodology or Adavantage:** * Propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters. * The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. * With beam search. Acoustic model + language model + insertion weight. * **Dataset:** * LibriSpeech * **Keyword:** * Transformer, CTC, Attention, Word Embedding, Speech Recognition --- ### Self- and Semi-supervised Training 1. **vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations** (ICLR 2020) * **Architecture:** * wav2letter * BERT * **Methodology or Adavantage:** * Learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. * The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. * Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classiﬁcation and WSJ speech recognition. * **Dataset:** * TIMIT * WSJ * **Keyword:** * BERT, wav2letter, Self-supervised, Discrete, Speech Recognition 2. **wav2vec: Unsupervised Pre-training for Speech Recognition** (InterSpeech 2019) * **Architecture:** * CNN * **Methodology or Adavantage:** * Explore unsupervised pre-training for speech recognition by learning representations of raw audio. * We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. * Experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. * Outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data. * **Dataset:** * WSJ * Librispeech * TIMIT * **Keyword:** * Unsupervised, pre-training, CNN, Speech Recognition  3. **Effectiveness of self-supervised pre-training for speech recognition** (ICASSP 2020) * **Architecture:** * wav2vec * Discrete BERT * Continuous BERT * **Methodology or Adavantage:** * Directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classiﬁcation (CTC) loss instead of feeding the representations into a task-speciﬁc model. * Also propose a BERT-style model learning directly from the continuous audio data and compare pre-training on raw audio to spectral features. * **Dataset:** * Librispeech * **Keyword:** * BERT, fine-tune, self-supervised, Speech Recognition 4. **Self-Training for End-to-End Speech Recognition** (ICASSP 2020) * **Architecture:** ![](https://i.imgur.com/afsdjbV.png) * **Methodology or Adavantage:** * Use unlabeled sentence Y train LM, then use LM and AM generate pseudo labels with unlabeled acoustic input X.Finally use pair ( X , pseudo labels ) train AM model. * **Keyword:** * Self-training, CNN, RNN, Multi-head Attention, Speech Recognition --- ### Transformer-based 1. **Transformer-based Acoustic Modeling for Hybrid Speech Recognition** (ICASSP 2020) * **Architecture:** * Deep Transformer , Comparison between trabsformer and BiLSTM * **Methodology or Adavantage:** * Add LayerNorm into Transformer Architecture. * **Keyword:** * Transformer, BiLSTM, Layer Norm, Speech Recognition 2. **Transformer with Bidirectional Decoder for Speech Recognition** (Interspeech 2020) * **Architecture:** * Speech transformer * End2end asr * Bidirectional decoder * **Methodology or Adavantage:** * The conventional transformer-based approaches usually generate the sequence results token by token from left to right,we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously. * We propose a novel speech transformer with bidirectional decoder to exploit the bidirectional context information for ASR task. * We investigate the effectiveness of the bidirectional target and bidirectional decoding with comprehensive experiments. * Our method achieves a 3.6% relative CER reduction. The best model in this paper described as STBD-big achieves 6.64% CER with a large margin improvement on AISHELL-1 dataset.www. * **Dataset:** * AISHELL-1 * **Keyword:** * Speech Transformer, Bidirectional, Attention, End2end, Speech Recognition --- ### CNN-based 1. **Conformer: Convolution-augmented transformer for speech recognition** (Interspeech 2020) * **Architecture:** * CNN * Transformer * **Methodology or Adavantage:** * Use CNN to extract local information in (Conformer)Transformer block * Use Depthwise Conv and Swish Activation as CNN block. Besides, use lots of residual network. * **Keyword:** * CNN, Transformer, Depthwise Conv, Speech Recognition --- ## Links [Linguistic Data Consortium](https://www.ldc.upenn.edu)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.