Lenny
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Automatic Speech Recognition Program https://hackmd.io/8_C9xjpbQgaFkCFpdRei4g?both 深度學習理論與應用 https://github.com/MarkWuNLP/SemanticMask https://awesomeopensource.com/ **ssh yh@140.113.170.49 -p 20203 --- ## Content Table * [Presented and Preparing Papers](#Presented-and-Preparing-Papers) * [Dataset preprocessing by Sean](##Dataset-preprocessing-by-Sean) * [OpenASR (low resource)](##OpenASR-(low-resource)) * [Lenny & Sean & Huang's Model](#Model) * [State of the art papers (paper with code)](#State-Of-The-Art) * [ASR Frameworks](#ASR-Frameworks) * [ASR Papers](#ASR-Papers) * [ASR Conferences](#ASR-Conferences) * [ASR Paper Summary](#ASR-Paper-Summary) * [Links](#Links) --- ## Presented and Preparing Papers **毓涵 Presented** | Paper | | --------------------------------------------------------------------------------------------------------------------------------- | | [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (ICLR 2020)](https://openreview.net/forum?id=rylwJxrYDS)| |[wav2vec: Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/1904.05862)| |[wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)| |[Wav2Letter: an End-to-End ConvNet-based Speech Recognition System](https://arxiv.org/abs/1609.03193)| |[Word-level Speech Recognition with a Letter to Word Encoder (ICML 2020)](https://arxiv.org/abs/1906.04323)| **毓涵 Preparing** |Paper| |:----| |Sparse Sinkhorn Attention (ICML 2020)| |Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes| |Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture| |Training ASR models by generation of contextual information| |Effectiveness of self-supervised pre-training for speech recognition| |An attention-based joint acoustic and text on-device end-to-end model| <!-- https://hackmd.io/oW_shU5-TPqt661UkYCPpw --> ___ **Shang-En Presented** | [Deep Speech 1: Scaling up end-to-end Speech Recognition](https://arxiv.org/pdf/1412.5567.pdf) | | -------- | **Shang-En preparing** |Paper| |:----| | [End-to-End Multi-Lingual Multi-Speaker Speech Recognition](https://openreview.net/pdf?id=HJxqMhC5YQ) | | [English Conversational Telephone Speech Recognition by Humans and Machines](https://arxiv.org/pdf/1703.02136.pdf) | | [Deep Speech 2: End-to-End Speech Recognition in English and Mandarin](http://proceedings.mlr.press/v48/amodei16.pdf) [code](https://github.com/tensorflow/models/tree/master/research/deep_speech)| | [First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs](https://arxiv.org/pdf/1408.2873.pdf)| |[Sequence Transduction with Recurrent Neural Networks](https://arxiv.org/pdf/1211.3711.pdf)| |[TRANSFORMER TRANSDUCER: A STREAMABLE SPEECH RECOGNITION MODEL WITH TRANSFORMER ENCODERS AND RNN-T LOSS](https://arxiv.org/pdf/2002.02562.pdf)| |[ESP on CHiME6](https://github.com/espnet/espnet/blob/master/egs/chime6/asr1/RESULTS.md?fbclid=IwAR0WHUQJS6MpAmt_P9UHluu7MscIt8yC2YiVhKIjxOtV3ML_t0Vq8LVQStY)| ___ **意麟 Presented** | [CP-GAN: CONTEXT PYRAMID GENERATIVE ADVERSARIAL NETWORK FOR SPEECH ENHANCEMENT](https://ieeexplore.ieee.org/document/9054060) | | -------- | **意麟 preparing** | [Improving Voice Separation by Incorporating End-To-End Speech Recognition](https://ieeexplore.ieee.org/document/9053845) | | ------------------------------------------------------------------------------------------------------------------------- | | SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition | | Transformer with Bidirectional Decoder for Speech Recognition | | Very Deep Self-Attention Networks for End-to-End Speech Recognition | | Two-pass End-to-End Speech Recognition | | END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED LEARNING WITH MODERN ARCHITECTURES | | IMPROVING CTC USING STIMULATED LEARNING FOR SEQUENCE MODELING | | "Very Deep Self-Attention Networks for End-to-End Speech Recognition | | Vectorized beam search for CTC-attention-based speech recognition | | IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION | --- <!-- **ASR框架整理** --> <!-- |Framework|Language|Institution/Developer| |:-------:|:------:|:-------------------:| |[wav2letter++](https://github.com/facebookresearch/wav2letter/tree/v0.2)|C++|Facebook Research| |[Espnet 2](https://espnet.github.io/espnet/espnet2_tutorial.html) |Pytorch|Espnet| |[End-to-end-ASR-Pytorch](https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch)|pytorch|XenderLiu| |[LAS_Mandarin_PyTorch](https://github.com/jackaduma/LAS_Mandarin_PyTorch)|Pytorch|Kun Ma| |[vq-wav2vec](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec)|Pytorch|Facebook AI Research| --> <!-- |[ASRT Speech Recognition](https://github.com/nl8590687/ASRT_SpeechRecognition)|Python Tensorflow (pure)|AI柠檬| |[Lingvo](https://github.com/tensorflow/lingvo)|Python Tensorflow (pure)|Tensorflow| |[Pytorch Kaldi](https://github.com/mravanelli/pytorch-kaldi)| Python Pytorch M. Ravanelli, T. Parcollet, Y. Bengio | |[OpenSeq2Seq](https://github.com/NVIDIA/OpenSeq2Seq)|Python Tensorflow (pure)|NVIDIA| |[chinese-ASR工业落地的中文语音识别系统](https://taorui-plus.github.io/chinese-ASR/)|毓涵幫忙看QQ (看起來還沒完成)|意麟找的| |[PYCHAIN](https://arxiv.org/pdf/2005.09824v1.pdf)| [github點我](https://github.com/YiwenShaoStephen)| |[Espresso: A Fast End-to-end Neural Speech Recognition Toolkit]((https://arxiv.org/abs/1909.08723))|[github點我](https://github.com/freewym/espresso)|意麟找的| |[一个超容易上手的端到端开源语音识别项目--TensorflowASR](https://zhuanlan.zhihu.com/p/182848910?fbclid=IwAR0rOKSybuwqLeyCWx6I3cdM9hnt7VNZIm4X83_32kHb-LvhGC-j7Cf3-UI) |[github](https://github.com/Z-yq/TensorflowASR)| |[Awesome End-to-End Speech Recognition](https://github.com/charlesliucn/awesome-end2end-speech-recognition#toolkits)||| |[NeuralSP: Neural network based Speech Processing]x(https://github.com/hirofumi0810/neural_sp)||意麟找的| |https://github.com/gentaiscool/end2end-asr-pytorch||| |https://github.com/jackjhliu/End-to-End-Mandarin-ASR||| --> > Reference: [Github Topic (Speech Recognition)](https://github.com/topics/speech-recognition?l=python) > Reference: [Papers with code (End-To-End Speech Recognition)](https://paperswithcode.com/task/end-to-end-speech-recognition) ### [Dataset preprocessing by Sean](https://hackmd.io/CrfPBDVaT22voNajEVvqnA) ### OpenASR (low resource) [OpenASR Challenge](https://www.nist.gov/itl/iad/mig/openasr-challenge) * Task ![](https://i.imgur.com/7KSmN5x.png) * Matrices > [Word error rate(WER)](https://en.wikipedia.org/wiki/Word_error_rate) > Time and memory resources > Training conditions >> Constrained >> Unconstrained > Languages .... * Schedule ![](https://i.imgur.com/i012vgQ.png) ## Model ### Wang [LAS](https://hackmd.io/qPxQax_7RIi5g3O_7xRSzA) [paper](https://hackmd.io/0BayeTpCSyqzLvx6g4byAg) ### Lee [hyperlink]() ### Huang [Adaptive Sparse Transformer (hackmd)](https://hackmd.io/@AndyHuang/SkfIX__Wv) [Adaptive Sparse Transformer (overleaf)](https://www.overleaf.com/read/mtvvjzwtttzs) ## State Of The Art https://paperswithcode.com/task/speech-recognition ## ASR Frameworks |Framework|Language|Institution/Developer| |:-------:|:------:|:-------------------:| |[wav2letter++](https://github.com/facebookresearch/wav2letter/tree/v0.2)|C++|Facebook Research| |[Espnet]() |Pytorch|Espnet| |[End-to-end-ASR-Pytorch](https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch)|Pytorch|XenderLiu| |[LAS_Mandarin_PyTorch](https://github.com/jackaduma/LAS_Mandarin_PyTorch)|Pytorch|Kun Ma| |[vq-wav2vec](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec)|Pytorch|Facebook AI Research| ## ASR Papers | | Paper | Conference | Authors | |:-----:|:-----------------------------------------------------------------------------------------------------------------------:|:----------------:|:------------:| | Huang | [Streaming End-to-end Speech Recognition for Mobile Devices](https://arxiv.org/abs/1811.06621) | ICASSP 2019 | Google | | Huang | [Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes](https://arxiv.org/abs/1811.09021) | ICASSP 2019 | Google | | 3 | Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition | ICASSP 2019 | Amazon | | Huang | [Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model](https://arxiv.org/abs/1909.05330) | Interspeech 2019 | Google | | Sean | [Self-Training for End-to-End Speech Recognition](https://arxiv.org/pdf/1909.09116.pdf) | ICASSP 2020 | Facebook AI | | Sean | [Transformer-based Acoustic Modeling for Hybrid Speech Recognition](https://arxiv.org/pdf/1910.09799.pdf) | ICASSP 2020 | Facebook AI | | 7 | Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture | ICASSP 2020 | | | 8 | Multi-task self-supervised learning for Robust Speech Recognition | ICASSP 2020 | | | 9 | End-to-End Multi-speaker Speech Recognition with Transformer | ICASSP 2020 | | | Huang | [Word-level Speech Recognition with a Letter to Word Encoder](https://arxiv.org/abs/1906.04323) | ICML 2020 | | | Huang | [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations](https://arxiv.org/abs/1910.05453v1) | ICLR 2020 | | | Huang | [wav2vec: Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/1904.05862) | Interspeech 2019 | | | 13 | Contextnet: Improving convolutional neural networks for automatic speech recognition with global context | Interspeech 2020 | | | Sean | [Conformer: Convolution-augmented transformer for speech recognition](https://arxiv.org/pdf/2005.08100.pdf) | Interspeech 2020 | Google | | 15 | Specaugment on large scale datasets | ICASSP 2020 | | | 16 | Speech sentiment analysis via pre-trained features from end-to-end asr models | ICASSP 2020 | | | 17 | An attention-based joint acoustic and text on-device end-to-end model | ICASSP 2020 | | | 18 | Two-pass end-to-end speech recognition | Interspeech 2019 | | | 19 | Semi-supervised training for end-to-end models via weak distillation | ICASSP 2019 | | | 20 | Imperceptible, robust, and targeted adversarial examples for automatic speech recognition | ICML 2020 | | | 21 | A spelling correction model for end-to-end speech recognition | ICASSP 2019 | | | lenny | [Compression of End-to-End Models](https://pdfs.semanticscholar.org/7e05/eb04d83b07014e7b2018666358ff5b9432a7.pdf) | Interspeech 2018 | | | 23 | Improving the Performance of Online Neural Transducer Models | ICASSP 2018 | | | 24 | Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions | Interspeech 2019 | Facebook | | Sean | [Who Needs Words? Lexicon-Free Speech Recognition](https://arxiv.org/pdf/1904.04479.pdf) | Interspeech 2019 | Facebook | | 26 | Libri-Light: A (large) data set for ASR with limited or no supervision | ICASSP 2020 | Facebook | | Huang | [Effectiveness of self-supervised pre-training for speech recognition](https://arxiv.org/abs/1911.03912) | ICASSP 2020 | Facebook | | 28 | OOV recovery with efficient 2nd pass decoding and open-vocabulary word-level RNNLM rescoring for ASR | ICASSP 2020 | Facebook | | 29 | Spatial attention for far-field speech recognition with deep beamforming neural networks | ICASSP 2020 | Facebook | | 30 | Training ASR models by generation of contextual information | ICASSP 2020 | Facebook | | 31 | Improving Voice Separation by Incorporating End-To-End Speech Recognition | ICASSP 2020 | ? | | lenny | [Transformer with Bidirectional Decoder for Speech Recognition](https://arxiv.org/abs/2008.04481) | InterSpeech 2020 | tsinghua(cn) | | 33 | END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED | ICML2020 | Facebook | | 34 | IMPROVING CTC USING STIMULATED LEARNING FOR SEQUENCE MODELING | ICASSP 2019 | ? | | 35 | Vectorized beam search for CTC-attention-based speech recognition | ? | ? | | lenny | [Iterative Compression of End-to-End ASR Model using AutoML](https://arxiv.org/abs/2008.02897) | Interspeech2020 | Samsung | | lenny | [Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience](https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1750.pdf) | Interspeech2019 | | | lenny | [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/pdf/1904.08779.pdf) | Interspeech 2019 | Google | > [Reference (Tensorflow/lingvo)](https://github.com/tensorflow/lingvo/blob/master/PUBLICATIONS.md) ## ASR Conferences 1. ASRU 2. ASLT 3. ICASSP 4. InterSpeech ## ASR Paper Summary ### Summary Table |Type|Counts| |:--:|:----:| |Real-time and Model Compression|4|| |Self- and Semi-supervised Training|4| |Multilingual|2| |Output Encoding|2 |Transformer-based|2| |CNN-based|1| --- ### Example 1. **Paper Title** (Conference) * **Architecture:** <br/> * Descriptions * **Methodology or Adavantage:** <br/> * Descriptions * **Dataset (optional):**<br/> * Descriptions * **Keyword:** <br/> * Descriptions --- ### Real-time and Model Compression 1. **Streaming End-to-end Speech Recognition for Mobile Devices** (ICASSP 2019) * **Architecture:**<br/> * Recurrent neural network transducer. * LSTM * **Methodology or Adavantage:**<br/> * Outperform a conventional CTC-based model interms of both latency and accuracy and achieve 0.51 RT90 (real time factor at 90 percentile). * Add a time-reduction layer in the encoder to speed up training and inference * Train with word-piece subword units, which outperform graphemes in our experiments. * Use different threads for the encoder and the prediction network to enable pipelining through asynchrony in order to save time. * Quantize parameters from 32-bit floating-point precision into 8-bit fixed-point. * **Dataset:**<br/> * 14.8K voice search (VS) utterances extracted from Google traffic. * 15.7K dictation utterances, which refer to as the IME test set. * **Keyword:** <br/> * Real-time, RNN-T, Speech Recognition 2. **Compression of End-to-End Models** (Interspeech 2018) * **Methodology or Adavantage:** <br/> * This work explores the problem of compressing end-to-end models with the goal of satisfying device constraints without sacrificing model accuracy. * **Keyword:** <br/> * Matrix factorization, knowledge distillation, parameter , Speech Recognition 3. **Iterative Compression of End-to-End ASR Model using AutoML** (Interspeech 2020) * **Methodology or Adavantage:** <br/> * Increasing demand for on-device Automatic Speech Recognition (ASR) systems has resulted in renewed interests in developing automatic model compression techniques. * AutoML-based Low Rank Factorization (LRF) technique * 提出 Iterative AutoML-based LRF approach that achieves over 5×compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression * **Keyword:** <br/> * On-device, AutoML, Compression, Speech Recognition 4. **Automatic Compression of Subtitles with Neural Networks** (Interspeech 2019) * **Architecture:**<br/> * LSTM * Encoder+Decoder * **Methodology or Adavantage:** <br/> * Automatic sentence compression * for fast speech or limited screen size, it might be advantageous to compress the subtitles to their most relevant content. * **Keyword:** <br/> * LSTM, Compression, Speech Recognition --- ### Multilingual 1. **Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes** (ICASSP 2019) * **Architecture:**<br/> * Audio-to-Byte model is based on the Listen, Attend and Spell (LAS). * Byte-to-Audio model is based on Tacotron 2. * **Methodology or Adavantage:** <br/> * Output target changed from graphemes to Unicode bytes. * Generates the text sequence one Unicode byte at a time. * Any script of any language representable by Unicode can be represented by a byte sequence, and there is no need to change the existing model structure. * **Dataset:**<br/> * English + Japanese + Spanish + Korean * **Keyword:** <br/> * Multilingual, Speech Recognition, LAS, Tacotron 2, Byte-encoding 2. **Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model** (Interspeech 2019) * **Architecture:** <br/> * RNN-T * **Methodology or Adavantage:** <br/> * E2E multilingual system which is equipped to operate in low-latency interactive applications. * Deal with imbalanced multilingual data by * Data sampling * Conditioning on language vector * Adapter modules * **Dataset:**<br/> * 9 languages. Hindi, Marathi, ... * **Keyword:** <br/> * Multilingual, Speech Recognition, RNN-T, Imbalanced data, Low-lantency --- ### Output Encoding 1. **Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes** (ICASSP 2019) * Same paper described in multilingual section. 2. **Word-level Speech Recognition with a Letter to Word Encoder** (ICML 2020) * **Architecture:** <br/> * Trasnsformer, CTC * **Methodology or Adavantage:** <br/> * Propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters. * The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. * With beam search. Acoustic model + language model + insertion weight. * **Dataset:**<br/> * LibriSpeech * **Keyword:** <br/> * Transformer, CTC, Attention, Word Embedding, Speech Recognition --- ### Self- and Semi-supervised Training 1. **vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations** (ICLR 2020) * **Architecture:** <br/> * wav2letter * BERT * **Methodology or Adavantage:** <br/> * Learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. * The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. * Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition. * **Dataset:**<br/> * TIMIT * WSJ * **Keyword:** <br/> * BERT, wav2letter, Self-supervised, Discrete, Speech Recognition 2. **wav2vec: Unsupervised Pre-training for Speech Recognition** (InterSpeech 2019) * **Architecture:** <br/> * CNN * **Methodology or Adavantage:** <br/> * Explore unsupervised pre-training for speech recognition by learning representations of raw audio. * We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. * Experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. * Outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data. * **Dataset:**<br/> * WSJ * Librispeech * TIMIT * **Keyword:** <br/> * Unsupervised, pre-training, CNN, Speech Recognition <!-- 3. **Semi-supervised training for end-to-end models via weak distillation** (ICASSP 2019) * **Architecture:** <br/> * Descriptions * Descriptions * **Methodology or Adavantage:** <br/> * Descriptions * **Dataset:**<br/> * Descriptions * **Keyword:** <br/> * Descriptions --> 3. **Effectiveness of self-supervised pre-training for speech recognition** (ICASSP 2020) * **Architecture:** <br/> * wav2vec * Discrete BERT * Continuous BERT * **Methodology or Adavantage:** <br/> * Directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model. * Also propose a BERT-style model learning directly from the continuous audio data and compare pre-training on raw audio to spectral features. * **Dataset:**<br/> * Librispeech * **Keyword:** <br/> * BERT, fine-tune, self-supervised, Speech Recognition 4. **Self-Training for End-to-End Speech Recognition** (ICASSP 2020) * **Architecture:** <br/> ![](https://i.imgur.com/afsdjbV.png) * **Methodology or Adavantage:** <br/> * Use unlabeled sentence Y train LM, then use LM and AM generate pseudo labels with unlabeled acoustic input X.Finally use pair ( X , pseudo labels ) train AM model. * **Keyword:** <br/> * Self-training, CNN, RNN, Multi-head Attention, Speech Recognition --- ### Transformer-based 1. **Transformer-based Acoustic Modeling for Hybrid Speech Recognition** (ICASSP 2020) * **Architecture:** <br/> * Deep Transformer , Comparison between trabsformer and BiLSTM * **Methodology or Adavantage:** <br/> * Add LayerNorm into Transformer Architecture. * **Keyword:** <br/> * Transformer, BiLSTM, Layer Norm, Speech Recognition 2. **Transformer with Bidirectional Decoder for Speech Recognition** (Interspeech 2020) * **Architecture:**<br/> * Speech transformer * End2end asr * Bidirectional decoder * **Methodology or Adavantage:** <br/> * The conventional transformer-based approaches usually generate the sequence results token by token from left to right,we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously. * We propose a novel speech transformer with bidirectional decoder to exploit the bidirectional context information for ASR task. * We investigate the effectiveness of the bidirectional target and bidirectional decoding with comprehensive experiments. * Our method achieves a 3.6% relative CER reduction. The best model in this paper described as STBD-big achieves 6.64% CER with a large margin improvement on AISHELL-1 dataset.www. * **Dataset:**<br/> * AISHELL-1 * **Keyword:** <br/> * Speech Transformer, Bidirectional, Attention, End2end, Speech Recognition --- ### CNN-based 1. **Conformer: Convolution-augmented transformer for speech recognition** (Interspeech 2020) * **Architecture:** <br/> * CNN * Transformer * **Methodology or Adavantage:** <br/> * Use CNN to extract local information in (Conformer)Transformer block * Use Depthwise Conv and Swish Activation as CNN block. Besides, use lots of residual network. * **Keyword:** <br/> * CNN, Transformer, Depthwise Conv, Speech Recognition --- ## Links [Linguistic Data Consortium](https://www.ldc.upenn.edu)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully