Willianto Sulaiman
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## Progress August 9^th^ 2023 - Whisper (especially in taiwanese accent) - Whisper does not perform well on monolingual chinese after fine-tune with code switching ASR. - Utilizing parameter efficient method to decrease the WER. - Introducing the model with unlabeled speech and/or unlabeled transcription hoping to further leverage the accuracy (MMS Paper) - Dataset can be obtained in websites and also youtube videos - Slow inference time - The inference time is slower when added ladder tuning (Bob's research) - Find a way to fasten the inference time by either simplify the ladder or find a method to skip some connection when necessary - MMS - MMS was trained using much more language compare to Whisper - Hence, the representation should be able to capture more general feature compare to Whisper - Introducing MMS to code switching and hope to get a better MER compare to Whisper - SLU - Utilizing speech encoder model (wav2vec, MMS) and extract emotion from a given speech input - This can be useful for preparing the right reply during a conversation (AI Project) - Dataset: - [Chinese speech emotion dataset](https://github.com/AkishinoShiame/Chinese-Speech-Emotion-Datasets) - [Chinese dataset](https://www.twine.net/blog/mandarin-chinese-langauge-datasets/) - [EMOVIE](https://www.isca-speech.org/archive/pdfs/interspeech_2021/cui21c_interspeech.pdf) - Code Switch TTS - Current TTS (Especially in Ai Project) although good enough, still doesn't result in a natural way of speech - Create a TTS model that can speak up 2 languages naturally (EN-ZH) ## Progress August 30^th^ 2023 - Whisper is vulnerable to adversarial attack [Paper link](https://arxiv.org/abs/2210.17316) - example: heavy accent speech **(not yet proven, need to experiment)** - Proper way to fine-tune efficiently is required for downstream task ![](https://hackmd.io/_uploads/HkR9ruhp3.png) src: Bobbi - Whisper after fine-tuning prioritize LID over task. (Self observation) - Given zh LID input, **transcribe task**, and english utterance, the output will be our utterance translated to zh. - Whisper ASR output can be either simplified chinese or traditional chinese - Solution: - Optimize encoder to be more robust - Modifying prompt or training process to focus on the transcribe task. ## Progress September 19^th^ 2023 - Utilizing [LST](https://arxiv.org/abs/2206.06522) (Ladder Side Tuning) to fine-tune Whisper model - Since LST has some promising future for memory and parameter efficient. - Implement [Bayesian Deep Learning]((https://arxiv.org/abs/2002.08791)) (BDL) to Whisper - But the reference paper **applied it to ResNet18** (Conv layers and rather small), need to further understand and try to implement to transformer and Whisper - **MAYBE** Adapter of Whisper can utilize this theorem to get more robust encoder. - Apply [MEFT](https://arxiv.org/abs/2306.00477) (Memory-Efficient Fine-Tuning) to apply a reversible model, so the **intermediate activations are not necessary to be cached and can be recomputed**. ## Progress October 11^th^ 2023 - [A Simple Baseline for Bayesian Uncertainty in Deep Learning](https://proceedings.neurips.cc/paper/2019/hash/118921efba23fc329e6560b27861f0c2-Abstract.html) - Use the information contained in the SGD trajectory to efficiently approximate the posterior distribution over the weights of the neural network. - SGD update the model with the following update rule: ![](https://hackmd.io/_uploads/SJRhcj7Wa.png) Looking for a way to combine the formula above with formula below ([Source](https://arxiv.org/pdf/2211.15583)) since we are doing parameter-efficient fine-tuning ![](https://hackmd.io/_uploads/r1Ceiomb6.png) - Bayesian Model Averaging (BMA) with Stochastic Weight Average Gaussian (SWAG) ![](https://hackmd.io/_uploads/HktenjXba.png) - [Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles](https://arxiv.org/abs/1612.01474) - Propose an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates. - Training criterion for regression: ![](https://hackmd.io/_uploads/H1d0F3QWa.png) - Overall training procedure is as follow: ![](https://hackmd.io/_uploads/B1Dec3Xb6.png) - [Bayesian Deep Learning and a Probabilistic Perspective of Generalization](https://arxiv.org/abs/2002.08791) - Propose the combination between Deep Ensembles and SWAG called MultiSWAG. - They argue that generalization depends largely on the **support and inductive biases** of a model. - Define support as the range of datasets for which $p(\mathcal{D|M}) > 0$ - Define inductive biases as the relative prior probabilities of different dataset --- the distribution of support given by $p(\mathcal{D|M})$ - [Uncertainty Estimation in Deterministic Vision Transformer](https://charliezhaoyinpeng.github.io/UDM-AAAI23/assets/papers/3/CameraReady/UDM_AAAI_23_camera_ready.pdf) - Replace the dot product similarity with the distance within Banach Space and also normalize the term by a theoretical lower bound of the Lipschitz constant. ![](https://hackmd.io/_uploads/H1wbg2Qb6.png) ![](https://hackmd.io/_uploads/Bkh383XbT.png) - Uncertainty information can be added inside the attention module. ## Progress October 24^th^ 2023 - Use OWSM model as backbone model [Link](https://paperswithcode.com/paper/reproducing-whisper-style-training-using-an/review/) ![](https://hackmd.io/_uploads/HkT1n-Sz6.png) - OWSM performs better for **English, Chinese, and Japanese** - **Trained with simplified chinese datasets** so fine-tune with traditional chinese will be hard, easier to fine-tune with simplified then convert text to traditional - OWSM V2 result before fine-tuning: - Sound: [Drive](https://drive.google.com/file/d/1UTZDpnuUBhuXIEUOAcz_OQyjaZrSnqTt/view?usp=sharing) - Original text: 去使用这个 app 就直接捅一捅一下就下去所以不会很仔细地看这些东西 - Output: \<en\>\<asr\>\<notimestamps\> to to使用这个艾普就自己同意同意同意同意下一步下一步下一步 所以伯父很自己看这些东西 所以伯父很自己看这些东西 就自己同意同意同意同意下一步下一步下一步下一步 所以伯父很自己看这些东西 所以伯父很自己看这些东西 就自己同意同意同意同意下一步下一步下一步下一步下一步 所以伯父很自己看这些东西 就自己同意同意同意同意 - Tried to fine-tune OWSM V1 with code-switch dataset - **Tokenizer problem** (All word ar tokenized as \<UNK\>) - Weird output from OWSM V2 even though experiments shows that OWSM V2 is the best - Still analyzing the training process of OWSM model (very new and training process is different compare to whisper) ## Progress November 21^th^ - Several problems faced when fine-tuning OWSM - Solved by following the correct fine-tuning procedure given by OWSM author - **Direct inference** for ASCEND (hongkong english-mandarin code switch dataset)dataset ![image](https://hackmd.io/_uploads/rJ-HqNiVp.png) - ASCEND With fine-tuning ![image](https://hackmd.io/_uploads/rJj35NoV6.png) - SEAME result ![image](https://hackmd.io/_uploads/SyXjASsNp.png) ## Progress December 13l^th^ - Overall structure of OWSM: ![image](https://hackmd.io/_uploads/B1RMX3Trp.png) - proposed method: - Apply contrastive learning in the encoder part based on the LID from the CTC - Possible title: **Contrastive learning for LID separation in CTC-based code switching ASR** - ![image](https://hackmd.io/_uploads/H1997npra.png) - $$ \mathcal{L}_{c} = - \sum_{(\mathbf{i}, y_i) \in \mathbf{x}}\log \frac{\frac{1}{\sum \mathbb{1}(y_i = y_j)}\sum_{(\mathbf{j}, y_j) \in \{\mathbb{x}, y_i=y_j\}}\exp(\text{sim}(i, j))}{\sum_{(k,y_k) \in (y_i \neq y_k)}\exp(\text{sim}(i, k))} $$ $$ \mathcal{L}_{\text {in }}^{\text {sup }}=\sum_{i \in I} \mathcal{L}_{\text {in }, i}^{\text {sup }}=\sum_{i \in I}-\log \left\{\frac{1}{|P(i)|} \sum_{p \in P(i)} \frac{\exp \left(\boldsymbol{z}_i \cdot \boldsymbol{z}_p / \tau\right)}{\sum_{a \in A(i)} \exp \left(\boldsymbol{z}_i \cdot \boldsymbol{z}_a / \tau\right)}\right\} $$ - Using side adapter to make the output "aware" of uncertainty - Possible title: **Introducing uncertainty to deterministic model via side-adapter by parameter-efficient fine-tuning** - Inspired from the [paper pin-yen presented](https://arxiv.org/pdf/2303.02444.pdf) - ![image](https://hackmd.io/_uploads/SyJB_nL8p.png) - standard SGPA has the following calculation for the mean and variance: $$ m_d=K_{q k}[\mathrm{v}]_{:, d}, \quad \Sigma_d=K_{q q}+K_{q k}\left(K_{k k}^{-1}[S]_{:,:, d} K_{k k}^{-1}-K_{k k}^{-1}\right) K_{k q} $$ where $\mathbf{S} \in \mathbb{R}^{T \times T \times d_v}$ is a set of variational covariance parameter, $T$ is the number of inducing points, and $\mathbf{K}$ is kernel gram matrix. - $K$ can be defined as $K(\cdot,\cdot) = K_{base}(h_\theta(\cdot), h_\theta(\cdot))$ where $h_\theta$ is a feature representasion outputted from DNN - the decoupled has the following calculation for the mean and variance: $$ \begin{aligned} \boldsymbol{m}_d & =\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_a}\left[\mathbf{v}_a\right]_{:, d}-\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_g} \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_a}\left[\mathbf{v}_a\right]_{:, d}+\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_g}\left[\mathbf{v}_g\right]_{:, d} \\ \boldsymbol{\Sigma}_d & =\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{q}}+\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_g} \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_g}^{-1}\left(\left[\boldsymbol{S}_g\right]_{:,:, d}-\boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_g}\right) \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_g}^{-1} \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{q}}, \end{aligned} $$ - `In this work, we are not able to consider pretraining due to the high computational cost, and since SGPA replaces scaled dot-product with a valid kernel, there is no existing pre-trained backbone that can be directly used for the downstream fine-tuning tasks.` - ![image](https://hackmd.io/_uploads/rkWHLyDI6.png) ## Progress Jan 2^nd^ 2024 - Found that **OWSM is actually not that great** in performance compare to Whisper - ![image](https://hackmd.io/_uploads/ryG2fr-Op.png) - ![image](https://hackmd.io/_uploads/B1iTTNWOa.png) - OWSM only better in their **transparency** which are not suitable for our task - Decide to **go back to Whisper** following Bob's work - Problem 1: - A model capable of code-switch is required to make a perfect multilingual language - But code-switch dataset is limited, while monolingual dataset for multiple language can be obtain much easier - Propose a method to introduce code-switch for ASR model without introducing code-switch data - Possible title: **Introducing code-switch capability with uncertainty fine-tuning for ASR model** - Problem 2: - referencing from [There is more than one kind of robustness: Fooling Whisper with adversarial examples](https://arxiv.org/abs/2210.17316), **a small noise with SNR of 35-40dB** and fooling language detector can **degrade Whisper performance dramatically.** - language detector was proven by Bob's work ![image](https://hackmd.io/_uploads/Sy-8msGua.png) - Trying to create **a robust code-switch ASR** with proper fine-tuning method. - Possible title: Towards Robust Code-Switching ASR with Parameter Efficient Fine Tuning - Following [Bayesian low-rank adaptation for large language models](https://arxiv.org/pdf/2308.13111.pdf): - This paper applies a **Laplace approximation to the posterior** over the **LoRA parameters**. $$ \mathbf{F}(\boldsymbol{\theta})=\sum_{n=1}^N \mathbb{E}_{\mathrm{P}\left(y \mid f_\theta\left(\mathbf{x}_n\right)\right)}\left[\nabla_\theta \mathrm{P}\left(y \mid f_\theta\left(\mathbf{x}_n\right)\right)\left(\nabla_\theta \mathrm{P}\left(y \mid f_\theta\left(\mathbf{x}_n\right)\right)\right)^T\right] $$ $$ f_\theta\left(\mathbf{x}_*\right) \sim \mathcal{N}\left(f_{\theta_{\text {MAP }}}\left(\mathbf{x}_*\right), \boldsymbol{\Lambda}\right), $$ $$ \boldsymbol{\Lambda}=\left(\left.\nabla_\theta f_\theta\left(\mathbf{x}_*\right)\right|_{\theta_{\mathrm{MAP}}} ^T\right) \boldsymbol{\Sigma}\left(\left.\nabla_\theta f_\theta\left(\mathbf{x}_*\right)\right|_{\theta_{\mathrm{MAP}}}\right) $$ - Have to read carefully for this paper because this paper mentioned how to **carefully sample the weight**. - **LoRA is a good side adapter** in my case because **its output initially start from 0**, and slowly increase to adapt the model. ## Progress Jan 17^th^ - Run whisper with LoRA as the adapter (r=6) ![image](https://hackmd.io/_uploads/r1iCmwNFa.png) - The other result is still ongoing - Currently still using \<en\> only instead of \<zh\>\<en> like Bob does, still ongoing (problem with whisper version) ![image](https://hackmd.io/_uploads/rk22aDNF6.png) ## Progress Jan 22^th^ ![image](https://hackmd.io/_uploads/BkpkIjoKp.png) ## Progress Jan 24^th^ - Comparing the attention map between adapter and lora method - **Forgot to add Layernorm for the qk map** - Layer 2 head 8 ![image](https://hackmd.io/_uploads/HJh5a40t6.png) ![image](https://hackmd.io/_uploads/HyVspVCFp.png) - layer 8 head 11 ![image](https://hackmd.io/_uploads/SJgaTV0ta.png) ![image](https://hackmd.io/_uploads/By4T64Rtp.png) - There are still **some problems related to attention map**, will fix soon - Running adapter and lora for 30 epochs to see the convergence of the model ![image](https://hackmd.io/_uploads/B1ZkWH0Y6.png) ## Progress Jan 29^th^ ![image](https://hackmd.io/_uploads/BJ17EyB9T.png) - Found a normal fine-tuning process that can be **"comparable"** with Bob's method - Find method to apply uncertainty to the model - [Bayesian Attention Modules](https://arxiv.org/abs/2010.10604) (NIPS 2020) - [Bayesian low-rank adaptation for large language models](https://arxiv.org/abs/2308.13111)(Under review ICLR 2024) - [LoRA ensembles for large language model fine-tuning](https://openreview.net/forum?id=X5VElAKt2s)(Under review ICLR 2024) - Several methods to evaluate the model - Negative Log Likelihood (NLL) - ECE (Expected Calibration Error) - OOD dataset (Librispeech, ASCEND, NTUT) - Possible title: - **KUNAM: Knowledge-based UNcertainty in Attention Module for Robust Code-Switch ASR** - **Whispering Skepticism: Uncertainty-Aware Whisper for Accurate Code-Switch Speech Recognition** - **Whisper with a Grain of Salt: Whisper with Uncertainty Awareness for Precise Code-Switch Speech Recognition** - **Whispering in Shades of Uncertainty: A Bayesian Framework for Whisper in Code-Switch ASR** - **Whispering with Caution: Bayesian Fine-tuning for Whisper in Code-Switch ASR** ## Progress Feb 21^th^ Overleaf: https://www.overleaf.com/read/dbwscrgchpsc#7f61a1 Title: **KUNAM: Knowledge-based UNcertainty in Attention Module for Accent-Robust ASR** - Change title to "accent-robust" ASR - Have a discussion with Mahdin and decide to focus on heavy accent speech robustness, and out-of-domain data. - in-domain data: aishell (zh), still unknown for (en), ASCEND (cs for additional if time allows) - out-domain data: seame (zh), seame (en), seame (cs for additional if time allows) ![image](https://hackmd.io/_uploads/BJh4umm36.png) - Trying to print out NLL and ECE of model, but according to author, NLL and ECE is not a suitable metric for autoregressive transformer >Given that this is the first investigation of Laplace for LoRA fine-tuning, we chose to focus on multiple-choice QA because that allowed us to use robust, well-understood calibrations metrics like ECE and NLL. >As a next step, we are indeed excited by the possibility of investigating free-text generation: Laplace-LoRA certainly could be applied in this setting. However, **the development of robust, well-understood calibration metrics for free-text generation remains a challenging and open research question**. Given the complexity of evaluating calibration in the free-text setting, this extension is out-of-scope here, and we leave it for future work. - So instead just use the usual MER WER CER to see whether uncertainty in ASR helps to decrease the error rate. - Currently running experiments which results are not satisfying yet. ## Progress March 4^th^ - Change topic to: **Enhancing Code-Switching ASR with Accent-Robust Variational Neural Networks** - Abstract: Code-switching has been a hot topic in the Automatic Research Recognition (ASR) and there exist a lot of research papers achieving satisfying result on their dataset. However their performance has yet to be explored in another accent and unseen accents have been proved to degrade the performance of ASR model. In this paper, Variational Neural Network (VNN) is utilized to obtain a robust result in code-switching with unseen accents. Experiments show that proposed method can achieve higher performance compare to previous methods in unseen accent while achieving comparable result in seen accent. - Proposed Method: - Current method ![image](https://hackmd.io/_uploads/r1L33ymaa.png) - Apply Variational Inference on the attention output ![image](https://hackmd.io/_uploads/BytLbPGTa.png) - Apply Variational Inference on the attention weight ![image](https://hackmd.io/_uploads/By4WOCzaT.png) - Prior: $p(z) = \mathcal{N}(\mathbf{\mu}, \textbf{0}) = z_{linear} \text{ or } \textbf{QK}^\top$ - Posterior Approximation: $q(z) = \mathcal{N}(\mu, \mathbf{\Sigma}) = \mu_i + \sigma_i * \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \mathbf{I}), \sigma = \sqrt{\Sigma}$ - Adapter output is covariance/variance which need to be non-zero, thus adding **ReLU to the output is a good solution (thx to mahdin)** - Add new loss $$ \operatorname{KL}\left[q\left(\boldsymbol{z}_t\right) \| p_{\mathrm{r}}\left(\boldsymbol{z}_t\right)\right]= \quad \sum_{i=1}^M\left\{\log \frac{\sigma_{t, i}^{\mathrm{r}}}{\sigma_{t, i}}+\frac{\sigma_{t, i}^2+\left(\mu_{t, i}-\mu_{t, i}^{\mathrm{r}}\right)^2}{2 \sigma_{t, i}^{\mathrm{r}}{ }^2}-\frac{1}{2}\right\} $$ - Need to find a way to control the covariance. - Dataset used: - SEAME (singaporean and malaysian accent code-switch) - ASCEND (Hongkong accent code-switch) - NTUT (Taiwanese accent code-switch) All the dataset will be converted to simplified chinese for consistency ## Progress March 7^th^ BASELINE ![image](https://hackmd.io/_uploads/rkbabGP66.png) ![image](https://hackmd.io/_uploads/B1bHWMDaa.png) - Parameter trained: 1.47 M (0.6%) - Need more parameters to adapt. ## Progress March 11^th^ - Background formula (please check the truth behind this formula): - We assume that a previously robust pretrained model, can successfuly model $P(\mathbf{Y} \mid \mathbf{X})$, and given new dataset $\mathbf{X}'$, we fine-tune the model to fit $P(\mathbf{Y} \mid \mathbf{X}')$. - But this will result in forgetting since $\mathbf{X}' \not\subset \mathbf{X}$, the model will fail to model the original $P(\mathbf{Y} \mid \mathbf{X})$, we call $\mathbf{X}'$ as OOD dataset. - We want the model to still model $P(\mathbf{Y} \mid \mathbf{X})$, but can model the input $\mathbf{X}'$ by introducing a model that can model $P(\mathbf{X} \mid \mathbf{X}')$ - In here we assume that $P(\mathbf{X}') = \mathcal{N}(\mathbf{X}, \mathbf{\sigma I})$ (expand the distribution of $\mathbf{X}$) - Obtain $\sigma$ from $cov_{\phi}(\mathbf{x}')$ where $f_\theta(\cdot)$ is the pretrained model, $cov_\phi$ is a module parameterized with $\phi$ to obtain $\sigma$ given $\mathbf{x}'$ - Then to obtain $P(\mathbf{X})$, we can use MC approximation to obtain the mean: $$ \mathbf{x} \approx \frac{1}{N}\sum_n^N \mathbf{x'} \sim \mathcal{N}(\mathbf{X}, \mathbf{\sigma I}) $$ when N is large enough - Baseline (Bob's method for code-switching finetuning method): ![image](https://hackmd.io/_uploads/H1MjPss6T.png) - Proposed method: - $\mathbf{x}'$ obtained from attention **input** - ![image](https://hackmd.io/_uploads/rJiTDsspp.png) - ![image](https://hackmd.io/_uploads/rJFlOij66.png =300x300) - $\mathbf{x}'$ obtained from attention **output** - ![image](https://hackmd.io/_uploads/S1dDuoo6T.png) - ![image](https://hackmd.io/_uploads/HyLT_ooTa.png =300x300) - Bob's model result: - ![image](https://hackmd.io/_uploads/BkY8nX3TT.png) - ![image](https://hackmd.io/_uploads/S1DtT7hpa.png) - Proposed model result: - ![image](https://hackmd.io/_uploads/BkKDh73TT.png) - ![image](https://hackmd.io/_uploads/HyG3pX3TT.png) - <audio src="https://drive.google.com/file/d/1DvXGO3Pv71PRX8l40alaOZqRdKlZ0FZA/view?usp=sharing" controls audio type="audio/wav"></audio> - <audio src="https://drive.google.com/file/d/1S-JQ9s0VtlVWb-9vpCibNf0Ak7EpAQ_Y/view?usp=sharing" controls audio type="audio/wav"></audio> - https://drive.google.com/file/d/1DvXGO3Pv71PRX8l40alaOZqRdKlZ0FZA/view?usp=sharing - https://drive.google.com/file/d/1S-JQ9s0VtlVWb-9vpCibNf0Ak7EpAQ_Y/view?usp=sharing ## Progress March 27^th^ - [Variational information bottleneck for effective low-resource fine-tuning](https://arxiv.org/abs/2106.05469.pdf) ![image](https://hackmd.io/_uploads/HyF178Zy0.png) - ![image](https://hackmd.io/_uploads/ryktD8bJA.png) - ![image](https://hackmd.io/_uploads/HJ1nvLbyC.png) - [Variational information bottleneck for effective low-resource fine-tuning](https://arxiv.org/abs/2106.05469.pdf) - [Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck](https://arxiv.org/abs/2204.01387.pdf) - [Towards Diverse, Relevant and Coherent Open-Domain Dialogue Generation via Hybrid Latent Variables](https://arxiv.org/abs/2212.01145) - [Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space](https://arxiv.org/abs/1711.07068) - [The Expressive Power of Low-Rank Adaptation](https://arxiv.org/abs/2310.17513) ## Progress April 2^nd^ <!-- ### The Expressive Power of Low-Rank Adaptation - Expressive Power of Transformer Layers with LoRA - Transformer network, denoted as $\mathrm{TFN}_{L, D}$, is a composition of $L$ Transformer blocks and an output layer, parameterized by weight $\boldsymbol{W}_o \in \mathbb{R}^{D \times D}$. - Each transformer block comprises a $H$-head selfattention layer, parameterized by weight $\left(\left(\boldsymbol{W}_{O l}^h, \boldsymbol{W}_{V l}^h, \boldsymbol{W}_{K l}^h, \boldsymbol{W}_{Q l}^h\right)_{h=1}^H\right)_{l=1}^L$, followed by a tokenwise feedforward layer, parameterized by weight $\left(\boldsymbol{W}_{1 l}, \boldsymbol{W}_{2 l}\right)_{l=1}^L$ and bias $\left(\boldsymbol{b}_{1 l}, \boldsymbol{b}_{2 l}\right)_{l=1}^L$. - Assume that all weight matrices have a dimension of $D \times D$, while the bias vectors are of dimension $D$. ![image](https://hackmd.io/_uploads/BJOaeAP1R.png) ![image](https://hackmd.io/_uploads/rJ1KZCPJA.png) ![image](https://hackmd.io/_uploads/ryFxZ0wk0.png) $\mathrm{LR}_r(\cdot)$ : best rank- $r$ approximation of a square matrix in Frobenuis norm and spectral norm. The subscript $r$ may be omitted to indicate a general low-rank approximation without specifying the rank. ![image](https://hackmd.io/_uploads/r1VrORvJR.png) --> ![image](https://hackmd.io/_uploads/SyoaTCDy0.png) $$ K L(p, q)=\log \frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2 \sigma_2^2}-\frac{1}{2} $$ <!-- $$ \begin{aligned} & \log P(\boldsymbol{Y} \mid \boldsymbol{X})=\log \prod_{t=1}^n P\left(\boldsymbol{y}_t \mid \boldsymbol{y}_0, \ldots, \boldsymbol{y}_{t-1}, \boldsymbol{X}\right) \\ & \approx \log \prod_{t=1}^n \int P\left(\boldsymbol{y}_t \mid \boldsymbol{y}_0, \ldots, \boldsymbol{y}_{t-1}, \boldsymbol{z}_t\right) p\left(\boldsymbol{z}_t \mid \boldsymbol{h}_t, \boldsymbol{X}\right) \mathrm{d} \boldsymbol{z}_t \end{aligned} $$ --> ## Progress April 17^th^ - Title: **Visper: parameter-efficient VIB fine-tuning for accent robust ASR with Whisper** - **Red and green** mean **lose and win** compare to Bob's method - Comparing the best beta ![image](https://hackmd.io/_uploads/rkW4bMpxC.png) - smaller beta = better (but 0 can performs worse) - need to set smaller beta - Comparing which prior to use ![image](https://hackmd.io/_uploads/BJjYZfal0.png) - using noise as prior performs better - currently training using parameter for mean and variance - Considering 2 stage ![image](https://hackmd.io/_uploads/SJiiZMaxA.png) - performance is not good ## Progress April 29^th^ - Title: - Irrelevant Information Filtering Through Layer-Wise Variational Fine-Tuning - Layer-Wise LoRA-VIB for Efficient Code-Switching Speech Recognition - Advancing Whisper Code-Switching Capability with Layer-Wise LoRA-VIB Adaptation - Balancing Monolingual and Code-Switching Capabilities in Whisper Using Variational-LoRA - Layer-Wise LoRA-VIB in Whisper: An Approach for Balancing Monolingual and Code-Switching Capabilities - Background: - Speech models nowadays have satisfying result in **monolingual ASR** - But experiments show that these particular models **can't perform well in code-switch** even thou the languages in the sentence was learned by he speech models - Fine-tuning to code-switching dataset is famous for tackling this issue - But by fine-tuning to the dataset, the **generalization of the speech models are now perturbed by the fine-tuning process** - This research wants to **introduce only the code-switching capability to speech models while also retaining the generality of the speech models** during the fine-tuning process - Experiments ![image](https://hackmd.io/_uploads/ryoMDRhZR.png) - Changing KL function into **ordinary MSE Loss for the mean** performs better ## Progress May 6^th^ Title: **Balancing Monolingual and Code-Switching Capabilities in Whisper Using Variational-LoRA** <!-- ![image](https://hackmd.io/_uploads/BJileM8fC.png) ![image](https://hackmd.io/_uploads/rkrreGIfC.png) ![image](https://hackmd.io/_uploads/SkZJbzUzC.png) --> - Currently searching for a method to prevent overfitting to the target dataset - Introduce code-switching capability $\rightarrow$ introduce only the language transition $\rightarrow$ should be related to the **decoder part** - [Sparsely Shared LoRA on Whisper for Child Speech Recognition](https://arxiv.org/abs/2309.11756) (ICASSP 2024) - Adapting with AdaLoRA for child speech recognition (low-resource speech & zero-shot) - LoRA shared for each blocks (Enc-SAM, Enc-FFM, Dec-SAM, Dec-CAM, and Dec-FFM) - [An Effective Mixture-of-Experts Approach for Code-Switching Speech Recognition Leveraging Encoder Disentanglement](https://arxiv.org/abs/2402.17189) (ICASSP 2024) ![image](https://hackmd.io/_uploads/BJiAQMIzR.png) - [Cross-Modal Parallel Training for Improving end-to-end Accented Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10447979) (ICASSP 2024) ![image](https://hackmd.io/_uploads/SyS7rMUG0.png) - [Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps](https://openreview.net/pdf?id=mYWsyTuiRp) (ICLR 2024) - [Incorporating Residual and Normalization Layers into Analysis of Masked Language Models](https://arxiv.org/abs/2109.07152) (EMNLP 2021) - [Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning](https://openreview.net/forum?id=YR3ETaElNK) (ICLR 2024) ## Progress May 13^th^ - Recent trend about decoding for hallucination - [DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models](https://arxiv.org/pdf/2309.03883) ![image](https://hackmd.io/_uploads/HynTCmymR.png) ![image](https://hackmd.io/_uploads/ByJb1E1mR.png) ![image](https://hackmd.io/_uploads/BJh-kNk7R.png) - [Constrained Decoding for Cross-lingual Label Projection](https://arxiv.org/pdf/2402.03131) ![image](https://hackmd.io/_uploads/rJqY8VJQA.png) - A lot of papers modifying decoding process for mitigating hallucination - hallucination can be assosiated to wrong language in CS ## Progress May 15^th^ Title: - **Unlocking the Power of Authentic Bilingual Speech Recognition: Revolutionizing Robustness and Accuracy** - **Towards Authentic Robust Bilingual Speech Recognition** - **Towards Robust Bilingual Speech Recognition** ![image](https://hackmd.io/_uploads/SyxIZyfXC.png) - [Large Language Models are Efficient Learners of Noise-Robust Speech Recognition](https://arxiv.org/abs/2401.10446) (ICLR 2024) - ![image](https://hackmd.io/_uploads/ryE43JzX0.png) - [It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition](https://arxiv.org/abs/2402.05457) (ICLR 2024) - ![image](https://hackmd.io/_uploads/BJVih1GmA.png) ## Progress May 20^th^ & May 22^nd^ Title: **Probabilistic Language-Aware Speech Recognition** ~~little modification into: **Probabilistic Speech Recognition with Language-Aware Enhancements**~~ - Introduce LID to the probability $$ \begin{split} \log P(\textbf{y} \mid \textbf{x}) & =\sum_i \log P(y_i \mid \textbf{x}, y_{<i}) \\ \text{Under language conditional assumption: } \\ &= \int P(y_i \mid \mathbf{x}, y_{<i}, L_i) P(L_i \mid \mathbf{x}, y_{<i}) dL_i\\ & = \mathbb{E}_{L_i \sim P(L_i \mid \mathbf{x}, y_{<i})}\left[\sum_i \log P(y_i \mid \textbf{x}, y_{<i}, L_i)\right] \\ & = \sum_i \mathbb{E}_{L_i \sim P(L_i \mid \mathbf{x}, y_{<i})}\left[\log P(y_i \mid \textbf{x}, y_{<i}, L_i)\right] \\ \text{by Jensen's Inequality: } \\ & \geq \sum_i \log \mathbb{E}_{L_i \sim P(L_i \mid \mathbf{x}, y_{<i})} \left[ P(y_i \mid \textbf{x}, y_{<i}, L_i)\right] \\ \end{split} $$ - Assuming fine-tuning data has some perturbation in $\textbf{x}$, $P(\textbf{x}_{noise}) = \mathcal{N}(\textbf{x},\mathbf{\sigma I}) = \textbf{x} + \epsilon * \mathbf{\sigma}$ where $\epsilon \sim \mathcal{N}(\mathbf{0, I})$ $$ \begin{aligned} & \sum_i \log P\left(y_i \mid \textbf{x}_{noise}, y_{<i}\right)= \\ & \sum_{L I D_i \in\{\text { en }, z h\}} \sum_i \log \left(\int P\left(y_i \mid \textbf{x}+\epsilon * \mathbf{\sigma}, y_{<i}, L I D_i\right) P(L I D_i \mid \textbf{x}+\epsilon * \mathbf{\sigma}, y_{<i}) P(\epsilon * \mathbf{\sigma}) d (\epsilon * \mathbf{\sigma})\right) \end{aligned} $$ - Which shows the reason why **many models cannot handle perturbed data (OOD)**. - to maximize the modeling, we need to **make $\sigma$ as close to zero** as possible by: 1. Create a $\text{noise}$ approximator which approximate $\sigma$ given $\textbf{x}_{noise}$ 2. Approximating $\textbf{x}$ by $\textbf{x} \approx \frac{1}{N} \sum_N \textbf{x}_{noise} - \epsilon *\sigma, \\\ \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ 3. Then objective turns into **minimalizing $\sigma$ while maximizing two probabilities** mentioned previously ## Progress May 29^th^ Derivation from prof: $$ \begin{split} \log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\ & = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) q(\mathbf{\ell}}{q(\mathbf{\ell}} \\ & = \log \mathbb{E}_{q(\mathbf{\ell})}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\ \text{by Jensen's Inequality:} \\ & \geq \mathbb{E}_{q(\mathbf{\ell})}\log\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\ & = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p(\mathbf{y, \ell} \mid \mathbf{x}) - q(\mathbf{\ell}) \right] \\ \text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\ & = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \end{split} $$ - Question: - Can I change $q(\mathbf{\ell})$ into $q(\ell \mid x)$? so that the last equation will be: $$ \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right]- \mathcal{D}_{KL}\left(q(\mathbf{\ell \mid \mathbf{x}}) || p(\mathbf{\ell} \mid \mathbf{x})\right) $$ - Doing some experiments right now applying the above equation - Still running - [LAE: Language-Aware Encoder for Monolingual and Multilingual ASR](https://arxiv.org/abs/2206.02093) ![image](https://hackmd.io/_uploads/BJdWQFNVR.png) - [LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR] (ASRU 2023)(https://arxiv.org/abs/2309.16178) ![image](https://hackmd.io/_uploads/HJNYDYV4A.png) - Assume that once Z^Man^ and Z^En^ are given, no more information from X are needed - And assume that Z^Man^ and Z^En^ are **independent** ![image](https://hackmd.io/_uploads/S1D4QtV4R.png) ## Progress June 3^rd^ $$ \begin{split} \log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\ & = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) \mathcal{A}}{\mathcal{A}} \\ & = \log \mathbb{E}_{\mathcal{A}}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\ \text{by Jensen's Inequality:} \\ & \geq \mathbb{E}_{\mathcal{A}}\left[\log\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\ & = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y, \ell} \mid \mathbf{x}) - \log \mathcal{A} \right] \\ \text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\ & = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(\mathcal{A} || p(\mathbf{\ell} \mid \mathbf{x})\right) \end{split} $$ In here $\mathcal{A}$ can be defined into several terms: - $\mathcal{A} = q(\mathbf{\ell}) \rightarrow \mathbb{E}_{q(\mathbf{\ell} )}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell}|| p(\mathbf{\ell} \mid \mathbf{x})\right)$ - Hard to decide the prior $q(\mathbf{\ell})$ - $\mathcal{A} = p(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{p(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right]$ ![image](https://hackmd.io/_uploads/BJQHuI9E0.png =500x) - $\mathcal{A} = q(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right)$ ![image](https://hackmd.io/_uploads/r1VQu8qV0.png =500x) $$ \mathcal{L} = \mathcal{L}_{CE} + \alpha\mathcal{L}_{\ell} + \beta\mathcal{L}_{\text{var}} $$ $\mathcal{L}_{\text{var}}$ is not yet defined in this formulation | | devman | devsge | ntut | ascend | aishell | |---------------------|--------|--------|------|--------|---------| | previous method | 14.2 | 20.8 | 42.4 | 29.9 | 42.5 | | rescore ($q(\ell \mid \mathbf{x})$) | 14.1 | 20.9 | 32.6 | 26.6 | 32.5 | | rescore ($p(\ell \mid x)$) | **13.7** | 21.9 | 35.2 | 27.7 | 31.9 | | rescore ($q_\phi(\ell)$) + AR | 14.8 | 21.6 | 30.1 | **25.8** | **22.2** | | rescore ($p(\ell \mid x)$) + AR | 14.4 | 21.7 | **29.6** | 23.5 | 23.1 <!-- | baseline | 13.8 | **20.2** | 33.1 | 27.2 | 33.2 | --> ## Progress June 12^th^ Title: **Probabilistic Language-Aware Speech Recognition** ![image](https://hackmd.io/_uploads/SyxIZyfXC.png) ![image](https://hackmd.io/_uploads/r1VQu8qV0.png =500x) | | devman | devsge | ntut | ascend | aishell | |---------------------|--------|--------|------|--------|---------| | previous method (bob) | 14.2 | 20.8 | 42.4 | 29.9 | 42.5 | | rescore ($q(\ell \mid \mathbf{x})$) | **13.7** | **20.0** | **32.9** | 27.7 | **32.3** | <!-- | baseline | 13.8 | 20.2 | 33.1 | **27.2** | 33.2 | --> ## Progress June 17^th^ - Title: **Probabilistic Language-Aware Speech Recognition** - Thesis outline ![image](https://hackmd.io/_uploads/H1Zqb6hr0.png) - Background Problem - Language Confusion: - Speech: 我要吃牛肉麵 (all chinese) - Pretrained whisper: 我要吃牛肉麵 - After CS FT: 我要吃 new Roman - Proposed Method: - Apply an additional module to **enhance the awareness of language** for each token prediction. - Overall Structure ![image](https://hackmd.io/_uploads/S17BsA2B0.png) <!-- ![image](https://hackmd.io/_uploads/SJ-c8kyLA.png) --> - Language aware derivation: $$ \begin{split} \log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\ & = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) \mathcal{A}}{\mathcal{A}} \\ & = \log \mathbb{E}_{\mathcal{A}}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\ \text{by Jensen's Inequality:} \\ & \geq \mathbb{E}_{\mathcal{A}}\left[\log\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\ & = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y, \ell} \mid \mathbf{x}) - \log \mathcal{A} \right] \\ \text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\ & = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(\mathcal{A} || p(\mathbf{\ell} \mid \mathbf{x})\right) \end{split} $$ - In here $\mathcal{A}$ can be defined into several terms: - If awareness was assumed learned **within** the base model: $\mathcal{A} = p(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{p(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right]$ - $p(\ell \mid x) = \int p(\ell \mid y) p(y \mid x) dy$ - $p(y \mid x) \text{ if } \ell_i == \ell(y_i) \text{ else } 0$ - If awareness was assumed learned **outside** the base model: $\mathcal{A} = q(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right)$ - Language rescoring module methods: 1. Language Awareness introduced **within** the base model $(p_\theta(\ell \mid \mathbf{x}))$ ![image](https://hackmd.io/_uploads/Sy4G3A2rC.png) 2. Language awareness introduced **outside** the base model $(q_\phi(\ell \mid \mathbf{x}))$ ![image](https://hackmd.io/_uploads/BJOU3R2HR.png) - Result | | devman | devsge | ntut | ascend | aishell | |---------------------|--------|--------|------|--------|---------| | Previous method (bob) | 14.2 | 20.8 | 42.4 | 29.9 | 42.5 | | Standard Finetuning | 14.7 | 21.2 | - | - | - | | rescore ($q_\phi(\ell \mid \mathbf{x})$) | **13.7** | **20.0** | **32.9** | **27.7** | 32.3 | |rescore ($p_\theta(\ell \mid \mathbf{x})$) | **13.7** | 21.9 | 35.2 | **27.7** | **31.9** | <!-- | SOTA | 16.7 | 26.9 | - | - | - | --> ## Progress June 26^th^ https://www.overleaf.com/read/dbwscrgchpsc#7f61a1 ![image](https://hackmd.io/_uploads/S1KFdrFIA.png) ## Progress July 8^th^ Overall Structure: ![image](https://hackmd.io/_uploads/H1ho4SYPC.png) External Language-Aware: ![image](https://hackmd.io/_uploads/SkQ0NSKPA.png) Internal Language-Aware ![image](https://hackmd.io/_uploads/SJ81BrKDR.png) Overleaf Link: https://www.overleaf.com/read/dbwscrgchpsc#7f61a1 Trying new method: ![image](https://hackmd.io/_uploads/BJPyUrKPR.png) ## Progress July 9^th^ - Why $p(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})$ while previous papers are all $p(y_t \mid \mathbf{x}, \mathbf{y}_{<t})$? - The reason is because many previous works working on code-switching **indirectly** introduce language-awareness into the model. - Attention-Guided Adaptation for Code-Switching Speech Recognition (ICASSP 2024) - To predict the next token, the LID attention map also need to attend to the correct LID token. - This shows that LID is a condition to predict the next token. - BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition (ASRU 2023) ![image](https://hackmd.io/_uploads/By1DXUcwC.png) ![image](https://hackmd.io/_uploads/H1cwmUcD0.png) - This method propose a method to train the model by separating english and chinese into different encoder - This method will force the model to distinguish between english and chinese speech, and extract the feature from the encoder of the correct language. - Thus, the prediction of next token heavily affected by language information - Adapting the adapters for code-switching in multilingual ASR (ICASSP 2024) ![image](https://hackmd.io/_uploads/rkulrU5P0.png) - This method propose an adapter to detect the switching in the sentence, and to use the correct adapter accordingly - The prediction heavily affected by the switching pattern (language-awareness) - An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement (ICASSP 2024) ![image](https://hackmd.io/_uploads/Syq_rLqw0.png) - Same as before, disentangle the information for english and chinese adapter, thus language-awareness in introduced within. - Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition (Interspeech 2023) ![image](https://hackmd.io/_uploads/H1-uULcP0.png) - Mixture-of-Experts for language (language-awareness) - It is clear that language of each token is a condition to predict next token especially in code-switching ASR. - Why Whisper as base model? - Because Whisper itself is already a multilingual model that consider language as its condition - Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization (Interspeech 2023) - This paper shows that by changing LID prompt the performance in CS can have an improvement - Why "Probabilistic"? - Previous works does not applying language awareness in **probabilistic way** - Previous works introduce language-awareness indirectly by manual engineering (LID attention label, masking) which **might be suboptimal** compare to directly modeling language-awareness. - By directly applying language-awareness, we expect the model to be more discriminative towards different language thus language confusion might be reduced. - Where is the probabilistic language-awareness? - The probabilistic language-aware locates in the language calibrator - Given previous assumption that code-switching ASR needs language $\mathbf{\ell}$ as an additional condition, we expand original ASR log-probability $\log p(\mathbf{y} \mid \mathbf{x})$ into: $$ \begin{split} \log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\mathbf{\ell} p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) \\ & = \log \sum_\mathbf{\ell} \frac{p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) q(\mathbf{\ell} \mid \mathbf{x})}{q(\mathbf{\ell} \mid \mathbf{x})} \\ & = \log \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\frac{p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x})}{q(\mathbf{\ell} \mid \mathbf{x})} \right] \\ \end{split} $$ by Jensen's Inequality: $$ \begin{split} & \geq \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log\frac{p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x})}{q(\mathbf{\ell} \mid \mathbf{x})} \right] \\ & = \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) - \log q(\mathbf{\ell} \mid \mathbf{x}) \right] \\ \end{split} $$ with $p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell}, \mathbf{x}) \cdot p(\mathbf{\ell} \mid \mathbf{x})$: $$ \begin{split} & = \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x}, \mathbf{\ell})\right] - \mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\ & = \sum_t \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] - \mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\ \end{split} $$ #### Objectives - Maximizing $\sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right]$ $$ \begin{split} & \max_{\theta, \phi} \sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] \\ & = \max_{\theta, \phi} \sum_t \sum_\mathbf{\ell} q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})\\ \end{split} $$ The term $\sum_\mathbf{\ell} q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})$ can be expanded into: $$ \begin{split} \sum_\mathbf{\ell} q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) & = q_{\phi}(\text{en} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \text{en}) \\ & + q_{\phi}(\text{zh} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \text{zh}) \\ & + q_{\phi}(\text{other} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \text{other}) \\ \end{split} $$ Suppose that $y_t = \text{"我"}$, then we know that $q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) = 0$ $\forall \mathbf{\ell} \neq \text{zh}$. Then defining the language of $y_t$ as $\ell_t$, we can simplify the objective into: $$ \max_{\theta, \phi} \sum_t q_\phi(\ell_{t} \mid \mathbf{x})\log p_\theta(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})\\ $$ - Minimizing $\mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) \mid\mid p_(\mathbf{\ell} \mid \mathbf{x})\right)$ $$ \min_\phi \left|| p(\mathbf{\ell} \mid \mathbf{x}) - q_\phi(\mathbf{\ell} \mid \mathbf{x}) \right||^2_2 $$ ![image](https://hackmd.io/_uploads/Bk2gNK9PA.png) ![image](https://hackmd.io/_uploads/Hkoczq9P0.png) ## Progress July 12^th^ - Why "Probabilistic"? - Previous works does not applying language awareness in **probabilistic way** - Previous works introduce language-awareness indirectly by manual engineering (LID attention label, masking) which **might be suboptimal** compare to directly modeling language-awareness. - By directly applying language-awareness, we expect the model to be **more discriminative towards different language** thus language confusion might be reduced. #### Objectives $$ \text{ELBO} \triangleq \sum_t \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] - \mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\ $$ - Maximizing $\sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right]$ assuming $\mathbf{\ell}=\{l_k\}_{k=1}^{K}$ $$ \begin{split} & \max_{\theta, \phi} \sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] \\ & = \max_{\theta, \phi} \sum_t \sum_k q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k)\\ \end{split} $$ The term $\sum_\mathbf{k} q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k)$ can be expanded into: $$ \begin{split} \sum_k q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k) & = q_{\phi}(l_1 \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_1) \\ & + q_{\phi}(l_2 \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t},l_2) \\ & \cdots \\ & + q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k) \\ \end{split} $$ Suppose that $y_t = \text{"我"}$, then we define $q_{\phi}(l_k \mid \mathbf{x}) = 0$ $\forall l_k \neq \text{zh}$. Then defining the language of $y_t$ as $l_t$, we can simplify the objective into: $$ \max_{\theta, \phi} \sum_t q_\phi(l_k == l_t \mid \mathbf{x})\log p_\theta(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k == l_t)\\ $$ - Minimizing $\mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) \mid\mid p_(\mathbf{\ell} \mid \mathbf{x})\right)$ $$ \min_\phi \left|| p(\mathbf{\ell} \mid \mathbf{x}) - q_\phi(\mathbf{\ell} \mid \mathbf{x}) \right||^2_2 $$ ![image](https://hackmd.io/_uploads/SJ4T4uCDR.png) ## Progress July 17^th^ ![image](https://hackmd.io/_uploads/BkZUBDf_R.png) ![image](https://hackmd.io/_uploads/B14aHPG_C.png) - Minimizing $\mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) \mid\mid p_(\mathbf{\ell} \mid \mathbf{x})\right)$ $$ \min_\phi \sum_k p(l_k \mid \mathbf{x}) \log \frac{p(l_k \mid \mathbf{x})}{q_\phi(l_k \mid \mathbf{x})} $$ - But this could lead to $\infty$, so we add normalizing term for denominator $$ \min_\phi \sum_k p(l_k \mid \mathbf{x}) \log \frac{p(l_k \mid \mathbf{x}) + 10^{-8}}{q_\phi(l_k \mid \mathbf{x})} $$ - simplified into: $$ \min_\phi \log \frac{1}{q_\phi(l_t \mid \mathbf{x})} $$ ## Progress August 8^th^

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully