## Progress August 9^th^ 2023
- Whisper (especially in taiwanese accent)
- Whisper does not perform well on monolingual chinese after fine-tune with code switching ASR.
- Utilizing parameter efficient method to decrease the WER.
- Introducing the model with unlabeled speech and/or unlabeled transcription hoping to further leverage the accuracy (MMS Paper)
- Dataset can be obtained in websites and also youtube videos
- Slow inference time
- The inference time is slower when added ladder tuning (Bob's research)
- Find a way to fasten the inference time by either simplify the ladder or find a method to skip some connection when necessary
- MMS
- MMS was trained using much more language compare to Whisper
- Hence, the representation should be able to capture more general feature compare to Whisper
- Introducing MMS to code switching and hope to get a better MER compare to Whisper
- SLU
- Utilizing speech encoder model (wav2vec, MMS) and extract emotion from a given speech input
- This can be useful for preparing the right reply during a conversation (AI Project)
- Dataset:
- [Chinese speech emotion dataset](https://github.com/AkishinoShiame/Chinese-Speech-Emotion-Datasets)
- [Chinese dataset](https://www.twine.net/blog/mandarin-chinese-langauge-datasets/)
- [EMOVIE](https://www.isca-speech.org/archive/pdfs/interspeech_2021/cui21c_interspeech.pdf)
- Code Switch TTS
- Current TTS (Especially in Ai Project) although good enough, still doesn't result in a natural way of speech
- Create a TTS model that can speak up 2 languages naturally (EN-ZH)
## Progress August 30^th^ 2023
- Whisper is vulnerable to adversarial attack [Paper link](https://arxiv.org/abs/2210.17316)
- example: heavy accent speech **(not yet proven, need to experiment)**
- Proper way to fine-tune efficiently is required for downstream task
 src: Bobbi
- Whisper after fine-tuning prioritize LID over task. (Self observation)
- Given zh LID input, **transcribe task**, and english utterance, the output will be our utterance translated to zh.
- Whisper ASR output can be either simplified chinese or traditional chinese
- Solution:
- Optimize encoder to be more robust
- Modifying prompt or training process to focus on the transcribe task.
## Progress September 19^th^ 2023
- Utilizing [LST](https://arxiv.org/abs/2206.06522) (Ladder Side Tuning) to fine-tune Whisper model
- Since LST has some promising future for memory and parameter efficient.
- Implement [Bayesian Deep Learning]((https://arxiv.org/abs/2002.08791)) (BDL) to Whisper
- But the reference paper **applied it to ResNet18** (Conv layers and rather small), need to further understand and try to implement to transformer and Whisper
- **MAYBE** Adapter of Whisper can utilize this theorem to get more robust encoder.
- Apply [MEFT](https://arxiv.org/abs/2306.00477) (Memory-Efficient Fine-Tuning) to apply a reversible model, so the **intermediate activations are not necessary to be cached and can be recomputed**.
## Progress October 11^th^ 2023
- [A Simple Baseline for Bayesian Uncertainty in Deep Learning](https://proceedings.neurips.cc/paper/2019/hash/118921efba23fc329e6560b27861f0c2-Abstract.html)
- Use the information contained in the SGD trajectory to efficiently approximate the posterior distribution over the weights of the neural network.
- SGD update the model with the following update rule:

Looking for a way to combine the formula above with formula below ([Source](https://arxiv.org/pdf/2211.15583)) since we are doing parameter-efficient fine-tuning

- Bayesian Model Averaging (BMA) with Stochastic Weight Average Gaussian (SWAG)

- [Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles](https://arxiv.org/abs/1612.01474)
- Propose an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates.
- Training criterion for regression:

- Overall training procedure is as follow:

- [Bayesian Deep Learning and a Probabilistic Perspective of Generalization](https://arxiv.org/abs/2002.08791)
- Propose the combination between Deep Ensembles and SWAG called MultiSWAG.
- They argue that generalization depends largely on the **support and inductive biases** of a model.
- Define support as the range of datasets for which $p(\mathcal{D|M}) > 0$
- Define inductive biases as the relative prior probabilities of different dataset --- the distribution of support given by $p(\mathcal{D|M})$
- [Uncertainty Estimation in Deterministic Vision Transformer](https://charliezhaoyinpeng.github.io/UDM-AAAI23/assets/papers/3/CameraReady/UDM_AAAI_23_camera_ready.pdf)
- Replace the dot product similarity with the distance within Banach Space and also normalize the term by a theoretical lower bound of the Lipschitz constant.


- Uncertainty information can be added inside the attention module.
## Progress October 24^th^ 2023
- Use OWSM model as backbone model [Link](https://paperswithcode.com/paper/reproducing-whisper-style-training-using-an/review/)

- OWSM performs better for **English, Chinese, and Japanese**
- **Trained with simplified chinese datasets** so fine-tune with traditional chinese will be hard, easier to fine-tune with simplified then convert text to traditional
- OWSM V2 result before fine-tuning:
- Sound: [Drive](https://drive.google.com/file/d/1UTZDpnuUBhuXIEUOAcz_OQyjaZrSnqTt/view?usp=sharing)
- Original text: 去使用这个 app 就直接捅一捅一下就下去所以不会很仔细地看这些东西
- Output: \<en\>\<asr\>\<notimestamps\> to to使用这个艾普就自己同意同意同意同意下一步下一步下一步 所以伯父很自己看这些东西 所以伯父很自己看这些东西 就自己同意同意同意同意下一步下一步下一步下一步 所以伯父很自己看这些东西 所以伯父很自己看这些东西 就自己同意同意同意同意下一步下一步下一步下一步下一步 所以伯父很自己看这些东西 就自己同意同意同意同意
- Tried to fine-tune OWSM V1 with code-switch dataset
- **Tokenizer problem** (All word ar tokenized as \<UNK\>)
- Weird output from OWSM V2 even though experiments shows that OWSM V2 is the best
- Still analyzing the training process of OWSM model (very new and training process is different compare to whisper)
## Progress November 21^th^
- Several problems faced when fine-tuning OWSM
- Solved by following the correct fine-tuning procedure given by OWSM author
- **Direct inference** for ASCEND (hongkong english-mandarin code switch dataset)dataset

- ASCEND With fine-tuning

- SEAME result

## Progress December 13l^th^
- Overall structure of OWSM:

- proposed method:
- Apply contrastive learning in the encoder part based on the LID from the CTC
- Possible title: **Contrastive learning for LID separation in CTC-based code switching ASR**
- 
-
$$
\mathcal{L}_{c} = - \sum_{(\mathbf{i}, y_i) \in \mathbf{x}}\log \frac{\frac{1}{\sum \mathbb{1}(y_i = y_j)}\sum_{(\mathbf{j}, y_j) \in \{\mathbb{x}, y_i=y_j\}}\exp(\text{sim}(i, j))}{\sum_{(k,y_k) \in (y_i \neq y_k)}\exp(\text{sim}(i, k))}
$$
$$
\mathcal{L}_{\text {in }}^{\text {sup }}=\sum_{i \in I} \mathcal{L}_{\text {in }, i}^{\text {sup }}=\sum_{i \in I}-\log \left\{\frac{1}{|P(i)|} \sum_{p \in P(i)} \frac{\exp \left(\boldsymbol{z}_i \cdot \boldsymbol{z}_p / \tau\right)}{\sum_{a \in A(i)} \exp \left(\boldsymbol{z}_i \cdot \boldsymbol{z}_a / \tau\right)}\right\}
$$
- Using side adapter to make the output "aware" of uncertainty
- Possible title: **Introducing uncertainty to deterministic model via side-adapter by parameter-efficient fine-tuning**
- Inspired from the [paper pin-yen presented](https://arxiv.org/pdf/2303.02444.pdf)
- 
- standard SGPA has the following calculation for the mean and variance:
$$
m_d=K_{q k}[\mathrm{v}]_{:, d}, \quad \Sigma_d=K_{q q}+K_{q k}\left(K_{k k}^{-1}[S]_{:,:, d} K_{k k}^{-1}-K_{k k}^{-1}\right) K_{k q}
$$
where $\mathbf{S} \in \mathbb{R}^{T \times T \times d_v}$ is a set of variational covariance parameter, $T$ is the number of inducing points, and $\mathbf{K}$ is kernel gram matrix.
- $K$ can be defined as $K(\cdot,\cdot) = K_{base}(h_\theta(\cdot), h_\theta(\cdot))$ where $h_\theta$ is a feature representasion outputted from DNN
- the decoupled has the following calculation for the mean and variance:
$$
\begin{aligned}
\boldsymbol{m}_d & =\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_a}\left[\mathbf{v}_a\right]_{:, d}-\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_g} \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_a}\left[\mathbf{v}_a\right]_{:, d}+\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_g}\left[\mathbf{v}_g\right]_{:, d} \\
\boldsymbol{\Sigma}_d & =\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{q}}+\boldsymbol{K}_{\boldsymbol{q} \boldsymbol{k}_g} \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_g}^{-1}\left(\left[\boldsymbol{S}_g\right]_{:,:, d}-\boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_g}\right) \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{k}_g}^{-1} \boldsymbol{K}_{\boldsymbol{k}_g \boldsymbol{q}},
\end{aligned}
$$
- `In this work, we are not able to consider pretraining due to the high computational cost, and since SGPA replaces scaled dot-product with a valid kernel, there is no existing pre-trained backbone that can be directly used for the downstream fine-tuning tasks.`
- 
## Progress Jan 2^nd^ 2024
- Found that **OWSM is actually not that great** in performance compare to Whisper
- 
- 
- OWSM only better in their **transparency** which are not suitable for our task
- Decide to **go back to Whisper** following Bob's work
- Problem 1:
- A model capable of code-switch is required to make a perfect multilingual language
- But code-switch dataset is limited, while monolingual dataset for multiple language can be obtain much easier
- Propose a method to introduce code-switch for ASR model without introducing code-switch data
- Possible title: **Introducing code-switch capability with uncertainty fine-tuning for ASR model**
- Problem 2:
- referencing from [There is more than one kind of robustness: Fooling Whisper with adversarial examples](https://arxiv.org/abs/2210.17316), **a small noise with SNR of 35-40dB** and fooling language detector can **degrade Whisper performance dramatically.**
- language detector was proven by Bob's work

- Trying to create **a robust code-switch ASR** with proper fine-tuning method.
- Possible title: Towards Robust Code-Switching ASR with Parameter Efficient Fine Tuning
- Following [Bayesian low-rank adaptation for large language models](https://arxiv.org/pdf/2308.13111.pdf):
- This paper applies a **Laplace approximation to the posterior** over the **LoRA parameters**.
$$
\mathbf{F}(\boldsymbol{\theta})=\sum_{n=1}^N \mathbb{E}_{\mathrm{P}\left(y \mid f_\theta\left(\mathbf{x}_n\right)\right)}\left[\nabla_\theta \mathrm{P}\left(y \mid f_\theta\left(\mathbf{x}_n\right)\right)\left(\nabla_\theta \mathrm{P}\left(y \mid f_\theta\left(\mathbf{x}_n\right)\right)\right)^T\right]
$$
$$
f_\theta\left(\mathbf{x}_*\right) \sim \mathcal{N}\left(f_{\theta_{\text {MAP }}}\left(\mathbf{x}_*\right), \boldsymbol{\Lambda}\right),
$$
$$
\boldsymbol{\Lambda}=\left(\left.\nabla_\theta f_\theta\left(\mathbf{x}_*\right)\right|_{\theta_{\mathrm{MAP}}} ^T\right) \boldsymbol{\Sigma}\left(\left.\nabla_\theta f_\theta\left(\mathbf{x}_*\right)\right|_{\theta_{\mathrm{MAP}}}\right)
$$
- Have to read carefully for this paper because this paper mentioned how to **carefully sample the weight**.
- **LoRA is a good side adapter** in my case because **its output initially start from 0**, and slowly increase to adapt the model.
## Progress Jan 17^th^
- Run whisper with LoRA as the adapter (r=6)

- The other result is still ongoing
- Currently still using \<en\> only instead of \<zh\>\<en> like Bob does, still ongoing (problem with whisper version)

## Progress Jan 22^th^

## Progress Jan 24^th^
- Comparing the attention map between adapter and lora method
- **Forgot to add Layernorm for the qk map**
- Layer 2 head 8


- layer 8 head 11


- There are still **some problems related to attention map**, will fix soon
- Running adapter and lora for 30 epochs to see the convergence of the model

## Progress Jan 29^th^

- Found a normal fine-tuning process that can be **"comparable"** with Bob's method
- Find method to apply uncertainty to the model
- [Bayesian Attention Modules](https://arxiv.org/abs/2010.10604) (NIPS 2020)
- [Bayesian low-rank adaptation for large language models](https://arxiv.org/abs/2308.13111)(Under review ICLR 2024)
- [LoRA ensembles for large language model fine-tuning](https://openreview.net/forum?id=X5VElAKt2s)(Under review ICLR 2024)
- Several methods to evaluate the model
- Negative Log Likelihood (NLL)
- ECE (Expected Calibration Error)
- OOD dataset (Librispeech, ASCEND, NTUT)
- Possible title:
- **KUNAM: Knowledge-based UNcertainty in Attention Module for Robust Code-Switch ASR**
- **Whispering Skepticism: Uncertainty-Aware Whisper for Accurate Code-Switch Speech Recognition**
- **Whisper with a Grain of Salt: Whisper with Uncertainty Awareness for Precise Code-Switch Speech Recognition**
- **Whispering in Shades of Uncertainty: A Bayesian Framework for Whisper in Code-Switch ASR**
- **Whispering with Caution: Bayesian Fine-tuning for Whisper in Code-Switch ASR**
## Progress Feb 21^th^
Overleaf: https://www.overleaf.com/read/dbwscrgchpsc#7f61a1
Title: **KUNAM: Knowledge-based UNcertainty in Attention Module for
Accent-Robust ASR**
- Change title to "accent-robust" ASR
- Have a discussion with Mahdin and decide to focus on heavy accent speech robustness, and out-of-domain data.
- in-domain data: aishell (zh), still unknown for (en), ASCEND (cs for additional if time allows)
- out-domain data: seame (zh), seame (en), seame (cs for additional if time allows)

- Trying to print out NLL and ECE of model, but according to author, NLL and ECE is not a suitable metric for autoregressive transformer
>Given that this is the first investigation of Laplace for LoRA fine-tuning, we chose to focus on multiple-choice QA because that allowed us to use robust, well-understood calibrations metrics like ECE and NLL.
>As a next step, we are indeed excited by the possibility of investigating free-text generation: Laplace-LoRA certainly could be applied in this setting. However, **the development of robust, well-understood calibration metrics for free-text generation remains a challenging and open research question**. Given the complexity of evaluating calibration in the free-text setting, this extension is out-of-scope here, and we leave it for future work.
- So instead just use the usual MER WER CER to see whether uncertainty in ASR helps to decrease the error rate.
- Currently running experiments which results are not satisfying yet.
## Progress March 4^th^
- Change topic to: **Enhancing Code-Switching ASR with Accent-Robust Variational Neural Networks**
- Abstract:
Code-switching has been a hot topic in the Automatic Research Recognition (ASR) and there exist a lot of research papers achieving satisfying result on their dataset. However their performance has yet to be explored in another accent and unseen accents have been proved to degrade the performance of ASR model. In this paper, Variational Neural Network (VNN) is utilized to obtain a robust result in code-switching with unseen accents. Experiments show that proposed method can achieve higher performance compare to previous methods in unseen accent while achieving comparable result in seen accent.
- Proposed Method:
- Current method

- Apply Variational Inference on the attention output

- Apply Variational Inference on the attention weight

- Prior: $p(z) = \mathcal{N}(\mathbf{\mu}, \textbf{0}) = z_{linear} \text{ or } \textbf{QK}^\top$
- Posterior Approximation: $q(z) = \mathcal{N}(\mu, \mathbf{\Sigma}) = \mu_i + \sigma_i * \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \mathbf{I}), \sigma = \sqrt{\Sigma}$
- Adapter output is covariance/variance which need to be non-zero, thus adding **ReLU to the output is a good solution (thx to mahdin)**
- Add new loss
$$
\operatorname{KL}\left[q\left(\boldsymbol{z}_t\right) \| p_{\mathrm{r}}\left(\boldsymbol{z}_t\right)\right]= \quad \sum_{i=1}^M\left\{\log \frac{\sigma_{t, i}^{\mathrm{r}}}{\sigma_{t, i}}+\frac{\sigma_{t, i}^2+\left(\mu_{t, i}-\mu_{t, i}^{\mathrm{r}}\right)^2}{2 \sigma_{t, i}^{\mathrm{r}}{ }^2}-\frac{1}{2}\right\}
$$
- Need to find a way to control the covariance.
- Dataset used:
- SEAME (singaporean and malaysian accent code-switch)
- ASCEND (Hongkong accent code-switch)
- NTUT (Taiwanese accent code-switch)
All the dataset will be converted to simplified chinese for consistency
## Progress March 7^th^
BASELINE


- Parameter trained: 1.47 M (0.6%)
- Need more parameters to adapt.
## Progress March 11^th^
- Background formula (please check the truth behind this formula):
- We assume that a previously robust pretrained model, can successfuly model $P(\mathbf{Y} \mid \mathbf{X})$, and given new dataset $\mathbf{X}'$, we fine-tune the model to fit $P(\mathbf{Y} \mid \mathbf{X}')$.
- But this will result in forgetting since $\mathbf{X}' \not\subset \mathbf{X}$, the model will fail to model the original $P(\mathbf{Y} \mid \mathbf{X})$, we call $\mathbf{X}'$ as OOD dataset.
- We want the model to still model $P(\mathbf{Y} \mid \mathbf{X})$, but can model the input $\mathbf{X}'$ by introducing a model that can model $P(\mathbf{X} \mid \mathbf{X}')$
- In here we assume that $P(\mathbf{X}') = \mathcal{N}(\mathbf{X}, \mathbf{\sigma I})$ (expand the distribution of $\mathbf{X}$)
- Obtain $\sigma$ from $cov_{\phi}(\mathbf{x}')$ where $f_\theta(\cdot)$ is the pretrained model, $cov_\phi$ is a module parameterized with $\phi$ to obtain $\sigma$ given $\mathbf{x}'$
- Then to obtain $P(\mathbf{X})$, we can use MC approximation to obtain the mean:
$$
\mathbf{x} \approx \frac{1}{N}\sum_n^N \mathbf{x'} \sim \mathcal{N}(\mathbf{X}, \mathbf{\sigma I})
$$
when N is large enough
- Baseline (Bob's method for code-switching finetuning method):

- Proposed method:
- $\mathbf{x}'$ obtained from attention **input**
- 
- 
- $\mathbf{x}'$ obtained from attention **output**
- 
- 
- Bob's model result:
- 
- 
- Proposed model result:
- 
- 
- <audio src="https://drive.google.com/file/d/1DvXGO3Pv71PRX8l40alaOZqRdKlZ0FZA/view?usp=sharing" controls audio type="audio/wav"></audio>
- <audio src="https://drive.google.com/file/d/1S-JQ9s0VtlVWb-9vpCibNf0Ak7EpAQ_Y/view?usp=sharing" controls audio type="audio/wav"></audio>
- https://drive.google.com/file/d/1DvXGO3Pv71PRX8l40alaOZqRdKlZ0FZA/view?usp=sharing
- https://drive.google.com/file/d/1S-JQ9s0VtlVWb-9vpCibNf0Ak7EpAQ_Y/view?usp=sharing
## Progress March 27^th^
- [Variational information bottleneck for effective low-resource fine-tuning](https://arxiv.org/abs/2106.05469.pdf)

- 
- 
- [Variational information bottleneck for effective low-resource fine-tuning](https://arxiv.org/abs/2106.05469.pdf)
- [Anti-Spoofing Using Transfer Learning with Variational Information
Bottleneck](https://arxiv.org/abs/2204.01387.pdf)
- [Towards Diverse, Relevant and Coherent Open-Domain Dialogue Generation via Hybrid Latent Variables](https://arxiv.org/abs/2212.01145)
- [Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space](https://arxiv.org/abs/1711.07068)
- [The Expressive Power of Low-Rank Adaptation](https://arxiv.org/abs/2310.17513)
## Progress April 2^nd^
<!-- ### The Expressive Power of Low-Rank Adaptation
- Expressive Power of Transformer Layers with LoRA
- Transformer network, denoted as $\mathrm{TFN}_{L, D}$, is a composition of $L$ Transformer blocks and an output layer, parameterized by weight $\boldsymbol{W}_o \in \mathbb{R}^{D \times D}$.
- Each transformer block comprises a $H$-head selfattention layer, parameterized by weight $\left(\left(\boldsymbol{W}_{O l}^h, \boldsymbol{W}_{V l}^h, \boldsymbol{W}_{K l}^h, \boldsymbol{W}_{Q l}^h\right)_{h=1}^H\right)_{l=1}^L$, followed by a tokenwise feedforward layer, parameterized by weight $\left(\boldsymbol{W}_{1 l}, \boldsymbol{W}_{2 l}\right)_{l=1}^L$ and bias $\left(\boldsymbol{b}_{1 l}, \boldsymbol{b}_{2 l}\right)_{l=1}^L$.
- Assume that all weight matrices have a dimension of $D \times D$, while the bias vectors are of dimension $D$.



$\mathrm{LR}_r(\cdot)$ : best rank- $r$ approximation of a square matrix in Frobenuis norm and spectral norm. The subscript $r$ may be omitted to indicate a general low-rank approximation without specifying the rank.
 -->

$$
K L(p, q)=\log \frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2 \sigma_2^2}-\frac{1}{2}
$$
<!-- $$
\begin{aligned}
& \log P(\boldsymbol{Y} \mid \boldsymbol{X})=\log \prod_{t=1}^n P\left(\boldsymbol{y}_t \mid \boldsymbol{y}_0, \ldots, \boldsymbol{y}_{t-1}, \boldsymbol{X}\right) \\
& \approx \log \prod_{t=1}^n \int P\left(\boldsymbol{y}_t \mid \boldsymbol{y}_0, \ldots, \boldsymbol{y}_{t-1}, \boldsymbol{z}_t\right) p\left(\boldsymbol{z}_t \mid \boldsymbol{h}_t, \boldsymbol{X}\right) \mathrm{d} \boldsymbol{z}_t
\end{aligned}
$$ -->
## Progress April 17^th^
- Title: **Visper: parameter-efficient VIB fine-tuning for accent robust ASR with Whisper**
- **Red and green** mean **lose and win** compare to Bob's method
- Comparing the best beta

- smaller beta = better (but 0 can performs worse)
- need to set smaller beta
- Comparing which prior to use

- using noise as prior performs better
- currently training using parameter for mean and variance
- Considering 2 stage

- performance is not good
## Progress April 29^th^
- Title:
- Irrelevant Information Filtering Through Layer-Wise Variational Fine-Tuning
- Layer-Wise LoRA-VIB for Efficient Code-Switching Speech Recognition
- Advancing Whisper Code-Switching Capability with Layer-Wise LoRA-VIB Adaptation
- Balancing Monolingual and Code-Switching Capabilities in Whisper Using Variational-LoRA
- Layer-Wise LoRA-VIB in Whisper: An Approach for Balancing Monolingual and Code-Switching Capabilities
- Background:
- Speech models nowadays have satisfying result in **monolingual ASR**
- But experiments show that these particular models **can't perform well in code-switch** even thou the languages in the sentence was learned by he speech models
- Fine-tuning to code-switching dataset is famous for tackling this issue
- But by fine-tuning to the dataset, the **generalization of the speech models are now perturbed by the fine-tuning process**
- This research wants to **introduce only the code-switching capability to speech models while also retaining the generality of the speech models** during the fine-tuning process
- Experiments

- Changing KL function into **ordinary MSE Loss for the mean** performs better
## Progress May 6^th^
Title: **Balancing Monolingual and Code-Switching Capabilities in Whisper Using Variational-LoRA**
<!-- 

 -->
- Currently searching for a method to prevent overfitting to the target dataset
- Introduce code-switching capability $\rightarrow$ introduce only the language transition $\rightarrow$ should be related to the **decoder part**
- [Sparsely Shared LoRA on Whisper for Child Speech Recognition](https://arxiv.org/abs/2309.11756) (ICASSP 2024)
- Adapting with AdaLoRA for child speech recognition (low-resource speech & zero-shot)
- LoRA shared for each blocks (Enc-SAM, Enc-FFM, Dec-SAM, Dec-CAM, and Dec-FFM)
- [An Effective Mixture-of-Experts Approach for Code-Switching Speech Recognition Leveraging Encoder Disentanglement](https://arxiv.org/abs/2402.17189) (ICASSP 2024)

- [Cross-Modal Parallel Training for Improving end-to-end Accented Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10447979) (ICASSP 2024)

- [Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps](https://openreview.net/pdf?id=mYWsyTuiRp) (ICLR 2024)
- [Incorporating Residual and Normalization Layers into Analysis of Masked Language Models](https://arxiv.org/abs/2109.07152) (EMNLP 2021)
- [Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning](https://openreview.net/forum?id=YR3ETaElNK) (ICLR 2024)
## Progress May 13^th^
- Recent trend about decoding for hallucination
- [DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models](https://arxiv.org/pdf/2309.03883)



- [Constrained Decoding for Cross-lingual Label Projection](https://arxiv.org/pdf/2402.03131)

- A lot of papers modifying decoding process for mitigating hallucination
- hallucination can be assosiated to wrong language in CS
## Progress May 15^th^
Title:
- **Unlocking the Power of Authentic Bilingual Speech Recognition: Revolutionizing Robustness and Accuracy**
- **Towards Authentic Robust Bilingual Speech Recognition**
- **Towards Robust Bilingual Speech Recognition**

- [Large Language Models are Efficient Learners of Noise-Robust Speech Recognition](https://arxiv.org/abs/2401.10446) (ICLR 2024)
- 
- [It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition](https://arxiv.org/abs/2402.05457) (ICLR 2024)
- 
## Progress May 20^th^ & May 22^nd^
Title: **Probabilistic Language-Aware Speech Recognition**
~~little modification into: **Probabilistic Speech Recognition with Language-Aware Enhancements**~~
- Introduce LID to the probability
$$
\begin{split}
\log P(\textbf{y} \mid \textbf{x}) & =\sum_i \log P(y_i \mid \textbf{x}, y_{<i}) \\
\text{Under language conditional assumption: } \\
&= \int P(y_i \mid \mathbf{x}, y_{<i}, L_i) P(L_i \mid \mathbf{x}, y_{<i}) dL_i\\
& = \mathbb{E}_{L_i \sim P(L_i \mid \mathbf{x}, y_{<i})}\left[\sum_i \log P(y_i \mid \textbf{x}, y_{<i}, L_i)\right] \\
& = \sum_i \mathbb{E}_{L_i \sim P(L_i \mid \mathbf{x}, y_{<i})}\left[\log P(y_i \mid \textbf{x}, y_{<i}, L_i)\right] \\
\text{by Jensen's Inequality: } \\
& \geq \sum_i \log \mathbb{E}_{L_i \sim P(L_i \mid \mathbf{x}, y_{<i})}
\left[ P(y_i \mid \textbf{x}, y_{<i}, L_i)\right] \\
\end{split}
$$
- Assuming fine-tuning data has some perturbation in $\textbf{x}$, $P(\textbf{x}_{noise}) = \mathcal{N}(\textbf{x},\mathbf{\sigma I}) = \textbf{x} + \epsilon * \mathbf{\sigma}$ where $\epsilon \sim \mathcal{N}(\mathbf{0, I})$
$$
\begin{aligned}
& \sum_i \log P\left(y_i \mid \textbf{x}_{noise}, y_{<i}\right)= \\
& \sum_{L I D_i \in\{\text { en }, z h\}} \sum_i \log \left(\int P\left(y_i \mid \textbf{x}+\epsilon * \mathbf{\sigma}, y_{<i}, L I D_i\right) P(L I D_i \mid \textbf{x}+\epsilon * \mathbf{\sigma}, y_{<i}) P(\epsilon * \mathbf{\sigma}) d (\epsilon * \mathbf{\sigma})\right)
\end{aligned}
$$
- Which shows the reason why **many models cannot handle perturbed data (OOD)**.
- to maximize the modeling, we need to **make $\sigma$ as close to zero** as possible by:
1. Create a $\text{noise}$ approximator which approximate $\sigma$ given $\textbf{x}_{noise}$
2. Approximating $\textbf{x}$ by $\textbf{x} \approx \frac{1}{N} \sum_N \textbf{x}_{noise} - \epsilon *\sigma, \\\ \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
3. Then objective turns into **minimalizing $\sigma$ while maximizing two probabilities** mentioned previously
## Progress May 29^th^
Derivation from prof:
$$
\begin{split}
\log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\
& = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) q(\mathbf{\ell}}{q(\mathbf{\ell}} \\
& = \log \mathbb{E}_{q(\mathbf{\ell})}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\
\text{by Jensen's Inequality:} \\
& \geq \mathbb{E}_{q(\mathbf{\ell})}\log\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\
& = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p(\mathbf{y, \ell} \mid \mathbf{x}) - q(\mathbf{\ell}) \right] \\
\text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\
& = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell}) || p(\mathbf{\ell} \mid \mathbf{x})\right)
\end{split}
$$
- Question:
- Can I change $q(\mathbf{\ell})$ into $q(\ell \mid x)$? so that the last equation will be:
$$
\mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right]- \mathcal{D}_{KL}\left(q(\mathbf{\ell \mid \mathbf{x}}) || p(\mathbf{\ell} \mid \mathbf{x})\right)
$$
- Doing some experiments right now applying the above equation
- Still running
- [LAE: Language-Aware Encoder for Monolingual and Multilingual ASR](https://arxiv.org/abs/2206.02093)

- [LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR] (ASRU 2023)(https://arxiv.org/abs/2309.16178)

- Assume that once Z^Man^ and Z^En^ are given, no more information from X are needed
- And assume that Z^Man^ and Z^En^ are **independent**

## Progress June 3^rd^
$$
\begin{split}
\log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\
& = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) \mathcal{A}}{\mathcal{A}} \\
& = \log \mathbb{E}_{\mathcal{A}}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\
\text{by Jensen's Inequality:} \\
& \geq \mathbb{E}_{\mathcal{A}}\left[\log\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\
& = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y, \ell} \mid \mathbf{x}) - \log \mathcal{A} \right] \\
\text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\
& = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(\mathcal{A} || p(\mathbf{\ell} \mid \mathbf{x})\right)
\end{split}
$$
In here $\mathcal{A}$ can be defined into several terms:
- $\mathcal{A} = q(\mathbf{\ell}) \rightarrow \mathbb{E}_{q(\mathbf{\ell} )}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell}|| p(\mathbf{\ell} \mid \mathbf{x})\right)$
- Hard to decide the prior $q(\mathbf{\ell})$
- $\mathcal{A} = p(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{p(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right]$

- $\mathcal{A} = q(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right)$

$$
\mathcal{L} = \mathcal{L}_{CE} + \alpha\mathcal{L}_{\ell} + \beta\mathcal{L}_{\text{var}}
$$
$\mathcal{L}_{\text{var}}$ is not yet defined in this formulation
| | devman | devsge | ntut | ascend | aishell |
|---------------------|--------|--------|------|--------|---------|
| previous method | 14.2 | 20.8 | 42.4 | 29.9 | 42.5 |
| rescore ($q(\ell \mid \mathbf{x})$) | 14.1 | 20.9 | 32.6 | 26.6 | 32.5 |
| rescore ($p(\ell \mid x)$) | **13.7** | 21.9 | 35.2 | 27.7 | 31.9 |
| rescore ($q_\phi(\ell)$) + AR | 14.8 | 21.6 | 30.1 | **25.8** | **22.2** |
| rescore ($p(\ell \mid x)$) + AR | 14.4 | 21.7 | **29.6** | 23.5 | 23.1
<!-- | baseline | 13.8 | **20.2** | 33.1 | 27.2 | 33.2 | -->
## Progress June 12^th^
Title: **Probabilistic Language-Aware Speech Recognition**


| | devman | devsge | ntut | ascend | aishell |
|---------------------|--------|--------|------|--------|---------|
| previous method (bob) | 14.2 | 20.8 | 42.4 | 29.9 | 42.5 |
| rescore ($q(\ell \mid \mathbf{x})$) | **13.7** | **20.0** | **32.9** | 27.7 | **32.3** |
<!-- | baseline | 13.8 | 20.2 | 33.1 | **27.2** | 33.2 | -->
## Progress June 17^th^
- Title: **Probabilistic Language-Aware Speech Recognition**
- Thesis outline

- Background Problem
- Language Confusion:
- Speech: 我要吃牛肉麵 (all chinese)
- Pretrained whisper: 我要吃牛肉麵
- After CS FT: 我要吃 new Roman
- Proposed Method:
- Apply an additional module to **enhance the awareness of language** for each token prediction.
- Overall Structure

<!--  -->
- Language aware derivation:
$$
\begin{split}
\log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\
& = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) \mathcal{A}}{\mathcal{A}} \\
& = \log \mathbb{E}_{\mathcal{A}}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\
\text{by Jensen's Inequality:} \\
& \geq \mathbb{E}_{\mathcal{A}}\left[\log\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{\mathcal{A}} \right] \\
& = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y, \ell} \mid \mathbf{x}) - \log \mathcal{A} \right] \\
\text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\
& = \mathbb{E}_{\mathcal{A}}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(\mathcal{A} || p(\mathbf{\ell} \mid \mathbf{x})\right)
\end{split}
$$
- In here $\mathcal{A}$ can be defined into several terms:
- If awareness was assumed learned **within** the base model: $\mathcal{A} = p(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{p(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right]$
- $p(\ell \mid x) = \int p(\ell \mid y) p(y \mid x) dy$
- $p(y \mid x) \text{ if } \ell_i == \ell(y_i) \text{ else } 0$
- If awareness was assumed learned **outside** the base model: $\mathcal{A} = q(\mathbf{\ell} \mid \mathbf{x}) \rightarrow \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right)$
- Language rescoring module methods:
1. Language Awareness introduced **within** the base model $(p_\theta(\ell \mid \mathbf{x}))$

2. Language awareness introduced **outside** the base model $(q_\phi(\ell \mid \mathbf{x}))$

- Result
| | devman | devsge | ntut | ascend | aishell |
|---------------------|--------|--------|------|--------|---------|
| Previous method (bob) | 14.2 | 20.8 | 42.4 | 29.9 | 42.5 |
| Standard Finetuning | 14.7 | 21.2 | - | - | - |
| rescore ($q_\phi(\ell \mid \mathbf{x})$) | **13.7** | **20.0** | **32.9** | **27.7** | 32.3 |
|rescore ($p_\theta(\ell \mid \mathbf{x})$) | **13.7** | 21.9 | 35.2 | **27.7** | **31.9** |
<!-- | SOTA | 16.7 | 26.9 | - | - | - | -->
## Progress June 26^th^
https://www.overleaf.com/read/dbwscrgchpsc#7f61a1

## Progress July 8^th^
Overall Structure:

External Language-Aware:

Internal Language-Aware

Overleaf Link: https://www.overleaf.com/read/dbwscrgchpsc#7f61a1
Trying new method:

## Progress July 9^th^
- Why $p(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})$ while previous papers are all $p(y_t \mid \mathbf{x}, \mathbf{y}_{<t})$?
- The reason is because many previous works working on code-switching **indirectly** introduce language-awareness into the model.
- Attention-Guided Adaptation for Code-Switching Speech Recognition (ICASSP 2024)
- To predict the next token, the LID attention map also need to attend to the correct LID token.
- This shows that LID is a condition to predict the next token.
- BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition (ASRU 2023)


- This method propose a method to train the model by separating english and chinese into different encoder
- This method will force the model to distinguish between english and chinese speech, and extract the feature from the encoder of the correct language.
- Thus, the prediction of next token heavily affected by language information
- Adapting the adapters for code-switching in multilingual ASR (ICASSP 2024)

- This method propose an adapter to detect the switching in the sentence, and to use the correct adapter accordingly
- The prediction heavily affected by the switching pattern (language-awareness)
- An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement (ICASSP 2024)

- Same as before, disentangle the information for english and chinese adapter, thus language-awareness in introduced within.
- Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition (Interspeech 2023)

- Mixture-of-Experts for language (language-awareness)
- It is clear that language of each token is a condition to predict next token especially in code-switching ASR.
- Why Whisper as base model?
- Because Whisper itself is already a multilingual model that consider language as its condition
- Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization (Interspeech 2023)
- This paper shows that by changing LID prompt the performance in CS can have an improvement
- Why "Probabilistic"?
- Previous works does not applying language awareness in **probabilistic way**
- Previous works introduce language-awareness indirectly by manual engineering (LID attention label, masking) which **might be suboptimal** compare to directly modeling language-awareness.
- By directly applying language-awareness, we expect the model to be more discriminative towards different language thus language confusion might be reduced.
- Where is the probabilistic language-awareness?
- The probabilistic language-aware locates in the language calibrator
- Given previous assumption that code-switching ASR needs language $\mathbf{\ell}$ as an additional condition, we expand original ASR log-probability $\log p(\mathbf{y} \mid \mathbf{x})$ into:
$$
\begin{split}
\log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\mathbf{\ell} p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) \\
& = \log \sum_\mathbf{\ell} \frac{p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) q(\mathbf{\ell} \mid \mathbf{x})}{q(\mathbf{\ell} \mid \mathbf{x})} \\
& = \log \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\frac{p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x})}{q(\mathbf{\ell} \mid \mathbf{x})} \right] \\
\end{split}
$$
by Jensen's Inequality:
$$
\begin{split}
& \geq \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log\frac{p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x})}{q(\mathbf{\ell} \mid \mathbf{x})} \right] \\
& = \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) - \log q(\mathbf{\ell} \mid \mathbf{x}) \right] \\
\end{split}
$$
with $p(\mathbf{y}, \mathbf{\ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell}, \mathbf{x}) \cdot p(\mathbf{\ell} \mid \mathbf{x})$:
$$
\begin{split}
& = \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(\mathbf{y} \mid \mathbf{x}, \mathbf{\ell})\right] - \mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\
& = \sum_t \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] - \mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\
\end{split}
$$
#### Objectives
- Maximizing $\sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right]$
$$
\begin{split}
& \max_{\theta, \phi} \sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] \\
& = \max_{\theta, \phi} \sum_t \sum_\mathbf{\ell} q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})\\
\end{split}
$$
The term $\sum_\mathbf{\ell} q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})$ can be expanded into:
$$
\begin{split}
\sum_\mathbf{\ell} q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) & = q_{\phi}(\text{en} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \text{en}) \\
& + q_{\phi}(\text{zh} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \text{zh}) \\
& + q_{\phi}(\text{other} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \text{other}) \\
\end{split}
$$
Suppose that $y_t = \text{"我"}$, then we know that $q_{\phi}(\mathbf{\ell} \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) = 0$ $\forall \mathbf{\ell} \neq \text{zh}$. Then defining the language of $y_t$ as $\ell_t$, we can simplify the objective into:
$$
\max_{\theta, \phi} \sum_t q_\phi(\ell_{t} \mid \mathbf{x})\log p_\theta(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell})\\
$$
- Minimizing $\mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) \mid\mid p_(\mathbf{\ell} \mid \mathbf{x})\right)$
$$
\min_\phi \left|| p(\mathbf{\ell} \mid \mathbf{x}) - q_\phi(\mathbf{\ell} \mid \mathbf{x}) \right||^2_2
$$


## Progress July 12^th^
- Why "Probabilistic"?
- Previous works does not applying language awareness in **probabilistic way**
- Previous works introduce language-awareness indirectly by manual engineering (LID attention label, masking) which **might be suboptimal** compare to directly modeling language-awareness.
- By directly applying language-awareness, we expect the model to be **more discriminative towards different language** thus language confusion might be reduced.
#### Objectives
$$
\text{ELBO} \triangleq \sum_t \mathbb{E}_{q(\mathbf{\ell} \mid \mathbf{x})}\left[\log p(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] - \mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\
$$
- Maximizing $\sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right]$ assuming $\mathbf{\ell}=\{l_k\}_{k=1}^{K}$
$$
\begin{split}
& \max_{\theta, \phi} \sum_t \mathbb{E}_{q_{\phi}(\mathbf{\ell} \mid \mathbf{x})}\left[\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, \mathbf{\ell}) \right] \\
& = \max_{\theta, \phi} \sum_t \sum_k q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k)\\
\end{split}
$$
The term $\sum_\mathbf{k} q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k)$ can be expanded into:
$$
\begin{split}
\sum_k q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k) & = q_{\phi}(l_1 \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_1) \\
& + q_{\phi}(l_2 \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t},l_2) \\
& \cdots \\
& + q_{\phi}(l_k \mid \mathbf{x})\log p_{\theta}(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k) \\
\end{split}
$$
Suppose that $y_t = \text{"我"}$, then we define $q_{\phi}(l_k \mid \mathbf{x}) = 0$ $\forall l_k \neq \text{zh}$. Then defining the language of $y_t$ as $l_t$, we can simplify the objective into:
$$
\max_{\theta, \phi} \sum_t q_\phi(l_k == l_t \mid \mathbf{x})\log p_\theta(y_t \mid \mathbf{x}, \mathbf{y}_{<t}, l_k == l_t)\\
$$
- Minimizing $\mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) \mid\mid p_(\mathbf{\ell} \mid \mathbf{x})\right)$
$$
\min_\phi \left|| p(\mathbf{\ell} \mid \mathbf{x}) - q_\phi(\mathbf{\ell} \mid \mathbf{x}) \right||^2_2
$$

## Progress July 17^th^


- Minimizing $\mathcal{D}_{\text{KL}}\left(q(\mathbf{\ell} \mid \mathbf{x}) \mid\mid p_(\mathbf{\ell} \mid \mathbf{x})\right)$
$$
\min_\phi \sum_k p(l_k \mid \mathbf{x}) \log \frac{p(l_k \mid \mathbf{x})}{q_\phi(l_k \mid \mathbf{x})}
$$
- But this could lead to $\infty$, so we add normalizing term for denominator
$$
\min_\phi \sum_k p(l_k \mid \mathbf{x}) \log \frac{p(l_k \mid \mathbf{x}) + 10^{-8}}{q_\phi(l_k \mid \mathbf{x})}
$$
- simplified into:
$$
\min_\phi \log \frac{1}{q_\phi(l_t \mid \mathbf{x})}
$$
## Progress August 8^th^