## Title: **Probabilistic Language-Aware Speech Recognition**
### Background
- **Many multilingual ASR** nowadays shows **exceptional result on both english and chinese** dataset (Meta's MMS, Google's UMS, OpenAI's Whisper)
- But these model are still **incapable of transcribing chinese-english code-switching** (for example: **我喜歡 beef burger 但是我不喜歡 burger king**)
- **Fine-tuning** was one of the option to introduce code-switching to the model, but from observation this may result in **performance-degradation for monolingual chinese and english**.
- This was assumed because during prediction, the **language was given as a constraint**, **not as a condition** for predicting the next token.
- So I'm going to **introduce language of each token** $\ell$ as an additional condition for predicting the next token.
- And my target is to achieve **similar result in code-switching setting**, but achieve **better result in the monolingual settings** compared to normal fine-tuning with code-switching dataset.
## Base Settings:
- Using OpenAI's whisper-small model as the base model.
- Train dataset:
- SEAME (South-East Asia Mandarin English Code-Switching dataset)
- Test dataset:
- SEAME (in-domain, code-switching)
- ASCEND (out-domain, hongkong-accent code-switching dataset)
- AISHELL (out-domain, monolingual chinese dataset)
## Overall Architecture:

- This is the derivation obtained from Prof Chien:
$$
\begin{split}
\log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\
& = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) q(\ell)}{q(\ell)} \\
& = \log \mathbb{E}_{q(\mathbf{\ell})}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\
\text{by Jensen's Inequality:} \\
& \geq \mathbb{E}_{q(\mathbf{\ell})}\log\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\
& = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p(\mathbf{y, \ell} \mid \mathbf{x}) - q(\mathbf{\ell}) \right] \\
\text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\
& = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell}) || p(\mathbf{\ell} \mid \mathbf{x})\right)
\end{split}
$$
- (My idea) change $q(\ell)$ into $p(\ell \mid x)$
$$
\begin{split}
\log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\
& = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) p(\ell \mid x)}{p(\ell \mid x)} \\
\text{following the previos steps:} \\
& = \mathbb{E}_{p(\mathbf{\ell \mid x})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(p(\mathbf{\ell \mid x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\
& = \mathbb{E}_{p(\mathbf{\ell \mid x})}\log\left[p(\mathbf{y} \mid \mathbf{x, \ell})\right]
\end{split}
$$