Willi's work - HackMD

## Title: **Probabilistic Language-Aware Speech Recognition** ### Background - **Many multilingual ASR** nowadays shows **exceptional result on both english and chinese** dataset (Meta's MMS, Google's UMS, OpenAI's Whisper) - But these model are still **incapable of transcribing chinese-english code-switching** (for example: **我喜歡 beef burger 但是我不喜歡 burger king**) - **Fine-tuning** was one of the option to introduce code-switching to the model, but from observation this may result in **performance-degradation for monolingual chinese and english**. - This was assumed because during prediction, the **language was given as a constraint**, **not as a condition** for predicting the next token. - So I'm going to **introduce language of each token** $\ell$ as an additional condition for predicting the next token. - And my target is to achieve **similar result in code-switching setting**, but achieve **better result in the monolingual settings** compared to normal fine-tuning with code-switching dataset. ## Base Settings: - Using OpenAI's whisper-small model as the base model. - Train dataset: - SEAME (South-East Asia Mandarin English Code-Switching dataset) - Test dataset: - SEAME (in-domain, code-switching) - ASCEND (out-domain, hongkong-accent code-switching dataset) - AISHELL (out-domain, monolingual chinese dataset) ## Overall Architecture: ![image](https://hackmd.io/_uploads/SyxIZyfXC.png) - This is the derivation obtained from Prof Chien: $$ \begin{split} \log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\ & = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) q(\ell)}{q(\ell)} \\ & = \log \mathbb{E}_{q(\mathbf{\ell})}\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\ \text{by Jensen's Inequality:} \\ & \geq \mathbb{E}_{q(\mathbf{\ell})}\log\left[\frac{p(\mathbf{y, \ell} \mid \mathbf{x})}{q(\mathbf{\ell})} \right] \\ & = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p(\mathbf{y, \ell} \mid \mathbf{x}) - q(\mathbf{\ell}) \right] \\ \text{with } p(\mathbf{y, \ell} \mid \mathbf{x}) = p(\mathbf{y} \mid \mathbf{\ell, x}) \cdot p(\ell \mid x): \\ & = \mathbb{E}_{q(\mathbf{\ell})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(q(\mathbf{\ell}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \end{split} $$ - (My idea) change $q(\ell)$ into $p(\ell \mid x)$ $$ \begin{split} \log p(\mathbf{y} \mid \mathbf{x}) & = \log \sum_\ell p(\mathbf{y, \ell} \mid \mathbf{x}) \\ & = \log \sum_\ell \frac{p(\mathbf{y, \ell} \mid \mathbf{x}) p(\ell \mid x)}{p(\ell \mid x)} \\ \text{following the previos steps:} \\ & = \mathbb{E}_{p(\mathbf{\ell \mid x})}\log\left[p_\theta(\mathbf{y} \mid \mathbf{x, \ell})\right] - \mathcal{D}_{KL}\left(p(\mathbf{\ell \mid x}) || p(\mathbf{\ell} \mid \mathbf{x})\right) \\ & = \mathbb{E}_{p(\mathbf{\ell \mid x})}\log\left[p(\mathbf{y} \mid \mathbf{x, \ell})\right] \end{split} $$

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.