# Self-supervised pretraining for phoneme recognition on foreign languages
資應所碩一王斾頤
## Introduction
### Background
* Data scarcity and cost limitations of annotated data
* Rise of self-supervised learning methods
### Phoneme Recognition
* Processing a raw audio recording and predict the corresponding sequence of phonemes pronounced by the speaker
* Compares different self-supervised models **pretrained on English speech corpus**, for phoneme recognition across languages:
* Wav2vec 2.0
* HuBERT
* WavLM
### Research Questions
> *What is the impact of choosing English as a pretrained language, especially for languages vastly different from English?*
> *What is the influence of the abundance of training data on model performance?*
> *Which method allows extracting the best features for phoneme recognition?*
---
## Methods
### Network Architecture
The overall network architecture of the 3 models share some common design.

They are made of a CNN encoder, a feature projector, a transformer encoder and a linear layer used as the language modeling head.
---
### Dataset
* Based on Mozilla CommonVoice dataset available on [HuggingFace](https://huggingface.co/datasets/common_voice)
* The dataset is **powered by the voices of volunteer contributors** around the world.
* Free to chose any dataset available on HuggingFace with phonemes dictionaries previously cited to run your models.
```python=
from datasets import load_dataset
dataset = load_dataset("common_voice", "tr")
```
* provides a **train/validation/test split**
### Phonemes Conversion
* Transform ground truth sentences to phonemes using [phonemizer](https://github.com/bootphon/phonemizer) library combined with the **espeak-ng** backend that supports over 100 languages.

#### Example 1. (English)
```python=
from phonemizer import phonemize
texts = ['hello, my name is david']
# Do this:
phonemized = phonemize(texts, language='en-us')
phonemized
```
``` python=
# output
['həloʊ maɪ neɪm ɪz deɪvɪd']
```
#### Example 2. (Mandarin)
```python =
text_zhs = ['你好我的名字是大衛']
# Do this:
phonemized_zhs = phonemize(text_zhs, language='zh')
phonemized_zhs
```
``` python=
# output
['ni2 xɑu2 wo2 tə1 miɜŋ tsi̪5 s.i.5 tɑ5 wei5']
```
### Tokenization
To compute the loss, we need to tokenize each phoneme.
For this we used Hugging Face’s phoneme tokenizer ```Wav2Vec2PhonemeCTCTokenizer``` which allows the encoding and decoding of the targets and the predictions into tokens.

Hugging Face’s tokenizer requires a dictionary listing all the phonemes of a language
:::info
The PHOIBLE set of characters were different from the ones used by the phonemizer.
-> ==many < unk > token appeared==
-> The works of[ *Unsupervised pretraining transfers well across languages*](https://arxiv.org/abs/2002.02848) that provides phonemes dictionaries for 10 languages.
:::
---
### Language


* 印歐語系:
* 羅曼語系:意大利語
* 東斯拉夫語系:俄語
* 西日耳曼語系:荷蘭語
* 北日耳曼語系:瑞典語
* 突厥語系:土耳其語

:::info
**More info**
* 漢藏語系:中文
* 日本语系:日文
* 朝鮮語系:韓文
:::
---
## Pretrained Methods’ Descriptions
> Used models pretrained on 960 hours of English audio data from **Librispeech** dataset.
> Librispeech : a collection of approximately 1,000 hours of audiobooks.
• Wav2vec2 Base: [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
• WavLM Base: [microsoft/wavlm-base](https://huggingface.co/microsoft/wavlm-base)
==• WavLM Large: [microsoft/wavlm-large](https://huggingface.co/microsoft/wavlm-large)==
• HuBERT Large: [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft)
:::danger
**Exception:**
**WavLM Large** was pretrained on MIX-94K(60K hours of Libri-Light, 10K hours of GigaSpeech and 24K hours of VoxPopuli )
:::
::: info
1. Model Feature Extractor Size Difference
* 所有模型的CNN編碼器提取的特徵大小都是512
* 預訓練的HuBERT模型更大,其Transformer處理的特徵大小為1024,而Wav2vec2和WavLM為768
* HuBERT模型的深度是其他兩個模型的兩倍(24層)
2. Challenges in Comparing Model Feature Extraction Capacity
* 為準確比較預訓練模型的特徵提取能力,**最好是擁有相同大小的Transformer**
* 在Hugging Face上**沒有HuBERT Base模型的預訓練權重**
:::
---
### Contrastive Models

#### Wav2Vec 2.0

:::info
#### **XLS-R(補充)


:::
---
### Predictive Models

#### HuBERT

#### WavLM

---
## Experiments
### Audio pre-processing
* re-sampled audios to match the sampling rate of the pretrained dataset (16kHz).
* removed ==blanks== and ==large audios with a duration greater than 5 seconds==.
* The audios are padded to the same length
### Evaluation
Use the **Phoneme Error Rate (PER)** for the models evaluations.
$PER = \cfrac{S+I+D}{N}$
* $S$ stands for substitutions (replacing a Phoneme).
* $I$ stands for insertions (inserting a Phoneme).
* $D$ stands for deletions (omitting a Phoneme).
* $N$ is the number of Phoneme
### Fine-tuning

* Performance seems to vary based on the amount of available training data (e.g., Italian) and the language's proximity to English (e.g., Dutch).
* Exception (**Turkish**): Despite having very **low training data** and **being the farthest from English**, it performs as the second best language after fine-tuning the models.
Unfortunately, there isn't a clear explanation for this phenomenon, but the discrepancy disappears when using frozen features.
### Frozen features

* Utilize only the features learned during pretraining on an English corpus (沒有為不同語言調整其音素特徵提取)
* Languages with a strong similarity to English tend to exhibit better results (Dutch compared to Russian)
* The results for **Swedish** and **Russian** are quite similar, as the proximity of Swedish to English is ==counterbalanced== by the extremely limited amount of training data available.
---
#### Comparison of Wav2vec2 Base and WavLM Base
* ***WavLM Base*** performs better than Wav2vec2 Base across all languages
#### Comparison of WavLM Large and HuBERT Large
* ***WavLM Large*** outperforms HuBERT Large for most languages, with better results on average (28.31% vs. 30.75% PER on the test set).
