# Self-supervised pretraining for phoneme recognition on foreign languages 資應所碩一王斾頤 ## Introduction ### Background * Data scarcity and cost limitations of annotated data * Rise of self-supervised learning methods ### Phoneme Recognition * Processing a raw audio recording and predict the corresponding sequence of phonemes pronounced by the speaker * Compares different self-supervised models **pretrained on English speech corpus**, for phoneme recognition across languages: * Wav2vec 2.0 * HuBERT * WavLM ### Research Questions > *What is the impact of choosing English as a pretrained language, especially for languages vastly different from English?* > *What is the influence of the abundance of training data on model performance?* > *Which method allows extracting the best features for phoneme recognition?* --- ## Methods ### Network Architecture The overall network architecture of the 3 models share some common design. ![image](https://hackmd.io/_uploads/HJVisLUh6.png) They are made of a CNN encoder, a feature projector, a transformer encoder and a linear layer used as the language modeling head. --- ### Dataset * Based on Mozilla CommonVoice dataset available on [HuggingFace](https://huggingface.co/datasets/common_voice) * The dataset is **powered by the voices of volunteer contributors** around the world. * Free to chose any dataset available on HuggingFace with phonemes dictionaries previously cited to run your models. ```python= from datasets import load_dataset dataset = load_dataset("common_voice", "tr") ``` * provides a **train/validation/test split** ### Phonemes Conversion * Transform ground truth sentences to phonemes using [phonemizer](https://github.com/bootphon/phonemizer) library combined with the **espeak-ng** backend that supports over 100 languages. ![image](https://hackmd.io/_uploads/ByCHMHw36.png) #### Example 1. (English) ```python= from phonemizer import phonemize texts = ['hello, my name is david'] # Do this: phonemized = phonemize(texts, language='en-us') phonemized ``` ``` python= # output ['həloʊ maɪ neɪm ɪz deɪvɪd'] ``` #### Example 2. (Mandarin) ```python = text_zhs = ['你好我的名字是大衛'] # Do this: phonemized_zhs = phonemize(text_zhs, language='zh') phonemized_zhs ``` ``` python= # output ['ni2 xɑu2 wo2 tə1 miɜŋ tsi̪5 s.i.5 tɑ5 wei5'] ``` ### Tokenization To compute the loss, we need to tokenize each phoneme. For this we used Hugging Face’s phoneme tokenizer ```Wav2Vec2PhonemeCTCTokenizer``` which allows the encoding and decoding of the targets and the predictions into tokens. ![image](https://hackmd.io/_uploads/Synt4wD3p.png) Hugging Face’s tokenizer requires a dictionary listing all the phonemes of a language :::info The PHOIBLE set of characters were different from the ones used by the phonemizer. -> ==many < unk > token appeared== -> The works of[ *Unsupervised pretraining transfers well across languages*](https://arxiv.org/abs/2002.02848) that provides phonemes dictionaries for 10 languages. ::: --- ### Language ![image](https://hackmd.io/_uploads/SyO0ISDha.png) ![image](https://hackmd.io/_uploads/HJih9PD26.png =500x) * 印歐語系: * 羅曼語系:意大利語 * 東斯拉夫語系:俄語 * 西日耳曼語系:荷蘭語 * 北日耳曼語系:瑞典語 * 突厥語系:土耳其語 ![image](https://hackmd.io/_uploads/r16Xa2Pna.png =500x) :::info **More info** * 漢藏語系:中文 * 日本语系:日文 * 朝鮮語系:韓文 ::: --- ## Pretrained Methods’ Descriptions > Used models pretrained on 960 hours of English audio data from **Librispeech** dataset. > Librispeech : a collection of approximately 1,000 hours of audiobooks. • Wav2vec2 Base: [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) • WavLM Base: [microsoft/wavlm-base](https://huggingface.co/microsoft/wavlm-base) ==• WavLM Large: [microsoft/wavlm-large](https://huggingface.co/microsoft/wavlm-large)== • HuBERT Large: [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) :::danger **Exception:** **WavLM Large** was pretrained on MIX-94K(60K hours of Libri-Light, 10K hours of GigaSpeech and 24K hours of VoxPopuli ) ::: ::: info 1. Model Feature Extractor Size Difference * 所有模型的CNN編碼器提取的特徵大小都是512 * 預訓練的HuBERT模型更大,其Transformer處理的特徵大小為1024,而Wav2vec2和WavLM為768 * HuBERT模型的深度是其他兩個模型的兩倍(24層) 2. Challenges in Comparing Model Feature Extraction Capacity * 為準確比較預訓練模型的特徵提取能力,**最好是擁有相同大小的Transformer** * 在Hugging Face上**沒有HuBERT Base模型的預訓練權重** ::: --- ### Contrastive Models ![截圖 2024-02-24 下午10.42.33](https://hackmd.io/_uploads/B1BKeFv2T.png) #### Wav2Vec 2.0 ![image](https://hackmd.io/_uploads/H1EWaNw3a.png) :::info #### **XLS-R(補充) ![image](https://hackmd.io/_uploads/H1V_ANP26.png) ![image](https://hackmd.io/_uploads/SJjs9KvhT.png) ::: --- ### Predictive Models ![image](https://hackmd.io/_uploads/H1TRlYDhT.png) #### HuBERT ![image](https://hackmd.io/_uploads/HyI4eydn6.png) #### WavLM ![image](https://hackmd.io/_uploads/By5t-yu36.png) --- ## Experiments ### Audio pre-processing * re-sampled audios to match the sampling rate of the pretrained dataset (16kHz). * removed ==blanks== and ==large audios with a duration greater than 5 seconds==. * The audios are padded to the same length ### Evaluation Use the **Phoneme Error Rate (PER)** for the models evaluations. $PER = \cfrac{S+I+D}{N}$ * $S$ stands for substitutions (replacing a Phoneme). * $I$ stands for insertions (inserting a Phoneme). * $D$ stands for deletions (omitting a Phoneme). * $N$ is the number of Phoneme ### Fine-tuning ![image](https://hackmd.io/_uploads/HkKBanD26.png) * Performance seems to vary based on the amount of available training data (e.g., Italian) and the language's proximity to English (e.g., Dutch). * Exception (**Turkish**): Despite having very **low training data** and **being the farthest from English**, it performs as the second best language after fine-tuning the models. Unfortunately, there isn't a clear explanation for this phenomenon, but the discrepancy disappears when using frozen features. ### Frozen features ![image](https://hackmd.io/_uploads/SJ6Lphw3a.png) * Utilize only the features learned during pretraining on an English corpus (沒有為不同語言調整其音素特徵提取) * Languages with a strong similarity to English tend to exhibit better results (Dutch compared to Russian) * The results for **Swedish** and **Russian** are quite similar, as the proximity of Swedish to English is ==counterbalanced== by the extremely limited amount of training data available. --- #### Comparison of Wav2vec2 Base and WavLM Base * ***WavLM Base*** performs better than Wav2vec2 Base across all languages #### Comparison of WavLM Large and HuBERT Large * ***WavLM Large*** outperforms HuBERT Large for most languages, with better results on average (28.31% vs. 30.75% PER on the test set). ![image](https://hackmd.io/_uploads/r18XMy_na.png)