Hubert training by fairseq library

# Hubert training by fairseq library *Prerequisite: fairseq installation, LibriSpeech dataset* For installing fairseq library, please follow the offical installation instruction in fairseq github page. Below tutorial is based on LibriSpeech train-100hr (training set) and dev-clean (validation set). * learned kmeans(hubert large/20 layer/libri-train-100) : https://github.com/voidful/hubert-cluster-code * hubert cluster id asr: https://huggingface.co/voidful/asr_hubert_cluster_bart_base ## Label preparation In `./simple_kmeans`, do below preprocessing steps for training set and validation set. * feature extraction *.tsv files contains a list of audio, where each line is the root, and following lines are the subpath for each audio: ``` <root-dir> <audio-path-1> <audio-path-2> ... ``` by ``` $ python FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py \ /dir/to/save/audio/files --ext flac \ --dest /path/to/new/train.tsv --valid-percent 0 ``` and then prepare data for kmeans: - nshard: split the list of filenames into nshard segments - rank: rank-th segment, rank = [0, nshard - 1] ``` $ python dump_mfcc_feature.py ${tsv_dir} ${split} ${nshard} ${rank} ${feat_dir} ``` Later on we would merge the those shards, here we have to prepare all the segments, the resulting files would be saved at `${feat_dir}/${split}_${rank}_${nshard}.{npy,len}` * k-means clustering ``` python learn_kmeans.py ${feat_dir} ${split} ${nshard} ${km_path} ${n_cluster} --percent 0.1 ``` Can use more CPUs to speed up, e.g.: in battleship: ``` hrun -c 56 -m 24 python learn_kmeans.py \ /home/daniel094144/data/LibriSpeech/Hubert100/ \ train \ 3 \ /home/daniel094144/data/LibriSpeech/Hubert100/km.pkl \ 100 \ --percent 1 \ ``` * k-means application ``` python dump_km_label.py ${feat_dir} \ ${split} ${km_path} ${nshard} ${rank} ${lab_dir} ``` **(need to run this command for all rank of shards)** - merge all the shards for ${split} by running: ``` for rank in $(seq 0 $((nshard - 1))); do cat $lab_dir/${split}_${rank}_${nshard}.km done > $lab_dir/${split}.km ``` e.g.: nshard = 3; lab_dir = /home/daniel094144/data/LibriSpeech/Hubert100/; split = train ``` for rank in $(seq 0 $((3 - 1))); do cat /home/daniel094144/data/LibriSpeech/Hubert100/train_${rank}_3.km done > /home/daniel094144/data/LibriSpeech/Hubert100/train.km ``` `train.km` is as follow: ![](https://imgur.com/UJxCTht.png) - **prepare dict.km.txt file** ``` python preprocess.py -s km --trainpref /home/daniel094144/data/LibriSpeech/Hubert100/train --destdir /home/daniel094144/data/LibriSpeech/Hubert100/ --only-source ``` https://fairseq.readthedocs.io/en/latest/command_line_tools.html **[NOTE]** You should finish above preparation for both {train} and {valid} split before model pretraining. - Before pretraining, modifying the config file - number of GPU in distributed training: `distributed_world_size` - task.labels: from ??? to ["km"], this refer to `dict.km.txt` dictionary file ![](https://imgur.com/JhDgNuY.png) - Hubert pertraining can be run by the below command: ``` $ python fairseq_cli/hydra_train.py \ --config-dir /path/to/fairseq-py/examples/hubert/config/pretrain \ --config-name hubert_base_librispeech \ task.data=/path/to/data task.label_dir=/path/to/labels model.label_rate=100 ``` e.g.: ``` $ python fairseq_cli/hydra_train.py \ --config-dir /home/daniel094144/fairseq/examples/hubert/config/pretrain \ --config-name hubert_base_librispeech \ task.data=/home/daniel094144/data/LibriSpeech/LS100_fairseq task.label_dir=/home/daniel094144/data/LibriSpeech/Hubert100 model.label_rate=100 ``` e.g.: ``` $ python fairseq_cli/hydra_train.py \ --config-dir /home/daniel094144/fairseq/examples/disentangle_hubert/config/pretrain \ --config-name hubert_base_librispeech \ task.data=/home/daniel094144/data/LibriSpeech/LS100_fairseq task.label_dir=/home/daniel094144/data/LibriSpeech/Hubert100 task.feat_dir=/home/daniel094144/data/LibriSpeech/Hubert100 ``` ## Code Architecture ### Training code flow chart ``` fairseq/fairseq_cli/hydra_train.py': read config by hydra | |__ distributed_utils.call_main(): Assign distributed training setup and call main function |__ 'fairseq/fairseq_cli/train.py' |__ main(): build trainer and training procedure | |__ task: in 'fairseq/fairseq/tasks/', there are lots of tasks in .py, | | specifying the task-dependent code. e.g.: for hubert-pretraining | | is in 'fairseq/fairseq/tasks/hubert_pretraining.py' | | | |__ class HubertPretrainingTask(FairseqTask) | | | |__ HuBertDataset(FairseqDataset): 'fairseq/fairseq/data/audio/hubert_dataset.py' | | |__ FairseqDataset | | |__ load label(label_path) | | |__ load_audio(manifest_path): return root, names, inds, tot, sizes | | |__ get_audio: by soundfile | | |__ get_label | | |__ *get_pitch* | | |__ *get_speaker* | | | |__ *FairseqTask: 'fairseq/fairseq/tasks/fairseq_task.py'. | Tasks store dictionaries and provide helpers for loading/iterating over Datasets, | initializing the Model/Criterion and calculating the loss. | | | |__ build_model | | |__ models: 'fairseq/fairseq/models/' | | |__ Hubert: 'fairseq/fairseq/models/hubert/hubert.py' | | |__ HubertModel(BaseFairseqModel) | | |__ forward: return unnormalized logits from nce loss | | |__ forward_features | | |__ **GRL layer** | | |__ **disentangler output** | | |__ apply_mask | | |__ compute_pred: call compute_nce | | |__ compute_nce: using cosine similarity to calculate NCE loss (retrun unnormalized logit) | | | | | |__ encoder: | | | |__ TransformerEncoder: 'fairseq.models.wav2vec.wav2vec2' | | | |__ TransformerSentenceEncoderLayer | | | | | |__ feature_extractor | | |__ ConvFeatureExtractionModel:'fairseq.models.wav2vec.wav2vec2' | | | |__ build_criterion | | |__ 'fairseq/fairseq/criterions/' | | |__ HubertCriterion(FairseqCriterion):'fairseq/fairseq/criterions/hubert_criterion.py' | | |__ forward: input model, forward model, return loss | | |__ model.get_logits: get masked or unmasked logits | | |__ model.get_targets: return zeros tensor | | |__ F.cross_entropy: (log_softmax + NLL_loss) input is unnormalized logits | |__ train_step: Do forward and backward, and return the loss as computed by *criterion* for the given *model* and *sample*. | |__ valid_step | |__ optimizer_step | |__ AMPOptimizer: support AMP (automatic mixed precision) training | 'fairseq/fairseq/optim/amp_optimizer.py' | |__ trainer(cfg, task, model, criterion): at 'fairseq/fairseq/trainer.py': | handle distributed training and logging, using HubertPretrainingTask to do training | |__ self.task = HubertPretrainingTask | |__ setup lr | |__ lr_scheduler: 'fairseq/fairseq/optim/lr_scheduler/' | |__ train(): 'progress' support for wandb, logging ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.