owned this note
owned this note
Published
Linked with GitHub
# Hubert training by fairseq library
*Prerequisite: fairseq installation, LibriSpeech dataset*
For installing fairseq library, please follow the offical installation instruction in fairseq github page.
Below tutorial is based on LibriSpeech train-100hr (training set) and dev-clean (validation set).
* learned kmeans(hubert large/20 layer/libri-train-100) : https://github.com/voidful/hubert-cluster-code
* hubert cluster id asr: https://huggingface.co/voidful/asr_hubert_cluster_bart_base
## Label preparation
In `./simple_kmeans`, do below preprocessing steps for training set and validation set.
* feature extraction
*.tsv files contains a list of audio, where each line is the root, and following lines are the subpath for each audio:
```
<root-dir>
<audio-path-1>
<audio-path-2>
...
```
by
```
$ python FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py \
/dir/to/save/audio/files --ext flac \
--dest /path/to/new/train.tsv --valid-percent 0
```
and then prepare data for kmeans:
- nshard: split the list of filenames into nshard segments
- rank: rank-th segment, rank = [0, nshard - 1]
```
$ python dump_mfcc_feature.py ${tsv_dir} ${split} ${nshard} ${rank} ${feat_dir}
```
Later on we would merge the those shards, here we have to prepare all the segments, the resulting files would be saved at `${feat_dir}/${split}_${rank}_${nshard}.{npy,len}`
* k-means clustering
```
python learn_kmeans.py ${feat_dir} ${split} ${nshard} ${km_path} ${n_cluster} --percent 0.1
```
Can use more CPUs to speed up, e.g.: in battleship:
```
hrun -c 56 -m 24 python learn_kmeans.py \
/home/daniel094144/data/LibriSpeech/Hubert100/ \
train \
3 \
/home/daniel094144/data/LibriSpeech/Hubert100/km.pkl \
100 \
--percent 1 \
```
* k-means application
```
python dump_km_label.py ${feat_dir} \
${split} ${km_path} ${nshard} ${rank} ${lab_dir}
```
**(need to run this command for all rank of shards)**
- merge all the shards for ${split} by running:
```
for rank in $(seq 0 $((nshard - 1))); do
cat $lab_dir/${split}_${rank}_${nshard}.km
done > $lab_dir/${split}.km
```
e.g.: nshard = 3; lab_dir = /home/daniel094144/data/LibriSpeech/Hubert100/; split = train
```
for rank in $(seq 0 $((3 - 1))); do
cat /home/daniel094144/data/LibriSpeech/Hubert100/train_${rank}_3.km
done > /home/daniel094144/data/LibriSpeech/Hubert100/train.km
```
`train.km` is as follow:

- **prepare dict.km.txt file**
```
python preprocess.py -s km --trainpref /home/daniel094144/data/LibriSpeech/Hubert100/train --destdir /home/daniel094144/data/LibriSpeech/Hubert100/ --only-source
```
https://fairseq.readthedocs.io/en/latest/command_line_tools.html
**[NOTE]** You should finish above preparation for both {train} and {valid} split before model pretraining.
- Before pretraining, modifying the config file
- number of GPU in distributed training: `distributed_world_size`
- task.labels:
from ??? to ["km"], this refer to `dict.km.txt` dictionary file

- Hubert pertraining can be run by the below command:
```
$ python fairseq_cli/hydra_train.py \
--config-dir /path/to/fairseq-py/examples/hubert/config/pretrain \
--config-name hubert_base_librispeech \
task.data=/path/to/data task.label_dir=/path/to/labels model.label_rate=100
```
e.g.:
```
$ python fairseq_cli/hydra_train.py \
--config-dir /home/daniel094144/fairseq/examples/hubert/config/pretrain \
--config-name hubert_base_librispeech \
task.data=/home/daniel094144/data/LibriSpeech/LS100_fairseq task.label_dir=/home/daniel094144/data/LibriSpeech/Hubert100 model.label_rate=100
```
e.g.:
```
$ python fairseq_cli/hydra_train.py \
--config-dir /home/daniel094144/fairseq/examples/disentangle_hubert/config/pretrain \
--config-name hubert_base_librispeech \
task.data=/home/daniel094144/data/LibriSpeech/LS100_fairseq task.label_dir=/home/daniel094144/data/LibriSpeech/Hubert100
task.feat_dir=/home/daniel094144/data/LibriSpeech/Hubert100
```
## Code Architecture
### Training code flow chart
```
fairseq/fairseq_cli/hydra_train.py': read config by hydra
|
|__ distributed_utils.call_main(): Assign distributed training setup and call main function
|__ 'fairseq/fairseq_cli/train.py'
|__ main(): build trainer and training procedure
| |__ task: in 'fairseq/fairseq/tasks/', there are lots of tasks in .py,
| | specifying the task-dependent code. e.g.: for hubert-pretraining
| | is in 'fairseq/fairseq/tasks/hubert_pretraining.py'
| |
| |__ class HubertPretrainingTask(FairseqTask)
| |
| |__ HuBertDataset(FairseqDataset): 'fairseq/fairseq/data/audio/hubert_dataset.py'
| | |__ FairseqDataset
| | |__ load label(label_path)
| | |__ load_audio(manifest_path): return root, names, inds, tot, sizes
| | |__ get_audio: by soundfile
| | |__ get_label
| | |__ *get_pitch*
| | |__ *get_speaker*
| |
| |__ *FairseqTask: 'fairseq/fairseq/tasks/fairseq_task.py'.
| Tasks store dictionaries and provide helpers for loading/iterating over Datasets,
| initializing the Model/Criterion and calculating the loss.
| |
| |__ build_model
| | |__ models: 'fairseq/fairseq/models/'
| | |__ Hubert: 'fairseq/fairseq/models/hubert/hubert.py'
| | |__ HubertModel(BaseFairseqModel)
| | |__ forward: return unnormalized logits from nce loss
| | |__ forward_features
| | |__ **GRL layer**
| | |__ **disentangler output**
| | |__ apply_mask
| | |__ compute_pred: call compute_nce
| | |__ compute_nce: using cosine similarity to calculate NCE loss (retrun unnormalized logit)
| | |
| | |__ encoder:
| | | |__ TransformerEncoder: 'fairseq.models.wav2vec.wav2vec2'
| | | |__ TransformerSentenceEncoderLayer
| | |
| | |__ feature_extractor
| | |__ ConvFeatureExtractionModel:'fairseq.models.wav2vec.wav2vec2'
| |
| |__ build_criterion
| | |__ 'fairseq/fairseq/criterions/'
| | |__ HubertCriterion(FairseqCriterion):'fairseq/fairseq/criterions/hubert_criterion.py'
| | |__ forward: input model, forward model, return loss
| | |__ model.get_logits: get masked or unmasked logits
| | |__ model.get_targets: return zeros tensor
| | |__ F.cross_entropy: (log_softmax + NLL_loss) input is unnormalized logits
| |__ train_step: Do forward and backward, and return the loss as computed by *criterion* for the given *model* and *sample*.
| |__ valid_step
| |__ optimizer_step
| |__ AMPOptimizer: support AMP (automatic mixed precision) training
| 'fairseq/fairseq/optim/amp_optimizer.py'
|
|__ trainer(cfg, task, model, criterion): at 'fairseq/fairseq/trainer.py':
| handle distributed training and logging, using HubertPretrainingTask to do training
| |__ self.task = HubertPretrainingTask
| |__ setup lr
| |__ lr_scheduler: 'fairseq/fairseq/optim/lr_scheduler/'
|
|__ train(): 'progress' support for wandb, logging
```