Masked Autoencoders(MAE) 模型訓練筆記

# Masked Autoencoders(MAE) 模型訓練筆記 ###### tags: `Training_Log` `ebird` `Masked Autoencoders` ### 文件說明 - 主要紀錄根據官方公布的MAE腳本的訓練過程 - 幾個主要修改點： 1. 把多節點多GPU環境改為小資的單機單GPU 2. 改為混和精度(amp)，明確開啟梯度減裁(clip_grad)防止訓練過程梯度消失/爆炸(loss nan) 3. 修改timm==0.3.2引入的模組版本不相容問題 --- #### [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) - [官方Github](https://github.com/facebookresearch/mae) [Masked Autoencoders(MAE)論文筆記](https://hackmd.io/lTqNcOmQQLiwzkAwVySh8Q) [XAI文獻搜尋筆記: XAI for Vision Transformer](https://hackmd.io/HsbIfCq_RhG5dX3vGvs-wg) ## MAE架構官方示意圖 ![](https://hackmd.io/_uploads/r179dGlXc.png) # 一、執行指令與參數設定 ## [Pretrain] ### 訓練結果視覺化 #### [Demo] - [Demo](https://colab.research.google.com/github/facebookresearch/mae/blob/main/demo/mae_visualize.ipynb) - 什麼鳥蛋(或不是鳥蛋)照片都可以還原 :::spoiler - 犀鳥 ![](https://hackmd.io/_uploads/SklR_fxmq.png =600x) - 袋鼠羅傑 ![](https://hackmd.io/_uploads/Hk3loGxX9.png =600x) - 上海夜景 ![](https://hackmd.io/_uploads/rkmwifxX5.png =600x) ::: #### `norm_pix_loss`對訓練結果的影像 :::spoiler - 上圖為Ebirzation [Demo](https://colab.research.google.com/github/facebookresearch/mae/blob/main/demo/mae_visualize.ipynb) - 紋理 ![](https://i.imgur.com/VdJXMhP.png =500x) - 大量的同顏色圖塊 ![](https://i.imgur.com/qnV7fqa.png =500x) - Ebird pretrained 在同色塊(var=0)的圖塊表現不佳 - 討論 - ~~資料問題~~ - ~~資料量不足，模型參數大於資料量~~ - data size: - Ebird : 0.89M - ImageNet 10M - para. of ViT base : 1.14M - 資料偏差 + 圖塊LOSS計算前做標準化(`--norm_pix_loss` 開啟) - 鳥類資料可能很多同色均值的圖塊(天空、海洋)，`--norm_pix_loss`可能會讓同色的圖塊特徵消失而無法學習 ::: ### 訓練資料 #### 用eBird資料集進行訓練 :::spoiler - 89 萬餘筆爬自 eBird/MacaulayLibrary 資料庫的全球鳥類照 - input size: 224 * 224 * 3 - Patch 數是 16 * 16，patch 大小為 14 * 14 - 照官方配置，訓練時遮 75 - [ebird資料處理筆記](https://hackmd.io/-FrCog1WSf2Z6xFjY6gOdA) ::: #### Linear Probing訓練腳本與超參數設定 ##### Pretrain訓練腳本 ###### Pretrain mae官方設定值 ![](https://i.imgur.com/QA1Nl5t.png =350x) ###### main_pretrain.py - 開混和精度執行 :::spoiler code ```bash= python main_pretrain.py \ --batch_size 400 \ --accum_iter 10 \ --model mae_vit_base_patch16 \ --mask_ratio 0.75 \ --epochs 1600 \ --warmup_epochs 40 \ --blr 1.5e-4 \ --weight_decay 0.05 \ --data_path /home/esslab/AI_projects/shared/birds \ --clip_grad=5 \ --norm_pix_loss \ --amp \ --resume=output_dir/checkpoint-xxx.pth \ --start_epoch=xxx ``` ::: ##### Pretrain超參數設定 :::spoiler ###### 學習率(lr) - 起始的lr值根據 [linear scaling rule](https://arxiv.org/abs/1706.02677) 計算得來 - lr = blr * effective batch size / 256. :::spoiler half-cycle cosine學習率(lr)詳解 - lr排程的衰退率採取half-cycle cosine - Decay the learning rate with half-cycle cosine after warmup - lr x 0.5 x (1+ cos($pi$ x epoch_currt/ epoch_total)) - epoch_currt = epoch_currt - epoch_warmup - epoch_total = epoch_total - epoch_warmup - 透過cosine函數調控 - cosine的值會落在cos(0)=1, cos(pi)=-1 - 0.5x(1+cos(x)) => 值域從1開始，趨近於0 - 即lr 在 lr x (1 - 0)之間變化，根據設定的總epoch決定衰退程度 ![](https://hackmd.io/_uploads/BJ7elWUx9.png =300x) - [`lr_sched.py`](https://github.com/facebookresearch/mae/blob/main/util/lr_sched.py) :::spoiler ```python= def adjust_learning_rate(optimizer, epoch, args): """Decay the learning rate with half-cycle cosine after warmup""" if epoch < args.warmup_epochs: lr = args.lr * epoch / args.warmup_epochs else: lr = args.min_lr + (args.lr - args.min_lr) * 0.5 * \ (1. + math.cos(math.pi * (epoch - args.warmup_epochs) / (args.epochs - args.warmup_epochs))) ``` ::: ::: ###### Finetune：layer-wise lr decay ###### 有效batch size - effect batch size = 4,096 = 64(batch)x8(node)x8(gpu)x1(accum_iter) :::spoiler 有效batch size詳細設定 - 官方預訓練用的是Imagenet 1k,約1.28m的資料 - 1000 object classes - 1,281,167 training images, 50,000 validation images and 100,000 test images. - 我們使用200(batch)x20(accum_iter) - 影像格式為 256x256的png - 也許直接處理為224x224的jpg可以再節省記憶體 - 記憶體使用情形 - max mem: 40607 - 45307MiB / 46068MiB ::: ###### 訓練epoch設定推估 - 官方設定為800 - 每輪epoch 更新參數次數 = 1.28m /4,096 ~= 312次 - 總更新次數= 訓練800epoch x 312 = 250k(250,227)次 :::spoiler epoch推估 - 目前側用的蛾類資料集數量為12.5k(n=12,515) - `batch=200, ceeum_iter=20`, effect batch size = 4,000 - 每輪epoch 更新參數次數 = 12.5k /4,000 ~= 3.128次 - 如果要達到同樣的總更次數(250k)，epoch = 1.28m/12.5k x 800epoch ~= 8,000 epoch - 蛾類資料集測試階段設定為2,400 ::: ::: # 二、訓練成果檢視 #### 模型架構 - 使用的ViT模型架構都是base - ImageNet pretrain `--norm_pix_loss` on - Ebird pretrain：`--norm_pix_loss` on, off #### cls_token or gap :::spoiler 最後匯聚出抽象特徵(shape : embedding_dim)的輸出採用cls_token 還是 global ave.pooling? - cls_token - 搭配ViT結構特有的類別embedding，為可學習的權重 - 放在圖塊embedding最前面，一起進入 MAE Encoder(ViT)中訓練，在訓練過程中學習到所有圖塊的資訊(經過矩陣相乘) - gap(global ave.pooling) - 即Resnet最後一層所用匯聚所有資訊產出最後latent vectors的方法 - 將各patch(不含cls_token)embedding資訊匯集成形狀為(,embedding_dim)的高維向量 - MAE Encoder(ViT) 產出的latent vectors - 形狀為(batch, patch_num, embedding_dim) - MAE 當中的cls_token本身是為符合ViT結構而保留的embedding，在未對應下游任務遷移學習前，裡面的參數權重是無資訊的 ::: ### Fine-tune階段使用資料 :::info 為評估模型效能表現，我們尋找資料分布相近的鳥類影像資料集做為訓練成果驗證使用 ::: #### 以Naturalist 2021資料集內的鳥類影像進行評估 :::spoiler - Data Source - [iNaturalist 2021 Dataset](https://github.com/visipedia/inat_comp/tree/master/2021) - There is a total of 10,000 species in the dataset. The full training dataset contains nearly 2.7M images. - To make the dataset more accessible we have also created a "mini" training dataset with 50 examples per species for a total of 500K images. - Each species has 10 validation images. There are a total of 500,000 test images. - 擷取出1486種鳥類資料 - train: n = 279 /per sp.(ave.) - val : n=10 /per sp. ![](https://i.imgur.com/LnbfZtA.png =600x) - 資料擷取過程見[Pipeline to get and preprocess](https://github.com/YunghuiHsu/ebird_project/tree/main/iNaturalist_2021_dataset) ::: ## 訓練表現 ### Linear Probing表現 #### Linear Probing 參數設定與分類任務top1 Accuracy | Pretrain Dataset | NP^#^ | classifier | Effect_batch_size | Server | Fc_head | Accu_top1 | | ---------------- |:------:| ------------- |:-----------------:|:------:|:-------:|:---------:| | Ebird | off | --cls_token | 10x1638=16,380 | 3090 | fc2 | 36.62% | | Ebird | on | --cls_token | 10x1638=16,380 | 3090 | fc2 | 49.77% | | Ebird | on | --global_pool | 10x1638=16,380 | 3090 | fc1 | 39.87% | | ImageNet | on | --cls_token | 10x1638=16,380 | 3090 | fc1 | 24.18% | - Model Architecture：MAE ViT base - ^#^ norm_pix_loss - fc_head: - fc1:BatchNorm > fc - fc2:BatchNorm > fc > LeakyReLU > fc > LeakyReLU ![](https://hackmd.io/_uploads/Hk66iH4dc.png =500x) ### Fine-tune表現 #### Performance in 'Finetune' at different hyperparameters - Dataset for Finetune : - iNaturalist 2021 - n_classes: 1486 - norm_pix_loss: on > off - token: cls > global_pool | Dataset(PT) | NP^#^ | classifier | Accu_top1 | |:-----------:|:------:| --------------- |:----------:| | eBird | off | --cls_token | 83.31% | | eBird | off | --global_pool | 82.62% | | eBird | ==on== | ==--cls_token== | ==85.47%== | | eBird | on | --global_pool | 84.92% | | ImageNet | on | --cls_token | 83.83% | :::spoiler tabel note: - Effect_batch_size = 3x342=1,026 on A40 - Model Architecture：MAE ViT base - ^#^ norm_pix_loss ::: ![](https://i.imgur.com/JVDylrA.png =500x) #### Performance in 'Finetune' at different Dataset - Dataset(PT) : eBird; norm_pix_loss : on | Dataset(FT) | Data Size(Train) | Data Size(Val) | n_classes | classifier | Accu_top1 | |:--------------:|:----------------:|:--------------:|:---------:|:-----------:|:----------:| | `ebird_finetune` | ~ 513.5k | ~ 17.2k | 9484 | --cls_token | ==89.45%== | | `iNat 2021` | ~ 414.8k | 1486x10 | 1486 | --cls_token | 85.47% | ![](https://i.imgur.com/tGYkJXB.png =500x) #### Performance in 'Evaluation' at different Dataset - Pretrain： on eBird Dataset | Dataset(FT) | Dataset evaluate | n_classes | Accu_top1 | |:----------------:|:-------------------:|:---------:|:---------:| | `ebird_finetune` | iNat 2021/val | 1486 | - | | `iNat 2021` | iNat 2021/val_eBird | 1486 | 95.7% | | `iNat 2021` | iNat 2021/val | 1486 | 85.5% | - 檢視發現val_eBird的資料品質優於iNat 2021/val，主體明顯清楚 ### 與經典CNN模型比較(Restnet50) #### Restnet50 參數設定與分類任務top1 Accuracy | Pretrain Dataset | Data Aug | Effect_batch_size** | Server | blr | epoch | Accu_top1 | | ---------------- | --------- |:-------------------:| ------ |:---------:| ----------- |:---------:| | Ebird | follow LR | 3x340=1,020 | A40 | 5e-4 | 100 | 75.85% | | Ebird | follow FT | 3x344=1,032 | 3090 | 5e-4 | 100 | 78.91% | | Ebird | follow FT | 3x344=1,032 | 3090 | 1.4888e-4 | 300(resume) | 80.12% | - Model Architecture： Restnet50 - Effect batch, bsl follow FT - blr =5e-4, lr = 2e-3, ![](https://i.imgur.com/7frhVsQ.png =500x) - Data Aug: - FT : Augmentation follow Fine-tune - LP : Augmentation follow Linear Probing ### MAE FT vs MAE LP vs CNN(Restnet50) ![](https://i.imgur.com/jG0PHT5.png) - Resnet50_FT的Accu斜率還有 ## 參數設定說明 ### Linear Probing訓練參數說明 #### LP的head - MAE encoder最後面產出的head為因應有後面的分類任務 - ，也無法直接跟原本decoder對應? - 原ViT架構 : ['patch_embed', 'pos_drop', 'blocks', 'head', 'fc_norm']。 - shape of fc_norm : (n_class, dim) - shape of fc_norm : ( dim) - LP中把norm 與 head順序對調 : ['patch_embed', 'pos_drop', 'blocks', 'norm', 'head']。shape : (n_class, dim) #### ==增加將Linear從單層改為雙層的選項== :::spoiler - 我們希望使用原MAE的encoder在做下游分類任務的同時，也能方便在探索模型學習到的影像表徵(umage representations)時，也能使用原encoder與decoder搭配還原、重建影像，而Fine-Tune過程雖然可以針對下游分類任務取得優秀結果，但遷移學習後的encoder參數權重已無法與原decoder對應。 - 因此，在希望盡可能保留原結構、提升分類預測力、但又避免過擬合的條件下，我們將原本Linear Probing最後面輸出的classifier head的線性全連接層，從單層增加為2層 ::: :::spoiler multi_fc in main_linprobe.py ```python= if args.multi_fc : # manually initialize fc layer: following MoCo v3 fc1 = torch.nn.Linear(model.head.in_features, 1024, bias=False) fc2 = torch.nn.Linear(1024, args.nb_classes, bias=False) for fc_ in [fc1, fc2]: trunc_normal_(fc_.weight, std=0.01) model.head = torch.nn.Sequential( torch.nn.BatchNorm1d(model.head.in_features, affine=False, eps=1e-6), fc1, torch.nn.LeakyReLU(0.2), fc2, torch.nn.LeakyReLU(0.2), ) ``` ::: #### Linear Probing訓練腳本與超參數設定 :::spoiler Linear Probing超參數設定 ##### 學習率(lr) - 起始的lr值根據 [linear scaling rule](https://arxiv.org/abs/1706.02677) 計算得來 - lr = blr * effective batch size / 256. - 起始lr = 0.1 ##### 有效batch size - effect batch size = 16,384 :::spoiler 有效batch size設定細節 - 3090目前可用約19G，可用batch size 上限約在2048左右 - 綜合考量系統out of memory、cpu負載及gpu容量問題 - batch size = 1024(16,384/16) - 2,048(16,384/8) - 測試當batch size調降至1638，accum_iter=10，num_worker最多設到3時，可以在盡量塞滿gpu 記憶體可用空間時，系統記憶體也不會oom穩定運作 - 當batch設太大時，各worker占用的記憶體也越大，在虛擬記憶體交換時越不穩定 - 保持記憶體可用空間(available) > 交換(swap)空間，較不容易導致系統oom，程序被殺掉(killed) - 系統記憶體檢視`top` : - ![](https://i.imgur.com/p2rT3Jk.png) - gpu記憶體使用量`nvisia-smi`: 17219 MiB - `free -m` - ![](https://i.imgur.com/aacsFxA.png) ::: ::: ##### Linear Probing訓練腳本 ###### Linear Probing mae官方設定值 ![](https://i.imgur.com/q2fjc4m.png =400x) ###### 官方文件說明 :::spoiler 官方文件說明 > - Here the effective batch size is 512 (batch_size per gpu) * 4 (nodes) * 8 (gpus per node) = 16384. > - blr is the base learning rate. The actual lr is computed by the linear scaling rule: lr = blr * effective batch size / 256. > - Training time is ~2h20m for 90 epochs in 32 V100 GPUs. > - To run single-node training, follow the instruction in fine-tuning. ::: ###### spoiler Linear Probing訓練腳本 :::spoiler - 單機運算直接使用main_linprobe.py檔，submitit.py檔是給多結點分散式運算調度用 - Linear Probing使用超大批量資料來加速訓練，相關文獻可以參考: - [Large batch training of convolutional networks](https://arxiv.org/abs/1708.03888) - 我們比較使用ebird育訓練的模型與官方提供以ImageNet預訓練模型的成效 - model = [ImageNetPretrain.pth, EbirdPretrain.pth] ::: :::spoiler --global_pool ```=bash python main_linprobe.py \ --batch_size 1638 \ --accum_iter 10 \ --model vit_base_patch16 --global_pool \ --nb_classes 1486 \ --finetune ${model}\ --epochs 90 \ --blr 0.1 \ --weight_decay 0.0 \ --data_path '../../share/iNaturalist_2021'\ --logfile 'log.txt'\ --num_workers 3 \ ``` ::: :::spoiler --cls_token and 2fc ```=bash python main_linprobe.py \ --batch_size 1638 \ --accum_iter 10 \ --model vit_base_patch16 --multi_fc \ --nb_classes 1486 \ --finetune ebird_pretrain_1600_vit_base.pth\ --epochs 90 \ --blr 0.1 \ --weight_decay 0.0 \ --data_path '../../share/iNaturalist_2021'\ --logfile 'log_log_Ebird_LP_cls_2fc.txt'\ --num_workers 4 \ ``` ::: ### Fine-tune訓練參數說明 #### Fine-tune mae官方設定值 ![](https://i.imgur.com/GWyOiLK.png =300x) #### Fine-tune訓練腳本與超參數設定 :::spoiler 超參數設定 - effect batch size = 1024 - epochs 100. - warmup from 5 up to 10 ::: :::spoiler 訓練腳本 ```bash= python3 main_finetune.py \ --accum_iter 3 \ --batch_size 342 \ --model vit_base_patch16 \ --finetune ebird_pretrain_1600_vit_base.pth \ --nb_classes 1486 \ --data_path '../../shared/iNaturalist_2021' \ --epochs 100 --warmup_epochs 10 \ --blr 5e-4 --layer_decay 0.65 \ --weight_decay 0.05 --drop_path 0.1 --mixup 0.8 --cutmix 1.0 --reprob 0.25 \ --logfile 'log_Ebird_finetune.txt'\ --num_workers 4 \ ``` ::: ### Restnet訓練參數說明 #### Restnet訓練腳本與超參數設定 :::spoiler 超參數設定 - ==data augmentation== - 強(採用Fine-tune) - 弱(採用Linear Probing) - effect batch size = 1024 - epochs 100. - warmup from 5 up to 10 - blr = 5e-4 (Follow Fine-tune) - model : touchvision pretrained ::: :::spoiler 訓練腳本 - Data aug follow Linear Probing ```bash= python main_resnet50_linprobe.py .py \ --batch_size 340 --accum_iter 3 \ --nb_classes 1486 \ --epochs 100 --warmup_epochs 5 \ --blr 5e-4 --weight_decay 0.05 \ --data_path '../../share/iNaturalist_2021' \ --logfile 'log_resnet50_LR.txt' \ --num_workers 6 \ ``` - Data aug follow Fine-tune ```bash= python main_resnet50_finetune.py \ --batch_size 344 --accum_iter 3\ --nb_classes 1486 \ --epochs 100 --warmup_epochs 5 \ --blr 5e-4 --weight_decay 0.05 \ --data_path '../../share/iNaturalist_2021' \ --drop_path 0.1 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \ --logfile 'log_resnet50_FT.txt'\ --num_workers 6 \ ``` ::: #### 延長Restnet訓練epoch :::spoiler - 檢視MAE表現的2個指標(ViT FT, ViT LR)與欲對比的CNN模型(Resnet50) - CNN模型在Accuracy的曲線斜率明顯未到達飽和，可以再延長訓練時間取得較好的模型表現 ![](https://i.imgur.com/bHRVQmf.png =400x) - 延長訓練方式 - lr schedule改用early-stoping - 採用原本訓練好的模型resume - 採用後段較小的lr - 2e-4 at epoch=80, 5e-4 at epoch=70 - train epoch = 200, lr at epoch80 - 回推blr = lr x 256 / effective batch size = 2e-4 x 256/1032 = 4.96e-5 * (train/resume) = 9.9225e-5 - train epoch = 300, lr at epoch80 - blr = 2e-4 x 256/1032 *3 = 1.4888e-4 - train epoch太大會使lr下降太慢，讓訓練曲線平緩提早引發early-stoping ::: --- # 三、訓練腳本優化 ## 1. 當batchsize x2時，計算時間也幾乎跟著線性增加x2， - [ ] 未解決 ### 問題推測： - 懷疑可能是某個部份的運算瓶頸造成 - 檢視發現在engine.py中，mixup的資料augmentation是特意放在gpu中計算，有可能是造成運算呈現性增加的關鍵 - ### 解決對策： #### 1. 檢視log指標計算是否正確 - [x] 確認MetricLogger與SmoothedValue在計算平均值時分母是否正確、 - [x] 確認每次迭代時間計算的起始 #### 2. 針對可能的運算瓶頸加入timer，檢視耗費時間 - 單獨紀錄mixup時間 - 比較 ~~cpu~~ vs gpu ：a. 獨立處理時間、b. 單次迭代時間、c、整體eoch時間差異 - mixup (timm)預測在cuda中處理，因此取消在cpu中測試 - 分別在loss計算(forward, backprop)前後加入timer紀錄 :::info ==torch.cuda.synchronize與測試模型計算時間== - 在pytorch裡面，程序(process)的執行都是異步(asynchronous communication)的，如果直接調用`end = time.time()`，程序在執行完該指令後就會直接退出，但這時候gpu可能還在運算中 - 在前面加上一行 `torch.cuda.synchronize()`，則python編譯時會等待cpu運算跑完後才執行後面的`time.time()`(同步synchronous communication) ::: ##### code :::spoiler engine_finetune.py ```python= if mixup_fn is not None: torch.cuda.synchronize() start_mixup = time.time() samples, targets = mixup_fn(samples, targets) torch.cuda.synchronize() metric_logger.update(mixup_time = time.time() - start_mixup) with torch.cuda.amp.autocast(): torch.cuda.synchronize() start_forward = time.time() outputs = model(samples) loss = criterion(outputs, targets) torch.cuda.synchronize() metric_logger.update(forward= time.time() - start_forward) loss_value = loss.item() if not math.isfinite(loss_value): print("Loss is {}, stopping training".format(loss_value)) sys.exit(1) loss /= accum_iter torch.cuda.synchronize() start_bp = time.time() loss_scaler(loss, optimizer, clip_grad=max_norm, parameters=model.parameters(), create_graph=False, update_grad=(data_iter_step + 1) % accum_iter == 0) torch.cuda.synchronize() metric_logger.update(backprop = time.time() - start_bp) ``` ::: #### 測試結果 - 在整個迭代過程中，mixup所佔時間比<1%，非構成效能瓶頸主因 - forward與backpropagation佔了整體迭代時間97.4% - 當批次增加2倍時，forward與backpropagation運算時間也跟著線性增加，並未享有矩陣算的加速優勢 | accum^1^ | batch^2^ | iter(s) | mixup(s) | data(s) | forward | bp^3^ | total | | ---------- | --------- | ------- |:--------:| -------- | ------- |:------:|:-----:| | 8 | 128 | 0.385x | 0.0009 | 0.0001 | 0.1332 | 0.2424 | 20.49 | | 16 | 64 | 0.197x | 0.0005 | 0.0001 | 0.0673 | 0.1268 | 21.29 | ^1^: accum_iter, ^2^: batch size,^3^: backpropagation ##### 討論 - 可能gpu運算單元太小(相對tpu)，無法一次處理如transformer架構的超大矩陣運算，因此無法有效加速 - 未來有會丟到tpu進行測試 #### 參考資料 - [PyTorch测试模型执行计算耗费的时间](https://www.jianshu.com/p/cbada26ea29d?from=groupmessage) - [torch.cuda.synchronize()同步统计pytorch调用cuda运行时间](https://blog.csdn.net/weixin_44942126/article/details/117605711) - [同步(Synchronous)和异步(Asynchronous)](https://www.cnblogs.com/IT-CPC/p/10898871.html?fbclid=IwAR16omJ4yeAEBML3YlZdAFbIoi4Q4eNweoIk9rU2sXH_a5A4NRNVXXZQaJo) --- # 四、錯誤紀錄 ## 紀錄格式範例 ### 問題 - [x] 是否解決 #### 錯誤訊息 ##### 問題推測 #### 解決對策 *** ## [Pretrain] ### 問題紀錄(館碩) 1. 爆抓 & 清理 eBird 鳥類照片與資料 2. 把富豪版(8 台主機，每台插 8 張 Tesla T100?) 閹割回單機單卡 (A40) 小確幸版 3. 指定合適的記憶體區塊大小，讓 GPU 記憶體空間盡可能塞進大一點的 batch size (體感接近剛吃飽後的牛仔褲包覆) 4. 為了享有自動混和精度訓練的好處(省記憶體 & 加速運算)不眠不休幾天做了各種嘗試，最後發現只要下 grad clip norm (限制梯度為 K 個單位向量)就差不多了 ### 1. 官方預設為多節點分散式運算 - [x] 是否解決 #### 錯誤訊息 ##### 問題推測 - 官方code預設為多節點分散式運算 #### 解決對策 - 關閉預設的分散式運算 - 但後續會出現記憶體配置問題 - 見1.1記憶體配置錯誤(CUDA OOM error) ### 1.1 記憶體配置錯誤(CUDA OOM error) - [x] 是否解決 #### 錯誤訊息 `RuntimeError: CUDA out of memory. Tried to allocate **xxx GiB** (GPU 0; xxx GiB total capacity; xxx GiB already allocated; **xxx GiB** free; xxx GiB reserved in total by PyTorch)` ##### 問題推測 - #### 解決對策 - terminal 執行python指令前加上 `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:80` - 相關討論串 - [Keep getting CUDA OOM error with Pytorch failing to allocate all free memory ](https://discuss.pytorch.org/t/keep-getting-cuda-oom-error-with-pytorch-failing-to-allocate-all-free-memory/133896/4?fbclid=IwAR1W4OqSyhwuq2RpGxbhD6dXCglc4hIjST6VelnmmzC69FIhs-5PYWplb68) ### 2. 資料集放置方式 - [x] 是否解決 #### 錯誤訊息 ##### 問題推測 - 資料集放置方式要與ImageNet格式一致 #### 解決對策 - 將資料移至 data/train/0/路徑內 ### 3. 梯度消失導致訓練中斷 - [X] 是否解決 #### 錯誤訊息 - 訓練至180-200epoch後，loss變為nan ##### 問題推測 - 可能混合精度訓練導致 - [FINETUNE.md](https://github.com/facebookresearch/mae/blob/main/FINETUNE.md)提到: :::spoiler > 最初的MAE實現是在TensorFlow+TPU中，沒有明確的混合精度。這個重新實現是在PyTorch+GPU中，具有自動混合精度（torch.cuda.amp）。我們已經觀察到這兩個平台之間不同的數值行為。在這個版本中，我們使用--global_pool進行微調；使用--cls_token的性能類似，但在GPU中微調ViT-Huge時，有可能產生NaN。我們在TPU中沒有觀察到這個問題。關掉amp可以解決這個問題，但速度較慢。 > The original MAE implementation was in TensorFlow+TPU with no explicit mixed precision. This re-implementation is in PyTorch+GPU with automatic mixed precision (torch.cuda.amp). We have observed different numerical behavior between the two platforms. In this repo, we use --global_pool for fine-tuning; using --cls_token performs similarly, but there is a chance of producing NaN when fine-tuning ViT-Huge in GPUs. We did not observe this issue in TPUs. Turning off amp could solve this issue, but is slower. ::: - pytorch amp 說明[Automatic mixed precision for Pytorch #25081](https://github.com/pytorch/pytorch/issues/25081) #### 解決對策 1. 關閉amp混合精度運算 - 檢視相關程式碼，發現在[engine_pretrain.py](https://github.com/facebookresearch/mae/blob/be47fef7a727943547afb0c670cf1b26034c3c89/engine_pretrain.py)僅有一行與amp混合精度運算有關 - 註解掉with torch.cuda.amp.autocast()，直接運行下面loss, _, _ = model - trade-off: - batch得縮小(256>200)、運算速度下降 :::spoiler ```python=47 with torch.cuda.amp.autocast(): loss, _, _ = model(samples, mask_ratio=args.mask_ratio) ``` ```python=47 # with torch.cuda.amp.autocast(): loss, _, _ = model(samples, mask_ratio=args.mask_ratio) ``` ::: 2. ~~更改amp.GradScale縮放設定_1~~ - [ ] 未確認 - 預設值 torch.cuda.amp.GradScaler(init_scale=65536.0, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True) - 估計大概在epoch 170會發生第三次 scaling, 那時 scale都已經是65536 x 2^3 - float16 最大值是 6.55 x 1e4，所以我們會有 loss / 16 * 65536 * 2^3 <= 6.55 * 1e4 的限制，所以epoch 170 之後，loss > 0.5時就會nan。所以把啟始參數 scale factor 改成1 就好了，loss 應該沒有小到需要 up scaling 3. 更改amp.GradScale縮放設定_2 - [x] loss NaN問題解決 - 原code在初始化NativeScalerWithGradNormCount時，未傳入值，也未啟動clip_grad :::spoiler [code](https://github.com/facebookresearch/mae/blob/main/util/misc.py) ```python= # mae/main_pretrain.py # line 57 from util.misc import NativeScalerWithGradNormCount as NativeScaler loss_scaler = NativeScaler() # mae/engine_pretrain.py # line 57 loss_scaler(loss, optimizer, parameters=model.parameters(), update_grad=(data_iter_step + 1) % accum_iter == 0) # mae/util/misc.py class NativeScalerWithGradNormCount: state_dict_key = "amp_scaler" def __init__(self): self._scaler = torch.cuda.amp.GradScaler() def __call__(self, loss, optimizer, clip_grad=None, parameters=None, create_graph=False, update_grad=True): self._scaler.scale(loss).backward(create_graph=create_graph) if update_grad: if clip_grad is not None: assert parameters is not None self._scaler.unscale_(optimizer) # unscale the gradients of optimizer's assigned params in-place norm = torch.nn.utils.clip_grad_norm_(parameters, clip_grad) else: self._scaler.unscale_(optimizer) norm = get_grad_norm_(parameters) self._scaler.step(optimizer) self._scaler.update() else: norm = None return norm ``` ::: - 修改loss_scaler = NativeScaler()的傳入值，啟動梯度剪裁(clip_grad)後問題解決 - terminal輸入指令: :::spoiler ```bash= python main_pretrain.py \ --batch_size 400 --accum_iter 10 \ --model mae_vit_base_patch16 --mask_ratio 0.75 \ --epochs 1600 --warmup_epochs 40 \ --blr 1.5e-4 --weight_decay 0.05 \ --data_path /home/esslab/AI_projects/shared/birds \ --clip_grad=5 --norm_pix_loss --amp \ --resume=output_dir/checkpoint-419.pth --start_epoch=419 ``` ::: ##### git 修改紀錄 :::spoiler ```bash= diff --git a/main_pretrain.py b/main_pretrain.py @@ -177,12 +188,12 @@ def main(args): - loss_scaler = NativeScaler() + loss_scaler = NativeScaler(init_scale=65536.0, growth_factor=2, growth_interval=2000) # 修改後的傳入值與預設值一樣 # ====================================================================== diff --git a/util/misc.py b/util/misc.py @@ -251,8 +251,8 @@ def init_distributed_mode(args): class NativeScalerWithGradNormCount: state_dict_key = "amp_scaler" - def __init__(self): - self._scaler = torch.cuda.amp.GradScaler() + def __init__(self, **kwargs): + self._scaler = torch.cuda.amp.GradScaler(**kwargs) # ====================================================================== diff --git a/engine_pretrain.py b/engine_pretrain.py @@ -54,7 +59,7 @@ def train_one_epoch(model: torch.nn.Module, sys.exit(1) loss /= accum_iter - loss_scaler(loss, optimizer, parameters=model.parameters(), + loss_scaler(loss, optimizer, parameters=model.parameters(), clip_grad=args.clip_grad, ``` ::: ### 4. 接續訓練時，lr設定問題 - [x] 是否解決 #### 錯誤訊息 - 接續訓練時，lr值跑掉 ![](https://hackmd.io/_uploads/By4LK3Ve5.png =500x) ##### 問題推測 - 原訓練腳本使用lr排程，是根據一開始指定的epoch數量決定 - [main_pretrain.py](https://github.com/facebookresearch/mae/blob/main/main_pretrain.py) :::spoiler ```python=163 eff_batch_size = args.batch_size * args.accum_iter * misc.get_world_size() if args.lr is None: # only base_lr is specified args.lr = args.blr * eff_batch_size / 256 print("base lr: %.2e" % (args.lr * 256 / eff_batch_size)) print("actual lr: %.2e" % args.lr) print("accumulate grad iterations: %d" % args.accum_iter) print("effective batch size: %d" % eff_batch_size) ``` ::: - [engine_pretrain.py](https://github.com/facebookresearch/mae/blob/main/engine_pretrain.py) - 訓練時的epoch要累積到effective batch size時才會更新lr :::spoiler ```python= def train_one_epoch(model: torch.nn.Module, data_loader: Iterable, optimizer: torch.optim.Optimizer, device: torch.device, epoch: int, loss_scaler, log_writer=None, args=None): for data_iter_step, (samples, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): # we use a per iteration (instead of per epoch) lr scheduler if data_iter_step % accum_iter == 0: lr_sched.adjust_learning_rate(optimizer, data_iter_step / len(data_loader) + epoch, args) ``` ::: #### 解決對策 - 在lr_sched.adjust_learning_rate前面的判斷式條件加上 is_resume - 當在terminal設置resume參數時，lr會直接根據目前epch進行更新 `python main_pretrain.py \` ` --resume output_dir/checkpoint-xxx.pth \` ` --start_epoch xxx \ ` - engine_pretrain.py ::: spoiler ```python= print(epoch, args.start_epoch) is_resume = False if epoch == args.start_epoch: is_resume = True for data_iter_step, (samples, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): # we use a per iteration (instead of per epoch) lr scheduler if (data_iter_step % accum_iter == 0) or is_resume: lr_sched.adjust_learning_rate(optimizer, data_iter_step / len(data_loader) + epoch, args) ``` ::: ### 5. 無法匯入container_abcs - [x] 是否解決 #### 錯誤訊息 - `cannot import name ‘container_abcs‘ from ‘torch._six` ##### 問題推測 - 因为1.8版本之后container_abcs就已经被移除了。 #### 解決對策 - [cannot import name ‘container_abcs‘ from ‘torch._six‘错误的解决方法（一般升级pytorch1.9后出现）](https://blog.csdn.net/qq_45281807/article/details/121843592?fbclid=IwAR0_SxTZRR5itMrJMHK6KKC2prVqT7gXRgqt_hGUKMRY0e77cO5IhFVGs2A) - 進入timm的package內修改 - 確切檔案集路徑依錯誤提示 ![](https://hackmd.io/_uploads/S1fuTMIe9.png) - edit `vim ... /timm/models/layers/helpers.py` ```python= TORCH_MAJOR = int(torch.__version__.split('.')[0]) TORCH_MINOR = int(torch.__version__.split('.')[1]) if TORCH_MAJOR == 1 and TORCH_MINOR < 8: from torch._six import container_abcs else: import collections.abc as container_abcs ``` ## [Linear Probing] #### 使用模型 - ImageNet pretrained model - Ebird pretrained model ### 1. 執行main_linprobe.py時，出現引數錯誤(AttributeError) ##### 使用單機單GPU執行，terminal指令如下: :::spoiler ```bash= python main_linprobe.py \ --batch_size 1024 \ --model vit_base_patch16 --global_pool \ --nb_classes 1486 \ --finetune checkpoint-1312.pth\ --epochs 90 \ --blr 0.1 \ --weight_decay 0.0 \ --data_path '../ebird/iNaturalist_2021'\ --logfile 'log_EbirdPretrain.txt'\ --num_workers 4 ``` ::: - [x] 是否解決 #### 錯誤訊息 ```bash= File "/home/ess/AI_projects/yunghui/mae/util/crop.py", line 24, in get_params width, height = F._get_image_size(img) AttributeError: module 'torchvision.transforms.functional' has no attribute '_get_image_size' ``` ##### 問題推測 - torchvision版本問題 #### 解決對策 - edit `mae/util/crop.py` ```python= def get_params(img, scale, ratio): try: width, height = F._get_image_size(img) except: width, height = F.get_image_size(img) ``` ### 2. 訓練時跳出killed訊息，process被終止 - [x] 是否解決 #### 錯誤訊息 - `killed` ![](https://hackmd.io/_uploads/S1xvEl4E5.png) ##### 問題推測 - 系統記憶體不足 - 檢視被殺掉的程序 `dmesg -T | egrep -i 'killed process'` - `[Wed Apr 13 14:47:00 2022] Out of memory: Killed process 268067 (python) total-vm:17002212kB, anon-rss:5095752kB, file-rss:66060kB, shmem-rss:171536kB, UID:1000 pgtables:11020kB oom_score_adj:0` - Linux的OOM killer（Out Of Memory killer） - 在系統記憶體耗盡的情況下跳出來，選擇性強制殺掉記憶體占用最大的進程，以求釋放記憶體 #### 解決對策 - 保持記憶體可用空間(available) > 交換(swap)空間，較不容易導致系統oom，程序被殺掉(killed) - `free -m` 檢視 - 縮小effective batch size - 降低num_workers數量設置 - 調整batch_size與accum_iter配置 ### 3. 訓練曲線異常 - [x] 是否解決 #### 錯誤訊息 - 檢視訓練曲線 - train_loss看起來正常 - test_loss一直增加 - acc1與acc5後面有些微下降 ![](https://hackmd.io/_uploads/By-oIgVE9.png =900x) ##### 問題推測 - 訓練腳本與模型檢查後無異常 - 檢視類別標籤 - 發現train與val類別標籤不一致 - datasets.ImageFolder 的類別標籤是將資料夾排序後來編碼，val的資料夾數量跟train不同導致編碼不一致 - ![](https://hackmd.io/_uploads/HyoQ_xN45.png =300x) #### 解決對策 - 將val資料夾重新上傳後解決 *** # 參考文獻 ## 優化方法相關 ### LARS(Layer-wise Adaptive Rate Scaling) - [Large Batch Training of Convolutional Networks](https://arxiv.org/abs/1708.03888) - [优化器方法-LARS(Layer-wise Adaptive Rate Scaling)](https://www.jianshu.com/p/e430620d3acf)