加速PyTorch模型訓練技巧

加速PyTorch模型訓練技巧 === ![](https://i.imgur.com/p3NDiWi.gif) [TOC] --- # 一. Using learning rate schedule. 一般提到如何調整Learning Rate，相信大多數人第一時間會想到Adaptive Learning Rate，常見的方法有++Adagrad++、++RMSProp++、++Momentum++、++Adam++和++AdamW++，相信這些大家都不陌生，卻時常忽略**Learning Rate Scheduling**這個技巧(其實是在說我)。在訓練模型的過程，其中一個很重要的參數就是Learning Rate，合適的Learning Rate可以幫助模型快速收斂，常見的調整方法是在訓練初期時給定較大的Leaning Rate，隨著模型的訓練逐漸調低Learning Rate。這時候問題就來了，我們應該什麼時後調整Learning Rate，該怎麼調整使得模型能較快收斂，以下將簡單介紹幾個PyTorch提供的方法。 ### **1. lr_scheduler.LambdaLR** > torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1, verbose=False) **介紹:** &ensp;&ensp; **Sets the learning rate of each parameter group to the initial lr times a given function.** **重點參數:** &ensp;&ensp; (1) **lr_lambda (function or list) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups.** &ensp;&ensp; (2) **last_epoch (int) – The index of last epoch. Default: -1.** ==**Example**== ```python= # Assuming optimizer has two groups. optimizer = optim.SGD(net.parameters(), lr = 1e-5) lambda1 = lambda epoch: 0.2 * epoch if epoch > 5 else 1 lambda2 = lambda epoch: 0.2 ** epoch scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda = lambda1) for epoch in range(10): scheduler.step() print(epoch, scheduler.get_last_lr()[0]) ``` ==**Result**== ![](https://i.imgur.com/g2Ec9Ki.png) ### **2. lr_scheduler.MultiStepLR** > torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1, verbose=False) **介紹:** &ensp;&ensp; **Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones.** **重點參數:** &ensp;&ensp; (1) **milestones (list) – List of epoch indices. Must be increasing.** &ensp;&ensp; (2) **gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.** &ensp;&ensp; (3) **last_epoch (int) – The index of last epoch. Default: -1.** ==**Example**== ```python= optimizer = optim.SGD(net.parameters(), lr = 1e-5) scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[5,10,15], gamma=0.1) for epoch in range(1, 20 + 1): scheduler.step() print(epoch, scheduler.get_last_lr()[0]) ``` ==**Result**== ![](https://i.imgur.com/8CpNs3l.png) ### **3. lr_scheduler.ExponentialLR** > torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1, verbose=False) **介紹:** &ensp;&ensp; **Decays the learning rate of each parameter group by gamma every epoch.** **重點參數:** &ensp;&ensp; (1) **milestones (list) – List of epoch indices. Must be increasing.** &ensp;&ensp; (2) **gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.** &ensp;&ensp; (3) **last_epoch (int) – The index of last epoch. Default: -1.** ==**Example**== ```python= optimizer = optim.SGD(net.parameters(), lr = 1e-5) scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma = 0.5) for epoch in range(1, 20 + 1): scheduler.step() print(epoch, scheduler.get_last_lr()[0]) ``` ==**Result**== ![](https://i.imgur.com/bCJuRcK.png) ### **4. lr_scheduler.MultiplicativeLR** > torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda, last_epoch=-1, verbose=False) **介紹:** &ensp;&ensp; **Multiply the learning rate of each parameter group by the factor given in the specified function.** **重點參數:** &ensp;&ensp; (1) **gamma (float) – Multiplicative factor of learning rate decay.** &ensp;&ensp; (2) **last_epoch (int) – The index of last epoch. Default: -1.** ==**Example**== ```python= optimizer = optim.SGD(net.parameters(), lr = 1e-5) lambda1 = lambda epoch: 0.2 if epoch % 5 == 0 else 1 lambda2 = lambda epoch: 0.2 scheduler = optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda = lambda2) for epoch in range(1, 20 + 1): scheduler.step() print(epoch, scheduler.get_last_lr()[0]) ``` ==**Result**== ![](https://i.imgur.com/2jZohYc.png) ### **5. lr_scheduler.ReduceLROnPlateau (目前唯一不靠Epoch來更新的lr_scheduler)** > torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08, verbose=False) **介紹:** &ensp;&ensp; **_Reduce learning rate when a metric has stopped improving._ Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.** **重點參數:** &ensp;&ensp; (1) **factor (float) – Factor by which the learning rate will be reduced. new_lr = lr * factor. Default: 0.1.** &ensp;&ensp; (2) **patience (int) – Number of epochs with no improvement after which learning rate will be reduced.** &ensp;&ensp; (3) **threshold (float) – Threshold for measuring the new optimum, to only focus on significant changes. Default: 1e-4.** &ensp;&ensp; (4) **threshold_mode (str) – One of rel, abs.** &ensp;&ensp; (5) **threshold_mode (str) – One of rel, abs..** &ensp;&ensp; (6) **min_lr (float or list) – A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. Default: 0.** **什麼是Plateau:** &ensp;&ensp; **看到下圖就是把Loss function的值給畫出來，可以看到在Plateau的地方梯度是趨近於0，但是loss的值卻不是處在Local或Global minimal的地方，這會導致在訓練時卡住而沒有到更佳的結果去。** &ensp;&ensp; **Gradient的限制就是，只要你微分的值趨近於0就幾乎不會在繼續更新下去，即使當前不是處於Local或Global minimal的地方，常見的有Plateau和Saddle Point，通常會用Adaptive Learning Rate去避開。** ![](https://i.imgur.com/zHKK3BF.png) &ensp;&ensp; &ensp;&ensp; &ensp;&ensp; &ensp;&ensp; &ensp;&ensp; &ensp;&ensp; &ensp;&ensp; &ensp;&ensp; **Source: Gradient Descent [pdf](http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Gradient%20Descent%20(v2).pdf)** ==**Example**== ```python= initial_lr = 0.1 net = model.resnet101() optimizer = torch.optim.Adam(net.parameters(), lr = initial_lr) scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=2) for epoch in range(1, 15): optimizer.zero_grad() optimizer.step() print(f"{epoch} {optimizer.param_groups[0]['lr']}") scheduler.step(1) ``` ==**Result**== ![](https://i.imgur.com/JmTlQcG.png) ### **6.更多的Learning Rate Scheduler: [[Pytorch Doc]](https://pytorch.org/docs/stable/optim.html)** **PyTorch官方給的警告** > **在PyTorch 1.1.0之前，learning rate scheduler應該在optimizer更新前呼叫使用。但在1.1.0版本之後，改變了BC-breaking way的行為。如果在optimizer更新(calling optimizer.step())前使用learning rate scheduler(calling scheduler.step())，這個行為將會跳過第一次的learning rate schedule.(~~但現在應該沒甚麼人的版本還在1.1.0之前的吧~~)** :zap: **附上範例程式碼** > PyTorch的官方文件是將Learning Rate Scheduler在每一個Epoch結束完執行一次，因此照這個思路，寫下比較general的版本。 ```python= import torch from torch import optim def train(model, train_loader, epoch, optimizer, loss_fn, device = 'cpu', scheduler = None): model.train() if not scheduler: scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr = 1e-5, max_lr = 1e-2) for _ in range(1, epoch + 1): for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = loss_fn(output, target) loss.backward() optimizer.step() if scheduler.__module__ == 'torch.optim.lr_scheduler': scheduler.step() ``` --- # 二. Use multiple workers and pinned memory in DataLoader 當我們使用```torch.utils.data.DataLoader```時，參數num_worker預設為0，則pin_memory預設為False。因為如果要對這兩個參數進行更動的話，前提你的硬體不能太差。 1. pin_memory **當DataLoader的pin_memory參數為True時，DataLoader會自動將得到的資料(Tensor)複製一份到GPU的"Pinned Memory/PageLocked Memory"，這樣可以更快的將資料傳輸到有支持CUDA的顯示卡。 Pinned Memory存放的內容在任何情形下都不會與Virtual Memory進行交換，因此此操作可以節省和Virtual Memory交換的時間。** [[More]](https://pytorch.org/docs/stable/data.html#memory-pinning) 2. num_worker **在加載資料時，可以指定使用多少個sub process去進行操作，增加num_workers時將會增加CPU RAM的消耗。** :::danger :fire: 來自 NVIDIA 的CUDA深度學習算法軟體工程師 Szymon Micacz 就曾經使用四個 worker 和pinned memory，在單個 epoch中達到了2倍的加速。[Source](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/szymon_migacz-pytorch-performance-tuning-guide.pdf) ![](https://i.imgur.com/hOfY7lC.png) ::: ==**Example**== ```python= cuda_kwargs = { 'num_workers' : 1, 'pin_memory' : True, 'batch_size' : 64 'shuffle' : True } train_loader = torch.utils.data.DataLoader(dataset, **cuda_kwargs) ``` --- # 三. ENABLE cuDNN AUTOTUNER **介紹:** &ensp;&ensp; **首先，先大略知道什麼是cuDNN，cuDNN是NVIDIA專門為了深度類神經網路所打造的GPU加速原形函式庫，針對Convolution、Pooling和一些activation function有進行底層的優化。cuDNN也支持當今主流的深度學習框架，例如: PyTorch、Caffe、mxnet、MATLAB等等...。** **torch.backends.cudnn.benchmark:** &ensp;&ensp; **簡單看了下介紹後，但這和 cudnn.benchmark 有何關係呢？實際上，若將這個flag設為True，它就會調用[cudnnFindConvolutionForwardAlgorithmEx](https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#cudnnFindConvolutionForwardAlgorithmEx)(連結是到NVIDIA所提供的官方API Doc)，就可以在PyTorch中對模型裡的卷積層(Convolution)、池化層(Pooling)進行預先的優化，也就是在每一個卷積層中測試cuDNN 提供的所有卷積實現演算法，然後選擇效能最佳的那一個，使得之後效能==可能==會提高**。 &ensp;&ensp; 為甚麼會說效能**可能**會提高而不是一定呢? PyTorch在官方的Github資料[[Link]](https://github.com/pytorch/pytorch/blob/31ee5d8d8b249fd0dbd60e1d9c171ec75b597672/docs/source/notes/randomness.rst)中有解釋: **"Due to benchmarking noise and different hardware, the benchmark may select different algorithms on subsequent runs, even on the same machine"**. **使用時機:** &ensp;&ensp; **==當你的model有使用到Convolution的架構時才建議使用此方法才能進行加速!!== 而且網路架構是固定時(非動態變動)，網路的input size(包含batch size、圖片大小等等...)是不變的前提下，因為若在訓練中有關卷積層的架構是浮動的話，反而不會加速而是會花費更多時間。** ==**Example**== ```python= if torch.cuda.is_available(): device = torch.device('cuda') torch.backends.cudnn.benchmark = True else: device = torch.device('cpu') ``` --- # 四. Disable bias for convolution directly followed by a Batch Norm **介紹:** &ensp;&ensp; **不管對Feature map中的任何channel加上bias都只會影響它的平均值，當你的模型架構含BatchNorm2d且是接在Conv2d後的話，它將會移除掉任何channel的平均值，所以沒有必要為Conv2d加上bias(Conv1d和Conv3d同上)，下面為示意圖。** ![](https://i.imgur.com/hrqF6Sz.png) **Batch Normalization演算法如下:** ![](https://i.imgur.com/oS91wgc.png) _S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015._ ==**Example**== ```python= import torch import torch.nn as nn class Net(nn.Module): def __init__(self, inplanes: int = 1, planes: int = 1): super(Net, self).__init__() self.conv1 = nn.Conv2d(inplanes, planes, bias = False) self.bn1 = nn.BatchNorm2d(planes) def forward(self, x: torch.Tensor)->torch.Tensor: out = self.conv1(x) out = self.bn1(out) return out ``` --- # 五. Automatic mixed precision(AMP) for faster training on NVIDIA GPUs **介紹:** &ensp;&ensp; **在預設情況下，大多數深度學習框架（包括PyTorch）都使用32位浮點（FP32）算法進行訓練。但是，對許多深度學習模型來說這準確性可能不是必須的。在2017年時，NVIDIA研究人員開發了一種混合精度訓練的方法[[Source]](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/)，並寫成論文投稿到ICLR2018[[PAPER](https://arxiv.org/pdf/1710.03740.pdf)]，方法是在訓練網絡時結合了單精度（FP32）和半精度（FP16）格式，可以達到跟純FP32訓練一樣的準確率下(使用相同的超參數)，訓練時間更短，並且在NVIDIA GPU上具有其他性能優勢:** * 訓練時間更短 * 較低的記憶體要求 * 可使用更大的Batch size，更大的模型或更大的輸入 &ensp;&ensp; **在PyTorch中，當我們提到混和精度(AMP)訓練時，就是在NVIDIA能支援Tensor Core的CUDA設備上使用```torch.cuda.amp.autocast()和torch.cuda.amp.GradScaler()```** **如何在PyTorch使用AMP:** 1. **Adding autocast** In these regions, CUDA ops run in a dtype chosen by autocast to improve performance while maintaining accuracy. ```python= # Source: PyTorch Documents for epoch in range(0): # 0 epochs, this section is for illustration only for input, target in zip(data, targets): # Runs the forward pass under autocast. with torch.cuda.amp.autocast(): output = net(input) # output is float16 because linear layers autocast to float16. assert output.dtype is torch.float16 loss = loss_fn(output, target) # loss is float32 because mse_loss layers autocast to float32. assert loss.dtype is torch.float32 # Exits autocast before backward(). # Backward passes under autocast are not recommended. # Backward ops run in the same dtype autocast chose for corresponding forward ops. loss.backward() opt.step() opt.zero_grad() # set_to_none=True here can modestly improve performance ``` **當使用autocast時，PyTorch會自動決定哪些Layer或Function將它們轉成torch.HalfTensor(float16)，至於哪些操作可以轉到float16，請看下圖([Source](https://pytorch.org/docs/stable/amp.html#autocast-op-reference))。** ![](https://i.imgur.com/tdptMJY.png) ==**重點!!**==&ensp;&ensp; **Autocast只包含在Forward pass的部分(包含loss的計算)，** 2. **Adding GradScaler** GradScaler help perform the steps of gradient scaling conveniently. Gradient scaling improves convergence for networks with float16 gradients by minimizing gradient underflow. ```python= # Source: PyTorch Documents # Creates model and optimizer in default precision model = Net().cuda() optimizer = optim.SGD(model.parameters(), ...) # Creates a GradScaler once at the beginning of training. scaler = GradScaler() for epoch in epochs: for input, target in data: optimizer.zero_grad() # Runs the forward pass with autocasting. with autocast(): output = model(input) loss = loss_fn(output, target) # Scales loss. Calls backward() on scaled loss to create scaled gradients. # Backward passes under autocast are not recommended. # Backward ops run in the same dtype autocast chose for corresponding forward ops. scaler.scale(loss).backward() # scaler.step() first unscales the gradients of the optimizer's assigned params. # If these gradients do not contain infs or NaNs, optimizer.step() is then called, # otherwise, optimizer.step() is skipped. scaler.step(optimizer) # Updates the scale for next iteration. scaler.update() ``` &ensp;&ensp; **為甚麼都已經使用了Autocast卻還需要GradScaler的操作呢? 因為在Forward pass時某些操作已經轉成Float16的操作，導致浮點精度不夠來表示梯度，在Bapropagation時可能會造成梯度的Underflow。所以為了防止梯度的Underflow，```scaler.scale(loss).backward()```會對loss乘上一個scale的factor，這會使得Backpropagation時所有梯度也乘上一樣的scale factor，為了防止梯度變小到0(很大一部分是Float16精度無法表示的緣故:如下圖)，良好的Scaler factor可以將梯度保留在一個非常小的值。** **2018年ICLR論文《Mixed Precision Training》[Paper](https://openreview.net/pdf?id=r1gs9JgRZ)發現，如果在模型中的任何地方使用fp16時，將會"swallows"所有小於2 ^ -24的梯度更新，占了大約是這網路所有更新參數的5%，也就是整個模型中會有5%的權重不會隨著梯度更新，這有機會導致模型訓練不起來。** ![](https://i.imgur.com/CYOjEEz.png) _P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training._ **NVIDIA的AMP測試數據([Gighub](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)):** 1. **Training** ![](https://i.imgur.com/ArQPuvd.png) 2. **Inference** ![](https://i.imgur.com/VHDOliM.png) **相關資料:** 1. **ECCV 2020 Tutorial on Accelerating Computer Vision with Mixed Precision(NVLAB): [LINK](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/)** 2. **PyTorch AMP Tutorial: [Colab](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/19350f3746283d2a6e32a8e698f92dc4/amp_recipe.ipynb#scrollTo=2V5LTdiLzhk9)** --- # 六. Increase Batch Sizes **增加Batch sizes去更有效的使用GPU的RAM** > **可以搭配前面的AMP，AMP可以降低GPU的記憶體需求量，因此可以一起搭配使用** --- # 七. Accumulating gradients **介紹:** &ensp;&ensp; **假設我們要訓練的Batch size為256，但是我們的GPU memory卻只能一次塞Batch size為32的資料進去。這時候該怎麼辦?~~如果是土豪就直接升級GPU解決~~，若無法升級GPU該怎麼辦，我們可以執行8（= 256/32）個梯度累加而無需執行優化"optimizer.step()"步驟，並繼續通過loss.backward（）累加計算出的梯度。一旦我們累積了256個輸入資料的梯度時，就執行優化步驟'optimizer.step（）'並清空累加的梯度，再次進入循環。** **優點:** &ensp;&ensp; **不需要多個GPU就可以進行大batch size的訓練。即使使用單張GPU，此方法也可以進行大batch size的訓練。** **缺點:** &ensp;&ensp; **比在多個GPU上平行訓練要花費更多的時間(廢話)。** **基本的訓練函數，每個Batch是這樣訓練的:** ```python= def train(model, dataloader, optimizer, epoch, accumulation_steps = 8): for eph_idx in range(1, epoch + 1): for index, (data, target) in enumerate(dataloader): optimizer.zero_grad() # reset gradient output = model(data) loss = torch.nn.MSELoss()(output, target) loss.backward() optimizer.step() # update parameters ``` 1. **optimizer.zero_grad(): 清空梯度** 2. **計算loss: 給定output和target，透過loss函數計算loss值** 3. **loss.backward(): 執行Backpropagation，並且計算當前梯度** 4. **optimier.step(): 根據梯度來更新模型參數** **若使用梯度累加，每個Batch是這樣訓練的:** ```python= def train(model, dataloader, optimizer, epoch, accumulation_steps = 8): for eph_idx in range(1, epoch + 1): for index, (data, target) in enumerate(dataloader): output = model(data) loss = torch.nn.MSELoss()(output, target) loss = loss / accumulation_steps # Loss regularization loss.backward() # backpropagation if (index+1) % accumulation_steps == 0: optimizer.step() # update parameters optimizer.zero_grad() # reset gradient ``` 1. **計算loss: 給定output和target，透過loss函數計算loss值** 2. **loss.backward(): 執行Backpropagation，並且計算當前梯度** 3. **重複步驟1、2，不把梯度清空，讓梯度累加在過往的梯度上** 4. **梯度在累加到一定的次數後(accumulation_steps)，執行```optimizer.step()```根據累積的梯度去更新模型參數，再呼叫```optimizer.zero_grad()```清空過往的梯度** :fire: ==**少數有Accumlating gradients成功的範例，例如: NVIDIA的[DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples/blob/b1ce24a54ff25fc21b8a281abbbbb5d2d69664e2/PyTorch/LanguageModeling/BERT/run_glue.py#L491~L503)、[More](https://github.com/NVIDIA/DeepLearningExamples/search?q=gradient+accumulation)**== --- # 八. Training model on Multi-GPUs &ensp;&ensp; **若要在有多張GPU的Server上訓練PyTorch模型，入門是使用torch.nn.DataParallel()。** **下面來具體講講nn.DataParallel中是怎麼做的:** &ensp;&ensp; **首先在Forward pass中，你的輸入資料會被切割成多個部分(以下稱為'資料子集')分別送到不同的device(GPU)中進行計算，而你的模型module是會被每個device(GPU)上進行複制一份，也就是說，輸入的batch是會被平均分到每個device中去，但是你的模型module是要複製到每個device(GPU)中的，每個模型module只需要處理每個'資料子集'即可，要保證你的batch size大於你的device(GPU)使用個數。然後在Backpropagation的過程中，每個'資料子集'的梯度被累加到原始的Module中。([官方解釋](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel))** &ensp;&ensp; **官方定義```CLASS torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)```，module就是放入你設計的模型，device_ids就是你要用來訓練的GPU ID，output_device代表要拿來輸出的device id，這個參數通常是忽略不寫的默認是device_ids[0]，所以第一張顯卡的RAM使用量會比其他還要來的多(除了要做Forward pass還要進行gradient的累加)。** ==**Example**== ```python= device_ids = [0, 1, 2, 3] parallel_model = torch.nn.DataParallel(model, device_ids = device_ids) # Encapsulate the model predictions = parallel_model(inputs) # Forward pass on multi-GPUs loss = loss_function(predictions, labels) # Compute loss function loss.mean().backward() # Average GPU-losses + backward pass optimizer.step() # Optimizer step predictions = parallel_model(inputs) # Forward pass with new parameters ```