加速PyTorch模型訓練技巧
===
![](https://i.imgur.com/p3NDiWi.gif)
[TOC]
---
# 一. Using learning rate schedule.
一般提到如何調整Learning Rate,相信大多數人第一時間會想到Adaptive Learning Rate,常見的方法有++Adagrad++、++RMSProp++、++Momentum++、++Adam++和++AdamW++,相信這些大家都不陌生,卻時常忽略**Learning Rate Scheduling**這個技巧(其實是在說我)。
在訓練模型的過程,其中一個很重要的參數就是Learning Rate,合適的Learning Rate可以幫助模型快速收斂,常見的調整方法是在訓練初期時給定較大的Leaning Rate,隨著模型的訓練逐漸調低Learning Rate。這時候問題就來了,我們應該什麼時後調整Learning Rate,該怎麼調整使得模型能較快收斂,以下將簡單介紹幾個PyTorch提供的方法。
### **1. lr_scheduler.LambdaLR**
> torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)
**介紹:**
   **Sets the learning rate of each parameter group to the initial lr times a given function.**
**重點參數:**
   (1) **lr_lambda (function or list) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups.**
   (2) **last_epoch (int) – The index of last epoch. Default: -1.**
==**Example**==
```python=
# Assuming optimizer has two groups.
optimizer = optim.SGD(net.parameters(), lr = 1e-5)
lambda1 = lambda epoch: 0.2 * epoch if epoch > 5 else 1
lambda2 = lambda epoch: 0.2 ** epoch
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda = lambda1)
for epoch in range(10):
scheduler.step()
print(epoch, scheduler.get_last_lr()[0])
```
==**Result**==
![](https://i.imgur.com/g2Ec9Ki.png)
### **2. lr_scheduler.MultiStepLR**
> torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1, verbose=False)
**介紹:**
   **Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones.**
**重點參數:**
   (1) **milestones (list) – List of epoch indices. Must be increasing.**
   (2) **gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.**
   (3) **last_epoch (int) – The index of last epoch. Default: -1.**
==**Example**==
```python=
optimizer = optim.SGD(net.parameters(), lr = 1e-5)
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[5,10,15], gamma=0.1)
for epoch in range(1, 20 + 1):
scheduler.step()
print(epoch, scheduler.get_last_lr()[0])
```
==**Result**==
![](https://i.imgur.com/8CpNs3l.png)
### **3. lr_scheduler.ExponentialLR**
> torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1, verbose=False)
**介紹:**
   **Decays the learning rate of each parameter group by gamma every epoch.**
**重點參數:**
   (1) **milestones (list) – List of epoch indices. Must be increasing.**
   (2) **gamma (float) – Multiplicative factor of learning rate decay. Default: 0.1.**
   (3) **last_epoch (int) – The index of last epoch. Default: -1.**
==**Example**==
```python=
optimizer = optim.SGD(net.parameters(), lr = 1e-5)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma = 0.5)
for epoch in range(1, 20 + 1):
scheduler.step()
print(epoch, scheduler.get_last_lr()[0])
```
==**Result**==
![](https://i.imgur.com/bCJuRcK.png)
### **4. lr_scheduler.MultiplicativeLR**
> torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)
**介紹:**
   **Multiply the learning rate of each parameter group by the factor given in the specified function.**
**重點參數:**
   (1) **gamma (float) – Multiplicative factor of learning rate decay.**
   (2) **last_epoch (int) – The index of last epoch. Default: -1.**
==**Example**==
```python=
optimizer = optim.SGD(net.parameters(), lr = 1e-5)
lambda1 = lambda epoch: 0.2 if epoch % 5 == 0 else 1
lambda2 = lambda epoch: 0.2
scheduler = optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda = lambda2)
for epoch in range(1, 20 + 1):
scheduler.step()
print(epoch, scheduler.get_last_lr()[0])
```
==**Result**==
![](https://i.imgur.com/2jZohYc.png)
### **5. lr_scheduler.ReduceLROnPlateau (目前唯一不靠Epoch來更新的lr_scheduler)**
> torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08, verbose=False)
**介紹:**
   **_Reduce learning rate when a metric has stopped improving._ Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This scheduler reads a metrics quantity and if no improvement is seen for a ‘patience’ number of epochs, the learning rate is reduced.**
**重點參數:**
   (1) **factor (float) – Factor by which the learning rate will be reduced. new_lr = lr * factor. Default: 0.1.**
   (2) **patience (int) – Number of epochs with no improvement after which learning rate will be reduced.**
   (3) **threshold (float) – Threshold for measuring the new optimum, to only focus on significant changes. Default: 1e-4.**
   (4) **threshold_mode (str) – One of rel, abs.**
   (5) **threshold_mode (str) – One of rel, abs..**
   (6) **min_lr (float or list) – A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. Default: 0.**
<br>
**什麼是Plateau:**
   **看到下圖就是把Loss function的值給畫出來,可以看到在Plateau的地方梯度是趨近於0,但是loss的值卻不是處在Local或Global minimal的地方,這會導致在訓練時卡住而沒有到更佳的結果去。**
   **Gradient的限制就是,只要你微分的值趨近於0就幾乎不會在繼續更新下去,即使當前不是處於Local或Global minimal的地方,常見的有Plateau和Saddle Point,通常會用Adaptive Learning Rate去避開。**
![](https://i.imgur.com/zHKK3BF.png)
                        **Source: Gradient Descent [pdf](http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Gradient%20Descent%20(v2).pdf)**
==**Example**==
```python=
initial_lr = 0.1
net = model.resnet101()
optimizer = torch.optim.Adam(net.parameters(), lr = initial_lr)
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=2)
for epoch in range(1, 15):
optimizer.zero_grad()
optimizer.step()
print(f"{epoch} {optimizer.param_groups[0]['lr']}")
scheduler.step(1)
```
==**Result**==
![](https://i.imgur.com/JmTlQcG.png)
### **6.更多的Learning Rate Scheduler: [[Pytorch Doc]](https://pytorch.org/docs/stable/optim.html)**
**PyTorch官方給的警告**
> **在PyTorch 1.1.0之前,learning rate scheduler應該在optimizer更新前呼叫使用。但在1.1.0版本之後,改變了BC-breaking way的行為。如果在optimizer更新(calling optimizer.step())前使用learning rate scheduler(calling scheduler.step()),這個行為將會跳過第一次的learning rate schedule.(~~但現在應該沒甚麼人的版本還在1.1.0之前的吧~~)**
:zap: **附上範例程式碼**
> PyTorch的官方文件是將Learning Rate Scheduler在每一個Epoch結束完執行一次,因此照這個思路,寫下比較general的版本。
```python=
import torch
from torch import optim
def train(model, train_loader, epoch, optimizer, loss_fn, device = 'cpu', scheduler = None):
model.train()
if not scheduler:
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr = 1e-5, max_lr = 1e-2)
for _ in range(1, epoch + 1):
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
if scheduler.__module__ == 'torch.optim.lr_scheduler':
scheduler.step()
```
---
# 二. Use multiple workers and pinned memory in DataLoader
當我們使用```torch.utils.data.DataLoader```時,參數num_worker預設為0,則pin_memory預設為False。因為如果要對這兩個參數進行更動的話,前提你的硬體不能太差。
1. pin_memory
**當DataLoader的pin_memory參數為True時,DataLoader會自動將得到的資料(Tensor)複製一份到GPU的"Pinned Memory/PageLocked Memory",這樣可以更快的將資料傳輸到有支持CUDA的顯示卡。
Pinned Memory存放的內容在任何情形下都不會與Virtual Memory進行交換,因此此操作可以節省和Virtual Memory交換的時間。**
[[More]](https://pytorch.org/docs/stable/data.html#memory-pinning)
2. num_worker
**在加載資料時,可以指定使用多少個sub process去進行操作,增加num_workers時將會增加CPU RAM的消耗。**
:::danger
:fire: 來自 NVIDIA 的CUDA深度學習算法軟體工程師 Szymon Micacz 就曾經使用四個 worker 和pinned memory,在單個 epoch中達到了2倍的加速。[Source](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/szymon_migacz-pytorch-performance-tuning-guide.pdf)
![](https://i.imgur.com/hOfY7lC.png)
:::
==**Example**==
```python=
cuda_kwargs = {
'num_workers' : 1,
'pin_memory' : True,
'batch_size' : 64
'shuffle' : True
}
train_loader = torch.utils.data.DataLoader(dataset, **cuda_kwargs)
```
---
# 三. ENABLE cuDNN AUTOTUNER
**介紹:**
   **首先,先大略知道什麼是cuDNN,cuDNN是NVIDIA專門為了深度類神經網路所打造的GPU加速原形函式庫,針對Convolution、Pooling和一些activation function有進行底層的優化。cuDNN也支持當今主流的深度學習框架,例如: PyTorch、Caffe、mxnet、MATLAB等等...。**
**torch.backends.cudnn.benchmark:**
   **簡單看了下介紹後,但這和 cudnn.benchmark 有何關係呢?實際上,若將這個flag設為True,它就會調用[cudnnFindConvolutionForwardAlgorithmEx](https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#cudnnFindConvolutionForwardAlgorithmEx)(連結是到NVIDIA所提供的官方API Doc),就可以在PyTorch中對模型裡的卷積層(Convolution)、池化層(Pooling)進行預先的優化,也就是在每一個卷積層中測試cuDNN 提供的所有卷積實現演算法,然後選擇效能最佳的那一個,使得之後效能==可能==會提高**。
   為甚麼會說效能**可能**會提高而不是一定呢? PyTorch在官方的Github資料[[Link]](https://github.com/pytorch/pytorch/blob/31ee5d8d8b249fd0dbd60e1d9c171ec75b597672/docs/source/notes/randomness.rst)中有解釋: **"Due to benchmarking noise and different hardware, the benchmark may select different algorithms on subsequent runs, even on the same machine"**.
**使用時機:**
   **==當你的model有使用到Convolution的架構時才建議使用此方法才能進行加速!!== 而且網路架構是固定時(非動態變動),網路的input size(包含batch size、圖片大小等等...)是不變的前提下,因為若在訓練中有關卷積層的架構是浮動的話,反而不會加速而是會花費更多時間。**
==**Example**==
```python=
if torch.cuda.is_available():
device = torch.device('cuda')
torch.backends.cudnn.benchmark = True
else:
device = torch.device('cpu')
```
---
# 四. Disable bias for convolution directly followed by a Batch Norm
**介紹:**
   **不管對Feature map中的任何channel加上bias都只會影響它的平均值,當你的模型架構含BatchNorm2d且是接在Conv2d後的話,它將會移除掉任何channel的平均值,所以沒有必要為Conv2d加上bias(Conv1d和Conv3d同上),下面為示意圖。**
![](https://i.imgur.com/hrqF6Sz.png)
**Batch Normalization演算法如下:**
![](https://i.imgur.com/oS91wgc.png)
_S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In ICML, 2015._
==**Example**==
```python=
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self, inplanes: int = 1, planes: int = 1):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(inplanes, planes, bias = False)
self.bn1 = nn.BatchNorm2d(planes)
def forward(self, x: torch.Tensor)->torch.Tensor:
out = self.conv1(x)
out = self.bn1(out)
return out
```
---
# 五. Automatic mixed precision(AMP) for faster training on NVIDIA GPUs
**介紹:**
   **在預設情況下,大多數深度學習框架(包括PyTorch)都使用32位浮點(FP32)算法進行訓練。但是,對許多深度學習模型來說這準確性可能不是必須的。在2017年時,NVIDIA研究人員開發了一種混合精度訓練的方法[[Source]](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/),並寫成論文投稿到ICLR2018[[PAPER](https://arxiv.org/pdf/1710.03740.pdf)],方法是在訓練網絡時結合了單精度(FP32)和半精度(FP16)格式,可以達到跟純FP32訓練一樣的準確率下(使用相同的超參數),訓練時間更短,並且在NVIDIA GPU上具有其他性能優勢:**
* 訓練時間更短
* 較低的記憶體要求
* 可使用更大的Batch size,更大的模型或更大的輸入
   **在PyTorch中,當我們提到混和精度(AMP)訓練時,就是在NVIDIA能支援Tensor Core的CUDA設備上使用```torch.cuda.amp.autocast()和torch.cuda.amp.GradScaler()```**
<br>
**如何在PyTorch使用AMP:**
1. **Adding autocast**
In these regions, CUDA ops run in a dtype chosen by autocast to improve performance while maintaining accuracy.
```python=
# Source: PyTorch Documents
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
# Runs the forward pass under autocast.
with torch.cuda.amp.autocast():
output = net(input)
# output is float16 because linear layers autocast to float16.
assert output.dtype is torch.float16
loss = loss_fn(output, target)
# loss is float32 because mse_loss layers autocast to float32.
assert loss.dtype is torch.float32
# Exits autocast before backward().
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
loss.backward()
opt.step()
opt.zero_grad() # set_to_none=True here can modestly improve performance
```
**當使用autocast時,PyTorch會自動決定哪些Layer或Function將它們轉成torch.HalfTensor(float16),至於哪些操作可以轉到float16,請看下圖([Source](https://pytorch.org/docs/stable/amp.html#autocast-op-reference))。**
![](https://i.imgur.com/tdptMJY.png)
<br>
==**重點!!**==   **Autocast只包含在Forward pass的部分(包含loss的計算),**
2. **Adding GradScaler**
GradScaler help perform the steps of gradient scaling conveniently. Gradient scaling improves convergence for networks with float16 gradients by minimizing gradient underflow.
```python=
# Source: PyTorch Documents
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
```
   **為甚麼都已經使用了Autocast卻還需要GradScaler的操作呢? 因為在Forward pass時某些操作已經轉成Float16的操作,導致浮點精度不夠來表示梯度,在Bapropagation時可能會造成梯度的Underflow。所以為了防止梯度的Underflow,```scaler.scale(loss).backward()```會對loss乘上一個scale的factor,這會使得Backpropagation時所有梯度也乘上一樣的scale factor,為了防止梯度變小到0(很大一部分是Float16精度無法表示的緣故:如下圖),良好的Scaler factor可以將梯度保留在一個非常小的值。**
**2018年ICLR論文《Mixed Precision Training》[Paper](https://openreview.net/pdf?id=r1gs9JgRZ)發現,如果在模型中的任何地方使用fp16時,將會"swallows"所有小於2 ^ -24的梯度更新,占了大約是這網路所有更新參數的5%,也就是整個模型中會有5%的權重不會隨著梯度更新,這有機會導致模型訓練不起來。**
![](https://i.imgur.com/CYOjEEz.png)
_P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training._
<br>
**NVIDIA的AMP測試數據([Gighub](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)):**
1. **Training**
![](https://i.imgur.com/ArQPuvd.png)
2. **Inference**
![](https://i.imgur.com/VHDOliM.png)
**相關資料:**
1. **ECCV 2020 Tutorial on Accelerating Computer Vision with Mixed Precision(NVLAB): [LINK](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/)**
2. **PyTorch AMP Tutorial: [Colab](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/19350f3746283d2a6e32a8e698f92dc4/amp_recipe.ipynb#scrollTo=2V5LTdiLzhk9)**
---
# 六. Increase Batch Sizes
**增加Batch sizes去更有效的使用GPU的RAM**
> **可以搭配前面的AMP,AMP可以降低GPU的記憶體需求量,因此可以一起搭配使用**
---
# 七. Accumulating gradients
**介紹:**
   **假設我們要訓練的Batch size為256,但是我們的GPU memory卻只能一次塞Batch size為32的資料進去。這時候該怎麼辦?~~如果是土豪就直接升級GPU解決~~,若無法升級GPU該怎麼辦,我們可以執行8(= 256/32)個梯度累加而無需執行優化"optimizer.step()"步驟,並繼續通過loss.backward()累加計算出的梯度。一旦我們累積了256個輸入資料的梯度時,就執行優化步驟'optimizer.step()'並清空累加的梯度,再次進入循環。**
**優點:**
   **不需要多個GPU就可以進行大batch size的訓練。即使使用單張GPU,此方法也可以進行大batch size的訓練。**
**缺點:**
   **比在多個GPU上平行訓練要花費更多的時間(廢話)。**
**基本的訓練函數,每個Batch是這樣訓練的:**
```python=
def train(model, dataloader, optimizer, epoch, accumulation_steps = 8):
for eph_idx in range(1, epoch + 1):
for index, (data, target) in enumerate(dataloader):
optimizer.zero_grad() # reset gradient
output = model(data)
loss = torch.nn.MSELoss()(output, target)
loss.backward()
optimizer.step() # update parameters
```
1. **optimizer.zero_grad(): 清空梯度**
2. **計算loss: 給定output和target,透過loss函數計算loss值**
3. **loss.backward(): 執行Backpropagation,並且計算當前梯度**
4. **optimier.step(): 根據梯度來更新模型參數**
<br>
**若使用梯度累加,每個Batch是這樣訓練的:**
```python=
def train(model, dataloader, optimizer, epoch, accumulation_steps = 8):
for eph_idx in range(1, epoch + 1):
for index, (data, target) in enumerate(dataloader):
output = model(data)
loss = torch.nn.MSELoss()(output, target)
loss = loss / accumulation_steps # Loss regularization
loss.backward() # backpropagation
if (index+1) % accumulation_steps == 0:
optimizer.step() # update parameters
optimizer.zero_grad() # reset gradient
```
1. **計算loss: 給定output和target,透過loss函數計算loss值**
2. **loss.backward(): 執行Backpropagation,並且計算當前梯度**
3. **重複步驟1、2,不把梯度清空,讓梯度累加在過往的梯度上**
4. **梯度在累加到一定的次數後(accumulation_steps),執行```optimizer.step()```根據累積的梯度去更新模型參數,再呼叫```optimizer.zero_grad()```清空過往的梯度**
<br>
:fire: ==**少數有Accumlating gradients成功的範例,例如: NVIDIA的[DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples/blob/b1ce24a54ff25fc21b8a281abbbbb5d2d69664e2/PyTorch/LanguageModeling/BERT/run_glue.py#L491~L503)、[More](https://github.com/NVIDIA/DeepLearningExamples/search?q=gradient+accumulation)**==
---
# 八. Training model on Multi-GPUs
   **若要在有多張GPU的Server上訓練PyTorch模型,入門是使用torch.nn.DataParallel()。**
**下面來具體講講nn.DataParallel中是怎麼做的:**
   **首先在Forward pass中,你的輸入資料會被切割成多個部分(以下稱為'資料子集')分別送到不同的device(GPU)中進行計算,而你的模型module是會被每個device(GPU)上進行複制一份,也就是說,輸入的batch是會被平均分到每個device中去,但是你的模型module是要複製到每個device(GPU)中的,每個模型module只需要處理每個'資料子集'即可,要保證你的batch size大於你的device(GPU)使用個數。然後在Backpropagation的過程中,每個'資料子集'的梯度被累加到原始的Module中。([官方解釋](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel))**
   **官方定義```CLASS torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)```,module就是放入你設計的模型,device_ids就是你要用來訓練的GPU ID,output_device代表要拿來輸出的device id,這個參數通常是忽略不寫的默認是device_ids[0],所以第一張顯卡的RAM使用量會比其他還要來的多(除了要做Forward pass還要進行gradient的累加)。**
==**Example**==
```python=
device_ids = [0, 1, 2, 3]
parallel_model = torch.nn.DataParallel(model, device_ids = device_ids) # Encapsulate the model
predictions = parallel_model(inputs) # Forward pass on multi-GPUs
loss = loss_function(predictions, labels) # Compute loss function
loss.mean().backward() # Average GPU-losses + backward pass
optimizer.step() # Optimizer step
predictions = parallel_model(inputs) # Forward pass with new parameters
```