PyTorch OneCycleLR 說明

# PyTorch OneCycleLR 說明 ###### tags: `PyTorch`, `Python`, `scheduler`, `lr`, `learning-rate` Ref.: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html ## 參數說明 `torch.optim.lr_scheduler.OneCycleLR()` * ==**optimizer**,== * ==**max_lr**,== * 最大學習率 * 可以使用 `div_factor` 和 `final_div_factor` 參數來指定**初始學習率**和**最小學習率** * ==**初始學習率**==: **`initial_lr`** = **`max_lr`**/**`div_factor`** * ==**最小學習率**==: **`min_lr`** = **`initial_lr`**/**`final_div_factor`** * **div_factor=25.0,** * **final_div_factor=10000.0, (1e4)** * ==**total_steps=None,**== * 週期總步數 (steps) * 同一時間只需要設定 **`total_steps`** 或 (`epochs` 與 `steps_per_epoch`) 兩組擇一，因為 `total_steps` = `epochs` * `steps_per_epoch` * **epochs=None,** * **steps_per_epoch=None,** * ==**pct_start=0.3**,== * *全名為 percentage start* * The percentage of the cycle (in number of steps) spent increasing the learning rate. > 推測是 **warm-up ratio** of `total_steps`，初始 0.3 表示 `total_steps` 前 30% steps 會從 min_lr 開始逐漸遞增 * pct_start 指的是第一階段 (phase 1) 中學習率會逐漸增加的百分比 (warm-up)。它會增加到 `max_lr` 這個參數指定的最大學習率。當 `three_phase=False (default)` 則 phase 2 就是從 `max_lr` 遞減至 `min_lr`。(three_phase=True 的情況參見 `three_phase` 說明) * **three_phase=False,** * 指定是否使用第三階段來消除學習率而不是修改第二階段 > If True, use a third phase of the schedule to annihilate the learning rate according to ‘final_div_factor’ instead of modifying the second phase (the first two phases will be symmetrical about the step indicated by ‘pct_start’). * 預設 False，表示 OneCycleLR 預設只有兩階段 (Two phase)。 * 第一階段是學習率遞增 (warm-up phase)，由 `pct_start` 來決定第一階段的長度 (percentage of `total_steps`)； * 第二階段是學習率遞減，由 `anneal_strategy` 來指定遞減函數，由 div_factor 和 final_div_factor 決定最終學習率最小值 (`min_lr` = `max_lr` / `div_factor` / `final_div_factor` = `initial_lr` / `final_div_factor`) * 實際舉例，假設設置 `pct_start=0.3`, `three_phase = True`, 其餘參數依 default 設置之後, 三個 phase 中 learning rate 分別會如下: * 第一階段（0% ~ 30%）: 學習率會從 `initial_lr` 逐漸遞增至 `max_lr` * 第二階段（30% ~ 60%）: 學習率會逐漸遞減至 `initial_lr`，曲線就是第一階段的反向，持續長度和階段一等長 * 第三階段（60% ~ 100%）: 學習率會從 max_lr 逐漸遞減至 min_lr (min_lr = initial_lr / final_div_facto) * ==**anneal_strategy='cos’,**== * 可以指定學習率的遞減策略，默認是 'cos' 也就是餘弦遞減 * 線性遞減是指學習率以相同的速率在第二階段中減少，**餘弦遞減**則是指學習率在第二階段中**先快後慢**地減少。 * ==**cycle_momentum=True,**== * 指定是否讓動量在學習率週期中逆轉，默認是 True * 可以使用 `base_momentum` 和 `max_momentum` 參數來指定動量週期的邊界值 * **base_momentum=0.85,** * **max_momentum=0.95,** * ==**last_epoch= -1,**== * 用於恢復訓練時使用 (上次訓練中斷時，是進行到第幾 epoch)，默認值 -1，表示此次從 epoch = 0 開始。 * (待實證：假設是設定 `total_steps` 而非 (`epochs` 與 `steps_per_epoch`) 不知道此函數要如何判斷一個 epoch 包含多少 steps?) 從官方文件說明看起來此值是填入上一次的 epoch (*total number of batches computed*) 而非 step > The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning. Default: -1 * **verbose=False** * 指定參數是否在更新學習率時印出資訊 ## 使用提醒 * optimizer.step() 應於 scheduler.step() 之後呼叫 * 目前最新主流的 OneCycleLR 建議使用方法是每個 step 更新一次，而不是一個 epoch 更新一次 ## Code :::spoiler (Code of ToyModel) ```python class ToyModel(nn.Module): def __init__(self, input_shape, output_classes=10, final_embed_dim=2048): """input_shape: (C, H, W) if output_classes == None, then model without final classifier""" super().__init__() self.add_classifier = True if output_classes else False self.conv1 = nn.Conv2d(input_shape[0], 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.conv2_drop = nn.Dropout2d() self.size = int(((input_shape[1] - 5 + 1) / 2 - 5 + 1) / 2) self.fc1 = nn.Linear(20 * self.size * self.size, final_embed_dim) if self.add_classifier: self.classifier = nn.Linear(final_embed_dim, output_classes) def _encoder_forward(self, x): # Conv2d(kernel_size = 5): H' = H - 5 + 1 # max_pool2d(kernel_size = 2): H' = H / 2 x = F.relu(F.max_pool2d(self.conv1(x), 2)) x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) x = x.flatten(start_dim=1) # x = x.view(-1, 20*self.size*self.size) x = self.fc1(x) return x def forward(self, x): x = self._encoder_forward(x) if self.add_classifier: x = F.relu(x) x = F.dropout(x, training=self.training) x = self.fc2(x) return x ``` ::: Example Code of OneCycleLR ```python= import torch.optim as optim from torch.optim.lr_scheduler import OneCycleLR import matplotlib.pyplot as plt TOTAL_STEPS = 200 # 建一個非常簡單的 CNN 模型 model = ToyModel(input_shape=(3,32,32), output_classes=10) # 初始化優化器 optimizer = optim.SGD(model.parameters(), lr=0.1) # 初始化學習率調度器 scheduler = OneCycleLR(optimizer, max_lr=0.2, div_factor=2.0, total_steps=TOTAL_STEPS, pct_start=0.3, three_phase=True) # 記錄學習率變化 lr_history = [] # 在訓練循環中更新學習率 for step in range(TOTAL_STEPS): scheduler.step() lr_history.append(optimizer.param_groups[0]['lr']) #train(...) # 繪製學習率變化 plt.plot(lr_history) plt.xlabel('Step') plt.ylabel('Learning rate') plt.show() ``` 以下是輸出的 lr 變化圖 (**`three_phase=True`** 因此有三階段變化)，共 200 steps (每個 step 更新一次 lr) ![](https://i.imgur.com/7f6vI2v.png) 觀察： * 設定 OneCycleLR 之後就會完全覆蓋最初 optimizer 的 lr 設定 (第 12 行設 lr = 0.1 但效果直接被 OneCycleLR 取代) * initial_lr 的確是 max_lr / div_factor * 第 20 行，three_phase=True 效果為： * 三階段套用的遞增/減函數都是 anneal_strategy (default: cosine)，若改成 `'linear'` 就會變成線性變化的斜直線 * phase 1: lr 從 initial_lr 遞增至 max_lr * phase 2: lr 從 max_lr 遞減至 initial_lr，phase 2 就是 phase 1 的反向 * phase 3: lr 從 initial_lr 遞減至 min_lr