Try   HackMD

contributed by < weian312>

混合精度訓練

2018 ICLR
Sharan et al.(Baidu), Paulius et al.(NVIDIA)

TODO

  • 實測
  • 論文還有部份重點沒出來

首先科普 IEEE754

Half-Precision (FP16, binary16)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

最小正尾數

224

0 00000 0000000001

Implementation

那這裡就直接從 implement 開始說吧!
首先複習一下深度學習訓練

  • 結構與前傳遞
    類神經網路的結構分為
    1.Weights & Bias
    2.Activations
    前向傳遞的過程, 輸入的數據會乘上 weights 加上bias 再經過 Activations 得出 output, 將output 套入 loss function 與正確標籤計算出
    LLoss(output,true label)
  • 倒傳遞
    從 Loss 經過 chain rule 計算出每個節點的偏微分 (或梯度)
    LW
    , 選擇一個 Optimizer(eg. SGD) 來更新權重
    SGD(stochastic gradient decent):
    WWηLW

FP32 MASTER COPY OF WEIGHTS

先上圖, 其實這張說明得很清楚

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 權重和梯度, Activations 都用半精度儲存, 將權重保留一份單精度的副本
  • 前傳遞與倒傳遞的計算過程都用半精度計算,唯有在倒傳遞最後一步更新權重(算好梯度丟進Optimizer那步)時使用單精度,並保存成新的單精度權重。

這裡的解釋有二

  1. 因為梯度還要乘上學習率(例如乘上
    104
    ,降了四個數量級),半精度最小正數只能吃到
    224=5.96×108
    (變成零直接沒更新到XD)
  2. 權重相對梯度的值太大, 如果是用半精度做權重更新(就是沒存單精度的weight),在計算上一樣會變蛋(原文有更清楚的解釋,有興趣的可以去看)

原文2-a有無存單精度權重的混合精度訓練
(dev0是validation set)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

原文2-b在訓練過程會變蛋的值

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Reference

Tensorflow Guide
Paper