contributed by < weian312>

混合精度訓練

2018 ICLR
Sharan et al.(Baidu), Paulius et al.(NVIDIA)

TODO

實測
論文還有部份重點沒出來

首先科普 IEEE754

Half-Precision (FP16, binary16)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

最小正尾數

2^{- 24}

0 00000 0000000001

Implementation

那這裡就直接從 implement 開始說吧！
首先複習一下深度學習訓練

結構與前傳遞
類神經網路的結構分為
1.Weights & Bias
2.Activations
前向傳遞的過程, 輸入的數據會乘上 weights 加上bias 再經過 Activations 得出 output, 將output 套入 loss function 與正確標籤計算出
$L ＝ L o s s (o u t p u t, t r u e l a b e l)$
倒傳遞
從 Loss 經過 chain rule 計算出每個節點的偏微分 (或梯度)
$\frac{\partial L}{\partial W}$ , 選擇一個 Optimizer(eg. SGD) 來更新權重
SGD(stochastic gradient decent):

$W \leftarrow W - η \frac{\partial L}{\partial W}$

FP32 MASTER COPY OF WEIGHTS

先上圖, 其實這張說明得很清楚

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

權重和梯度, Activations 都用半精度儲存, 將權重保留一份單精度的副本
前傳遞與倒傳遞的計算過程都用半精度計算,唯有在倒傳遞最後一步更新權重(算好梯度丟進Optimizer那步)時使用單精度,並保存成新的單精度權重。

這裡的解釋有二

因為梯度還要乘上學習率(例如乘上
$10^{- 4}$ ,降了四個數量級),半精度最小正數只能吃到
$2^{- 24} = 5.96 \times 10^{- 8}$ (變成零直接沒更新到XD)
權重相對梯度的值太大, 如果是用半精度做權重更新(就是沒存單精度的weight),在計算上一樣會變蛋(原文有更清楚的解釋,有興趣的可以去看）

原文2-a有無存單精度權重的混合精度訓練
(dev0是validation set)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

原文2-b在訓練過程會變蛋的值

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Reference

Tensorflow Guide
Paper