Model訓練Tips

Source: 李宏毅ML2022Course

Model訓練Tips

General Guide

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Model Bias

Model Bias是因為model太過簡單，所包含的function set太少
重新設計更複雜的網路可以解決Model Bias
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
當model足夠複雜讓function set夠豐富，loss仍然很高，可能是Optimization Issue
Optimization若設計不好，迭代過程有可能掉入local minimum
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
如果深層網路在training data上相較於淺層網路沒有得到比較低的loss, 則應為optimization issue
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
若training data訓練有較小的Loss, 但testing data驗證的Loss仍然很高，有可能是Overfitting
解決Overfitting的方法可能有
- 加入更多資料，或者做資料擴增(data augmentation)
- 用較簡單的網路，例如減少參數，或者共用參數、減少features
- Optimization方法優化，例如Early stopping、Regularization、Droppout
Bias-Complexity Trade-off
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
交叉驗證Cross Validation能降低sampling的影響
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
不建議用Public testing data來篩選model
建議將Training set拆分成Training set & Validation set, 用Validation set篩選model
N-Fold Cross Validation (例如3folds, 將training set平分成train/train/val來做model評估，做完permutation後再執行一次model評估，最後得到平均loss)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
若training & testing(或validation) data存在不同的分布，即使增加data也無法降低testing loss
Hessian矩陣與L(
$θ$ )的二次微分相關
Hessian矩陣的eigen values若都為正，則
$θ^{'}$ 為local minima; 相反的若eigen values都為負，則
$θ^{'}$ 為local maximum; 若有時正有時負，則為saddle point
Hessian矩陣的eigen value若為負，沿著其對應的eigen vector方向更新參數就能降低Loss
實作上二次微分矩陣和eigen value運算量大，因此是最後手段
低維度的local minimum在高維空間中可能並不是
實際上的local minimum數量並不多
當training到一個極限，以為是卡在local minimum時，通常是卡在saddle point，還有方法可以繼續優化
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Batch optimization方法是分別對每batch計算loss和權重更新，直到所有batch都算過一遍，才shuffle data進入下一個epoch
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
對平行運算(GPU)來說每次更新小batch和稍大batch運算速度差異不大，除非batch size太大
小batch則計算每個epoch的batch數更多，因此花的時間更長
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
小batch更新路線較noisy，對optimization有好處，可以避免落入critical point (Optimization)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
實際上影像辨識training的結果，大batch的accuracy較差
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
由於testing data的分布和training data不一定相同(存在mismatch)，小batch高隨機性會讓training過程避免滯留於error surface井，降低training和testing結果差異 (Generalization)
小batch在optimization和generalization上有優勢
一般Gradient Descent會往gradient反方向更新權重
Gradient Descent + Momentum會往(gradient反方向)&(前一次權重更新方向)的和向量更新權重
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Momemtum是過去gradient的總和
Critical point處gradient=0
Critical point可能是saddle point或local minimum(機率小)
Saddle point和local minumum可以由Hessian矩陣分辨
Saddle point可以藉由沿著Hessian矩陣eigen vector方向脫離
小Batch和Momemtum可以幫助脫離critical point

參考資料
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes
Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well
Large Batch Training of Convolutional Networks
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Adaptive Learning Rate讓Learning rate根據不同梯度做調整
Learning rate調整方向傾向越平緩梯度步伐越大，反之越陡峭的梯度步伐越小
Adagrad考慮先前所有gradient的RMS, 由於取了平方，所以只考慮gradient數值
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
RMSProp考量當前的gradient影響較大，過去的gradient影響較小
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Adam是結合RMSProp和Momentum的特性
Learning Rate Scheduling可以解決當訓練在平緩區待太久，由於引入Adaptive Learning Rate讓調整後的步伐過大
Learning Rate Decay讓Learning Rate隨時間下降
Warm up降低訓練初期的Learning rate，讓model獲得更多error surface情報
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

參考資料
ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Feature Scaling

Batch Normalization

不同feature權重數值可能有數個order的差異，容易讓gradient descent變得困難
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Internal covariate shift是指每一層參數個別調整的結果有可能會調過頭
Internal covariate shift可以藉由較小的lr或Batch normalization解決
BN普遍的架構是apply在activation layer前面
先做BN再進入activation好處是避免讓activation的input太接近飽和平緩區
Batch平均&標準差需要趨近所有Training set平均&標準差，因此Batch size不能太小
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
BN能降低參數初始化的影響以及對抗Overfitting
BN理論上要放在activation之前
So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation (Ioffe and Szegedy, 2015)
若model同時包含BN和Dropout，Dropout可以安插在activation layer之後，因此結構會是CONV/FC=>BatchNorm=>ReLu=>Dropout=>CONV/FC 參考

參考資料
Batch Normalization: Accelerating Deep Network Training b
y
Reducing Internal Covariate Shift
Dropout: A Simple Way to Prevent Neural Networks from
Overfitting

Model訓練Tips

General Guide

Model Bias

Feature Scaling

Batch Normalization

Read more

讀書筆記

ML2022 HW

Speech Recognition

數學