Try   HackMD

Model訓練Tips

Source: 李宏毅ML2022Course

General Guide

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Model Bias

  • Model Bias是因為model太過簡單,所包含的function set太少

  • 重新設計更複雜的網路可以解決Model Bias

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 當model足夠複雜讓function set夠豐富,loss仍然很高,可能是Optimization Issue

  • Optimization若設計不好,迭代過程有可能掉入local minimum

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 如果深層網路在training data上相較於淺層網路沒有得到比較低的loss, 則應為optimization issue

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 若training data訓練有較小的Loss, 但testing data驗證的Loss仍然很高,有可能是Overfitting

  • 解決Overfitting的方法可能有

    • 加入更多資料,或者做資料擴增(data augmentation)
    • 用較簡單的網路,例如減少參數,或者共用參數、減少features
    • Optimization方法優化,例如Early stopping、Regularization、Droppout
  • Bias-Complexity Trade-off

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 交叉驗證Cross Validation能降低sampling的影響

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 不建議用Public testing data來篩選model

  • 建議將Training set拆分成Training set & Validation set, 用Validation set篩選model

  • N-Fold Cross Validation (例如3folds, 將training set平分成train/train/val來做model評估,做完permutation後再執行一次model評估,最後得到平均loss)

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 若training & testing(或validation) data存在不同的分布,即使增加data也無法降低testing loss

  • Hessian矩陣與L(

    θ)的二次微分相關

  • Hessian矩陣的eigen values若都為正,則

    θ為local minima; 相反的若eigen values都為負,則
    θ
    為local maximum; 若有時正有時負,則為saddle point

  • Hessian矩陣的eigen value若為負,沿著其對應的eigen vector方向更新參數就能降低Loss

  • 實作上二次微分矩陣和eigen value運算量大,因此是最後手段

  • 低維度的local minimum在高維空間中可能並不是

  • 實際上的local minimum數量並不多

  • 當training到一個極限,以為是卡在local minimum時,通常是卡在saddle point,還有方法可以繼續優化

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • Batch optimization方法是分別對每batch計算loss和權重更新,直到所有batch都算過一遍,才shuffle data進入下一個epoch

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 對平行運算(GPU)來說每次更新小batch和稍大batch運算速度差異不大,除非batch size太大

  • 小batch則計算每個epoch的batch數更多,因此花的時間更長

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 小batch更新路線較noisy,對optimization有好處,可以避免落入critical point (Optimization)

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 實際上影像辨識training的結果,大batch的accuracy較差

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 由於testing data的分布和training data不一定相同(存在mismatch),小batch高隨機性會讓training過程避免滯留於error surface井,降低training和testing結果差異 (Generalization)

  • 小batch在optimization和generalization上有優勢

  • 一般Gradient Descent會往gradient反方向更新權重

  • Gradient Descent + Momentum會往(gradient反方向)&(前一次權重更新方向)的和向量更新權重

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • Momemtum是過去gradient的總和

  • Critical point處gradient=0

  • Critical point可能是saddle point或local minimum(機率小)

  • Saddle point和local minumum可以由Hessian矩陣分辨

  • Saddle point可以藉由沿著Hessian矩陣eigen vector方向脫離

  • 小Batch和Momemtum可以幫助脫離critical point

參考資料
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes
Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well
Large Batch Training of Convolutional Networks
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

  • Adaptive Learning Rate讓Learning rate根據不同梯度做調整

  • Learning rate調整方向傾向越平緩梯度步伐越大,反之越陡峭的梯度步伐越小

  • Adagrad考慮先前所有gradient的RMS, 由於取了平方,所以只考慮gradient數值

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • RMSProp考量當前的gradient影響較大,過去的gradient影響較小

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • Adam是結合RMSProp和Momentum的特性

  • Learning Rate Scheduling可以解決當訓練在平緩區待太久,由於引入Adaptive Learning Rate讓調整後的步伐過大

  • Learning Rate Decay讓Learning Rate隨時間下降

  • Warm up降低訓練初期的Learning rate,讓model獲得更多error surface情報

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

參考資料
ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Feature Scaling

Batch Normalization

  • 不同feature權重數值可能有數個order的差異,容易讓gradient descent變得困難

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • Internal covariate shift是指每一層參數個別調整的結果有可能會調過頭

  • Internal covariate shift可以藉由較小的lr或Batch normalization解決

  • BN普遍的架構是apply在activation layer前面

  • 先做BN再進入activation好處是避免讓activation的input太接近飽和平緩區

  • Batch平均&標準差需要趨近所有Training set平均&標準差,因此Batch size不能太小

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • BN能降低參數初始化的影響以及對抗Overfitting

  • BN理論上要放在activation之前
    So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation (Ioffe and Szegedy, 2015)

  • 若model同時包含BN和Dropout,Dropout可以安插在activation layer之後,因此結構會是CONV/FC=>BatchNorm=>ReLu=>Dropout=>CONV/FC 參考

參考資料
Batch Normalization: Accelerating Deep Network Training b
y
Reducing Internal Covariate Shift

Dropout: A Simple Way to Prevent Neural Networks from
Overfitting