Try   HackMD

2. HUNG-YI LEE 2022 ML - NN Training tips

tags: Machine Learning

【機器學習2021】類神經網路訓練不起來怎麼辦 系列(一)~(五)

(一)局部最小值 (local minima) 與鞍點 (saddle point)

1. optimization fail

  • 使用梯度下降時,當gradiant為0,不一定是卡在local minima,只能說卡在crtical point~

    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • local minima旁邊沒路可走,saddle point旁邊有路走?
  • 判斷critical point是屬於local minima or saddle point?

    • 數學方法:需先了解loss function的形狀
      • 給定θ′(at critical point),附近的θ function,也就是L(θ)可以用tayler series approximation來表示
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
      • 計算Hessian矩陣,並透過Hessian來判斷critical point是local minima or saddle point。因for all v測試顯然不實際,因此可用linear algebra的eigenvalue來快速判斷
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
  • 若判斷為saddle point:可透過H解

    • 最慢方式,因計算H矩陣eigenvalue很麻煩
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • 實際例子:
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • 總結:根據研究,大部分時候並非卡在local minima

(二)批次 (batch) 與動量 (momentum)

batch size影響

  • big batch size較穩定,small batch size產生較noisy的gradient

  • 因GPU paralled運算,大的batch size不會比較差,反而有優勢

  • opt. issue: in training set,過大batch size也不好

    • large batch size 過快算到gradient = 0
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • overfitting:small batch size在testing set表現好?

    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • 解釋(尚待進一步研究)
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • 各有優劣,是一個超參數

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

momentum

  • 利用物理現象來參數更新,

(三):自動調整學習速率 (Learning Rate)

training stuck問題

  • 參數透過training不斷更新,gradient不斷下降;但是當loss不再減小,gradient有時候不是最小!。所以並非每次都是卡在citical point(local minima or saddle point)
  • 使用gradient descent卡關常常不是因為critical point,而是learing rate問題!永遠走不到終點

learning rate客製化

  • 大原則:觀察error surface

    • 當convex error surface時
    • error surface表示參數們與loss高低之間的關係
    • 某方向平坦:大learning rate
    • 某方向陡峭:小learning rate
  • 如何實作:Adagrad

    • 使用一個參數:θi(i指某一個參數)做舉例
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
      • 增加sigma參數(多個時期的gradient做root mean square而得)
    • root mean square
      • 參考過去每次的gradient做調整,反覆進行調整
      • 使用於Adagrad方法
  • 其他版本方法:RMSPropd方法來dynamically調整learning Rate

    • 起因:Adagrad方法不適用在Complex Error Surface,且反應也比RMSPropd慢
    • 缺點:此方法找不到論文,只能site corsara video
    • 特色:可調整各階段新計算的g值之重要性,引入新的超參數alpha參數來調整
  • 今日最常使用之optimization方法:Adam

    • Adam:RMSProp + Momentum
      • pyTorch有提供套件自行研究

使用Adaptive learning rate效果

  • 問題:累積過多過小的gradient會爆走
  • sol:learning rate scheduling
    • 加入以下兩種機制之一:
      • learning rate decay
        • 距離終點越近,learning rate減小
      • warm up(黑科技)
        • learning rate先變大 後變小
        • 遠古時期論文即出現:
        • 為何如此還需要進一步研究(學界未完善研究),參考RAdam 1908.03265

總結 Optimization

進化的gradient descent

  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

(四):損失函數 (Loss) 也可能有影響

classification

  • 完整版本:

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • 本堂課為方便解釋,定義

    • y head為正確答案、y為預測答案
    • one-hot vector:多維度的n * 1向量,裡面只有0跟1
  • softmax():把y vector contain 各種值,轉換成one-hot vector y promt

    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
    • 輸入的y vector稱為logit
    • 好處:normalize,以及把值之間差距變大
  • 當binary classification時,多用sigmoid

    • 思考,其實與softmax 計算binary classification時意義是一樣的

loss of classification

  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • pyTorch:call cross-entropy時,softmax()會自動被加入到network最後一層,所以TA code中找不到softmax XD

  • cross-entropy

    • 常用在classfication 問題
    • 是一個loss function
    • 用數學證明為何常用在classfication問題上
      • 參考過去影片
    • 用舉例說明為何常用在classfication問題上
      • Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
        • 左圖卡住的地方,因loss值很大,error surface是非常平坦的,gradient會趨近於零

總結

改變loss function(將error surface鏟平?)可影響optimization時的難易度。

(五):批次標準化 (Batch Normalization) 簡介

  • HW3將會用到batch normalization(CNN)

changing landscape

  • 將error surface鏟平的技術之一

  • convex error surface產生原因

    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • feature normalization

    • 不同feature會彼此關聯
    • 需要夠大的batch size計算mu and sigma
    • 新增beta and gamma參數避免normalization限制model
  • testing時候遇到的問題

    • 因資料不足一個batch,常常無法計算出mu/sigma,而無法normalization!
      • sol: pytorch會紀錄training時每次的mu/sigma,計算出moving avg.
  • batch normalization in CNN

  • internal covariate shift是否有影響之研究

  • 讓error surface比較不崎嶇的其他方式

    • batch normalizationg是偶然的,但有用的發現
    • 其他normalization方法

HW02

Python 中 with用法及原理