# [YT: Hung Yi Lee ML2021](https://www.youtube.com/watch?v=Ye018rCVvOo&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J)
- notes:
- **Dose local minima truly cause the problem?** Local minima 是個假議題。
- ## L1 預測本頻道觀看人數 (上) - 深度學習基本概念簡介
- Machine Learning ~ Looking for Function
- We can't describe it, call machine help...
- Speech Recognition
- Image Recognition
- Playing GO
- Different types of Functions
- Regression: The function outputs a scalar
- PM2.5 (PM2.5 today, temperature, Concentration of O3) $\Rightarrow$ PM 2.5 of tomorrow
- Classfication:
- SPAM filtering
- Go game:
- 19x19 $\Rightarrow$ Next move
- Structred Learning
- picture a picture, write an article, in fact: creation...
- ### Example:
- YouTube Channel
- f(backend data) = #viewer of tomorrow
1. Function: with unknow parameters
- guess a math formula: **based on domain knowledge**
- $y = b + w\times x_1$
- $y$: no. of views of 2/26, $x_1$: no. of views of 2/25
- **$w, b$ are unknow**
- $w$ weight
- $b$ bias
- **Model**: $y = b + w\times x_1$
2. Define Loss from Training Data
- Loss is a function of parameters: $L(b, w)$
- loss: how good a set of values is.
- eg. $L(0.5K, 1)$
- data 2017/01/01 ~ 2020/12/31
- $y = 0.5k + 1\times x_1$
- data: [4.8k, 4.9k, 7.5k..., 3.4k, 9.8k]
- label: [4.9k, 7.5k...]
- $y = 0.5k + 1\times x_1$ = [5.3k, 5.4k]
- $e1 =|y-\hat{y}| = 0.4k$
- $e2 =|y-\hat{y}| = 2.1k$
- ...
- $L=\frac{1}{N}\sum\limits_{n} e_n$
- smaller L is better
- L: mean absolute error
- mean square error
3. Optimization
- $w^*, b^* = arg \space \min \limits_{w, b} L$
- Gradient Descent
- $w^i = w^{i-1} - \eta\triangledown L$
- $w^i = w^{i-1} - \eta\frac{\partial L}{\partial w}|_{w=w^{i-1},b=b^{i-1}}$
- $b^i = b^{i-1} - \eta\frac{\partial L}{\partial b}|_{w=w^{i-1},b=b^{i-1}}$
- **Dose local minima truly cause the problem?** Local minima 是個假議題。
- Machine Learning is so simple...(?)
- how about 2021 data?
- we found it seems one day shift... reasonable $y = 0.1k + 0.97\times x_1$
- $L_{train} = 0.48K, L_{testing} = 0.58K$
- but from data showed, it seems weekly cycle... we modify model, this is domain knowledge:
- $y= b + \sum\limits_{j=1}^7 w_j x_j$
- $b^* = 0.05k, w_1^* =0.79, w_2^*=-0.31, w_3^* = 0.12, w_4^* = -0.01, w_5^* =-0.10, w_6^* =0.30, w_7^* = 0.18$
- $L = 0.38k, L^{'}=0.49k$
- $y= b + \sum\limits_{j=1}^{28} w_j x_j$
- $L = 0.33k, L^{'}=0.46k$
- $y= b + \sum\limits_{j=1}^{56} w_j x_j$
- $L = 0.32k, L^{'}=0.46k$
- ## Above is Linear Model
- ## L2 預測本頻道觀看人數 (下) - 深度學習基本概念簡介
- Linear model is too simple
- universality, piece wise function appromation 
- $y = b + wx_1$
- $y = b + \sum\limits_{i} c_i \space sigmoid(b_i + w_i x_1)$
- $y = b + \sum\limits_{j} w_jx_j$
- $y = b + \sum\limits_{j} c_i \space sigmoid(b_i + \sum\limits_j w_{ij}x_i)$
- $w_{ij}$: weight for $x_j$ for $i$-th sigmoid
- 
- ### Home Assignment:
1. draw network from
- $y = b + \sum\limits_{j} c_i \space sigmoid(b_i + \sum\limits_j w_{ij}x_i)$
- $w_{ij}$: weight for $x_j$ for $i$-th sigmoid
- $i = 1, 2, 3$
- $j = 1, 2, 3$
2. write down above in matrics format
- $y = b + \textbf{c}^T \textbf{a}$
- $\textbf{a} = \sigma(\textbf{r})$
- $\textbf{r} = \textbf{b} + \textbf{W}\textbf{x}$
- $y = b + \textbf{c}^T \space \sigma(\textbf{b} + \textbf{W}\textbf{x})$
- $L(\Theta)$
- $\Theta=\begin{bmatrix}\theta_{1} \\ \vdots\\ \theta_{n} \end{bmatrix}$
- $\Theta^* = arg \min\limits_{\Theta} L$
- $\textbf{g} = \begin{bmatrix} \ \frac{\partial L}{\partial \theta_1}|_{\Theta = \Theta^n} \\ \frac{\partial L}{\partial \theta_2}_{\Theta = \Theta^n}\\ \vdots \end{bmatrix}$
- $g = \triangledown L(\Theta^n)$
- Batch
# L3 任務攻略
- Loss on Training data first
- loss is large:
- model bias: to make a more complex model
- or optimization: L4, 5, 6, 7
- Loss on Training data is small
- loss on testing data is small: good!!
- loss on testing data is large:
- overfitting
- more data
- constrained model ($y = ax^2+b$、CNN)
- less parameter、less feature、regulirailition、dropout、early stopping
- please train your models at training set, don't abuse public testing data
- [how to select your final models in a kaggle competition](https://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/)
- mismatch:
- different distributions
- real objects at training, catoon pic at test data: transfer learning
# L4 Optimization : When gradient is small
- gradient is zero, it stops to learn: meets **critical points**
- critical points:
- **local minima:no way to go**
- **saddle point: escape**
- Tayler $L(\theta)$ around $\theta = \theta ^{'}$
- $L(\theta) \approx L(\theta^{'}) + (\theta-\theta^{'})^T g + \frac{1}{2}(\theta-\theta^{'})^TH(\theta-\theta^{'})$
- 第一項
- 第二項是梯度補償 $(\theta$ 與 $\theta^{'}$的距離),但是遇到曲線就不是完全能補足
- 所以加第三項能夠繼續追求接近
- $H$ 是 Hessian 矩陣
- 我們讓 $v = (\theta-\theta^{'})$
- 第三項寫成: $\frac{1}{2}v^THv$
- at critical points: g = 0
- 結論: 正定或副定時不是 saddle points
- 結論: Hessian 的所有 eigen value 有正有負就是 saddle points
- $v^THv > 0$ local minima
- $v^THv < 0$ local maxima
- 結論: 如果在 saddle 只要找到負值的 eigen value $\lambda$,將相對應的 eigen vector $u$ 與 $\theta{'}$ 相加就可以以降低 loss 的方向逃離...
- 但是沒有人在做二次微分求 Hessian,也沒有人在算 eigen values, eigen vectors 的
- 實驗觀察與結論: 就是很難觀察到 Minimum ration = 1
- $Minimum \space ratio = \frac{Number \space of \space POSITIVE \space Eigen \space values}{Number \space of Eigen values}$
- 也就是空間再複雜,也少有 minimax,而是都是 saddle points
- ### Patrick: 這一段倒底再說甚麼?
- ### 記住我們平常都適用到 gradient descent (搭配 back propagation 執行),所以上面說的泰勒展開到二次微分與 Hessian 矩陣是不會發生的事。
- ### 我們採用泰勒展開到二次微分與 Hessian 矩陣與線性代數的推導是要說明,所有特徵值(of $H$)都大於 0 才會是 local minimum,所有特徵值(of $H$)都小於 0 才會是 local maximum。 $\Rightarrow$ 有一些特徵值大於零有一些小於零,而讓 一次為分為零(梯度),這個地方是 SADDLE POINT。 Saddle point 容易用 小 batches size 逃離。
- ### 從實驗看到很少是 LOCAL MINIMA...
# L5 BATCH AND MOMENTUM
- if we have 20 examples
- **a.** batch size = N, full batch
- **b.** batch size = 1
- **a.** udates after seeing all the 20 examples: it seems taking **LONG** time, BUT POWERFUL
- **b.** updates for each example: it seems **SHORT** time but NOISE
- but if we have GPU, **a.** didn't taking LONGER time each update.
- #### 所以我們就會想: 既然每次更新所花時間差不多,我們當然選擇 POWERFUL 的 **a.** 或 多樣本
- 但是... full or more huge number samples batch 因為 batch loss function 變化性少,遇到 saddle points 不容易逃開。
- 不僅在 training 過程,在 testing 時,從小batch 訓練出來的模型也能夠在測試時有更好的結果。
- 因為在 小數量樣本 batch 過程中,我們不容易卡在 shape minima point,因為狹長型 minima points 遇到小數量 batch 的 noisy 方向不一定,容易一個小橫移就跳出了。
- 而在 small batch 訓練結果會落在一個一個 flat minima 的峽谷,對於 測試資料形成的 loss function 的小橫移不會太敏感。
- 
- momentum
# L6 Erro surface is rugged: adaptive learning rate
- critical points 不一定是我們訓練過程中的最大障礙
- history = model.fit()
- **converge at a lower loss level, and the gradient is small??? has you recoded norm of gradient? have you observed it?
**老師是說,我們的實驗都馬在看 loss,都沒有紀錄 gradient,沒有在訓練不下去了時候去觀察過梯度是否真的很小...**
- 有時候我們只是在兩邊山腰的跳躍
- $Training \space Stuck \neq Small \space Gradient$
- 一般 gradint descent 不好到達 Critical points.
- Learning rate cannot be one-size-fits-all
- different parameters needs different learning rate
- 某一個方向很平坦,大LR
- 某一個方向很陡峭,小LR
- (vanilla) gradient descent (for one parameter)
- $\theta_i^{t+1} \leftarrow \theta_i^{t} - \eta \space g_i^t$
- $g_i^t = \frac{\partial L}{\partial\theta_i}|_{\theta = \theta^t}$
- adagrad: rms learning rate
- $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta}{\sigma_i^t}g_i^t$
- $\sigma_i^0 = \sqrt{(g_i^0)^2}$
- $\sigma_i^t = \sqrt{\frac{1}{t+1}\sum\limits_{i=0}^t(g_i^t)^2}$
- 以上是討論比較平滑的平面。
- 底下要來說說learning rate adapts dynamically 用在很不平滑的表面
- RMSProp, introduced at coursera, **we have a hyper-parameter $\alpha$**
- $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta}{\sigma_i^t}g_i^t$
- $\sigma_i^0 = \sqrt{(g_i^0)^2}$
- $\sigma_i^1 = \sqrt{\alpha(\sigma_i^0)^2 + (1-\alpha)(g_i^1)^2}$
- $\sigma_i^2 = \sqrt{\alpha(\sigma_i^1)^2 + (1-\alpha)(g_i^2)^2}$
- $\sigma_i^t = \sqrt{\alpha(\sigma_i^{t-1})^2 + (1-\alpha)(g_i^t)^2}$
- $0 < \alpha < 1$
- $\alpha$ 愈大,愈重視整體歷程,$\alpha$ 愈小,愈重視當下的情況
- $\alpha$小,反應快
- Adam : RMSProp + Momentum
- [paper](https://arxiv.org/pdf/1412.6980.pdf)
- $m_0$ for momentum
- $v_0$ for RMSProp
- adagram 會累積到噴 $\Leftarrow$ 沙米龜啦...
- learning rate scheduling
- $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta^t}{\sigma_i^t}g_i^t$
- 留意 $\eta^t$ 跟時間有關
- learning rate decay
- **Warm Up** 先變大再變小
- [[warm up at residual network](https://arxiv.org/abs/1512.03385): ... So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue traing...
- [[warm up at trnsformer]](https://arxiv.org/abs/1706.03762)
- 可能 因為 起始的 $\sigma_i^t$ 數值少,在統計上還沒有代表性,variance 可能太大....
- [[RAdam]](https://arxiv.org/abs/1908.03265): The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam
- Summary of Optimization
- (vanilla) gradient descent (for one parameter)
- $\theta_i^{t+1} \leftarrow \theta_i^{t} - \eta \space g_i^t$
- Various Improvements
- $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta ^t}{\sigma_i^t} \space m_i^t$
- $m_i^t$ Momentum: weighted sum of the previouse gardients
- 老師有提到 不要看到除個$\sigma_i^t$ 又乘一個 $m_i^t$ 就覺得作虛工。$\sigma_i^t$ 是一直累積梯度大小的,$m_i^t$除了大小也有方向喲!
-
- [[further study by TA]](https://youtu.be/4pUmZ8hXlHM)
- [[further study by TA]](https://youtu.be/e03YKGHXnL8)
- [[can we cahnge the error surface(rug surface image source)]](https://arxiv.org/abs/1712.09913)
# L7 Classification
- ### [HY Lee: Classification, ML2016](/9nVOfiIvTW-9SMhCcsAm4Q) #
- regression
- logistic regression
- one hot encoding
- $y = b + c^T \sigma(b + W x)$
- $y = b^{'} + W^{'}\sigma(b + W x)$
- $\hat y \Leftrightarrow y{'} = softmax(y)$
- softmax: $y_t^{'} = \frac{\large e^{\small y_i}}{\sum_j \large e^{\small y_i}}$
- 大的值跟小的值會差距愈大
- Loss of Classification
- mse
- cross-entropy
- $e = - \sum\limits_i \hat y_i ln y_i^ {'}$
- **Minimizing cross-entropy** is equivalent to **maximizing likehood**
# L8 Batch Normalization
- Changing Landscape
- 
- Feature Normalization
- sample 1 $x^1$ is with $[x_1^1, x_1^2, x_1^3..., x_1^R]$
- sample 2 $x^2$ is with $[x_2^1, x_2^2, x_2^3...x_2^R]$
- sample i $x^3$ is with $[x_i^1, x_i^2, x_i^3...x_i^R] \rightarrow mean = m_i, standard \space deviation\space is \space \sigma_i$
- (one of) Feature Normalization: $\tilde{x}_i^r = \frac{x_i^r-m_i}{\sigma_i}$
- 就是每一個特徵都要正規化
- $\tilde{x}^1 \rightarrow W^1 \rightarrow z1 \rightarrow sigmoid \rightarrow a^1 \rightarrow W^2 ...$
- 看起來我們也應該對 $z$或$a$ 作 normalization,兩者都可以
- 如果是 sigmoid,建議對 $z$
- ...
- in face, we adapts batch normalization...
- $\tilde z^i = \frac{z^i-\mu}{\sigma}$
- $\hat z^i= \gamma\odot\tilde z^i + \beta$
- $\gamma$ and $\beta$ are parameter
- initial $gamma$ is one vector, $\beta$ is zero vector
- $\tilde{x}^1 \rightarrow W^1 \rightarrow z1 \rightarrow \tilde z^1 \rightarrow \hat z^1 ...$
- during testing, we don't have batch, how to computer mean and std. deviation?
- moving average during training.
- $\mu^1, \mu^2, ... \mu^t$
- $\bar \mu \leftarrow p\mu + (1-p)\mu^t$
- using $\bar \mu$ and $\bar \sigma$ to have $\tilde z$
- ie $\tilde x \rightarrow W^1 \rightarrow z \rightarrow \tilde z = \frac{\large z-\bar \mu}{\large \bar \sigma}$
- [[paper: Batch Normalization]](https://arxiv.org/abs/1502.03167)
- 
- pink: sigmoid: training failed, but works with normalization.
- blue:x5 means 5 times learning rate, seems great, since error surface is more smooth, good for large lr
- - [[paper: How Does Batch Normalization Help Optimization?]](https://arxiv.org/abs/1805.11604)
- Internal Covariate Shift? 當我們 update 參數為 $A^{'}$ 算出 $a^{'}$,這時候也同時更新的參數 $B{'}$ 是對 $a$ 適合,但是對 $a{'}$ 不適合吧 ($x \rightarrow A \rightarrow a \rightarrow B \rightarrow b$)
- How does batch normalization help optimization?
- 陳述 internal covariate shift 不是要素
- 實驗證實 BN:error surface 比較平滑
- batch normalization
- Layer Normalization
- Instance Normalization
- Weight Normalization
- ....
- 後補充: 結果在 WGAN 看到 Spectrum normalization... 似乎跑不掉啊