[YT: Hung Yi Lee ML2021](https://www.youtube.com/watch?v=Ye018rCVvOo&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J)

# [YT: Hung Yi Lee ML2021](https://www.youtube.com/watch?v=Ye018rCVvOo&list=PLJV_el3uVTsMhtt7_Y6sgTHGHp1Vb2P2J) - notes: - **Dose local minima truly cause the problem?** Local minima 是個假議題。 - ## L1 預測本頻道觀看人數 (上) - 深度學習基本概念簡介 - Machine Learning ~ Looking for Function - We can't describe it, call machine help... - Speech Recognition - Image Recognition - Playing GO - Different types of Functions - Regression: The function outputs a scalar - PM2.5 (PM2.5 today, temperature, Concentration of O3) $\Rightarrow$ PM 2.5 of tomorrow - Classfication: - SPAM filtering - Go game: - 19x19 $\Rightarrow$ Next move - Structred Learning - picture a picture, write an article, in fact: creation... - ### Example: - YouTube Channel - f(backend data) = #viewer of tomorrow 1. Function: with unknow parameters - guess a math formula: **based on domain knowledge** - $y = b + w\times x_1$ - $y$: no. of views of 2/26, $x_1$: no. of views of 2/25 - **$w, b$ are unknow** - $w$ weight - $b$ bias - **Model**: $y = b + w\times x_1$ 2. Define Loss from Training Data - Loss is a function of parameters: $L(b, w)$ - loss: how good a set of values is. - eg. $L(0.5K, 1)$ - data 2017/01/01 ~ 2020/12/31 - $y = 0.5k + 1\times x_1$ - data: [4.8k, 4.9k, 7.5k..., 3.4k, 9.8k] - label: [4.9k, 7.5k...] - $y = 0.5k + 1\times x_1$ = [5.3k, 5.4k] - $e1 =|y-\hat{y}| = 0.4k$ - $e2 =|y-\hat{y}| = 2.1k$ - ... - $L=\frac{1}{N}\sum\limits_{n} e_n$ - smaller L is better - L: mean absolute error - mean square error 3. Optimization - $w^*, b^* = arg \space \min \limits_{w, b} L$ - Gradient Descent - $w^i = w^{i-1} - \eta\triangledown L$ - $w^i = w^{i-1} - \eta\frac{\partial L}{\partial w}|_{w=w^{i-1},b=b^{i-1}}$ - $b^i = b^{i-1} - \eta\frac{\partial L}{\partial b}|_{w=w^{i-1},b=b^{i-1}}$ - **Dose local minima truly cause the problem?** Local minima 是個假議題。 - Machine Learning is so simple...(?) - how about 2021 data? - we found it seems one day shift... reasonable $y = 0.1k + 0.97\times x_1$ - $L_{train} = 0.48K, L_{testing} = 0.58K$ - but from data showed, it seems weekly cycle... we modify model, this is domain knowledge: - $y= b + \sum\limits_{j=1}^7 w_j x_j$ - $b^* = 0.05k, w_1^* =0.79, w_2^*=-0.31, w_3^* = 0.12, w_4^* = -0.01, w_5^* =-0.10, w_6^* =0.30, w_7^* = 0.18$ - $L = 0.38k, L^{'}=0.49k$ - $y= b + \sum\limits_{j=1}^{28} w_j x_j$ - $L = 0.33k, L^{'}=0.46k$ - $y= b + \sum\limits_{j=1}^{56} w_j x_j$ - $L = 0.32k, L^{'}=0.46k$ - ## Above is Linear Model - ## L2 預測本頻道觀看人數 (下) - 深度學習基本概念簡介 - Linear model is too simple - universality, piece wise function appromation ![](https://i.imgur.com/39BF09n.png) - $y = b + wx_1$ - $y = b + \sum\limits_{i} c_i \space sigmoid(b_i + w_i x_1)$ - $y = b + \sum\limits_{j} w_jx_j$ - $y = b + \sum\limits_{j} c_i \space sigmoid(b_i + \sum\limits_j w_{ij}x_i)$ - $w_{ij}$: weight for $x_j$ for $i$-th sigmoid - ![](https://i.imgur.com/D83FazS.png) - ### Home Assignment: 1. draw network from - $y = b + \sum\limits_{j} c_i \space sigmoid(b_i + \sum\limits_j w_{ij}x_i)$ - $w_{ij}$: weight for $x_j$ for $i$-th sigmoid - $i = 1, 2, 3$ - $j = 1, 2, 3$ 2. write down above in matrics format - $y = b + \textbf{c}^T \textbf{a}$ - $\textbf{a} = \sigma(\textbf{r})$ - $\textbf{r} = \textbf{b} + \textbf{W}\textbf{x}$ - $y = b + \textbf{c}^T \space \sigma(\textbf{b} + \textbf{W}\textbf{x})$ - $L(\Theta)$ - $\Theta=\begin{bmatrix}\theta_{1} \\ \vdots\\ \theta_{n} \end{bmatrix}$ - $\Theta^* = arg \min\limits_{\Theta} L$ - $\textbf{g} = \begin{bmatrix} \ \frac{\partial L}{\partial \theta_1}|_{\Theta = \Theta^n} \\ \frac{\partial L}{\partial \theta_2}_{\Theta = \Theta^n}\\ \vdots \end{bmatrix}$ - $g = \triangledown L(\Theta^n)$ - Batch # L3 任務攻略 - Loss on Training data first - loss is large: - model bias: to make a more complex model - or optimization: L4, 5, 6, 7 - Loss on Training data is small - loss on testing data is small: good!! - loss on testing data is large: - overfitting - more data - constrained model ($y = ax^2+b$、CNN) - less parameter、less feature、regulirailition、dropout、early stopping - please train your models at training set, don't abuse public testing data - [how to select your final models in a kaggle competition](https://www.chioka.in/how-to-select-your-final-models-in-a-kaggle-competitio/) - mismatch: - different distributions - real objects at training, catoon pic at test data: transfer learning # L4 Optimization : When gradient is small - gradient is zero, it stops to learn: meets **critical points** - critical points: - **local minima：no way to go** - **saddle point: escape** - Tayler $L(\theta)$ around $\theta = \theta ^{'}$ - $L(\theta) \approx L(\theta^{'}) + (\theta-\theta^{'})^T g + \frac{1}{2}(\theta-\theta^{'})^TH(\theta-\theta^{'})$ - 第一項 - 第二項是梯度補償 $(\theta$ 與 $\theta^{'}$的距離)，但是遇到曲線就不是完全能補足 - 所以加第三項能夠繼續追求接近 - $H$ 是 Hessian 矩陣 - 我們讓 $v = (\theta-\theta^{'})$ - 第三項寫成: $\frac{1}{2}v^THv$ - at critical points: g = 0 - 結論: 正定或副定時不是 saddle points - 結論: Hessian 的所有 eigen value 有正有負就是 saddle points - $v^THv > 0$ local minima - $v^THv < 0$ local maxima - 結論: 如果在 saddle 只要找到負值的 eigen value $\lambda$，將相對應的 eigen vector $u$ 與 $\theta{'}$ 相加就可以以降低 loss 的方向逃離... - 但是沒有人在做二次微分求 Hessian，也沒有人在算 eigen values, eigen vectors 的 - 實驗觀察與結論: 就是很難觀察到 Minimum ration = 1 - $Minimum \space ratio = \frac{Number \space of \space POSITIVE \space Eigen \space values}{Number \space of Eigen values}$ - 也就是空間再複雜，也少有 minimax，而是都是 saddle points - ### Patrick: 這一段倒底再說甚麼? - ### 記住我們平常都適用到 gradient descent (搭配 back propagation 執行)，所以上面說的泰勒展開到二次微分與 Hessian 矩陣是不會發生的事。 - ### 我們採用泰勒展開到二次微分與 Hessian 矩陣與線性代數的推導是要說明，所有特徵值(of $H$)都大於 0 才會是 local minimum，所有特徵值(of $H$)都小於 0 才會是 local maximum。 $\Rightarrow$ 有一些特徵值大於零有一些小於零，而讓一次為分為零(梯度)，這個地方是 SADDLE POINT。 Saddle point 容易用小 batches size 逃離。 - ### 從實驗看到很少是 LOCAL MINIMA... # L5 BATCH AND MOMENTUM - if we have 20 examples - **a.** batch size = N, full batch - **b.** batch size = 1 - **a.** udates after seeing all the 20 examples: it seems taking **LONG** time, BUT POWERFUL - **b.** updates for each example: it seems **SHORT** time but NOISE - but if we have GPU, **a.** didn't taking LONGER time each update. - #### 所以我們就會想: 既然每次更新所花時間差不多，我們當然選擇 POWERFUL 的 **a.** 或多樣本 - 但是... full or more huge number samples batch 因為 batch loss function 變化性少，遇到 saddle points 不容易逃開。 - 不僅在 training 過程，在 testing 時，從小batch 訓練出來的模型也能夠在測試時有更好的結果。 - 因為在小數量樣本 batch 過程中，我們不容易卡在 shape minima point，因為狹長型 minima points 遇到小數量 batch 的 noisy 方向不一定，容易一個小橫移就跳出了。 - 而在 small batch 訓練結果會落在一個一個 flat minima 的峽谷，對於測試資料形成的 loss function 的小橫移不會太敏感。 - ![](https://i.imgur.com/J5rC3Gg.png) - momentum # L6 Erro surface is rugged: adaptive learning rate - critical points 不一定是我們訓練過程中的最大障礙 - history = model.fit() - **converge at a lower loss level, and the gradient is small??? has you recoded norm of gradient? have you observed it? **老師是說，我們的實驗都馬在看 loss，都沒有紀錄 gradient，沒有在訓練不下去了時候去觀察過梯度是否真的很小...** - 有時候我們只是在兩邊山腰的跳躍 - $Training \space Stuck \neq Small \space Gradient$ - 一般 gradint descent 不好到達 Critical points. - Learning rate cannot be one-size-fits-all - different parameters needs different learning rate - 某一個方向很平坦，大LR - 某一個方向很陡峭，小LR - (vanilla) gradient descent (for one parameter) - $\theta_i^{t+1} \leftarrow \theta_i^{t} - \eta \space g_i^t$ - $g_i^t = \frac{\partial L}{\partial\theta_i}|_{\theta = \theta^t}$ - adagrad: rms learning rate - $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta}{\sigma_i^t}g_i^t$ - $\sigma_i^0 = \sqrt{(g_i^0)^2}$ - $\sigma_i^t = \sqrt{\frac{1}{t+1}\sum\limits_{i=0}^t(g_i^t)^2}$ - 以上是討論比較平滑的平面。 - 底下要來說說learning rate adapts dynamically 用在很不平滑的表面 - RMSProp, introduced at coursera, **we have a hyper-parameter $\alpha$** - $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta}{\sigma_i^t}g_i^t$ - $\sigma_i^0 = \sqrt{(g_i^0)^2}$ - $\sigma_i^1 = \sqrt{\alpha(\sigma_i^0)^2 + (1-\alpha)(g_i^1)^2}$ - $\sigma_i^2 = \sqrt{\alpha(\sigma_i^1)^2 + (1-\alpha)(g_i^2)^2}$ - $\sigma_i^t = \sqrt{\alpha(\sigma_i^{t-1})^2 + (1-\alpha)(g_i^t)^2}$ - $0 < \alpha < 1$ - $\alpha$ 愈大，愈重視整體歷程，$\alpha$ 愈小，愈重視當下的情況 - $\alpha$小，反應快 - Adam : RMSProp + Momentum - [paper](https://arxiv.org/pdf/1412.6980.pdf) - $m_0$ for momentum - $v_0$ for RMSProp - adagram 會累積到噴 $\Leftarrow$ 沙米龜啦... - learning rate scheduling - $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta^t}{\sigma_i^t}g_i^t$ - 留意 $\eta^t$ 跟時間有關 - learning rate decay - **Warm Up** 先變大再變小 - [[warm up at residual network](https://arxiv.org/abs/1512.03385): ... So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue traing... - [[warm up at trnsformer]](https://arxiv.org/abs/1706.03762) - 可能因為起始的 $\sigma_i^t$ 數值少，在統計上還沒有代表性，variance 可能太大.... - [[RAdam]](https://arxiv.org/abs/1908.03265): The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam - Summary of Optimization - (vanilla) gradient descent (for one parameter) - $\theta_i^{t+1} \leftarrow \theta_i^{t} - \eta \space g_i^t$ - Various Improvements - $\theta_i^{t+1} \leftarrow \theta_i^{t} - \frac{\eta ^t}{\sigma_i^t} \space m_i^t$ - $m_i^t$ Momentum: weighted sum of the previouse gardients - 老師有提到不要看到除個$\sigma_i^t$ 又乘一個 $m_i^t$ 就覺得作虛工。$\sigma_i^t$ 是一直累積梯度大小的，$m_i^t$除了大小也有方向喲! - - [[further study by TA]](https://youtu.be/4pUmZ8hXlHM) - [[further study by TA]](https://youtu.be/e03YKGHXnL8) - [[can we cahnge the error surface(rug surface image source)]](https://arxiv.org/abs/1712.09913) # L7 Classification - ### [HY Lee: Classification, ML2016](/9nVOfiIvTW-9SMhCcsAm4Q) # - regression - logistic regression - one hot encoding - $y = b + c^T \sigma(b + W x)$ - $y = b^{'} + W^{'}\sigma(b + W x)$ - $\hat y \Leftrightarrow y{'} = softmax(y)$ - softmax: $y_t^{'} = \frac{\large e^{\small y_i}}{\sum_j \large e^{\small y_i}}$ - 大的值跟小的值會差距愈大 - Loss of Classification - mse - cross-entropy - $e = - \sum\limits_i \hat y_i ln y_i^ {'}$ - **Minimizing cross-entropy** is equivalent to **maximizing likehood** # L8 Batch Normalization - Changing Landscape - ![](https://i.imgur.com/fbShI6K.png) - Feature Normalization - sample 1 $x^1$ is with $[x_1^1, x_1^2, x_1^3..., x_1^R]$ - sample 2 $x^2$ is with $[x_2^1, x_2^2, x_2^3...x_2^R]$ - sample i $x^3$ is with $[x_i^1, x_i^2, x_i^3...x_i^R] \rightarrow mean = m_i, standard \space deviation\space is \space \sigma_i$ - (one of) Feature Normalization: $\tilde{x}_i^r = \frac{x_i^r-m_i}{\sigma_i}$ - 就是每一個特徵都要正規化 - $\tilde{x}^1 \rightarrow W^1 \rightarrow z1 \rightarrow sigmoid \rightarrow a^1 \rightarrow W^2 ...$ - 看起來我們也應該對 $z$或$a$ 作 normalization，兩者都可以 - 如果是 sigmoid，建議對 $z$ - ... - in face, we adapts batch normalization... - $\tilde z^i = \frac{z^i-\mu}{\sigma}$ - $\hat z^i= \gamma\odot\tilde z^i + \beta$ - $\gamma$ and $\beta$ are parameter - initial $gamma$ is one vector, $\beta$ is zero vector - $\tilde{x}^1 \rightarrow W^1 \rightarrow z1 \rightarrow \tilde z^1 \rightarrow \hat z^1 ...$ - during testing, we don't have batch, how to computer mean and std. deviation? - moving average during training. - $\mu^1, \mu^2, ... \mu^t$ - $\bar \mu \leftarrow p\mu + (1-p)\mu^t$ - using $\bar \mu$ and $\bar \sigma$ to have $\tilde z$ - ie $\tilde x \rightarrow W^1 \rightarrow z \rightarrow \tilde z = \frac{\large z-\bar \mu}{\large \bar \sigma}$ - [[paper: Batch Normalization]](https://arxiv.org/abs/1502.03167) - ![](https://i.imgur.com/iig3Oam.png) - pink: sigmoid: training failed, but works with normalization. - blue:x5 means 5 times learning rate, seems great, since error surface is more smooth, good for large lr - - [[paper: How Does Batch Normalization Help Optimization?]](https://arxiv.org/abs/1805.11604) - Internal Covariate Shift? 當我們 update 參數為 $A^{'}$ 算出 $a^{'}$，這時候也同時更新的參數 $B{'}$ 是對 $a$ 適合，但是對 $a{'}$ 不適合吧 ($x \rightarrow A \rightarrow a \rightarrow B \rightarrow b$) - How does batch normalization help optimization? - 陳述 internal covariate shift 不是要素 - 實驗證實 BN:error surface 比較平滑　 - batch normalization - Layer Normalization - Instance Normalization - Weight Normalization - .... - 後補充: 結果在 WGAN 看到 Spectrum normalization... 似乎跑不掉啊