Machine Learning 筆記（ 1 - 8 ）

# Machine Learning 筆記（ 1 - 8 ） ###### tags : `shaimejump520`, `machine learning` ##### contributed by <`江家銘`> [參考影片：Machine Learning (Hung-yi Lee, NTU)](https://www.youtube.com/watch?v=CXgbekl66jc&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49) --- ## ML Lecture 0-1 : Introduction ---- ### 人工智慧 Artificial Intelligence（目標） * 早已提出但沒有好的方法 * Hand-crafted rule ，人類設定好規則，機器依尋規則做出回應 * 缺點：永遠無法超越創造者 ---- ### 機器學習 Machine Learning（手段） * 1980年代被提出 * Write a progarm for learning，**looking for a function from data** ---- * Frame work (以 supervise 為例) 1. 準備 a set of function (Model) $\ f_{1},\ f_{2}\cdot\cdot\ \cdot$，內有成千上萬種function 2. 將 training data 作為 input 送入 model 中 3. 從中挑出表現最好的 function $\ f^{*}$ 4. 再使用 testing data 作為 $\ f^{*}$ 的 input，看 output 是否符合預期 step1-3 為 training ，step4 為 testing ---- * Regression * The output of the target function $f$ is "scalar" ---- * Classification * Binary Classification output : **yes** or **no** example : Gmail 垃圾郵件偵測 * Multi-class Classificition output : **Class 1, Class2, ... , Class N** example : 新聞分類（政治，財經，體育...） ---- * Model * Linear Model * Non-linear Model * Deep learing ---- * Supervised Learning * 需要大量 training data * 而 training data 中 input 與 output(or label) 之間的關係，往往需要靠人工雕塑出來 * Hard to collect a large amount of labelled data ---- * Semi-supervised Learning * 有少量的 labelled data 和大量的 unlabelled data * unlabelled data 也對 learning 有幫助 ---- * Transfer Learning * 有少量的、與 learning 目標有關的 labelled data * 和大量的可能 labelled 、可能 unlabelled 的 **無關目標之data** ---- * Unsupervised Learning * 空有 output 而沒有 input * example * machine reading 給予大量的文章，machine 自行判斷出文字的含意 * machine drawing 給大量動物照片，結果能輸出自己畫出之動物照片 ---- * Structure Learning 輸出為具有結構性、較複雜之物件 * 翻譯 * 語音辨識 ---- * Reinforcement Learning * Alphago * Supervised v.s. Reinforcement | Supervised | Reinforcement | | --------------- | ------------- | | Learning from teacher| Learning from critics | | 根據棋譜決定下一步棋應落於何處 | 下棋 -> 贏了 -> 這樣下很棒！ | ---- * Learning Map ![](https://i.imgur.com/4qryhrL.png) --- ## ML Lecture 1 : Regression ---- ### Regression * example * Stock Market Forecast output 可以為道瓊工業平均指數，為一 scalar * Self-driving Car input 汽車週邊情況，output 方向盤角度 * Recommendation input 使用者及商品，output 購買可能性 ---- * Example Application ![](https://i.imgur.com/scPh2mm.png) ---- * Step 1 : Model * function set : $y=b+w\cdot x_{cp}$ $f_{1}:y=10.0+9.0\cdot x_{cp}$ $f_{2}:y=9.8+9.2\cdot x_{cp}$ $f_{3}:y=-0.8-1.2\cdot x_{cp}$（會發現此組不可能發生，因為 cp 值為負） ... ... (無窮多組) ---- * Step 1 : Model * 上面舉的 modle 為一 Linear Model : $y = b + \sum{w_{i}x_{i}}$ * $x_{i}$ : an attribute of input x, also called "Feature" * $w_{i}$ : weight * $b$ : bias ---- * Step 2 : Goodness of Function * ![](https://i.imgur.com/1C1NfjS.png =300x300) ---- * Step 2 : Goodness of Function * Training data : 10 pokemons $(x^1, \hat{y}^1)$ $(x^2, \hat{y}^2)$ $\ \ \ \ \ \vdots$ $(x^{10}, \hat{y}^{10})$ * Loss function $L$ : 任何可微分的 function，用以衡量 model 中 function 的好壞 input : a function, output : how bad it is $\begin{split} L(f) &= L(w, b)\\ &= \displaystyle\sum_{n=1}^{10}{(\hat{y}^n-(b+w\cdot x^n_{cp}))^2} \end{split}$ ---- * Graph of Loss function ![](https://i.imgur.com/luUIxEN.png =400x400) * 圖上每一點都代表一個 function ，一組 (w,b) * 顏色愈接近紅色代表 loss 愈大，表現較差，反之愈接近紫色代表 loss 愈小，是個好 function ---- * Step 3-1 : Best Function * Pick the "Best" Function $\begin{split}f^* &= arg \ \underset{f}{min} \ L(f)\\ w^*, b^* &= arg \ \underset{w, b}{min} \ L(f)\\ &=arg\ \underset{w, b}{min} \displaystyle\sum_{n=1}^{10}{(\hat{y}^n-(b+w\cdot x^n_{cp}))^2}\end{split}$ ---- * Step 3-1 : Gradient Descent * 先任意取一點初始值 $w^0$，求該點的導數，可知在 $w^0$ 附近的 loss 值是 increasing or decreasing * 為了取得較低的 loss value，我們使用公式： $w^1 = w^0 - \eta\dfrac{dL}{dw}|_{w = w^0}$ 其中 $\eta$ 為 **learning rate** ，代表學習的效率（其值愈大，學習速度愈快） * 可以發現，loss value 更新的幅度取決於 $w^0$ 的導數值，以及 learning rate，$w^0$ 前後斜率愈陡峭，learning rate 愈大，則 loss value 更新的就愈快 ---- * Step 3-1 : Gradient Descent * 在經過多次 iteration 後，我們可以得到 $w^T$，此處具有 loss function 的 local minima，(雖然好像看起來不是 global minima 感覺不是很爽，但後來會發現，在 linear regression 中是沒有 local minima 的) ---- * Step 3-1 : Gradient Descent * ![](https://i.imgur.com/zECmUqJ.png ) 以上是只有一個參數的狀況下，以下討論同時有 w, b 兩個參數的狀況（事實上沒有什麼差別） ---- * Step 3-1 : Gradient Descent * 先任意選取初始值 $w^0, b^0$ * 計算 $\tfrac{\partial{L}}{\partial{w}}|_{w=w^0, b=b^0},\ \tfrac{\partial{L}}{\partial{b}}|_{w=w^0, b=b^0}$ $w^1=w^0-\eta\tfrac{\partial{L}}{\partial{w}}|_{w=w^0, b=b^0},\ b^1=b^0-\eta\tfrac{\partial{L}}{\partial{b}}|_{w=w^0, b=b^0}$ * 經過多次 iteration ，可以取得相對較小的 loss value ---- * Step 3-1 : Gradient Descent * 將上述圖像化，可發現逐步的靠近圖中紫色部份（也就是 loss value 較低之處） ![](https://i.imgur.com/oNXOeuR.png) ---- * Step 3-1 : Gradient Descent * 前述提及，使用 Gradiant Descent 只能找到 local optimal solution 而非 global optimal solution，然而，in Linear Regression， loss function L is convex, i.e. No local optimal. ---- * Result * 根據 Gradiant Descent，最終取得的結果如下 ![](https://i.imgur.com/F3SIVHQ.png) ---- * Result 可看出 Regression 的結果及與 training data 的誤差，然而，這並非我們真正想要知道的，我們真正關心的是使用Regression 的結果用在其他 data 是否準確？ ---- * Result * 取另外 10 組資料作為 testing data 下去測試 ![](https://i.imgur.com/05oiQOR.png) ---- * Result 發現與 regression 結果大致符合，而在誤差方面，則是相較於 training data 稍大一些，由於 function 是依據 training data 產生，所以 testing data 的 error 較大實屬正常 ---- * Improvement * Select another Model ---- * 二次式：$y = b + w_{1}\cdot x_{cp} + w_{2}\cdot (x_{cp})^2$ 重複前述步驟，結果如下： ![](https://i.imgur.com/lmvkGBI.png) 相較先前已有許多進步， ---- * 三次式結果如下： ![](https://i.imgur.com/V5ron5O.png) 依舊有所進步，然而進步幅度稍緩， ---- * 再到四次式、五次式，會發現，對於 training data 愈來愈符合，然而在 testing data 表現卻愈來愈糟糕 ---- ![](https://i.imgur.com/feIOElZ.png) ---- * 上圖為五次式的結果，可以明顯地看出 regression 的結果已經走樣了，更加符合了 training data，卻在 testing data 表現相當之差 * Model selection * 挑選愈複雜的 model 其所包含的 function set 愈大，當然愈可能找到更加符合 training data 的 function，然而其結果卻不見得在 testing data 上得到好的表現 * 上述狀況稱為 overfitting * 應該選擇較 suitable 的 model ---- * Regularization $y = b + \sum w_{i}x_{i}$ $L = \displaystyle\sum_{n}(\hat y^n-(b+\sum w_{i}x_{i}))^2+$ $\lambda\sum(w_{i})^2$ * $\lambda\sum(w_{i})^2$ 使結果更加平滑，$\lambda$ 值愈大，對 training data 來說，error 也愈大，然後相對的，testing data 的 error 也會因此變小（但當 $\lambda$ 值過大時，會造成曲線過於平滑，error 反而上升） ---- * 而特別的是，在做 Regularization 時，我們透過操控 $\lambda$ 來影響 weight 參數的權重，但是卻不對 bias 做處理（如果真的那麼做了，Regularization 的結果可能會更糟），原因是 bias 主要控制平移，與平滑化無關，而 Regularization 對 testing data 的主要益處便在於平滑化 --- > [time= Apr , 2018] --- ## ML Lecture 2 : Where does the error come from? ---- ### Bias and Variance of Estimator * 這邊要先複習一下**機率與統計** * 當我們要在大量的 data 中找到平均值的時候，無法使用講所有的值相加取平取，這時候我們會使用 sample 的手法，隨機取 N 筆 data 算出平均值 m ，重複取 n 組，得到 n 個平均值 $m_1,m_2,\cdots,m_n$，再取這 n 個平均值的期望值 m，這時我們可以認為，這個期望值會很接近我們所需要的全部 data 的平均值。 ---- * 而用上述得到的 n 個平均值及 m ，我們可以用差平方的平均值求出變異數 $s^2$，取 $s^2$ 的期望值再乘以 $\frac{N}{N-1}$ 即可估算出 $\sigma^2$ * 詳細數學計算如下： $variable : 𝑥 \ \ \ \ \ \ mean\ of\ x : \mu \ \ \ \ \ \ \ variance\ of\ x : \sigma^2\\ Sample\ N\ points : \{x^1, x^2, \cdots,x^N\}\\ m = \dfrac{1}{N}\displaystyle\sum_n{x^n}\neq\mu\\ E[m]=E[\dfrac{1} {N}\displaystyle\sum_n{x^n}]=\dfrac{1}{N}\displaystyle\sum_n{E[x^n]}=\mu$ ---- * $s^2 = \dfrac{1}{N}\displaystyle\sum_n{(x^n-m)^2}\\ E[s^2] = \dfrac{N-1}{N}\sigma^2$ * 我們可以根據式子發現，若取的 N 值愈大，則我們每組的 m 及 s 會分佈的相對較集中 ---- * **bias** 及 **variance** 皆是我們估計目標的錯誤來源 ![](https://i.imgur.com/mcCCs6f.png) ---- ### Variance * 在相同 model 之下，給予多組不同 testing data ，我們可以找出多個不同的 $f^*$ ，而且當 model 愈複雜時，結果呈現的愈分散 ![](https://i.imgur.com/prMnxsx.png =650x400) ---- ![](https://i.imgur.com/MR72X1t.png =800x500) * 愈簡單的 model 愈不容易受抽樣的 data 不同而有所影響 ---- ### Bias $E[f^*]=\bar{f}$ ![](https://i.imgur.com/KY5XSHr.png =700x450) 會發現，愈複雜的 model 得到的 $\bar{f}$ 愈接近 $\hat{f}$ ---- ![](https://i.imgur.com/DLt0Ctw.png) ---- * 咎其原因，當我們選定 model 時，function set 也隨之決定了，若我們挑選到的 model 本身便不包含 target function $\ \hat{f}$（如上圖左），那麼我們所求的的 $\bar{f}$ 也就不可能跟 $\hat{f}$ 太接近，因而造成了較大的 bias。反之則如上圖右的狀況，較大的 function set 較能導致較小的 bias ---- ![](https://i.imgur.com/toK4x5J.png) ---- * 縱軸為 error ，橫軸為 model 複雜度，我們可以觀察到簡單的 model 會有較大的 bias 但會有較小之 variance，而叫複雜的 model 則會有較小的 bias 卻有較大的 variance。綜合兩者的影響，我們會得到我們真正的 error，為求得到好的 error ，我們需要平衡地挑選 model * 若 model 過於複雜導致 error 上升，此一現象稱為 **Overfitting** * 而 model 過於簡單也會導致 error 上身，此現象則稱為 **Underfitting** ---- ### Diagnosis 在嘗試解決 bias 或是 variance 的問題之前，我們首先調知道我們目件所測試的 model 他處於哪個狀態，是 bias 大、underfitting 呢？還是 variance 大、overfitting 了呢？ * 若是我們所挑中的 model 連 training data 都沒有辦法很好的 fit ，那麼我們就會認為，我們有很大的 bias ，也就是 underfitting 了 * 若是我們所挑選的 model 對 training data 有很好的表現，然而卻對 testing data 表現十分不理想，那我們就會認為，我們現在有很大的 variance ，已經 overfitting 了 ---- ### What to do with large bias? * 重新設計 model * 增加更多的 features 當作 input * 提升 model 的冪次，使 model 更加複雜 ---- ### What to do with large variance? ---- * More data * 假設原本的取樣只取 10 個 data，那們增加取樣的數量，我們所得到的 $f^*$ 的 variance 就會下降（此部份前面複習機率與統計有提及） * 此舉十分有效，然而在現實中，data 的收集十分不易，所以難以執行 * （題外話），雖然 data 收集不易，但我們可以用些小手段自己製造出所需的 data，如：於手寫辨識時，將原本擁有的筆跡做小角度旋轉，摹擬不同人寫字時角度有所不同的現象，自造出更多的 data 用於 training ，其他還有諸如圖像辨識、語音辨識等等皆能應用類似的手法稱生所需的 data ---- * Regularization * 先前的內容有提及，主要概念是降低 input 的權重，使結果的曲線能較平滑一些 ![](https://i.imgur.com/e6bcc9y.png) * 但是此舉等同於影響了 function set（變成只收集較平滑的 function），有可能會破壞 bias，所以在使用 regularization 時，必須注意調整的 weight ，要在 bias 與 variance 間取得平衡 ---- ### Model Selection * 必須要在 bias 與 variance 間取得平衡，得到最小的error * What we should not to do : ![](https://i.imgur.com/3jcxEWv.png) ---- * * 我們在挑選 model 時，往往是將 training data 分別放入多個不同的 model 中，各自挑出一個 target function ，再利用 testing data 做測試，挑選其中 error 最小的那個 model 作為所需然而，我們所擁有的 testing set 跟現實中的 testing set 又會有所差距，所以我們所挑選的那個 model 不見得會是在現實生活中表現的最好的那個，那我們所做的一切不就都白費功夫了嗎將問題放到競賽層面，我們擁有的 testing data 就是 public testing set，實際上的 testing data 就是 private testing set，相同的問題一樣會產生 ---- * Cross Validation ![](https://i.imgur.com/kHRXk6Y.png) ---- * * 先將 training set 分為兩組，一組仍為 training set，另一組為 validation set * 使用 training set 去做 training 找出各組 modle 的 $f^*$ ，再使用 validation set 驗證各個 $f^*$ 的 error * 挑選 error 較小的那個 function 作為我們的 target function（或是可以用該 model 再用整組 training set 做一次 training ，可能會用的比較安心XD） ---- * * 使用挑出來的 model 應用到 public data 上所得到的結果，會比使用 cross validation 之前更加接近在 private 上的結果 * 注意！！別因為 model 在 public 上表現不好，而對 model 做調整，否則就只是挑選出了一個適合 public data 的 model ，在 private data 中不見得會有比較好的表現（這也算是一種 bias） ---- * N-fold Cross Validation ![](https://i.imgur.com/KBCZtU8.png) ---- * * 將 training set 分為 N 組，每次挑其中一組作為 validation set ，其他的做為 training set * 如此一來總共要做 N 次，每個 model 每次會有一個 error ，對每次的 error 取平均後，再挑選 model * 用完整的 training set 去對選出的 model 做 training 再對 public data 及 private data 做 testing --- ## ML Lecture 3-1 : Gradient Descent ---- ### Review Gradient Descent 進行 machine learning 時，第一步選定 model ，第2步決定 loss function ，第3步用 gradient descent 逐步逼近我們所需的 target function * $\theta^* = arg\ \underset{\theta}{min}\ L(\theta)\ \ \ \ \ L:loss\ function\ \ \ \ \ \theta:parameters$ $Suppose\ the\ \theta\ has\ two\ variables\ \{\theta_1,\theta_2\}$ $\nabla L(\theta) = \left( \begin{array}{c} \partial L(\theta_1) / \partial \theta_1\\ \partial L(\theta_2) / \partial \theta_2\\ \end{array} \right) \ \ \ \ \ \ \ \theta^0 = \left( \begin{array}{c} \theta^0_1\\ \theta^0_2\\ \end{array} \right)$ $\theta^1=\theta^0 - \eta\nabla L(\theta^0)$ $\theta^2=\theta^1 - \eta\nabla L(\theta^1)$ ---- ![](https://i.imgur.com/JyhXSA4.png) ---- ### Tip 1 : Tuning your learning rates * learning rate 的大小需適中，太小會造成效率緩慢，太大則有可能永遠無法到達我們所需的 local minimal ---- * ![](https://i.imgur.com/l4V2Mu6.png) 如上圖，過大的 learning rate 有可能導致 loss 的下降卡住 ---- #### Adaptive Learning Rates * Popular & Simple Idea : 逐漸將 learning rate 調降 * 剛開始的 learning rate 可以大，增加效率 * 再經過多輪後，我們應該已經靠近我的的目標，此時需要調降我們的 learning rate 避免我們錯過 loss 低點 * E.g. 1/t decay: $\eta^𝑡 = \eta/\sqrt{t+1}$ * Learning rate cannot be one-size-fits-all * 每個不同的參數，應該擁有自己獨立的 learning rate ---- #### Adagrad * Divide the learning rate of each parameter by the **root mean square of its previous derivatives** * $w^{t+1} = w^t - \dfrac{\eta^t}{\sigma^t}g^t$ * $\eta^t = \dfrac{\eta}{\sqrt{t+1}}\ \ \ \ \ \ g^t = \dfrac{\partial L(\theta^t)}{\partial w}\ \ \ \ \ \ \sigma^t$ : 先前所有 w 的方均根 ---- * * $w^1 = w^0 - \dfrac{\eta^0}{\sigma^0}g^0\ \ \ \ \ \ \ \ \sigma^0 = \sqrt{(g^0)^2}$ $w^2 = w^1 - \dfrac{\eta^1}{\sigma^1}g^1\ \ \ \ \ \ \ \ \sigma^1 = \sqrt{\dfrac{1}{2}[(g^0)^2+(g^1)^2]}$ $w^3 = w^2 - \dfrac{\eta^2}{\sigma^2}g^2\ \ \ \ \ \ \ \ \sigma^2 = \sqrt{\dfrac{1}{3}[(g^0)^2+(g^1)^2+(g^2)^2]}$ $\ \ \ \ \ \ \ \ \ \ \ \ \ \vdots$ $w^{t+1} = w^t - \dfrac{\eta^t}{\sigma^t}g^t\ \ \ \ \ \ \ \ \sigma^t = \sqrt{\dfrac{1}{t+1}\sum_{i=0}^t{(g^i)^2}}$ $w^{t+1} = w^t - \dfrac{\eta^t}{\sigma^t}g^t = w^t - \dfrac{\eta}{\sqrt{\sum_{i=0}^t{(g^i)^2}}}g^t$ ---- * ![](https://i.imgur.com/LU5YTE9.png) 在一般的 gradient descent 中，$g^t$ 項（也就是 gradient）愈大，loss 的跨度通常就比較大，然而在 Adagrad 中，就不一定會是這樣了 ---- * adagrad 利用 learning rate 的分母項使 gradient 產生反差 ![](https://i.imgur.com/iDy3Xox.png) 而如此的反差可以使 adaptive learning rate 更加合適 ---- * 在單一參數的情況下，一階導數愈大，代表離最小值愈遠 ![](https://i.imgur.com/5dS4HSO.png =600x350) 注意，$x_0$ 與 minima 的距離為 $\frac{|2ax_0 + b|}{2a}$ ---- * 但在超過一個參數的情況下就不一定了 ![](https://i.imgur.com/u42rHcT.png) 如上圖，a 雖然比 c 還要小，但是距離最小值卻遠的多 ---- * 所以我們引入二階導數作為分母，消除不同參數間的差異 ![](https://i.imgur.com/iONywXW.png =600x350) 現在，我們的最佳 step 為 $\dfrac{|一階導數|}{\ \ 二階導數\ \ }$ ---- * 如此一來，原先只使用一階導數所產生的問題就有了解決 ![](https://i.imgur.com/tqibgqS.png =600x300) 原先我們認為，因為 c 大於 a，所以應該距離 minima 較遠，然而事實卻非如此。如因我們引入二階導數作為分母，c 的二階導數大於 a，在結果上，a 的 $\tfrac{|一階導數|}{\ \ 二階導數\ \ }$ 會比 c 的要小 ---- * 而在較複雜的 model 上，計算二階微分往往耗費大量時間，所以我們選擇使用 $\sqrt{(一階微分)^2\ \ \ \ }$ 來近似 ![](https://i.imgur.com/7O0fLhh.png =600x350) 至此，對於 adagrad 的 learning rate 選擇有了詳盡的解釋 ---- ### Tip 2 : Stochastic Gradient Descent Make the training faster * Gradient Descent 的 loss 會加總所有 example 的和 * $L = \displaystyle\sum_n{(\hat{y}^n - (b + \sum{w_ix_i^n}))^2}\ \ \ \ \theta^i=\theta^{i-1} - \eta\nabla L(\theta^{i-1})$ * Stochastic Gradient Descent 每次只看一個 example $x^n$ 的 loss ，所以速度會快上許多（若有 n 個 example 那就快了 n 倍） * $L^n = (\hat{y}^n - (b + \sum{w_ix_i^n}))^2\ \ \ \ \ \ \ \ \ \theta^i=\theta^{i-1} - \eta\nabla L^n(\theta^{i-1})$ ---- ![](https://i.imgur.com/WZbXupi.png =600x300) * Gradient Descent 較平穩，但效率較差 * Stochastic Gradient Descent 速度快上了多，但是較不平穩，光靠一個 example 不一定能找到我們所需的 function ，但我們有許多 examples ，可以逐步朝目標告靠近 ---- ### Tip 3 : Feature Scaling * 不同參數的分佈範圍可能很不一樣，所以我們需要做 scaling ，使範圍較接近 ![](https://i.imgur.com/BEi7GS4.png) ---- ![](https://i.imgur.com/R1h2lrG.png =600x350) * 左圖為沒有做 scaling 的範例，可發現 $w_1$ 對 $y$ 的影響較小，所以圖形中 $w_1$ 方向叫平滑，整體呈現橢圓狀，事實上，如此的圖形在做 gradient descent 時會比較困難，必須得要用到 adagrad 等 adaptive learning rate 的手法 ---- ![](https://i.imgur.com/R1h2lrG.png =600x350) * 右圖為做完 scaling 的範例，可以發現，如果是對他做 gradient descent 會很容易且快速（直接向著圓心走） ---- #### 如何做到 Feature Scaling 其實方法有千千百百種，以下介紹其中最簡單的一種。 ---- * 假設今天有 R 組 examples $\{x^1, x^2, \cdots, x^R\}$，每組 example 都有許多 component $\{x^r_1, x^r_2, \cdots\}$ ![](https://i.imgur.com/ih7FOrz.png) 其實就是一個機率統計 normalize 的過程 ---- ### Gradient Descent Theory * 當我們在做 Gradient Descent 的時候，不見得 $L(\theta^0) > L(\theta^1) > L(\theta^2) > \cdots$ ---- #### 探討 Why Gradient Descent works? * 如果天有一個初始點，我們應該可以很容易的在初始點周圍很近的地方(圖中紅色圈圈)，找到一個具有 loss 最小值的點，作為下一個起始點，如此不斷朝目標點邁進 ![](https://i.imgur.com/RRPoncp.png =600x400) ---- * **Taylor Series** : Let h(x) be any function infinitely differentiable around $x = x_0$ $\begin{split}h(x) &= \displaystyle\sum_{k=0}^\infty{\frac{h^{(k)}(x_0)}{k!}}(x-x_0)^k \\ &= h(x_0) + h'(x_0)(x-x_0) + \frac{h''(x_0)}{2!}(x-x_0)^2 + \cdots \\ &\approx h(x_0) + h'(x_0)(x-x_0) ,\ when\ x\ is\ colse\ to\ x_0\end{split}$ 我們可以在 $x_0$ 周圍得到很好的近似 ---- * Multivariable Taylor Series： $h(x,y) = h(x_0, y_0) + \dfrac{\partial h(x_0, y_0)}{\partial x}(x- x_0) \\ + \dfrac{\partial h(x_0, y_0)}{\partial y}(y - y_0) + something\ related\ to\ (x-x_0)^2\ and\ (y-y_0)^2\cdots$ When x and y is close to $x_0$ and $y_0$ : $h(x,y) \approx h(x_0, y_0) + \dfrac{\partial h(x_0, y_0)}{\partial x}(x- x_0) \\ + \dfrac{\partial h(x_0, y_0)}{\partial y}(y - y_0)$ ---- * **Formal Derivation** :（基於 Taylor Series） $L(\theta) \approx L(a,b) + \dfrac{\partial L(a,b)}{\partial\theta_1}(\theta_1 - a) + \dfrac{\partial L(a,b)}{\partial\theta_2}(\theta_2 - b)$ $s = L(a,b) \ \ \ \ \ \ \ u = \dfrac{\partial L(a,b)}{\partial\theta_1} \ \ \ \ \ \ \ v = \dfrac{\partial L(a,b)}{\partial\theta_2}$ $L(\theta) \approx s + u(\theta_1-a) + v(\theta_2-b)$ ---- * Find $\theta_1$ and $\theta_2$ in the red circle minimizing L(θ) red circle : $(\theta_1-a)^2 + (\theta_2 - b)^2 \leq d^2$ ![](https://i.imgur.com/OcVIhFi.png) ---- * Gradient descent – two variables $\Delta\theta_1 = (\theta_1-a)\ \ \ \ \ \ \ \Delta\theta_2 = (\theta_2-b)$ To minimize L(θ) : $\left( \begin{array}{c} \Delta\theta_1 \\ \Delta\theta_2 \\ \end{array} \right) = -\eta \left( \begin{array}{c} u \\ v \\ \end{array} \right)$ $\left( \begin{array}{c} \theta_1 \\ \theta_2 \\ \end{array} \right)= \left( \begin{array}{c} a \\ b \\ \end{array} \right) -\eta \left( \begin{array}{c} u \\ v \\ \end{array} \right)=\left( \begin{array}{c} a \\ b \\ \end{array} \right) -\eta \left( \begin{array}{c} \dfrac{\partial L(a,b)}{\partial\theta_1} \\ \dfrac{\partial L(a,b)}{\partial\theta_2} \\ \end{array} \right)$ ---- * 至此，推倒出了 Gradient Descent 的公式但需要特別注意，只在 learning rate 夠小的時候才成立（因為需要近似）當然，可以不止做一次微分，可以做到二次微分：Newton’s method ---- #### More Limitation of Gradient Descent * Gradient Descent 會卡在 loss function 微分值為 0 的地方（local minima 、saddle point) --- > [time= May , 2018] --- ## ML Lecture 4 : Classification ---- Function input 一個 object x, output n 個 class 的其中之一應用： * Credit Scoring * 輸入申請者的收入、存款、職業、年齡、財務歷史 * 輸出接受 or 拒絕借貸 * Medical Diagnosis * 輸入症狀、年齡、性別、病史等等 * 輸出患者疾病 * Handwritten character recognition * Face recognition ---- ### Example Application 輸入一隻寶可夢，得到該隻寶可夢的種類屬性（分類）(共 18 種) * 量化一寶可夢 : Total, HP, Attack, Defense, SP Atk, AP Def, Speed 等等。(一個 7 個 attribute 的 vector) ---- ### Traning Data for Classification #### Classification as Regression？ * 以 binary Regression 為例 * training : Class 1 means the target is 1; Class 2 means the target is -1 * testing : closer to 1 → class 1; closer to -1 → class 2 ![](https://i.imgur.com/7nHYbYI.png) 預想會是如此，兩群中間的綠線代表 y 值為 0 ---- ![](https://i.imgur.com/3VWV50a.png) 然而，當在進行 regression 時，若有一群 class 1 的點如圖中右下角所示，其 y 值遠大於 1 ，則此時**綠線**所代表的 function 會被視為 error 極大，regression 最終的結果會選擇**紫線**代表的 function 以減少右下角那些點所造成的 error > Penalize to the examples that are “too correct...” [name=(Bishop, P186)] ---- * 若採用 Multiple class * Class 1 means the target is 1; Class 2 means the target is 2; Class 3 means the target is 3 * class 兩兩之間的數值是相近的（如得到值 2.6 其歸屬於 class 3 ，但也距離 class 2 很接近），若其並沒有實際上的意義的話，則並不能得到好的結果 ---- ### Ideal Alternatives 以 binary class 為例 * Function ( Model ) : $\ f(x) = \begin{equation} \left\{ \begin{array}{l} class\ 1 , & g(x)>0\\ class\ 2 , & else \end{array} \right. \end{equation}$ $g(x)$ 為原本的結果，將$g(x)$的 output 離散化 * Loss Function : $\ L(f) = \displaystyle\sum_n{\delta(f(x^n)\neq \hat{y}^n)}$ 同樣的，loss function的結果不再是數值，而是與結果不符合的次數 * 然而，改變後的 function 皆沒有辦法再用舊的方法（如微分...）來解決，解決辦法有二： Perceptron, SVM 除上述的兩種解決方法，尚可從其他觀點來解決問題... ---- ### Generative Model 從機率角度來看，當挑選一個 x ，我們可以預估出該 x 分別屬於 class 1 及 class 2 的機率，根據機率去猜測該 x 應該是屬於哪個 class ![](https://i.imgur.com/7FuHFv3.png) * 給予 x ，其屬於 class 1 的機率： $P(C_1|x) = \dfrac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1) + P(x|C_2)P(C_2)}$ * Generative Model： $P(x) = P(x|C_1)P(C_1) + P(x|C_2)P(C_2)$ ---- ![](https://i.imgur.com/8vEL9cj.png) * Water and Normal type with ID < 400 for training, rest for testing Training: 79 Water, 61 Normal $P(C_1) = 79 / (79 + 61) =0.56$ $P(C_2) = 61 / (79 + 61) =0.44$ ---- * 接下來求 $P(x|C_1)$ 及 $P(x|C_2)$ 值我們可以藉由 Gaussian Distribution（常態分布）來估測他的機率 $f_{\mu,\Sigma}(x)=\dfrac{1}{(2\pi)^{D/2}}\dfrac{1}{|\Sigma|^{1/2}}exp\{-\dfrac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\}$ 給予 vector x ，回傳在此分佈下取樣出 x 的機率 * 現在的問題是，$\mu$ 及 $\Sigma$ 的該用什麼值，不同的 $\mu,\Sigma$ 會產生不同的分佈，但**都有可能**（可能大、可能小）sample 出 x 假設我們有已知的 79 組 vector : $x^1, x^2,\cdots,x^{79}$ Likelihood of $\mu,\Sigma$ : $L(\mu,\Sigma) = f_{\mu,\Sigma}(x^1)\ f_{\mu,\Sigma}(x^2)\ f_{\mu,\Sigma}(x^3)\cdots\ f_{\mu,\Sigma}(x^{79})$ 代表所有已知的 vextor 在該組分佈（$\mu,\Sigma$）之中的機率必有一組 $\mu^*,\Sigma^*$ 擁有最大的 Likelihood，也就是 $L(\mu^*,\Sigma^*)$ 擁有最大值 $\mu^* = \dfrac{1}{79}\displaystyle\sum_{n=1}^{79}x^n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \Sigma^* = \dfrac{1}{79}\displaystyle\sum_{n=1}^{79}(x^n - \mu^*)(x^n - \mu^*)^T$ ---- 經由上述，我們已經可以求出 $P(C_1|x)$ ，若 $P(C_1|x)>0.5$ 則我們可以說 x 是屬於 class 1 然而結果卻很不理想： ![](https://i.imgur.com/V1gnGqo.png) 考慮 2 個 feature 的精確度為 47%，考慮 7 個 feature 的精確度為 54% ---- ### Modifying Model 思考，在選擇 model 時，每個 class 都有自己的 $\mu$ 及 $\Sigma$，其中 $\Sigma$ 的維度大小與決於所參考的 feature 的平方，所以如果今天所參考的 feature 一多，那整體的參數數目就會上升很快，model 就變複雜，就很有可能 overfitting 所以我們會使不同的 class 共用相同的 covarience matrix ，以漸少參數的使用 : $L(\mu^1, \mu^2, \Sigma) = f_{\mu^1,\Sigma}(x^1)\ f_{\mu^1,\Sigma}(x^2)\cdots f_{\mu^1,\Sigma}(x^{79})\ \times\ f_{\mu^2,\Sigma}(x^{80})\ f_{\mu^2,\Sigma}(x^{81})\cdots f_{\mu^2,\Sigma}(x^{140})$ $\mu^1$ 及 $\mu^2$ 不變 $\Sigma = \tfrac{79}{140}\Sigma^1 + \tfrac{61}{140}\Sigma^2$ ---- 結果如下： ![](https://i.imgur.com/I2o4dFO.png) 可以看到邊界變為一條直線，所以我們又稱這種 model 為 **Linear Model** 若考慮 7 個 feature 時，結果將會從正確率 54% 提升到正確率 73% ---- ### Posterior Probability $P(C_1|x) = \dfrac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1) + P(x|C_2)P(C_2)} = \dfrac{1}{1+\dfrac{P(x|C_2)P(C_2)}{P(x|C_1)P(C_1)}} = \dfrac{1}{1+exp(-z)} = \sigma(z)$ $z = ln \dfrac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_2)}$![](https://i.imgur.com/uqRT4T7.png =180x120) > 以下為數學化簡，（可略） ![](https://i.imgur.com/K7gelv8.png) ![](https://i.imgur.com/MfVTbr1.png) ![](https://i.imgur.com/G9Vsh6Z.png) 考慮 $\Sigma^1 = \Sigma^2 = \Sigma$ ![](https://i.imgur.com/OXjUWFK.png) ---- 最終，$P(C_1|x) = \sigma(w \cdot x + b)$ 在 generative model 中，我們需要估算 $N_1, N_2, \mu^1, \mu^2, \Sigma$，才能得到 $w\ 跟\ b$ --- ## ML Lecture 5 : Logistic Regression ---- ### Step 1 : Function Set ![](https://i.imgur.com/fbfTDXH.png) $f_{w,b}(x) = P(C_1|x) = \sigma\lgroup\displaystyle\sum_i w_i x_i + b\rgroup$ 由於結果經過 sigmoid function 所以 output 必定介於 0 和 1 之間（不同於 Linear Regression 的結果可以是任何值） ---- ### Step 2 : Goodness of a Function ![](https://i.imgur.com/fPFJInt.png) 假設 Training data 如上圖所示（$C_1$ 代表 class 1，$C_2$ 代表 class 2），那我們可以挑選任一組 $w,b$ ，他們決定了 posterior probability ，因此我們可以計算出這 N 筆 data 是經由這組 $w,b$ 產生的機率： $Assume\ the\ data\ is\ generated\ based\ on\ \ f_{w,b}(x) = P_{w,b}(C_1|x)$ $L(w,b) = f_{w,b}(x^1)f_{w,b}(x^2)(1-f_{w,b}(x^3))\cdots f_{w,b}(x^N)$ 可以選出一組 $w^*,b^*$ 其擁有最大的 $L(w,b)$ : $w^*,b^* = arg\ \underset{w,b}{max}\ L(w,b)$ 而為**簡化計算**可將上式等同於下式： $w^*,b^* = arg\ \underset{w,b}{min}\ -lnL(w,b)$ ---- ![](https://i.imgur.com/jn2EA5h.png) $\begin{split}-lnL(w,b) = &-lnf_{w,b}(x^1)&\Rightarrow -[\hat{y}^1lnf(x^1) + (1 - \hat{y}^1)ln(1 - f(x^1))]\\ &-lnf_{w,b}(x^2) &\Rightarrow -[\hat{y}^2lnf(x^2) + (1 - \hat{y}^2)ln(1 - f(x^2))]\\ &-ln(1-f_{w,b}(x^3))6\ \ \ &\Rightarrow -[\hat{y}^1lnf(x^3) + (1 - \hat{y}^3)ln(1 - f(x^3))]\\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \vdots & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \vdots\end{split}$ $=\displaystyle\sum_n-[\hat{y}^nlnf(x^n) + (1 - \hat{y}^n)ln(1 - f(x^n))]$ 最終我們得到 $Loss\ function : L(f)=\displaystyle\sum_n C(f(x^n), \hat{y}^n)$ ---- > $C(f(x^n), \hat{y}^n) = -[\hat{y}^nlnf(x^n) + (1 - \hat{y}^n)ln(1 - f(x^n))]$ （cross entropy） > [What's Cross Entropy?](https://blog.csdn.net/rtygbwwwerr/article/details/50778098) & [What's Cross Entropy for?](http://shuokay.com/2017/06/23/cross-entropy/) ---- ### Step 3 : Find the best function 做 Gradian Descent > (略)以下是數學推導： > ![](https://i.imgur.com/FUpleX1.png) > ![](https://i.imgur.com/Nu93zeF.png) > ![](https://i.imgur.com/W5B1UZi.png) ---- 最終 Gradian Descent 的公式： $w_i \leftarrow w_i -\eta\displaystyle\sum_n-\ (\hat{y}^n- f_{w,b}(x^n))\ x_i^n$ 與 Linear Regression 一模一樣！ ---- ### *WHY NOT？* Logistic Regression + Square Error 如果在 step 2 時，Loss function 不使用 Cross Entropy，而直接使用 Square Error，不是省事許多嗎？ ---- ![](https://i.imgur.com/oHPdoaN.png) 可以發現，當 $\hat{y}^n = 0$ 時，在距離目標及遠及極近時，獲得的微分值是一樣的、而且皆為 0 ，在 $\hat{y}^n = 1$ 時也會是相同的狀況，這會使我們沒有辦法藉由 gradian descent 持續朝目標逼近。 ---- ![](https://i.imgur.com/ZyEYlmy.png) 上圖藉由 Loss 圖形讓我們可以很清楚的看見，當在遠處時，Cross Entropy 的斜率遠大於 Square Error ，當 Square Error 幾乎沒有辦法更新時，Cross Entropy 卻擁有相當大的優勢。 ---- ### Discriminative v.s. Generative ![](https://i.imgur.com/jimbgkT.png) ---- | Discriminative | Generative | | --------------------- | --------------------- | | Gaussian Distribution | Posterior Probability | | 直接找 $w, b$ | 要用 $\mu^1,\mu^2,\Sigma^{-1}$ 來計算 $w, b$ | | ![](https://i.imgur.com/Zqsfz1f.png =250x) 只考慮兩個 feature 的情況下，肉眼看不出分別 | ![](https://i.imgur.com/mqhGjDk.png =250x) 只考慮兩個 feature 的情況下，肉眼看不出分別| | 考慮全部 7 個 feature ，準確率為 79% | 考慮全部 7 個 feature ，準確率為 73% | | 看實際 data 說話 | 會自己**腦補**出可能不會發生的情況 | | training data 多時較佳 | training data 少或有問題時表現較佳 | ---- 發現 Discriminative 較 Generative 佔優勢（在大部分的例子中也是如此）因為 Generative 會自己**腦補、假設**出可能不會發生的情況，所以可能會導致不佳的結果，然而**當 training data 很少時 Generative Model 會有不錯的結果** * 在語音辨識中，便是使用 Generative 的大架構中使用 Discriminative 的 neural network * Priors probability 用 Generative 產生（該文字產生的機率） * class-dependent 用 Discriminative 來做辨識（實際語音的分類） ---- ### Multi-class Classification (3 classes as example) $C_1 : w^1,b^1\ \ \ \ \ \ z_1 = w^1\cdot x + b^1\\ C_2 : w^2,b^2\ \ \ \ \ \ z_2 = w^2\cdot x + b^2\\ C_3 : w^3,b^3\ \ \ \ \ \ z_3 = w^3\cdot x + b^3\\ \ \\ y_i = P(C_i|x)$ ---- ![](https://i.imgur.com/U6Mk0bF.png) （此 softmax function 是從 Gussian distribution 觀點下推導而成的結果，詳細內容可以看 [Bishop, P209-210] ，而另從 max entropy 觀點也可推倒出此結果）總之，對 z 取 exponential 後，再做 normalization 可得到 y，其中 y 符合： * $1 > y_i > 0$ * $\sum_iy_i = 1$ ---- 如同 2-class 的 example，對應的 x 求出 y 後，與　$\hat{y}$ 做 cross entropy，選出最 minimize 的結果： ![](https://i.imgur.com/U5rZEqh.png) 在此將 $\hat{y}$ 設為三維的 vextor，可以避免先前提到的--若將 class 1 的 $\hat{y}$ 設成 1 、class 2 的 $\hat{y}$ 設成 2 、class 3 的 $\hat{y}$ 設成 3 ，會隱含兩兩 class 之間的距離關係 ---- ### Limitation of Logistic Regression 然而，有些問題，是單一的 logistic regression 沒有辦法解決、分類的。 ![](https://i.imgur.com/jSDt6j2.png) 如上圖例子，可能就沒有辦法有效的判斷 4 個點份別屬於哪個 class ---- ![](https://i.imgur.com/vvzHKXC.png) 由於 logistic regression 的結果會是一個 linear bounded，因此最有可能得到上圖的結果，無法成功分別兩個 class ---- ### Feature Transformation 為解決 Limitation of Logistic Regression ，我們可以嘗試將四點位置做變形之後，再做 logistic regression。假設： * $x_1' : distance\ to\ \begin{bmatrix} 0 \\ 0 \end{bmatrix} \ \ \ \ \ \ \ x_2' : distance\ to\ \begin{bmatrix} 1 \\ 1 \end{bmatrix}$ 將四點從 $x_1-x_2$空間轉換到 $x_1'-x_2'$空間，如下圖： ![](https://i.imgur.com/UJk97el.png) ---- 之後便可以成功的利用 logistic regression 進行分類，然而，如此的 feature transformation 是靠著人類自己選擇、進行的，若是過於複雜的 feature transformation 很難單靠人類來訂定。 ---- 所以我們希望找出方法，能夠靠著電腦來決定 transformation 。 * Cascading logistic regression models 就是一種方法 ![](https://i.imgur.com/GRBQbPK.png) 將原本的 feature $x_1, x_2$ 分別都丟進兩個 logistic regression models，兩個 models 分別產生 $x_1', x_2'$ 作為新的、transformate 完的 feature，將 $x_1', x_2'$ 作為輸入丟進最後一個 logistic regression model，此層輸出 y ，真正的完成 classification ---- ![](https://i.imgur.com/foPD0PM.png) 如此，透過電腦自行去完成整個多層 regression 的過程，我們便稱之為 Deep Learning，其中多層次一個一個的 regression，稱為 Neuron，整個架構稱為 Neuron Network。 ---- ![](https://i.imgur.com/AP1d813.png) --- ## ML Lecture 6 : Brief Introduction of Deep Learning ---- ### Ups and downs of Deep Learning * 1958: Perceptron (linear model) * 最先提出，類似 logistic regression (但沒有使用 sigmoid function) * 1969: Perceptron has limitation * 發現有很多問無法用 linear 方法解決 * 1980s: Multi-layer perceptron * 思路如同 Deep learning ，用多層的 Perceptron 組成 Neural Network * Do not have significant difference from DNN today * 1986: Backpropagation * Usually more than 3 hidden layers is not helpful * 1989: 1 hidden layer is “good enough”, why deep? ---- * 2006: RBM initialization (breakthrough) * restricted Boltzmann machine(RBM) 被認為是當時的突破，是區分 Deep learning 跟 Multi-layer perceptron（Neural Network）的指標 * 但後來證實其實沒有多大的作用 * 2009: GPU * GPU 加速運算，讓 deep learning 效率突飛猛進 * 2011: Start to be popular in speech recognition * 2012: win ILSVRC image competition ---- ### Three Steps for Deep Learning just like the regressions we learn before * Step 1 : Neural Network * Step 2 : goodness of function * Step 3 : pick the best function ---- ### Neural Network Different connection leads to different network structures 裡頭有許多不同的 logistic regression，都擁有不同的 weights 跟 biases，在此我們用 $\theta$ 表示 * Network parameter $\theta$ : all the weights and biases in the “neurons” ---- ### Fully Connect Feedforward Network ![](https://i.imgur.com/6pA2kJr.png) ---- > 事實上，此處不一定是使用 sigmoid function，sigmoid function 所在的位置![](https://i.imgur.com/JCWkIdh.png)，在 neural network 的文獻中稱之為 **activation function**，不一定是使用 sigmoid function（事實上再現今的實作中大多數已經不使用 sigmoid function） ---- 架構如上，再放入各參數： ![](https://i.imgur.com/zFuPS4v.png) 整體就會是一個 function $f$，$f\begin{pmatrix}\begin{bmatrix} 1 \\ -1 \end{bmatrix} \end{pmatrix} = \begin{bmatrix} 0.62 \\ 0.83\end{bmatrix} \ \ \ \ \ \ \ f\begin{pmatrix}\begin{bmatrix} 0 \\ 0 \end{bmatrix} \end{pmatrix} = \begin{bmatrix} 0.51 \\ 0.85 \end{bmatrix}$ ---- 而這個 network structure 就會是一個 function set，收集所有這樣結構但參數不同的 function > 相較於先前的 logistic regression、linear regression 行為上是相同的，但 function set 卻大了很多，將先前介紹的 regression 都包含其中 ---- ![](https://i.imgur.com/5VuEy6w.png) * Input Layer：整個 Neural Network 的第一層為輸入 vector $x$，雖然不包含實際上做 regression 的 neuron，但我們還是將之稱為 Input Layer * Output Layer：整個 Neural Network 中最後一層 regression 的所在，直接輸出 output vector $y$ * Hidden Layers：夾在 input layer 與 output layer 中的每一層都稱為 hidden layer ---- ### Deep = Many hidden layers 何謂 Deep ? 3 層 hidden layers？ 8 層 hidden layers？說法各有不同，也有人將 Neural-based 的方法都稱之為 Deep ![](https://i.imgur.com/6tJwPLB.png =600x) （上圖為圖像辨識優化的演進） ---- ### Matrix Operation 我們會發現，關於 Deep learning 中的運算，我們可以很容易的用**矩陣運算**來表示： ![](https://i.imgur.com/fjuYvX8.png) ---- 整個 Neural Network 如下： ![](https://i.imgur.com/HWVXfSh.png) $y = f(x) = \sigma(W^L \cdots \sigma(W^2 \sigma(W^1 x + b^1) + b^2) \cdots + b^L)$ 將 Neural function 化為矩陣表示不僅僅只是表示方便而已，而是可以透過平行化運算（parallel computing）來達到加速的目的 > That is why GPU helpful in deep learning. ---- ### Output layer 事實上，Hidden Layers 就是扮演一個**自動化 Feature 提取器**的角色，負責完成 Feature Transformation，將好的、容易分辨的 feature 提供給 Output Layer： ![](https://i.imgur.com/aLED7aE.png =600x) 而 Output Layer 就會根據 hidden layers 提供的 feature 來分辨該輸入 $x$ 究竟是屬於哪個 class，因此，Output layer 其實就是一個 Multi-class Classifier，其中會包含 Softmax Function ---- ### Example Application Handwriting Digit Recognition ![](https://i.imgur.com/tjAhjRq.png) ---- * Input : 一張 16X16 的圖，總共有 256 個 bits，每個 bit 代表該 pixel 是黑 or 白，所以是一個 256-dim 的 vector * Output : 可能為 0~9 十個數字，所示是輸出一個 10 維的 vector，每個 dimension 代表數入數字為該值的 probability * 我們的目標是找到一個 function，能夠有很好的正確率所以我們需要設計一個好的 network structure，它決定了一個 function set 是否有可能找到良好的目標 function ---- ### FAQ 為了達成上述的目的，有幾個問題... * Q : How many layers? How many neurons for each layer? * 這沒有一個答案，我們所能做的就是 **Trial and Error + Intuition** > 非 Deep -> Deep learning：把一個問題從 **如何找到一個 Feature** 轉換成 **如何選擇一個 network stucture** > 在過去做影像辨識時，會抽一些（人定的）feature ，也就是做 feature transform，但有了 Deep learning 後，我們可以直接將影像的 pixels 丟入，讓 machine 幫我們將有用的 feature 找出來。 ---- > 而 Deep Learning 究竟好不好用，就取決於哪一個問題比較容易？ > 如：語音辨識對於人類來說十分容易，幾乎是下意識就可以完成，然而，我們卻很難找出對於機器容易分辨的 feature，並藉此完成 linear 的分類，所以 Deep Learning 十分有效。（影像辨識亦是如此） > 又如：NLP 的分辨，人們可以簡單的透過辭彙列表來分析一篇文章的情緒，這個 feature 是可以輕鬆抓到的，所以 Deep learn 對 NLP 的進步就沒有那麼顯著。（但依舊是有效的） ---- * Q : Can the structure be automatically determined? * E.g. Evolutionary Artificial Neural Networks * Q : Can we design the network structure? * Convolutional Neural Network (CNN) ---- ### Loss for an Example ![](https://i.imgur.com/HzMLoeT.png) 其實與 Multi-class Classification 一模一樣，就是計算 $y\ 與\ \hat{y}$ 的 cross entropy，並挑出最小的那個 function，但這次我們要看的是 Total Loss ---- ![](https://i.imgur.com/7uwP5im.png =400x) 每筆 training data 有會有一個做完 cross entropy 的值，total loss 就是所有 C 的加總： $L = \displaystyle\sum_{n=1}^N C^n$ 我們要從整個 function set 中，**挑出 total loss 最小的 function** 也就是說，要**找到一個 network parameter $\theta^*$ 使得 total loss 最小** ---- ### Gradian Descent $\nabla L = \begin{bmatrix} \dfrac{\partial L}{\partial w_1} \\ \dfrac{\partial L}{\partial w_2} \\ \vdots \\ \dfrac{\partial L}{\partial b_1} \\ \vdots \end{bmatrix}$ ---- ![](https://i.imgur.com/QP095YS.png) ---- ### Deeper is Better ? 以下是實驗數據，測試在每層 neuron 數 2k 的情況下，不同層數的 hidden layer 對辨識結果有何影響 | Layer X Size | Word Error Rate (%) | | ------------ | ------------------- | | 1 X 2k | 24.2 | | 2 X 2k | 20.4 | | 3 X 2k | 18.4 | | 4 X 2k | 17.8 | | 5 X 2k | 17.2 | | 7 X 2k | 17.1 | 似乎不怎麼讓人驚訝，因為就我們所知，愈多曾的 hidden layers，代表著 function set 愈大，當然愈可能找到表現較佳的 function ---- 甚至還有研究宣稱，在 neuron 數夠大的其況下，其實僅需要一層的 hidden layer 就可以表達足夠大的 function set、包含任何的 function，那果真如此的話 Why **Deep** not **Fat** ? > [Reference for the reason ](http://neuralnetworksanddeeplearning.com/chap4.html) --- ## ML Lecture 7: Backpropagation ---- ### Gradian Descent Network parameters $\theta = \{w_1, w_2, \cdots, b_1, b_2, \cdots\}$ Starting parameters $\theta^0 \longrightarrow \theta^1 \longrightarrow \theta^2 \longrightarrow \cdots$ $\nabla L(\theta) = \begin{bmatrix} \partial L(\theta)/\partial w_1 \\ \partial L(\theta)/\partial w_2 \\ \vdots \\ \partial L(\theta)/\partial b_1 \\ \partial L(\theta)/\partial b_2 \\ \vdots \end{bmatrix}$ $Compute\ \nabla L(\theta^0)\ \ \ \ \ \ \ \ \theta^1 = \theta^0 - \eta \nabla L(\theta^0)\\ Compute\ \nabla L(\theta^1)\ \ \ \ \ \ \ \ \theta^2 = \theta^1 - \eta \nabla L(\theta^1)\\ \ \ \ \ \ \ \ \ \ \ \ \ \vdots\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \vdots$ ---- 與先前所學到的 Gradian Descent 並沒有什麼不同，但在做 Deep learning 時，往往會有好多層 hidden layers，而每一層的 neural 數都是 1k、2k 起跳，會造成參數的數量都是非常非常大的，計算也非常冗長。 Backpropagation 做的事情其實就是 Gradian Descent，不過他的演算法可以使我們有效率的計算 Gradian vector ---- ### Chain Rule 稍微複習一下： ![](https://i.imgur.com/1GfBkVQ.png) ---- ### Backpropagation ![](https://i.imgur.com/f3LAWdA.png) Total Loss 是所有 $C^n$ 的總和（由先前的敘述，$y^n$ 事由 NN -- Neural Network 所產生的 Output；$\hat{y}^n$ 為我們所預期各個 class 應有的正確參數值；有一個 cost function 負責計算兩者之間的差距，先前的例子該 function 為計算 cross entropy）我們可以發現，若要計算某一參數 $w$ 對Total Loss 的微分，我們可以取參數 $w$ 分別對各個 C^n 的微分值再做加總。 ---- 而要如何計算 w 對 C 的微分值？我們先只看一個 neuron : ![](https://i.imgur.com/AfJnK9s.png) $\dfrac{\partial C}{\partial w} = \dfrac{\partial z}{\partial w}\dfrac{\partial C}{\partial z}\ \ \ \ (by\ Chain\ rule)$ $\dfrac{\partial C}{\partial w}$ 可以經由兩項相成得到，所以，我們將 Backpropagation 的計算分成了兩個部份： ---- * Forward Pass : 計算對所有參數 w 對 z 的微分 $\dfrac{\partial z}{\partial w}$ * Backward Pass : 計算所有 activation function 的 input z 對 C 的微分 $\dfrac{\partial C}{\partial z}$ ---- ### Forward Pass 簡化來看，如果 input 只有兩項，$z = x_1 w_1 + x_2 w_2 + b$： ![](https://i.imgur.com/AfJnK9s.png) 那麼 $\dfrac{\partial z}{\partial w_1}$ 跟 $\dfrac{\partial z}{\partial w_2}$ 的計算是很 trivial 的，分別是 $x_1$ 及 $x_2$，也就是該 weight 所乘上的 input ---- 所以在計算各個參數 w 的 Forward Pass 的時候，很簡單，只要找到該 w 前的 input 就會是答案： ![](https://i.imgur.com/JniSxQ2.png) ---- ### Backward Pass 而在計算 Backward Pass 時就麻煩了許多，先做一個假設，假設我們使用的 activation function 是 sigmoid function，並且 $a = \sigma(z)$ : ![](https://i.imgur.com/YYrWo1k.png) $\dfrac{\partial C}{\partial z} = \dfrac{\partial a}{\partial z} \dfrac{\partial C}{\partial a} = \sigma'(z) \dfrac{\partial C}{\partial a}\ \ \ \ \ \ \lgroup$![](https://i.imgur.com/R1Ox6Dw.png =200x140)$\sigma'(z) 為已知的常數\rgroup$ ---- $\begin{split} \dfrac{\partial C}{\partial a} &= \dfrac{\partial z'}{\partial a} \dfrac{\partial C}{\partial z'} &+ \dfrac{\partial z''}{\partial a} \dfrac{\partial C}{\partial z''} \\ &= w_3 \dfrac{\partial C}{\partial z'} &+ w_4 \dfrac{\partial C}{\partial z''} \end{split}$（由 forward pass 知道 $\dfrac{\partial z'}{\partial a} = w_3 \ \ \dfrac{\partial z''}{\partial a} = w_4$）接下來，只需要知道了 $\dfrac{\partial C}{\partial z'}\ 和\ \dfrac{\partial C}{\partial z''}\ 就可以得到\ \dfrac{\partial C}{\partial a}，並因此得到\ \dfrac{\partial C}{\partial z}$ ---- 先假設我們已經知道 $\dfrac{\partial C}{\partial z'}\ 和\ \dfrac{\partial C}{\partial z''}$ ，得到：$\dfrac{\partial C}{\partial z} = \sigma'(z) \ [w_3 \dfrac{\partial C}{\partial z'} + w_4 \dfrac{\partial C}{\partial z''} ]$ 至此得知，Backward Pass 的流程，是由後層網路的值往前推得： ![](https://i.imgur.com/1kqqWkD.png) （$\sigma'(z)$ 為一個常數，因為 $z$ 已由 forward pass 決定） ---- 先前尚未解說的 $\dfrac{\partial C}{\partial z'}\ 和\ \dfrac{\partial C}{\partial z''}$ 兩項的值，其實跟原本求的 $\dfrac{\partial C}{\partial z}$ 情況相仿，需要透過 foreard pass 及 backward pass 求得，forward pass 沒有問題，而 backward pass 的部份，因為需要後層 layer 的值來決定，所以根據後層 layer 屬性不同，可以分成兩種 case 討論： ---- * Case 1：Output Layer ![](https://i.imgur.com/gka7yMU.png) 如圖，若 $z' \ 及 \ z''$ 的後一層是 output layer，也就是通過下一層 activation function（此處可能為 softmax function）後會得到 output $y_1\ 和 \ y_2$ ，此時 $\dfrac{\partial C}{\partial z'}\ 和\ \dfrac{\partial C}{\partial z''}$ 可表示為： $$\dfrac{\partial C}{\partial z'} = \dfrac{\partial y_1}{\partial z'} \dfrac{\partial C}{\partial y_1}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \dfrac{\partial C}{\partial z''} = \dfrac{\partial y_2}{\partial z''} \dfrac{\partial C}{\partial y_2}$$ ---- $z'$ 對 $y_1$ 的微分等於是對 softmax function 取微分，而 $y_1$ 對 $C$ 的微分等同於對 cross entropy 取微分，皆為已知；對 $z''$ 項亦同，所以最後終於求出了 $\dfrac{\partial C}{\partial z}$ 的答案 ---- * Case 2：Not Output Layer ![](https://i.imgur.com/7qZfgzw.png =500x300) 若下層依舊是 hidden layer，便沒有辦法直接得到答案，需要按照求 $\dfrac{\partial C}{\partial z}$ 的流程一樣，做 backward pass 繼續往後，recursive 求 $\dfrac{\partial C}{\partial z}$ ，直到下一層為 output layer 為止。 ---- * ![](https://i.imgur.com/QtOtZMB.png =500x300) ---- ### Summary 然而，設想，當我需要知道 $\dfrac{\partial C}{\partial z}$ 時，我需要先計算 $\dfrac{\partial C}{\partial z'}\ 和\ \dfrac{\partial C}{\partial z''}$ ，而當我需要知道 $\dfrac{\partial C}{\partial z'}$ 時，我需要先知道 $\dfrac{\partial C}{\partial z_a}\ 和\ \dfrac{\partial C}{\partial z_b}$，這將會造成大量、且重複的運算，在一層 layer 動輒上百、上千個 neural 的情況下，是極度沒有效率的作法， ---- 所以，Backpropagation 採取 backward pass 手法，**逆轉整個 network、把 activation function 當成 op-amp** ，一開始便從最後一層 layer 的微分開始計算，藉由已運算完成的結果步步往前推算，在求出 $\dfrac{\partial C}{\partial z}$ 時，後層的 $\dfrac{\partial C}{\partial z'}\ ,\ \ \dfrac{\partial C}{\partial z''}\ ,\ \ \dfrac{\partial C}{\partial z_a}\ ,\ \ \dfrac{\partial C}{\partial z_b}$ 等等都已經計算完畢。 ---- 再搭配上 Forward pass 所求出的 $\dfrac{\partial z}{\partial w}$ ，我們便可以推導出最初的所求 $\dfrac{\partial C}{\partial w}$ ![](https://i.imgur.com/KEJfE0U.png) 有了 Backpropagation 的手法，我們就可以有效率地來做 Gradian Descent --- ## ML Lecture 8-1 : “Hello world” of deep learning 此堂課大部份再介紹，[Keras](https://keras.io/) 的操作 ---- ### Keras Keras 其實是 [TensorFlow](https://github.com/tensorflow/tensorflow)、[Theano](https://github.com/Theano/Theano) 等 API 的 interface，作為一個使用更為簡單、更好入門的工具。（TensorFlow 跟 Theano 可能更 flexible 一點，甚至不僅僅可作為 Neural Network 的開發工具，也可以當成純粹的數學運算工具使用） ---- ### "Hello world" 使用 Kares 來實作 Handwriting Digit Recognition ![](https://i.imgur.com/DjBvh0n.png) 使用 [MNIST DATA](http://yann.lecun.com/exdb/mnist/) (Kares 也提供 [function](https://keras.io/datasets/) 用來載入各種 Data Sets) 與我們一直以來學習的一樣，Deep Learning 也需要3個步驟來完成 * Step 1: define a set of function * Step 2: goodness of function * Step 3: pick the best function ---- ### Step 1: Building a Network network 架構如下： ![](https://i.imgur.com/qirjNbC.png) ---- 所以首先，我們要先決定我們的 network structure ，先創建一個 model : ```python= model = Sequential() ``` 根據 Network 架構給予參數： ```python=2 model.add(Dense(input_dim=28*28, units=500, activation='sigmoid')) ``` * `Dense` 表示一個 fully connected 的 layer * 一張 image 為 $28\times28$ 個 pixels ，所以 `input_dim` 是 28*28 * `units` 代表這一層的 neural 數是 500 ---- * `Activation` 決定此層 layer 中的 neural 是使用何種 activation function，在此是使用 sigmoid function * 尚可以選擇多種不同的 function，如：softplus, softsign, relu, tanh, hard_sigmoid,linear 等等 * 其實也可以在 keras 中加入自訂的 function，只需在 Keras 的 code 中找到定義 activation function 的地方加上去就好 ---- 再加一層 layer : ```python=4 model.add(Dense(units=500, activation='sigmoid')) ``` * 此層的 `Dense` 不需要再給 `input_dim` ，因為上一層的 output 就會是這一層的 input，所以僅需設定 `units` 即可最後加上一層 output layer : ```python=6 model.add(Dense(output_dim=10, activation('softmax')) ``` * 因為此 output layer 相當於 multiclass classifler，所以此層的 activation function 使用 softmax function，當然，再其他情況下，可以挑選任何的 function 使用 ---- ### Step 2: Configuration 接著，要挑選 loss 來決定 function 的好壞 ![](https://i.imgur.com/fCiAB1K.png) ```python=8 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) ``` * `loss` 決定 loss function，使用 cross entropy -- `'categorical_crossentropy'` * `optimizer` 決定 gradian descent 要怎麼做、learning rate 怎麼取，在這邊使用 `'adam'` * 可以選擇：SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam 等等 ---- ### Step 3: Find the optimal network parameters 已經將架構參數都給完後，最後一步是將 training data 給予我們的 neural network，真正開始 train model： ```python=11 model.fit(x_train, y_train, batch_size=100, nb_epoch=20) ``` * `x_train` 是 input，會是一個 numpy array，包含一個個二維 metric，長度會是 training data 的個數，寬度會是一張圖片 $28 \times 28=784$ 個 pixels 長 ![](https://i.imgur.com/rFtfMIX.png =300x) ---- * `y_train` 是 training data 的 label，紀錄該筆 data 實際上是哪個數字，也是 numpy array，包含一個個二維 metric，長度也是 training data 個數，寬度會是 10 ![](https://i.imgur.com/3SRBxml.png =300x) * `bitch_size` 及 `nb_epoch` 攸關 learning 過程中運算的效率，會在之後做解釋 ---- ### mini batch 在實際上 deep learning 做 gradian descent 時，我們並不會真正去 minimize total loss，而是會將我們擁有的 training data 幾個幾個一組、分成一個一個 batches（會是 random 選擇，如果不夠隨機，會影響 performance），輪流去計算一個 batch 的 total loss 來做更新 ![](https://i.imgur.com/wdOvsUU.png) ---- 步驟（一個 epoch）： * 隨機選取一組 network parameters * 挑選第 1 組 batch 計算 $L' = l^1 + l^31 + \cdots$ 根據 $L'$ 更新 parameters * 挑選第 2 組 batch 計算 $L'' = l^2 + l^16 + \cdots$ 根據 $L''$ 更新 parameters * 繼續往下挑選 batch，計算個別的 total loss，更新 parameters，直到所有 batches 都被挑選過挑選完一輪 batches 是一個 epoch，而要完成一個 learning 可能會需要好幾十個 epoch ---- 所以剛剛的 function calling： ```python=11 model.fit(x_train, y_train, batch_size=100, nb_epoch=20) ``` * `batch_size` 代表一個 mini-batch 中要有多少個 training data * `nb_epoch` 為這次的 learning 要做多少次的 epoch ---- ### Speed 在做 learning 時，一次 epoch 中會做好多次 parameters update，取決於 batch 數，所以總共的 parameters update 次數 $= batch\ 數目 \times epoch\ 數$ 假設 batch size 為 100，共有 100 個 batches，20 個 epoch，那麼總共就 update $100 \times 20 = 2000$ 次那麼如果今天把 batch size 設為 1，也就是 **Stochastic Gradian Descent** ，那就會有 10000 個batch，一個 epoch 中可以 update 參數 10000 次，**百倍於 Mini-Batch** 雖然 Stochastic 有 update 方向不穩定的缺點，但是*天下武功、唯快不破*，但他有百倍於 Mini-Batch 的機會達到 optimal 參數的位置，**Then, WHY MINI-BATCH？** ---- 看實測結果： ![](https://i.imgur.com/SNcFze6.png =500x) * 假設有 50000 個 examples * batch size = 1, 50000 updates in one epoch and takes **166s** * batch size = 10, 5000 updates in one epoch and takes **17s** 當 batch size = 1 做一個 epoch 時， batch size = 10 可以做將近十個 epoch，兩者的參數 update 次數幾乎相同，但 batch size = 10 卻更穩定、可以更快達到 optimal，所以 **Mini-Batch 勝！！！** ---- > 而 mini-batch 的效率卻要歸功於 GPU 的平行運算，使得 batch size = 1 一次只計算一個 example，跟 batch size = 10 一次算十個 examples 所花費的時間是相同的，大幅提升了效率導致 update 數也大幅上升。 > > ![](https://i.imgur.com/iSaMIB4.png =500x) > （對 GPU 來說，矩陣運算中每一個 elements 都是可以平行運算的，以矩陣運算來看的話，可以比較容易理解 GPU 對 learning 的幫助在何處） ---- > > 然而，平行運算並不是萬能，依舊受制於硬體極限，所以並非加大 batch size 都可以達到提升效能的目的， > > 而且，batch size 愈大，performance 也並非會更好，如果將 batch size 設到最大（也就是 Fully-Batch），那實際去 train 的時候，會很容易卡住（掉到 local minimal 或 saddle point），因為真實 training 的 model 往往不是 convex 的，有著許多的坑坑洞洞，很容易會掉進去，所以**如果 batch size 太大，performance 通常都很差** ---- > Stochastic 的特性是更新比較 random，所以今天如果在 train 的過程中掉進了 local minimal 或 saddle point，可能可以靠下一次的比較隨機的更新離開這個 gradian 為 0 的地方，所以我們**是需要一定的隨機性的** > batch size 的大或小，都會顯著的影響 training 的 speed and performance. > That's why we need mini-batch. ---- ### Others Function of Keras * Keras 支援我們將 model 儲存之後使用（[教學](https://faroit.github.io/keras-docs/2.0.2/getting-started/faq/#how-can-i-save-a-keras-model)） * How to use the neural network (testing): * case 1：如果我們今天有一組 testing set `x_test`，也有 testing set 的答案（label）`y_test`，那我們可以用 `evaluate`來計算我們 train 完的 model 的正確率 ```python= score = model.evaluate(x_test, y_test) print('Total loss on Testing Set:' score[0]) print('Accuracy of Testing Set:' score[1]) ``` * case 2：`predict` 實際使用 model，給予 input `x_test`，將答案輸出至 `result`（例如：給予一張圖片，model 回傳辨識的結果） ```python= result = model.predict(x_text) ``` --- ## ML Lecture 8-2 : Keras 2.0 因應 Keras 改版，function 參數給法有所改變，已經更新在 **ML Lecture 8-1** 的筆記內容 --- ## ML Lecture 8-1-1 : 實作 Handwriting Digit Recognition ---- 在安裝 tensorflow 時遇到了點困難，做了許多嘗試之後發現是我筆電的顯示卡規格不夠，前尚在嘗試能不能載舊版本的 tensorflow，不然的話可能就只能暫時使用 cpu 版的 --- > [time= Aug, 2018] ---