課程｜機器學習基石

# 機器學習基石 > A takes D and H to get g ~~ f ## L2 Learn to answer yes/no ### Perceptron Hypothesis Set - $x^T=(1,x_1,x_2,...,x_n)$ $w^T=(-threshold,w1,w2,...,w_n)$ $h(x)=sign(w^T\cdot x)$ - Perceptron = linear(binary) classifier ### Perceptron Learning Algorithm - 希望至少在看過的資料中，g和f(unknown)越接近越好 - PLA核心為知錯能改，從某個$w_0$開始，不斷修正錯誤得到新的$w_{t+1}$。 * $w_{t+1}=w_t+y_{n(t)}x_{n(t)}~~~~~$ if $~~~~~sign(w_tx_{n(t)}) \neq y_{n(t)}$ ### Guarantee of PLA (若線性可分，則有以下性質) - $y_{n(t)}w_f^Tx_{n(t)}>0$，因為所有$y_i$和PLA最後得到得$w_f$判斷的分類正負號相同，兩者相乘為正數。 - $w_f^Tw_{t+1}=w_f^T(w_t+y_{n(t)}x_{n(t)})>w_f^Tw_t+0$ **表示$w_t和w_f$越來越靠近** - Mistake <-> $sign(w_tx_{n(t)}) \neq y_{n(t)}$ <-> $y_{n(t)}w_tx_{n(t)}\leq0$ $||w_{t+1}||^2=||w_t+y_{n(t)}x_{n(t)}||^2$ $= ||w_t||^2 +2y_{n(t)}w_t^Tx_{n(t)}+||y_{n(t)}x_{n(t)}||^2$ $\leq ||w_t||^2 + 0 +max_n||y_nx_n||^2$ **表示$w_t$長度不會成長太快，頂多$x_n^2$大小** - 在T個iteration之後，$\frac{w_f^T}{||w_f||}\frac{w_T}{||w_T||}\geq\sqrt{T}\cdot const$ ### Non-Separable Data * 資料線性可分，則 PLA會停。 * 資料線性不可分，則找一條犯錯最少的線，被證明為NPhard。 * **Pocket PLA**，跑足夠多次之後，可以找到還不錯的線。 ## L3 Types of learning - Output Space - Binary classification(是非題) - Multiclass classification (如販賣機) - Regression (with bound)(輸出為一個實數) - Structured learning - Label - Supervised learning 有標記所有的x->y - Unsupervised learning 分群問題 - Semisupervised learning (臉部辨識、藥物測試) 因為找到標記成本很高 - Reinforcement learning (廣告系統、棋類遊戲)對之前輸出結果獎勵跟懲罰來學習 - Protocol - Batch 一開始就把所有資料未給電腦 - Online 一筆一筆更新 - Active 電腦主動問問題(對某筆資料有疑問之類的) - Input Space - Concrete Features 人類去定義 - Raw Features 直接看資料如何儲存 - Abstract Features >raw features或abstract features則需要人或機器去轉換抽取找到特徵 ## L4 Feasibility of Learning - **Hoeffding Inequality：** $~P(|\mu -\nu|>\epsilon)~\le~2e^{-2\epsilon^2N}$ - 抽樣的個數N越多，$\mu和\nu$就會很接近 - $\mu=\nu$ is PAC - $P(|E_{in}(h) -E_{out}(h)|>\epsilon)~\le~2e^{-2\epsilon^2N}$ - DATA中的錯誤率： $E_{in}(h)=\frac{1}{N}\sum\limits_{n=1}^N[[h(x_n)\neq y_n]]$ - 真正的錯誤率： $E_{out}(h)=\varepsilon~[[h(x)\neq f(x)]]$ - 學習的目標 - 1. $E_{in}(h)~\sim~E_{out}(h)$ - 2. $E_{in}(h)~越小越好$ - BAD DATA - **Hoeffding Inequality**保證同一個h來說，BAD DATA數目不會太多。 - 但是在有選擇的情況下，選到BAD DATA機率上升。 - $~~~~~P_D[BAD~D$ for all h$]$ $\leq 2e^{-2\epsilon^2N}~+~2e^{-2\epsilon^2N}~+...+~2e^{-2\epsilon^2N}$ $\leq 2M~e^{-2\epsilon^2N}$ - **當 M 是finite**，則 $~E_{in}(h)~\sim~E_{out}(h)$ - 我們可以專注在找$E_{in}$小的 h ## L5 Training and Testing - 學習的目標 - 1.$E_{in}(h)~\sim~E_{out}(h)$ - 2.$E_{in}(h)~越小越好$ - What if $|\mathcal{H}|=M$ is infinite ? - 當 $\mathcal{H}$ 裡的 $h$ 接近時，$P(BAD$ events$)$會 overlap - $P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le2~effect(N)~e^{-2\epsilon^2N}$ - $\mathcal{H} = \{$all lines in $R^2\}$ - 一個點：2 kinds - 兩個點：4 kinds - 三個點：6 kinds 一條線/ 8 kinds 最多 - 四個點：14 kinds 最多 - **Dichotomies** - 用來描述，有**幾種 hypothesis**，和 data 個數 $N$ 有關。 - **A dichotomy** ： hypothesis "limited" to the eyes of $x_1,x_2,...,x_N$ - $h(x_1,x_2,...,x_N)=(h(x_1),h(x2),...,h(x_N))∈\{×,◦\}^N$ - **Dichotomies Set**：|$\mathcal{H}(x_1,x_2,...,x_N)$| - Bounded by $2^N$ - Depends on input. - **Growth function成長函數：** - $m_\mathcal{H}(N) = max~|\mathcal{H}(x_1,x_2,...,x_N)|$ - Example - Positive Rays ： $N+1$ - Positive Intervals ： $\frac{1}{2}N^2+\frac{1}{2}N+1$ - Convex Sets ： $2^N$ - 2D perceptrons ： $m_\mathcal{H}(N)<2^N$ - **Break Point：** - $k$ ：滿足$m_\mathcal{H}(k)<2^k$，任意 $k$ inputs 不可 shattered - No break point ： $m_\mathcal{H}(N)=2^N$ - Break point k ： $m_\mathcal{H}(N)=O(N^{k-1})$ - Example - Positive Rays ： 2 - Positive Intervals ： 3 - Convex Sets ： No - 2D perceptrons ： 4 ## L6 Theroy of Generization - 給定break point k，我們可以知道成長函數$m_\mathcal{H}(N)$不能太高 - k=2 - N=1，$m_\mathcal{H}(N)$=2 - N=2，$m_\mathcal{H}(N)$=3 - N=3，$m_\mathcal{H}(N)$=4 - Bounding function, $B(N, k)=\sum\limits_{i=0}^{k-1}\begin{pmatrix}N\\i\end{pmatrix}$ - 給定N個點，break point k，$m_\mathcal{H}(N)$ 的 maximum possible - ![](https://i.imgur.com/QosROeQ.png) - ![](https://i.imgur.com/XbAfchJ.png) - 結論 - $m_\mathcal{H}(N) ≤ B(N, k) ≤ N^{k−1}$ - 直覺上： $~P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le2~m_\mathcal{H}(N)~e^{-2\epsilon^2N}$ - 實際上： $~P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le4~(2N)^{k-1}~e^{-\frac{1}{8}\epsilon^2N}$ - 1. $m_\mathcal{H}(N)$ breaks at k - 2. $N 夠大，k 夠大$ ## L7 The VC Dimension - **VC dimension** $d_{VC}(\mathcal{H})$ ： - $d_{VC} ： k-1，存在~d_{VC}$ inputs 可被 shattered - $N≤d_{VC}~~\rightarrow~~$ some $N$ inputs 可被shatter - $k>d_{VC}~~~\rightarrow~~$ $k$ is a break point - Example - Positive Rays ： $~~~~~~~d_{VC}(\mathcal{H})=1$ - Positive Intervals ： $d_{VC}(\mathcal{H})=2$ - Convex Sets ： $~~~~~~d_{VC}(\mathcal{H})=\infty$ - 2D perceptrons ： $d_{VC}(\mathcal{H})=3$ - **Looseness of VC Bound** - **Hoeffding for unknown Eout** - any distribution, any target - **$m_\mathcal{H}(N)$ instead of $|\mathcal{H}(x_1,...,x_N)|$** - any data - **$N^{d_{vc}}$ instead of $m_\mathcal{H}(N)$** - any $\mathcal{H}$ of same $d_{vc}$ - **Union bound on worst cases** - any choice made by $\mathcal{A}$ - Perceptron ：$d_{VC}=d+1$ - $d_{VC}≥d+1$ - 取點為 $\begin{bmatrix}1&0&0&\dots&0\\1&1&0&\dots&0\\1&0&1&\dots&0\\\vdots&&&\ddots&0\\1&0&0&\dots&1\end{bmatrix}$，則對任意的 $y$，$sign(Xw)=y$ - $d_{VC}≤d+1$ - 線性相依$x_{d+2}=a_1x_1+a_2x_2+...+a_{d+1}x_{d+1}$ - **Degrees of Freedom** ： - Parameters $w=(w_0,w_1,···,w_d)$ ：可看成是d+1維自由度 - Positive Rays ： $~~~~~~~d_{VC}(\mathcal{H})=1，|free~parameter| = 1$ - Positive Intervals ： $d_{VC}(\mathcal{H})=2，|free~parameter| = 2$ - **TradeOff** - $d_{VC}$ 小 - $E_{in}(h)~\sim~E_{out}(h)$，但$E_{in}(h)$ 很大 - $d_{VC}$ 大 - $E_{in}(h)~\nsim~E_{out}(h)$，但$E_{in}(h)$ 很小 - 整理 - $P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le4~(2N)^{d_{vc}}~e^{-\frac{1}{8}\epsilon^2N}=\delta$ - $\epsilon$ 和 $\delta$ 可以一一對應 - $\delta=4~(2N)^{d_{vc}}~e^{-\frac{1}{8}\epsilon^2N}$ - $\epsilon=\sqrt{\frac{8}{N}ln\frac{4~(2N)^{d_{vc}}}{\delta}}$ - 給定壞事情發生的機率$\delta$，有超過 $1-\delta$ 機率： - $P(|E_{in}(g) -E_{out}(g)|\leq\epsilon)\geq 1-\delta$ - $E_{in}(g)-\sqrt{\frac{8}{N}ln\frac{4~(2N)^{d_{vc}}}{\delta}}~\leq~E_{out}(g)~\leq~E_{in}(g)+\sqrt{\frac{8}{N}ln\frac{4~(2N)^{d_{vc}}}{\delta}}$ - ![](https://i.imgur.com/DiHFuHZ.png) - 理論上：$N≈10,000d_{VC}$ - 實務上：$N≈10d_{VC}$ ## L8 ## L9 Linear Regression - **問題定義** - $(x_1,2), (x_2,-3), ..., (x_N,7)$ - $h(x) = w^Tx$ - $E_{in}(w)=\frac{1}{N}||X~w-y||^2=\frac{1}{N}(w^TX^TXw-2w^TX^Tyb+y^Ty)$ - $Squared~Error=(\hat{y}−y)^2$ - $w=(X^TX)^{-1} X^Ty$ - **Linear Regression 演算法** - 1.Construct $\underset{N\times (d+1)}{X}=\begin{bmatrix} -~-x_1^T-~-\\-~-x_2^T-~-\\...\\-~-x_n^T-~-\end{bmatrix}$, $~y=\begin{bmatrix}y_1\\y_2\\...\\y_n\end{bmatrix}$ - 2.Calculate pseudo-inverse $\underset{(d+1)\times N}{X^†}=(X^TX)^{-1} X^T$ - 3.$w_{LIN}=X^†y$ - 4.$h(x)=w_{LIN}^Tx$ - **Noise** - $I-H$ trasform $y$ to $y-\hat{y}$ - E~in~ = noise level · (1 − $\frac{d+1}{N}$) - E~out~ = noise level · (1 + $\frac{d+1}{N}$) - **Classification vs Regression** - **err~0/1~ ≤ err~sqr~** - 可以用regression來解classification - 比較有效率，但bound比較寬鬆 - 通常用regression得到$w_{LIN}$，接著再做pocket ## L10 Logistic Regression - **問題定義** - $(x_1,1), (x_2,0), ..., (x_N,1)$ - $h(x)=\theta(w^Tx)=\frac{1}{1+exp(-w^Tx)}$ - **Likelihood(h)**：手上資料label=1，則y~i~=1，label=0，則y~i~=-1 - $\theta(y_1w^Tx_1)\times \theta(y_2w^Tx_2)\times ...\times \theta(y_Nw^Tx_N)$ 越大越好 - **Cross Entropy Error** - $\frac{1}{N}\sum\limits_{n=1}^N-ln~\theta(y_nw^Tx_n)$ 越小越好 - $E_{in}(w)=~\frac{1}{N}\sum\limits_{n=1}^Nln(1+exp(-y_nw^Tx_n))$ 越小越好 > cross entropy = $ln(1+exp(-ywx))$ > corss entropy = $-\sum p(x)ln(q(x))=-\sum yln(\frac{1}{1+exp(-W^Tx)})$ - **結論：找出w使得** - $E_{in}(w)=~\frac{1}{N}\sum\limits_{n=1}^Nln(1+exp(-y_nw^Tx_n))$ 越小越好 - **Logistic Regression 演算法** (Gradient Descent) - 0.Initialize $w_0$ - 1.compute $~~~~~~~~~~\nabla E_{in}(w_t)=\frac{1}{N}\sum\limits_{n=1}^N \theta(-y_nw_t^Tx_n)(-y_nx_n)$ - 2.updated by $~~~~~~w_{t+1}\leftarrow w_t-\eta\nabla E_{in}(w_t)$ - 1.$\nabla E_{in}(w_t)=0$ - 2.Enough iterations - 3.$h(x)=\theta(w_{t+1}^Tx)=\frac{1}{1+exp(-W^Tx)}$ ## L11 ## L12 Nonlinear feature transform - **Nonlinear feature transform** $\Phi$： $x∈X \rightarrow z∈Z$ - 圓：$\Phi(x)=(1,x_1,x_2)$ - 二次曲線：$\Phi(x)=(1,x_1,x_2,x_1^2,x_1x_2,x_2^2)$ - **How to use?** - 轉換 $\{(x_n, y_n)\}$ 到 $\{(z_n, y_n)\}$ - 在 $Z$ 維做分類得到 perceptron $\tilde{w}$ - 則 $g(x)=sign(\tilde{w}^T\Phi(x))$ - Q-th order polynomial on $x=(x_1,x_2,...,x_d)$ - $\Phi_Q(x)=(1,x_1,x_2,...,x_d,~...,~x_1^Q,x_1^{Q-1}x_2,...,x_d^Q)$ - **Dimesion** = $1+\tilde{d}=\begin{pmatrix}Q+d\\Q\end{pmatrix}=\begin{pmatrix}Q+d\\d\end{pmatrix}=O(Q^d)$ - $\mathcal{H}_{\Phi_0}⊂\mathcal{H}_{\Phi_1}⊂\mathcal{H}_{\Phi_2}⊂...⊂\mathcal{H}_{\Phi_Q}$ - $d_{vc}(\mathcal{H}_0)≤d_{vc}(\mathcal{H}_1)≤d_{vc}(\mathcal{H}_2)≤...≤d_{vc}(\mathcal{H}_Q)$ - $E_{in}(g_0)≥E_{in}(g_1)≥E_{in}(g_2)≥...≥E_{in}(g_Q)$ - **$\Phi$ shall be decided without "peeking data"** - $z = (1, x_1, x_2, x^2_1 , x_1x_2, x^2_2 )$ - $z = (1, x^2_1 , x^2_2 )$ - $z = (1, x^2_1 + x^2_2 )$ - $z = sign(0.6 − x^2_1 − x^2_2 )$ ## L13 Hazard of Overfitting - **Definition：** - Bad Generization ：看到的點做得很好，其他的點錯誤很大。 - Overfitting ： Ein越來越小，Eout卻越來越大。 - Underfitting ： Ein越來越大，Eout也越來越大。 - **目標函數是10次多項式+Noise**，用 $\mathcal{H}_2$ 和 $\mathcal{H}_{10}$ 去學習 - ![](https://i.imgur.com/SbgOO1f.png) - $g_2的E_{out}$ 比較好，**因為資料量不夠大** - 以退為進 - 聰明反被聰明誤 - **目標函數是50次多項式**，用 $\mathcal{H}_2$ 和 $\mathcal{H}_{10}$ 去學習 - $g_2的E_{out}$ 仍然比較好 - **因為複雜度造成noise** - **Overfitting的複雜度實驗** - 目標函數參數：data size $N$，noise level $\sigma^2$，complexity level $Q_f$ - 學習函數好壞：$E_{out}(g_{10})-E_{out}(g_2)$ - **會造成overfitting的原因** - **Stochastic Noise ： $\sigma^2~大$** - **Deteministic Noise ： $Q_f~複雜$** - **Data Size小** - **Excessive Power** - **對付overfitting的方法** - **1.Start from simple model** - **2.Data cleaning/pruning** - correct the label - remove the example - **3.Data hinting：蒐集資料** - add virtual examples - 不在是iid，有了偏見 - **4.Regularization：踩煞車** - **5.Validation** ## L14 Regularization - **A little regularization goes a long way!** - **Stepping Back**：避免模型複雜度太高 - $\mathcal{H}_{10}=\{w\in\mathcal{R}^{11}\}$ - $\mathcal{H}_{2}=\{w\in\mathcal{R}^{11}~|~w_3=w_4=...=w_{10}=0\}$ - $\mathcal{H}_{2}^{'}=\{w\in\mathcal{R}^{11}~|~\#(w_i=0)=8\}~~$ **NP-hard to solve** - $\mathcal{H}(C)=\{w\in\mathcal{R}^{11}~|~~~||w||^2\leq C\}~~$ **類似的條件** - **Lagrange Multiplier** - $\min\limits_{w\in\mathcal{R}^{Q+1}}~E_{in}(w)=\frac{1}{N}\sum\limits_{n=1}^{N}(w^Tz_n-y_n)^2~~~$ s.t. $\sum\limits_{q=0}^Qw_q^2\leq C$ - $\min\limits_{w\in\mathcal{R}^{Q+1}}~E_{in}(w)+\frac{\lambda}{N}w^Tw$ - $\lambda~\geq 0$ 稱為 **Lagrange Multiplier**，跟 C 可以一一對應 - $w^Tw$稱為 **regularizer** - $E_{in}(w)+\frac{\lambda}{N}w^Tw$稱為 **augmented error** - $\frac{2}{N}(Z^TZw-Z^Ty)+\frac{2\lambda}{N}w=0$ - $w=(Z^TZ+\lambda I)^{-1}Z^Ty$稱為 **ridge regression** - **Legendre Polynomials：效果更好的regulization** - $\Phi(x)=(1,L_1(x),L_2(x),...,L_Q(x))$ - **General Regularizers** - Symmetry ： $\sum[[q~is~odd]]w_q^2$ - Plausible $~$： $\sum|w_q|$，**sparsity** - Friendly $~~~$： $\sum w_q^2$，**convex, differentiable, easy to optimize** - $\lambda$ 的選擇 - 越多 noise $\rightarrow$ 需要更大的$\lambda$才可以使 $E_{out}$ 降低 - 實務上還是要多方嘗試 ## L15 Validation - **Model Selection Problem $\rightarrow$ 藉由 Validation** - 給定：$H_1,H_2,...,H_M$及對應演算法$A_1,A_2,...,A_M$ - 目標：$E_{out}(g_m^{∗})$最小化 - Model Selection by $E_{in}~$ ：overfitting/ bad generization - Model Selection by $E_{test}$：infeasible/ cheating - Model Selection by $E_{val}$ ：legal cheating/ 留一份不被汙染的資料 - **Single Validation** - ![](https://i.imgur.com/aMDHL5u.png)&emsp; ![](https://i.imgur.com/qUrBJVe.png) - $E_{out}(g)~~\underset{small~K}{\approx}~~E_{out}(g^{-})~~\underset{large~K}{\approx}~~E_{val}(g^{-})$ - 經驗法則： $K = \frac{N}{5}$ - **Cross Validation** - **Leave-One-Out** - $E_{loocv}(H,A)=\frac{1}{N}\sum\limits_{n=1}^Ne_n=\frac{1}{N}\sum\limits_{n=1}^Nerr(g^−_n(x_n),y_n)$ - $E_{loocv}(H,A)\approx E_{out}(g^-)\approx E_{out}(g)$ - **V-Fold Cross** - 效果不錯，不一定要做到 Leave-One-Out - 經驗法則： $V=10$ - **Validation結果會比Testing結果樂觀** - Training&emsp;： **select among hypotheses** - Validation： **select among finalists** - Testing &emsp;： **evaluate** ## L16 3 Principles - **Occam’s Razor** - An explanation should be as simple as possible. - **Sampling Bias** - If data is biased, learning will be biased. - **Data Snooping** - If a data set has affected, its ability to assess the outcome has been compromised.