# 機器學習基石
> A takes D and H to get g ~~ f
## L2 Learn to answer yes/no
### Perceptron Hypothesis Set
- $x^T=(1,x_1,x_2,...,x_n)$
$w^T=(-threshold,w1,w2,...,w_n)$
$h(x)=sign(w^T\cdot x)$
- Perceptron = linear(binary) classifier
### Perceptron Learning Algorithm
- 希望至少在看過的資料中,g和f(unknown)越接近越好
- PLA核心為知錯能改,從某個$w_0$開始,不斷修正錯誤得到新的$w_{t+1}$。
* $w_{t+1}=w_t+y_{n(t)}x_{n(t)}~~~~~$ if $~~~~~sign(w_tx_{n(t)}) \neq y_{n(t)}$
### Guarantee of PLA (若線性可分,則有以下性質)
- $y_{n(t)}w_f^Tx_{n(t)}>0$,因為所有$y_i$和PLA最後得到得$w_f$判斷的分類正負號相同,兩者相乘為正數。
- $w_f^Tw_{t+1}=w_f^T(w_t+y_{n(t)}x_{n(t)})>w_f^Tw_t+0$ **表示$w_t和w_f$越來越靠近**
- Mistake <-> $sign(w_tx_{n(t)}) \neq y_{n(t)}$ <-> $y_{n(t)}w_tx_{n(t)}\leq0$
$||w_{t+1}||^2=||w_t+y_{n(t)}x_{n(t)}||^2$
$= ||w_t||^2 +2y_{n(t)}w_t^Tx_{n(t)}+||y_{n(t)}x_{n(t)}||^2$
$\leq ||w_t||^2 + 0 +max_n||y_nx_n||^2$ **表示$w_t$長度不會成長太快,頂多$x_n^2$大小**
- 在T個iteration之後,$\frac{w_f^T}{||w_f||}\frac{w_T}{||w_T||}\geq\sqrt{T}\cdot const$
### Non-Separable Data
* 資料線性可分,則 PLA會停。
* 資料線性不可分,則找一條犯錯最少的線,被證明為NPhard。
* **Pocket PLA**,跑足夠多次之後,可以找到還不錯的線。
## L3 Types of learning
- Output Space
- Binary classification(是非題)
- Multiclass classification (如販賣機)
- Regression (with bound)(輸出為一個實數)
- Structured learning
- Label
- Supervised learning 有標記所有的x->y
- Unsupervised learning 分群問題
- Semisupervised learning (臉部辨識、藥物測試) 因為找到標記成本很高
- Reinforcement learning (廣告系統、棋類遊戲)對之前輸出結果獎勵跟懲罰來學習
- Protocol
- Batch 一開始就把所有資料未給電腦
- Online 一筆一筆更新
- Active 電腦主動問問題(對某筆資料有疑問之類的)
- Input Space
- Concrete Features 人類去定義
- Raw Features 直接看資料如何儲存
- Abstract Features
>raw features或abstract features則需要人或機器去轉換抽取找到特徵
## L4 Feasibility of Learning
- **Hoeffding Inequality:** $~P(|\mu -\nu|>\epsilon)~\le~2e^{-2\epsilon^2N}$
- 抽樣的個數N越多,$\mu和\nu$就會很接近
- $\mu=\nu$ is PAC
- $P(|E_{in}(h) -E_{out}(h)|>\epsilon)~\le~2e^{-2\epsilon^2N}$
- DATA中的錯誤率 : $E_{in}(h)=\frac{1}{N}\sum\limits_{n=1}^N[[h(x_n)\neq y_n]]$
- 真正的錯誤率 : $E_{out}(h)=\varepsilon~[[h(x)\neq f(x)]]$
- 學習的目標
- 1. $E_{in}(h)~\sim~E_{out}(h)$
- 2. $E_{in}(h)~越小越好$
- BAD DATA
- **Hoeffding Inequality**保證同一個h來說,BAD DATA數目不會太多。
- 但是在有選擇的情況下,選到BAD DATA機率上升。
- $~~~~~P_D[BAD~D$ for all h$]$
$\leq 2e^{-2\epsilon^2N}~+~2e^{-2\epsilon^2N}~+...+~2e^{-2\epsilon^2N}$
$\leq 2M~e^{-2\epsilon^2N}$
- **當 M 是finite**,則 $~E_{in}(h)~\sim~E_{out}(h)$
- 我們可以專注在找$E_{in}$小的 h
## L5 Training and Testing
- 學習的目標
- 1.$E_{in}(h)~\sim~E_{out}(h)$
- 2.$E_{in}(h)~越小越好$
- What if $|\mathcal{H}|=M$ is infinite ?
- 當 $\mathcal{H}$ 裡的 $h$ 接近時,$P(BAD$ events$)$會 overlap
- $P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le2~effect(N)~e^{-2\epsilon^2N}$
- $\mathcal{H} = \{$all lines in $R^2\}$
- 一個點:2 kinds
- 兩個點:4 kinds
- 三個點:6 kinds 一條線/ 8 kinds 最多
- 四個點:14 kinds 最多
- **Dichotomies**
- 用來描述,有**幾種 hypothesis**,和 data 個數 $N$ 有關。
- **A dichotomy** : hypothesis "limited" to the eyes of $x_1,x_2,...,x_N$
- $h(x_1,x_2,...,x_N)=(h(x_1),h(x2),...,h(x_N))∈\{×,◦\}^N$
- **Dichotomies Set**:|$\mathcal{H}(x_1,x_2,...,x_N)$|
- Bounded by $2^N$
- Depends on input.
- **Growth function成長函數:**
- $m_\mathcal{H}(N) = max~|\mathcal{H}(x_1,x_2,...,x_N)|$
- Example
- Positive Rays : $N+1$
- Positive Intervals : $\frac{1}{2}N^2+\frac{1}{2}N+1$
- Convex Sets : $2^N$
- 2D perceptrons : $m_\mathcal{H}(N)<2^N$
- **Break Point:**
- $k$ : 滿足$m_\mathcal{H}(k)<2^k$,任意 $k$ inputs 不可 shattered
- No break point : $m_\mathcal{H}(N)=2^N$
- Break point k : $m_\mathcal{H}(N)=O(N^{k-1})$
- Example
- Positive Rays : 2
- Positive Intervals : 3
- Convex Sets : No
- 2D perceptrons : 4
## L6 Theroy of Generization
- 給定break point k,我們可以知道成長函數$m_\mathcal{H}(N)$不能太高
- k=2
- N=1,$m_\mathcal{H}(N)$=2
- N=2,$m_\mathcal{H}(N)$=3
- N=3,$m_\mathcal{H}(N)$=4
- Bounding function, $B(N, k)=\sum\limits_{i=0}^{k-1}\begin{pmatrix}N\\i\end{pmatrix}$
- 給定N個點,break point k,$m_\mathcal{H}(N)$ 的 maximum possible
- 
- 
- 結論
- $m_\mathcal{H}(N) ≤ B(N, k) ≤ N^{k−1}$
- 直覺上 : $~P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le2~m_\mathcal{H}(N)~e^{-2\epsilon^2N}$
- 實際上 : $~P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le4~(2N)^{k-1}~e^{-\frac{1}{8}\epsilon^2N}$
- 1. $m_\mathcal{H}(N)$ breaks at k
- 2. $N 夠大,k 夠大$
## L7 The VC Dimension
- **VC dimension** $d_{VC}(\mathcal{H})$ :
- $d_{VC} : k-1,存在~d_{VC}$ inputs 可被 shattered
- $N≤d_{VC}~~\rightarrow~~$ some $N$ inputs 可被shatter
- $k>d_{VC}~~~\rightarrow~~$ $k$ is a break point
- Example
- Positive Rays : $~~~~~~~d_{VC}(\mathcal{H})=1$
- Positive Intervals : $d_{VC}(\mathcal{H})=2$
- Convex Sets : $~~~~~~d_{VC}(\mathcal{H})=\infty$
- 2D perceptrons : $d_{VC}(\mathcal{H})=3$
- **Looseness of VC Bound**
- **Hoeffding for unknown Eout**
- any distribution, any target
- **$m_\mathcal{H}(N)$ instead of $|\mathcal{H}(x_1,...,x_N)|$**
- any data
- **$N^{d_{vc}}$ instead of $m_\mathcal{H}(N)$**
- any $\mathcal{H}$ of same $d_{vc}$
- **Union bound on worst cases**
- any choice made by $\mathcal{A}$
- Perceptron :$d_{VC}=d+1$
- $d_{VC}≥d+1$
- 取點為 $\begin{bmatrix}1&0&0&\dots&0\\1&1&0&\dots&0\\1&0&1&\dots&0\\\vdots&&&\ddots&0\\1&0&0&\dots&1\end{bmatrix}$,則對任意的 $y$,$sign(Xw)=y$
- $d_{VC}≤d+1$
- 線性相依$x_{d+2}=a_1x_1+a_2x_2+...+a_{d+1}x_{d+1}$
- **Degrees of Freedom** :
- Parameters $w=(w_0,w_1,···,w_d)$ : 可看成是d+1維自由度
- Positive Rays : $~~~~~~~d_{VC}(\mathcal{H})=1,|free~parameter| = 1$
- Positive Intervals : $d_{VC}(\mathcal{H})=2,|free~parameter| = 2$
- **TradeOff**
- $d_{VC}$ 小
- $E_{in}(h)~\sim~E_{out}(h)$,但$E_{in}(h)$ 很大
- $d_{VC}$ 大
- $E_{in}(h)~\nsim~E_{out}(h)$,但$E_{in}(h)$ 很小
- 整理
- $P(|E_{in}(g) -E_{out}(g)|>\epsilon)\le4~(2N)^{d_{vc}}~e^{-\frac{1}{8}\epsilon^2N}=\delta$
- $\epsilon$ 和 $\delta$ 可以一一對應
- $\delta=4~(2N)^{d_{vc}}~e^{-\frac{1}{8}\epsilon^2N}$
- $\epsilon=\sqrt{\frac{8}{N}ln\frac{4~(2N)^{d_{vc}}}{\delta}}$
- 給定壞事情發生的機率$\delta$,有超過 $1-\delta$ 機率:
- $P(|E_{in}(g) -E_{out}(g)|\leq\epsilon)\geq 1-\delta$
- $E_{in}(g)-\sqrt{\frac{8}{N}ln\frac{4~(2N)^{d_{vc}}}{\delta}}~\leq~E_{out}(g)~\leq~E_{in}(g)+\sqrt{\frac{8}{N}ln\frac{4~(2N)^{d_{vc}}}{\delta}}$
- 
- 理論上:$N≈10,000d_{VC}$
- 實務上:$N≈10d_{VC}$
## L8
## L9 Linear Regression
- **問題定義**
- $(x_1,2), (x_2,-3), ..., (x_N,7)$
- $h(x) = w^Tx$
- $E_{in}(w)=\frac{1}{N}||X~w-y||^2=\frac{1}{N}(w^TX^TXw-2w^TX^Tyb+y^Ty)$
- $Squared~Error=(\hat{y}−y)^2$
- $w=(X^TX)^{-1} X^Ty$
- **Linear Regression 演算法**
- 1.Construct $\underset{N\times (d+1)}{X}=\begin{bmatrix} -~-x_1^T-~-\\-~-x_2^T-~-\\...\\-~-x_n^T-~-\end{bmatrix}$, $~y=\begin{bmatrix}y_1\\y_2\\...\\y_n\end{bmatrix}$
- 2.Calculate pseudo-inverse $\underset{(d+1)\times N}{X^†}=(X^TX)^{-1} X^T$
- 3.$w_{LIN}=X^†y$
- 4.$h(x)=w_{LIN}^Tx$
- **Noise**
- $I-H$ trasform $y$ to $y-\hat{y}$
- E~in~ = noise level · (1 − $\frac{d+1}{N}$)
- E~out~ = noise level · (1 + $\frac{d+1}{N}$)
- **Classification vs Regression**
- **err~0/1~ ≤ err~sqr~**
- 可以用regression來解classification
- 比較有效率,但bound比較寬鬆
- 通常用regression得到$w_{LIN}$,接著再做pocket
## L10 Logistic Regression
- **問題定義**
- $(x_1,1), (x_2,0), ..., (x_N,1)$
- $h(x)=\theta(w^Tx)=\frac{1}{1+exp(-w^Tx)}$
- **Likelihood(h)**: 手上資料label=1,則y~i~=1,label=0,則y~i~=-1
- $\theta(y_1w^Tx_1)\times \theta(y_2w^Tx_2)\times ...\times \theta(y_Nw^Tx_N)$ 越大越好
- **Cross Entropy Error**
- $\frac{1}{N}\sum\limits_{n=1}^N-ln~\theta(y_nw^Tx_n)$ 越小越好
- $E_{in}(w)=~\frac{1}{N}\sum\limits_{n=1}^Nln(1+exp(-y_nw^Tx_n))$ 越小越好
> cross entropy = $ln(1+exp(-ywx))$
> corss entropy = $-\sum p(x)ln(q(x))=-\sum yln(\frac{1}{1+exp(-W^Tx)})$
- **結論:找出w使得**
- $E_{in}(w)=~\frac{1}{N}\sum\limits_{n=1}^Nln(1+exp(-y_nw^Tx_n))$ 越小越好
- **Logistic Regression 演算法** (Gradient Descent)
- 0.Initialize $w_0$
- 1.compute $~~~~~~~~~~\nabla E_{in}(w_t)=\frac{1}{N}\sum\limits_{n=1}^N \theta(-y_nw_t^Tx_n)(-y_nx_n)$
- 2.updated by $~~~~~~w_{t+1}\leftarrow w_t-\eta\nabla E_{in}(w_t)$
- 1.$\nabla E_{in}(w_t)=0$
- 2.Enough iterations
- 3.$h(x)=\theta(w_{t+1}^Tx)=\frac{1}{1+exp(-W^Tx)}$
## L11
## L12 Nonlinear feature transform
- **Nonlinear feature transform** $\Phi$: $x∈X \rightarrow z∈Z$
- 圓:$\Phi(x)=(1,x_1,x_2)$
- 二次曲線:$\Phi(x)=(1,x_1,x_2,x_1^2,x_1x_2,x_2^2)$
- **How to use?**
- 轉換 $\{(x_n, y_n)\}$ 到 $\{(z_n, y_n)\}$
- 在 $Z$ 維做分類得到 perceptron $\tilde{w}$
- 則 $g(x)=sign(\tilde{w}^T\Phi(x))$
- Q-th order polynomial on $x=(x_1,x_2,...,x_d)$
- $\Phi_Q(x)=(1,x_1,x_2,...,x_d,~...,~x_1^Q,x_1^{Q-1}x_2,...,x_d^Q)$
- **Dimesion** = $1+\tilde{d}=\begin{pmatrix}Q+d\\Q\end{pmatrix}=\begin{pmatrix}Q+d\\d\end{pmatrix}=O(Q^d)$
- $\mathcal{H}_{\Phi_0}⊂\mathcal{H}_{\Phi_1}⊂\mathcal{H}_{\Phi_2}⊂...⊂\mathcal{H}_{\Phi_Q}$
- $d_{vc}(\mathcal{H}_0)≤d_{vc}(\mathcal{H}_1)≤d_{vc}(\mathcal{H}_2)≤...≤d_{vc}(\mathcal{H}_Q)$
- $E_{in}(g_0)≥E_{in}(g_1)≥E_{in}(g_2)≥...≥E_{in}(g_Q)$
- **$\Phi$ shall be decided without "peeking data"**
- $z = (1, x_1, x_2, x^2_1 , x_1x_2, x^2_2 )$
- $z = (1, x^2_1 , x^2_2 )$
- $z = (1, x^2_1 + x^2_2 )$
- $z = sign(0.6 − x^2_1 − x^2_2 )$
## L13 Hazard of Overfitting
- **Definition:**
- Bad Generization : 看到的點做得很好,其他的點錯誤很大。
- Overfitting : Ein越來越小,Eout卻越來越大。
- Underfitting : Ein越來越大,Eout也越來越大。
- **目標函數是10次多項式+Noise**,用 $\mathcal{H}_2$ 和 $\mathcal{H}_{10}$ 去學習
- 
- $g_2的E_{out}$ 比較好,**因為資料量不夠大**
- 以退為進
- 聰明反被聰明誤
- **目標函數是50次多項式**,用 $\mathcal{H}_2$ 和 $\mathcal{H}_{10}$ 去學習
- $g_2的E_{out}$ 仍然比較好
- **因為複雜度造成noise**
- **Overfitting的複雜度實驗**
- 目標函數參數:data size $N$,noise level $\sigma^2$,complexity level $Q_f$
- 學習函數好壞:$E_{out}(g_{10})-E_{out}(g_2)$
- **會造成overfitting的原因**
- **Stochastic Noise : $\sigma^2~大$**
- **Deteministic Noise : $Q_f~複雜$**
- **Data Size小**
- **Excessive Power**
- **對付overfitting的方法**
- **1.Start from simple model**
- **2.Data cleaning/pruning**
- correct the label
- remove the example
- **3.Data hinting:蒐集資料**
- add virtual examples
- 不在是iid,有了偏見
- **4.Regularization:踩煞車**
- **5.Validation**
## L14 Regularization
- **A little regularization goes a long way!**
- **Stepping Back**:避免模型複雜度太高
- $\mathcal{H}_{10}=\{w\in\mathcal{R}^{11}\}$
- $\mathcal{H}_{2}=\{w\in\mathcal{R}^{11}~|~w_3=w_4=...=w_{10}=0\}$
- $\mathcal{H}_{2}^{'}=\{w\in\mathcal{R}^{11}~|~\#(w_i=0)=8\}~~$ **NP-hard to solve**
- $\mathcal{H}(C)=\{w\in\mathcal{R}^{11}~|~~~||w||^2\leq C\}~~$ **類似的條件**
- **Lagrange Multiplier**
- $\min\limits_{w\in\mathcal{R}^{Q+1}}~E_{in}(w)=\frac{1}{N}\sum\limits_{n=1}^{N}(w^Tz_n-y_n)^2~~~$ s.t. $\sum\limits_{q=0}^Qw_q^2\leq C$
- $\min\limits_{w\in\mathcal{R}^{Q+1}}~E_{in}(w)+\frac{\lambda}{N}w^Tw$
- $\lambda~\geq 0$ 稱為 **Lagrange Multiplier**,跟 C 可以一一對應
- $w^Tw$稱為 **regularizer**
- $E_{in}(w)+\frac{\lambda}{N}w^Tw$稱為 **augmented error**
- $\frac{2}{N}(Z^TZw-Z^Ty)+\frac{2\lambda}{N}w=0$
- $w=(Z^TZ+\lambda I)^{-1}Z^Ty$稱為 **ridge regression**
- **Legendre Polynomials: 效果更好的regulization**
- $\Phi(x)=(1,L_1(x),L_2(x),...,L_Q(x))$
- **General Regularizers**
- Symmetry : $\sum[[q~is~odd]]w_q^2$
- Plausible $~$: $\sum|w_q|$,**sparsity**
- Friendly $~~~$: $\sum w_q^2$,**convex, differentiable, easy to optimize**
- $\lambda$ 的選擇
- 越多 noise $\rightarrow$ 需要更大的$\lambda$才可以使 $E_{out}$ 降低
- 實務上還是要多方嘗試
## L15 Validation
- **Model Selection Problem $\rightarrow$ 藉由 Validation**
- 給定:$H_1,H_2,...,H_M$及對應演算法$A_1,A_2,...,A_M$
- 目標:$E_{out}(g_m^{∗})$最小化
- Model Selection by $E_{in}~$ :overfitting/ bad generization
- Model Selection by $E_{test}$:infeasible/ cheating
- Model Selection by $E_{val}$ :legal cheating/ 留一份不被汙染的資料
- **Single Validation**
-   
- $E_{out}(g)~~\underset{small~K}{\approx}~~E_{out}(g^{-})~~\underset{large~K}{\approx}~~E_{val}(g^{-})$
- 經驗法則 : $K = \frac{N}{5}$
- **Cross Validation**
- **Leave-One-Out**
- $E_{loocv}(H,A)=\frac{1}{N}\sum\limits_{n=1}^Ne_n=\frac{1}{N}\sum\limits_{n=1}^Nerr(g^−_n(x_n),y_n)$
- $E_{loocv}(H,A)\approx E_{out}(g^-)\approx E_{out}(g)$
- **V-Fold Cross**
- 效果不錯,不一定要做到 Leave-One-Out
- 經驗法則 : $V=10$
- **Validation結果會比Testing結果樂觀**
- Training : **select among hypotheses**
- Validation: **select among finalists**
- Testing  : **evaluate**
## L16 3 Principles
- **Occam’s Razor**
- An explanation should be as simple as possible.
- **Sampling Bias**
- If data is biased, learning will be biased.
- **Data Snooping**
- If a data set has affected, its ability to assess the outcome has been compromised.