HTML 筆記

HTML week 3

Data labels

Supervised

Unsupervised

Semi-supervised

Self-supervised

自己產生資料，比如從輸出產生輸入，把完整的拼圖打亂，叫Machine把它拼回去。
自監督式學習 Self-Supervised Learning for Computer Vision 之概述

Weakly-supervised

Reinforcement

Protocols

Batch

Online

Active

Input Spaces

Concrete

Raw

Abstract

Error

Where does the error come from?

No Free Lunch Theorem for Machine Learning

Unseen cases可能與seen cases差異甚大

Off-Training-Set Error

Test set太小則有可能與真實的正確率有偏差

E_{out}

：真實誤差

E_{in}

：根據seen cases得到的誤差
用Hoeffding’s Inequality bound兩者的誤差。

\begin{aligned} \underset{D}{Pr} [D is bad] \\ = & \underset{D}{Pr} [D is bad for h_{1} \lor D is bad for h_{2} \lor \dots \lor D is bad for h_{M}] \\ \leq & \underset{D}{Pr} [D is bad for h_{1}] + \underset{D}{Pr} [D is bad for h_{2}] + \dots + \underset{D}{Pr} [D is bad for h_{M}] \\ \leq & 2 M e x p (- 2 ϵ^{2} N) \end{aligned}

$| H |$ 的大小：
太大：bound太大，
$E_{out}$ 和
$E_{in}$ 可能差很大
太小：找不到好的
$h \in H$ 使得
$E_{in}$ 很小

但是PLA的
$| H | = \infty$ 怎麼辦？

abbr: PAC=probably approximately correct

HTML week 4

bound the distance between

E_{i} n

and

E_{o} u t

by the number of hypotheses. but for infinite hypotheses? Classify them into dichotomies(from parameters to the classification results of samples). In a binary classification problem, the number of dichotomies is upper bound by

2^{N}

where N is the number of samples. growth function: the maximum number of dichotomies among possible inputs(to make it independent of inputs), denoted as

m_{H} (N)

. But for some tasks, the number of dichotomies may be lower,

1D perceptron,
$f (x) = s i g n (x - h)$ ,
$m_{H} (N) = N + 1$ ,bp:2
positive intervals(1D),
$f (x) = 1$ if
$x \in [l, r), 0$ otherwise,
$m_{H} (N) = (\binom{N + 1}{2}) + 1 = \frac{1}{2} N^{2} + \frac{1}{2} N + 1$ , bp:3
Convex sets(2D), 用凸的圖形把一個分類包起來，
$m_{H} (N) = 2^{N}$ , bp:no
2D perceptron,
$m_{H} (N) < 2^{N}$ in some cases, bp:4

k

is a breakpoint if

m_{H} (k) < 2^{k}

, property: all

n >

minimum break point are break points.
For some N, k inputs can be shattered by

H \Leftrightarrow m_{H} (N) = 2^{N}

For a set of hypotheses for a task, VC dimension=breakpoint-1, VC dimension means the last

n

s.t. the hypotheses can shatter any possible N inputs. VC dimension can be view as the strength of a set of hypotheses.

It's proven that for

N \geq 2, k \geq 3

m_{H} (N) \leq N^{k - 1}

HTML week 5

Error functions are application/user-dependent.

Example:

In a supermarket fingerprint verification system that verifies customers and collects points for discounts, false rejection is a serious problem.
In the CIA, false acceptance will be VERY serious because we let intruders in.

HTML week 6

linear classification: h(x)=sign(s), error: 0/1
linear regression: h(x)=s, error: mse
logistic regression: h(x)=

θ (s)

, error: cross-entropy

non-linear transform + linear model => non-linear model

HTML week 7

Introduce overfitting

low

E_{in}

, high

E_{out}

Causes of overfitting

too much noise but too few samples
too complex target but too few samples
too complex model(large
$d_{v c}$ ) for simple target but very few samples.

Avoid Overfitting

start from simple models
data cleaning/pruning
data hinting
regularization
Validation

Data cleaning/pruning

Maybe automatically detect outliers, and

data cleaning: correct the label
data pruning: remove the sample
the effect may be limited

data hinting

Example: slightly shift/rotate images to generate more data.
Aka data augmentation

HTML week 8

validation

validation set：作弊，

D \to D_{t r a i n} \cup D_{v a l}

從training data分

K

個出來當作選擇hypothesis的標準。

E_{out} (g) \underset{s m a l l K}{\approx} E_{out} (g^{-}) \underset{l a r g e K}{\approx} E_{val} (g^{-})

practical:

\frac{K}{N} = 20 %

Cross Validation

Single Cross Validation

K = 1

When choose n-th data as validation set,

E_{val} = e_{n}

leave-one-out cross-validation estimate

E_{l o o c v} (H, A) = \frac{1}{N} \sum_{n = 1}^{N} e_{n}

Hope

E_{l o o c v} (H, A) \approx E_{out} (g)

E_{D} E_{l o o c v} (H, A) = E_{D} \frac{1}{N} \sum_{n = 1}^{N} e_{n} = \overset{―}{E_{out}} (N - 1)

disadvantage

It takes a lot of time. Not practical.

V-fold Cross Validation

把資料切成V份，輪流把其中一份當作validation。
practical:

V = 5 \sim 10

Don't fool yourself

Report test result instead of best validation result.

Three principles

奧坎剃刀(Occam's Razor)

越簡單越好。
只加入必要的東西。

Sampling Bias

Machine learning may cause harm.

Story

選舉民調：手機太貴，某一黨的支持者比較買得起，因此有誤差
Netflix competition：validation error: 13% improvement
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
BUT validation: random examples within
$D$
test: last user records after
$D$

HTML week 9

Solution

Match test scenario as much as possible
One possible solution: emphasize later samples

Bank credit card approval problem

時間偏差
金融市場會改變，比如通貨膨脹、貨幣流通量改變
分布偏差
- 要核准信用卡後觀察一段時間才能決定，有可能被拒絕的人其實是個好客戶（distribution不同，一個是所有人、一個是只包含以前核准過的客戶）
- 也許以前的年輕人信用狀況都不好，之後的年輕人可能就難以核准

Data Snooping

If designing the model while snooping(偷看) data, it may overfit.
If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.

偷看test data，並照資料做model，失去test data獨立的作用。像是投信調參數做出過去績效很好的指數來發行ETF，但是未來表現未必好。
Data reusing: 每次都看以前的論文，做得更好才發表。一直對data做某件事，可能就會變相的在偷看測試資料。

secret solution

carefully balance between data-driven modeling(snooping) and validation (no-snooping)

Linear Support Vector Machine

Fatness for a model solving linearly separable data

If many hypotheses can perfectly solve this, choose the one with the largest "Robustness"(can classify even if some test data is affected by noise.)

Robustness = Fatness of saparating hyperplane = the distance to the nearest

x_{n}

(margin).
Choose the one with the largest margin and can perfectly separate data.

Distance to Hyperplane

Shorten x and w first. Make

x = (x_{0} (= 1), \dots, x_{n})

x = (x_{1}, \dots, x_{n})

, and

w = (w_{0}, \dots, w_{n})

w = (w_{1}, \dots, w_{n})

and

b (= w_{0})

.
So

h (x) = w^{T} x

h (x) = w^{T} x + b

Want to know the distance(x,b,w) = the distance between x and hyperplane

w^{T} x^{'} + b = 0

distance=project some

(x - x^{'})

to orthogonal of the hyperplane(the normal vector

w

| \frac{w^{T} (x - x^{'})}{‖ w ‖} | = \frac{1}{‖ w ‖} | w^{T} x + b |

Support Vector Machine

Saparating: for every n,
$y_{n} (w^{T} x_{n} + b) > 0$
distance
$= \frac{1}{‖ w ‖} y_{n} (w^{T} x + b)$

Goal: Find

a r g m a x_{b, w} \frac{1}{‖ w ‖}

subject to

min_{n} y_{n} (w^{T} x + b) = 1

If
$min_{n} y_{n} (w^{T} x + b) > 1$ , say
$1.127$ , we can let
$b^{'} = \frac{b}{1.127}, w^{'} = \frac{w}{1.127}$ so that
$\frac{1}{‖ w ‖}$ is larger.

Origin of name: The optimal boundary only depends on the nearest points(support vectors(candidates)).

Solve with QP(Quadratic Programming)

It's equivalent to finding the minimum of

\frac{1}{2} w^{T} w

. And minimum of

\frac{1}{2} u^{T} Q u + p^{T} u

subject to

a_{m}^{T} u \geq c_{m}

for

m = 1, \dots, M

,
where

u = [\begin{matrix} b \\ w \end{matrix}]; Q = [\begin{matrix} 0 & 0_{d}^{T} \\ 0_{d} & I_{d} \end{matrix}]; p = 0_{d + 1}

a_{n}^{T} = y_{n} [\begin{matrix} 1 & x_{n}^{T} \end{matrix}]; c_{m} = 1; M = N

easy with QP solver

u \leftarrow Q P (Q, p, A, c)

With non-linear transform

可以像Linear model一樣直接把transform後的資料作為輸入跑Linear SVM，但因為有些transform後的維度(

\tilde{d}

)非常大，像是d維資料的Q-order polynomial transform

Φ_{Q}

的

\tilde{d} = O (Q^{d})

，

Q = 10, d = 12

就會爆炸，因此depend on

\tilde{d}

不利於使用更複雜的transform，要使用對偶來消除

HTML week 10

Dual Support Vector Machine

如上所述，(Non-)Linear SVM要optimize

\tilde{d} + 1

個variables(

w

和b)，且符合

N

個條件(

y_{n} (w^{T} z_{n} + b) \geq 1

，即每一筆資料都正確分類)。
We want to construct an equivalent SVM with

N

variables and

N + 1

constraints. 細節需要太多的數學過程，因此只介紹必要部分。

Tool: Lagrange Multipliers

以regularization舉例

$min_{w} E_{in} (w)$ s.t.
$w^{T} w \leq C$
$⟺$
$min_{w} E_{a u g} (w) = E_{in} (w) + \frac{λ}{N} w^{T} w$
如果想要讓weight length
$\leq C$ ，相當於給定另一個參數
$λ$ 。

Constrained to Unconstrained

用Lagrange Multipliers把原本

min_{b, w} \frac{1}{2} w^{T} w

s.t.

\forall n, y_{n} (w^{T} z_{n} + b) \geq 1

變成

L (b, w, \vec{α}) = \frac{1}{2} w^{T} w + \sum_{n = 1}^{N} α_{n} (1 - y_{n} (w^{T} z_{n} + b))

其中

a l l α_{n} \geq 0

，KKT會用到。

Claim

\begin{aligned} SVM 相當於 min_{b, w} (max_{a l l α_{n} \geq 0} L (b, w, \vec{α})) \\ = & min_{b, w} (\infty if violate; \frac{1}{2} w^{T} w if feasible) \end{aligned}

violate, feasible: Does
$(b, w)$ satisfy constraints?

Proof of Claim

For violating
$(b, w)$ ，then some
$(1 - y_{n} (w^{T} z_{n} + b)$ is positive, maximizing
$L$ will let corresponding
$α_{n} = \infty$ , so the result is
$\infty$ too.
For feasible
$(b, w)$ , all
$(1 - y_{n} (w^{T} z_{n} + b) \leq 0$ . Since
$a l l α_{n} \geq 0$ , the maximum of the second term(summation)=0, so maximum
$L = \frac{1}{2} w^{T} w$

Lagrange Dual Problem

For any fixed

{\vec{α}}^{'}

with all

α_{n}^{'} \geq 0

min_{b, w} (max_{a l l α_{n} \geq 0} L (b, w, \vec{α})) \geq min_{b, w} L (b, w, {\vec{α}}^{'})

Because max

\vec{α} >

any

{\vec{α}}^{'}

, and this still holds after

min_{b, w}

.
For the best

{\vec{α}}^{'}

, the above also holds, so

min_{b, w} (max_{a l l α_{n} \geq 0} L (b, w, \vec{α})) \geq max_{a l l α_{n}^{'} \geq 0} (min_{b, w} L (b, w, {\vec{α}}^{'}))

LHS is equivalent to the original SVM, called primal. RHS is called Lagrange dual.

Strong/weak Duality of QP

In the Lagrange dual problem,

$\geq$ : weak duality
$=$ : strong duality, when
- convex primal
- feasible primal(
  $Φ$ -separable(after transform))
- linear constraints

Strong duality means there exists a primal-dual optimal solution for both sides.

Solving

eliminate b

Since inner

min_{b, w}

has no constraints on

\vec{α}

(unconstrained), we can simply take a partial derivative on

b

\begin{aligned} L (b, w, \vec{α}) = \frac{1}{2} w^{T} w + \sum_{n = 1}^{N} α_{n} (1 - y_{n} (w^{T} z_{n} + b)) \\ \frac{\partial L (b, w, \vec{α})}{\partial b} = 0 = - \sum_{n = 1}^{N} α_{n} y_{n} \end{aligned}

If we want to maximize

L (b, w, \vec{α})

, we can set(without loss of optimality) the constraint

\sum_{n = 1}^{N} α_{n} y_{n} = 0

Therefore, we can remove b:

\begin{aligned} min_{b, w} \frac{1}{2} w^{T} w + \sum_{n = 1}^{N} α_{n} (1 - y_{n} (w^{T} z_{n} + b) \\ min_{b, w} \frac{1}{2} w^{T} w + \sum_{n = 1}^{N} α_{n} (1 - y_{n} (w^{T} z_{n})) - b \sum_{n = 1}^{N} α_{n} y_{n} \end{aligned}

Under a constraint outside

min

\sum_{n = 1}^{N} α_{n} y_{n} = 0

.
Since the last term is

0

find w

Similarly, we can simply take a partial derivative on

w

\begin{aligned} L (b, w, \vec{α}) = \frac{1}{2} w^{T} w + \sum_{n = 1}^{N} α_{n} (1 - y_{n} (w^{T} z_{n} + b)) \\ \frac{\partial L (b, w, \vec{α})}{\partial w_{i}} = 0 = w_{i} - \sum_{n = 1}^{N} α_{n} y_{n} z_{n, i} \end{aligned}

We then have

w = \sum_{n = 1}^{N} α_{n} y_{n} z_{n}

Then substitute it into the expression:

\begin{aligned} max_{a l l α_{n} \geq 0} (min_{b, w} \frac{1}{2} w^{T} w + \sum_{n = 1}^{N} α_{n} - w^{T} w) \\ max_{a l l α_{n} \geq 0} (- \frac{1}{2} {‖ \sum_{n = 1}^{N} α_{n} y_{n} z_{n} ‖}^{2} + \sum_{n = 1}^{N} α_{n}) \end{aligned}

Under addtional constraints for

max

\sum_{n = 1}^{N} α_{n} y_{n} = 0

and

w = \sum_{n = 1}^{N} α_{n} y_{n} z_{n}

KKT condition

(b, w, \vec{α})

is a optimal solution for Lagrange dual, and

原本的問題有解， primal feasible:

$y_{n} (w^{T} z_{n} + b) \geq 1$
滿足Dual的條件， dual feasible:

$α_{n} \geq 0$
滿足Dual的最佳化條件，因此內部是optimal，dual-inner optimal:

$\sum_{n = 1}^{N} α_{n} y_{n} = 0$ and
$w = \sum_{n = 1}^{N} α_{n} y_{n} z_{n}$
滿足原本問題的最佳化條件，因此所有Lagrange terms消失，primal-inner optimal:

$α_{n} (1 - y_{n} (w^{T} z_{n} + b)) = 0$

Called Karush-Kuhn-Tucker (KKT) conditions

小結

根據以上結論，我們可以用QP解

min_{a l l α_{n} \geq 0} (\frac{1}{2} {‖ \sum_{n = 1}^{N} α_{n} y_{n} z_{n} ‖}^{2} - \sum_{n = 1}^{N} α_{n})

得出

\vec{α}

，再用以上條件得出

b, w

w

optimal

\vec{α} \Rightarrow

optimal

w = \sum_{n = 1}^{N} α_{n} y_{n} z_{n}

b

optimal

\vec{α} \Rightarrow

optimal

b

根據primal feasible，

y_{n} (w^{T} z_{n} + b) \geq 1

，給出了

b

的範圍(大部分(non-support vector)都大於1)。
進一步，根據primal-inner optimal，

α_{n} (1 - y_{n} (w^{T} z_{n} + b)) = 0, \forall n

由於non-SV後面都是non-zero，因此

α_{n} = 0

。因此若

α_{n} > 0

，可以得出：

\begin{aligned} 1 - y_{n} (w^{T} z_{n} + b) = 0, y_{n} (w^{T} z_{n} + b) = 1 \\ y_{n} = \pm 1, y_{n} = \frac{1}{y_{n}} \\ b = y_{n} - w^{T} z_{n} \end{aligned}

High-level comments

比較 SVM, PLA

SVM	PLA
$w_{S V M} = \sum_{n} α_{n} (y_{n} z_{n})$	$w_{P L A} = \sum_{n} β_{n} (y_{n} z_{n})$
$α_{n}$ 從 dual solution。	$β_{n} =$ # of mistake corrections on $(x_{n}, y_{n})$

w

is linear combination of

y_{n} z_{n}

. This is also true for GD/SGD-based Logistic regression/Linear regression when

w_{0} = 0

w

is represented by data.
SVM: represented by SVs only.

比較 primal 與 dual

Problems

因為QP的kernel
$Q$ 是一個 N-by-N 的矩陣，其中
$q_{n, m} = y_{n} y_{m} z_{n}^{T} z_{m}$ 大部分是Non-zero，因此當N大一點就會需要很多記憶體來存
$Q$ 。
解決：practically用特殊的solver
原本的目標：不要與轉換後的特徵數量
$\tilde{d}$ 相關，但其實如果真的去計算每個
$q_{n, m} = y_{n} y_{m} z_{n}^{T} z_{m}$ ，會是兩個
$\tilde{d}$ 長度的向量內積。
解決：kernel SVM

Kernel Support Vector Machine

目標：解決上述暴力計算

q_{n, m} = z_{n}^{T} z_{m} \forall n, m \in [1, N]

時需要

O (\tilde{d})

的問題。

Kernel function

=Transform+Inner Product
用有效的Kernel function快速計算

z_{n}^{T} z_{m} = Φ (x_{n})^{T} Φ (x_{m})

Example for

Φ_{2}

(poly transform)

Φ_{2} (x) = (1, x_{1}, \dots, x_{d}, x_{1}^{2}, x_{1} x_{2}, \dots, x_{1} x_{d}, x_{2} x_{1}, x_{2}^{2}, \dots, x_{d}^{2})

speed up Q for QP solver

考慮計算

\begin{aligned} Φ_{2} (x)^{T} Φ_{2} (x^{'}) & = 1 + \sum_{i = 1}^{d} x_{i} x_{i}^{'} + \sum_{i = 1}^{d} \sum_{j = 1}^{d} (x_{i} x_{j}) (x_{i}^{'} x_{j}^{'}) \\ = 1 + \sum_{i = 1}^{d} x_{i} x_{i}^{'} + (\sum_{i = 1}^{d} x_{i} x_{i}^{'}) (\sum_{j = 1}^{d} x_{j} x_{j}^{'}) \\ = 1 + x^{T} x^{'} + (x^{T} x^{'})^{2} \end{aligned}

Note:
$d$ 是原本資料的維度，而
$\tilde{d} = O (d^{2})$ 是轉換後的維度，如果在
$O (d)$ 時間算完是可接受的。

speed up b,w

HTML week 11

Soft-Margin Support Vector Machine

SVM for Soft Binary Classification

Blending and Bagging

An Aggregation Story

aggregation for binary classification

假設有

T

個朋友，每個人對一個股票的預測為

g_{1}, \dots, g_{T}

函數，對於股票

x

，

sign (g_{t} (x))

代表漲跌。而根據這些朋友的結果做綜合的預測有哪些方法？

select 找表現最佳的朋友，只照抄他的預測，validation

$G (x) = g_{t *} (x)$ with
$t * = {argmin}_{t} E_{val} (g_{t}^{-})$ .
mix 所有朋友的預測，一人一票，uniformly

$G (x) = sign (\sum_{t} 1 \cdot g_{t} (x))$
mix 同上但是加上權重
$α$ ，non-uniformly

$G (x) = sign (\sum_{t} α_{t} \cdot g_{t} (x))$ with
$α_{t} \geq 0$
combine(stacking) 預測什麼類股就主要參考那個類股的專家。權重depends on input，conditionally

$G (x) = sign (\sum_{t} q_{t} (x) \cdot g_{t} (x))$ with
$q_{t} (x) \geq 0$

比較

Selection
也就是第一種，很簡單也很常用， rely on strong hypothesis
當然應該用
$E_{val}$ 而不是
$E_{in}$ ，但同樣的需要確保validation用的
$g_{t}^{-}$ 夠強
aggregation/blending
參考其他弱一點的朋友(hypothesis)可能會更好

Why blending may be better

acts as feature transform: 把2D平面上只能用鉛直線的
$H_{1}$ 和只能用水平線的
$H_{2}$ ，blending後就可以用具有直角的線。
acts as regularization: 平均後，比如 linear regression 就能得到類似 SVM 的效果。

Blending for regression

以uniform blending舉例

G = g_{t}

的平均，也就是

G (x) = \frac{1}{T} \sum_{t} g_{t} (x)

Theoretical analysis for uniform blending

\begin{aligned} {avg}_{t} (E_{out} (g_{t})) = {avg}_{t} ((g_{t} (x) - f (x))^{2}) \\ = & {avg}_{t} (g_{t}^{2} - 2 g_{t} f + f^{2}) \\ = & {avg}_{t} (g_{t}^{2}) - 2 G f + f^{2} \\ = & {avg}_{t} (g_{t}^{2}) + (G - f)^{2} - G^{2} \\ = & {avg}_{t} (g_{t}^{2}) - 2 G^{2} + G^{2} + (G - f)^{2} \\ = & {avg}_{t} ((g_{t} - G)^{2}) + (G - f)^{2} \\ = & {avg}_{t} ((g_{t} - G)^{2}) + E_{out} (G) \\ \geq & E_{out} (G) \end{aligned}

Therefore, the uniform blending is better than the average of

g_{t}

This can also be interpreted as:
(expected performance of randomly choosing one hypothesis)
= (expected deviation to consensus)
+ (performance of consensus).

performance of consensus: bias
expected deviation to consensus: variance

Uniform blending reduces variance for more stable performance.

HTML week 12

Linear blending

Known

g_{t}

, and each is given

α_{t}

ballots.

G (x) = sign (\sum_{t} α_{t} g_{t} (x)) with α_{t} \geq 0

Compute good

α_{t}

min_{α_{t} \geq 0} E_{in} (\vec{α})

For linear regression(+transform), it looks alike.

min_{α_{t} \geq 0} \frac{1}{N} \sum_{n = 1}^{N} {(y_{n} - \sum_{t} α_{t} g_{t} (x_{n}))}^{2}

有些 hypothesis 可能是反指標，所以

α_{t}

可以是負的，因此不一定需要

α_{t} \geq 0

的 constraints。

Any blending

Blending 相當於把

(x, y)

變換為

(Φ^{-} (x), y)

，其中

Φ^{-} (x) = (g_{1}^{-} (x), g_{2}^{-} (x), \dots)

。
之後 Linear blending 相當於做 linear regression。
事實上也可以用其他模型，也就是 Any blending ，相當於 Stacking(blending 權重與資料有關)。
雖然很 powerful，但一樣會 overfitting。

Bagging

如果能找出多樣化的 hypothesis，效果會更好。
可能方法：

每個
$g_{t}$ 用不同 hypothesis set
用不同參數(learning rate等)
隨機性的驗算法
資料的隨機性(CV的不同資料切割，產生不同
$g_{t}^{-}$ )

而實際上可以用同一份資料利用資料隨機性製造出不同的

g_{t}

。

Bootstrap aggregation

{\tilde{D}}_{t}

Sample
$N$ (or
$N^{'}$ ) data points with replacement(可能選到同筆資料很多次) from
$D$
Train
$g_{t}$ by an algorithm
$A$ on
${\tilde{D}}_{t}$ .
Do the above many times. Output the blending
$G = Uniform ({g_{t}})$

This simulates the real aggregation:

Request size-
$N$ data from
$P^{N}$ (i.i.d.)
Train
$g_{t}$ by an algorithm
$A$ on
${\tilde{D}}_{t}$ .
Do the above many times. Output the blending
$G = Uniform ({g_{t}})$

因為我們沒有辦法真的每次得到不同的

N

筆資料，所以重複取樣是一個近似的方法。

又稱為 Bagging，把資料打包。

bagging performance

如果 hypothesis 對隨機資料很敏感(指產生多樣化的結果)，很有可能平均後的效果會很好。(像是沒教的 pocket algorithm)

Adaptive Boosting

辨認蘋果問題

叫一堆小孩講出可能的判斷方法：

圓的
紅色
也有可能是綠色
有可能長著梗

過程中會得出一個班級的共識，老師再把錯誤的分類結果提出來讓小孩再想一想。

對應到 ML

小孩 = (simple, weak) hypothesis sets
班級的共識 = blending 後的 hypothesis
老師 = reweighting，讓 hypothesis 聚焦在錯誤的地方

Use Bootstrap Again

Bootstrapping 相當於把N筆資料重新給予權重，有些可能是 0，有些可能是很多次。
每個

g_{t}

相當於在 minimize bootstrap-weighted error。

E_{in}^{u} (h) = \frac{1}{N} \sum_{n = 1}^{N} u_{n} \cdot e r r (h (x_{n}), y_{n})

Weighted base algorithm

可以重複某些資料，但也可以用演算法來達成，像是

soft SVM: 若是使用一個錯誤增加
$C$ 的 error，變成每次錯誤加
$C u_{n}$ 就好。
用 SGD 解 logistic regression: 調整 sample 到第
$n$ 筆的機率正比於
$u_{n}$ 。

More diverse results

重複 T 次，第 t 次使用

u^{(t)}

作為權重，我們希望每次結果會非常不同：

用
$u^{(t)}$ 作為權重，得出
$g_{t}$
我們希望
$g_{t}$ 在以
$u^{(t + 1)}$ 作為權重時表現很差，因此
$g_{t + 1}$ 會與
$g_{t}$ 有很大的差異。
根據
$g_{t}$ 對不同的回答情況

具體作法就是讓

g_{t}

在以

u^{(t + 1)}

作為權重時的準確率為 0.5。
我們想要：

\frac{\sum_{n = 1}^{N} u_{n}^{(t + 1)} \cdot [[g_{t} (x_{n}) \neq y_{n}]]}{\sum_{n = 1}^{N} u_{n}^{(t + 1)}} = 0.5

因此我們可以把

u^{(t + 1)}

設為：

把
$g_{t}$ 在回答正確的
$n$ 的權重
$u_{n}^{(t + 1)}$ 設為
$u_{n}^{(t)}$ 除以正確的比率
把
$g_{t}$ 在回答錯誤的
$n$ 的權重
$u_{n}^{(t + 1)}$ 設為
$u_{n}^{(t)}$ 除以錯誤的比率。

或是交叉相乘：把正確的乘上錯誤率(次數)，錯誤的乘上正確率(次數)

Scaling factor

若

g_{t}

在

u^{(t)}

的錯誤率是

ϵ_{t}

。前一部分的方法相當於把正確的乘以

ϵ_{t}

，錯誤的乘以

1 - ϵ_{t}

。
因此可以取他們的中間值

Scaling Factor = S_{t} = \sqrt{\frac{1 - ϵ_{t}}{ϵ_{t}}}

正確時除以

S_{t}

，錯誤時乘以

S_{t}

。

這會在之後用到。

如何決定
$u^{(1)}$

Best for

E_{in}

，則用

u_{n}^{(1)} = \frac{1}{N}

(相當於沒有 reweighting)。

如何決定 blending 方法 G

不太可能用 uniform，因為

g_{2}

是在正常的

g_{1}

表現很差的資料訓練，這個資料會與原本的資料偏差很大，很有可能結果會很差，其他結果也是差不多。
Linear 或 Non-linear 都行

AdaBoost

使用特殊 blending 方法：

α_{t} = \ln (S_{t})

，因為 Scaling Factor 與正確率正相關，所以相當於給越正確的 hypothesis 越多的權重。

AdaBoost = 弱的 hypothesis(學生) + reweighting(老師) + blending(整個班)

Theoretical guarantee

max (ϵ_{t}) = ϵ < \frac{1}{2}

E_{in} (G) = 0

after

T = O (\log N)

iterations.

Decision Stump

在某一維度上，用一個 threshold 來分類。非常弱，但很有效率。

N

筆

d

維資料可以在

O (d \cdot N \log N)

的時間完成。

AdaBoost-Stump

用 Decision Stump 作為 hypothesis，用 AdaBoost 來訓練。
第一個實時辨認人臉的系統就是用這個方法，把圖片切成很多塊，並且用 Decision Stump 來判斷是否有人臉。

補充

AdaBoost 很強大，因此要小心 overfitting。
中小型的資料上可以用 (soft) SVM，可以達到類似的結果(regularization、margin)。
在
$E_{in} = 0$ 後繼續做還是可以降低
$E_{out}$ ，因為可以達到更小的 margin。
比較適合二元分類，多元可以用 Gradient Boosting。
比神經網路差。

Decision Tree

Decision Tree 可以達到 conditional aggregation，也就是 stacking。

aggregation	blending	learning example
uniform	voting	bagging
weighted	linear	boosting
conditional	stacking	Desicion Tree

Decision tree 就是一堆 if-else 的組合，每個 if-else 就是一個 node。
是個很接近人類邏輯的模型，也很容易解釋、很簡單(很多財經分析也許會用)、很有效率。但是沒有理論保證，不知道該怎麼選擇參數、沒有代表性的演算法。

表示方法

Path: Summation for every path t,
$G (x) = \sum_{t} q_{t} (x) \cdot g_{t} (x)$
q: condition, g: leaf 上的 hypothesis（可以用常數）
Recursive: Summation for every child c
$G (x) = \sum_{c} b_{t} (x) \cdot G_{c} (x)$
b: child's condition, G: child's subtree's hypothesis

建 Decision tree 的演算法

停止條件
到達應該做出葉節點的地方(termination criteria)則回傳
如果要繼續
- 決定節點的條件
  $b (x)$ (branching criteria)
- 分成不同的節點，遞迴建Decision Tree
- 回傳

需要選擇的事

停止條件
小孩數量
節點分枝條件
葉節點的 hypothesis

Classification and Regression Tree (CART)

節點分枝條件

可以簡單的用 decision stump 來當作節點分枝條件的 hypothesis set，相對應的小孩數量就是 2(binary tree)。

節點分枝條件的決定：盡量找出分群後的結果能讓不純程度(用 Impurity Function 決定)最小的 hypothesis。差不多相當於 error function。
具體就是把不同群的 impurity function 以群的大小加權平均。

Impurity Function

$E_{in}$ of optimal constant hypothesis:
- Regression with MSE:
  $\bar{y} = avg (y_{1}, \dots, y_{n})$ ,
  $impurity = \frac{1}{N} \sum_{n = 1}^{N} (y_{n} - \bar{y})^{2}$
- Classification with 0-1 loss:
  $y^{*} = majority (y_{1}, \dots, y_{n})$ ,
  $impurity = \frac{1}{N} \sum_{n = 1}^{N} [[y_{n} \neq y^{*}]]$
Special for classification
- Gini index:
  $N_{k} = \sum_{n = 1}^{N} [[y_{n} = k]]$ ,
  $impurity = 1 - \sum_{k = 1}^{K} {(\frac{N_{k}}{N})}^{2}$
  考慮所有
  $k$
- classification error:
  $impurity = 1 - max_{k} (\frac{N_{k}}{N})$
  只考慮最常見的
  $k = y^{*}$

分類通常用 Gini index，回歸用第一種。

停止條件

CART 是 fully-grown tree，因此會分到不能分為止，具體有這些條件：

Impurity=0，label 都一樣，沒辦法分
$x_{n}$ 都一樣，沒辦法分

葉節點的 hypothesis set

因為結果會是不能繼續分的，可以直接用 optimal constant hypothesis。

Regularization by Pruning

One regularizer:

Ω (G) = NumberOfLeaves (G)

變成對於所有的 decision tree

G

，找到

{argmin}_{G} E_{in} (G) + λ Ω (G)

但當然無法找到所有的 decision tree，所以通常只會考慮：

$G^{(0)}$ =fully-grown tree
$G^{(i)}$ =
${argmin}_{G} E_{in (G)}$
其中
$G$ 是從
$G^{(i - 1)}$ 少掉一個葉節點。

找到

λ

的方法：validation

用類別特徵分枝

對應到 decision stump，會用 decision subset。

b (x) = [[x_{i} \in S]] + 1

，其中

S \in [K]

一樣是個二元樹，所以就是一個包含於

S

、一個不包含於

S

。

Surrogate branching

用其他類似的特徵來取代缺失的特徵。
但實際上不一定能找到全部都沒有缺失的特徵，比如在 final project 大概就用不到。

HTML week 13

Random Forest

用 bagging 的方法把

T

個 Fully-grown Decision Tree 的輸出結合起來

OOB examples

假設 bagging 每次抽

N^{'} = N

個，那每個

t

沒抽到

(x_{n}, y_{n})

的機率是

{(1 - 1 / N)}^{N}

，當

N

大了的話:

{(1 - \frac{1}{N})}^{N} = \frac{1}{{(\frac{N}{N - 1})}^{N}} = \frac{1}{{(1 + \frac{1}{N - 1})}^{N}} \approx \frac{1}{e}

每次大約會有

N / e

個沒抽到，比想像中還多。
用處: 可以對於每個

g_{t}

剛好可以用它的 OOB examples 當作 validation。剛好就不用重新訓練。

實測

在複雜的資料上，也能得到很好的結果，這就是投票的力量。

T 要多少？

越多越好，實際上大概需要幾千幾萬這個量級。
壞處就是要算很多次 Decision Tree。

Gradient Boosted Decision Tree

(Ada)Boost + Decision Tree

HTML week 14

Neural Network

Intuition: 多個 perceptrons，加上 activation (模仿神經，用簡單的門檻，過某個值才是 1，否則是 0，相當於一個 perceptron 得到其他 perceptrons 的輸出作為輸入)，可以造出 AND OR NOT 等等的邏輯閘。多層的話就更強。
為了簡化，用 MSE error。

Activation(Transformation)

如果只是加起來，整個還是 Linear Model，所以沒差，提到 sigmoid、tanh，會提到為何

\tanh

現在比較少用。

\tanh (s) = \frac{\exp (s) - \exp (- s)}{\exp (s) + \exp (- s)} = 2 θ (2 s) - 1

Hypothesis

For a

d^{(0)} - d^{(1)} - \dots - d^{(L)}

Neural Network:

w_{i j}^{(l)}

$1 \leq l \leq L$ : layers
$0 \leq i \leq d^{(l - 1)}$ : inputs
$1 \leq j \leq d^{(l)}$ : outputs

raw output

s_{j}^{(l)} = \sum_{i = 0}^{d^{(l - 1)}} w_{i j}^{(l)} x_{i}^{(l - 1)}

transformed output

x_{j}^{(l)} = {\begin{cases} A (s_{j}^{(l)}), & if l < L \\ s_{j}^{(l)}, & if l = L \end{cases}

universal approximator:
Neural Network 能夠模仿任何函數，實際上就算只有一層，只要 neuron 夠多也可以達到。像是 Gaussian Kernel 很強的概念一樣。

adaboost 難以與 perceptron 並用。

可以用(stochastic)gradient descent來minimize loss(error)
簡單暴力的把error對w取偏微分，會得到de/dw=de/ds*ds/dw，但除了輸出層的s直接與e關連，其他需要用 backward propagation 算。

automatic differentiation

直接真的用很少的偏移看看結果差多少作為微分結果。

optimize gives local minimum

如果只用 gradient descent，只會得到 local minimum，因此 init weights 很重要，像是太大的話梯度很小，會「飽和」，可以試試看小又隨機的 weight。
但後來發現其實不太會跑到 bad local minimum，只要用經驗法則選好的 weights 就好了。或是pre-training

initialization

全 0，tanh relu都無法運作
一樣，所有 neuron 都長一樣
太大，梯度消失(tanh 出來的值差不多)

因此需要 small and random

Regularization

Early Stopping: 因為 gradient descent 類型的 model 在步驟多了之後會探索更多區域，相當於

d_{v c}

漸漸增大，因此可以用 validation error 早點停下來。但有可能 validation error,

E_{o u t}

會在之後再次下降。
這是早期流行的方法，現在會用 double descent，