# Intro ML
contributed by <`kylekylehaha`>
###### tags:`Data Science`
## Part0 What is Machine Learning?
**Key Essence of Machine Learning**
- Exists some "underlying pattern" to be learned
- so performance measure can be imporved.
- But no programmable (easy) definition.
- 當有簡單定義時不需要用 ML
- Somehow there is data about the pattern.
---

---
## Part1 What is Classification?
### Supervised vs. Unsupervised learning
Supervised learning(e.g. classification)
- Supervision: The **training data** are accompanied by **labels**.
Unsuprervised learning(e.g. clustering)
- The class labels of training data is unknown.
---
### Classification-A two steps process
**Model construction**: describing a set of predetermined class.
**Model usage**: for classifying futrue or unknown object.
---
## Part2 Decision Tree
node 為一種 attribute, 而 node 下的 subtree(brach) 可視為 attribute 內可能的值
Q: Decision tree vs. loser/wining tree:
A: loser/wining tree 屬於 selection tree, 用來做 sorting(usually sorting inceasely)
---
### Building a decision tree
切割時,利用 **亂度(entropy)** 來定義切得混不混亂。
**亂度(entropy)**:
- 定義: Info(D) 代表 D 的亂度,D 之中有 m 類
$$
Info(D)=-\sum_{i=1}^mp_ilog(p_i)
$$
- $P_i$ 代表第 i 類出現的機率,也就是該類在資料的比例。
- 數字越大代表越亂,越小代表越一致
- Info(D)=0 代表全部皆為都一類。
---
### Feature Space Separation
**Decision tree** 只能畫**和軸垂直的線**,無法用斜線來做分類。這些線被稱為 discriminant function.

(左邊為 Decision Tree, 右圖中發現 decision tree 只能用橫的直的方式做切割,但正確的切法應為斜線。)
優點:
- Fast
- Easy Interpretable for human
缺點:
- Accuracy
- Greedily selecting split attribute(每層切完後就決定好了,無法根據後續的表現去對前面做更改)
---
## Part04 Support Vector Machine(SVM)
**Discriminant Function:**
The classifier is said to assign a feature vector $x$ to class $w_i$ if
$$
g_i(x) \gt g_j(x) \ for \ all \ j \ne i
$$
For two class case: $g(x)\equiv g_1(x)-g_2(x)$, assign $x$ to $w_i$ if $g(x) \lt 0$ ; otherwise assign $x$ to $w_2$
---
### Linear support vector machine
The linear discriminant function with the maximum **margin** is the best.
- Margin is defined as the width that the boundary could be increased by before hitting a data point.(上半部最靠近線的點到線的距離和下半部最靠近的點到線的距離)
- margin 越大代表切得越乾淨。
- 
---
**Goal**: $w^Tx+b=0$中,找到適當的w 和 b 使得 margin 越大越好,將問題視為最佳化問題。
We know that
$$
W^Tx^+ +b = 1 \\
W^Tx^- +b = -1
$$
Thus, the margin width is:
$$
M=(x^+-x^-) \cdot n \\
=(x^+-x^-) \cdot \frac{w}{||w||} = \frac{2}{||w||}
$$

落在 margin 上的這些點,稱為 support vector。視為將 margin 撐開的點。
---
Formulation:
$$
maximize \ \frac{2}{||w||}
$$
such that
$$
For \ y_i=+1, \ w^Tx_i+b \ge 1\\
For \ y_i=-1, \ w^Tx_i+b \le -1\\
$$
可改寫為:
Formulation:
$$
minmize \ \frac{1}{2}||w||^2
$$
such that
$$
y_i(w^x_i+b) \ge 1
$$
---
如果 data 有 noisy 使得無法用 linear 完美分類呢?允許 **slack variable** 加入。
**slakc variable**: 兩點間的距離小於某個值是被允許的。

---
### Non-linear SVM
將低維度的 data object mapping to high dimension space.



---
With this mapping, our discriminant function is now:
$$
g(x)=w^T\phi(x)+b=\sum_{i \in SV}\alpha \phi(x_i)^T\phi(x)+b
$$
No need to know the mapping explicitly,because we only use **dot product** of feature vectors in both training and test.
A **kenrel function** is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space.
Example of commonly-used kernel function:
- linear kernel $K(x_i,x_j)=x^T_ix_j$
- Gaussian (Radial-Based Function(RBF)) kernel: $K(x_i,x_j)=exp(-\frac{||x_i-x_j||^2}{2\sigma^2})$
可以利用 linear kenrel 來找出比較重要的 feature(by 做完後會對每個 feature 給個 weight),藉此解決 curse of dimensionality
---
## Part5 Model evaluation
**Class imbalance problem**
**Sensitivity**: True positive recognition rate
- sentivity=TP/P (P: # of positive)
**Specificity**: True negative recognition rate
- specficity=TN/N (N: # of negative)
**Recall and precision** 要合再一起算,兩者合再一起就是f1-score
---
**Cross-validation**
- Randomly partition data into k=10.
- At ith iteration, use $D_i$ as test set and others as training set
-
- Do shuffle before k-fold corss-validation
---
**ROC curve**: for visual comparison of classification model.
- Shows the trade-off between the **true positive** rate and the **false positive** rate
- The **area** under the ROC curve is a measure of the accuracy of the model
- The **closer** to the diagonal line, the **less** accurate to the model.
- A model with perfect accuracy will have an area of 1.0

---
## Part6 Ensemble Methods
利用 r 個 classifier 產生的結果做投票,得出最後的 label。如何訓練出不同的 classifier 以及如何投票(e.g. 每人權重不同)需要考慮。
Ensemble Methods: increasing the Accuracy
- Use a combination of models to increase accuracy.
- Ideally, significant different among the models.(若 model 都相似就沒有必要投票了)
Popular ensemble methods:
- Bagging: averaging the prediction over a collection of classifiers.
- Boosting: weighted voted with a collection of classifiers.
- Ensemble: combining a set of heterogeneous classifiers.
---

(左邊是一棵 decision tree, 右邊是多棵 decision tree with ensemble methods, 可以發現相較於左邊,右邊的結果比較像斜線。)
---
### Ensemble methods: bagging
- 將 training set 做 boostrap samples, 而每個 sample 產生出對應的 classifier.
- 對於 unknown sample X, 每個 classifier 對其做 prediction, 最後根據 vote 個數決定該 sample 的最終 prediction.
- 此外,也可預測 continuous values: 將預測出來的值取平均即可。
- 
---
### Ensemble methods: Random Forest
相比 bagging 對 training data 動手腳, Random Forest 是在每個 classifier 動手: 隨機取一個 attribute 去做分類。
原本的 decision tree 根據 Info(D) 去選 attribute, Info 越小的越好,但在這裡我們隨機取一個 attribute 來當作一個新的 classifier.

---
Two Methods to Random Forest
- Forest-RI(random input selection): randomly select attribute.
- Forest-RC(random linear combination): create new attribute that are linear combination of the existing attributes.
---
### Ensemble methods: Boosting
相比 Random Forest 和 bagging, 前面兩者可以平行來做,但在 boosting 是採一個個 iteration。
---
**How it works?**
- A series of k classifiers is iteratively learned.
- After a classifier $M_i$ is learned, the weights are updated to allow the subsequent classifier, $M_{i+1}$, to **pay more attention to the training tuple that were misclassified by $M_i$**
---
**Adaboost**
- Given d class-label tuple:$(X_1,y_1), (X_2,y_2),...,(X_d,y_d)$.
- Initially, each tuple have same weight (1/d)
---
Generate k classifiers in k round. At round i:
- Tuples from D are sampled(with replacement) to form a training set $D_i$.
- Each tuple's **chance of being selected** is based on **its weight**.
- If a tuple is misclassified, its weight is increased, o.w. it is decreased.
---
Prediction
- Each model has weight, 因此將每個 model 產生的結果乘上該 model 的 weight,得到最後的 class.
---
## Part7 Bayes Classification Methods
**Goal: 求在 $X$ 條件下,$C_i$ 會發生個機率。**
舉個例子: $C_i$ 為"會買電腦"的事件,$X$ 為一個人,而人有多個 attribute(e.g. age, income, credit ranking... etc)
- $P(H)$: 全部買電腦的人的機率
- $P(X|C_i)$: "會買電腦"的前提下,是某種 attribute 的機率。
- 假設每個 attribute 皆為獨立下,$P(X|C_i)=\Pi_{k=1}^NP(x_k|C_i)=P(x_1|C_i)*P(x_2|C_i)*...*P(x_k|C_i)$
最後再利用貝氏定理得到我們的 goal:$P(C_i|X)$
$$
P(C_i|X)=\frac{P(X|C_i)P(H)}{P(X)}
$$
因為我們要做分類,只需知道哪一個機率較高("會買" or "不會買"),因此不用知道$P(X)$的機率。
例子:
- 會買的機率: 0.028/P(X)
- 不會買的機率: 0.007/P(X)
---
須避免 Zero-probability problem:若有一項的機率為0, 則整個機率就會是 0。
我們利用 **Laplacian correction**: 把每一項都加 1 即可。
E.g.:
1000 tuples:
- low: 0
- median: 990
- high: 10
則可以發現 prob(low)=0,因此我們全部加1
- low: 1
- median: 991
- high: 11
則 prob(low)=1/1003 ; prob(median)=991/1003 ; prob(high)=11/1003
---
## Part8 Disscuion
### Class imbalance problem
須注意要在 training set 上做處理,因為這種處理只是希望 model 要認真學習,而非隨便亂猜某一種 label。
- Oversampling: 將少的 sample 變多
- Under-sampling: 將多的 sample 變少
Testing set 的 imbalance 代表現實世界的分佈,故不能去做處理,這樣 performance 會失準。
---
### Normalization
若沒做標準化,可能會使某個數值 dominant ,會使其他數值看起來很小,這樣 SVM 會誤導,故在用 SVM 時要先做標準化,且未做與做的 accuracy 差異可能會差到10%以上。
- 變異數(variance):代表資料的分散程度
$$
\frac{1}{N}\sum_{x \in X}(x-\bar{X})^2
$$
- 標準差為變異數開根號
$$
\sigma=\sqrt{\frac{1}{N}\sum_{x \in X}(x-\bar{X})^2}
$$
---
標準化就是讓**整個資料的平均值變為0, 讓標準差變為1**。
兩個步驟:
1) 全部資料減平均值
2) 全部資料除標準差
$$
x\rightarrow \frac{x-\bar{x}}{\sigma}
$$

這樣可以避免某個 feature value 值過大蓋過其他值的變化。
---
### Multiclass F1 score
分成 micro-f1 和 macro-f1。**micro-f1**其實就是 accuracy,而 **macro-f1** 可以視為將每個 class 的 precision 和 recall 取平均。
**macro-f1**
- macro-precision=$\frac{1}{k}\sum_i precision_i$
- macro-reall=$\frac{1}{k}\sum_i recall_i$
- macro-f1-score=$\frac{2}{\frac{1}{macro-recall}+\frac{1}{macro-precision}}$
因此當有 imbalanced problem 時,我們採用 macro-f1-score 作為 metric。