Intro ML - HackMD

# Intro ML contributed by <`kylekylehaha`> ###### tags:`Data Science` ## Part0 What is Machine Learning? **Key Essence of Machine Learning** - Exists some "underlying pattern" to be learned - so performance measure can be imporved. - But no programmable (easy) definition. - 當有簡單定義時不需要用 ML - Somehow there is data about the pattern. --- ![](https://i.imgur.com/nn19Mev.png) --- ## Part1 What is Classification? ### Supervised vs. Unsupervised learning Supervised learning(e.g. classification) - Supervision: The **training data** are accompanied by **labels**. Unsuprervised learning(e.g. clustering) - The class labels of training data is unknown. --- ### Classification-A two steps process **Model construction**: describing a set of predetermined class. **Model usage**: for classifying futrue or unknown object. --- ## Part2 Decision Tree node 為一種 attribute, 而 node 下的 subtree(brach) 可視為 attribute 內可能的值 Q: Decision tree vs. loser/wining tree: A: loser/wining tree 屬於 selection tree, 用來做 sorting(usually sorting inceasely) --- ### Building a decision tree 切割時，利用 **亂度(entropy)** 來定義切得混不混亂。 **亂度(entropy)**: - 定義: Info(D) 代表 D 的亂度，D 之中有 m 類 $$ Info(D)=-\sum_{i=1}^mp_ilog(p_i) $$ - $P_i$ 代表第 i 類出現的機率，也就是該類在資料的比例。 - 數字越大代表越亂，越小代表越一致 - Info(D)=0 代表全部皆為都一類。 --- ### Feature Space Separation **Decision tree** 只能畫**和軸垂直的線**，無法用斜線來做分類。這些線被稱為 discriminant function. ![](https://i.imgur.com/BJifdar.png) (左邊為 Decision Tree, 右圖中發現 decision tree 只能用橫的直的方式做切割，但正確的切法應為斜線。) 優點: - Fast - Easy Interpretable for human 缺點: - Accuracy - Greedily selecting split attribute(每層切完後就決定好了，無法根據後續的表現去對前面做更改) --- ## Part04 Support Vector Machine(SVM) **Discriminant Function:** The classifier is said to assign a feature vector $x$ to class $w_i$ if $$ g_i(x) \gt g_j(x) \ for \ all \ j \ne i $$ For two class case: $g(x)\equiv g_1(x)-g_2(x)$, assign $x$ to $w_i$ if $g(x) \lt 0$ ; otherwise assign $x$ to $w_2$ --- ### Linear support vector machine The linear discriminant function with the maximum **margin** is the best. - Margin is defined as the width that the boundary could be increased by before hitting a data point.(上半部最靠近線的點到線的距離和下半部最靠近的點到線的距離) - margin 越大代表切得越乾淨。 - ![](https://i.imgur.com/rxmJkho.png) --- **Goal**: $w^Tx+b=0$中，找到適當的w 和 b 使得 margin 越大越好，將問題視為最佳化問題。 We know that $$ W^Tx^+ +b = 1 \\ W^Tx^- +b = -1 $$ Thus, the margin width is: $$ M=(x^+-x^-) \cdot n \\ =(x^+-x^-) \cdot \frac{w}{||w||} = \frac{2}{||w||} $$ ![](https://i.imgur.com/FKNpM8k.png) 落在 margin 上的這些點，稱為 support vector。視為將 margin 撐開的點。 --- Formulation: $$ maximize \ \frac{2}{||w||} $$ such that $$ For \ y_i=+1, \ w^Tx_i+b \ge 1\\ For \ y_i=-1, \ w^Tx_i+b \le -1\\ $$ 可改寫為: Formulation: $$ minmize \ \frac{1}{2}||w||^2 $$ such that $$ y_i(w^x_i+b) \ge 1 $$ --- 如果 data 有 noisy 使得無法用 linear 完美分類呢？允許 **slack variable** 加入。 **slakc variable**: 兩點間的距離小於某個值是被允許的。 ![](https://i.imgur.com/9lt4rp6.png) --- ### Non-linear SVM 將低維度的 data object mapping to high dimension space. ![](https://i.imgur.com/LP356KR.png) ![](https://i.imgur.com/NeZFnlY.png) ![](https://i.imgur.com/C0wgiDV.png) --- With this mapping, our discriminant function is now: $$ g(x)=w^T\phi(x)+b=\sum_{i \in SV}\alpha \phi(x_i)^T\phi(x)+b $$ No need to know the mapping explicitly,because we only use **dot product** of feature vectors in both training and test. A **kenrel function** is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space. Example of commonly-used kernel function: - linear kernel $K(x_i,x_j)=x^T_ix_j$ - Gaussian (Radial-Based Function(RBF)) kernel: $K(x_i,x_j)=exp(-\frac{||x_i-x_j||^2}{2\sigma^2})$ 可以利用 linear kenrel 來找出比較重要的 feature(by 做完後會對每個 feature 給個 weight)，藉此解決 curse of dimensionality --- ## Part5 Model evaluation **Class imbalance problem** **Sensitivity**: True positive recognition rate - sentivity=TP/P (P: # of positive) **Specificity**: True negative recognition rate - specficity=TN/N (N: # of negative) **Recall and precision** 要合再一起算，兩者合再一起就是f1-score --- **Cross-validation** - Randomly partition data into k=10. - At ith iteration, use $D_i$ as test set and others as training set -![](https://i.imgur.com/8pJPvNG.png) - Do shuffle before k-fold corss-validation --- **ROC curve**: for visual comparison of classification model. - Shows the trade-off between the **true positive** rate and the **false positive** rate - The **area** under the ROC curve is a measure of the accuracy of the model - The **closer** to the diagonal line, the **less** accurate to the model. - A model with perfect accuracy will have an area of 1.0 ![](https://i.imgur.com/b7ILR9c.png) --- ## Part6 Ensemble Methods 利用 r 個 classifier 產生的結果做投票，得出最後的 label。如何訓練出不同的 classifier 以及如何投票(e.g. 每人權重不同)需要考慮。 Ensemble Methods: increasing the Accuracy - Use a combination of models to increase accuracy. - Ideally, significant different among the models.(若 model 都相似就沒有必要投票了) Popular ensemble methods: - Bagging: averaging the prediction over a collection of classifiers. - Boosting: weighted voted with a collection of classifiers. - Ensemble: combining a set of heterogeneous classifiers. --- ![](https://i.imgur.com/WEsVswb.png) (左邊是一棵 decision tree, 右邊是多棵 decision tree with ensemble methods, 可以發現相較於左邊，右邊的結果比較像斜線。) --- ### Ensemble methods: bagging - 將 training set 做 boostrap samples, 而每個 sample 產生出對應的 classifier. - 對於 unknown sample X, 每個 classifier 對其做 prediction, 最後根據 vote 個數決定該 sample 的最終 prediction. - 此外，也可預測 continuous values: 將預測出來的值取平均即可。 - ![](https://i.imgur.com/4eZXPNn.jpg) --- ### Ensemble methods: Random Forest 相比 bagging 對 training data 動手腳， Random Forest 是在每個 classifier 動手: 隨機取一個 attribute 去做分類。原本的 decision tree 根據 Info(D) 去選 attribute, Info 越小的越好，但在這裡我們隨機取一個 attribute 來當作一個新的 classifier. ![](https://i.imgur.com/eAHQvtC.jpg) --- Two Methods to Random Forest - Forest-RI(random input selection): randomly select attribute. - Forest-RC(random linear combination): create new attribute that are linear combination of the existing attributes. --- ### Ensemble methods: Boosting 相比 Random Forest 和 bagging, 前面兩者可以平行來做，但在 boosting 是採一個個 iteration。 --- **How it works?** - A series of k classifiers is iteratively learned. - After a classifier $M_i$ is learned, the weights are updated to allow the subsequent classifier, $M_{i+1}$, to **pay more attention to the training tuple that were misclassified by $M_i$** --- **Adaboost** - Given d class-label tuple:$(X_1,y_1), (X_2,y_2),...,(X_d,y_d)$. - Initially, each tuple have same weight (1/d) --- Generate k classifiers in k round. At round i: - Tuples from D are sampled(with replacement) to form a training set $D_i$. - Each tuple's **chance of being selected** is based on **its weight**. - If a tuple is misclassified, its weight is increased, o.w. it is decreased. --- Prediction - Each model has weight, 因此將每個 model 產生的結果乘上該 model 的 weight，得到最後的 class. --- ## Part7 Bayes Classification Methods **Goal: 求在 $X$ 條件下，$C_i$ 會發生個機率。** 舉個例子: $C_i$ 為"會買電腦"的事件，$X$ 為一個人，而人有多個 attribute(e.g. age, income, credit ranking... etc) - $P(H)$: 全部買電腦的人的機率 - $P(X|C_i)$: "會買電腦"的前提下，是某種 attribute 的機率。 - 假設每個 attribute 皆為獨立下，$P(X|C_i)=\Pi_{k=1}^NP(x_k|C_i)=P(x_1|C_i)*P(x_2|C_i)*...*P(x_k|C_i)$ 最後再利用貝氏定理得到我們的 goal:$P(C_i|X)$ $$ P(C_i|X)=\frac{P(X|C_i)P(H)}{P(X)} $$ 因為我們要做分類，只需知道哪一個機率較高("會買" or "不會買")，因此不用知道$P(X)$的機率。例子: - 會買的機率: 0.028/P(X) - 不會買的機率: 0.007/P(X) --- 須避免 Zero-probability problem:若有一項的機率為0, 則整個機率就會是 0。我們利用 **Laplacian correction**: 把每一項都加 1 即可。 E.g.: 1000 tuples: - low: 0 - median: 990 - high: 10 則可以發現 prob(low)=0，因此我們全部加1 - low: 1 - median: 991 - high: 11 則 prob(low)=1/1003 ; prob(median)=991/1003 ; prob(high)=11/1003 --- ## Part8 Disscuion ### Class imbalance problem 須注意要在 training set 上做處理，因為這種處理只是希望 model 要認真學習，而非隨便亂猜某一種 label。 - Oversampling: 將少的 sample 變多 - Under-sampling: 將多的 sample 變少 Testing set 的 imbalance 代表現實世界的分佈，故不能去做處理，這樣 performance 會失準。 --- ### Normalization 若沒做標準化，可能會使某個數值 dominant ，會使其他數值看起來很小，這樣 SVM 會誤導，故在用 SVM 時要先做標準化，且未做與做的 accuracy 差異可能會差到10%以上。 - 變異數(variance):代表資料的分散程度 $$ \frac{1}{N}\sum_{x \in X}(x-\bar{X})^2 $$ - 標準差為變異數開根號 $$ \sigma=\sqrt{\frac{1}{N}\sum_{x \in X}(x-\bar{X})^2} $$ --- 標準化就是讓**整個資料的平均值變為0, 讓標準差變為1**。兩個步驟: 1) 全部資料減平均值 2) 全部資料除標準差 $$ x\rightarrow \frac{x-\bar{x}}{\sigma} $$ ![](https://i.imgur.com/LAl8O3H.png) 這樣可以避免某個 feature value 值過大蓋過其他值的變化。 --- ### Multiclass F1 score 分成 micro-f1 和 macro-f1。**micro-f1**其實就是 accuracy，而 **macro-f1** 可以視為將每個 class 的 precision 和 recall 取平均。 **macro-f1** - macro-precision=$\frac{1}{k}\sum_i precision_i$ - macro-reall=$\frac{1}{k}\sum_i recall_i$ - macro-f1-score=$\frac{2}{\frac{1}{macro-recall}+\frac{1}{macro-precision}}$ 因此當有 imbalanced problem 時，我們採用 macro-f1-score 作為 metric。