kylekylehaha
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Intro ML contributed by <`kylekylehaha`> ###### tags:`Data Science` ## Part0 What is Machine Learning? **Key Essence of Machine Learning** - Exists some "underlying pattern" to be learned - so performance measure can be imporved. - But no programmable (easy) definition. - 當有簡單定義時不需要用 ML - Somehow there is data about the pattern. --- ![](https://i.imgur.com/nn19Mev.png) --- ## Part1 What is Classification? ### Supervised vs. Unsupervised learning Supervised learning(e.g. classification) - Supervision: The **training data** are accompanied by **labels**. Unsuprervised learning(e.g. clustering) - The class labels of training data is unknown. --- ### Classification-A two steps process **Model construction**: describing a set of predetermined class. **Model usage**: for classifying futrue or unknown object. --- ## Part2 Decision Tree node 為一種 attribute, 而 node 下的 subtree(brach) 可視為 attribute 內可能的值 Q: Decision tree vs. loser/wining tree: A: loser/wining tree 屬於 selection tree, 用來做 sorting(usually sorting inceasely) --- ### Building a decision tree 切割時,利用 **亂度(entropy)** 來定義切得混不混亂。 **亂度(entropy)**: - 定義: Info(D) 代表 D 的亂度,D 之中有 m 類 $$ Info(D)=-\sum_{i=1}^mp_ilog(p_i) $$ - $P_i$ 代表第 i 類出現的機率,也就是該類在資料的比例。 - 數字越大代表越亂,越小代表越一致 - Info(D)=0 代表全部皆為都一類。 --- ### Feature Space Separation **Decision tree** 只能畫**和軸垂直的線**,無法用斜線來做分類。這些線被稱為 discriminant function. ![](https://i.imgur.com/BJifdar.png) (左邊為 Decision Tree, 右圖中發現 decision tree 只能用橫的直的方式做切割,但正確的切法應為斜線。) 優點: - Fast - Easy Interpretable for human 缺點: - Accuracy - Greedily selecting split attribute(每層切完後就決定好了,無法根據後續的表現去對前面做更改) --- ## Part04 Support Vector Machine(SVM) **Discriminant Function:** The classifier is said to assign a feature vector $x$ to class $w_i$ if $$ g_i(x) \gt g_j(x) \ for \ all \ j \ne i $$ For two class case: $g(x)\equiv g_1(x)-g_2(x)$, assign $x$ to $w_i$ if $g(x) \lt 0$ ; otherwise assign $x$ to $w_2$ --- ### Linear support vector machine The linear discriminant function with the maximum **margin** is the best. - Margin is defined as the width that the boundary could be increased by before hitting a data point.(上半部最靠近線的點到線的距離和下半部最靠近的點到線的距離) - margin 越大代表切得越乾淨。 - ![](https://i.imgur.com/rxmJkho.png) --- **Goal**: $w^Tx+b=0$中,找到適當的w 和 b 使得 margin 越大越好,將問題視為最佳化問題。 We know that $$ W^Tx^+ +b = 1 \\ W^Tx^- +b = -1 $$ Thus, the margin width is: $$ M=(x^+-x^-) \cdot n \\ =(x^+-x^-) \cdot \frac{w}{||w||} = \frac{2}{||w||} $$ ![](https://i.imgur.com/FKNpM8k.png) 落在 margin 上的這些點,稱為 support vector。視為將 margin 撐開的點。 --- Formulation: $$ maximize \ \frac{2}{||w||} $$ such that $$ For \ y_i=+1, \ w^Tx_i+b \ge 1\\ For \ y_i=-1, \ w^Tx_i+b \le -1\\ $$ 可改寫為: Formulation: $$ minmize \ \frac{1}{2}||w||^2 $$ such that $$ y_i(w^x_i+b) \ge 1 $$ --- 如果 data 有 noisy 使得無法用 linear 完美分類呢?允許 **slack variable** 加入。 **slakc variable**: 兩點間的距離小於某個值是被允許的。 ![](https://i.imgur.com/9lt4rp6.png) --- ### Non-linear SVM 將低維度的 data object mapping to high dimension space. ![](https://i.imgur.com/LP356KR.png) ![](https://i.imgur.com/NeZFnlY.png) ![](https://i.imgur.com/C0wgiDV.png) --- With this mapping, our discriminant function is now: $$ g(x)=w^T\phi(x)+b=\sum_{i \in SV}\alpha \phi(x_i)^T\phi(x)+b $$ No need to know the mapping explicitly,because we only use **dot product** of feature vectors in both training and test. A **kenrel function** is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space. Example of commonly-used kernel function: - linear kernel $K(x_i,x_j)=x^T_ix_j$ - Gaussian (Radial-Based Function(RBF)) kernel: $K(x_i,x_j)=exp(-\frac{||x_i-x_j||^2}{2\sigma^2})$ 可以利用 linear kenrel 來找出比較重要的 feature(by 做完後會對每個 feature 給個 weight),藉此解決 curse of dimensionality --- ## Part5 Model evaluation **Class imbalance problem** **Sensitivity**: True positive recognition rate - sentivity=TP/P (P: # of positive) **Specificity**: True negative recognition rate - specficity=TN/N (N: # of negative) **Recall and precision** 要合再一起算,兩者合再一起就是f1-score --- **Cross-validation** - Randomly partition data into k=10. - At ith iteration, use $D_i$ as test set and others as training set -![](https://i.imgur.com/8pJPvNG.png) - Do shuffle before k-fold corss-validation --- **ROC curve**: for visual comparison of classification model. - Shows the trade-off between the **true positive** rate and the **false positive** rate - The **area** under the ROC curve is a measure of the accuracy of the model - The **closer** to the diagonal line, the **less** accurate to the model. - A model with perfect accuracy will have an area of 1.0 ![](https://i.imgur.com/b7ILR9c.png) --- ## Part6 Ensemble Methods 利用 r 個 classifier 產生的結果做投票,得出最後的 label。如何訓練出不同的 classifier 以及如何投票(e.g. 每人權重不同)需要考慮。 Ensemble Methods: increasing the Accuracy - Use a combination of models to increase accuracy. - Ideally, significant different among the models.(若 model 都相似就沒有必要投票了) Popular ensemble methods: - Bagging: averaging the prediction over a collection of classifiers. - Boosting: weighted voted with a collection of classifiers. - Ensemble: combining a set of heterogeneous classifiers. --- ![](https://i.imgur.com/WEsVswb.png) (左邊是一棵 decision tree, 右邊是多棵 decision tree with ensemble methods, 可以發現相較於左邊,右邊的結果比較像斜線。) --- ### Ensemble methods: bagging - 將 training set 做 boostrap samples, 而每個 sample 產生出對應的 classifier. - 對於 unknown sample X, 每個 classifier 對其做 prediction, 最後根據 vote 個數決定該 sample 的最終 prediction. - 此外,也可預測 continuous values: 將預測出來的值取平均即可。 - ![](https://i.imgur.com/4eZXPNn.jpg) --- ### Ensemble methods: Random Forest 相比 bagging 對 training data 動手腳, Random Forest 是在每個 classifier 動手: 隨機取一個 attribute 去做分類。 原本的 decision tree 根據 Info(D) 去選 attribute, Info 越小的越好,但在這裡我們隨機取一個 attribute 來當作一個新的 classifier. ![](https://i.imgur.com/eAHQvtC.jpg) --- Two Methods to Random Forest - Forest-RI(random input selection): randomly select attribute. - Forest-RC(random linear combination): create new attribute that are linear combination of the existing attributes. --- ### Ensemble methods: Boosting 相比 Random Forest 和 bagging, 前面兩者可以平行來做,但在 boosting 是採一個個 iteration。 --- **How it works?** - A series of k classifiers is iteratively learned. - After a classifier $M_i$ is learned, the weights are updated to allow the subsequent classifier, $M_{i+1}$, to **pay more attention to the training tuple that were misclassified by $M_i$** --- **Adaboost** - Given d class-label tuple:$(X_1,y_1), (X_2,y_2),...,(X_d,y_d)$. - Initially, each tuple have same weight (1/d) --- Generate k classifiers in k round. At round i: - Tuples from D are sampled(with replacement) to form a training set $D_i$. - Each tuple's **chance of being selected** is based on **its weight**. - If a tuple is misclassified, its weight is increased, o.w. it is decreased. --- Prediction - Each model has weight, 因此將每個 model 產生的結果乘上該 model 的 weight,得到最後的 class. --- ## Part7 Bayes Classification Methods **Goal: 求在 $X$ 條件下,$C_i$ 會發生個機率。** 舉個例子: $C_i$ 為"會買電腦"的事件,$X$ 為一個人,而人有多個 attribute(e.g. age, income, credit ranking... etc) - $P(H)$: 全部買電腦的人的機率 - $P(X|C_i)$: "會買電腦"的前提下,是某種 attribute 的機率。 - 假設每個 attribute 皆為獨立下,$P(X|C_i)=\Pi_{k=1}^NP(x_k|C_i)=P(x_1|C_i)*P(x_2|C_i)*...*P(x_k|C_i)$ 最後再利用貝氏定理得到我們的 goal:$P(C_i|X)$ $$ P(C_i|X)=\frac{P(X|C_i)P(H)}{P(X)} $$ 因為我們要做分類,只需知道哪一個機率較高("會買" or "不會買"),因此不用知道$P(X)$的機率。 例子: - 會買的機率: 0.028/P(X) - 不會買的機率: 0.007/P(X) --- 須避免 Zero-probability problem:若有一項的機率為0, 則整個機率就會是 0。 我們利用 **Laplacian correction**: 把每一項都加 1 即可。 E.g.: 1000 tuples: - low: 0 - median: 990 - high: 10 則可以發現 prob(low)=0,因此我們全部加1 - low: 1 - median: 991 - high: 11 則 prob(low)=1/1003 ; prob(median)=991/1003 ; prob(high)=11/1003 --- ## Part8 Disscuion ### Class imbalance problem 須注意要在 training set 上做處理,因為這種處理只是希望 model 要認真學習,而非隨便亂猜某一種 label。 - Oversampling: 將少的 sample 變多 - Under-sampling: 將多的 sample 變少 Testing set 的 imbalance 代表現實世界的分佈,故不能去做處理,這樣 performance 會失準。 --- ### Normalization 若沒做標準化,可能會使某個數值 dominant ,會使其他數值看起來很小,這樣 SVM 會誤導,故在用 SVM 時要先做標準化,且未做與做的 accuracy 差異可能會差到10%以上。 - 變異數(variance):代表資料的分散程度 $$ \frac{1}{N}\sum_{x \in X}(x-\bar{X})^2 $$ - 標準差為變異數開根號 $$ \sigma=\sqrt{\frac{1}{N}\sum_{x \in X}(x-\bar{X})^2} $$ --- 標準化就是讓**整個資料的平均值變為0, 讓標準差變為1**。 兩個步驟: 1) 全部資料減平均值 2) 全部資料除標準差 $$ x\rightarrow \frac{x-\bar{x}}{\sigma} $$ ![](https://i.imgur.com/LAl8O3H.png) 這樣可以避免某個 feature value 值過大蓋過其他值的變化。 --- ### Multiclass F1 score 分成 micro-f1 和 macro-f1。**micro-f1**其實就是 accuracy,而 **macro-f1** 可以視為將每個 class 的 precision 和 recall 取平均。 **macro-f1** - macro-precision=$\frac{1}{k}\sum_i precision_i$ - macro-reall=$\frac{1}{k}\sum_i recall_i$ - macro-f1-score=$\frac{2}{\frac{1}{macro-recall}+\frac{1}{macro-precision}}$ 因此當有 imbalanced problem 時,我們採用 macro-f1-score 作為 metric。

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully