h6 `ML` `Decision Tree` `Visualization`

Implement Decision Tree from scratch

What is Decision Tree

I. Structure

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

II. How to build the decision tree

algorithm: learn from dataset. adopts a greedy divide-and conquer strategy:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- always test the most important attribute first
- then recursively solve the smaller subproblems that defined by the possible results of the test
most important attribute
- most important attribute in this case: type
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
- means the node that makes the most difference to the classification of an example
- hope to get to the correct classification with a small number of tests
- all paths in the tree will be short and the tree as a whole will be shallow
Performance and learning curve
with more training examples(datas) -> decision tree will be more accurate
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
how to choose the most importance attribute

簡單來說就是先用算法量化特徵值，然後在用這些數據建構決策樹的每個點，比較常見的有三種算法: ID3, C4.5, CART，其中我們會用到基尼不純度(gini impurity)和熵的運算
- Gini impurity
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
  - 算出來的gini越大，變量不確定性就越大，可參考這個圖示的範例
  - 最大0.5，最小0，
- Entropy
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
  - 熵越大，變量不確定性就越大，可以參考這個圖示的範例
  - Ex: 好瓜和壞括各站一半時，熵最大(1)；當所有瓜都是好瓜或是都是壞瓜時，熵最小(0)
- Gini VS. Entropy
  - 了解完gini和entropy的概念後，可以知道一個feature算出的entropy或gini太高的話代表在一群資料中用該feature分類的話不確定性會很高；
  - 反之，則確定性高，也就是說比較準確知道到底是甚麼類別
Goal:
- 找到有最低entropy或gini的feature來幫我們分裂數隻生長下去，這樣決策樹就不會太深，而且分裂至數頁的預測結果也會比較準確
- 會用

ID3, C4.5, CART

Algorithm	model	structure	attribute choose	continuous value	missing data	prune	multiple use of features
ID3	classify	multiway tree	information gain	x	x	x	x
C4.5	classify	multiway tree	information gain	v	v	v	v
CART	clssify, regression tree	binary tree	gini impurity, 均方差	v	v	v	v

可以參考範例
ID3
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- ID3 may be overfitting since there's no pruning strategy
- can only process discrete feature
- not consider missing data
- less features to make larger information gain: 分母越小整體越大
- most simple: useful when decision tree is not complex
C4.5
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- C4.5 use multiway tree, but it's more efficient to use binary tree
- can only classify
- take too much time to compute entropy
- 信息增益率對可取值較少的特徵有所偏好（分母越小，整體越大），因此 C4.5並不是直接用增益率最大的特徵進行劃分，而是使用一個啟發式方法：先從候選劃分特徵中找到信息增益高於平均值的特徵，再從中選擇增益率最高的。
CART
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- CART use binary -> fast
- can classify and regression
- use Gini impurity -> take less time to compute
- 【机器学习】决策树（上）——ID3、C4.5、CART（非常详细）

III. How it works to predict the results

take a decision tree used to predict fake or real account for example

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

root: Seperate the account data based on their number of followers which means
- the value of the feature of account data, #follower, <= 92
  then the account data is possible be true
- else it's possible be false.
Notice that I use the word possible here because the predict result is on the leaf node. Before arriving at leaf node, data cannot be confirmed which class it is.
how:
- the data will go through by comparing the value of feature to the threshold of the node and decide which way to go until arrive the leaf node.
- And the leaf node is the predict result for the data
predict: we use this data to test

profile pic nums/length username fullname words nums/length fullname name==username description length external URL private #posts #followers #follows fake

1 0.27 0 0 0 53 0 0 32 1000 955 0
it will predict the account is fake!

IV. Why decision tree?

break down complex data to more manageable parts
explicable process compared to deep learning

complete introduction to decision tree (with source code)

Implement in python

source code and explaination provided here

Reference

IT鐵人賽 [Day 12] 決策樹 (Decision tree): 詳細講解參數是啥

資料視覺化之 Decision tree (決策樹)範例與 Machine Learning (機器學習) 概念簡單教學(入門): 用sklearn套件從準備資料到視覺化詳細講解

Graphviz download

Python中defaultdict的用法分析详解

資料視覺化之 contour ( matplotlib.pyplot ) 教學與用法: 對資料分類的視覺化很有用

决策树的手工实现与可视化(Python版本): 詳細解釋信息增益

Visualizing the Third Numeric Variable in Python: plot for three variables

How to program a decision tree in Python from 0: 作者是西班牙人，所以有些變數長得很奇怪的就是西文，除了最後一個classify_data()，其他都work

基尼指数在机器学习中的奥秘: 詳細講解Gini和Entropy的數學運算，也有圖示化的範例解釋，很清楚

決策樹(Decision Tree)常見的三種算法(ID3、C4.5、CART): 三種的公式都有，還有在三種公式底下運算的範例

【机器学习】决策树（上）——ID3、C4.5、CART（非常详细）

机器学习入门 12-4 基尼系数

Visualize a Decision Tree in 4 Ways with Scikit-Learn and Python

pydotplus使用