h6 ML
Decision Tree
Visualization
Implement Decision Tree from scratch
What is Decision Tree
I. Structure
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
II. How to build the decision tree
-
algorithm: learn from dataset. adopts a greedy divide-and conquer strategy:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- always test the most important attribute first
- then recursively solve the smaller subproblems that defined by the possible results of the test
-
most important attribute
- most important attribute in this case: type
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- means the node that makes the most difference to the classification of an example
- hope to get to the correct classification with a small number of tests
- all paths in the tree will be short and the tree as a whole will be shallow
-
Performance and learning curve
with more training examples(datas) -> decision tree will be more accurate
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
how to choose the most importance attribute
- 簡單來說就是先用算法量化特徵值,然後在用這些數據建構決策樹的每個點,比較常見的有三種算法: ID3, C4.5, CART,其中我們會用到基尼不純度(gini impurity)和熵的運算
- Gini impurity
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- 算出來的gini越大,變量不確定性就越大,可參考這個圖示的範例
- 最大0.5,最小0,
- Entropy
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- 熵越大,變量不確定性就越大,可以參考這個圖示的範例
- Ex: 好瓜和壞括各站一半時,熵最大(1);當所有瓜都是好瓜或是都是壞瓜時,熵最小(0)
- Gini VS. Entropy
- 了解完gini和entropy的概念後,可以知道一個feature算出的entropy或gini太高的話代表在一群資料中用該feature分類的話不確定性會很高;
- 反之,則確定性高,也就是說比較準確知道到底是甚麼類別
- Goal:
- 找到有最低entropy或gini的feature來幫我們分裂數隻生長下去,這樣決策樹就不會太深,而且分裂至數頁的預測結果也會比較準確
- 會用
- ID3, C4.5, CART
Algorithm |
model |
structure |
attribute choose |
continuous value |
missing data |
prune |
multiple use of features |
ID3 |
classify |
multiway tree |
information gain |
x |
x |
x |
x |
C4.5 |
classify |
multiway tree |
information gain |
v |
v |
v |
v |
CART |
clssify, regression tree |
binary tree |
gini impurity, 均方差 |
v |
v |
v |
v |
-
可以參考範例
-
ID3
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- ID3 may be overfitting since there's no pruning strategy
- can only process discrete feature
- not consider missing data
- less features to make larger information gain: 分母越小整體越大
- most simple: useful when decision tree is not complex
-
C4.5
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- C4.5 use multiway tree, but it's more efficient to use binary tree
- can only classify
- take too much time to compute entropy
- 信息增益率對可取值較少的特徵有所偏好(分母越小,整體越大),因此 C4.5並不是直接用增益率最大的特徵進行劃分,而是使用一個啟發式方法:先從候選劃分特徵中找到信息增益高於平均值的特徵,再從中選擇增益率最高的。
-
CART
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
III. How it works to predict the results
take a decision tree used to predict fake or real account for example
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
-
root: Seperate the account data based on their number of followers which means
- the value of the feature of account data,
#follower
, <= 92
then the account data is possible be true
- else
it's possible be false.
-
Notice that I use the word possible here because the predict result is on the leaf node. Before arriving at leaf node, data cannot be confirmed which class it is.
-
how:
- the data will go through by comparing the value of feature to the threshold of the node and decide which way to go until arrive the leaf node.
- And the leaf node is the predict result for the data
-
predict: we use this data to test
profile pic |
nums/length username |
fullname words |
nums/length fullname |
name==username |
description length |
external URL |
private |
#posts |
#followers |
#follows |
fake |
1 |
0.27 |
0 |
0 |
0 |
53 |
0 |
0 |
32 |
1000 |
955 |
0 |
-
it will predict the account is fake!

IV. Why decision tree?
- break down complex data to more manageable parts
- explicable process compared to deep learning
complete introduction to decision tree (with source code)
Implement in python
source code and explaination provided here
Reference
IT鐵人賽 [Day 12] 決策樹 (Decision tree): 詳細講解參數是啥
資料視覺化之 Decision tree (決策樹)範例與 Machine Learning (機器學習) 概念簡單教學(入門): 用sklearn套件從準備資料到視覺化詳細講解
Graphviz download
Python中defaultdict的用法分析详解
資料視覺化之 contour ( matplotlib.pyplot ) 教學與用法: 對資料分類的視覺化很有用
决策树的手工实现与可视化(Python版本): 詳細解釋信息增益
Visualizing the Third Numeric Variable in Python: plot for three variables
How to program a decision tree in Python from 0: 作者是西班牙人,所以有些變數長得很奇怪的就是西文,除了最後一個classify_data()
,其他都work
基尼指数在机器学习中的奥秘: 詳細講解Gini和Entropy的數學運算,也有圖示化的範例解釋,很清楚
決策樹(Decision Tree)常見的三種算法(ID3、C4.5、CART): 三種的公式都有,還有在三種公式底下運算的範例
【机器学习】决策树(上)——ID3、C4.5、CART(非常详细)
机器学习入门 12-4 基尼系数
Visualize a Decision Tree in 4 Ways with Scikit-Learn and Python
pydotplus使用