Try   HackMD
h6 ML Decision Tree Visualization

Implement Decision Tree from scratch

What is Decision Tree

I. Structure

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

II. How to build the decision tree

  1. algorithm: learn from dataset. adopts a greedy divide-and conquer strategy:

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

    • always test the most important attribute first
    • then recursively solve the smaller subproblems that defined by the possible results of the test
  2. most important attribute

    • most important attribute in this case: type
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
    • means the node that makes the most difference to the classification of an example
    • hope to get to the correct classification with a small number of tests
    • all paths in the tree will be short and the tree as a whole will be shallow
  3. Performance and learning curve
    with more training examples(datas) -> decision tree will be more accurate

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  4. how to choose the most importance attribute

  • 簡單來說就是先用算法量化特徵值,然後在用這些數據建構決策樹的每個點,比較常見的有三種算法: ID3, C4.5, CART,其中我們會用到基尼不純度(gini impurity)的運算
    • Gini impurity
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
      • 算出來的gini越大,變量不確定性就越大,可參考這個圖示的範例
      • 最大0.5,最小0,
    • Entropy
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →
      • 熵越大,變量不確定性就越大,可以參考這個圖示的範例
      • Ex: 好瓜和壞括各站一半時,熵最大(1);當所有瓜都是好瓜或是都是壞瓜時,熵最小(0)
    • Gini VS. Entropy
      • 了解完gini和entropy的概念後,可以知道一個feature算出的entropy或gini太高的話代表在一群資料中用該feature分類的話不確定性會很高
      • 反之,則確定性高,也就是說比較準確知道到底是甚麼類別
  • Goal:
    • 找到有最低entropy或gini的feature來幫我們分裂數隻生長下去,這樣決策樹就不會太深,而且分裂至數頁的預測結果也會比較準確
    • 會用
  • ID3, C4.5, CART
    Algorithm model structure attribute choose continuous value missing data prune multiple use of features
    ID3 classify multiway tree information gain x x x x
    C4.5 classify multiway tree information gain v v v v
    CART clssify,
    regression tree
    binary tree gini impurity,
    均方差
    v v v v
    • 可以參考範例

    • ID3

      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

      • ID3 may be overfitting since there's no pruning strategy
      • can only process discrete feature
      • not consider missing data
      • less features to make larger information gain: 分母越小整體越大
      • most simple: useful when decision tree is not complex
    • C4.5

      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

      • C4.5 use multiway tree, but it's more efficient to use binary tree
      • can only classify
      • take too much time to compute entropy
      • 信息增益率對可取值較少的特徵有所偏好(分母越小,整體越大),因此 C4.5並不是直接用增益率最大的特徵進行劃分,而是使用一個啟發式方法:先從候選劃分特徵中找到信息增益高於平均值的特徵,再從中選擇增益率最高的。
    • CART

      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →



III. How it works to predict the results

take a decision tree used to predict fake or real account for example

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • root: Seperate the account data based on their number of followers which means

    • the value of the feature of account data, #follower, <= 92
      :arrow_right: then the account data is possible be true
    • else :arrow_right: it's possible be false.
  • Notice that I use the word possible here because the predict result is on the leaf node. Before arriving at leaf node, data cannot be confirmed which class it is.

  • how:

    • the data will go through by comparing the value of feature to the threshold of the node and decide which way to go until arrive the leaf node.
    • And the leaf node is the predict result for the data
  • predict: we use this data to test

    profile pic nums/length username fullname words nums/length fullname name==username description length external URL private #posts #followers #follows fake
    1 0.27 0 0 0 53 0 0 32 1000 955 0
  • :tada: it will predict the account is fake!

    image

IV. Why decision tree?

  1. break down complex data to more manageable parts
  2. explicable process compared to deep learning

:link: complete introduction to decision tree (with source code)



Implement in python

source code and explaination provided here



Reference

:link: IT鐵人賽 [Day 12] 決策樹 (Decision tree): 詳細講解參數是啥

:link: 資料視覺化之 Decision tree (決策樹)範例與 Machine Learning (機器學習) 概念簡單教學(入門): 用sklearn套件從準備資料到視覺化詳細講解

:link: Graphviz download

:link: Python中defaultdict的用法分析详解

:link: 資料視覺化之 contour ( matplotlib.pyplot ) 教學與用法: 對資料分類的視覺化很有用

:link: 决策树的手工实现与可视化(Python版本): 詳細解釋信息增益

:link: Visualizing the Third Numeric Variable in Python: plot for three variables

:link: How to program a decision tree in Python from 0: 作者是西班牙人,所以有些變數長得很奇怪的就是西文,除了最後一個classify_data(),其他都work

:link: 基尼指数在机器学习中的奥秘: 詳細講解Gini和Entropy的數學運算,也有圖示化的範例解釋,很清楚

:link: 決策樹(Decision Tree)常見的三種算法(ID3、C4.5、CART): 三種的公式都有,還有在三種公式底下運算的範例

:link: 【机器学习】决策树(上)——ID3、C4.5、CART(非常详细)

:link: 机器学习入门 12-4 基尼系数

:link: Visualize a Decision Tree in 4 Ways with Scikit-Learn and Python

:link: pydotplus使用