###### h6 `ML` `Decision Tree` `Visualization` # Implement Decision Tree from scratch ## What is Decision Tree ### I. Structure ![image](https://hackmd.io/_uploads/rk_0ORLSa.png) ### II. How to build the decision tree 1. algorithm: learn from dataset. adopts a greedy divide-and conquer strategy: ![image](https://hackmd.io/_uploads/H1YjQkwrp.png) - always test the **most important attribute** first - then recursively solve the smaller subproblems that defined by the possible results of the test 2. **most important attribute** - most important attribute in this case: **type** ![image](https://hackmd.io/_uploads/SyoXf1vBT.png) - means the node that makes the most difference to the classification of an example - hope to get to the correct classification with a small number of tests - all **paths in the tree will be short** and **the tree as a whole will be shallow** 3. Performance and learning curve **with more training examples(datas) -> decision tree will be more accurate** ![image](https://hackmd.io/_uploads/ryIr4yDr6.png) 4. **how to choose the most importance attribute** - 簡單來說就是先用算法量化特徵值,然後在用這些數據建構決策樹的每個點,比較常見的有三種算法: ID3, C4.5, CART,其中我們會用到**基尼不純度(gini impurity)**和**熵**的運算 - **Gini impurity** ![image](https://hackmd.io/_uploads/Hkc6kzQH6.png) - 算出來的gini越大,變量不確定性就越大,可參考[這個圖示的範例](https://zhuanlan.zhihu.com/p/545157674) - 最大0.5,最小0, - **Entropy** ![image](https://hackmd.io/_uploads/H1FjkMXBp.png) - 熵越大,變量不確定性就越大,可以參考[這個圖示的範例](https://blog.csdn.net/qq_42308287/article/details/123963052) - Ex: 好瓜和壞括各站一半時,熵最大(1);當所有瓜都是好瓜或是都是壞瓜時,熵最小(0) - **Gini VS. Entropy** - 了解完gini和entropy的概念後,可以知道**一個feature算出的entropy或gini太高**的話代表在一群資料中**用該feature分類的話不確定性會很高**; - 反之,則確定性高,也就是說比較準確知道到底是甚麼類別 - **Goal**: - **找到有最低entropy或gini的feature來幫我們分裂數隻生長下去**,這樣決策樹就不會太深,而且分裂至數頁的預測結果也會比較準確 - 會用 - ID3, C4.5, CART |Algorithm|model|structure|attribute choose| continuous value|missing data|prune|multiple use of features| |-|-|-|-|-|-|-|-| |ID3| classify|multiway tree|information gain| x| x|x|x| |C4.5|classify|multiway tree|information gain|v|v|v|v| |CART|clssify, <br/> regression tree| binary tree|gini impurity, <br/> 均方差|v|v|v|v| - [可以參考範例](https://roger010620.medium.com/%E6%B1%BA%E7%AD%96%E6%A8%B9-decision-tree-%E5%B8%B8%E8%A6%8B%E7%9A%84%E4%B8%89%E7%A8%AE%E7%AE%97%E6%B3%95-id3-c4-5-cart-54091ca85044) - **ID3** ![image](https://hackmd.io/_uploads/SJARvxPHa.png) - ID3 may be overfitting since there's no pruning strategy - can only process discrete feature - not consider missing data - less features to make larger information gain: 分母越小整體越大 - most simple: useful when decision tree is not complex - **C4.5** ![image](https://hackmd.io/_uploads/r1M7qePBT.png) - C4.5 use multiway tree, but it's more efficient to use binary tree - can only classify - take too much time to compute entropy - 信息增益率對可取值較少的特徵有所偏好(分母越小,整體越大),因此 C4.5並不是直接用增益率最大的特徵進行劃分,而是使用一個**啟發式方法**:先從候選劃分特徵中找到信息增益高於平均值的特徵,再從中選擇增益率最高的。 - **CART** ![image](https://hackmd.io/_uploads/rJg-6ZPS6.png) - CART use binary -> **fast** - can classify and regression - use Gini impurity -> **take less time to compute** - [【机器学习】决策树(上)——ID3、C4.5、CART(非常详细)](https://zhuanlan.zhihu.com/p/85731206) <br/><br/> ### III. How it works to predict the results take a decision tree used to **predict fake or real** account for example ![image](https://hackmd.io/_uploads/HydxkMPr6.png) - root: Seperate the account data based on their **number of followers** which means - the value of the feature of account data, **`#follower`**, <= 92 :arrow_right: then the account data is possible be true - else :arrow_right: it's possible be false. - Notice that I use the word **possible** here because the predict result is on the leaf node. Before arriving at leaf node, data cannot be confirmed which class it is. - how: - the data will go through by comparing the value of feature to the threshold of the node and decide which way to go until arrive the leaf node. - And the leaf node is the predict result for the data - predict: we use this data to test | profile pic | nums/length username | fullname words |nums/length fullname|name==username|description length|external URL|private|#posts|#followers|#follows|fake| | ---- | ---- | ---- |---- |---- | ---- |----|----|----|----|----|----| | 1 |0.27| 0 |0 |0 |53 |0 |0 |32 |1000| 955| 0| - :tada: it will predict the account is fake! ![image](https://hackmd.io/_uploads/HJKOezPS6.png) ### IV. Why decision tree? 1. break down complex data to more manageable parts 2. explicable process compared to deep learning :link: [complete introduction to decision tree (with source code)](https://github.com/michaeldorner/DecisionTrees/tree/master) <br/><br/> ## Implement in python #### source code and explaination provided [here](https://github.com/Hlunlun/Decision-Tree-from-scratch) <br/><br/> ## Reference :link: [IT鐵人賽 [Day 12] 決策樹 (Decision tree)](https://ithelp.ithome.com.tw/articles/10271143): 詳細講解參數是啥 :link: [資料視覺化之 Decision tree (決策樹)範例與 Machine Learning (機器學習) 概念簡單教學(入門)](https://tree.rocks/decision-tree-graphviz-contour-with-pandas-gen-train-test-dataset-for-beginner-9137b7c8416a): 用sklearn套件從準備資料到視覺化詳細講解 :link: [Graphviz download](https://graphviz.org/download/) :link: [Python中defaultdict的用法分析详解](https://zhuanlan.zhihu.com/p/504744465) :link: [資料視覺化之 contour ( matplotlib.pyplot ) 教學與用法](https://tree.rocks/python-matplotlib-plt-contour-or-contourf-tutorial-with-sklearn-e4497f76280): 對資料分類的視覺化很有用 :link: [决策树的手工实现与可视化(Python版本)](https://blog.csdn.net/qq_42308287/article/details/123963052): 詳細解釋信息增益 :link: [Visualizing the Third Numeric Variable in Python](https://betterprogramming.pub/visualizing-the-third-numeric-variable-in-python-eaf73f42219): plot for three variables :link: [How to program a decision tree in Python from 0](https://anderfernandez.com/en/blog/code-decision-tree-python-from-scratch/#How-to-choose-the-cuts-for-our-decision-tree): 作者是西班牙人,所以有些變數長得很奇怪的就是西文,除了最後一個`classify_data()`,其他都work :link: [基尼指数在机器学习中的奥秘](https://zhuanlan.zhihu.com/p/545157674): 詳細講解Gini和Entropy的數學運算,也有圖示化的範例解釋,很清楚 :link: [決策樹(Decision Tree)常見的三種算法(ID3、C4.5、CART)](https://roger010620.medium.com/%E6%B1%BA%E7%AD%96%E6%A8%B9-decision-tree-%E5%B8%B8%E8%A6%8B%E7%9A%84%E4%B8%89%E7%A8%AE%E7%AE%97%E6%B3%95-id3-c4-5-cart-54091ca85044): 三種的公式都有,還有在三種公式底下運算的範例 :link: [【机器学习】决策树(上)——ID3、C4.5、CART(非常详细)](https://zhuanlan.zhihu.com/p/85731206) :link: [机器学习入门 12-4 基尼系数](https://cloud.tencent.com/developer/article/1774905) :link: [Visualize a Decision Tree in 4 Ways with Scikit-Learn and Python](https://mljar.com/blog/visualize-decision-tree/) :link: [pydotplus使用](https://www.cnblogs.com/qizhou/p/12743708.html)