###### h6 `ML` `Decision Tree` `Visualization`
# Implement Decision Tree from scratch
## What is Decision Tree
### I. Structure

### II. How to build the decision tree
1. algorithm: learn from dataset. adopts a greedy divide-and conquer strategy:

- always test the **most important attribute** first
- then recursively solve the smaller subproblems that defined by the possible results of the test
2. **most important attribute**
- most important attribute in this case: **type**

- means the node that makes the most difference to the classification of an example
- hope to get to the correct classification with a small number of tests
- all **paths in the tree will be short** and **the tree as a whole will be shallow**
3. Performance and learning curve
**with more training examples(datas) -> decision tree will be more accurate**

4. **how to choose the most importance attribute**
- 簡單來說就是先用算法量化特徵值,然後在用這些數據建構決策樹的每個點,比較常見的有三種算法: ID3, C4.5, CART,其中我們會用到**基尼不純度(gini impurity)**和**熵**的運算
- **Gini impurity**

- 算出來的gini越大,變量不確定性就越大,可參考[這個圖示的範例](https://zhuanlan.zhihu.com/p/545157674)
- 最大0.5,最小0,
- **Entropy**

- 熵越大,變量不確定性就越大,可以參考[這個圖示的範例](https://blog.csdn.net/qq_42308287/article/details/123963052)
- Ex: 好瓜和壞括各站一半時,熵最大(1);當所有瓜都是好瓜或是都是壞瓜時,熵最小(0)
- **Gini VS. Entropy**
- 了解完gini和entropy的概念後,可以知道**一個feature算出的entropy或gini太高**的話代表在一群資料中**用該feature分類的話不確定性會很高**;
- 反之,則確定性高,也就是說比較準確知道到底是甚麼類別
- **Goal**:
- **找到有最低entropy或gini的feature來幫我們分裂數隻生長下去**,這樣決策樹就不會太深,而且分裂至數頁的預測結果也會比較準確
- 會用
- ID3, C4.5, CART
|Algorithm|model|structure|attribute choose| continuous value|missing data|prune|multiple use of features|
|-|-|-|-|-|-|-|-|
|ID3| classify|multiway tree|information gain| x| x|x|x|
|C4.5|classify|multiway tree|information gain|v|v|v|v|
|CART|clssify, <br/> regression tree| binary tree|gini impurity, <br/> 均方差|v|v|v|v|
- [可以參考範例](https://roger010620.medium.com/%E6%B1%BA%E7%AD%96%E6%A8%B9-decision-tree-%E5%B8%B8%E8%A6%8B%E7%9A%84%E4%B8%89%E7%A8%AE%E7%AE%97%E6%B3%95-id3-c4-5-cart-54091ca85044)
- **ID3**

- ID3 may be overfitting since there's no pruning strategy
- can only process discrete feature
- not consider missing data
- less features to make larger information gain: 分母越小整體越大
- most simple: useful when decision tree is not complex
- **C4.5**

- C4.5 use multiway tree, but it's more efficient to use binary tree
- can only classify
- take too much time to compute entropy
- 信息增益率對可取值較少的特徵有所偏好(分母越小,整體越大),因此 C4.5並不是直接用增益率最大的特徵進行劃分,而是使用一個**啟發式方法**:先從候選劃分特徵中找到信息增益高於平均值的特徵,再從中選擇增益率最高的。
- **CART**

- CART use binary -> **fast**
- can classify and regression
- use Gini impurity -> **take less time to compute**
- [【机器学习】决策树(上)——ID3、C4.5、CART(非常详细)](https://zhuanlan.zhihu.com/p/85731206)
<br/><br/>
### III. How it works to predict the results
take a decision tree used to **predict fake or real** account for example

- root: Seperate the account data based on their **number of followers** which means
- the value of the feature of account data, **`#follower`**, <= 92
:arrow_right: then the account data is possible be true
- else :arrow_right: it's possible be false.
- Notice that I use the word **possible** here because the predict result is on the leaf node. Before arriving at leaf node, data cannot be confirmed which class it is.
- how:
- the data will go through by comparing the value of feature to the threshold of the node and decide which way to go until arrive the leaf node.
- And the leaf node is the predict result for the data
- predict: we use this data to test
| profile pic | nums/length username | fullname words |nums/length fullname|name==username|description length|external URL|private|#posts|#followers|#follows|fake|
| ---- | ---- | ---- |---- |---- | ---- |----|----|----|----|----|----|
| 1 |0.27| 0 |0 |0 |53 |0 |0 |32 |1000| 955| 0|
- :tada: it will predict the account is fake!

### IV. Why decision tree?
1. break down complex data to more manageable parts
2. explicable process compared to deep learning
:link: [complete introduction to decision tree (with source code)](https://github.com/michaeldorner/DecisionTrees/tree/master)
<br/><br/>
## Implement in python
#### source code and explaination provided [here](https://github.com/Hlunlun/Decision-Tree-from-scratch)
<br/><br/>
## Reference
:link: [IT鐵人賽 [Day 12] 決策樹 (Decision tree)](https://ithelp.ithome.com.tw/articles/10271143): 詳細講解參數是啥
:link: [資料視覺化之 Decision tree (決策樹)範例與 Machine Learning (機器學習) 概念簡單教學(入門)](https://tree.rocks/decision-tree-graphviz-contour-with-pandas-gen-train-test-dataset-for-beginner-9137b7c8416a): 用sklearn套件從準備資料到視覺化詳細講解
:link: [Graphviz download](https://graphviz.org/download/)
:link: [Python中defaultdict的用法分析详解](https://zhuanlan.zhihu.com/p/504744465)
:link: [資料視覺化之 contour ( matplotlib.pyplot ) 教學與用法](https://tree.rocks/python-matplotlib-plt-contour-or-contourf-tutorial-with-sklearn-e4497f76280): 對資料分類的視覺化很有用
:link: [决策树的手工实现与可视化(Python版本)](https://blog.csdn.net/qq_42308287/article/details/123963052): 詳細解釋信息增益
:link: [Visualizing the Third Numeric Variable in Python](https://betterprogramming.pub/visualizing-the-third-numeric-variable-in-python-eaf73f42219): plot for three variables
:link: [How to program a decision tree in Python from 0](https://anderfernandez.com/en/blog/code-decision-tree-python-from-scratch/#How-to-choose-the-cuts-for-our-decision-tree): 作者是西班牙人,所以有些變數長得很奇怪的就是西文,除了最後一個`classify_data()`,其他都work
:link: [基尼指数在机器学习中的奥秘](https://zhuanlan.zhihu.com/p/545157674): 詳細講解Gini和Entropy的數學運算,也有圖示化的範例解釋,很清楚
:link: [決策樹(Decision Tree)常見的三種算法(ID3、C4.5、CART)](https://roger010620.medium.com/%E6%B1%BA%E7%AD%96%E6%A8%B9-decision-tree-%E5%B8%B8%E8%A6%8B%E7%9A%84%E4%B8%89%E7%A8%AE%E7%AE%97%E6%B3%95-id3-c4-5-cart-54091ca85044): 三種的公式都有,還有在三種公式底下運算的範例
:link: [【机器学习】决策树(上)——ID3、C4.5、CART(非常详细)](https://zhuanlan.zhihu.com/p/85731206)
:link: [机器学习入门 12-4 基尼系数](https://cloud.tencent.com/developer/article/1774905)
:link: [Visualize a Decision Tree in 4 Ways with Scikit-Learn and Python](https://mljar.com/blog/visualize-decision-tree/)
:link: [pydotplus使用](https://www.cnblogs.com/qizhou/p/12743708.html)