# Machine Learning Homework 5 Decison tree
Department : Forsetry
Student : M10912004
Name : De-Kai Kao (高得愷)
## Python
### Import
```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 1000)
```
### Plot
```
x1 = np.array([8,9,11,-11,-10,-10,8,9,11,-11,-10,-10])
x2 = np.array([8,9,11,-11,-10,-11,0,0,-1,8,9,11])
label = np.array([1,1,1,1,1,1,2,2,2,2,2,2])
d = np.dstack([x1, x2, label])
df = pd.DataFrame(d[0], columns = ['x1','x2','label'])
plt.scatter(df.x1, df.x2, c=label, marker='*')
```

### MSE
* MSE 均方誤差 $\quad\quad\quad\quad\quad\quad MSE = \frac{1}{N}\sum^N_{i=1}(y_i-\hat{y_i})$
* SD 標準差 $\quad\quad\quad\quad\quad\quad\quad\quad SD=\sqrt\frac{\sum^N_{i=1}(y_i-\hat{y_i})}{N}$
均方誤差和標準差只差平方根,可以用 numpy 和 pandas 的函式執行,但要注意預設使用的標準差不同,pandas 預設的分母會是 n-1 這也很好理解因為 pandas 主要是用在 Dataframe 的處理,而 numpy 更多是使用在科學運算上。
```
print(x1.std()) # numpy
print(df.x1.std(ddof=0)) # pandas
```
$\quad9.878427675158296$
$\quad9.878427675158296$
## Main
* 計算基尼不純度 (Gini impurity)
$$Ginx(y)=\sum^k_{i=1}p(y_i)(1-p(y_i))=1-\sum^k_{i=1}p(y_i)^2$$
```
def calculate_gini_impurity(df):
labels = np.array(df.label)
total_samples = len(labels)
unique_labels = set(labels)
gini_impurity = 1.0
for label in unique_labels:
label_probability = np.count_nonzero(labels == label)/ total_samples
gini_impurity -= label_probability ** 2
return gini_impurity
```
* 計算熵(Entropy)
$$
H(y_i)=-\sum^k_{i=1}p(y_i)\;log_2\;p(y_i)
$$
```
def calculate_entropy(df):
labels = np.array(df.label)
total_samples = len(labels)
unique_labels = set(labels)
entropy = 0.0
for label in unique_labels:
label_probability = np.count_nonzero(labels == label)/ total_samples
entropy -= label_probability * math.log2(label_probability)
return entropy
calculate_entropy(df)
```
* 根據 x1(df.column[0]), x2(df.column[1]) 的 MSE 比較, x1大輸出0, x2大輸出1
```
def mse(df):
print()
if df[df.columns[0]].std(ddof=0)**2 >= df[df.columns[1]].std(ddof=0)**2:
return 0
else:
return 1
#mse(df)
```
* 利用 def mse 選擇的 column 根據平均值切分成兩個 Dataframe
```
def node(df):
dfcol = df.columns[mse(df)]
print('df[{}].mse: '.format(dfcol), round(df[dfcol].std(ddof=0)**2,3))
print(df)
print('- Gini:', calculate_gini_impurity(df))
print('- IG:', calculate_information_gain(df))
node_0 = df[df[dfcol] >= df[dfcol].mean()]
node_1= df.loc[df.index.difference(node_0.index)]
return [node_0, node_1]
#node(df)[1]
```
* 判斷 label 值是否統一, 如果不統一再做一次切分,但前提是 MSE 值大於 threshold 值否則直接輸出 Dataframe
```
def check_label_unique(df):
mse_threshold = 0.05
dfcol = df.columns[mse(df)]
if len(df[df.columns[-1]].unique()) != 1:
if df[dfcol].std(ddof=0)**2 < mse_threshold:
print('df[{}].mse:'.format(dfcol), round(df[dfcol].std(ddof=0)**2,3))
return df
return node(df)
else:
return df
#check_label_unique(df)
```
* 信息增益(Information Gain, IG) 需要同時收集分割前和分割後的樣本集合, $D_i$ 為分割後的樣本集合,
$$
IG(Y)=H(Y)=\sum^k_{i=1}\frac{N_i}{N_p}H(D_i)
$$
* 執行 def check_label_unique ,當接收到的資料為 list 時,**loop (自己) def decision_tree**,同時計算信息增益(IG),直到接收到所有的資料皆為 Dataframe
```
def decision_tree(df):
c = check_label_unique(df)
ce = calculate_entropy(df)
if type(c)==list:
for i in c:
print('- IG', ce-(len(i.label)/len(df.label))*calculate_entropy(i))
decision_tree(i)
else:
print(c)
print(c.shape)
print('- Gini:', calculate_gini_impurity(df))
print('- entropy:', calculate_entropy(df))
decision_tree(df)
```
### Plot decision line
```
plt.scatter(df.x1, df.x2, c=label, marker='*')
plt.xlabel('x1')
plt.ylabel('x2')
plt.vlines(x=-0.5,
ymin=np.nanmin(np.array([x1, x2])),
ymax=np.nanmax(np.array([x1, x2])),
linewidth=2, color='r')
plt.hlines(y=4.5,
xmin=0,
xmax=np.nanmax(np.array([x1, x2])),
linewidth=2, color='r')
plt.hlines(y=-0.66,
xmin=np.nanmin(np.array([x1, x2])),
xmax=-1,
linewidth=2, color='r')
```

## Sklearn
* df = random data
```
from sklearn import tree
X, y = df[['x', 'y']], df.label
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
```
```
tree.plot_tree(clf)
```

```
import graphviz
# dot_data = tree.export_graphviz(clf, out_file=None)
# graph = graphviz.Source(dot_data)
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=X.columns,
class_names=y.astype(str).unique(),
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
```

* 討論:
決策樹的演算過程相對 SVM 更容易理解些,對於 sklearn 套件可能是依據gini值來決定是否要進行分割,不太確定的部分是,套件中 x, y 的決定分隔成兩類的 ”值“ 是如何產生的,而我則是在透過實作的過程中,思考出使用 MSE 值得大小選擇 column, 並用該 column 的平均值作為分割的依據。