# Machine Learning Homework 5 Decison tree Department : Forsetry Student : M10912004 Name : De-Kai Kao (高得愷) ## Python ### Import ``` import numpy as np import pandas as pd import matplotlib.pyplot as plt pd.set_option('display.max_rows', 1000) ``` ### Plot ``` x1 = np.array([8,9,11,-11,-10,-10,8,9,11,-11,-10,-10]) x2 = np.array([8,9,11,-11,-10,-11,0,0,-1,8,9,11]) label = np.array([1,1,1,1,1,1,2,2,2,2,2,2]) d = np.dstack([x1, x2, label]) df = pd.DataFrame(d[0], columns = ['x1','x2','label']) plt.scatter(df.x1, df.x2, c=label, marker='*') ``` ![](https://hackmd.io/_uploads/S1AfyTmr2.png) ### MSE * MSE 均方誤差 $\quad\quad\quad\quad\quad\quad MSE = \frac{1}{N}\sum^N_{i=1}(y_i-\hat{y_i})$ * SD 標準差 $\quad\quad\quad\quad\quad\quad\quad\quad SD=\sqrt\frac{\sum^N_{i=1}(y_i-\hat{y_i})}{N}$ 均方誤差和標準差只差平方根,可以用 numpy 和 pandas 的函式執行,但要注意預設使用的標準差不同,pandas 預設的分母會是 n-1 這也很好理解因為 pandas 主要是用在 Dataframe 的處理,而 numpy 更多是使用在科學運算上。 ``` print(x1.std()) # numpy print(df.x1.std(ddof=0)) # pandas ``` $\quad9.878427675158296$ $\quad9.878427675158296$ ## Main * 計算基尼不純度 (Gini impurity) $$Ginx(y)=\sum^k_{i=1}p(y_i)(1-p(y_i))=1-\sum^k_{i=1}p(y_i)^2$$ ``` def calculate_gini_impurity(df): labels = np.array(df.label) total_samples = len(labels) unique_labels = set(labels) gini_impurity = 1.0 for label in unique_labels: label_probability = np.count_nonzero(labels == label)/ total_samples gini_impurity -= label_probability ** 2 return gini_impurity ``` * 計算熵(Entropy) $$ H(y_i)=-\sum^k_{i=1}p(y_i)\;log_2\;p(y_i) $$ ``` def calculate_entropy(df): labels = np.array(df.label) total_samples = len(labels) unique_labels = set(labels) entropy = 0.0 for label in unique_labels: label_probability = np.count_nonzero(labels == label)/ total_samples entropy -= label_probability * math.log2(label_probability) return entropy calculate_entropy(df) ``` * 根據 x1(df.column[0]), x2(df.column[1]) 的 MSE 比較, x1大輸出0, x2大輸出1 ``` def mse(df): print() if df[df.columns[0]].std(ddof=0)**2 >= df[df.columns[1]].std(ddof=0)**2: return 0 else: return 1 #mse(df) ``` * 利用 def mse 選擇的 column 根據平均值切分成兩個 Dataframe ``` def node(df): dfcol = df.columns[mse(df)] print('df[{}].mse: '.format(dfcol), round(df[dfcol].std(ddof=0)**2,3)) print(df) print('- Gini:', calculate_gini_impurity(df)) print('- IG:', calculate_information_gain(df)) node_0 = df[df[dfcol] >= df[dfcol].mean()] node_1= df.loc[df.index.difference(node_0.index)] return [node_0, node_1] #node(df)[1] ``` * 判斷 label 值是否統一, 如果不統一再做一次切分,但前提是 MSE 值大於 threshold 值否則直接輸出 Dataframe ``` def check_label_unique(df): mse_threshold = 0.05 dfcol = df.columns[mse(df)] if len(df[df.columns[-1]].unique()) != 1: if df[dfcol].std(ddof=0)**2 < mse_threshold: print('df[{}].mse:'.format(dfcol), round(df[dfcol].std(ddof=0)**2,3)) return df return node(df) else: return df #check_label_unique(df) ``` * 信息增益(Information Gain, IG) 需要同時收集分割前和分割後的樣本集合, $D_i$ 為分割後的樣本集合, $$ IG(Y)=H(Y)=\sum^k_{i=1}\frac{N_i}{N_p}H(D_i) $$ * 執行 def check_label_unique ,當接收到的資料為 list 時,**loop (自己) def decision_tree**,同時計算信息增益(IG),直到接收到所有的資料皆為 Dataframe ``` def decision_tree(df): c = check_label_unique(df) ce = calculate_entropy(df) if type(c)==list: for i in c: print('- IG', ce-(len(i.label)/len(df.label))*calculate_entropy(i)) decision_tree(i) else: print(c) print(c.shape) print('- Gini:', calculate_gini_impurity(df)) print('- entropy:', calculate_entropy(df)) decision_tree(df) ``` ### Plot decision line ``` plt.scatter(df.x1, df.x2, c=label, marker='*') plt.xlabel('x1') plt.ylabel('x2') plt.vlines(x=-0.5, ymin=np.nanmin(np.array([x1, x2])), ymax=np.nanmax(np.array([x1, x2])), linewidth=2, color='r') plt.hlines(y=4.5, xmin=0, xmax=np.nanmax(np.array([x1, x2])), linewidth=2, color='r') plt.hlines(y=-0.66, xmin=np.nanmin(np.array([x1, x2])), xmax=-1, linewidth=2, color='r') ``` ![](https://hackmd.io/_uploads/H1eMhS482.png) ## Sklearn * df = random data ``` from sklearn import tree X, y = df[['x', 'y']], df.label clf = tree.DecisionTreeClassifier() clf = clf.fit(X, y) ``` ``` tree.plot_tree(clf) ``` ![](https://hackmd.io/_uploads/rJGGKeoB3.png) ``` import graphviz # dot_data = tree.export_graphviz(clf, out_file=None) # graph = graphviz.Source(dot_data) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=X.columns, class_names=y.astype(str).unique(), filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph ``` ![](https://hackmd.io/_uploads/BywBFxsH2.png) * 討論: 決策樹的演算過程相對 SVM 更容易理解些,對於 sklearn 套件可能是依據gini值來決定是否要進行分割,不太確定的部分是,套件中 x, y 的決定分隔成兩類的 ”值“ 是如何產生的,而我則是在透過實作的過程中,思考出使用 MSE 值得大小選擇 column, 並用該 column 的平均值作為分割的依據。