# Introduction to Machine Learning Problem Set 4
# `yjw259`
For better reading and coloring,
https://hackmd.io/@y56/S1CYGmb_I
## 1.
### (a)
#### Print the (unweighted) classification accuracy of each of the decision stumps in the notebook.
```
m = 0 ; 0.66
m = 1 ; 0.5
m = 2 ; 0.66
m = 3 ; 0.5
m = 4 ; 0.66
m = 5 ; 0.66
m = 6 ; 0.5
m = 7 ; 0.66
m = 8 ; 0.5
m = 9 ; 0.62
m = 10 ; 0.5
m = 11 ; 0.64
m = 12 ; 0.5
m = 13 ; 0.66
m = 14 ; 0.5
```
#### What is the mean accuracy of the individual decision stumps?
```
0.5813333333333334
```
#### What is the accuracy of the ensemble?
It is `100%` after 15 iterations.
```
m = 0 ; 0.66
m = 1 ; 0.66
m = 2 ; 0.82
m = 3 ; 0.82
m = 4 ; 0.9
m = 5 ; 0.68
m = 6 ; 0.96
m = 7 ; 0.76
m = 8 ; 0.96
m = 9 ; 0.86
m = 10 ; 0.96
m = 11 ; 0.94
m = 12 ; 0.98
m = 13 ; 1.0
m = 14 ; 1.0
```
### (b)
#### Why is the ensemble more confident about the class of those eight samples after two iterations?
- After iteration-0 ($\alpha_0=0.66$), the predicted value of ensemble is `-0.66` on the left region of $x_1=0.558$, and `+0.66` one the right.
- After iteration-1 ($\alpha_0=0.49$), since the stump in this iteration predicts ***Red(+1)*** for the whole space, the predicted value of ensemble is `-0.17` on the left region of $x_1=0.558$, and `+1.15` one the right.
- Intuitively speaking, the eight points are predicted ***Red*** by both stumps, so the magnitudes are added up.
#### Why is the ensemble not as confident about any of the other samples after two iterations.
The other points are predicted to be opposite for the two stumps, so their magnitudes are decreasing.
### \(c\)
#### Why are the predictions different, even though the boundary is the same?
The cutpoint is determined by minimizing the sum of the two weighted Gini indices; the prediction is determined by the sign of the sum of those weights in the region.
#### Explain what changed between iteration-1 and iteration-2, and why.
The weights of points in the middle region increase after iteration-1, causing the prediction to be ***Blue*** on the right-hand side.
## 2.
### Single Decision Tree
#### (a)
```python=
tmp_mean = np.mean(valid_scores, axis=1)
tmp_std = np.std(valid_scores, axis=1)
tmp_index = list(tmp_mean).index((max(tmp_mean)))
print("the depth w/ max mean R^2:", tmp_index+1)
print("the std.dev. there:", tmp_std[tmp_index])
print("one SE:", tmp_std[tmp_index]/(5-1))
# the std.dev. is computed by len(samples)
print("best model: depth as", 4)
```
- the depth w/ max mean R^2:
- 4
- the std.dev. there:
- 0.030506
- one SE:
- 0.007626
- best model:
- depth as 4
#### (b)
```python=
dt_m = DecisionTreeClassifier(max_depth = 4)
dt_m.fit(X_train, y_train)
y_pred_m = dt_m.predict(X_test)
from sklearn.metrics import accuracy_score
print('Accuracy on test data is %.2f' % (
accuracy_score(y_test, y_pred_m)))
```
- Accuracy on test data is
- 0.92
### Pruning
#### (a)
The complexity of the model decreases with $\alpha$.
```python=
tmp_mean = np.mean(valid_scores, axis=1)
tmp_std = np.std(valid_scores, axis=1)
tmp_index = list(tmp_mean).index((max(tmp_mean)))
print("the alpha w/ max mean R^2:", ccp_alphas[:-1][tmp_index])
print("the std.dev. there:", tmp_std[tmp_index])
print("one SE:", tmp_std[tmp_index]/(5-1))
# the std.dev. is computed by len(samples)
print(tmp_mean[tmp_index]-0.007266782124777077)
print(ccp_alphas[:-1][tmp_index::])
print(tmp_mean[tmp_index::]>(tmp_mean[tmp_index]
-0.007266782124777077))
print("best model: alpha as", 0.00756272)
```
- the alpha w/ max mean R^2:
- 0.004199999999999995
- the std.dev. there:
- 0.029067128499108308
- one SE:
- 0.007266782124777077
- best model: alpha as
- 0.00756272
#### (b)
```python=
clfs = []
for ccp_alpha in [0.00756272]:
clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
print("test_scores:", test_scores[0])
```
- test_scores:
- 0.92
### Boosting
#### (a)
```python=
tmp_mean = np.mean(valid_scores, axis=1)
tmp_std = np.std(valid_scores, axis=1)
tmp_index = list(tmp_mean).index((max(tmp_mean)))
print(tmp_index)
print("the number of trees w/ max mean R^2:", tmp_index+1)
print("the max mean R^2:", tmp_mean[tmp_index])
print("the std.dev. there:", tmp_std[tmp_index])
print("one SE:", tmp_std[tmp_index]/(5-1))
# the std.dev. is computed by len(samples)
print(tmp_mean>(0.8442857142857143-0.005345224838248489))
print("best model: number of trees as", 12)
```
- the number of trees w/ max mean R^2:
- 13
- the max mean R^2:
- 0.8442857142857143
- the std.dev. there:
- 0.021380899352993955
- one SE:
- 0.005345224838248489
- best model: number of trees as
- 12
#### (b)
```python=
dt_m = AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=12)
dt_m.fit(X_train, y_train)
y_pred_m = dt_m.predict(X_test)
from sklearn.metrics import accuracy_score
print('Accuracy on test data is %.2f' % (
accuracy_score(y_test, y_pred_m)))
```
- Accuracy on test data is
- 0.87
### Bagging
#### (a)
```python=
tmp_mean = np.mean(valid_scores, axis=1)
tmp_std = np.std(valid_scores, axis=1)
tmp_index = list(tmp_mean).index((max(tmp_mean)))
print(tmp_index)
print("the number of trees w/ max mean R^2:", tmp_index+1)
print("the max mean R^2:", tmp_mean[tmp_index])
print("the std.dev. there:", tmp_std[tmp_index])
print("one SE:", tmp_std[tmp_index]/(5-1))
# the std.dev. is computed by len(samples)
print(tmp_mean>(0.8414285714285714-0.009675869417245768))
print("best model: number of trees as", 5)
```
- the number of trees w/ max mean R^2:
-23
- the max mean R^2:
- 0.8414285714285714
- the std.dev. there:
- 0.03870347766898307
- one SE:
- 0.009675869417245768
- best model: number of trees as:
- 5
#### (b)
```python=
dt_m = BaggingClassifier(
base_estimator=DecisionTreeClassifier(), n_estimators=5)
dt_m.fit(X_train, y_train)
y_pred_m = dt_m.predict(X_test)
from sklearn.metrics import accuracy_score
print('Accuracy on test data is %.2f' % (
accuracy_score(y_test, y_pred_m)))
```
- Accuracy on test data is
- 0.86
### \(c\) KNN
```python=
from sklearn.neighbors import KNeighborsClassifier
train_scores, valid_scores = validation_curve(
KNeighborsClassifier(),
X_train, y_train,
"n_neighbors", np.arange(1, 25),
cv=5)
df_cv = pd.DataFrame(data = {'K': np.repeat(np.arange(1, 25),
repeats=5),
'train': train_scores.ravel(),
'valid': valid_scores.ravel()} )
ax = sns.lineplot(data=df_cv, x='K', y='train',
ci='sd', marker='o', label='Training score')
ax = sns.lineplot(data=df_cv, x='K', y='valid',
ci='sd', marker='o', label='Validation score')
ax.set(xlabel='K', ylabel='Score');
ax.plot([18],[0.8657142857142859],'.')
ax.plot([18],[0.8657142857142859-0.015907898179514334],'.')
ax.plot([1,18],[0.8657142857142859-0.0039769745448785835,
0.8657142857142859-0.0039769745448785835],'-')
```

```python=
tmp_mean = np.mean(valid_scores, axis=1)
tmp_std = np.std(valid_scores, axis=1)
tmp_index = list(tmp_mean).index((max(tmp_mean)))
print("the number of trees w/ max mean R^2:", tmp_index+1)
print("the max mean R^2:", tmp_mean[tmp_index])
print("the std.dev. there:", tmp_std[tmp_index])
print("one SE:", tmp_std[tmp_index]/(5-1))
# the std.dev. is computed by len(samples)
print(tmp_mean>(0.8657142857142859-0.0039769745448785835))
print("best model: number of trees as", 18)
```
- the number of trees w/ max mean R^2:
- 18
- the max mean R^2:
- 0.8657142857142859
- the std.dev. there:
- 0.015907898179514334
- one SE:
- 0.0039769745448785835
- best model: number of trees as
- 18
```python=
dt_m = KNeighborsClassifier(n_neighbors=18)
dt_m.fit(X_train, y_train)
y_pred_m = dt_m.predict(X_test)
print('Accuracy on test data is %.2f' % (
accuracy_score(y_test, y_pred_m)))
```
- Accuracy on test data is
- 0.90
## 3.
- For one-feature data, Random Forest and Bagging have the same performance. RF has only one feature to choose.
- RF will performer better. There may be dependent features in the data. RF will randomly select some of the features.
## 4.
- The left-most [-1] will be weighted more because it is mis-predicted.
- 
- The two iterations will have the same $\alpha$, since they both have only one mis-predicted data point and the data points have the same weights are the time of mis-predicted.
## 5.
### the 1st cut on the all, Outlook
```python=
import numpy as np
def lg(x):
if x!=0: return np.log2(x)
return 0
def Sf(n2,y2,N):
if n2==0: return 0
return (n2/N*(-(y2/n2)*lg(y2/n2)-(1-(y2/n2))*lg(1-(y2/n2))))
def Sf2(y):
tmp=sum(y)/len(y)
return -(1-tmp)*lg(1-tmp) - (tmp)*lg(tmp)
X=[[2,1,0,0],[2,1,0,1],[0,1,0,0],[1,2,0,0],[1,0,1,0],
[1,0,1,1],[0,0,1,1],[2,2,0,0],[2,0,1,0],[1,2,1,0],
[2,2,1,1],[0,2,0,1],[0,1,1,0],[1,2,0,1]]
y=[0,0,1,1,1,0,1,0,1,1,1,1,1,0]
print('orignal S:', Sf2(y))
origS =Sf2(y)
for f in [0,1,2,3]:
n0=n1=n2=y0=y1=y2=0
for i,ele in enumerate(X):
if ele[f]==0:
n0+=1
if y[i]==1:y0+=1
if ele[f]==1:
n1+=1
if y[i]==1:y1+=1
if ele[f]==2:
n2+=1
if y[i]==1:y2+=1
N=len(X)
totS=Sf(n2,y2,N)+Sf(n1,y1,N)+Sf(n0,y0,N)
print("info gain:", origS-totS)
```
```
orignal S: 0.9402859586706309
info gain: 0.2467498197744391 Outlook
info gain: 0.029222565658954647
info gain: 0.15183550136234136
info gain: 0.04812703040826927
```
We choose Outlook as the 1st cut.
Overcast(0) is pure and doesn't need more cuts.
### the 2nd cut on Rain(1), Wind
```python=
X=[[1,2,0,0],[1,0,1,0],[1,0,1,1],[1,2,1,0],[1,2,0,1]]
y=[1,1,0,1,0]
```
```
orignal S: 0.9709505944546686
info gain: 0.01997309402197489
info gain: 0.01997309402197489
info gain: 0.9709505944546686 Wind
```
We choose Wind. Now it is a pure node.
### the 2nd cut on Sunny(2), Humidity
```python=
X=[[2,1,0,0],[2,1,0,1],[2,2,0,0],[2,0,1,0],[2,2,1,1]]
y=[0,0,0,1,1]
```
```
orignal S: 0.9709505944546686
info gain: 0.5709505944546686
info gain: 0.9709505944546686 Humidity
info gain: 0.01997309402197489
```
We choose Humidity. Now it is a pure node.
## 6.
- For one-feature problem, Random Tree will perform better. Single Tree is more likely to overfit.
- Random Tree will still perform better. Single Tree is more likely to overfit.