# Intro to AI Lab4
**`0712245 高魁駿`**
###### tags: `AI`
## Environment:
- development tools:
- IDE: Sublime text, CentOS 7.8 (workstation)
- Compile Option: python forest.py
## Task description
- In this project, I experimented on classification tasks with supervised learning using random forests.
## Dataset
- I used the dataset downloaded from UCI Machine Learning Repository.
And the data is in this form:
* iris: 4 attributes and a class attribute
* wine: 12 attributes and a class attribute
* ionospherte: 34 attributes and a class attribute
* glass: 10 attributes and a class attribute
## Module
* CART
* cross validation (k folds shuffle)
* Get the training and testing accuracy of this forest and get accurancy of this treee
* random forest = tree/attribute bagging + decision tree
## Experiment Results
My experiment is to discuss the the training and testing accurancy of decision tree and random forest:
<font size=4>**1. Build a decision tree**</font>
* attribute-bagging forest
* no attribute-bagging forest
I list the validation accurancy of 4 dataset
| | iris | glass | ionosphere | wine |
|:----------------------------------------- | ----- |:----- |:---------- |:---- |
| decision tree not using attribute-bagging | 94.52 | 66.67 | 87.36 | 95.40 |
| random forest using attribute-bagging | 94.52 | 76.19 | 89.08 | 95.40 |
| random forest not using attribute-bagging | 94.52 | 69.52 | 88.51 | 95.40 |
The result showed that with bagging still did well on the validation accurancy, so on the later test I decide to use attribute bagging for the later test.
In this discussion, the control variable is the relative ratio of all attributes (ratio of data bagging) is 0.8. In the below experiment, I would like to change the relative ratio to check whether the attribute bagging ratio would influence the accurancy for classification.
In some cases, the accuracy may become lower after using attribute bagging. Beacuse the number of attributes is too small here, I cannot see some significant differences.
<font size=4>**2. Changing the ratio of tree bagging**</font>
> The purpose of this experiment is split the data into train set and test set in order to implement cross validation
==**Iris**==

==**Glass**==

==**Ionosphere**==

==**Wine**==

The out-of-bag error may alter with the size of training subset, so I focus on the validation accuracy on this part.
Actually, in this part of experiment I use shell script run on workstation five times, every time I can't find out the relation. The reason is that when the training module is simple, we don't need much data for training.
Thus, I conclude that if the module is complex like "Ionosphere" the training ratio needs higher to performs well; otherwise if the module is like "Iris" which is simple, the module performed well on only small training rate, too.
<font size=4>**3. Changing the size of random forest**</font>
> Test different number of trees in a forest
==**Iris**==

==**Glass**==

==**Ionosphere**==

==**Wine**==

First, I change the forest size is small scale, but in fact we can't see the tremendous influence in simple module (acoording to the figure) because I thought that the size of attribute is small. Thus, the occurance of ou-of-bag is small consequently.
Later, I did another experiment in large scale and attempt to figure out the relation focused on more complex model like "Ionosphere".

The accuracy had stopped growing after 20~50 trees in the forest. The time got longer if the forest became larger, without improving the accuracy.
<font size=4>**4. Changing relative ratio of attribute bagging**</font>
> The advantage of Bagging is that there is noisy data (bad data) in the original training sample. Through Bagging sampling, there is a chance to prevent noisy data from being trained, so it can reduce the instability of the model and reduce high variance.
==**Iris**==

==**Glass**==

==**Ionosphere**==

==**Wine**==

This test has a bagging rate lower than p=0.6 attributes. We can see that the accuracy become unstable then.
We can conclude that good attribute ratio helps to classify the data more accurate.
Typically, for a classification problem with p features, $\sqrt{p}$ (rounded down) features are used in each split. For regression problems the inventors recommend $\frac{p}{3}$ (rounded down) with a minimum node size of 5 as the default. In practice the best values for these parameters will depend on the problem, and they should be treated as tuning parameters.
### 基本上5跟6的方法為設限,基本上就是避免overfitting,所以要對decision tree做一些限制
常見的設限方式如下:
1. Minimum samples for a node split:資料數目不得小於多少才能再產生新節點
2. Minimum samples for a terminal node (leaf):要成為葉節點,最少需要多少資料
3. Maximum depth of tree (vertical depth):限制樹的高度最多幾層
4. Maximum number of terminal nodes:限制最終葉節點的數目
5. Maximum features to consider for split:在分離節點時,最多考慮幾種特徵值
<font size=4>**5. Changing max depth of decision tree**</font>
==**Iris**==

==**Glass**==

==**Ionosphere**==

==**Wine**==

So, found from the figure if your model is overfitting, reducing the number for max depth is one way to combat overfitting.
It is also bad to have a very low depth because your model will underfit so how to find the best value, experiment because overfitting and underfitting are very subjective to a dataset, there is no one value fits all solution.
Thus, I conclude that let the model decide the max depth first and then by comparing my train and test scores I look for overfitting or underfitting and depending on the degree I decrease or increase the max depth is a good way.
<font size=4>**6. Changing min size of the node**</font>
==**Iris**==

==**Glass**==

==**Ionosphere**==

==**Wine**==

Min samples split is used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree. Too high values can also lead to under-fitting hence depending on the level of underfitting or overfitting, you can change the values of min size.
## Conclusion
最後想探討一下到底 limit decison tree 還是 k-fold cross validation 對於 accurancy 的影響如何
理想中的狀況,應該是兩者幫助準確率上升的情形差不多,當還要觀察每個 model 的差異性來決定好壞 ,可能觀察他的 OOB accurancy , 還有每個 attribute type 對 model 的影響 , 當然在跑上述方法的時候應該也要討論時間的長短 , 像在跑 random forest 的時候數目太多也會造成時間用久 , 沒有設定decision tree的深度也會造成影響 , 通常,允許樹增長的越深,模型將越複雜,因為您將擁有更多的拆分,並且它將捕獲有關數據的更多信息,這是決策樹過擬合的根本原因之一,因為模型會非常適合訓練數據,並且無法很好地概括測試集。
以下我做了最後的總整理 (是針對最好的bagging rate 跟限制 decision tree的方法去run一次)
| | limit decision tree | cross validation |
|:---------- | -------------------------------- |:---------------- |
| Iris | max depth = 10 , min sample = 15 | ratio = 0.6 |
| Glass | max depth = 20 , min sample = 5 | ratio = 0.4 |
| Ionosphere | max depth = 15 , min sample = 1 | ratio = 0.6 |
| Wine | max depth = 15 , min sample = 10 | ratio = 0.5 |
## Appendix Code
``` python
import random
from copy import deepcopy
class Decision_Tree(object):
def __init__(self, train, test, maxDepth, minSize, attriBag, relativeRatio):
self.train = deepcopy(train)
self.test = deepcopy(test)
self.actualTrain = [row[-1] for row in self.train]
self.actualTest = [row[-1] for row in self.test]
self.relativeRatio = relativeRatio
self.root = self.get_split(self.train, attriBag)
self.split(self.root, maxDepth, minSize, 1, attriBag)
def split(self, node, maxDepth, minSize, depth, attriBag):
left, right = node['groups']
del(node['groups'])
# check for a no split
if not left or not right:
node['left'] = node['right'] = self.to_terminal(left + right)
return
# check for max depth
if depth >= maxDepth:
node['left'], node['right'] = self.to_terminal(left), self.to_terminal(right)
return
# process left child
if len(left) <= minSize:
node['left'] = self.to_terminal(left)
else:
node['left'] = self.get_split(left, attriBag)
self.split(node['left'], maxDepth, minSize, depth + 1, attriBag)
# process right child
if len(right) <= minSize:
node['right'] = self.to_terminal(right)
else:
node['right'] = self.get_split(right, attriBag)
self.split(node['right'], maxDepth, minSize, depth + 1, attriBag)
def to_terminal(self, group):
outcomes = [row[-1] for row in group]
return max(outcomes, key = outcomes.count)
def get_split(self, dataset, attriBag):
classLabels = list(set(row[-1] for row in dataset))
b_Attribute, b_Value, b_Score, b_Groups = 999, 999, 999, None
randomAttributes = []
if attriBag is 'use':
randomAttributes = random.sample(range(len(dataset[0])-1), (int)((len(dataset[0]) - 1)*self.relativeRatio))
else:
randomAttributes = range(len(dataset[0])-1)
datasetCopy = deepcopy(dataset)
for attribute in randomAttributes:
datasetCopy.sort(key = lambda element : element[attribute])
for row in range(len(datasetCopy)-1):
threshold = (datasetCopy[row][attribute] + datasetCopy[row + 1][attribute]) / 2
groups = self.test_split(attribute, threshold, datasetCopy)
gini = self.gini_index(groups, classLabels)
if gini < bScore:
b_Attribute, b_Value, b_Score, b_Groups = attribute, threshold, gini, groups
return {'attribute':b_Attribute, 'value':b_Value, 'groups':b_Groups}
def test_split(self, attribute, value, dataset):
left, right = [], []
for row in dataset:
if row[attribute] < value:
left.append(row)
else:
right.append(row)
return left, right
def gini_index(self, groups, classLabels):
numInstances = float(sum([len(group) for group in groups]))
gini = 0.0
for group in groups:
size = float(len(group))
if size == 0:
continue
score = 0.0
for label in classLabels:
p = [row[-1] for row in group].count(label) / size
score += p * p
gini += (1.0 - score) * (size / numInstances)
return gini
def predict(self, node, row):
if row[node['attribute']] < node['value']:
if isinstance(node['left'], dict):
return self.predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return self.predict(node['right'], row)
else:
return node['right']
def get_tree_accuracy(self):
preTrain = []
for row in self.train:
preTrain.append(self.predict(self.root, row))
preTest = []
for row in self.test:
preTest.append(self.predict(self.root, row))
return [get_accuracy(self.actualTrain, preTrain), get_accuracy(self.actualTest, preTest)]
def cross_validation_split(dataset, kFolds):
datasetCopy = deepcopy(dataset)
random.shuffle(datasetCopy)
return [datasetCopy[i::kFolds] for i in range(kFolds)]
def train_validation_split(dataset, ratio):
datasetCopy = deepcopy(dataset)
random.shuffle(datasetCopy)
return datasetCopy[0:(int)(len(datasetCopy)*ratio)], datasetCopy[(int)(len(datasetCopy)*ratio + 1):-1]
def get_accuracy(actual, predictions):
correct = 0
for i in range(len(actual)):
if actual[i] == predictions[i]:
correct += 1
return correct / float(len(actual)) * 100.0
class Random_Forest(object):
def __init__(self, train, test, maxDepth, minSize, numTree, attriBag, relativeRatio):
self.forest = []
self.train = deepcopy(train)
self.test = deepcopy(test)
self.actualTrain = [row[-1] for row in self.train]
self.actualTest = [row[-1] for row in self.test]
for tree in range(numTree):
trainSet, testSet = train_validation_split(self.train, 0.8) # ratio=0.8
self.forest.append(Decision_Tree(trainSet, testSet, maxDepth, minSize, attriBag, relativeRatio))
def get_forest_accuracy(self):
preTrain = []
for row in self.train:
prediction = []
for tree in self.forest:
prediction.append(tree.predict(tree.root, row))
preTrain.append(max(prediction, key = prediction.count))
preTest = []
for row in self.test:
prediction = []
for tree in self.forest:
prediction.append(tree.predict(tree.root, row))
preTest.append(max(prediction, key = prediction.count))
return [get_accuracy(self.actualTrain, preTrain), get_accuracy(self.actualTest, preTest)]
```