# DD2421 Lab 1
## Assignment 0
The hardest problem to learn is MONK-2. This is because the decision tree that needs to be formed for this would be the deepest. Ensuring that for exactly two $i \in \{ 1,2,...,6 \}, a_i = 1$, using splits on attributes, forms many paths in the tree. There are ${{6}\choose{2}}=15$ possible pairs for which the statement is true, and to check each pair, we need to perform six checks. E.g. for $i={1,2}$, we need to check ${(a_1=1)}{\wedge} (a_2=1) {\wedge} (a_3 \ne1) {\wedge} (a_4 \ne 1) {\wedge} (a_5 \ne 1) {\wedge} (a_6 \ne 1)$. These comparisons must be made for each of the 15 pairs, forming the deepest decision tree needed among the datasets provided.
## Assignment 1
```python
import monkdata as m
import dtree as tree
print(tree.entropy(m.monk1))
print(tree.entropy(m.monk2))
print(tree.entropy(m.monk3))
```
*Result*:
```
1.0
0.957117428264771
0.9998061328047111
```
*Solution*:
| Dataset | Entropy |
| ------- | --------- |
| MONK-1 | 1.0000000 |
| MONK-2 | 0.9571174 |
| MONK-3 | 0.9998061 |
## Assignment 2
Entropy is given by the equation:
$Entropy=\sum_i{-p_{i}\log_2 {p_{i}}}$
For a uniform distribution, where there are $n$ possible outcomes, each with probability $p_i=\frac{1}{n}$, $Entropy = - n * log_2 ({\frac{1}{n}}) = n * log_2 (n)$. As the number of possible outcomes $n$ increases, the entropy increases at a linearithmic rate. Uniform distributions yield the maximum unpredictability and hence the maximum entropy.
For a non-uniform distribution, the less uniform the classes are, the more predictable the outcomes, and the lower the entropy. For example, a fair coin toss has an $Entropy = -0.5*log_2(0.5)-0.5*log_2(0.5)=1$, whereas if we have a weighted coin where $Pr(Head) = 0.9$, we get $Entropy = -0.9*log_2(0.9)-0.1*log_2(0.1)=0.469$, which is much lower. A fair die also has a higher entropy than a weighted die.
Another example of a probability distribution with low entropy is a binomial distribution with a small number of trials (e.g. n = 2), whereas one with a high entropy would be a binomial distribution with a large number of trials (e.g. n = 100).
## Assignment 3
```python=
datasetTitles = ['MONK-1', 'MONK-2', 'MONK-3']
datasets = [m.monk1, m.monk2, m.monk3]
for i in range(3):
print(datasetTitles[i])
for j in range (6):
print(tree.averageGain(datasets[i], m.attributes[j]))
```
*Result*:
```
MONK-1
0.07527255560831925
0.005838429962909286
0.00470756661729721
0.02631169650768228
0.28703074971578435
0.0007578557158638421
MONK-2
0.0037561773775118823
0.0024584986660830532
0.0010561477158920196
0.015664247292643818
0.01727717693791797
0.006247622236881467
MONK-3
0.007120868396071844
0.29373617350838865
0.0008311140445336207
0.002891817288654397
0.25591172461972755
0.007077026074097326
```
*Solution*:
| Dataset| a1| a2| a3| a4| a5| a6|
|--------|--------|--------|--------|--------|--------|--------|
| MONK-1|0.075273|0.005838|0.004708|0.026312|0.287031|0.000758|
| MONK-2|0.003756|0.002458|0.001056|0.015664|0.017277|0.006248|
| MONK-3|0.007121|0.293736|0.000831|0.002892|0.255912|0.007077|
We deduce here that we need to split the three datasets on the following attributes :
+ _Monk1_ : a5
+ _Monk2_ : a5
+ _Monk3_ : a2
The explanation is the same as in Assignment 2, given the gain function $G(S,A)=Entropy(S,A)-\sum_{k \in values(A)} \frac{\|S_k\|}{\|S\|}Entropy(S_k)$ : maximizing the gain is reducing the entropy of all possible subsets, e.g. converging to the classes by reducing the random distribution of noise around.
## Assignment 4
Having maximized information gain after splitting based on an attribute implies having the lowest possible entropy for the subsets $S_k$ after splitting. When the subsets have low entropy, it implies having a non-uniform distribution within the subset, e.g. having 0 entropy would mean that there is only one class in the subset.
Since choosing attributes which maximize information gain reduces the entropy for each subset, this brings us toward a true classifcation, which makes it a good heuristic for picking attributes for splitting.
## Pre-Assignment 5
```python=
# Run this code first to determine second split for different values of (A5)
# A5 is 1, 2, 3, 4
values = [[1,2,3], [1,2,3], [1,2], [1,2,3], [1,2,3,4], [1,2]]
firstSplit = []
for attrib in range(len(values[4])):
print("Attrib value:", values[4][attrib])
dataset = tree.select(m.monk1, m.attributes[4], values[4][attrib])
firstSplit.append(dataset)
for i in range (len(m.attributes)):
# Test those that were not used to split the dataset (A5)
if i != 4:
print('A'+str(i+1) + ':', tree.averageGain(dataset, m.attributes[i]))
# Then run this code to determine the truth values for the second split
# Final answer
def processAttrib(valueIndex, attribIndex):
print("A5: " + str(values[4][valueIndex]))
for attrib in range(len(values[attribIndex])):
dataset = tree.select(firstSplit[valueIndex], m.attributes[attribIndex], values[attribIndex][attrib])
print(" A" + str(attribIndex + 1) +":", values[attribIndex][attrib], "-", tree.mostCommon(dataset))
print("A5: " + str(values[4][0]), "-", tree.mostCommon(firstSplit[0]))
processAttrib(1, 3);
processAttrib(2, 5);
processAttrib(3, 0);
```
*Result*:
```
A5: 1 - True
A5: 2
A4: 1 - False
A4: 2 - False
A4: 3 - False
A5: 3
A6: 1 - False
A6: 2 - False
A5: 4
A1: 1 - False
A1: 2 - False
A1: 3 - True
```
```python=
import drawtree_qt5 as dt;
t1 = tree.buildTree(m.monk1, m.attributes, maxdepth=2) # TODO: install pyqt and draw tree
dt.drawTree(t1);
```
*Result*:

## Assignment 5
```python=
datasetTitles = ['MONK-1', 'MONK-2', 'MONK-3']
datasets = [m.monk1, m.monk2, m.monk3]
testsets = [m.monk1test, m.monk2test, m.monk3test]
for i in range(len(datasets)):
t = tree.buildTree(datasets[i], m.attributes)
trainPerf = tree.check(t, datasets[i])
testPerf = tree.check(t, testsets[i])
print(datasetTitles[i])
print('Training performance: {0:.6f}, error: {0:.6f}'.format(trainPerf, 1-trainPerf))
print('Test performance: {0:.6f}, error: {0:.6f}'.format(testPerf, 1-testPerf))
```
*Result*:
```
MONK-1
Training performance: 1.000000, error: 0.000000
Test performance: 0.828704, error: 0.171296
MONK-2
Training performance: 1.000000, error: 0.000000
Test performance: 0.692130, error: 0.307870
MONK-3
Training performance: 1.000000, error: 0.000000
Test performance: 0.944444, error: 0.055556
```
*Solution*:
| | Etrain| Etest|
|--------|--------|--------|
| MONK-1|0.000000|0.171296|
| MONK-2|0.000000|0.307870|
| MONK-3|0.000000|0.055556|
As expected, MONK-2 had the highest test errors.
For the training dataset, performance is good because the tree was used to maximize information gain based on that data. Test performance is worse than training performance, as this data was not used when training the model.
## Assignment 6
When we do not prune, we will have overfitting, and high variance, because the tree has maximum flexibility, and is thus heavily dependent on the data set used. Pruning can thus reduce the dependence on the dataset (decreasing variance). Though it may reduce overfitting (reducing variance) it also can increase the bias if carried out to too great an extent.
## Assignment 7
```python=
import dtree as tree
import monkdata as m
import random
from matplotlib import pyplot as plt
# Splits the provided data into two datasets, a training set and a validation set, with
# the provided fraction of data in the training set.
def partition(data, fraction):
ldata = list(data)
random.shuffle(ldata)
breakPoint = int(len(ldata) * fraction)
return ldata[:breakPoint], ldata[breakPoint:]
# Takes a training data set and a validation data set, and produces a pruned tree.
# Pruning takes place as long as pruning increases performance.
def pruneData(dataTrain, dataVal):
# Create a full tree with the given training data
currentTree = tree.buildTree(dataTrain, m.attributes)
# Keep pruning the tree until performance cannot be improved anymore
while True:
# Take the current performance as the benchmark
currentPerformance=tree.check(currentTree,dataVal)
# Create all possible candidates for pruned trees
pruneCandidates=tree.allPruned(currentTree)
# Initialize variables to track the best performing pruned tree
bestTree=None
bestPerformance=0
# Iterate through all possible pruned trees
for candidate in pruneCandidates:
candidatePerformance = tree.check(candidate,dataVal)
if candidatePerformance > bestPerformance:
bestTree = candidate
bestPerformance = candidatePerformance
# If the best performing pruned tree is better than the current tree, replace it
if bestPerformance > currentPerformance:
currentTree = bestTree
# Otherwise end the loop
else:
break
return currentTree
# Obtaining the different performances for the different fraction partitions
datasets = [m.monk1, m.monk3]
testsets = [m.monk1test, m.monk3test]
fractions = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
trials = 1000;
means = [];
variances = [];
for idx, dataset in enumerate(datasets, 0):
fracMeans = [];
fracVariances = [];
for fraction in fractions:
meanPerformance = 0;
meanSquaredPerformance = 0;
for i in range(trials):
traindata, valdata = partition(dataset, fraction);
t = pruneTree(traindata, valdata)
performance = tree.check(t, testsets[idx]);
meanPerformance += performance;
meanSquaredPerformance += performance ** 2;
meanPerformance = meanPerformance / trials;
meanSquaredPerformance = meanSquaredPerformance / trials;
fracMeans.append(meanPerformance);
fracVariances.append(meanSquaredPerformance - meanPerformance ** 2);
means.append(fracMeans);
variances.append(fracVariances);
means, variances
# Creating the plot
fig, ax = plt.subplots(2,2, figsize=(8,5));
fig.suptitle("Effect of data partition size on pruning performance with {} trials".format(trials))
plt.subplots_adjust(wspace=1.0, hspace=0.5)
ax[0][0].plot(fractions, means[0])
ax[0][0].set_title("Mean of performance on monk1")
ax[0][1].plot(fractions, means[1])
ax[0][1].set_title("Mean of performance on monk3")
ax[1][0].plot(fractions, variances[0])
ax[1][0].set_title("Variance of performance on monk1")
ax[1][1].plot(fractions, variances[1])
ax[1][1].set_title("Variance of performance on monk3")
plt.show()
````
*Result*:
```
([[0.7696134259259267,
0.7940949074074083,
0.821393518518518,
0.8375902777777772,
0.850081018518519,
0.8573425925925918],
[0.9140347222222196,
0.9410578703703689,
0.952831018518515,
0.957939814814814,
0.9583240740740756,
0.9514791666666671]],
[[0.001678922705547703,
0.001686099317344092,
0.0020294551397458394,
0.002168462143132599,
0.002110391884215468,
0.002288556755828619],
[0.0031534863200893826,
0.0017395438475276137,
0.0011347247460182786,
0.0009904787165626594,
0.0008794152091846641,
0.0009361361829105519]])
```
*Solution*:

For MONK-1, we will use 0.8 as the fraction of training data, while for MONK-3, we will use 0.7.