DD2421 Lab 1 - HackMD

# DD2421 Lab 1 ## Assignment 0 The hardest problem to learn is MONK-2. This is because the decision tree that needs to be formed for this would be the deepest. Ensuring that for exactly two $i \in \{ 1,2,...,6 \}, a_i = 1$, using splits on attributes, forms many paths in the tree. There are ${{6}\choose{2}}=15$ possible pairs for which the statement is true, and to check each pair, we need to perform six checks. E.g. for $i={1,2}$, we need to check ${(a_1=1)}{\wedge} (a_2=1) {\wedge} (a_3 \ne1) {\wedge} (a_4 \ne 1) {\wedge} (a_5 \ne 1) {\wedge} (a_6 \ne 1)$. These comparisons must be made for each of the 15 pairs, forming the deepest decision tree needed among the datasets provided. ## Assignment 1 ```python import monkdata as m import dtree as tree print(tree.entropy(m.monk1)) print(tree.entropy(m.monk2)) print(tree.entropy(m.monk3)) ``` *Result*: ``` 1.0 0.957117428264771 0.9998061328047111 ``` *Solution*: | Dataset | Entropy | | ------- | --------- | | MONK-1 | 1.0000000 | | MONK-2 | 0.9571174 | | MONK-3 | 0.9998061 | ## Assignment 2 Entropy is given by the equation: $Entropy=\sum_i{-p_{i}\log_2 {p_{i}}}$ For a uniform distribution, where there are $n$ possible outcomes, each with probability $p_i=\frac{1}{n}$, $Entropy = - n * log_2 ({\frac{1}{n}}) = n * log_2 (n)$. As the number of possible outcomes $n$ increases, the entropy increases at a linearithmic rate. Uniform distributions yield the maximum unpredictability and hence the maximum entropy. For a non-uniform distribution, the less uniform the classes are, the more predictable the outcomes, and the lower the entropy. For example, a fair coin toss has an $Entropy = -0.5*log_2(0.5)-0.5*log_2(0.5)=1$, whereas if we have a weighted coin where $Pr(Head) = 0.9$, we get $Entropy = -0.9*log_2(0.9)-0.1*log_2(0.1)=0.469$, which is much lower. A fair die also has a higher entropy than a weighted die. Another example of a probability distribution with low entropy is a binomial distribution with a small number of trials (e.g. n = 2), whereas one with a high entropy would be a binomial distribution with a large number of trials (e.g. n = 100). ## Assignment 3 ```python= datasetTitles = ['MONK-1', 'MONK-2', 'MONK-3'] datasets = [m.monk1, m.monk2, m.monk3] for i in range(3): print(datasetTitles[i]) for j in range (6): print(tree.averageGain(datasets[i], m.attributes[j])) ``` *Result*: ``` MONK-1 0.07527255560831925 0.005838429962909286 0.00470756661729721 0.02631169650768228 0.28703074971578435 0.0007578557158638421 MONK-2 0.0037561773775118823 0.0024584986660830532 0.0010561477158920196 0.015664247292643818 0.01727717693791797 0.006247622236881467 MONK-3 0.007120868396071844 0.29373617350838865 0.0008311140445336207 0.002891817288654397 0.25591172461972755 0.007077026074097326 ``` *Solution*: | Dataset| a1| a2| a3| a4| a5| a6| |--------|--------|--------|--------|--------|--------|--------| | MONK-1|0.075273|0.005838|0.004708|0.026312|0.287031|0.000758| | MONK-2|0.003756|0.002458|0.001056|0.015664|0.017277|0.006248| | MONK-3|0.007121|0.293736|0.000831|0.002892|0.255912|0.007077| We deduce here that we need to split the three datasets on the following attributes : + _Monk1_ : a5 + _Monk2_ : a5 + _Monk3_ : a2 The explanation is the same as in Assignment 2, given the gain function $G(S,A)=Entropy(S,A)-\sum_{k \in values(A)} \frac{\|S_k\|}{\|S\|}Entropy(S_k)$ : maximizing the gain is reducing the entropy of all possible subsets, e.g. converging to the classes by reducing the random distribution of noise around. ## Assignment 4 Having maximized information gain after splitting based on an attribute implies having the lowest possible entropy for the subsets $S_k$ after splitting. When the subsets have low entropy, it implies having a non-uniform distribution within the subset, e.g. having 0 entropy would mean that there is only one class in the subset. Since choosing attributes which maximize information gain reduces the entropy for each subset, this brings us toward a true classifcation, which makes it a good heuristic for picking attributes for splitting. ## Pre-Assignment 5 ```python= # Run this code first to determine second split for different values of (A5) # A5 is 1, 2, 3, 4 values = [[1,2,3], [1,2,3], [1,2], [1,2,3], [1,2,3,4], [1,2]] firstSplit = [] for attrib in range(len(values[4])): print("Attrib value:", values[4][attrib]) dataset = tree.select(m.monk1, m.attributes[4], values[4][attrib]) firstSplit.append(dataset) for i in range (len(m.attributes)): # Test those that were not used to split the dataset (A5) if i != 4: print('A'+str(i+1) + ':', tree.averageGain(dataset, m.attributes[i])) # Then run this code to determine the truth values for the second split # Final answer def processAttrib(valueIndex, attribIndex): print("A5: " + str(values[4][valueIndex])) for attrib in range(len(values[attribIndex])): dataset = tree.select(firstSplit[valueIndex], m.attributes[attribIndex], values[attribIndex][attrib]) print(" A" + str(attribIndex + 1) +":", values[attribIndex][attrib], "-", tree.mostCommon(dataset)) print("A5: " + str(values[4][0]), "-", tree.mostCommon(firstSplit[0])) processAttrib(1, 3); processAttrib(2, 5); processAttrib(3, 0); ``` *Result*: ``` A5: 1 - True A5: 2 A4: 1 - False A4: 2 - False A4: 3 - False A5: 3 A6: 1 - False A6: 2 - False A5: 4 A1: 1 - False A1: 2 - False A1: 3 - True ``` ```python= import drawtree_qt5 as dt; t1 = tree.buildTree(m.monk1, m.attributes, maxdepth=2) # TODO: install pyqt and draw tree dt.drawTree(t1); ``` *Result*: ![](https://i.imgur.com/2txHLvt.png) ## Assignment 5 ```python= datasetTitles = ['MONK-1', 'MONK-2', 'MONK-3'] datasets = [m.monk1, m.monk2, m.monk3] testsets = [m.monk1test, m.monk2test, m.monk3test] for i in range(len(datasets)): t = tree.buildTree(datasets[i], m.attributes) trainPerf = tree.check(t, datasets[i]) testPerf = tree.check(t, testsets[i]) print(datasetTitles[i]) print('Training performance: {0:.6f}, error: {0:.6f}'.format(trainPerf, 1-trainPerf)) print('Test performance: {0:.6f}, error: {0:.6f}'.format(testPerf, 1-testPerf)) ``` *Result*: ``` MONK-1 Training performance: 1.000000, error: 0.000000 Test performance: 0.828704, error: 0.171296 MONK-2 Training performance: 1.000000, error: 0.000000 Test performance: 0.692130, error: 0.307870 MONK-3 Training performance: 1.000000, error: 0.000000 Test performance: 0.944444, error: 0.055556 ``` *Solution*: | | Etrain| Etest| |--------|--------|--------| | MONK-1|0.000000|0.171296| | MONK-2|0.000000|0.307870| | MONK-3|0.000000|0.055556| As expected, MONK-2 had the highest test errors. For the training dataset, performance is good because the tree was used to maximize information gain based on that data. Test performance is worse than training performance, as this data was not used when training the model. ## Assignment 6 When we do not prune, we will have overfitting, and high variance, because the tree has maximum flexibility, and is thus heavily dependent on the data set used. Pruning can thus reduce the dependence on the dataset (decreasing variance). Though it may reduce overfitting (reducing variance) it also can increase the bias if carried out to too great an extent. ## Assignment 7 ```python= import dtree as tree import monkdata as m import random from matplotlib import pyplot as plt # Splits the provided data into two datasets, a training set and a validation set, with # the provided fraction of data in the training set. def partition(data, fraction): ldata = list(data) random.shuffle(ldata) breakPoint = int(len(ldata) * fraction) return ldata[:breakPoint], ldata[breakPoint:] # Takes a training data set and a validation data set, and produces a pruned tree. # Pruning takes place as long as pruning increases performance. def pruneData(dataTrain, dataVal): # Create a full tree with the given training data currentTree = tree.buildTree(dataTrain, m.attributes) # Keep pruning the tree until performance cannot be improved anymore while True: # Take the current performance as the benchmark currentPerformance=tree.check(currentTree,dataVal) # Create all possible candidates for pruned trees pruneCandidates=tree.allPruned(currentTree) # Initialize variables to track the best performing pruned tree bestTree=None bestPerformance=0 # Iterate through all possible pruned trees for candidate in pruneCandidates: candidatePerformance = tree.check(candidate,dataVal) if candidatePerformance > bestPerformance: bestTree = candidate bestPerformance = candidatePerformance # If the best performing pruned tree is better than the current tree, replace it if bestPerformance > currentPerformance: currentTree = bestTree # Otherwise end the loop else: break return currentTree # Obtaining the different performances for the different fraction partitions datasets = [m.monk1, m.monk3] testsets = [m.monk1test, m.monk3test] fractions = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8] trials = 1000; means = []; variances = []; for idx, dataset in enumerate(datasets, 0): fracMeans = []; fracVariances = []; for fraction in fractions: meanPerformance = 0; meanSquaredPerformance = 0; for i in range(trials): traindata, valdata = partition(dataset, fraction); t = pruneTree(traindata, valdata) performance = tree.check(t, testsets[idx]); meanPerformance += performance; meanSquaredPerformance += performance ** 2; meanPerformance = meanPerformance / trials; meanSquaredPerformance = meanSquaredPerformance / trials; fracMeans.append(meanPerformance); fracVariances.append(meanSquaredPerformance - meanPerformance ** 2); means.append(fracMeans); variances.append(fracVariances); means, variances # Creating the plot fig, ax = plt.subplots(2,2, figsize=(8,5)); fig.suptitle("Effect of data partition size on pruning performance with {} trials".format(trials)) plt.subplots_adjust(wspace=1.0, hspace=0.5) ax[0][0].plot(fractions, means[0]) ax[0][0].set_title("Mean of performance on monk1") ax[0][1].plot(fractions, means[1]) ax[0][1].set_title("Mean of performance on monk3") ax[1][0].plot(fractions, variances[0]) ax[1][0].set_title("Variance of performance on monk1") ax[1][1].plot(fractions, variances[1]) ax[1][1].set_title("Variance of performance on monk3") plt.show() ```` *Result*: ``` ([[0.7696134259259267, 0.7940949074074083, 0.821393518518518, 0.8375902777777772, 0.850081018518519, 0.8573425925925918], [0.9140347222222196, 0.9410578703703689, 0.952831018518515, 0.957939814814814, 0.9583240740740756, 0.9514791666666671]], [[0.001678922705547703, 0.001686099317344092, 0.0020294551397458394, 0.002168462143132599, 0.002110391884215468, 0.002288556755828619], [0.0031534863200893826, 0.0017395438475276137, 0.0011347247460182786, 0.0009904787165626594, 0.0008794152091846641, 0.0009361361829105519]]) ``` *Solution*: ![](https://i.imgur.com/Wd8Hbsx.png) For MONK-1, we will use 0.8 as the fraction of training data, while for MONK-3, we will use 0.7.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.