We define what a data point is for the purpose of testing our functions as we build them. I've only tested with a few print
statements, since most of these are simple functions with not many moving parts. Usually, the correct thing to do is to write tests before instead of printing after. Anyway, here's a data point:
ββββ1 (positive): {'third': True, 'second': 'A', 'first': 1}
The info function tells us the entropy of the set. It is maximum for sets with more than one equiprobable label and minumum for homogeneous sets.
ββββ1.0
ββββ-0.0
ββββ0.9182958340544896
ββββOK
We say an attribute has an information gain equal to the average homogeneity (info) of the sets into which it divides our data. Phew. That means that the gain of an attribute is maximum if splitting the data based on that attribute yields the most homogeneous subgroups. This might have a downside for attributes with many, many values, since the subgroups will be too small and consequently very homogeneous, but the tree won't generalize.
ββββ1.0
ββββ-0.0
ββββOK
If the data set has only one label, this function returns that label and True
. It returns None
and False
otherwise.
ββββ(None, False)
ββββ('A', True)
ββββOK
This (duh) returns the most common label within a data set.
ββββA
ββββB
ββββOK
I'll leave this as a data class. But, in truth, I could have the next function (build_tree()
) as a train()
method of the DecisionNode
class. is_leaf()
is very self-evident. classify()
will take a data point and send it to the right child for further inspection based on the node's attribute. Once the data point arrives on a leaf node, it receives that node's label.
The main training algorithm. From Wikipedia:
The most important part below is the line
The key
argument is where most of the different versions of this tree building algorithm differ. Another flexible point is the next loop, when children are added to the rood node. Right now, each attribute value corresponds to one child. This must not be like that. We could add a child for every interval or set of values instead. This would require some small changes on the attribute_gain()
and the values
loop.
ββββlabel A
ββββ
ββββdecide on a
βββββ 1: label B
ββββ
ββββlabel B
ββββ
ββββdecide on b
βββββ yes: label B
βββββ no: label A
ββββdecide on c
βββββ 0: decide on b
βββββ β yes: label B
βββββ β no: decide on a
βββββ β β 0: label B
βββββ β β 1: label A
βββββ 1: label B
βββββ 2: label A
Have fun hereβ¦
ββββB