Decision trees

# Creating Decision Trees  slide: https://hackmd.io/@ccornwell/decision-trees --- <h3>Decision stumps</h3> - <font size=+2>Decision stump, a particular half-space model, with ${\bf w} = \pm{\bf e_j}$, some $j$, and usual $b$ (for half-space models).</font> - <font size=+2 style="color:#181818;">Say $\upsilon=\pm1$ and ${\bf w} = \upsilon{\bf e_j}$. The decision stump, given ${\bf x_i}$, has output $$\operatorname{sign}(\upsilon x_{i,j} + b).$$</font> - <font size=+2 style="color:#181818;">Outputs $1$ if $\upsilon x_{i,j} \ge -b$; outputs $-1$ if $\upsilon x_{i,j} < -b$ (deciding that $\operatorname{sign}(0)=1$ ). 3 parameters determine decision stump: $b, \upsilon, j$.</font> - <font size=+2 style="color:#181818;">On their own, stumps are not good classifiers. They are basic unit in a decision tree, as single neurons were in a neural network.</font> ---- <h3>Decision stumps</h3> - <font size=+2>Decision stump, a particular half-space model, with ${\bf w} = \pm{\bf e_j}$, some $j$, and usual $b$ (for half-space models).</font> - <font size=+2>Say $\upsilon=\pm1$ and ${\bf w} = \upsilon{\bf e_j}$. The decision stump, given ${\bf x_i}$, has output $$\operatorname{sign}(\upsilon x_{i,j} + b).$$</font> - <font size=+2>Outputs $1$ if $\upsilon x_{i,j} \ge -b$; outputs $-1$ if $\upsilon x_{i,j} < -b$ (deciding that $\operatorname{sign}(0)=1$ ). 3 parameters determine decision stump: $b, \upsilon, j$.</font> - <font size=+2 style="color:#181818;">On their own, stumps are not good classifiers. They are basic unit in a decision tree, as single neurons were in a neural network.</font> ---- <h3>Decision stumps</h3> - <font size=+2>Decision stump, a particular half-space model, with ${\bf w} = \pm{\bf e_j}$, some $j$, and usual $b$ (for half-space models).</font> - <font size=+2>Say $\upsilon=\pm1$ and ${\bf w} = \upsilon{\bf e_j}$. The decision stump, given ${\bf x_i}$, has output $$\operatorname{sign}(\upsilon x_{i,j} + b).$$</font> - <font size=+2>Outputs $1$ if $\upsilon x_{i,j} \ge -b$; outputs $-1$ if $\upsilon x_{i,j} < -b$ (deciding that $\operatorname{sign}(0)=1$ ). 3 parameters determine decision stump: $b, \upsilon, j$.</font> - <font size=+2>On their own, stumps are not good classifiers. They are basic unit in a decision tree, as single neurons were in a neural network.</font> --- <h3>Finding best "split"</h3> <h4>(Given training data)</h4> - <font size=+2>One approach ("empirical risk minimization"): find the $b, \upsilon, j$ that minimizes number of errors.</font> ![](https://i.imgur.com/t7YIeWc.png =x250) ---- <h3>Finding best "split"</h3> <h4>(Given training data)</h4> ![](https://i.imgur.com/t7YIeWc.png =x200) - <font size=+2>Fix $\upsilon$ and $j$. Each hyperplane: $j^{th}$ coordinate equal to $-b$.</font> - <font size=+2>Decrease $b$ to cross a data point, then number of errors changes by 1.</font> - <font size=+2>Order points by their $j^{th}$ coord. Quickly compute number of errors for each stump with $-b=$ (avg. of two consecutive $x_{i,j}$.)</font> --- <h3>Decision trees</h3> - <font size=+2>To build a decision tree from training data: first find best split.</font> - <font size=+2>Restrict to *only the points on one side*; find the best split for those. Do the same on other side.</font> ![](https://i.imgur.com/gmObXV8.png =x200) - <font size=+2 style="color:#181818;">Continue this way, stop when the nested splits place points into subsets with only one label.</font> ---- <h3>Decision trees</h3> - <font size=+2>To build a decision tree from training data: first find best split.</font> - <font size=+2>Restrict to *only the points on one side*; find the best split for those. Do the same on other side.</font> ![](https://i.imgur.com/gmObXV8.png =x200) - <font size=+2>Continue this way, stop when the nested splits place points into subsets with only one label.</font> --- <h3>Finding best "split" (alternate method)</h3> - <font size=+2>Method for determining split before, can produce poor performance on unseen data when significant imbalance between the two classes.</font> - <font size=+2>Another method: Maximize the *Information Gain* ($IG$), a number computed as follows.</font> - <font size=+2 style="color:#181818;">"entropy": $e(r) = -r\log(r)-(1-r)\log(1-r)$, $0<r<1$.</font> - <font size=+2 style="color:#181818;">Define: $r$ as proportion of points with $+1$ labels (total); $r_+$ as proportion of points on positive side of split with $+1$ labels; $r_-$ in similar way to $r_+$.</font> - <font size=+2 style="color:#181818;">If $p_+$ is proportion of all points on positive side, similar for $p_-$, then $IG = e(r) - (p_+e(r_+)+p_-e(r_-))$. ---- <h3>Finding best "split" (alternate method)</h3> - <font size=+2>Method for determining split before, can produce poor performance on unseen data when significant imbalance between the two classes.</font> - <font size=+2>Another method: Maximize the *Information Gain* ($IG$), a number computed as follows.</font> - <font size=+2>"entropy": $e(r) = -r\log(r)-(1-r)\log(1-r)$, $0<r<1$.</font> - <font size=+2 style="color:#181818;">Define: $r$ as proportion of points with $+1$ labels (total); $r_+$ as proportion of points on positive side of split with $+1$ labels; $r_-$ in similar way to $r_+$.</font> - <font size=+2 style="color:#181818;">If $p_+$ is proportion of all points on positive side, similar for $p_-$, then $IG = e(r) - (p_+e(r_+)+p_-e(r_-))$. ---- <h3>Finding best "split" (alternate method)</h3> - <font size=+2>Method for determining split before, can produce poor performance on unseen data when significant imbalance between the two classes.</font> - <font size=+2>Another method: Maximize the *Information Gain* ($IG$), a number computed as follows.</font> - <font size=+2>"entropy": $e(r) = -r\log(r)-(1-r)\log(1-r)$, $0<r<1$.</font> - <font size=+2>Define: $r$ as proportion of points with $+1$ labels (total); $r_+$ as proportion of points on positive side of split with $+1$ labels; $r_-$ in similar way to $r_+$.</font> - <font size=+2>If $p_+$ is proportion of all points on positive side, similar for $p_-$, then $IG = e(r) - (p_+e(r_+)+p_-e(r_-))$.