More about decision trees

# More on decision trees  slide: https://hackmd.io/@ccornwell/decision-trees2 --- <h3>Multiple class labels</h3> - <font size=+2>Multi-class decision trees: like last time (with 2 labels), goal is to partition data, so innermost parts have one label (or close to, if restricting #splits).</font> - <font size=+2 style="color:#181818;">Deciding how to split: Use Information Gain, but alter entropy function.</font> - <font size=+2 style="color:#181818;">Before: $e(r) = r\log(r) + (1-r)\log(1-r)$. Had 2 classes, say $p_1 = r$ and $p_2 = 1-r$; $e$ becomes $p_1\log(p_1)+p_2\log(p_2)$.</font> - <font size=+2 style="color:#181818;">Classes $i=1,\ldots, k$. Say in subset $S$ of data, have $p_i =$ proportion of points in $S$ with label $i$ (the probability of label $i$ in $S$). Let: $$H(s) = \sum_{i=1}^kp_i\log(p_i).$$</font> ---- <h3>Multiple class labels</h3> - <font size=+2>Multi-class decision trees: like last time (with 2 labels), goal is to partition data, so innermost parts have one label (or close to, if restricting #splits).</font> - <font size=+2>Deciding how to split: Use Information Gain, but alter entropy function.</font> - <font size=+2>Before: $e(r) = r\log(r) + (1-r)\log(1-r)$. Had 2 classes, say $p_1 = r$ and $p_2 = 1-r$; $e$ becomes $p_1\log(p_1)+p_2\log(p_2)$.</font> - <font size=+2 style="color:#181818;">Classes $i=1,\ldots, k$. Say in subset $S$ of data, have $p_i =$ proportion of points in $S$ with label $i$ (the probability of label $i$ in $S$). Let: $$H(s) = \sum_{i=1}^kp_i\log(p_i).$$</font> ---- <h3>Multiple class labels</h3> - <font size=+2>Multi-class decision trees: like last time (with 2 labels), goal is to partition data, so innermost parts have one label (or close to, if restricting #splits).</font> - <font size=+2>Deciding how to split: Use Information Gain, but alter entropy function.</font> - <font size=+2>Before: $e(r) = r\log(r) + (1-r)\log(1-r)$. Had 2 classes, say $p_1 = r$ and $p_2 = 1-r$; $e$ becomes $p_1\log(p_1)+p_2\log(p_2)$.</font> - <font size=+2>Classes $i=1,\ldots, k$. Say in subset $S$ of data, have $p_i =$ proportion of points in $S$ with label $i$ (the probability of label $i$ in $S$). Let: $$H(s) = \sum_{i=1}^kp_i\log(p_i).$$</font> --- <h3>Multiple class labels</h3> - <font size=+2>In determining split, on $S$, for decision tree, say $S_{\pm}$ are the subregions created by split.</font> - <font size=+2>Information Gain $IG$ computed exactly as before, but with new version of entropy: $IG=H(S) - \left(p_+H(S_+)+p_-H(S_-)\right)$.</font> - <font size=+2 style="color:#181818;">Nice example of computing $IG$ of a split, 3 classes, in pages 6-7 of "Predictive Medicine" presentation.</font> ---- <h3>Multiple class labels</h3> - <font size=+2>In determining split, on $S$, for decision tree, say $S_{\pm}$ are the subregions created by split.</font> - <font size=+2>Information Gain $IG$ computed exactly as before, but with new version of entropy: $IG=H(S) - \left(p_+H(S_+)+p_-H(S_-)\right)$.</font> - <font size=+2>Nice example of computing $IG$ of a split, 3 classes, in pages 6-7 of "Predictive Medicine" presentation.</font> --- <h3>Regression trees</h3> - <font size=+2>Can the idea of decision trees be used to create a regression model? How?</font> - <font size=+2>First, keep goal of getting partition, so each region has similar labels -- but not just handful of labels anymore. Label $y$ comes from some interval in $\mathbb R$.</font> - <font size=+2 style="color:#181818;">Information Gain & counting errors, no good here: likely that given training point is *only one* with its label.</font> - <font size=+2 style="color:#181818;">Instead, use "MSE" (average of variances) of labels within a region.</font> - <font size=+2 style="color:#181818;">Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font> ---- <h3>Regression trees</h3> - <font size=+2>Can the idea of decision trees be used to create a regression model? How?</font> - <font size=+2>First, keep goal of getting partition, so each region has similar labels -- but not just handful of labels anymore. Label $y$ comes from some interval in $\mathbb R$.</font> - <font size=+2>Information Gain & counting errors, no good here: likely that given training point is *only one* with its label.</font> - <font size=+2>Instead, use "MSE" (average of variances) of labels within a region.</font> - <font size=+2 style="color:#181818;">Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font> ---- <h3>Regression trees</h3> - <font size=+2>Can the idea of decision trees be used to create a regression model? How?</font> - <font size=+2>First, keep goal of getting partition, so each region has similar labels -- but not just handful of labels anymore. Label $y$ comes from some interval in $\mathbb R$.</font> - <font size=+2>Information Gain & counting errors, no good here: likely that given training point is *only one* with its label.</font> - <font size=+2>Instead, use "MSE" (average of variances) of labels within a region.</font> - <font size=+2>Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font> ---- <h3>Regression trees</h3> - <font size=+2>Instead, use "MSE" (average of variances) of labels within a region.</font> - <font size=+2>Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font> - <font size=+2>Over all splits, minimize weighted sum of average variances: $$\frac{|S_+|}{|S|}AV(S_+) + \frac{|S_-|}{|S|}AV(S_-).$$</font> - <font size=+2>This is default way that `DecisionTreeRegressor` works, in Python package `sklearn`.</font> --- <h3> Discussion </h3> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br />