# More on decision trees
<!-- Put the link to this slide here so people can follow -->
slide: https://hackmd.io/@ccornwell/decision-trees2
---
<h3>Multiple class labels</h3>
- <font size=+2>Multi-class decision trees: like last time (with 2 labels), goal is to partition data, so innermost parts have one label (or close to, if restricting #splits).</font>
- <font size=+2 style="color:#181818;">Deciding how to split: Use Information Gain, but alter entropy function.</font>
- <font size=+2 style="color:#181818;">Before: $e(r) = r\log(r) + (1-r)\log(1-r)$. Had 2 classes, say $p_1 = r$ and $p_2 = 1-r$; $e$ becomes $p_1\log(p_1)+p_2\log(p_2)$.</font>
- <font size=+2 style="color:#181818;">Classes $i=1,\ldots, k$. Say in subset $S$ of data, have $p_i =$ proportion of points in $S$ with label $i$ (the probability of label $i$ in $S$). Let: $$H(s) = \sum_{i=1}^kp_i\log(p_i).$$</font>
----
<h3>Multiple class labels</h3>
- <font size=+2>Multi-class decision trees: like last time (with 2 labels), goal is to partition data, so innermost parts have one label (or close to, if restricting #splits).</font>
- <font size=+2>Deciding how to split: Use Information Gain, but alter entropy function.</font>
- <font size=+2>Before: $e(r) = r\log(r) + (1-r)\log(1-r)$. Had 2 classes, say $p_1 = r$ and $p_2 = 1-r$; $e$ becomes $p_1\log(p_1)+p_2\log(p_2)$.</font>
- <font size=+2 style="color:#181818;">Classes $i=1,\ldots, k$. Say in subset $S$ of data, have $p_i =$ proportion of points in $S$ with label $i$ (the probability of label $i$ in $S$). Let: $$H(s) = \sum_{i=1}^kp_i\log(p_i).$$</font>
----
<h3>Multiple class labels</h3>
- <font size=+2>Multi-class decision trees: like last time (with 2 labels), goal is to partition data, so innermost parts have one label (or close to, if restricting #splits).</font>
- <font size=+2>Deciding how to split: Use Information Gain, but alter entropy function.</font>
- <font size=+2>Before: $e(r) = r\log(r) + (1-r)\log(1-r)$. Had 2 classes, say $p_1 = r$ and $p_2 = 1-r$; $e$ becomes $p_1\log(p_1)+p_2\log(p_2)$.</font>
- <font size=+2>Classes $i=1,\ldots, k$. Say in subset $S$ of data, have $p_i =$ proportion of points in $S$ with label $i$ (the probability of label $i$ in $S$). Let: $$H(s) = \sum_{i=1}^kp_i\log(p_i).$$</font>
---
<h3>Multiple class labels</h3>
- <font size=+2>In determining split, on $S$, for decision tree, say $S_{\pm}$ are the subregions created by split.</font>
- <font size=+2>Information Gain $IG$ computed exactly as before, but with new version of entropy: $IG=H(S) - \left(p_+H(S_+)+p_-H(S_-)\right)$.</font>
- <font size=+2 style="color:#181818;">Nice example of computing $IG$ of a split, 3 classes, in pages 6-7 of "Predictive Medicine" presentation.</font>
----
<h3>Multiple class labels</h3>
- <font size=+2>In determining split, on $S$, for decision tree, say $S_{\pm}$ are the subregions created by split.</font>
- <font size=+2>Information Gain $IG$ computed exactly as before, but with new version of entropy: $IG=H(S) - \left(p_+H(S_+)+p_-H(S_-)\right)$.</font>
- <font size=+2>Nice example of computing $IG$ of a split, 3 classes, in pages 6-7 of "Predictive Medicine" presentation.</font>
---
<h3>Regression trees</h3>
- <font size=+2>Can the idea of decision trees be used to create a regression model? How?</font>
- <font size=+2>First, keep goal of getting partition, so each region has similar labels -- but not just handful of labels anymore. Label $y$ comes from some interval in $\mathbb R$.</font>
- <font size=+2 style="color:#181818;">Information Gain & counting errors, no good here: likely that given training point is *only one* with its label.</font>
- <font size=+2 style="color:#181818;">Instead, use "MSE" (average of variances) of labels within a region.</font>
- <font size=+2 style="color:#181818;">Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font>
----
<h3>Regression trees</h3>
- <font size=+2>Can the idea of decision trees be used to create a regression model? How?</font>
- <font size=+2>First, keep goal of getting partition, so each region has similar labels -- but not just handful of labels anymore. Label $y$ comes from some interval in $\mathbb R$.</font>
- <font size=+2>Information Gain & counting errors, no good here: likely that given training point is *only one* with its label.</font>
- <font size=+2>Instead, use "MSE" (average of variances) of labels within a region.</font>
- <font size=+2 style="color:#181818;">Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font>
----
<h3>Regression trees</h3>
- <font size=+2>Can the idea of decision trees be used to create a regression model? How?</font>
- <font size=+2>First, keep goal of getting partition, so each region has similar labels -- but not just handful of labels anymore. Label $y$ comes from some interval in $\mathbb R$.</font>
- <font size=+2>Information Gain & counting errors, no good here: likely that given training point is *only one* with its label.</font>
- <font size=+2>Instead, use "MSE" (average of variances) of labels within a region.</font>
- <font size=+2>Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font>
----
<h3>Regression trees</h3>
- <font size=+2>Instead, use "MSE" (average of variances) of labels within a region.</font>
- <font size=+2>Say a split will separate $S$ into $S_{\pm}$, and let $y_i$ be label of $i^{th}$ training point. Avg. of variances within $S_+$ is $$AV(S_+) = \frac{1}{2|S_+|^2}\sum_{i\in S_+}\sum_{j\in S_+}(y_i - y_j)^2.$$</font>
- <font size=+2>Over all splits, minimize weighted sum of average variances: $$\frac{|S_+|}{|S|}AV(S_+) + \frac{|S_-|}{|S|}AV(S_-).$$</font>
- <font size=+2>This is default way that `DecisionTreeRegressor` works, in Python package `sklearn`.</font>
---
<h3>
Discussion
</h3>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
{"metaMigratedAt":"2023-06-15T22:31:01.096Z","metaMigratedFrom":"YAML","title":"More about decision trees","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"da8891d8-b47c-4b6d-adeb-858379287e60\",\"add\":7031,\"del\":579}]"}