---
title: 'Additive Models and Tree-based Models'
disqus: YY
---
Additive Models and Tree-based Models
===



## Table of Contents
---
---
[TOC]
## 1.1 Generalized Additive Models
虽然线性模型非常简单,且在许多数据分析中占很重要的地位,但是实际生活中的数据,很多都不是线性的。我们将要讨论一些拥有很强的灵活性的统计模型,这些模型常被用来识别和表征非线性回归效应。这些模型就叫做“广义可加模型”(Generalized Additive Models)。
(未完待续P295)
## 1.2 Tree-Based Models
基于树的模型通过将feature spae分割成若干个矩形,然后用一个简单的模型(甚至可以是一个常数)来拟合每一个小矩形。
最常见的基于树的模型叫做“回归分类树”(CART, classification and regression tree)。之后我们会提到另一种常见的基于树的模型C4.5。
假设现在我们有一个回归问题,因变量$Y$,自变量为$X_1$和$X_2$,每一个都在单位区间内取值。在左上角的小图中,我们先在$X_1=t_1$处进行切分。然后在所得的$X_1≤t_1$中,在$X_2=t_2$处进行切分;然后在$X_1>t_1$区域中,在$X_1=t_3$进行切分。最后,在区域$X_1>t_3$中的$X_2=t_4$处进行切分。最终我们能得到5个子区域$R_1, R_2, R_3, R_4, R_5$。对于该问题,用于预测目标变量$Y$的回归模型为:
$\begin{equation}\hat{f}(X)=\sum_{m=1}^{5}c_mI{(X_1, X_2)∈R_m}\end{equation} \tag{1.1}$
上式中,$c_m$是常数,$R_m$是第m个区域。
同样的树模型可以表示成右下的小图。

决策树最大的好处就在于它的可解释性。feature space的切分完全可以用一棵决策树来描述。当自变量多于两个,我们通常比较难绘制feature space是如何被切分的(如右上小图)。我们可以通过右下小图来看一下在高维空间,决策树是如何切分feature space的。决策树根据自变量,将样本总体进行分层为high outcome和low outcome。
### 1.2.1 回归树(Regression Tree)
我们现在来看如何建立一个回归树。假设我们的数据有$p$个自变量和一个因变量,共$N$个数据点。即$(x_i, y_i), for \space i=1,2,...,N,$ with $x_i=(x_{i1}, x_{i2},..., x_{ip})$。我们的算法要能够确定:
1. 挑选哪个自变量进行分割
2. 在该自变量的什么地方进行分割
3. 应该使用树的什么结构
假设现在我们已将feature space切分为$M$个区域(Region) $R_1, R_2,..., R_M$,且我们将每个区域内的因变量用常数$c_m$表示为如下形式:
\begin{equation}f(x) = \sum_{m=1}^{M}c_mI(x∈R_m) \tag{1.2}\end{equation}
如果我们采用“最小化 真实值与预测值之差 的平方和”(i.e. $\sum{(y_i-f(x_i))^2}$)作为评价指标,那就很容易发现,最佳的$\hat{c}_m$的值正好是$y_i$在区域m($R_m$)的平均值(**因为平均值才能让 真实值与预测值之差 的平方和最小**):
$\begin{equation}\hat{c}_m=ave(y_i|x_i∈R_m) \tag{1.3}\end{equation}$
根据 最小平方和 的方法找最佳的二分点
通常在计算上是不可行的。因此我们要使用一种贪婪算法(Greedy Algorithm)。我们从所有的数据开始,考虑要进行切分的自变量$j$和要对自变量$j$进行切分的点$s$,然后定义一对半平面:
$\begin{equation}R_1(j,s)=\{X|X_j≤s\} \space and \space R_2(j, s)=\{X|X_j>s\} \tag{1.4}\end{equation}$
**然后我们要用遍历的方法找到能 最小化真实值与预测值之差 的平方和 的 $s$和$j$的值**。从而我们有如下的目标函数:
\begin{equation}\min_{j,s}[\min_{c_1}\sum_{x_i∈R_1(j,s)}(y_i-c_1)^2 \space + \min_{c_2}\sum_{x_i∈R_2(j,s)}(y_i-c_2)^2] \tag{1.5}\end{equation}
不管如何选择$j$和$s$的值,式子(1.5)中括号里面的两个最小值都可以通过式子(1.6)来解决:
\begin{equation}\hat{c}1=ave(yi|x_i∈R_1(j,s)) \space and \space \hat{c}2=ave(yi|x_i∈R_2(j,s))\tag{1.6}\end{equation}
对于每一个待切分变量,对于且分点$s$的选择可以很快地通过遍历当前变量内的所有数据点来实现。因此最佳$(j,s)$值也确定。此时,我们就解决了:“在当前节点应该选择哪个自变量进行切分、且在什么地方进行切分”的问题。我们不断重复上面的过程,就能最终将整个的feature space(所有自变量)进行切分了。
#### 我们的决策树应该有多深?因为很明显,非常深的决策树会overfit,而决策树太浅会underfit。
树的大小是我们要调试的超参数之一,它控制着模型复杂度。最佳的树的大小是根据数据本身来选择和确定的。
1. 一种选择树的大小的方法是:只有当**当前的切分方式(包括切分的变量和切分位置)对 真实值与预测值之差 的平方和 所造成的的减少量 超过一定大小**,我们才进行此次切分。但是这种方法 比较短见。因为当前的一次 较差的切分可能会在下一次处 得到较好的切分。
2. 比较理想的一种方法是:我们**先建立一棵较大的决策树$T_0$,只有当达到某个最小节点大小(比如5)时才停止切分过程。然后这棵较大的树$T_0$将会根据名为 “cost-complexity pruning”的方法进行“剪枝”(Prune)**。
我们定义一棵子树 $T\subset{T_0}$,该子树$T$可以是通过任何剪枝的方式从$T_0$得到。也就是说,折叠(collapse)任意数量的内部(非终端)节点。我们将终结处的节点索引标记为$m$,那么节点$m$就表示第$m$个区域$R_m$。我们用$|T|$表示在子树$T$中的节点数。令下面三个式子为$(1.7)$:
\begin{equation}\begin{aligned}
N_m=\#\{x_i∈R_m\} \\
\hat{c_m}=\frac{1}{Nm}\sum_{x_i∈R_m}y_i \\
Q_m(T)=\frac{1}{N_m}\sum_{x_i∈R_m}(y_i-\hat{c_m})^2 \\
\end{aligned}
\tag{1.7}\end{equation}
然后我们定义"cost complexity criterion"为:
\begin{equation}C_{\alpha}(T)=\sum_{m=1}^{|T|}N_mQ_m(T)+\alpha|T| \tag{1.8}\end{equation}
(中心思想是**对于每个$\alpha$,子树$T_{\alpha}\subset{T_0}$要能够使得$C_{\alpha}(T)$最小**)
其中$\alpha$就是我们要调试的 超参数,它描述的是 **树的大小和树对于数据的拟合程度的取舍**(The tradeoff between tree size and its goodness of fit to the data)。
**$\alpha$是$≥0$的。$\alpha$越大,树的大小越小**。当$\alpha=0$,则当前的树$T_0$就没被剪枝。
对于每个$\alpha$,我们都能发现,只有唯一一个最小的树$T_{\alpha}$能使得 $C_{\alpha}$最小。现在要找到这个$T_{\alpha}$,我们就要用到**weakest link pruning**:
1. 我们依次折叠(collapse or fold)在$\sum_mN_mQ_m(T)$中产生最小每个节点增量的内部节点,然后继续折叠,直到产生单节点(根)树。
2. 上面的步骤给出了一个有限的子树序列,且我们能发现该序列中一定含有子树$T_{\alpha}$。对于${\alpha}$值的可以通过5或10叠交叉检验来估计:我们选择$\hat{\alpha}$值来最小化 交叉检验的 真实值与预测值之差 的平方和。最终,我们就能得到训练好的树$T_{\hat{\alpha}}$。
### 1.2.2 分类树(Classification Tree)
如果我们的因变量不是连续型变量而是分类型变量,$1, 2, ..., K$,那么我们的算法唯一需要进行改动的地方就是 “在当前节点处选择哪个自变量进行切分”和“在该自变量的哪个地方进行切分” 的判定条件(criterion)。
我们知道,**对于Regression Tree来说,这个判定条件(criterion)是“平方误差 节点不纯度”(squared-error node impurity),即$(1.7)$中的$Q_m(T)$**。但是这对Classification Tree来说并不适用。在节点$m$,我们用$N_m$表示在区域$R_m$中的数据点的个数。令:
\begin{equation}\hat{p_{mk}}=\frac{1}{N_m}\sum_{x_i∈R_m}I(y_i=k) \tag{1.9}\end{equation}
为 第$k$类的观测数据在节点$m$中的占比(The proportion of class $k$ observations in node $m$。我们将落在节点$m$内的数据 分类为 $k(m)=argmax_{k}\hat{p_{mk}}$,也就是节点$m$中的多数类。不同的 $Q_m(T)$ (也就是不纯度 impurity)的计算有:
1. Misclassification Error: $\frac{1}{N_m}\sum_{i∈R_m}I(y_i≠k(m))=1-\hat{p}_{mk(m)}$
2. Gini Index: $\sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})$
3. Cross-Entropy or Deviance: $-\sum_{k=1}^{K}\hat{p}_{mk}\log\hat{p}_{mk}$ $\tag{1.10}$
对于二分类问题,如果$p$是第二个类的占比,那么这三个指标分别为:
1. $1-\max\{p, 1-p\}$
2. $2p(1-p)$
3. $-p\log{p} - (1-p)\log{(1-p)}$
我们来看下这三个指标与$p$的关系图:

根据上面提到的$(1.5)$和$(1.7)$,我们发现我们需要将节点的不纯度 用 $N_{mL}$和$N_{mR}$进行加权。$N_mL$和$N_mR$分别是在分割点$m$的左右子节点的数据点的个数。
#### 1.2.2.1 Gini Index在Classification Tree的算例
1. 假设我们有如下一个数据集,"Decision"是因变量

统计一下Outlook变量的值所对应的Decision的情况,如下图:

计算第$k$类观测数据在节点$m$中的占比。我们假设"Decision = Yes"为第一类,即$k=1$;"Decision = No"为第二类,即$k=2$。同时假设我们在计算"Outlook = Sunny",即$m=1$;$"Outlook = Overcast"$,即$m=2$;$"Outlook = Rain"$,即$m=3$。
1. 根据$(1.9)$,计算在节点m,在区域Rm中的数据点的个数Nm。我们看到,数据集中一共出现了5次"Sunny",所以$N_m=5$。
2. 当$m =1$:
(1). 计算$\hat{p}_{mk}$,$k=1$时,有\begin{equation} \hat{p_{11}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1)=\frac{1}{N_1}\sum_{i=1}^{N_1=5} \begin{cases}
1, if \space y_i = 1 (即"Decision = Yes").\\
0, if \space y_i ≠ 1 (即"Decision ≠ "Yes"). \end{cases}\\
= 2/5 (表示因变量"Decision"为"Yes"且自变量之一的"Outlook"为"Sunny"的数据点的个数(2个),在所有"Outlook"为"Sunny"的数据点个数(5个)的占比,即2/5)\end{equation}
(2). 计算$\hat{p}_{mk}$,$k=2$有\begin{equation} \hat{p_{12}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1)=\sum_{i=1}^{N_1=5} \begin{cases}
1, if \space y_i = 1 (即"Decision = No").\\
0, if \space y_i ≠ 1 (即"Decision ≠ No"). \end{cases}\\
= 3/5 (表示因变量"Decision"为"No"且自变量之一的"Outlook"为"Sunny"的数据点的个数(3个),在所有"Outlook"为"Sunny"的数据点个数(5个)的占比,即3/5)\end{equation}
3. 计算$Gini \space Index(m=1)$:
$Gini(Outlook=Sunny) = \sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})=\sum_{k=1}^{K=2}\hat{p}_{1k}(1-\hat{p}_{1k})\\ =\hat{p}_{11}\times\hat{p}_{12}+\hat{p}_{12}\times\hat{p}_{11}=(2/5 \times 3/5) + (3/5 \times 2/5) = 12/25=0.48$
4. 同理计算计算$Gini \space Index(m=2)$:
$Gini(Outlook=Overcast) = \sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})=\sum_{k=1}^{K=2}\hat{p}_{2k}(1-\hat{p}_{2k})\\ =\hat{p}_{21}\times\hat{p}_{22}+\hat{p}_{22}\times\hat{p}_{21}=(0/4 \times 4/4 + 4/4 \times 0/4) = 0$
5. 同理计算计算$Gini \space Index(m=3)$:
$Gini(Outlook=Rain) = \sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})=\sum_{k=1}^{K=2}\hat{p}_{3k}(1-\hat{p}_{3k})\\ =\hat{p}_{31}\times\hat{p}_{32}+\hat{p}_{32}\times\hat{p}_{31}=(3/5 \times 2/5 + 2/5 \times 3/5) = 12/25 = 0.48$
#### 1.2.2.2 Cross-Entropy在Decision Tree中的算例
第1步和第2步与上面的计算流程一样
1. 根据$(1.9)$,计算在节点$m$,在区域$R_m$中的数据点的个数$N_m$。我们看到,数据集中一共出现了5次"Sunny",所以$N_m=5$。
2. 当$m =2$:
(1). 计算$\hat{p}_{mk}$,$k=1$时,有\begin{equation} \hat{p_{11}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1) \\
=\frac{1}{N_1}\sum_{i=1}^{N_1=5} \begin{cases}
1, if \space y_i = 1 (即"Decision = Yes").\\
0, if \space y_i ≠ 1 (即"Decision ≠ "Yes"). \end{cases}\\
= 2/5 (表示因变量"Decision"为"Yes"且自变量之一的"Outlook"为"Sunny"的数据点的个数(2个),在所有"Outlook"为"Sunny"的数据点个数(5个)的占比,即2/5)\end{equation}
(2). 计算$\hat{p}_{mk}$,$k=2$有\begin{equation} \hat{p_{12}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1) \\
=\sum_{i=1}^{N_1=5} \begin{cases}
1, if \space y_i = 1 (即"Decision = No").\\
0, if \space y_i ≠ 1 (即"Decision ≠ No"). \end{cases}\\
= 3/5 (表示因变量"Decision"为"No"且自变量之一的"Outlook"为"Sunny"的数据点的个数(3个),在所有"Outlook"为"Sunny"的数据点个数(5个)的占比,即3/5)\end{equation}
3. 计算$Cross-Entropy (m=1)$:
$Cross-Entropy(Outlook=Sunny) = -\sum_{k=1}^{K=2}{\hat{p}_{1k}\log{\hat{p}_{1k}}}\\
=-[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}]\\
=-[2/5 \times \log(2/5) + 3/5 \times \log(3/5)]\\
≈0.971$
4. 同理计算$Cross-Entropy (m=2)$:
$Cross-Entropy(Outlook=Overcast) = -\sum_{k=1}^{K=2}{\hat{p}_{2k}\log{\hat{p}_{2k}}}\\
=-[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}]\\
=-[4/4 \times \log(0/4) + 0/4 \times \log(4/4)] \leftarrow 这里的\log(0/4)会有"Math Error",这是因为该节点是"Pure Node",\\此时我们要令 \log(0/4)的值为 0。 \\
=0$
5. 同理计算$Cross-Entropy (m=3)$:
$Cross-Entropy(Outlook=Rain) = -\sum_{k=1}^{K=2}{\hat{p}_{3k}\log{\hat{p}_{3k}}}\\
=-[\hat{p}_{31}\log{\hat{p}_{31}} + \hat{p}_{32}\log{\hat{p}_{32}}]\\
=-[3/5 \times \log(2/5) + 2/5 \times \log(3/5)] \\
≈0.754$
另外,Cross-Entropy和Gini Index对于节点概率的变化更加敏感。
例如,在一个二分类问题中,每个类中有400个数据(400,400),假设有一次分割形成了(300,100)和(100,300)的分割方式,另一个分割方式形成了(200,400)和(200,0)的分割方式。这两种分割所产生的Misclassification Rate都是0.25,但是第二种切分方式产生了一个"Pu re Node"。
但是Gini Index和Cross-Entropy在第二个切分(Pure Node)上的值会很低。所以我们一般会用Gini Index或Cross-Entropy作为criterion。
对于"Cost-Complexity Pruning",可以使用这三种方法中的任何一种,但通常是Misclassification Rate。
对于Gini Index通常有两种解释方式:
1. **我们应当将观测点按照$\hat{p}_{mk}$的概率分类到 第$k$类中,而不是将观测点分类到该节点的多数类中。然后,当前节点的训练误差(training error rate)就是:$\sum_{k≠k'} \hat{p}_{mk}\hat{p}_{mk'}$,也就是Gini Index**。
2. 类似的,**如果我们将每个属于第$k$类的观测数据点编码为1,属于其他类的数据点编码为0,那么当前节点内的 0-1 值的方差就是 $\hat{p}_{mk}(1-\hat{p}_{mk})$。然后将所有的类$k$相加求和,就得到了Gini Index**。
#### 1.2.2.3 Information Gain(信息增益)
上一个小章节提到了,信息熵(Information Entropy 或Cross-Entropy)是用来衡量节点的purity(纯度)的指标。计算公式见$(1.10)$。我们令$Ent(D)=-\sum^{K}_{k=1}{\hat{p}_{mk}\log{\hat{p}_{mk}}}$,$Ent(D)$越小,则当前样本集$D$的纯度(purity)越高。(这个公式也决定了信息熵的一个缺点:即信息熵对可取值数目多的特征有偏好(即该属性能取得值越多,信息熵,越偏向这个),因为特征可取的值越多,会导致“纯度”越大,即$Ent(D)$会很小,如果一个特征的离散个数与样本数相等,那么$Ent(D)$值会为0)
**Information Gain(信息增益)**
假设Categorical Feature(离散型自变量)$a$有$M$个可能的取值$\{a^1, a^2, ..., a^M\}$如果用特征$a$来对数据集$D$进行划分,则会产生$M$个切分点,第$m$个切分点包含了数据集$D$中的所有在特征$a$上的取值为$a^m$的样本总数,记为$D^m$。因此可以根据上面的Information Entropy的计算公式计算。再考虑到不同且分店所包含的样本数量不同, 给且分店赋予权重$\frac{|D^m|}{|D|}$,即,样本数量越多的切分点的影响越大。因此,能计算出自变量$a$对于样本集$D$进行划分所获得的Information Gain:
\begin{equation}Gain(D,a)=Ent(D)-\sum^{M}_{m=1}{\frac{|D^m|}{|D|}Ent(D^m)} \tag{1.11}\end{equation}
一般来说,Information Gain越大,则使用自变量$a$对数据集划分所获得的“Purity Improvement”越大。所以Information Gain可以用于决策树切分时的自变量选择,其实就是选择Information Gain的最大属性。我们用上一章节中的(Cross-Entropy)的算例为例:
我们计算了“Outlook”这个自变量中,每个unique值的Cross-Entropy:
1. 计算$Cross-Entropy (m=1)$:
$Cross-Entropy(Outlook=Sunny) = Ent(D^1) = -\sum_{k=1}^{K=2}{\hat{p}_{1k}\log{\hat{p}_{1k}}}\\
=-[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}]\\
=-[2/5 \times \log(2/5) + 3/5 \times \log(3/5)]\\
≈0.971$
2. 同理计算$Cross-Entropy (m=2)$:
$Cross-Entropy(Outlook=Overcast) = Ent(D^2) = -\sum_{k=1}^{K=2}{\hat{p}_{2k}\log{\hat{p}_{2k}}}\\
=-[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}]\\
=-[4/4 \times \log(0/4) + 0/4 \times \log(4/4)] \leftarrow 这里的\log(0/4)会有"Math Error",这是因为该节点是"Pure Node",\\此时我们要令 \log(0/4)的值为 0。 \\
=0$
3. 同理计算$Cross-Entropy (m=3)$:
$Cross-Entropy(Outlook=Rain) = Ent(D^3) = -\sum_{k=1}^{K=2}{\hat{p}_{3k}\log{\hat{p}_{3k}}}\\
=-[\hat{p}_{31}\log{\hat{p}_{31}} + \hat{p}_{32}\log{\hat{p}_{32}}]\\
=-[3/5 \times \log(2/5) + 2/5 \times \log(3/5)] \\
≈0.754$
4. 为了计算Information Gain, 我们还要计算 $Ent(D)$,此时我们就不需要考虑$m$,只考虑$k$即可。其中,$k=1 \text{时}, \space \hat{p}_{k=1}=\frac{9}{14}$(注意$k=1$表示因变量为"Yes")。$k=2 \text{时}, \space \hat{p}_{k=1}=\frac{5}{14}$:
$Cross-Entropy(Outlook) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\
= -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$
5. 有了这几个值,我们终于可以计算Information Gain了:
$Gain(Outlook) = Ent(D) - \sum_{m=1}^{M=3}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{5}{14} \times 0.971 + \frac{4}{14} \times 0 + \frac{5}{14} \times 0.754) ≈ 0.324$
6. 同理我们可以计算其他自变量的Information Gain:
1. 对于Temp变量:我们令$Temp(m=1)为\text{Hot}$,$Temp(m=2)为\text{Mild}$,$Temp(m=3)为\text{Cool}$。
$Cross-Entropy(m=1) = Cross-Entropy(Temp=Hot) = Ent(D^1) = \sum_{k=1}^{K=2}{\hat{p}_{1k}} \\
= -[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}] \\
=-[\frac{2}{4} \times \log{\frac{2}{4}} + \frac{2}{4} \times \log{\frac{2}{4}}] \\
= 1$
$Cross-Entropy(m=2) = Cross-Entropy(Temp=Mild) = Ent(D^2) = \sum_{k=1}^{K=2}{\hat{p}_{2k}} \\
= -[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}] \\
=-[\frac{4}{6} \times \log{\frac{4}{6}} + \frac{2}{6} \times \log{\frac{2}{6}}] \\
≈ 0.918$
$Cross-Entropy(m=3) = Cross-Entropy(Temp=Cool) = Ent(D^3) = \sum_{k=1}^{K=2}{\hat{p}_{3k}} \\
= -[\hat{p}_{31}\log{\hat{p}_{31}} + \hat{p}_{32}\log{\hat{p}_{32}}] \\
=-[\frac{3}{4} \times \log{\frac{3}{4}} + \frac{1}{4} \times \log{\frac{1}{4}}] \\
≈ 0.811$
整个Temp的Cross-Entropy为:$Cross-Entropy(Temp) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\
= -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$
所以Information Gain为: $Gain(Temp) = Ent(D) - \sum_{m=1}^{M=3}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{5}{14} \times 0.971 + \frac{4}{14} \times 0 + \frac{5}{14} \times 0.754) ≈ 0.0291$
2. 对于变量:Humidity我们令$Humidity(m=1)为\text{High}$,$Humidity(m=2)为\text{Normal}$。
$Cross-Entropy(m=1) = Cross-Entropy(Humidity=High) = Ent(D^1) = \sum_{k=1}^{K=2}{\hat{p}_{1k}} \\
= -[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}] \\
=-[\frac{3}{7} \times \log{\frac{3}{7}} + \frac{4}{7} \times \log{\frac{4}{7}}] \\
= 0.985$
$Cross-Entropy(m=2) = Cross-Entropy(Humidity=Normal) = Ent(D^2) = \sum_{k=1}^{K=2}{\hat{p}_{2k}} \\
= -[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}] \\
=-[\frac{6}{7} \times \log{\frac{6}{7}} + \frac{1}{7} \times \log{\frac{1}{7}}] \\
≈ 0.592$
整个Humidity的Cross-Entropy为:$Cross-Entropy(Humidity) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\
= -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$
所以Information Gain为: $Gain(Humidity) = Ent(D) - \sum_{m=1}^{M=2}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{7}{14} \times 0.985 + \frac{7}{14} \times 0.592) ≈ 0.1515$
3. 对于变量:Wind我们令$Wind(m=1)为\text{Strong}$,$Wind(m=2)为\text{Weak}$。
$Cross-Entropy(m=1) = Cross-Entropy(Wind=Strong) = Ent(D^1) = \sum_{k=1}^{K=2}{\hat{p}_{1k}} \\
= -[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}] \\
=-[\frac{3}{6} \times \log{\frac{3}{6}} + \frac{3}{6} \times \log{\frac{3}{6}}] \\
= 1$
$Cross-Entropy(m=2) = Cross-Entropy(Wind=Weak) = Ent(D^2) = \sum_{k=1}^{K=2}{\hat{p}_{2k}} \\
= -[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}] \\
=-[\frac{6}{8} \times \log{\frac{6}{8}} + \frac{2}{8} \times \log{\frac{2}{8}}] \\
≈ 0.811$
整个Wind的Cross-Entropy为:$Cross-Entropy(Wind) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\
= -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$
所以Information Gain为: $Gain(Wind) = Ent(D) - \sum_{m=1}^{M=2}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{6}{14} \times 1 + \frac{8}{14} \times 0.811) = 0.048$
7. 所以根据这几个自变量的Information Gain的计算,我们发现:
* $Gain(outlook) = 0.324$
* $Gain(temp) = 0.0291$
* $Gain(humidity) = 0.1515$
* $Gain(wind) = 0.048$
所以我们选择$Outlook$这个自变量作为第一个节点处 被切分的自变量。
### 1.2.3 其他注意问题
#### 1.2.3.1 Categorical Predictors(分类型 自变量)
当切分一个含有$q$个无序值的分类型 自变量时,一共会有$2^{q-1}$个将这$q$个值 进行二分 的可能的切分方法,并且该值会随着$q$的值变得非常大。然而对于一个 $0-1$ 输出变量,该计算的计算量就会被简化。我们将预测变量的类 根据 落入 “输出变量为1” 的占比大小 进行排序。然后 假设该预测变量是有序预测变量,并对该有序变量进行 切分。
## 2.1 下面来看代码:
#### 注意,请先[在这里](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)下载iris数据,将下载的文件后缀改为'.csv',并移动到当前文件夹下以方便Python读取。(注释虽多,但有助于理解代码)
```Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from pprint import pprint
# Part 1: Load Iris data:
# The format of the data should be Pandas.DataFrame
df = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width',
'petal_length', 'petal_width', 'label'])
# Part 2: Train and Test Split
# Next, we should split Iris dataset into training and testing sets.
# We'll define our own "split_train_test" function instead of using predefined API
def split_train_test(df, test_size):
# df --> raw Iris data;
# test_size --> the size of testing set. It can be float or int type
# When test_size is float type ranging from 0 to 1, it means the the proportion of testing set selected from raw Iris data
# When test_size is int type, it means the size of test set is "test_size"
# Return --> A "train_df" and a "test_df", both are Pandas.DataFrame
if isinstance(test_size, float): # if test_size is float type
test_size = round(test_size * len(df)) # e.g., if we have 150 records in Iris data set and test_size is 0.83, then round(150 * 0.83) = round(124.5) = 125
indices = df.index.tolist() # retrieve indices from Iris DataFrame
test_indices = random.sample(population=indices, k=test_size) # Then, use "test_size" to perform random sampling from Iris DataFrame.
test_df = df.loc[test_indices] # the indices selected belong to testing set
train_df = df.drop(test_indices) # the indices not selected belong to training set
return train_df, test_df
# Then we can set a random seed, run "split_train_test" and get "train_df" and "test_df"
random.seed(100)
train_df, test_df = split_train_test(df, 20) # we select 20 records as testing set
# Part 3: Helper Functions
# Part 3.1: Helper Function -- "check_purity"
# If Iris dataset only has 1 class, then this data is called "Pure". If the number of classes is more then 1, then it's "Not Pure"
# Now, the first Helper Function is to check the purity of our dataset.
def check_purity(data):
label_column = data[:, -1]
unique_classes = np.unique(label_column)
if len(unique_classes) == 1: return True
else: return False
# If the dataset is pure, return True; otherwise return False
# Part 3.2: Helper Function -- "classify_data"
# classify the data
def classify_data(data):
label_column = data[:, -1] #
unique_classes, count_unique_classes = np.unique(label_column, return_counts=True)
# previous line will find the number of classes of each independent variable, and the number of unique classes of each independent variable
index = count_unique_classes.argmax() # find index of the class that has the most records
classification = unique_classes[index]
return classification
# Part 3.3: Helper Function -- "find_potential_splits"
# In this function, we should traverse all records to find the potential splitting points
# What is splitting point? -- For a continuous variable, for example, if the unique values of "petal_length" are: "1.0, 1.1, 1.2, ..."
# Then, the potential splitting points will be the average value of each two nearby values: "1.05, 1.15, ..."
# We'll create a dictionary called "potential_splits" to store potential splitting points.
def find_potential_splits(data):
potential_splits = {}
i, n_columns = data.shape
for index_column in range(n_columns - 1):
potential_splits[index_column] = []
values = data[:, index_column]
unique_values = np.unique(values)
for index in range(len(unique_values) - 1): # Why -1? -- Because the number of intervals of N numbers is N-1.
current_value = unique_values[index]
next_value = unique_values[index+1]
potential_split = (current_value + next_value) / 2 # "potential_split" is one of the potential splits
potential_splits[index_column].append(potential_split)
return potential_splits # Finally we return the dictionary that contains all potential splitting points
# To help you understand how "find_potential_splits" works, you can decomment the following 4 lines to see how this function works.
##########################################################################################################################################################
# potential_splits = find_potential_splits(train_df.values)
# sns.lmplot(data=train_df, x='petal_width', y='petal_length', hue='label',
# fit_reg=False, height=5, aspect=1) # If you're using Python 3.7+, use "height". If your Python version is lower than 3.6, use "size"
# plt.vlines(x=potential_splits[3], ymin=0.5, ymax=7.5)
# plt.show()
##########################################################################################################################################################
# Previous 4 lines showed the potential splitting points of "petal_width".
# The following 4 lines will show the potential splitting points of "petal_length":
##########################################################################################################################################################
# potential_splits = find_potential_splits(train_df.values)
# sns.lmplot(data=train_df, x='petal_width', y='petal_length', hue='label',
# fit_reg=False, height=5, aspect=1)
# plt.hlines(y=potential_splits[2], xmin=0, xmax=2.5)
# plt.show()
##########################################################################################################################################################
# Part 4: Split data -- "split_data"
# Since we've found all potential splitting points, we can define a function to split our training data
# Remember, in decision tree, if we want to make a split at a node, there are 2 things we should decide:
# First is which feature (independent variable) should be selected to split. Say, this feature is f1.
# Second is, once the feature f1 is selected, which potential split should be selected
# We will talk about how these 2 things are determined later. For "split_data" function, we want it to return two datasets
# For dataset1, all its feature f1's values should be smaller than split value
# For dataset2, all its feature f1's values should be greater than split value
def split_data(data, split_column, split_value):
split_column_values = data[:, split_column]
data_greater = data[split_column_values > split_value]
data_smaller = data[split_column_values <= split_value]
return data_greater, data_smaller
# If you're careful enough, you'll find that if "petal_width" < 0.8 (approximately), all records are "setosa".
# We'll utilize this insight to check if our functions work correctly: (you can decomment the following 9 lines to help you understand)
##########################################################################################################################################################
# split_column = 3
# split_value = 0.8 # we set split value to be 0.8. You'll see all records with "petal_width" < 0.8 are "setosa".
# data_greater, data_smaller = split_data(train_df.values, split_column, split_value)
# plot_df = pd.DataFrame(data_smaller, columns=df.columns)
# sns.lmplot(data=plot_df, x='petal_width', y='petal_length', hue='label',
# fit_reg=False, height=5, aspect=1)
# plt.vlines(x=split_value, ymin=0.5, ymax=7.5)
# plt.xlim(0, 3)
# plt.show()
##########################################################################################################################################################
# Part 5: Find the smallest cross-entropy -- "compute_entropy", "compute_cross_entropy"
# We mentioned 2 things that should be determined every time we want to make a split in decision.
# Now is time to compute the cross entropy and find the smallest entropy.
# The split that minimizes the cross entropy is the best split.
def compute_entropy(data):
label_column = data[:, -1]
i, count = np.unique(label_column, return_counts=True)
probabilities = count / count.sum()
entropy = sum(probabilities * -np.log2(probabilities))
return entropy
def compute_cross_entropy(data_greater, data_smaller):
count_data_points = len(data_smaller) + len(data_greater)
probability_data_greater = len(data_greater) / count_data_points
probability_data_smaller = len(data_smaller) / count_data_points
cross_entropy = (probability_data_greater * compute_entropy(data_greater) +
probability_data_smaller * compute_entropy(data_smaller))
return cross_entropy
# Take "petal_width" < 0.8 for example. "petal_width" will be the feature to be split and the splitting point is 0.8.
##########################################################################################################################################################
# To see an example of how cross entropy is calculated, decomment the following 4 lines:
# split_column = 3 # index 3 is "petal_width" column in our dataset.
# split_value = 0.8 # we set split value to be 0.8. You'll see all records with "petal_width" < 0.8 are "setosa".
# data_greater, data_smaller = split_data(train_df.values, split_column, split_value)
# print(compute_cross_entropy(data_greater, data_smaller))
##########################################################################################################################################################
# With previous functions, we can easily find the optimal splitting points by travering the dataset.
# The "feature-splitting point" pair that minimizes cross entropy will be the optimal feature-splitting point.
def find_optimal_split(data, potential_splits):
cross_entropy = float('inf') # we initialize a very large number and then replace it with smaller cross entropy value
for column_index in potential_splits: # traverse all features
for split_value in potential_splits[column_index]: # traverse all splitting points
data_greater, data_smaller = split_data(data, split_column=column_index, split_value=split_value)
current_cross_entropy = compute_cross_entropy(data_greater= data_greater, data_smaller=data_smaller)
if current_cross_entropy <= cross_entropy:
cross_entropy = current_cross_entropy
optimal_split_column = column_index
optimal_split_value = split_value
return optimal_split_column, optimal_split_value
# Not surprisingly, one of the optimal split feature-splitting point should be "sepal_length-0.8" (or "3-0.8" because the index of sepal_length is 3)
# You can check the result using below 2 lines:
##########################################################################################################################################################
# potential_splits = find_potential_splits(train_df.values)
# print(find_optimal_split(train_df.values, potential_splits))
##########################################################################################################################################################
# Part 6: Decision Tree Algorithm -- "build_decision_tree"
# Now is time to build our decision tree algorithm.
def build_decision_tree(df, counter=0, min_sample=2, max_depth=5):
# df --> Pandas.DataFrame type. Our dataset
# current_depth --> current depth of tree
# min_sample --> the minimum number of records of the dataset
# max_depth --> the maximum depth of decision tree
if counter == 0: # when we are at the first layer (root layer), we should convert Pandas.DataFrame data to a numpy array
global COLUMN_HEADERS
COLUMN_HEADERS = df.columns
data = df.values
else: # Otherwise, if we are at the other layers, then no need for any conversion
data = df
# base case
# If data is pure OR data size < min_samples OR we reached maximum depth, then we get the base case:
if (check_purity(data)) or (len(data) < min_sample) or (counter == max_depth):
classification = classify_data(data) # then we should classify our data
return classification
else: # Otherwise we keep growing our decision tree
counter += 1
potential_splits = find_potential_splits(data)
split_column, split_value = find_optimal_split(data, potential_splits)
data_greater, data_smaller = split_data(data, split_column=split_column, split_value=split_value)
# instantiate subtree:
feature_name = COLUMN_HEADERS[split_column]
decision = "{} <= {}".format(feature_name, split_value)
subtree = {decision: []}
# find the answer (recursive part)
no_answer = build_decision_tree(data_greater, counter, min_sample, max_depth)
yes_answer = build_decision_tree(data_smaller, counter, min_sample, max_depth)
if yes_answer == no_answer:
subtree = yes_answer
else:
subtree[decision].append(yes_answer)
subtree[decision].append(no_answer)
return subtree
# To help you understand how this recursive function works, decomment the following code block.
##########################################################################################################################################################
# atree = build_decision_tree(train_df, max_depth=2)
# pprint(atree)
##########################################################################################################################################################
# Part 7: Make Classification -- "make_classification"
# We've already had "build_decision_tree" function to grow a decision tree.
# Now we're to make classification using the decision tree we built.
# In this part, "make_classification" function will use "build_decision_tree" function to make prediction and then return the "label" it predicts.
def make_classification(record, tree):
# record --> an example data point we will use to testify our "build_decision_tree" function
''' A possible value of "record":
sepal_length 6.5
sepal_width 3
petal_length 5.5
petal_width 1.8
label Iris-virginica
Name: 116, dtype: object
'''
# tree --> a tree that is given by "build_decision_tree" function
decision = list(tree.keys())[0] # an example value of "decision" might be 'petal_width <= 0.8'
column_name, comparison, split_value = decision.split() # Once split, we'll get 3 values: "petal_width", "<=", and "0.8", respectively.
# So, we can use column_name as an index to retrieve the corresponding value of 'petal_width' feature, which is 3.
# Then if this value <= split_value, then the answer is True; otherwise False
if record[column_name] <= float(split_value):
answer = tree[decision][0]
else:
answer = tree[decision][1]
# The base case:
if not isinstance(answer, dict): # if answer is not a dict type, which means we reached leaf node, then we return the answer
return answer
else: # otherwise we continue the recursive part
remained_tree = answer
return make_classification(record, remained_tree)
# Again, you can decomment the following code block to see an example.
##########################################################################################################################################################
# record = test_df.iloc[2]
# tree = build_decision_tree(train_df, max_depth=2)
# print('Predicted Label: ', make_classification(record, tree))
# print('Actual Label: ', record['label'])
##########################################################################################################################################################
# You can see the Predicted and Actual Labels are the same, meaning our "build_decision_tree" function works well
# Part 8: Calculate Accuracy -- "calculate_accuracy"
# This part is easy. For classification problem, accuracy = number of correctly classified records / number of total records
def calculate_accuracy(df, tree):
df['classification'] = df.apply(make_classification, axis=1, args=(tree,))
df['correctly_classified'] = df['classification'] == df['label']
accuracy = df['correctly_classified'].mean()
return accuracy
# Let's use our test_df to see the accuracy of our decision tree model:
##########################################################################################################################################################
# tree = build_decision_tree(train_df, max_depth=2)
# accuracy = calculate_accuracy(test_df, tree)
# print('Accuracy is: ', accuracy)
##########################################################################################################################################################
```