Additive Models and Tree-based Models

--- title: 'Additive Models and Tree-based Models' disqus: YY --- Additive Models and Tree-based Models === ![downloads](https://img.shields.io/github/downloads/atom/atom/total.svg) ![build](https://img.shields.io/appveyor/ci/:user/:repo.svg) ![chat](https://img.shields.io/discord/:serverId.svg) ## Table of Contents --- --- [TOC] ## 1.1 Generalized Additive Models 虽然线性模型非常简单，且在许多数据分析中占很重要的地位，但是实际生活中的数据，很多都不是线性的。我们将要讨论一些拥有很强的灵活性的统计模型，这些模型常被用来识别和表征非线性回归效应。这些模型就叫做“广义可加模型”(Generalized Additive Models)。（未完待续P295） ## 1.2 Tree-Based Models 基于树的模型通过将feature spae分割成若干个矩形，然后用一个简单的模型(甚至可以是一个常数)来拟合每一个小矩形。最常见的基于树的模型叫做“回归分类树”(CART, classification and regression tree)。之后我们会提到另一种常见的基于树的模型C4.5。假设现在我们有一个回归问题，因变量$Y$，自变量为$X_1$和$X_2$，每一个都在单位区间内取值。在左上角的小图中，我们先在$X_1=t_1$处进行切分。然后在所得的$X_1≤t_1$中，在$X_2=t_2$处进行切分；然后在$X_1>t_1$区域中，在$X_1=t_3$进行切分。最后，在区域$X_1>t_3$中的$X_2=t_4$处进行切分。最终我们能得到5个子区域$R_1, R_2, R_3, R_4, R_5$。对于该问题，用于预测目标变量$Y$的回归模型为： $\begin{equation}\hat{f}(X)=\sum_{m=1}^{5}c_mI{(X_1, X_2)∈R_m}\end{equation} \tag{1.1}$ 上式中，$c_m$是常数，$R_m$是第m个区域。同样的树模型可以表示成右下的小图。 ![](https://i.imgur.com/rCWO7Pr.png) 决策树最大的好处就在于它的可解释性。feature space的切分完全可以用一棵决策树来描述。当自变量多于两个，我们通常比较难绘制feature space是如何被切分的(如右上小图)。我们可以通过右下小图来看一下在高维空间，决策树是如何切分feature space的。决策树根据自变量，将样本总体进行分层为high outcome和low outcome。 ### 1.2.1 回归树(Regression Tree) 我们现在来看如何建立一个回归树。假设我们的数据有$p$个自变量和一个因变量，共$N$个数据点。即$(x_i, y_i), for \space i=1,2,...,N,$ with $x_i=(x_{i1}, x_{i2},..., x_{ip})$。我们的算法要能够确定： 1. 挑选哪个自变量进行分割 2. 在该自变量的什么地方进行分割 3. 应该使用树的什么结构假设现在我们已将feature space切分为$M$个区域(Region) $R_1, R_2,..., R_M$，且我们将每个区域内的因变量用常数$c_m$表示为如下形式： \begin{equation}f(x) = \sum_{m=1}^{M}c_mI(x∈R_m) \tag{1.2}\end{equation} 如果我们采用“最小化真实值与预测值之差的平方和”(i.e. $\sum{(y_i-f(x_i))^2}$)作为评价指标，那就很容易发现，最佳的$\hat{c}_m$的值正好是$y_i$在区域m（$R_m$）的平均值（**因为平均值才能让真实值与预测值之差的平方和最小**）： $\begin{equation}\hat{c}_m=ave(y_i|x_i∈R_m) \tag{1.3}\end{equation}$ 根据最小平方和的方法找最佳的二分点通常在计算上是不可行的。因此我们要使用一种贪婪算法(Greedy Algorithm)。我们从所有的数据开始，考虑要进行切分的自变量$j$和要对自变量$j$进行切分的点$s$，然后定义一对半平面： $\begin{equation}R_1(j,s)=\{X|X_j≤s\} \space and \space R_2(j, s)=\{X|X_j>s\} \tag{1.4}\end{equation}$ **然后我们要用遍历的方法找到能最小化真实值与预测值之差的平方和的 $s$和$j$的值**。从而我们有如下的目标函数： \begin{equation}\min_{j,s}[\min_{c_1}\sum_{x_i∈R_1(j,s)}(y_i-c_1)^2 \space + \min_{c_2}\sum_{x_i∈R_2(j,s)}(y_i-c_2)^2] \tag{1.5}\end{equation} 不管如何选择$j$和$s$的值，式子(1.5)中括号里面的两个最小值都可以通过式子(1.6)来解决： \begin{equation}\hat{c}1=ave(yi|x_i∈R_1(j,s)) \space and \space \hat{c}2=ave(yi|x_i∈R_2(j,s))\tag{1.6}\end{equation} 对于每一个待切分变量，对于且分点$s$的选择可以很快地通过遍历当前变量内的所有数据点来实现。因此最佳$(j,s)$值也确定。此时，我们就解决了：“在当前节点应该选择哪个自变量进行切分、且在什么地方进行切分”的问题。我们不断重复上面的过程，就能最终将整个的feature space（所有自变量）进行切分了。 #### 我们的决策树应该有多深？因为很明显，非常深的决策树会overfit，而决策树太浅会underfit。树的大小是我们要调试的超参数之一，它控制着模型复杂度。最佳的树的大小是根据数据本身来选择和确定的。 1. 一种选择树的大小的方法是：只有当**当前的切分方式(包括切分的变量和切分位置)对真实值与预测值之差的平方和所造成的的减少量超过一定大小**，我们才进行此次切分。但是这种方法比较短见。因为当前的一次较差的切分可能会在下一次处得到较好的切分。 2. 比较理想的一种方法是：我们**先建立一棵较大的决策树$T_0$，只有当达到某个最小节点大小(比如5)时才停止切分过程。然后这棵较大的树$T_0$将会根据名为 “cost-complexity pruning”的方法进行“剪枝”(Prune)**。我们定义一棵子树 $T\subset{T_0}$，该子树$T$可以是通过任何剪枝的方式从$T_0$得到。也就是说，折叠(collapse)任意数量的内部(非终端)节点。我们将终结处的节点索引标记为$m$，那么节点$m$就表示第$m$个区域$R_m$。我们用$|T|$表示在子树$T$中的节点数。令下面三个式子为$(1.7)$： \begin{equation}\begin{aligned} N_m=\#\{x_i∈R_m\} \\ \hat{c_m}=\frac{1}{Nm}\sum_{x_i∈R_m}y_i \\ Q_m(T)=\frac{1}{N_m}\sum_{x_i∈R_m}(y_i-\hat{c_m})^2 \\ \end{aligned} \tag{1.7}\end{equation} 然后我们定义"cost complexity criterion"为： \begin{equation}C_{\alpha}(T)=\sum_{m=1}^{|T|}N_mQ_m(T)+\alpha|T| \tag{1.8}\end{equation} （中心思想是**对于每个$\alpha$，子树$T_{\alpha}\subset{T_0}$要能够使得$C_{\alpha}(T)$最小**）其中$\alpha$就是我们要调试的超参数，它描述的是 **树的大小和树对于数据的拟合程度的取舍**(The tradeoff between tree size and its goodness of fit to the data)。 **$\alpha$是$≥0$的。$\alpha$越大，树的大小越小**。当$\alpha=0$，则当前的树$T_0$就没被剪枝。对于每个$\alpha$，我们都能发现，只有唯一一个最小的树$T_{\alpha}$能使得 $C_{\alpha}$最小。现在要找到这个$T_{\alpha}$，我们就要用到**weakest link pruning**： 1. 我们依次折叠(collapse or fold)在$\sum_mN_mQ_m(T)$中产生最小每个节点增量的内部节点，然后继续折叠，直到产生单节点(根)树。 2. 上面的步骤给出了一个有限的子树序列，且我们能发现该序列中一定含有子树$T_{\alpha}$。对于${\alpha}$值的可以通过5或10叠交叉检验来估计：我们选择$\hat{\alpha}$值来最小化交叉检验的真实值与预测值之差的平方和。最终，我们就能得到训练好的树$T_{\hat{\alpha}}$。 ### 1.2.2 分类树(Classification Tree) 如果我们的因变量不是连续型变量而是分类型变量，$1, 2, ..., K$，那么我们的算法唯一需要进行改动的地方就是 “在当前节点处选择哪个自变量进行切分”和“在该自变量的哪个地方进行切分” 的判定条件(criterion)。我们知道，**对于Regression Tree来说，这个判定条件(criterion)是“平方误差节点不纯度”(squared-error node impurity)，即$(1.7)$中的$Q_m(T)$**。但是这对Classification Tree来说并不适用。在节点$m$，我们用$N_m$表示在区域$R_m$中的数据点的个数。令： \begin{equation}\hat{p_{mk}}=\frac{1}{N_m}\sum_{x_i∈R_m}I(y_i=k) \tag{1.9}\end{equation} 为第$k$类的观测数据在节点$m$中的占比(The proportion of class $k$ observations in node $m$。我们将落在节点$m$内的数据分类为 $k(m)=argmax_{k}\hat{p_{mk}}$，也就是节点$m$中的多数类。不同的 $Q_m(T)$ （也就是不纯度 impurity）的计算有： 1. Misclassification Error: $\frac{1}{N_m}\sum_{i∈R_m}I(y_i≠k(m))=1-\hat{p}_{mk(m)}$ 2. Gini Index: $\sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})$ 3. Cross-Entropy or Deviance: $-\sum_{k=1}^{K}\hat{p}_{mk}\log\hat{p}_{mk}$ $\tag{1.10}$ 对于二分类问题，如果$p$是第二个类的占比，那么这三个指标分别为： 1. $1-\max\{p, 1-p\}$ 2. $2p(1-p)$ 3. $-p\log{p} - (1-p)\log{(1-p)}$ 我们来看下这三个指标与$p$的关系图： ![](https://i.imgur.com/O1xBwGm.png) 根据上面提到的$(1.5)$和$(1.7)$，我们发现我们需要将节点的不纯度用 $N_{mL}$和$N_{mR}$进行加权。$N_mL$和$N_mR$分别是在分割点$m$的左右子节点的数据点的个数。 #### 1.2.2.1 Gini Index在Classification Tree的算例 1. 假设我们有如下一个数据集，"Decision"是因变量 ![](https://i.imgur.com/tS20E48.png) 统计一下Outlook变量的值所对应的Decision的情况，如下图： ![](https://i.imgur.com/6KJCkOA.png) 计算第$k$类观测数据在节点$m$中的占比。我们假设"Decision = Yes"为第一类，即$k=1$；"Decision = No"为第二类，即$k=2$。同时假设我们在计算"Outlook = Sunny"，即$m=1$；$"Outlook = Overcast"$，即$m=2$；$"Outlook = Rain"$，即$m=3$。 1. 根据$(1.9)$，计算在节点m，在区域Rm中的数据点的个数Nm。我们看到，数据集中一共出现了5次"Sunny"，所以$N_m=5$。 2. 当$m =1$: (1). 计算$\hat{p}_{mk}$，$k=1$时，有\begin{equation} \hat{p_{11}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1)=\frac{1}{N_1}\sum_{i=1}^{N_1=5} \begin{cases} 1, if \space y_i = 1 (即"Decision = Yes").\\ 0, if \space y_i ≠ 1 (即"Decision ≠ "Yes"). \end{cases}\\ = 2/5 （表示因变量"Decision"为"Yes"且自变量之一的"Outlook"为"Sunny"的数据点的个数(2个)，在所有"Outlook"为"Sunny"的数据点个数(5个)的占比，即2/5）\end{equation} (2). 计算$\hat{p}_{mk}$，$k=2$有\begin{equation} \hat{p_{12}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1)=\sum_{i=1}^{N_1=5} \begin{cases} 1, if \space y_i = 1 (即"Decision = No").\\ 0, if \space y_i ≠ 1 (即"Decision ≠ No"). \end{cases}\\ = 3/5 （表示因变量"Decision"为"No"且自变量之一的"Outlook"为"Sunny"的数据点的个数(3个)，在所有"Outlook"为"Sunny"的数据点个数(5个)的占比，即3/5）\end{equation} 3. 计算$Gini \space Index(m=1)$: $Gini(Outlook=Sunny) = \sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})=\sum_{k=1}^{K=2}\hat{p}_{1k}(1-\hat{p}_{1k})\\ =\hat{p}_{11}\times\hat{p}_{12}+\hat{p}_{12}\times\hat{p}_{11}=(2/5 \times 3/5) + (3/5 \times 2/5) = 12/25=0.48$ 4. 同理计算计算$Gini \space Index(m=2)$: $Gini(Outlook=Overcast) = \sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})=\sum_{k=1}^{K=2}\hat{p}_{2k}(1-\hat{p}_{2k})\\ =\hat{p}_{21}\times\hat{p}_{22}+\hat{p}_{22}\times\hat{p}_{21}=(0/4 \times 4/4 + 4/4 \times 0/4) = 0$ 5. 同理计算计算$Gini \space Index(m=3)$: $Gini(Outlook=Rain) = \sum_{k≠k'}\hat{p}_{mk}\hat{p}_{mk'}=\sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk})=\sum_{k=1}^{K=2}\hat{p}_{3k}(1-\hat{p}_{3k})\\ =\hat{p}_{31}\times\hat{p}_{32}+\hat{p}_{32}\times\hat{p}_{31}=(3/5 \times 2/5 + 2/5 \times 3/5) = 12/25 = 0.48$ #### 1.2.2.2 Cross-Entropy在Decision Tree中的算例第1步和第2步与上面的计算流程一样 1. 根据$(1.9)$，计算在节点$m$，在区域$R_m$中的数据点的个数$N_m$。我们看到，数据集中一共出现了5次"Sunny"，所以$N_m=5$。 2. 当$m =2$: (1). 计算$\hat{p}_{mk}$，$k=1$时，有\begin{equation} \hat{p_{11}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1) \\ =\frac{1}{N_1}\sum_{i=1}^{N_1=5} \begin{cases} 1, if \space y_i = 1 (即"Decision = Yes").\\ 0, if \space y_i ≠ 1 (即"Decision ≠ "Yes"). \end{cases}\\ = 2/5 （表示因变量"Decision"为"Yes"且自变量之一的"Outlook"为"Sunny"的数据点的个数(2个)，在所有"Outlook"为"Sunny"的数据点个数(5个)的占比，即2/5）\end{equation} (2). 计算$\hat{p}_{mk}$，$k=2$有\begin{equation} \hat{p_{12}}=\frac{1}{N_1}\sum_{x_i∈R_1}I(y_i=1) \\ =\sum_{i=1}^{N_1=5} \begin{cases} 1, if \space y_i = 1 (即"Decision = No").\\ 0, if \space y_i ≠ 1 (即"Decision ≠ No"). \end{cases}\\ = 3/5 （表示因变量"Decision"为"No"且自变量之一的"Outlook"为"Sunny"的数据点的个数(3个)，在所有"Outlook"为"Sunny"的数据点个数(5个)的占比，即3/5）\end{equation} 3. 计算$Cross-Entropy (m=1)$： $Cross-Entropy(Outlook=Sunny) = -\sum_{k=1}^{K=2}{\hat{p}_{1k}\log{\hat{p}_{1k}}}\\ =-[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}]\\ =-[2/5 \times \log(2/5) + 3/5 \times \log(3/5)]\\ ≈0.971$ 4. 同理计算$Cross-Entropy (m=2)$： $Cross-Entropy(Outlook=Overcast) = -\sum_{k=1}^{K=2}{\hat{p}_{2k}\log{\hat{p}_{2k}}}\\ =-[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}]\\ =-[4/4 \times \log(0/4) + 0/4 \times \log(4/4)] \leftarrow 这里的\log(0/4)会有"Math Error"，这是因为该节点是"Pure Node"，\\此时我们要令 \log(0/4)的值为 0。 \\ =0$ 5. 同理计算$Cross-Entropy (m=3)$： $Cross-Entropy(Outlook=Rain) = -\sum_{k=1}^{K=2}{\hat{p}_{3k}\log{\hat{p}_{3k}}}\\ =-[\hat{p}_{31}\log{\hat{p}_{31}} + \hat{p}_{32}\log{\hat{p}_{32}}]\\ =-[3/5 \times \log(2/5) + 2/5 \times \log(3/5)] \\ ≈0.754$ 另外，Cross-Entropy和Gini Index对于节点概率的变化更加敏感。例如，在一个二分类问题中，每个类中有400个数据(400,400)，假设有一次分割形成了(300,100)和(100,300)的分割方式，另一个分割方式形成了(200,400)和(200,0)的分割方式。这两种分割所产生的Misclassification Rate都是0.25，但是第二种切分方式产生了一个"Pu re Node"。但是Gini Index和Cross-Entropy在第二个切分(Pure Node)上的值会很低。所以我们一般会用Gini Index或Cross-Entropy作为criterion。对于"Cost-Complexity Pruning"，可以使用这三种方法中的任何一种，但通常是Misclassification Rate。对于Gini Index通常有两种解释方式： 1. **我们应当将观测点按照$\hat{p}_{mk}$的概率分类到第$k$类中，而不是将观测点分类到该节点的多数类中。然后，当前节点的训练误差(training error rate)就是：$\sum_{k≠k'} \hat{p}_{mk}\hat{p}_{mk'}$，也就是Gini Index**。 2. 类似的，**如果我们将每个属于第$k$类的观测数据点编码为1，属于其他类的数据点编码为0，那么当前节点内的 0-1 值的方差就是 $\hat{p}_{mk}(1-\hat{p}_{mk})$。然后将所有的类$k$相加求和，就得到了Gini Index**。 #### 1.2.2.3 Information Gain（信息增益）上一个小章节提到了，信息熵(Information Entropy 或Cross-Entropy)是用来衡量节点的purity（纯度）的指标。计算公式见$(1.10)$。我们令$Ent(D)=-\sum^{K}_{k=1}{\hat{p}_{mk}\log{\hat{p}_{mk}}}$，$Ent(D)$越小，则当前样本集$D$的纯度（purity）越高。（这个公式也决定了信息熵的一个缺点：即信息熵对可取值数目多的特征有偏好（即该属性能取得值越多，信息熵，越偏向这个），因为特征可取的值越多，会导致“纯度”越大，即$Ent(D)$会很小，如果一个特征的离散个数与样本数相等，那么$Ent(D)$值会为0） **Information Gain（信息增益）** 假设Categorical Feature（离散型自变量）$a$有$M$个可能的取值$\{a^1, a^2, ..., a^M\}$如果用特征$a$来对数据集$D$进行划分，则会产生$M$个切分点，第$m$个切分点包含了数据集$D$中的所有在特征$a$上的取值为$a^m$的样本总数，记为$D^m$。因此可以根据上面的Information Entropy的计算公式计算。再考虑到不同且分店所包含的样本数量不同，给且分店赋予权重$\frac{|D^m|}{|D|}$，即，样本数量越多的切分点的影响越大。因此，能计算出自变量$a$对于样本集$D$进行划分所获得的Information Gain： \begin{equation}Gain(D,a)=Ent(D)-\sum^{M}_{m=1}{\frac{|D^m|}{|D|}Ent(D^m)} \tag{1.11}\end{equation} 一般来说，Information Gain越大，则使用自变量$a$对数据集划分所获得的“Purity Improvement”越大。所以Information Gain可以用于决策树切分时的自变量选择，其实就是选择Information Gain的最大属性。我们用上一章节中的(Cross-Entropy)的算例为例：我们计算了“Outlook”这个自变量中，每个unique值的Cross-Entropy： 1. 计算$Cross-Entropy (m=1)$： $Cross-Entropy(Outlook=Sunny) = Ent(D^1) = -\sum_{k=1}^{K=2}{\hat{p}_{1k}\log{\hat{p}_{1k}}}\\ =-[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}]\\ =-[2/5 \times \log(2/5) + 3/5 \times \log(3/5)]\\ ≈0.971$ 2. 同理计算$Cross-Entropy (m=2)$： $Cross-Entropy(Outlook=Overcast) = Ent(D^2) = -\sum_{k=1}^{K=2}{\hat{p}_{2k}\log{\hat{p}_{2k}}}\\ =-[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}]\\ =-[4/4 \times \log(0/4) + 0/4 \times \log(4/4)] \leftarrow 这里的\log(0/4)会有"Math Error"，这是因为该节点是"Pure Node"，\\此时我们要令 \log(0/4)的值为 0。 \\ =0$ 3. 同理计算$Cross-Entropy (m=3)$： $Cross-Entropy(Outlook=Rain) = Ent(D^3) = -\sum_{k=1}^{K=2}{\hat{p}_{3k}\log{\hat{p}_{3k}}}\\ =-[\hat{p}_{31}\log{\hat{p}_{31}} + \hat{p}_{32}\log{\hat{p}_{32}}]\\ =-[3/5 \times \log(2/5) + 2/5 \times \log(3/5)] \\ ≈0.754$ 4. 为了计算Information Gain，我们还要计算 $Ent(D)$，此时我们就不需要考虑$m$，只考虑$k$即可。其中，$k=1 \text{时}, \space \hat{p}_{k=1}=\frac{9}{14}$（注意$k=1$表示因变量为"Yes"）。$k=2 \text{时}, \space \hat{p}_{k=1}=\frac{5}{14}$： $Cross-Entropy(Outlook) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\ = -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$ 5. 有了这几个值，我们终于可以计算Information Gain了： $Gain(Outlook) = Ent(D) - \sum_{m=1}^{M=3}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{5}{14} \times 0.971 + \frac{4}{14} \times 0 + \frac{5}{14} \times 0.754) ≈ 0.324$ 6. 同理我们可以计算其他自变量的Information Gain: 1. 对于Temp变量：我们令$Temp(m=1)为\text{Hot}$，$Temp(m=2)为\text{Mild}$，$Temp(m=3)为\text{Cool}$。 $Cross-Entropy(m=1) = Cross-Entropy(Temp=Hot) = Ent(D^1) = \sum_{k=1}^{K=2}{\hat{p}_{1k}} \\ = -[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}] \\ =-[\frac{2}{4} \times \log{\frac{2}{4}} + \frac{2}{4} \times \log{\frac{2}{4}}] \\ = 1$ $Cross-Entropy(m=2) = Cross-Entropy(Temp=Mild) = Ent(D^2) = \sum_{k=1}^{K=2}{\hat{p}_{2k}} \\ = -[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}] \\ =-[\frac{4}{6} \times \log{\frac{4}{6}} + \frac{2}{6} \times \log{\frac{2}{6}}] \\ ≈ 0.918$ $Cross-Entropy(m=3) = Cross-Entropy(Temp=Cool) = Ent(D^3) = \sum_{k=1}^{K=2}{\hat{p}_{3k}} \\ = -[\hat{p}_{31}\log{\hat{p}_{31}} + \hat{p}_{32}\log{\hat{p}_{32}}] \\ =-[\frac{3}{4} \times \log{\frac{3}{4}} + \frac{1}{4} \times \log{\frac{1}{4}}] \\ ≈ 0.811$ 整个Temp的Cross-Entropy为：$Cross-Entropy(Temp) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\ = -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$ 所以Information Gain为： $Gain(Temp) = Ent(D) - \sum_{m=1}^{M=3}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{5}{14} \times 0.971 + \frac{4}{14} \times 0 + \frac{5}{14} \times 0.754) ≈ 0.0291$ 2. 对于变量：Humidity我们令$Humidity(m=1)为\text{High}$，$Humidity(m=2)为\text{Normal}$。 $Cross-Entropy(m=1) = Cross-Entropy(Humidity=High) = Ent(D^1) = \sum_{k=1}^{K=2}{\hat{p}_{1k}} \\ = -[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}] \\ =-[\frac{3}{7} \times \log{\frac{3}{7}} + \frac{4}{7} \times \log{\frac{4}{7}}] \\ = 0.985$ $Cross-Entropy(m=2) = Cross-Entropy(Humidity=Normal) = Ent(D^2) = \sum_{k=1}^{K=2}{\hat{p}_{2k}} \\ = -[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}] \\ =-[\frac{6}{7} \times \log{\frac{6}{7}} + \frac{1}{7} \times \log{\frac{1}{7}}] \\ ≈ 0.592$ 整个Humidity的Cross-Entropy为：$Cross-Entropy(Humidity) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\ = -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$ 所以Information Gain为： $Gain(Humidity) = Ent(D) - \sum_{m=1}^{M=2}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{7}{14} \times 0.985 + \frac{7}{14} \times 0.592) ≈ 0.1515$ 3. 对于变量：Wind我们令$Wind(m=1)为\text{Strong}$，$Wind(m=2)为\text{Weak}$。 $Cross-Entropy(m=1) = Cross-Entropy(Wind=Strong) = Ent(D^1) = \sum_{k=1}^{K=2}{\hat{p}_{1k}} \\ = -[\hat{p}_{11}\log{\hat{p}_{11}} + \hat{p}_{12}\log{\hat{p}_{12}}] \\ =-[\frac{3}{6} \times \log{\frac{3}{6}} + \frac{3}{6} \times \log{\frac{3}{6}}] \\ = 1$ $Cross-Entropy(m=2) = Cross-Entropy(Wind=Weak) = Ent(D^2) = \sum_{k=1}^{K=2}{\hat{p}_{2k}} \\ = -[\hat{p}_{21}\log{\hat{p}_{21}} + \hat{p}_{22}\log{\hat{p}_{22}}] \\ =-[\frac{6}{8} \times \log{\frac{6}{8}} + \frac{2}{8} \times \log{\frac{2}{8}}] \\ ≈ 0.811$ 整个Wind的Cross-Entropy为：$Cross-Entropy(Wind) = Ent(D) = -\sum_{k=1}^{K=2}{\hat{p}_{k}} \\ = -(\frac{9}{14}\log{\frac{9}{14}}+ \frac{5}{14}\log{\frac{5}{14}}) ≈ 0.94$ 所以Information Gain为： $Gain(Wind) = Ent(D) - \sum_{m=1}^{M=2}{\frac{|D^m|}{|D|}Ent(D^m)} = 0.94 - (\frac{6}{14} \times 1 + \frac{8}{14} \times 0.811) = 0.048$ 7. 所以根据这几个自变量的Information Gain的计算，我们发现： * $Gain(outlook) = 0.324$ * $Gain(temp) = 0.0291$ * $Gain(humidity) = 0.1515$ * $Gain(wind) = 0.048$ 所以我们选择$Outlook$这个自变量作为第一个节点处被切分的自变量。 ### 1.2.3 其他注意问题 #### 1.2.3.1 Categorical Predictors（分类型自变量）当切分一个含有$q$个无序值的分类型自变量时，一共会有$2^{q-1}$个将这$q$个值进行二分的可能的切分方法，并且该值会随着$q$的值变得非常大。然而对于一个 $0-1$ 输出变量，该计算的计算量就会被简化。我们将预测变量的类根据落入 “输出变量为1” 的占比大小进行排序。然后假设该预测变量是有序预测变量，并对该有序变量进行切分。 ## 2.1 下面来看代码： #### 注意，请先[在这里](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)下载iris数据，将下载的文件后缀改为'.csv'，并移动到当前文件夹下以方便Python读取。（注释虽多，但有助于理解代码） ```Python import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import random from pprint import pprint # Part 1: Load Iris data: # The format of the data should be Pandas.DataFrame df = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label']) # Part 2: Train and Test Split # Next, we should split Iris dataset into training and testing sets. # We'll define our own "split_train_test" function instead of using predefined API def split_train_test(df, test_size): # df --> raw Iris data; # test_size --> the size of testing set. It can be float or int type # When test_size is float type ranging from 0 to 1, it means the the proportion of testing set selected from raw Iris data # When test_size is int type, it means the size of test set is "test_size" # Return --> A "train_df" and a "test_df", both are Pandas.DataFrame if isinstance(test_size, float): # if test_size is float type test_size = round(test_size * len(df)) # e.g., if we have 150 records in Iris data set and test_size is 0.83, then round(150 * 0.83) =　round(124.5) = 125 indices = df.index.tolist() # retrieve indices from Iris DataFrame test_indices = random.sample(population=indices, k=test_size) # Then, use "test_size" to perform random sampling from Iris DataFrame. test_df = df.loc[test_indices] # the indices selected belong to testing set train_df = df.drop(test_indices) # the indices not selected belong to training set return train_df, test_df # Then we can set a random seed, run "split_train_test" and get "train_df" and "test_df" random.seed(100) train_df, test_df = split_train_test(df, 20) # we select 20 records as testing set # Part 3: Helper Functions # Part 3.1: Helper Function -- "check_purity" # If Iris dataset only has 1 class, then this data is called "Pure". If the number of classes is more then 1, then it's "Not Pure" # Now, the first Helper Function is to check the purity of our dataset. def check_purity(data): label_column = data[:, -1] unique_classes = np.unique(label_column) if len(unique_classes) == 1: return True else: return False # If the dataset is pure, return True; otherwise return False # Part 3.2: Helper Function -- "classify_data" # classify the data def classify_data(data): label_column = data[:, -1] # unique_classes, count_unique_classes = np.unique(label_column, return_counts=True) # previous line will find the number of classes of each independent variable, and the number of unique classes of each independent variable index = count_unique_classes.argmax() # find index of the class that has the most records classification = unique_classes[index] return classification # Part 3.3: Helper Function -- "find_potential_splits" # In this function, we should traverse all records to find the potential splitting points # What is splitting point? -- For a continuous variable, for example, if the unique values of "petal_length" are: "1.0, 1.1, 1.2, ..." # Then, the potential splitting points will be the average value of each two nearby values: "1.05, 1.15, ..." # We'll create a dictionary called "potential_splits" to store potential splitting points. def find_potential_splits(data): potential_splits = {} i, n_columns = data.shape for index_column in range(n_columns - 1): potential_splits[index_column] = [] values = data[:, index_column] unique_values = np.unique(values) for index in range(len(unique_values) - 1): # Why -1? -- Because the number of intervals of N numbers is N-1. current_value = unique_values[index] next_value = unique_values[index+1] potential_split = (current_value + next_value) / 2 # "potential_split" is one of the potential splits potential_splits[index_column].append(potential_split) return potential_splits # Finally we return the dictionary that contains all potential splitting points # To help you understand how "find_potential_splits" works, you can decomment the following 4 lines to see how this function works. ########################################################################################################################################################## # potential_splits = find_potential_splits(train_df.values) # sns.lmplot(data=train_df, x='petal_width', y='petal_length', hue='label', # fit_reg=False, height=5, aspect=1) # If you're using Python 3.7+, use "height". If your Python version is lower than 3.6, use "size" # plt.vlines(x=potential_splits[3], ymin=0.5, ymax=7.5) # plt.show() ########################################################################################################################################################## # Previous 4 lines showed the potential splitting points of "petal_width". # The following 4 lines will show the potential splitting points of "petal_length": ########################################################################################################################################################## # potential_splits = find_potential_splits(train_df.values) # sns.lmplot(data=train_df, x='petal_width', y='petal_length', hue='label', # fit_reg=False, height=5, aspect=1) # plt.hlines(y=potential_splits[2], xmin=0, xmax=2.5) # plt.show() ########################################################################################################################################################## # Part 4: Split data -- "split_data" # Since we've found all potential splitting points, we can define a function to split our training data # Remember, in decision tree, if we want to make a split at a node, there are 2 things we should decide: # First is which feature (independent variable) should be selected to split. Say, this feature is f1. # Second is, once the feature f1 is selected, which potential split should be selected # We will talk about how these 2 things are determined later. For "split_data" function, we want it to return two datasets # For dataset1, all its feature f1's values should be smaller than split value # For dataset2, all its feature f1's values should be greater than split value def split_data(data, split_column, split_value): split_column_values = data[:, split_column] data_greater = data[split_column_values > split_value] data_smaller = data[split_column_values <= split_value] return data_greater, data_smaller # If you're careful enough, you'll find that if "petal_width" < 0.8 (approximately), all records are "setosa". # We'll utilize this insight to check if our functions work correctly: (you can decomment the following 9 lines to help you understand) ########################################################################################################################################################## # split_column = 3 # split_value = 0.8 # we set split value to be 0.8. You'll see all records with "petal_width" < 0.8 are "setosa". # data_greater, data_smaller = split_data(train_df.values, split_column, split_value) # plot_df = pd.DataFrame(data_smaller, columns=df.columns) # sns.lmplot(data=plot_df, x='petal_width', y='petal_length', hue='label', # fit_reg=False, height=5, aspect=1) # plt.vlines(x=split_value, ymin=0.5, ymax=7.5) # plt.xlim(0, 3) # plt.show() ########################################################################################################################################################## # Part 5: Find the smallest cross-entropy -- "compute_entropy", "compute_cross_entropy" # We mentioned 2 things that should be determined every time we want to make a split in decision. # Now is time to compute the cross entropy and find the smallest entropy. # The split that minimizes the cross entropy is the best split. def compute_entropy(data): label_column = data[:, -1] i, count = np.unique(label_column, return_counts=True) probabilities = count / count.sum() entropy = sum(probabilities * -np.log2(probabilities)) return entropy def compute_cross_entropy(data_greater, data_smaller): count_data_points = len(data_smaller) + len(data_greater) probability_data_greater = len(data_greater) / count_data_points probability_data_smaller = len(data_smaller) / count_data_points cross_entropy = (probability_data_greater * compute_entropy(data_greater) + probability_data_smaller * compute_entropy(data_smaller)) return cross_entropy # Take "petal_width" < 0.8 for example. "petal_width" will be the feature to be split and the splitting point is 0.8. ########################################################################################################################################################## # To see an example of how cross entropy is calculated, decomment the following 4 lines: # split_column = 3 # index 3 is "petal_width" column in our dataset. # split_value = 0.8 # we set split value to be 0.8. You'll see all records with "petal_width" < 0.8 are "setosa". # data_greater, data_smaller = split_data(train_df.values, split_column, split_value) # print(compute_cross_entropy(data_greater, data_smaller)) ########################################################################################################################################################## # With previous functions, we can easily find the optimal splitting points by travering the dataset. # The "feature-splitting point" pair that minimizes cross entropy will be the optimal feature-splitting point. def find_optimal_split(data, potential_splits): cross_entropy = float('inf') # we initialize a very large number and then replace it with smaller cross entropy value for column_index in potential_splits: # traverse all features for split_value in potential_splits[column_index]: # traverse all splitting points data_greater, data_smaller = split_data(data, split_column=column_index, split_value=split_value) current_cross_entropy = compute_cross_entropy(data_greater= data_greater, data_smaller=data_smaller) if current_cross_entropy <= cross_entropy: cross_entropy = current_cross_entropy optimal_split_column = column_index optimal_split_value = split_value return optimal_split_column, optimal_split_value # Not surprisingly, one of the optimal split feature-splitting point should be "sepal_length-0.8" (or "3-0.8" because the index of sepal_length is 3) # You can check the result using below 2 lines: ########################################################################################################################################################## # potential_splits = find_potential_splits(train_df.values) # print(find_optimal_split(train_df.values, potential_splits)) ########################################################################################################################################################## # Part 6: Decision Tree Algorithm -- "build_decision_tree" # Now is time to build our decision tree algorithm. def build_decision_tree(df, counter=0, min_sample=2, max_depth=5): # df --> Pandas.DataFrame type. Our dataset # current_depth --> current depth of tree # min_sample --> the minimum number of records of the dataset # max_depth --> the maximum depth of decision tree if counter == 0: # when we are at the first layer (root layer), we should convert Pandas.DataFrame data to a numpy array global COLUMN_HEADERS COLUMN_HEADERS = df.columns data = df.values else: # Otherwise, if we are at the other layers, then no need for any conversion data = df # base case # If data is pure OR data size < min_samples OR we reached maximum depth, then we get the base case: if (check_purity(data)) or (len(data) < min_sample) or (counter == max_depth): classification = classify_data(data) # then we should classify our data return classification else: # Otherwise we keep growing our decision tree counter += 1 potential_splits = find_potential_splits(data) split_column, split_value = find_optimal_split(data, potential_splits) data_greater, data_smaller = split_data(data, split_column=split_column, split_value=split_value) # instantiate subtree: feature_name = COLUMN_HEADERS[split_column] decision = "{} <= {}".format(feature_name, split_value) subtree = {decision: []} # find the answer (recursive part) no_answer = build_decision_tree(data_greater, counter, min_sample, max_depth) yes_answer = build_decision_tree(data_smaller, counter, min_sample, max_depth) if yes_answer == no_answer: subtree = yes_answer else: subtree[decision].append(yes_answer) subtree[decision].append(no_answer) return subtree # To help you understand how this recursive function works, decomment the following code block. ########################################################################################################################################################## # atree = build_decision_tree(train_df, max_depth=2) # pprint(atree) ########################################################################################################################################################## # Part 7: Make Classification -- "make_classification" # We've already had "build_decision_tree" function to grow a decision tree. # Now we're to make classification using the decision tree we built. # In this part, "make_classification" function will use "build_decision_tree" function to make prediction and then return the "label" it predicts. def make_classification(record, tree): # record --> an example data point we will use to testify our "build_decision_tree" function ''' A possible value of "record": sepal_length 6.5 sepal_width 3 petal_length 5.5 petal_width 1.8 label Iris-virginica Name: 116, dtype: object ''' # tree --> a tree that is given by "build_decision_tree" function decision = list(tree.keys())[0] # an example value of "decision" might be 'petal_width <= 0.8' column_name, comparison, split_value = decision.split() # Once split, we'll get 3 values: "petal_width", "<=", and "0.8", respectively. # So, we can use column_name as an index to retrieve the corresponding value of 'petal_width' feature, which is 3. # Then if this value <= split_value, then the answer is True; otherwise False if record[column_name] <= float(split_value): answer = tree[decision][0] else: answer = tree[decision][1] # The base case: if not isinstance(answer, dict): # if answer is not a dict type, which means we reached leaf node, then we return the answer return answer else: # otherwise we continue the recursive part remained_tree = answer return make_classification(record, remained_tree) # Again, you can decomment the following code block to see an example. ########################################################################################################################################################## # record = test_df.iloc[2] # tree = build_decision_tree(train_df, max_depth=2) # print('Predicted Label: ', make_classification(record, tree)) # print('Actual Label: ', record['label']) ########################################################################################################################################################## # You can see the Predicted and Actual Labels are the same, meaning our "build_decision_tree" function works well # Part 8: Calculate Accuracy -- "calculate_accuracy" # This part is easy. For classification problem, accuracy = number of correctly classified records / number of total records def calculate_accuracy(df, tree): df['classification'] = df.apply(make_classification, axis=1, args=(tree,)) df['correctly_classified'] = df['classification'] == df['label'] accuracy = df['correctly_classified'].mean() return accuracy # Let's use our test_df to see the accuracy of our decision tree model: ########################################################################################################################################################## # tree = build_decision_tree(train_df, max_depth=2) # accuracy = calculate_accuracy(test_df, tree) # print('Accuracy is: ', accuracy) ########################################################################################################################################################## ```