# 【巨量資料處理 II - Hadoop & Spark : Introduction to Information-based Learning】
## 1. Information-based Learning
- To build predictive models using only the most informative features.
- An informative feature is a descriptive feature whose values split the instances in the dataset into homogeneous sets with respect to the target feature value.
1. Information-based Learning (Decision Tree)
2. Similarity-based Learning (K-NN)
3. Error-based Learning (Neural Network, SVM)
4. Probability-based Learning (Native Bayesian)
<br>
## 2. Basic Idea : Learning Rules from Data
| ID | SUSPICIOUS WORDS | UNKNOWN SENDER | CONTAINS IMAGES | CLASS |
| :--: | :--: | :--: | :--: | :--: |
| - | 最有用 | - | 最無用 | - |
| 376 | true | false | true | spam |
| 489 | true | true | false | spam |
| 541 | true | true | false | spam |
| 693 | false | true | true | ham |
| 782 | false | false | false | ham |
| 976 | false | false | false | ham |
> Table: An email spam prediction dataset.
- Important features that split the dataset into pure sets with respect to the target class.
- All we need to do that is a computational metric of the purity of a set: entropy
<center>
<img src="https://hackmd.io/_uploads/HyVyFjHVJg.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
#### 2.1 Entropy/Information
- 平均來說需要的最小編碼數
- Shannon’s entropy model defines a computational measure of the impurity of the elements of a set.
- An easy way to understand the entropy of a set is to think in terms of the uncertainty associated with guessing the result if you were to make a random selection from the set.
- Entropy is related to the probability of an outcome. (可以理解成集合內的平均編碼數)
1. High probability → Low entropy
2. Low probability → High entropy
\begin{equation}
H = \sum_{i=1}^{N} p_i log_{2}(\frac{1}{p_i})
\end{equation}
:::success
**重點筆記**:
- $p_i$:item 出現的機率。
- $1/p_i$:越少發生傳達越多訊息,出現機率越低會用更多的 bit 表示。
:::
- Example: Bucket
<center>
<img src="https://hackmd.io/_uploads/By3jFoB4yl.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
\begin{align}
\text{Bucket1}: H_1 &= 1 \times \log1 = 0 \\
\text{Bucket2}: H_2 &= \frac{1}{2} \log2 + \frac{1}{4} \log4 + \frac{1}{8} \log8 + \frac{1}{8} \log8 = 1.75 \\
\text{Bucket3}: H_3 &= \frac{1}{4} \log4 + \frac{1}{4} \log4 + \frac{1}{4} \log4 + \frac{1}{4} \log4 = 2.00
\end{align}
<center>
<img src="https://hackmd.io/_uploads/rkEU0HHEyg.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
:::success
**重點筆記**:
- $H(\text{card})(\text{c}) = \frac{6}{12}\log\frac{12}{6} + \frac{6}{12}\log\frac{12}{6} = 1$
- $H(\text{card})(\text{d}) = \frac{6}{12}\log\frac{12}{6} + \frac{3}{12}\log\frac{12}{3} + \frac{3}{12}\log\frac{12}{3} = 1.5$
:::
- Shannon’s model of entropy is a weighted sum of the logs of the probabilities of each of the possible outcomes when we make a random selection from a set.
\begin{equation}
H(t) = -\sum_{i=1}^{l}(P(t=i) \times log_{s}{P(t=i)})
\end{equation}
\begin{align}
H(card) &= -\sum_{i=1}^{52}(P(card=i) \times log_{2}{P(card=i)}) \\
&= -\sum_{i=1}^{52}(0.019 \times log_{2}(0.019)) = -\sum_{i=1}^{52} -0.1096 \\
&= 5.700 \text{bits}
\end{align}
<br>
# 3. Email Spam Prediction Dataset
| ID | SUSPICIOUS WORDS | UNKNOWN SENDER | CONTAINS IMAGES | CLASS |
| :--: | :--: | :--: | :--: | :--: |
| 376 | true | false | true | spam |
| 489 | true | true | false | spam |
| 541 | true | true | false | spam |
| 693 | false | true | true | ham |
| 782 | false | false | false | ham |
| 976 | false | false | false | ham |
## 3.1 Intuition
Our intuition is that the ideal feature will partition the data into pure subsets where all the instances in each subset have the same classification.
- SUSPICIOUS WORDS perfect split.
- UNKNOWN SENDER mixture but some information (when ’true’ most instances are ’spam’).
- CONTAINS IMAGES no information.
<center>
<img src="https://hackmd.io/_uploads/rkr4hrrVyx.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
One way to implement this idea is to use a metric called information gain.
- 原本的資訊量 $\rightarrow$ 後來的資訊量
- The information gain of a feature can be understood as a measure of the reduction in the overall entropy of a prediction task by testing on that feature.
- Computing information gain involves the following 3 equations
\begin{align}
H(t, D) &= -\sum_{l \in levels(t)} (P(t=l) \times log_2(P(t=l))) \\
rem(d, D) &= \sum_{l \in levels(d)} \frac{|D_{d=l}|}{|D|} \times H(t, D_{d=l}) \\
IG(d, D) &= H(t, D) - rem(d, D)
\end{align}
:::success
**重點筆記**:
- $\frac{|D_{d=l}|}{|D|}$: Weighting
- $H(t, D_{d=l})$: entropy of partition $D_{d=l}$
:::
## 3.2 Compute Entropy for Original Data Set
\begin{align}
H(t, D) &= -\sum_{I \in \{\text{'spam', 'ham'}\}} (P(t=l) \times \log_2(P(t=l))) \\
&= -(P(t=\text{'spam'}) \times \log_2(P(t=\text{'spam'}))) + -(P(t=\text{'ham'}) \times \log_2(P(t=\text{'ham'}))) \\
&= -((\frac{3}{6} \times \log_2(\frac{3}{6})) + ((\frac{3}{6} \times \log_2(\frac{3}{6}))) \\
&= 1 \text{bit}
\end{align}
## 3.3 Compute Remaining Entropy after Partitioning
| ID | SUSPICIOUS WORDS | UNKNOWN SENDER | CONTAINS IMAGES | CLASS |
| :--: | :--: | :--: | :--: | :--: |
| 376 | true | false | true | spam |
| 489 | true | true | false | spam |
| 541 | true | true | false | spam |
| 693 | false | true | true | ham |
| 782 | false | false | false | ham |
| 976 | false | false | false | ham |
\begin{align}
rem(WORDS, D) &= (\frac{|D_{WORDS = T}|}{|D|} \times H(t, D_{WORDS=T})) + (\frac{|D_{WORDS = F}|}{|D|} \times H(t, D_{WORDS=F})) \\
&= (\frac{3}{6} \times (-\sum_{I \in \{\text{'spam', 'ham'}\}} P(t=l) \times \log_2(P(t=l)))) \\
&+ (\frac{3}{6} \times (-\sum_{I \in \{\text{'spam', 'ham'}\}} P(t=l) \times \log_2(P(t=l)))) \\
&= (\frac{3}{6} \times (-((\frac{3}{3} \times \log_2(\frac{3}{3}) + \frac{0}{3} \times \log_2(\frac{0}{3})))) \\
&+ (\frac{3}{6} \times (-((\frac{0}{3} \times \log_2(\frac{3}{3}) + \frac{0}{3} \times \log_2(\frac{3}{3}))) \\
&= 0 \text{bits}
\end{align}
\begin{align}
rem(SENDER, D) &= (\frac{|D_{SENDER = T}|}{|D|} \times H(t, D_{SENDER=T})) + (\frac{|D_{SENDER = F}|}{|D|} \times H(t, D_{SENDER=F})) \\
&= (\frac{3}{6} \times (-\sum_{I \in \{\text{'spam', 'ham'}\}} P(t=l) \times \log_2(P(t=l)))) \\
&+ (\frac{3}{6} \times (-\sum_{I \in \{\text{'spam', 'ham'}\}} P(t=l) \times \log_2(P(t=l)))) \\
&= (\frac{3}{6} \times (-((\frac{2}{3} \times \log_2(\frac{2}{3}) + \frac{1}{3} \times \log_2(\frac{1}{3})))) \\
&+ (\frac{3}{6} \times (-((\frac{1}{3} \times \log_2(\frac{1}{3}) + \frac{2}{3} \times \log_2(\frac{2}{3}))) \\
&= 0.9183 \text{bits}
\end{align}
\begin{align}
rem(IMAGE, D) &= (\frac{|D_{IMAGE = T}|}{|D|} \times H(t, D_{IMAGE=T})) + (\frac{|D_{IMAGE = F}|}{|D|} \times H(t, D_{IMAGE=F})) \\
&= (\frac{2}{6} \times (-\sum_{I \in \{\text{'spam', 'ham'}\}} P(t=l) \times \log_2(P(t=l)))) \\
&+ (\frac{4}{6} \times (-\sum_{I \in \{\text{'spam', 'ham'}\}} P(t=l) \times \log_2(P(t=l)))) \\
&= (\frac{2}{6} \times (-((\frac{1}{2} \times \log_2(\frac{1}{2}) + \frac{1}{2} \times \log_2(\frac{1}{2})))) \\
&+ (\frac{4}{6} \times (-((\frac{2}{4} \times \log_2(\frac{2}{4}) + \frac{2}{4} \times \log_2(\frac{2}{4}))) \\
&= 1 \text{bits}
\end{align}
#### 3.4.1 Calculate the information gain for the three descriptive feature in the dataset.
\begin{equation}
IG(d, D) = H(t, D) - rem(d, D)
\end{equation}
\begin{align}
IG(\text{SUSPICIOUS WORDS}, D) &= H(CLASS, D) - rem(\text{SUSPICIOUS WORDS}, D) \\
&= 1 - 0 = 1 \text{bits} \\
\\
IG(\text{UNKNOWN SENDER}, D) &= H(CLASS, D) - rem(\text{UNKNOWN SENDER}, D) \\
&= 1 - 0.9183 = 0.0817 \text{bits} \\
\\
IG(\text{CONTAINS IMAGES}, D) &= H(CLASS, D) - rem(\text{CONTAINS IMAGES}, D) \\
&= 1 - 1 = 0 \text{bits} \\
\end{align}
<br>
# 4. Outline
**Basic Idea**: Entropy, Information Gain
**Decision Tree Overview**: Recursively apply the basic idea on a large data set (具有固定的規則,容易解釋。)
## 4.1 Algorithm
**ID3 Algorithm (Iterative Dichotomizer 3)** : The ID3 algorithm builds the tree in a recursive, depth-first manner, beginning at the root node and working down to the leaf nodes.
Decision Tree/Rule Learning A Decision Tree consists of:
- A root node (or starting node), Interior nodes
- Leaf nodes (or terminating nodes).
<center>
<img src="https://hackmd.io/_uploads/BkYuzsrVkl.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
## 4.2 Vegetation Classification Dataset
<center>
| ID | STREAM | SLOPE | ELEVATION | VEGETATION |
| :--: | :--: | :--: | :--: | :--: |
| 1 | false | steep | high | chaparral |
| 2 | true | moderate | low | riparian |
| 3 | true | steep | medium | riparian |
| 4 | false | steep | medium | chaparral |
| 5 | false | flat | high | conifer |
| 6 | true | steep | highest | conifer |
| 7 | true | steep | high | chaparral |
</center>
- If we select “Elevation”, we have the following partition..
<center>
<img src="https://hackmd.io/_uploads/S1LXniSV1x.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
\begin{align}
H(VEGETATION, D) &= -\sum_{l \in
\begin{Bmatrix}
\text{'chaparral'} \\
\text{'riparian'} \\
\text{'conifer'}
\end{Bmatrix}}
\big(P(VEGETATION=l) \times \log_2(P(VEGETATION=l))\big) \\
&= -((\frac{3}{7} \times \log_2(\frac{3}{7})) + ((\frac{2}{7} \times \log_2(\frac{2}{7}))) + ((\frac{2}{7} \times \log_2(\frac{2}{7}))) \\
&= 1.5567 \text{bits}
\end{align}
\begin{align}
H(VEGETATION, D_6) &= -\sum_{l \in
\begin{Bmatrix}
\text{'chaparral'} \\
\text{'riparian'} \\
\text{'conifer'}
\end{Bmatrix}}
\big(P(VEGETATION=l) \times \log_2(P(VEGETATION=l))\big) \\
&= -((\frac{1}{1} \times \log_2(\frac{1}{1})) \\
&= 0 \text{bits}
\end{align}
\begin{align}
H(VEGETATION, D_7) &= -\sum_{l \in
\begin{Bmatrix}
\text{'chaparral'} \\
\text{'riparian'} \\
\text{'conifer'}
\end{Bmatrix}}
\big(P(VEGETATION=l) \times \log_2(P(VEGETATION=l))\big) \\
&= -((\frac{1}{2} \times \log_2(\frac{1}{2})) + ((\frac{1}{2} \times \log_2(\frac{1}{2}))) + ((\frac{0}{2} \times \log_2(\frac{0}{2}))) \\
&= 1.0 \text{bits}
\end{align}
\begin{align}
H(VEGETATION, D_8) &= -\sum_{l \in
\begin{Bmatrix}
\text{'chaparral'} \\
\text{'riparian'} \\
\text{'conifer'}
\end{Bmatrix}}
\big(P(VEGETATION=l) \times \log_2(P(VEGETATION=l))\big) \\
&= -((\frac{2}{3} \times \log_2(\frac{2}{3})) + ((\frac{0}{3} \times \log_2(\frac{0}{3}))) + ((\frac{1}{3} \times \log_2(\frac{1}{3}))) \\
&= 0.9183 \text{bits}
\end{align}
<center>
| ID | STREAM | SLOPE | ELEVATION | VEGETATION |
| :--: | :--: | :--: | :--: | :--: |
| 1 | false | steep | high | chaparral |
| 2 | true | moderate | low | riparian |
| 3 | true | steep | medium | riparian |
| 4 | false | steep | medium | chaparral |
| 5 | false | flat | high | conifer |
| 6 | true | steep | highest | conifer |
| 7 | true | steep | high | chaparral |
</center>
<center>
<img src="https://hackmd.io/_uploads/Sy7myar4yg.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
:::success
**重點筆記**: 從上表,我們可以得知 ELEVATION 的 Info. Gain 為欄位的最大值,因此其效果最好。
:::
<center>
<img src="https://hackmd.io/_uploads/S12XgTBEJg.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
<center>
<img src="https://hackmd.io/_uploads/SkgbEx6rN1g.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
#### 4.2.1 The Resultant Decision Tree
<center>
<img src="https://hackmd.io/_uploads/rJ0UepSNkg.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
Inference
<center>
<img src="https://hackmd.io/_uploads/SyY2lTrNye.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
#### 4.3 Decision Applications
- 空氣是否紫爆?
- 約會是否能成功?
- 銀行是否該貸款給你?
- 電信是否會續約?
- 是不是一個垃圾郵件?
- 你是否死於鐵達尼船難?
- 手寫辨識資料集?
<center>
<img src="https://hackmd.io/_uploads/BJyyWTB4kl.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
## 5. Advanced Issues
#### 5.1 Impurity Measure
##### 5.1.1 Entropy
<center>
<img src="https://hackmd.io/_uploads/Bybt8pHVJg.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
##### 5.1.2 Information Gain Ratio
- Entropy prefers features with many values. (例如: 男女、縣市)
- One way of addressing this issue is to use **information gain ratio** which is computed by dividing the information gain of a feature by **the amount of information** used to determine the value of the feature
- 能避免挑到欄位屬性非常多的去計算
\begin{equation}
GR(d, D) = \frac{IG(d, D)}{-\sum_{l \in levels(d)}(P(d=l) \times \log_2{P(d=l)})}
\end{equation}
<center>
| ID | STREAM | SLOPE | ELEVATION | VEGETATION |
| :--: | :--: | :--: | :--: | :--: |
| 1 | false | steep | high | chaparral |
| 2 | true | moderate | low | riparian |
| 3 | true | steep | medium | riparian |
| 4 | false | steep | medium | chaparral |
| 5 | false | flat | high | conifer |
| 6 | true | steep | highest | conifer |
| 7 | true | steep | high | chaparral |
</center>
<br>
\begin{align}
H(VEGETATION, D) &= 1.5567 \text{bits} \\
IG(STREAM, D) &= 0.3060 \\
IG(SLOPE, D) &= 0.5774 \\
IG(ELEVATION, D) &= 0.8774 \\
\end{align}
<br>
<center>
<img src="https://hackmd.io/_uploads/Sy7myar4yg.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
\begin{align}
H(STREAM, D) &= -\sum_{l \in
\begin{Bmatrix}
\text{'true'} \\
\text{'false'}
\end{Bmatrix}}
\big(P(STREAM=l) \times \log_2(P(STREAM=l))\big) \\
&= -((\frac{4}{7} \times \log_2(\frac{4}{7})) + ((\frac{3}{7} \times \log_2(\frac{3}{7}))) \\
&= 0.9852 \text{bits}
\end{align}
\begin{align}
H(SLOPE, D) &= -\sum_{l \in
\begin{Bmatrix}
\text{'flat'} \\
\text{'moderate'} \\
\text{'steep'}
\end{Bmatrix}}
\big(P(SLOPE=l) \times \log_2(P(SLOPE=l))\big) \\
&= -((\frac{1}{7} \times \log_2(\frac{1}{7})) + ((\frac{1}{7} \times \log_2(\frac{1}{7}))) + ((\frac{5}{7} \times \log_2(\frac{5}{7}))) \\
&= 1.1488 \text{bits}
\end{align}
\begin{align}
H(ELEVATION, D) &= -\sum_{l \in
\begin{Bmatrix}
\text{'low'} \\
\text{'medium'} \\
\text{'high'} \\
\text{'highest'}
\end{Bmatrix}}
\big(P(ELEVATION=l) \times \log_2(P(ELEVATION=l))\big) \\
&= -((\frac{1}{7} \times \log_2(\frac{1}{7})) + ((\frac{2}{7} \times \log_2(\frac{2}{7}))) + ((\frac{3}{7} \times \log_2(\frac{3}{7})) + ((\frac{1}{7} \times \log_2(\frac{1}{7}))) \\
&= 1.8424 \text{bits}
\end{align}
<br>
\begin{align}
GR(STREAM, D) &= \frac{0.3060}{0.9852} = 0.3106 \\
GR(SLOPE, D) &= \frac{0.5774}{1.1488} = 0.5026 \\
GR(ELECATION, D) &= \frac{0.8774}{1.8424} = 0.4762
\end{align}
<center>
<img src="https://hackmd.io/_uploads/B19YITSE1e.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
#### 5.1.3 Gini Index
- The Gini index can be thought of calculating how often you would misclassify an instance in the dataset if you predicted it based on the distribution of classifications in the dataset.
\begin{equation}
Gini(t, D) = 1 - \sum_{l \in levels(t)} P(t=l)^2
\end{equation}
:::success
**重點筆記**: Information gain can be calculated using the Gini index by replacing the entropy measure with the Gini index
:::
<center>
| ID | STREAM | SLOPE | ELEVATION | VEGETATION |
| :--: | :--: | :--: | :--: | :--: |
| 1 | false | steep | high | chaparral |
| 2 | true | moderate | low | riparian |
| 3 | true | steep | medium | riparian |
| 4 | false | steep | medium | chaparral |
| 5 | false | flat | high | conifer |
| 6 | true | steep | highest | conifer |
| 7 | true | steep | high | chaparral |
</center>
\begin{align}
Gini(VEGETATION, D) =& 1 - \sum_{l \in
\begin{Bmatrix}
\text{'chaparral'} \\
\text{'riparian'} \\
\text{'conifer'}
\end{Bmatrix}}
P(VEGETATION = l)^2 \\
=& 1 - \big((\frac{3}{7})^2 + (\frac{2}{7})^2 + (\frac{2}{7})^2\big) \\
=& 0.6531
\end{align}
<center>
<img src="https://hackmd.io/_uploads/HkKGtaBV1e.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
#### 5.2 Handling Continuous Features
The easiest way to handle continuous valued features is to turn them into boolean features by defining a threshold and using this threshold to partition the instances based their value of the continuous feature.
**How do we set the threshold?**
- The instances in the dataset are sorted according to the continuous feature values.
- The adjacent instances in the ordering that have different classifications are then selected as possible threshold points.
- The optimal threshold is found by computing the information gain for each of these classification transition boundaries and selecting the boundary with the highest information gain as the threshold.
<center>
<img src="https://hackmd.io/_uploads/ry7TtarNJg.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
- Once a threshold has been set, the dynamically created new boolean feature **have to compete with the other categorical features** for selection as the splitting feature at that node.
<center>
<img src="https://hackmd.io/_uploads/HkifqpH41l.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
<center>
<img src="https://hackmd.io/_uploads/rJ-NqpHVke.png",
style="
width: 50%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
<center>
<img src="https://hackmd.io/_uploads/HkFH5pS41l.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
#### 5.3 Predicting Continuous Targets
Regression trees are constructed so as to reduce the **variance** in the set of training examples at each of the leaf nodes in the tree
We can do this by adapting the ID3 algorithm to use **a measure of variance** rather than a measure of classification impurity (entropy) when selecting the best attribute
- entropy $\rightarrow$ variance
\begin{align}
var(t, D) &= \frac{\sum_{i=1}^{n} (t_i - \bar{t})^2}{n-1} \\
\mathop{\arg\min}\limits_{d \in \text{d}} & \sum_{l \in levels(d)} \frac{|D_{d=l}|}{|D|} \times var(t, D_{d=l})
\end{align}
##### 5.3.1 bike rentals per day
<center>
<img src="https://hackmd.io/_uploads/rygD1XwV1x.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
<center>
<img src="https://hackmd.io/_uploads/BkLw1mD41g.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
#### 5.3 Ensemble Model
##### 5.3.1 Bagging (Bootstrap Aggregating)
RF / Classify are idependent
- Bootstrap Sampling
<center>
<img src="https://hackmd.io/_uploads/BJEzxmv4ke.png",
style="
width: 90%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
:::success
**重點筆記**: Each random sample is the same size as the dataset and sampling with replacement is used
:::
- Subspace Sampling
Only uses a randomly selected subset of the features in the dataset
The combination of bagging, subspace sampling, and decision trees is known as a random forest model.
<center>
<img src="https://hackmd.io/_uploads/rJuMNQv41g.png",
style="
width: 70%;
height: auto;">
<div style="
border-bottom: 3px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 3px;">
</div>
</center>
Suppose there are 25 base classifes
- Each classifier has error rate, $\epsilon = 0.35$ (Accuracy = 65%)
- Assume classifiers are independent
- Probability that the ensemble classifier make a wrong prediction
:::success
**重點筆記**: 如果 $25$ 沒有很大,同一個母體樣本差異不大,準確率會變高,造成效率不好。
:::
Binary Classification: 錯三個
- ABCDE
- xxxoo
- oxxxo $\rightarrow \epsilon^3 (1-\epsilon)^2$
- ooxxx $\rightarrow \epsilon^2 (1-\epsilon)^3$
- $\binom{5}{3} \epsilon^3 (1 - \epsilon)^2 = {}^{5}C_3 \epsilon^3 (1 - \epsilon)^2$
\begin{equation}
\sum_{i = 0}^{5} \binom{5}{i} \epsilon^i (1-\epsilon)^{5-i}
\end{equation}
Example: 錯 13 個到 25 個
\begin{equation}
\sum_{i = 13}^{25} \binom{25}{i} \epsilon^i (1-\epsilon)^{25-i} = 0.06
\end{equation}
## 5 Boosting
Boosting works by iteratively creating models and adding them to the ensemble.
The iteration stops when a predefined number of models have been added.
When we use boosting each new model added to the ensemble is biased to pay more attention to instances that previous models miss-classified. (做錯的東西重新做)
This is done by incrementally adapting the dataset used to train the models. To do this we use a weighted dataset
There has tree simple example:
- multiple trees
- weighted voting
- sequential adding tree
### 5.1 Boosting Idea Overview

#### 5.1.1 Weighted Dataset
- Each instance has an associate weight $w_i \ge 0$,
- Initially set $\frac{1}{n}$ where $n$ is the number of instance in the dataset.
- After each model is added to the ensemble it is tested on the training data and the weights of the instances the model gets correct are decreased and the weights of the instance the model gets incorrect are increased.
- These weights are used as a distribution over which the dataset is sampled to created a replicated training set, where the replication of an instance is proportional to its weight.
#### 5.1.2 Idea Overview
During each training iteration the algorithm:
1. Induces a model and calculateds the total error, $\varepsilon$ $[0, 0.5]$, by summing the weights of the training instances for which the predictions made by the model are incorrect.
2. Increases the weights for the instances misclassified using:
\begin{equation}
w[i] \leftarrow w[i] \times (\frac{1}{2\varepsilon})
\end{equation}
2. Decreases the weights for the instances correctly classified:
\begin{equation}
w[i] \leftarrow w[i] \times (\frac{1}{2 \cdot (1 - \varepsilon)})
\end{equation}
4. Calculate a confidence factor, $\alpha$, for the model such that $\alpha$ increase as $\varepsilon$ decreases:
\begin{equation}
\alpha = \frac{1}{2} \times log_e (\frac{1 - \varepsilon}{\varepsilon})
\end{equation}
:::success
**重點筆記**: 當 $\varepsilon = 0.001$ 時,整體的結果會很大,間接代表可信程度 $\alpha$ 很高。
:::
#### 5.1.3 Boosting Example
**Step1:**

**Step2:**

**Step3:**

- 整體判斷流程
```python
if x > 0.75
y = 1
else if x > 0.35
y = -1
else
y = 1
```
- 各步驟移動流程

- 推論的過程

:::success
**重點筆記**: 常見的問題
- 有沒有可能出現決定整個結果的一棵樹? 有可能
- $\alpha$ 值一定算到越後面越高嗎? 沒有,取決於抽樣誤差
- 如果樹高太高會怎樣? 造成過擬和
- 此部分會考演算法的操作!
:::
## Why we need multiple trees ?
- Rule-based Learning / Information-based Learning:Decision-Tree
- Error-based Learning / Loss-based Learning:Neural Network (Logistic Regression)
- Probability-based Learning:Naïve Bayes Network (NB)
- Similarity-based Learning:K-Nearest Neighbor (K-NN)
## Different Viewpoint to the Same Problem
Q: how to make prediction $\hat{y_i}$ given $x_i$
Error-based Learning
- 數值等於方程式
- Find Optimal $\vec{v}$ such that
\begin{equation}
Mininmize \sum_i (y_i - \hat{y_i})^2
\end{equation}
\begin{align}
\hat{y} &= \vec{l} \cdot \vec{v} =
\begin{bmatrix}
1, x, x^2, x^3 \\
\end{bmatrix}
\cdot
\begin{bmatrix}
a \\
b \\
c \\
d \\
\end{bmatrix}
\end{align}
:::success
**重點筆記**:如果模型太複雜,容易受到雜訊干擾。
:::
Rule-based Learning
- Find Optimal Split such that
\begin{equation}
Mininmize \sum_i (y_i - \hat{y_i})^2
\end{equation}
- Derive Rules
\begin{equation}
y =
\begin{cases}
1, & x < 3 \\
5, & 3 < x < 8 \\
8, & 8 < x
\end{cases}
\end{equation}
## Why multiple tree help to boost the performance?
**Boosting Penalty**
Minimize $\sum_i (y_i - \hat{y_i})^2$
- Objective Function V1
\begin{equation}
\sum_i (y_i - \hat{y_i})^2
\end{equation}
- Objective Function V1: For avoiding overfitting
\begin{equation}
\sum_i (y_i - \hat{y_i})^2 + \text{regularization}
\end{equation}
\begin{equation}
Obj(\Theta) = L(\Theta) + \Omega(\Theta)
\end{equation}
:::success
**重點筆記**:
1. $L(\Theta)$:Training Loss measures how well model fit on training data
2. $\Omega(\Theta)$:Regularization mesures complexity of model
:::
<br><br>
# Basic Idea Review: Similarity-based Learning
- Similarity-based prediction models attempt to **mimic a very human way of reasoning** by basing predictions for a target feature value on the most **similar instances in memory**—this makes them easy to interpret and understand.
- One of the simplest and best known machine learning algorithms for this type of reasoning is called the **nearest neighbor algorithm**.

## K nearest neighbors
- The k nearest neighbors model predicts the target level with the majority vote from the set of k nearest neightbors to the query q

:::success
**重點筆記**:
$k = 1$ 時容易受到干擾
:::
## A very fundamental step for all data mining techniques:
- Finding similar items

## Similarity Measures
- One of the best known metrics is Euclidean distance which computes the length of the straight line between two points. Euclidean distance between two instances a and b in a m-dimensional feature space is defined as:
\begin{equation}
Euclidean(a, b) = \sqrt{\sum_{i=1}^m (a[i] - b[i])^2}
\end{equation}
:::info
**注意事項**: $d_1 (5.00, 2.50), d_2 (2.75, 7.50)$
\begin{align}
Euclidean(d_1, d_2) &= \sqrt{(5.00 - 2.75)^2 + (2.50 - 7.50)^2} \\
&= \sqrt{30.0625} = 5.4829
\end{align}
:::
- Another, less well known, distance measure is the **Manhattan distance** or taxi-cab distance. The Manhattan distance between two instances a and b in a feature space with m dimensions is
\begin{equation}
Manhattan(a, b) = \sum_{i=1}^m abs(a[i] - b[i])
\end{equation}
:::info
**注意事項**: $d_1 (5.00, 2.50), d_2 (2.75, 7.50)$
\begin{align}
Manhattan(d_1, d_2) &= abs(5.00 - 2.75) + abs(2.50 - 7.50) \\
&= 2.25 + 5 = 7.25
\end{align}
:::
- The Euclidean and Manhattan distances are special cases of Minkowski distance
- The Minkowski distance between two instances a and b in a feature space with m descriptive features is:
\begin{equation}
Minkowski(a, b) = (\sum_{i=1}^m abs(a[i] - b[i])^p)^{\frac{1}{p}}
\end{equation}
- The Minkowski distance with p = 1 is the Manhattan distance and with p = 2 is the Euclidean distance.
## The Nearest Neighbor Algorithm
- Require: set of training instances
- Require: a query/instance to be classified
1. Iterate across the instances in memory and find the instance that is shortest distance from the query position in the feature space.
2. Make a prediction for the query equal to the value of the target feature of the anearest neighbor.
## Data Normalization
- The marketing department wants to decide whether or not they should contact a customer with the following profile: ⟨SALARY = 56, 000, AGE = 35⟩


- This odd prediction is caused by features taking different ranges of values, this is equivalent to features having different variances.
- We can adjust for this using normalization; the equation for range normalization is:
\begin{equation}
a_j = \frac{a_i - min(a)}{max(a) - min(a)}
\end{equation}

## Handling Categorical Attributes
Example: A binary dataset listing the behavior of two individuals on a website during a trial period and whether or not they subsequently signed-up for the website.

- The similarity between the current trial user, **q**, and the two users in the dataset, $d_1$ and $d_2$, in terms of co-presence (CP), co-absence (CA), presence-absence (PA), and absence-presence (AP).

### Common Similarity Measure

- Russel-Rao
\begin{equation}
Sim_{RR}(q, d) = \frac{CP(q, d)}{|q|}
\end{equation}
1. $Sim_{RR}(q, d_1) = \frac{2}{5} = 0.4$
2. $Sim_{RR}(q, d_2) = \frac{1}{5} = 0.2$
- Sokai-Michener
\begin{equation}
Sim_{SM}(q, d) = \frac{CP(q, d) + CA(q, d)}{|q|}
\end{equation}
1. $Sim_{SM}(q, d_1) = \frac{3}{5} = 0.6$
2. $Sim_{SM}(q, d_2) = \frac{4}{5} = 0.8$
- Jaccard
\begin{equation}
Sim_{J}(q, d) = \frac{CP(q, d)}{CP(q, d) + PA(q, d) + AP(q, d)}
\end{equation}
1. $Sim_{J}(q, d_1) = \frac{2}{4} = 0.5$
2. $Sim_{J}(q, d_2) = \frac{1}{2} = 0.5$
:::success
**重點筆記**: 如果當問題大小很大(例如:$\{0, 1\}^{10^6}$)
- $Sim_{RR}(q, d) \sim 0$
- $Sim_{SM}(q, d) \sim 1$
:::
## The most profitable algorithm in history?
- Recommendation System

- Item Recommender
使用者會一直更新喜好,因此用 KNN 讓推薦結果更優異

### Machine Learning and Personalization
Machine Learning can learn a user model of a particular user based on:
- Sample interaction
- Rated examples
This model can then be used to:
- Recommend items
- Filter information
- Predict behavior
### Movie Recommendation Example with the following user’s preference log

### Algorithms : K-Nearest Neighbors
User-Based Nearest Neighbor
- Neighbor = similar users
- Generate a prediction for an item i by analyzing ratings for i from users in u’s neighborhood
\begin{equation}
pred(u, i) = \frac{\sum_{n \subset neighbors(u)} sim(u, n) \cdot (r_{ni} - \bar{r_n})}{\sum_{n \subset neighbors(u)} sim(u, n)}
\end{equation}
### Algorithm: Find similar users of Toby and use their ratings to predict rating for Toby Toby

## Similarity? Euclidean Distance?
- Euclidean Distance (絕對距離)
\begin{equation}
\sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + ... + (p_n - q_n)^2} = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}
\end{equation}

## Pearson Correlation as a Similarity Measure
- 相對距離。
- 先跟自己比,假設比自己平均高則代表喜歡、否則不喜歡。
## Covariance and Correlation
Scatter Plot Matrix

The way to measure the correlation between features
- 當 a 小 b 同時小或 a 大 b 同時大,代表有關聯。
- For two features, $a$ and $b$, in a dataset of $n$ instances, the sample covatiance between $a$ and $b$ is
\begin{equation}
cov(a, b) = \frac{1}{n-1} \sum_{i=1}^n ((a_i - \bar{a}) \times (b_i - \bar{b}))
\end{equation}

\begin{equation}
cov(\text{HEIGHT}, \text{WEIGHT}) = \frac{7009.9}{29} = 241.72
\end{equation}
\begin{equation}
cov(\text{HEIGHT}, \text{AGE}) = \frac{570.8}{29} = 19.7
\end{equation}
- Correlation is a normalized form of covariance that range between -1 and +1
- The correlation between two features, $a$ and $b$, can be calculated as
\begin{equation}
corr(a, b) = \frac{cov(a, b)}{sd(a) \times sd(b)}
\end{equation}
\begin{equation}
corr(\text{HEIGHT}, \text{WEIGHT}) = \frac{241.72}{13.6 \times 19.8} = 0.898
\end{equation}
\begin{equation}
corr(\text{HEIGHT}, \text{AGE}) = \frac{19.7}{13.6 \times 4.2} = 0.345
\end{equation}
:::success
**重點筆記**: $corr(a, a)$ 會如何呢?
\begin{align}
corr(a, a) &= \frac{cov(a, a)}{sd(a) \times sd(a)} \\
&= \frac{\sum(a_i - \bar{a}) \cdot (a_i - \bar{a})}{sd(a) \times sd(a)} \\
&= \frac{\sum(a_i - \bar{a}) \cdot (a_i - \bar{a})}{\sqrt{\sum (a_i - \bar{a})^2} \times \sqrt{\sum (a_i - \bar{a})^2}} \\
&= \frac{\sum (a_i - \bar{a})^2}{\sum (a_i - \bar{a})^2} = 1 \\
\end{align}
:::
## Pearson Correlation
\begin{equation}
r = \frac{\sum (x-m_x)(y-m_y)}{\sqrt{\sum (x-m_x)^2 \sum(y - m_y)^2}}
\end{equation}
\begin{align}
r_{Matthew} &= \frac{(-0.166 \times 3.166) + (0.833 \times 0.833) + (-0.666 \times -2.166)}{[(-0.166)^2 + (0.833)^2 + (-0.666)^2] {[(1.333)^2 + (0.833)^2 + (-2.166)^2]}} \\
&= 0.68
\end{align}

$r_{Rose} = 0.99$

<br>
## Algorithm: Find similar users and use their ratings to predict rating for a target

## Alternative: Item-Similarity
Item-Based Nearest Neighbor
- Generate predictions based on similarities between items.
- Prediction for a user u and item i is composed of a weighted sum of the user u’s ratings for items most similar to i.
- User Similarity
\begin{equation}
pred(u, i) = \frac{\sum_{n \subset neighbors(u)} sim(u, n) \cdot (r_{ni} - \bar{r_n})}{\sum_{n \subset neighbors(u)} sim(u, n)}
\end{equation}
- Item Similarity
\begin{equation}
pred(u, i) = \frac{\sum_{j \in ratedItems(u)} sim(i, j) \cdot r_{ui}}{\sum_{j \in ratedItems(u)} sim(i, j)}
\end{equation}

Find the items that a use rated and use (1) the ratings and (2) the item similarity to predict the ratings of new items

The Night Listener’s Prediction for Toby $\sim
f(\text{sim(snakes,night), sim(superman,night), sim(Dupree, night)})$
## Efficient Memory Search
### KD Tree
- Lazy Learning
- A binary search tree where every node is a k-dimensional point.
- Every node (except leaves) represents a hyperplane that divides the space into two parts.
- Points to the left (right) of this hyperplane represent the left (right) sub-tree of that node.

- As we move down the tree, we divide the space along alternating (but not always) axis-aligned hyperplanes:
- Split by x-coordinate: split by a vertical line that has (ideally) half the points left or on, and half right.
- Split by y-coordinate: split by a horizontal line that has (ideally) half the points below or on and half above.


### Example
- Split by x-coordinate: split by a vertical line that has approximately half the points left or on, and half right.

- Split by y-coordinate: split by a horizontal line that has half the points below or on and half above.

- Split by x-coordinate: split by a vertical line that has half the points left or on, and half right.

- Split by y-coordinate: split by a horizontal line that has half the points below or on and half above.

- Nearest Neighbor with KD Trees

Explore the branch of the tree that is closest to the query point first.


When we reach a leaf, compute the distance to each point in the node.


Then, backtrack and try the other branch at each node visited.

Each time a new closest node is found, we can update the distance bounds.


Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.



灰色部分不需要再檢查
## K-Nearest Neighbor Search
- Can find the k-nearest neighbors to a query by maintaining the k current bests instead of just one.
- Branches are only eliminated when they can't have points closer than any of the k current bests.
## NN example using kD trees
d=1 (binary search tree)


