# Chp 8 Classification: Basic Concept
###### tags: `Data Mining 心得`
## Decision Tree Induction

### Algorithm: Generate_decision_tree.
* Input:
1. Data partition, **D**, set of training **tuples** and their associated class labels
2. **attribute_list**, the set of candidate attributes;
3. **Attribute_selection_method**, a procedure to determine the splitting criterion
* Method:
```java=
Create a node N;
if tuples in D are all of the same class, C,
return N as a leaf node labeled with the class C;
if attribute list is empty
return N as a leaf node labeled with the majority class in D;
// majority voting
apply Attribute_selection_method(D, attribute list) to find the “best” splitting criterion;
label node N with splitting_criterion;
if splitting attribute is discrete-valued and multiway splits allowed:
// not restricted to binary trees
attribute_list = attribute_list - splitting_attribute; // remove splitting attribute
for each outcome j of splitting_criterion:
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty:
attach a leaf labeled with the majority class in D to node N;
else:
attach the node returned by Generate_decision_tree(Dj , attribute list) to node N;
endfor
return N;
```
* A partition is **pure** if all the tuples in it belong to the same class. The resulting partitions are as pure as possible.(line 8)
* Recursively to form a decision tree: (line 21)
* A is discrete-valued:

* A is continuous-valued:

* A is discrete-valued and a binary tree:

* Complexity: $O(n\times |D|\times log(|D|))$
### Attribute Selection Measures
#### Information Gain
* Def: Difference between the original information requirement and the new requirement
* The entropy of D:
* $$Info(D) = -\sum_{i=1}^m p_ilog_2(p_i)$$
* How much more information would we still need (after the partitioning) to arrive at an exact classification
* $$Info_a(D) = -\sum_{j=1}^v \frac{|D_j|}{|D|}Info(D_j)$$
* Term $\frac{|D_j|}{|D|}$ acts as the weight of the jth partition.
* Information gain:
* $$Gain(A) = Info(D)-Info_A(D)$$
* Example:

* label: 9 class yes, 5 class no
* $info(D) = -\frac{9}{14}log_2(\frac{9}{14})-\frac{5}{14}log_2(\frac{5}{14})=0.94$
* $info_{age}(D) =\\ \frac{5}{14}\times(-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5})+\\ \frac{4}{14}\times(-\frac{4}{4}log_2\frac{4}{4})+\\ \frac{5}{14}\times(-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5})=0.694$
* $Gain(age) = 0.94-0.694 = 0.246$
* Calculate each infomation gain for other features, and select the highest gain to be the split point.
* Result

#### Gain Ratio
* $$GainRatio(A) = \frac{Gain(A)}{SplitInfo_A(D)}$$
#### Gini Index
* Measures the impurity of D, a data partition or set of training tuples.
* The point giving the **minimum Gini index** for a given attribute is taken as the split-point of that attribute.
* $$Gini(D) = 1 - \sum_{i=1}^m p_i^2,\ p_i=\frac{|C_{i,D}|}{|D|}$$
* For example, if a binary split on $A$ partitions $D$ into $D_1$ and $D_2$
* $\frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(D_2)$
* The reduction in impurity:
* $\Delta Gini(A) = Gini(D)-Gini_A(D)$
#### Tree Pruing
* **Postpruning**: Remove subtrees from a “fully grown” tree.
* **Cost complexity**:
* The number of leaves in the tree and the error rate(misclassified) of the tree
* If pruning the subtree at node N would result in a smaller cost complexity, then the subtree is pruned.
* **Repetition**: An attribute is repeatedly tested

* **Replication**: Auplicate subtrees exist within the tree

## Bayes Classification Methods
* $X$ is considered “evidence.”
* $P(H|X)$ the probability that the hypothesis H holds given the “evidence”
* **Posterior probability**: $P(H|\boldsymbol X)$
* **Prior probability**: $P(H)$
* **Bayes’ theorem**:
* $$P(H|\boldsymbol X)=\frac{P(\boldsymbol X|H)P(H)}{P(\boldsymbol X)}$$
## Naive Bayesian Classification
1.
* $\boldsymbol X=(x_1, x_2, ..., x_n)$
2.
* $m$ classes, $C_1, C_2,..., C_m$
* $\boldsymbol X$ belongs to the class $C_i$ if and only if $P(C_i|\boldsymbol X)>P(C_j|\boldsymbol X)$ for$1\leq j\leq m, j\neq i$
* maximize $P(C_i|\boldsymbol X)$: $P(C_i|\boldsymbol X)=\frac{P(\boldsymbol X|C_i)P(C_i)}{P(\boldsymbol X)}$
3.
* $P(\boldsymbol X)$ is constant for all classes
* We maximize $P(\boldsymbol X|C_i)P(C_i)$
* $P(C_i)=\frac{|C_{i,D}|}{D}$
4.
* Assume all the class are independent, thus
* $P(\boldsymbol X|C_i) = \prod_{k=1}^n P(x_k|C_i)=P(x_1|C_i)\times P(x_2|C_i)\times ... \times P(x_n|C_i)$
* (a) If $A_k$ is categorical, $P(x_k|C_i) = \frac{number\ of\ x_k\ in\ C_i}{|C_{i,D}|}$
* (b) If $A_k$ is continuous value, assume Gaussian distribution
* $$P(x_k|C_i)=g(x,\mu,\sigma) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
5.
* To predict class label of $\boldsymbol X$, $P(\boldsymbol X|C_i)P(C_i)$ is evaluated for each class.
* Predict $C_i$ $\Leftrightarrow P(\boldsymbol X|C_i)P(C_i)>(\boldsymbol X|C_j)P(C_j)$ for$1\leq j\leq m, j\neq i$
## Rule-Based Classification
### Using IF-THEN Rules for Classification
```
IF condition
THEN conclusion
```
* Assess:
* $coverage(R) = \frac{n_{covers}}{|D|}$
* $accuracy(R) = \frac{n_{correct}}{n_{covers}}$
### Rule Extraction from a Decision Tree
* The rules are extracted directly from the tree, they are mutually exclusive and exhaustive.
* Exhaustive means there is one rule for each possible attribute–value
### Rule Induction Using a Sequential Covering Algorithm
* Sequential covering algorithm: Learn a set of IF-THEN rules for classification
```=
Input:
D, a data set of class-labeled tuples;
Att vals, the set of all attributes and their possible values.
Output:
A set of IF-THEN rules.
Rule set D fg; // initial set of rules learned is empty
for each class c do
Rule D Learn One Rule(D, Att vals, c);
remove tuples covered by Rule from D;
Rule set D Rule set CRule; // add new rule to rule set
endfor
return Rule Set ;
```
* Rule Quality Measures:
* **FOIL** (First Order Inductive Learner):
positive tuples: the tuples of the class for which we are learning, the remaining tuples are negative
pos (neg): covered by R
pos'(neg): covered by R'
* $FOIL\_Gain = pos' \times (log_2\frac{pos'}{pos'+neg'}-log_2\frac{pos'}{pos'+neg'})$
* Rule Pruning:
* Prune rules to avoid overfitting
* $FOIL\_Prune(R) = \frac{pos-neg}{pos+neg}$
## Model Evaluation and Selection
### 8.5.1 Metrics for Evaluating Classifier Performance

* $accuracy = \frac{TP+TN}{P+N}$
* $precision = \frac{TP}{TP+FP}$ (exactness)
* $recall = \frac{TP}{TP+FN}$ (completeness)
* $F_1\ score = \frac{2\times precision\times recall}{precision + recall}$
* $F_{\beta} = \frac{(1+\beta^2)\times precision\times recall}{\beta^2\times precision + recall}$
### Holdout Method and Random Subsampling
* holdout: training set: two-thirds of the data , and the remaining one-third is allocated to the test set
* Random subsampling: holdout method repeated k times
### Cross-Validation
* k-fold cross-validation

* Leave-one-out:

### Bootstrap
* Each time a tuple is selected, it is equally likely to be selected again and re-added to the training set.
* .632 bootstrap: 63.2% of the original data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set.
### Model Selection Using Statistical Tests of Significance
* Determine two model $M_1$ and $M_2$ are statistically significant.
* Do 10-fold cross-validation, and average the 10 error rates obtained each for $M_1$ and $M_2$
* **t-test**(student's test): the two models are the same if the difference in **mean error rate** between the two is zero.
* $\overline{err}$: mean square error
* **t-statistic** with k-1 degrees of freedom for k samples(t-distribution):
* $$t=\frac{\overline{err}(M_1)-\overline{err}(M_2)}{\sqrt{var(M_1-M_2)/k}}$$
* variance: $var(M_1-M_2) = \frac{1}{k}\displaystyle\sum_{i=1}^k[err(M_1)_i-err(M_2)_i-(\overline{err}(M_1)-\overline{err}(M_2))]^2$
* Significance level ($sig$): uaually 5% or 1%
* **Confidence limit**:
* $z = sig/2$
* If $t>z$ or $t<-z$, there is a statistically significant difference between $M_1$ and $M_2$
### Comparing Classifiers Based on Cost–Benefit and ROC Curves
* Receiver operating characteristic curves (**ROC** curve):
* Visualize the trade-off between the true positive rate (TPR) and the false positive rate (FPR)
* $TPR=\frac{TP}{P}$(sensitivity), $FPR=\frac{FP}{N}$(1-specificity)
* Example:
* 
* 
* The closer the ROC curve of a model is to the diagonal line, the less accurate the model.
## Techniques to Improve Classification Accuracy
### Ensemble Methods

### Bagging
```=
Input:
D, a set of d training tuples;
k, the number of models in the ensemble;
a classification learning scheme (decision tree algorithm, na¨ıve Bayesian, etc.).
Output: The ensemble: a composite model, M∗.
Method:
for i = 1 to k do // create k models:
create bootstrap sample, Di , by sampling D with replacement;
use Di and the learning scheme to derive a model, Mi ;
endfor
```
* Random Forests:
* generate k decision trees for the ensemble
* Let each of the k models classify X and return the majority vote
### Boosting and AdaBoost
* Boosting: After a classifier, $M_i$, is learned, the weights are updated to allow the subsequent classifier, $M_{i+1}$ , to “pay more attention” to the training tuples that were misclassified by $M_i$
* **Adaboost**:
* Assigns each training tuple an equal weight of 1/d.
* If a tuple was incorrectly classified, its weight is increased. We want it to focus more on the misclassified tuples of the previous round.
* $error(M_i)=\displaystyle\sum_{j=1}^d w_j \times err(X_j)$
* $err(X_j)$: the misclassification error of tuple $X_j$
* misclassified $err(X_j)=1$, otherwise $err(X_j)=0$
* Boosting assigns a weight to each classifier’s vote, based on how well the classifier performed.
* weight of each $M_i's$ vote: $log\frac{1-error(M_i)}{error(M_i)}$

```=
Input:
D, a set of d class-labeled training tuples;
k, the number of rounds (one classifier is generated per round);
a classification learning scheme.
Output: A composite model.
Method:
initialize the weight of each tuple in D to 1=d;
for i D 1 to k do // for each round:
sample D with replacement according to the tuple weights to obtain Di ;
use training set Di to derive a model, Mi ;
compute error(Mi), the error rate of Mi (Eq. 8.34)
if error(Mi) > 0.5 then
go back to line 12 and try again;
endif
for each tuple in Di that was correctly classified do:
multiply the weight of the tuple by error(Mi)/(1-error(Mi));// update weights
normalize the weight of each tuple;
endfor
```
## Improving Classification Accuracy of Class-Imbalanced Data
1. oversampling: Resampling the positive tuples so that the resulting training set contains an equal number of positive and negative tuples.
2. undersampling: decreasing the number of negative tuples
3. threshold moving: move the threshold,t, naive Bayesian classifiers and and neural network classifiers can do this
4. ensemble techniques