Chp 8 Classification: Basic Concept

# Chp 8 Classification: Basic Concept ###### tags: `Data Mining 心得` ## Decision Tree Induction ![](https://i.imgur.com/zJ4GMq5.png =80%x) ### Algorithm: Generate_decision_tree. * Input: 1. Data partition, **D**, set of training **tuples** and their associated class labels 2. **attribute_list**, the set of candidate attributes; 3. **Attribute_selection_method**, a procedure to determine the splitting criterion * Method: ```java= Create a node N; if tuples in D are all of the same class, C, return N as a leaf node labeled with the class C; if attribute list is empty return N as a leaf node labeled with the majority class in D; // majority voting apply Attribute_selection_method(D, attribute list) to find the “best” splitting criterion; label node N with splitting_criterion; if splitting attribute is discrete-valued and multiway splits allowed: // not restricted to binary trees attribute_list = attribute_list - splitting_attribute; // remove splitting attribute for each outcome j of splitting_criterion: // partition the tuples and grow subtrees for each partition let Dj be the set of data tuples in D satisfying outcome j; // a partition if Dj is empty: attach a leaf labeled with the majority class in D to node N; else: attach the node returned by Generate_decision_tree(Dj , attribute list) to node N; endfor return N; ``` * A partition is **pure** if all the tuples in it belong to the same class. The resulting partitions are as pure as possible.(line 8) * Recursively to form a decision tree: (line 21) * A is discrete-valued: ![](https://i.imgur.com/izWEtXB.png) * A is continuous-valued: ![](https://i.imgur.com/lajZOjG.png) * A is discrete-valued and a binary tree: ![](https://i.imgur.com/QuTIk5l.png) * Complexity: $O(n\times |D|\times log(|D|))$ ### Attribute Selection Measures #### Information Gain * Def: Difference between the original information requirement and the new requirement * The entropy of D: * $$Info(D) = -\sum_{i=1}^m p_ilog_2(p_i)$$ * How much more information would we still need (after the partitioning) to arrive at an exact classification * $$Info_a(D) = -\sum_{j=1}^v \frac{|D_j|}{|D|}Info(D_j)$$ * Term $\frac{|D_j|}{|D|}$ acts as the weight of the jth partition. * Information gain: * $$Gain(A) = Info(D)-Info_A(D)$$ * Example: ![](https://i.imgur.com/lucGHPn.png) * label: 9 class yes, 5 class no * $info(D) = -\frac{9}{14}log_2(\frac{9}{14})-\frac{5}{14}log_2(\frac{5}{14})=0.94$ * $info_{age}(D) =\\ \frac{5}{14}\times(-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5})+\\ \frac{4}{14}\times(-\frac{4}{4}log_2\frac{4}{4})+\\ \frac{5}{14}\times(-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5})=0.694$ * $Gain(age) = 0.94-0.694 = 0.246$ * Calculate each infomation gain for other features, and select the highest gain to be the split point. * Result ![](https://i.imgur.com/Gd2hJ3l.png) #### Gain Ratio * $$GainRatio(A) = \frac{Gain(A)}{SplitInfo_A(D)}$$ #### Gini Index * Measures the impurity of D, a data partition or set of training tuples. * The point giving the **minimum Gini index** for a given attribute is taken as the split-point of that attribute. * $$Gini(D) = 1 - \sum_{i=1}^m p_i^2,\ p_i=\frac{|C_{i,D}|}{|D|}$$ * For example, if a binary split on $A$ partitions $D$ into $D_1$ and $D_2$ * $\frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(D_2)$ * The reduction in impurity: * $\Delta Gini(A) = Gini(D)-Gini_A(D)$ #### Tree Pruing * **Postpruning**: Remove subtrees from a “fully grown” tree. * **Cost complexity**: * The number of leaves in the tree and the error rate(misclassified) of the tree * If pruning the subtree at node N would result in a smaller cost complexity, then the subtree is pruned. * **Repetition**: An attribute is repeatedly tested ![](https://i.imgur.com/C14A3Rw.png =50%x) * **Replication**: Auplicate subtrees exist within the tree ![](https://i.imgur.com/iZdxOtR.png) ## Bayes Classification Methods * $X$ is considered “evidence.” * $P(H|X)$ the probability that the hypothesis H holds given the “evidence” * **Posterior probability**: $P(H|\boldsymbol X)$ * **Prior probability**: $P(H)$ * **Bayes’ theorem**: * $$P(H|\boldsymbol X)=\frac{P(\boldsymbol X|H)P(H)}{P(\boldsymbol X)}$$ ## Naive Bayesian Classification 1. * $\boldsymbol X=(x_1, x_2, ..., x_n)$ 2. * $m$ classes, $C_1, C_2,..., C_m$ * $\boldsymbol X$ belongs to the class $C_i$ if and only if $P(C_i|\boldsymbol X)>P(C_j|\boldsymbol X)$ for$1\leq j\leq m, j\neq i$ * maximize $P(C_i|\boldsymbol X)$: $P(C_i|\boldsymbol X)=\frac{P(\boldsymbol X|C_i)P(C_i)}{P(\boldsymbol X)}$ 3. * $P(\boldsymbol X)$ is constant for all classes * We maximize $P(\boldsymbol X|C_i)P(C_i)$ * $P(C_i)=\frac{|C_{i,D}|}{D}$ 4. * Assume all the class are independent, thus * $P(\boldsymbol X|C_i) = \prod_{k=1}^n P(x_k|C_i)=P(x_1|C_i)\times P(x_2|C_i)\times ... \times P(x_n|C_i)$ * (a) If $A_k$ is categorical, $P(x_k|C_i) = \frac{number\ of\ x_k\ in\ C_i}{|C_{i,D}|}$ * (b) If $A_k$ is continuous value, assume Gaussian distribution * $$P(x_k|C_i)=g(x,\mu,\sigma) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$ 5. * To predict class label of $\boldsymbol X$, $P(\boldsymbol X|C_i)P(C_i)$ is evaluated for each class. * Predict $C_i$ $\Leftrightarrow P(\boldsymbol X|C_i)P(C_i)>(\boldsymbol X|C_j)P(C_j)$ for$1\leq j\leq m, j\neq i$ ## Rule-Based Classification ### Using IF-THEN Rules for Classification ``` IF condition THEN conclusion ``` * Assess: * $coverage(R) = \frac{n_{covers}}{|D|}$ * $accuracy(R) = \frac{n_{correct}}{n_{covers}}$ ### Rule Extraction from a Decision Tree * The rules are extracted directly from the tree, they are mutually exclusive and exhaustive. * Exhaustive means there is one rule for each possible attribute–value ### Rule Induction Using a Sequential Covering Algorithm * Sequential covering algorithm: Learn a set of IF-THEN rules for classification ```= Input: D, a data set of class-labeled tuples; Att vals, the set of all attributes and their possible values. Output: A set of IF-THEN rules. Rule set D fg; // initial set of rules learned is empty for each class c do Rule D Learn One Rule(D, Att vals, c); remove tuples covered by Rule from D; Rule set D Rule set CRule; // add new rule to rule set endfor return Rule Set ; ``` * Rule Quality Measures: * **FOIL** (First Order Inductive Learner): positive tuples: the tuples of the class for which we are learning, the remaining tuples are negative pos (neg): covered by R pos'(neg): covered by R' * $FOIL\_Gain = pos' \times (log_2\frac{pos'}{pos'+neg'}-log_2\frac{pos'}{pos'+neg'})$ * Rule Pruning: * Prune rules to avoid overfitting * $FOIL\_Prune(R) = \frac{pos-neg}{pos+neg}$ ## Model Evaluation and Selection ### 8.5.1 Metrics for Evaluating Classifier Performance ![](https://i.imgur.com/Kfhkt8h.png) * $accuracy = \frac{TP+TN}{P+N}$ * $precision = \frac{TP}{TP+FP}$ (exactness) * $recall = \frac{TP}{TP+FN}$ (completeness) * $F_1\ score = \frac{2\times precision\times recall}{precision + recall}$ * $F_{\beta} = \frac{(1+\beta^2)\times precision\times recall}{\beta^2\times precision + recall}$ ### Holdout Method and Random Subsampling * holdout: training set: two-thirds of the data , and the remaining one-third is allocated to the test set * Random subsampling: holdout method repeated k times ### Cross-Validation * k-fold cross-validation ![](https://i.imgur.com/XQY6POP.png =80%x) * Leave-one-out: ![](https://i.imgur.com/KNJNSSS.png =80%x) ### Bootstrap * Each time a tuple is selected, it is equally likely to be selected again and re-added to the training set. * .632 bootstrap: 63.2% of the original data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set. ### Model Selection Using Statistical Tests of Significance * Determine two model $M_1$ and $M_2$ are statistically significant. * Do 10-fold cross-validation, and average the 10 error rates obtained each for $M_1$ and $M_2$ * **t-test**(student's test): the two models are the same if the difference in **mean error rate** between the two is zero. * $\overline{err}$: mean square error * **t-statistic** with k-1 degrees of freedom for k samples(t-distribution): * $$t=\frac{\overline{err}(M_1)-\overline{err}(M_2)}{\sqrt{var(M_1-M_2)/k}}$$ * variance: $var(M_1-M_2) = \frac{1}{k}\displaystyle\sum_{i=1}^k[err(M_1)_i-err(M_2)_i-(\overline{err}(M_1)-\overline{err}(M_2))]^2$ * Significance level ($sig$): uaually 5% or 1% * **Confidence limit**: * $z = sig/2$ * If $t>z$ or $t<-z$, there is a statistically significant difference between $M_1$ and $M_2$ ### Comparing Classifiers Based on Cost–Benefit and ROC Curves * Receiver operating characteristic curves (**ROC** curve): * Visualize the trade-off between the true positive rate (TPR) and the false positive rate (FPR) * $TPR=\frac{TP}{P}$(sensitivity), $FPR=\frac{FP}{N}$(1-specificity) * Example: * ![](https://i.imgur.com/AWUiA5h.png) * ![](https://i.imgur.com/YAMlOw8.png =80%x) * The closer the ROC curve of a model is to the diagonal line, the less accurate the model. ## Techniques to Improve Classification Accuracy ### Ensemble Methods ![](https://i.imgur.com/FuoplJc.png) ### Bagging ```= Input: D, a set of d training tuples; k, the number of models in the ensemble; a classification learning scheme (decision tree algorithm, na¨ıve Bayesian, etc.). Output: The ensemble: a composite model, M∗. Method: for i = 1 to k do // create k models: create bootstrap sample, Di , by sampling D with replacement; use Di and the learning scheme to derive a model, Mi ; endfor ``` * Random Forests: * generate k decision trees for the ensemble * Let each of the k models classify X and return the majority vote ### Boosting and AdaBoost * Boosting: After a classifier, $M_i$, is learned, the weights are updated to allow the subsequent classifier, $M_{i+1}$ , to “pay more attention” to the training tuples that were misclassified by $M_i$ * **Adaboost**: * Assigns each training tuple an equal weight of 1/d. * If a tuple was incorrectly classified, its weight is increased. We want it to focus more on the misclassified tuples of the previous round. * $error(M_i)=\displaystyle\sum_{j=1}^d w_j \times err(X_j)$ * $err(X_j)$: the misclassification error of tuple $X_j$ * misclassified $err(X_j)=1$, otherwise $err(X_j)=0$ * Boosting assigns a weight to each classifier’s vote, based on how well the classifier performed. * weight of each $M_i's$ vote: $log\frac{1-error(M_i)}{error(M_i)}$ ![](https://i.imgur.com/QaCsW11.png =70%x) ```= Input: D, a set of d class-labeled training tuples; k, the number of rounds (one classifier is generated per round); a classification learning scheme. Output: A composite model. Method: initialize the weight of each tuple in D to 1=d; for i D 1 to k do // for each round: sample D with replacement according to the tuple weights to obtain Di ; use training set Di to derive a model, Mi ; compute error(Mi), the error rate of Mi (Eq. 8.34) if error(Mi) > 0.5 then go back to line 12 and try again; endif for each tuple in Di that was correctly classified do: multiply the weight of the tuple by error(Mi)/(1-error(Mi));// update weights normalize the weight of each tuple; endfor ``` ## Improving Classification Accuracy of Class-Imbalanced Data 1. oversampling: Resampling the positive tuples so that the resulting training set contains an equal number of positive and negative tuples. 2. undersampling: decreasing the number of negative tuples 3. threshold moving: move the threshold,t, naive Bayesian classifiers and and neural network classifiers can do this 4. ensemble techniques