# Chp 9 Classification: Advanced Method ###### tags: `Data Mining 心得` # Bayesian Belief Networks ## Concepts and Mechanisms * Specify joint conditional probability distributions. * Define by two components: a **direted acyclic cycle** and a set of **conditional probability tables(CPT)** * e.g. ![](https://i.imgur.com/mehMTzS.png) $P(LungCancer = yes |FamilyHistory = yes, Smoker = yes) = 0.8$ $P(LungCancer = no |FamilyHistory = no, Smoker = no) = 0.9$ ## Training Bayesian Belief Networks 1. Compute the gradients: * $$\frac{\partial lnP_w(D)}{\partial w_{ijk}} = \displaystyle\sum_{d=1}^{|D|}\frac{P(Y_i=y_{ik}, U_i=u_{ij}|X_d)}{w_{ijk}}$$ 2. Take a small step in the direction of the gradient * $$w_{ijk} \leftarrow w_{ijk}+(l)\frac{\partial lnP_w(D)}{\partial w_{ijk}}$$ * $l$: learning rate 3. Renormalize the weights: * $w_{ijk}$ must between 0 and 1 * $\displaystyle\sum_j w_{ijk}$ must equal 1 # Classification by Backpropagation ## Multilayer Feed-Forward Neural Network ![](https://i.imgur.com/yw7BGjs.png =80%x) ## Backpropagation * We need to calculate $\frac{\partial l}{\partial w}$ to update parameters.($l$ is the loss function.) * $\frac{\partial l}{\partial w} = \frac{\partial z}{\partial w} \frac{\partial l}{\partial z}$ * **Forward pass**: $\frac{\partial z}{\partial w}$ equals to its input $a$ * e.g. $(z=x_1w_1+x_2w_2+b)$, $a=(x_1, x_2)$ * ![](https://i.imgur.com/jV7lZvX.png =60%x) * **Backward pass**: Construct a reverse neural network. Its triangle neuron's output is $\frac{\partial l}{\partial z}$. * ![](https://i.imgur.com/ZRHcnGi.png =70%x) * When we need to calculate $\frac{\partial l}{\partial z}$ , we need to calculate $\frac{\partial l}{\partial z{'}}$ and $\frac{\partial l}{\partial z{''}}$ as well. That's why we reverse the network and calculate z of backeward neurons. * ![](https://i.imgur.com/4h06VPc.png =70%x) ![](https://i.imgur.com/v5mF7I9.png =70%x) * Multiply forward pass $\frac{\partial z}{\partial w}$ and backward pass $\frac{\partial l}{\partial z}$, and get $\frac{\partial l}{\partial w}$ # Support Vector Machines(SVM) ## 9.3.1 The Case When the Data Are Linearly Separable * Searching for the **maximum marginal hyperplane** ![](https://i.imgur.com/PPpxDUV.png) * **support vectors**: Any training tuples that satisfy $y_i(w_0+w_1x_1+w_2x_2)\geq 1$ * Decision boundary: * $d(\boldsymbol X^T)=\displaystyle \sum_{i=1}^l y_i\alpha_i\boldsymbol X_i\boldsymbol X^T+b_0$ * The complexity of the learned classifier: the number of support vectors ### 9.3.2 The Case When the Data Are Linearly Inseparable * Apply a kernel function, $K(\boldsymbol X_i, \boldsymbol X_j)$, to the original input data. * Replace every $\phi(\boldsymbol X_i)·\phi(\boldsymbol X_j)$ with $K(\boldsymbol X_i, \boldsymbol X_j)$ * Three admissible functions: * Polynomial kernel: $K(\boldsymbol X_i, \boldsymbol X_j) = (\boldsymbol X_i· \boldsymbol X_j+1)^h$ * Gaussian radial basis function kernel(RBL): $K(\boldsymbol X_i, \boldsymbol X_j)=e^{\frac{-\parallel \boldsymbol X_i-\boldsymbol X_j\parallel^2}{2\sigma^2}}$ * Sigmoid kernel: $K(\boldsymbol X_i, \boldsymbol X_j)=tanh(\kappa \boldsymbol X_i· \boldsymbol X_j-\delta)$ # Classification Using Frequent Patterns ## Associative Classification * Three steps: 1. Mine the data for frequent itemsets, that is, find commonly occurring attribute–value pairs in the data. 2. Analyze the frequent itemsets to generate association rules per class, which satisfy confidence and support criteria. 3. Organize the rules to form a rule-based classifier. * **CBA** (Classification Based on Associations) * If a set of rules has the same antecedent, then the rule with the highest confidence is selected to represent the set. * **CMAR** (Classification based on Multiple Association Rules) * Find the complete set of rules satisfying the minimum confidence and minimum support thresholds. * Use a weighted $\chi_2$ measure to find the “strongest” group of rules * ![](https://i.imgur.com/l3pDcmz.png) * CPAR (Classification based on Predictive Association Rules) * Use FOIL: builds rules to distinguish positive tuples from negative tuples ### Discriminative Frequent Pattern–Based Classification 1. Feature generation: Use frequent itemset mining to discover frequent patterns,$F$, in each partition. 2. Feature selection: Apply feature selection to $F$, the set of selected(more discriminating) frequent patterns. 3. Learning of classification model ![](https://i.imgur.com/RfpobVE.png =80%x) # Lazy Learners(Learning from Your Neighbors) * **Eager learners**: Construct a generalization (i.e., classification) model before receiving new (e.g., test) tuples to classify. * **Lazy learners**: Only when it sees the test tuple does it perform generalization to classify the tuple based on its similarity to the stored training tuples. ### k-Nearest-Neighbor Classifiers(KNN) ![](https://i.imgur.com/hqD7YXk.png) * Searches the pattern space for the k training tuples that are closest to the unknown tuple. * Use Min-max normalization to transform value into [0,1]: * $v' = \frac{v-min_A}{max_A-min_A}$ * Use Euclidean distance to calculate the closeness between the tuples * $dist(\boldsymbol X_1, \boldsymbol X_2) = \sqrt{\displaystyle\sum_{i=1}^n (x_{1i}-x_{2i})^2}$ ### Case-Based Reasoning(CBR) * Use a database of problem solutions to solve new problems. * If found an identical training case, return the accompanying solution. * If not found, search the similar case * e.g. Can be used in medical education # Other Classification Methods ### 1. Genetic Algorithms ![](https://i.imgur.com/p34sGiN.png =60%x) ### 2. Rough Set Approach * Rough sets: sets—a lower approximation of C and an upper approximation of C. ![](https://i.imgur.com/4mEZDnM.png) ### 3. Fuzzy Set Approaches * Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership that a certain value has in a given category. ![](https://i.imgur.com/4bVgZpU.png) * Fuzzy set theory is useful for data mining systems performing rule-based classification. # Additional Topics Regarding Classification ## Semi-Supervised Classification ### Self-training: 1. Select a learning method such as, say, Bayesian classification. Build the classifier using the labeled data, $X_l$. 2. Use the classifier to label the unlabeled data, $X_u$. 3. Select the tuple $x ∈ X_u$ having the highest confidence (most confident prediction). Add it and its predicted label to $X_l$ . 4. Repeat (i.e., retrain the classifier using the augmented set of labeled data). ### Cotraining: * Two or more classifiers teach each other. 1. Define two separate nonoverlapping feature sets for the labeled data, $X_l$ . 2. Train two classifiers, $f_1$ and $f_2$, on the labeled data, where $f_1$ is trained using one of the feature sets and $f_2$ is trained using the other. 3. Classify $X_u$ with $f_1$ and $f_2$ separately. 4. Add the most confident $(x, f_1(x))$ to the set of labeled data used by $f_2$, where $x ∈ X_u$. Similarly, add the most confident $(x, f_2(x))$ to the set of labeled data used by $f_1$. 5. Repeat. ## Active Learning ![](https://i.imgur.com/LunDdme.png) ## Transfer Learning * Extract the knowledge from one or more source tasks and apply the knowledge to a target task ![](https://i.imgur.com/fKPzBpi.png) e.g. TrAdaBoost(Transfer AdaBoost)