NOTE : Data Mining: Concepts and Techniques

# NOTE : Data Mining: Concepts and Techniques # Ch6 Mining Frequent Patterns,Associations, and Correlations: Basic Concepts and Methods * **Frequent pattern** mining searches for **recurring relationships** in a given data set (EX) Market Basket Analysis * Frequent Itemsets - Let I = I1, I2,..., Im be an itemset. * Closed Itemsets * Association Rules * $support(A \Rightarrow B)= P(A \cup B)$ (出現次數的比例) * $confidence(A \Rightarrow B) =P(B|A) = \frac{support(A \cup B)}{support(A)}$ (兩個itemset間的條件機率) * The occurrence frequency of an itemset is the number of transactions that contain the itemset. * Association rule mining can be viewed as a two-step process: 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. (Most important) 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence. ## Frequent Itemset Mining Methods ### Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation * **Apriori property** All nonempty subsets of a frequent itemset must also be frequent. ![](https://i.imgur.com/LJq6j3t.png) **Algorithm: Apriori**. Find frequent itemsets using an iterative level-wise approach based on candidate generation. * Finding all frequent itemsets (prune) * Deriving valid association rules (join) Input: D - a database of transactions; min sup, the minimum support count threshold. Output: L - frequent itemsets in D. ![](https://i.imgur.com/iaaNJmn.png) join component (steps 1–4) prune component (steps 5–7) ### Generating Association Rules from Frequent Itemsets * ![](https://i.imgur.com/RYzinPn.png) ### Improving the Efficiency of Apriori * Hash-based technique ![](https://i.imgur.com/qJFG7vz.png) * Transaction reduction - reducing the number of transactions scanned in future iterations * Partitioning ![](https://i.imgur.com/fQ1Mx4l.png) * Sampling - mining on a subset of the given data * Dynamic itemset counting - adding candidate itemsets at different points during a scan ### A Pattern-Growth Approach for Mining Frequent Itemsets * Apriori suffer from two nontrivial costs: 1. Huge number of candidate sets 2. Repeatedly scan the whole database and check a large set of candidates by pattern matching. -->To **solve** these problems: **frequent pattern growth FP-growth** (finding frequent itemsets without candidate generation) (divide-and-conquer) * An **FP-tree** registers compressed, frequent pattern information. ![](https://i.imgur.com/7OrvLu7.png) * Mining of the FP-tree Construct its conditional pattern base ![](https://i.imgur.com/cK0hqQs.png) --> Searching for shorter ones in much smaller conditional databases recursively and then concatenating the suffix * Mining Closed and Max Patterns * Item merging - $confidence(AC \Rightarrow D) = 1$ =>只記錄ACD就好 * Sub-itemset pruning * Item skipping ## Pattern Evaluation Methods Which Patterns Are Interesting? * Strong Rules Are Not Necessarily Interesting - sometime misleading * From Association Analysis to Correlation Analysis * The support and confidence measures are insufficient at filtering out uninteresting association rules -> Sol: correlation measure - confidence framework for association rules --> 1. correlation rules * $A \Rightarrow B [support, confidence, correlation].$ 1. $lift(A,B) = \frac {P(A\cup B)} {P(A)P(B)}$ 2. χ2 measure $χ2 = \sum_ \frac{(observed − expected)^2}{expected}$ * < 1 --> negatively correlated * \> 1 --> positively correlated * = 1 --> independent ### A Comparison of Pattern Evaluation Measures 1. all confidence $all\_conf(A,B) = \frac {sup(A∪B)} {max{sup(A),sup(B)}} = min\{{P(A|B), P(B|A)}\}$ $A \Rightarrow B$ and $B \Rightarrow A$ 2. max confidence $max\_conf(A, B) = max\{{P(A|B),P(B|A)}\}$ $A \Rightarrow B$ and $B \Rightarrow A$ 3. Kulczynski $Kulc(A, B) = \frac {1}{2}(P(A|B)+P(B|A)$ -->an average of two confidence measures: the probability of itemset B given itemset A, and the probability of itemset A given itemset B 4. cosine ![](https://i.imgur.com/PjXthC3.png) -->harmonized lift measure taking the square root - cosine value is only influenced by the **supports of A, B, and A∪B**, and **not** by the **total number of transactions** * Which is the best in assessing the discovered pattern relationships? ![](https://i.imgur.com/svL4GDy.png) * null-transaction - a transaction that does not contain any of the itemsets being examined * Imbalance ratio (IR) - assesses the imbalance of two itemsets * $IR(A,B) = \frac {sup(A)-sup(B)}{sup(A)+sup(B)-sup(A∪B)}$ * the larger the difference --> larger imbalance * independent of the number of null-transactions and the total number of transactions * It is important to consider the nullinvariance property when selecting appropriate interestingness measures for pattern evaluation # Ch7 Advanced Pattern Mining ### Pattern Mining: A Road Map A general road map on pattern mining research ![](https://i.imgur.com/lVmUkCQ.png) ### Classification * Basic patterns * Based on the *abstraction* levels involved in a pattern * Based on the *number of dimensions* involved in the rule or pattern * Based on the *types of values* handled in the rule or pattern * Based on the *constraints or criteria* used to mine *selective patterns* * Based on *kinds of data and features* to be mined * Based on *application domain-specific semantics* * Based on *data analysis usages* ## Pattern Mining in Multilevel, Multidimensional Space * Mining Multilevel Associations ![](https://i.imgur.com/urn3pVK.png) Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework * Using uniform minimum support for all levels (referred to as uniform support) * Using reduced minimum support at lower levels (referred to as reduced support) * Using item or group-based minimum support (referred to as group-based support) * Checking redundancy among multilevel association rules * Mining Multidimensional Associations * interdimensional association rules -- no repeated predicates * $age(X, “18-25”) \wedge occupation(X, “student”) \Rightarrow buys(X, “coke”)$ * search for frequent predicate sets * k-predicate set -- a set containing k conjunctive predicates * hybrid-dimensional association rules * age(X, “18-25”) \wedge buys(X, “popcorn”) \Rightarrow buys(X, “coke”) * (dynamic) quantitative association rules * Mining Quantitative Association Rules We can discretize quantitative attributes into multiple intervals and then treat them as nominal data in association mining * Method 1. data cube method Data Cube–Based Mining of Quantitative Associations -- the transformed multidimensional data may be used to construct a data cube (EX) ![](https://i.imgur.com/BMyeAih.png) 1. clustering-based method Mining Clustering-Based Quantitative Associations * quantitative association rules * top-down * bottom-up 1. statistical analysis method Using Statistical Theory to Disclose Exceptional Behavior * exceptional behavior * (EX) $sex = female ⇒ meanwage = 7.90/hr (overall\ mean\ wage = 9.02/hr)$ * An association rule under the new definition is a rule of the form: population subset ⇒ mean of values for the subset * Mining Rare Patterns and Negative Patterns --> rare / have negative correlation * infrequent (or rare) pattern - below (or far below) a user-specified minimum support threshold e.g., diamond, watch… * One Method: Use group-based “individualized” min-support * **negative pattern** **Def:** If itemsets X and Y are both frequent but rarely occur together (i,e) $sup(X∪Y) < sup(X) × sup(Y)$ --> itemsets X and Y are negatively correlated $X∪Y$ --> **negatively correlated pattern** (if "<" --> "$\ll$" ==> means strongly) * Extension ==> patterns containing k-itemsets for k > 2 * Null-transaction problem - If there are a lot of null-transactions, may strongly influence a measure’s assessment * **Def:** If X and Y are strongly negatively correlated --> $sup(X∪\bar Y) × sup(\bar X∪Y) \gg sup(X∪Y) × sup(\bar X∪\bar Y)$ * **Null-transaction problem** - The measure is not null-invariant --> should care about it * **Def:** Suppose that itemsets X and Y are both frequent $sup(X) ≥ min\_sup$ and $sup(Y) ≥ min\_sup$ If $(P(X|Y) + P(Y|X))/2 < \epsilon$, where $\epsilon$ is a negative pattern threshold, then pattern $X∪Y$ is a **negatively correlated pattern** * Negatively correlated patterns - based on the Kulczynski measure ## Constraint-Based Frequent Pattern Mining #### Mine together with user-provided constraints * **Knowledge type constraints**: These specify the type of knowledge to be mined (EX) association, correlation, classification, or clustering. * **Data constraints**: These specify the set of task-relevant data. using SQL-like queries * (EX) Find products sold together in NY stores this year * **Dimension/level constraints**: These specify the desired dimensions (or attributes) * similar to projection in relational database * (EX) In relevance to region, price, brand, customer category * **interestingness constraints**: various kinds of thresholds * (EX) support, confidence, and correlation * **Rule constraints**: These specify the form of, or conditions on, the rules to be mined. * (EX) Small sales (price < $10) triggers big sales (sum > $200) --> Focus in this section * pattern space pruning * properties * antimonotonicity * monotonicity * succinctness * data space pruning * properties * data succinctness * data antimonotonicty ### Metarule-Guided Mining of Association Rules Metarules allow users to specify the syntactic form of rules that they are interested in mining Metarules may be based on the analyst’s experience, expectations, or intuition regarding the data or may be automatically generated based on the database schema. ### Constraint-Based Pattern Generation: Pruning Pattern Space and Pruning Data Space * Pruning Pattern Space with Pattern Pruning Constraints * pattern mining constraints 1. antimonotonic * If an itemset S violates constraint c, so does any of its superset, its further mining can be terminated 2. monotonic * If an itemset S satisfies the constraint c, so does any of its superset, no need to check c again 3. succinct (precounting prunable) * If the constraint c can be enforced by directly manipulating the data * ![](https://i.imgur.com/GO5cXF7.png) 5. convertible * Ordering Data in Transactions * Convert tough constraints into (anti-)monotone by proper ordering of items in transactions 7. inconvertible * Pruning Data Space with Data Pruning Constraints * data-succinct -- Data space can be pruned at the initial pattern mining process * data-antimonotonic -- during the mining process, If a transaction t does not satisfy c, then t can be pruned to reduce data processing effort ## Mining High-Dimensional Data and Colossal Patterns * further exploring the vertical data format to handle data sets with a large number of dimensions (also called features or items, e.g., genes)but a small number of rows (also called transactions or tuples, e.g., samples) * mines colossal patterns(patterns of very long length) --> Pattern-Fusion ### Mining Colossal Patterns by Pattern-Fusion * Pattern-Fusion -- fuses a small number of shorter frequent patterns into colossal pattern candidates * traverses the tree in a bounded-breadth way * Pattern tree traversal: Candidates are taken from a pool of patterns, which results in shortcuts through pattern space to the colossal patterns * ![](https://i.imgur.com/U8eoqr4.png) * core pattern (p.341) * Outline of pattern method 1. Initial Pool 2. Iterative Pattern-Fusion * Pattern metric space: Each point represents a core pattern. The core patterns of a colossal pattern are denser than those of a small pattern, as shown within the dotted lines. ![](https://i.imgur.com/dwezVYJ.png) ## Mining Compressed or Approximate Patterns ### Mining Compressed Patterns by Pattern Clustering -- Too many scattered patterns but not so meaningful * Pattern distance - a valid distance metric defined on the set of transactions ![](https://i.imgur.com/M5p0hB0.png) ### Extracting Redundancy-Aware Top-k Patterns * Desired patterns: high significance & low redundancy (a) original patterns (b) redundancy-aware top-k patterns (c\) traditional top-kpatterns (d) k-summarized patterns -- the redundancy-aware top-k patterns make a trade-off between significance and redundancy ![](https://i.imgur.com/82ebqYO.png) ## Pattern Exploration and Application ### Semantic Annotation of Frequent Patterns # Ch8 Classification: Basic Concepts Classification is a form of data analysis that extracts models describing important data classes ## Basic Concepts * Process 1. build a classification model based on previous data 2. Model’s accuracy is acceptable --> use the model to classify new data ### General Approach to Classification * Data classification 1. learning step (or training phase) (where a classification model is constructed) 2. classification step (where the model is used to predict class labels for given data) --> supervised learning ## Decision Tree Induction The learning of decision trees from class-labeled training tuples ![](https://i.imgur.com/UggRipX.png) * Classification and Regression Trees (CART) 1. Discrete-valued ![](https://i.imgur.com/f1ktxzM.png) 2. Continuous-valued ![](https://i.imgur.com/BdLSjXy.png) 3. Discrete-valued and a binary tree ![](https://i.imgur.com/MGfDEPb.png) ### Attribute Selection Measures * **splitting rules** * Information Gain - The attribute with the highest information gain is chosen as the splitting attribute. * **entropy** of D * $$Info(D) = -\sum_{i=1}^m p_i\log_2(p_i)$$ * $p_i$ nonzero probability * **Information Gain** The difference between the original information requirement and the new requirement * ![](https://i.imgur.com/0uKzD8T.png) * $Gain(A) = Info(D) − Info_A(D)$ = how much would be gained by branching on A * minimum $Info_A(D)$ * EX: p.338 * **Gain Ratio** * **Gini Index** ### Tree Pruning address this problem of overfitting the data ### Scalability and Decision Tree Induction ## Bayes Classification Methods ### Na¨ıve Bayesian Classification ## Rule-Based Classification ### Using IF-THEN Rules for Classification ![](https://i.imgur.com/3sgU1c8.png) ## Model Evaluation and Selection * true positive(TP), true negative(TN), false positive(FP), positive(P), and negative(N) samples * ![](https://i.imgur.com/3M8yiYg.png)