# Chp 7 Advanced Pattern Mining ###### tags: `Data Mining 心得` ## 7.2 Pattern Mining in Multilevel, Multidimensional Space * **Concept hierarchy** : A sequence of mappings from a set of low-level concepts to a higher-level, more general concept set. ![](https://i.imgur.com/9VPGPAC.png) ### 7.2.1 Multilevel association rules * **Uniform minimum support**: * Support of each layer is the same. * Drawbacks: Items at lower abstraction levels will not occur as frequently as those at higher abstraction. ![](https://i.imgur.com/X4UlAqX.png) * **Reduced minimum support**: * The deeper the abstraction level, the smaller the corresponding threshold. ![](https://i.imgur.com/nAgeLIG.png) * **Group-based minimum support**: * Set up user-specific, item, or group-based minimal support thresholds when mining multilevel rules. ### 7.2.2 Multilevel Association Rules * e.g. * $$age(X, “20 \Rightarrow 29”) \land occupation(X, “student”) \Rightarrow buys(X, “laptop”)$$ ### 7.2.3 Mining Quantitative Association Rules 1. **Data Cube–Based Mining**: * e.g. * The lattice of cuboids defining a data cube for the dimensions *age*, *income*, and *buys*. * ![](https://i.imgur.com/k1FB7mC.png =80%x) 2. **Clustering-Based Mining**: * If such a combination passes the minimum support threshold. If it does, we continue to search for clusters in this 2-D region and progress to even higher-dimensional combinations. 3. **Statistical Theory**: * May disclose exceptional behavior ### 7.2.4 Mining Rare Patterns and Negative Patterns * **Def**: If itemsets X and Y are both frequent but rarely occur together , then itemsets X and Y are **negatively correlated** * **Def**: If X and Y are strongly negatively correlated, then * $sup(X \cup \bar Y) \times sup(\bar X \cup Y)>>sup(X \cup Y) \times sup(\bar X \cup \bar Y)$ ## 7.3 Constraint-Based Frequent Pattern Mining ### Pruning Pattern Space with Pattern Pruning Constraints 1. antimonotonic: * Apriori property(All nonempty subsets of a frequent itemset must also be frequent.) 2. monotonic: * If an itemset satisfies this rule constraint, so do all of its supersets. 3. succinct(簡潔的): * If a rule constraint is succinct, we can directly generate precisely the sets that satisfy it, even before support counting begins. 4. convertible: * If the items in the itemset are arranged in a particular order, the constraint may become monotonic or antimonotonic with regard to the frequent itemset mining process. * e.g. “variance(S)>= v” “standard deviation(S)>= v,” 5. inconvertible ### Pruning Data Space with Data Pruning Constraints * Constraints are data-succinct: Can be used at the beginning of a pattern mining process. * e.g. The mined pattern must contain digital camera, then any transaction that does not contain digital camera can be pruned at the beginning. * Constraints are data-antimonotonic: If a data entry cannot satisfy a data-antimonotonic constraint based on the current pattern, then it can be pruned. ### Mining High-Dimensional Data and Colossal Patterns * **Row enumeration**: (vertical data format) * ![](https://i.imgur.com/0C6Vcj3.png =50%x) #### Pattern-Fusion: * Def: Fuses a small number of shorter frequent patterns into colossal pattern candidates * Characteristics: * It traverses the tree in a bounded-breadth way. * Identify “shortcuts” whenever possible ![](https://i.imgur.com/WMcsF8v.png =70%x) * **Core pattern**: * For a pattern $\alpha$, an itemset $\beta \subseteq \alpha$, a **$\tau$-core pattern** of $\alpha$ if $\frac{|D_{\alpha}|}{|D_{\beta}|} \geq \tau$ * **$(d, \tau)$-robust**: If d is the maximum number of items that can be removed from $\alpha$ for the resulting pattern to remain a $\tau$-core pattern of $\alpha$: * $d= \max_{\beta}\{|\alpha|-|\beta|\ \beta \ is \ a \ \tau -core \ pattern\ of\ \alpha\}$ * **Core descendants**: The lower-level core patterns of a colossal pattern. * Pattern-Fusion method: 1. Initial Pool: * The complete set of frequent patterns up to a small size(e.g. 3) 2. Iterative Pattern-Fusion: * Takes as input a **user-specified** parameter, K, the maximum number of patterns to be mined. For each K seeds, we find all the patterns within a ball of a size specified by $\tau$. If the pool contains more than K patterns, the next iteration begins with this pool for the new round of random drawing. The support set of every superpattern shrinks, the process terminates. ### 7.5 Mining Compressed or Approximate Patterns #### Pattern Clustering * **Closed frequent itemset**: If $X$ is frequent and there exists no proper super-itemset $Y$ of $X$ such that $Y$ has the same support count as $X$ in $D$. * **Maximal frequent itemset**: If $X$ is frequent and there exists no super-itemset Y such that $X\subset Y$ and $Y$ is frequent in $D$ ![](https://i.imgur.com/9UtLaZF.png =40%x) * **Pattern distance**: * $$Pat\text{_}dist(P_1, P_2) = 1-\frac{|T(P_1)\cap T(P_2)|}{|T(P_1)\cup T(P_2)|}$$ * Supporting transaction sets: $T(P_1)、T(P_2)$ * $\delta$-covered * For $O(P)\subseteq O(P^{'})$, $Pat\text{_}Dist(P,P^{'}) \leq \delta$ * Representative pattern $P_r$ * $$\delta = Pat\text{_}Dist(P,P_r) = 1-\frac{|T(P_r)|}{|T(P)|}\geq 1-\frac{k}{min\text{_}sup}$$ #### Extracting Redundancy-Aware Top-k Patterns * Redundancy-aware top-k patterns:(make a trade-off between significance and redundancy) * ![](https://i.imgur.com/RYDibP5.png =50%x) * Traditional top-k strategy * ![](https://i.imgur.com/UknQjfU.png =50%x) * K-summarized pattern strategy:(nonredundancy) * ![](https://i.imgur.com/dNqzbqR.png =50%x) * **Significance measure** $S$: * Objective measures: * include support, confidence, correlation, and tf-idf * Subjective measures: * usually a relative score based on user prior knowledge or a background model * Redundancy $R$ between $p$ and $q$: * $R(p,q)=S(p)+S(q)-S(p,q)$ * Approximate redundancy using distance between patterns * **Relative significance** of $p$ given $q$ * $S(p|q) = S(p,q)-S(q) = \\ \ \ S(p)-R(p,q)$ ### 7.6 Pattern Exploration and Application * Semantic Annotation of Frequent Patterns: 1. Select context units and design a strength weight for each unit to model the contexts of frequent patterns. 2. Design similarity measures for the contexts of two patterns, and for a transaction and a pattern context. 3. For a given frequent pattern, extract the most significant context indicators, representative transactions, and semantically similar patterns to construct a structured annotation. * **Mutual information**: * $$I(X;Y) = \sum_{x\in X}\sum_{y\in Y}P(x,y)log\frac{P(x,y)}{P(x)P(y)}$$