Chp 6 Mining Frequent Patterns, Associations, and Correlations: Basic

# Chp 6 Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods ###### tags: `Data Mining 心得` ## Market Basket Analysis * Find **associations** between the different items that customers place in their “shopping baskets” ![](https://i.imgur.com/MCBTPyM.png =70%x) **support** reflects the usefulness and **confidence** reflects certainty of discovered rules ## Frequent Itemsets, Closed Itemsets, and Association Rules * $support(A \Rightarrow B)= P(A \cup B)$ * $confidence(A \Rightarrow B) =P(B|A) = \frac{support(A \cup B)}{support(A)}$ ![](https://i.imgur.com/duALGRL.png =70%x) * **Strong rule**: Satisfy min_conf thresold ## Frequent Itemset Mining Methods ### Apriori #### Steps * Finding all frequent itemsets (prune) * Deriving valid association rules (join) **Downward Closure Property**: Any subset of a frequent itemset must be frequent ![](https://i.imgur.com/Kk511Qo.png =60%x) 理論上C3原本有{A,B,C}、{A,B,E}等Itemset，此處沒有列出prune C3的過程 **Prune**: 以{A,B,C}為例，它會判斷在$L_2$的subset是否全部出現(此時時間複雜度為3)。因為沒有包含{A,B}，所以{A,B,C}失效而被移除。 #### Improvement * Divide Database 1. search in each sub-database 2. then merge * Hash-based ![](https://i.imgur.com/dHAkQWz.png =60%x) 1. 將table中數量沒超過min-supoort的itemsets刪除 2. 最後再刪除 * Candidate itemset Pruning (承接上述) ![](https://i.imgur.com/A7lYmQs.png =60%x) 直接從database中itemset找出現的subset，然後每個element做計數。再把沒有超過k（此處例子k=2）的element trim掉 ### FP-growth * Drawbacks of Apriori algorithm: Need to scan database many times * FP-growth only need *twice* * First: calculate supports for each item * Second: construct FP-tree Example: ![](https://i.imgur.com/4O5E8aF.png =40%x) #### Steps * **Constructing FP tree** 1. Scan all the database first, than calculate each support for each item. ![](https://i.imgur.com/5kIG2RJ.png =20%x) 2. Create the root of the tree, labeled 'null' 3. Scan the first transaction, create path $I2-->I1-->I5$ according to the order of the support list. 4. Scan the second transaction, and so on. If the item of the transaction is repeated(e.g. $I2$), then increment by 1. ![](https://i.imgur.com/fzC6p89.png) 5. An item header table is built so that each item points to its occurrences in the tree via a chain of node-links. * **Mining FP tree** 1. Start from the last item, $I5$. 2. **Condition pattern base**: Considering $I5$ as a suffix, its corresponding two prefix paths are $<I2, I1: 1>$ and $<I2, I1, I3: 1>$ 3. Build $I5-conditional$ FP-tree, only one path: $<I2: 2, I1: 2>$ ![](https://i.imgur.com/uovgDBK.png =30%x) 4. Combanations: $<I2, I5: 2>, <I1, I5: 2>, <I2, I1, I5: 2>$, these are frequent patterns 5. Switch to $I4$, repeat the steps above, and so on. * $I3-conditional$ FP-tree ![](https://i.imgur.com/YFmXEjt.png =60%x) ![](https://i.imgur.com/ttooXV5.png) * Psuedo code for Mining FP tree: ![](https://i.imgur.com/jUhbCqz.png) ### Mining Closed and Max Patterns * Itemmerging: * Containing a frequent itemset $X$ also contains an itemset $Y$ but not any proper superset of $Y$, then $X\cup Y$ forms a frequent *closed itemset*. * Sub-itemset pruning: * If $X$ is a proper subset of an already found frequent closed itemset Y and support_count($X$)=support_count($Y$), then $X$ and all of $X$’s descendants in the set enumeration tree cannot be frequent closed itemsets and thus can be pruned. * Item skipping: * If a local frequent item p has the same support in several header tables at different levels, we can safely prune p from the header tables at higher levels. * New itemset added: * **Superset checking**: Checks if this new frequent itemset is a *superset* of some already found closed itemsets. * **Subset checking**: Checks whether the newly found itemset is a *subset* of an already found closed itemset. * Subset checking: * If the current itemset $S_c$ can be subsumed by another already found closed itemset $S_a$, then * (1) $S_c$ and $S_a$ have the same support. * (2) The length of $S_c$ is smaller than that of $S_a$. * (3) All of the items in $S_c$ are contained in $S_a$. * **Two-level hash index structure**: * The first level uses the identifier of the last item in $S_c$ as a hash key. * The second level uses the support of $S_c$ as a hash key. ## Pattern Evaluation Methods * **Strong Rules Are Not Necessarily Interesting** * Although both support and confidence satisfy the threshold, the overall probability is bigger than confidence. * 'game'-->'video': * Support = 4000/10000 = 40%, Confidence = 4000/6000 = 66% * But, the probability of buying video is 7500/10000 = 75% ![](https://i.imgur.com/PR9OzH7.png) ### Lift * $A\Rightarrow B$ [support, confidence, correlation] * $$lift = \frac{P(A\cup B)}{P(A)P(B)}$$ * Lift less than 1: A is *negatively* correlated with the occurrence of B * Lift bigger than 1: and B are *positively* correlated * Lift equal to 1: A and B are *independent* * $$\frac{P(game, video)}{P(game) P(video)} = \frac{0.4}{0.60*0.75} = 0.89$$ * Negative correlation cannot be identified by a support–confidence framework ### $\chi^2$ measure * $$\chi^2 = \sum \frac{(observed-expected)^2}{expected}$$ ### All confidence * $$allconf(A,B) = \frac{sup(A\cup B)}{max\{sup(A), sup(B)\} } = min\{P(A|B), P(B|A)\}$$ * $sup(A)$ means support of A * Minimum confidence of the two association rules related to A and B ### Max confidence * $$maxconf(A,B) = max\{P(A|B),P(B|A)\}$$ * Maximum confidence of the two association rules ### Kulczynski measure * $$Kulc(A,B) = \frac{1}{2}(P(A|B)+P(B|A)$$ * Average of two confidence measures ### Cosine measure * $$cosine(A,B) = \frac{P(A\cup B)}{\sqrt{P(A) \times P(B)}} = \frac{sup(A\cup B)}{\sqrt{sup(A) \times sup(B)}}$$ * Can be viewed as a *harmonized lift* measure #### Short conclusion * The latter 4 measures are good at distinguishing interesting patterns, because they are not influenced by the number of *null-transactions*. * $Kulc$ and $IR$ working together has a good performance. * $IR(A,B)=\frac{|sup(A)-sup(B)|}{sup(A)+sup(B)-sup(A\cup B)}$