# Chp 6 Mining Frequent Patterns, Associations, and Correlations: Basic
Concepts and Methods
###### tags: `Data Mining 心得`
## Market Basket Analysis
* Find **associations** between the different items that customers place in their “shopping baskets”

**support** reflects the usefulness and **confidence** reflects certainty of discovered rules
## Frequent Itemsets, Closed Itemsets, and Association Rules
* $support(A \Rightarrow B)= P(A \cup B)$
* $confidence(A \Rightarrow B) =P(B|A) = \frac{support(A \cup B)}{support(A)}$

* **Strong rule**: Satisfy min_conf thresold
## Frequent Itemset Mining Methods
### Apriori
#### Steps
* Finding all frequent itemsets (prune)
* Deriving valid association rules (join)
**Downward Closure Property**:
Any subset of a frequent itemset must be frequent

理論上C3原本有{A,B,C}、{A,B,E}等Itemset,此處沒有列出prune C3的過程
**Prune**:
以{A,B,C}為例,它會判斷在$L_2$的subset是否全部出現(此時時間複雜度為3)。
因為沒有包含{A,B},所以{A,B,C}失效而被移除。
#### Improvement
* Divide Database
1. search in each sub-database
2. then merge
* Hash-based

1. 將table中數量沒超過min-supoort的itemsets刪除
2. 最後再刪除
* Candidate itemset Pruning (承接上述)

直接從database中itemset找出現的subset,然後每個element做計數。
再把沒有超過k(此處例子k=2)的element trim掉
### FP-growth
* Drawbacks of Apriori algorithm: Need to scan database many times
* FP-growth only need *twice*
* First: calculate supports for each item
* Second: construct FP-tree
Example:

#### Steps
* **Constructing FP tree**
1. Scan all the database first, than calculate each support for each item.

2. Create the root of the tree, labeled 'null'
3. Scan the first transaction, create path $I2-->I1-->I5$ according to the order of the support list.
4. Scan the second transaction, and so on. If the item of the transaction is repeated(e.g. $I2$), then increment by 1.

5. An item header table is built so that each item points to its occurrences in the tree via a chain of node-links.
* **Mining FP tree**
1. Start from the last item, $I5$.
2. **Condition pattern base**: Considering $I5$ as a suffix, its corresponding two prefix paths are $<I2, I1: 1>$ and $<I2, I1, I3: 1>$
3. Build $I5-conditional$ FP-tree, only one path: $<I2: 2, I1: 2>$

4. Combanations: $<I2, I5: 2>, <I1, I5: 2>, <I2, I1, I5: 2>$, these are frequent patterns
5. Switch to $I4$, repeat the steps above, and so on.
* $I3-conditional$ FP-tree


* Psuedo code for Mining FP tree:

### Mining Closed and Max Patterns
* Itemmerging:
* Containing a frequent itemset $X$ also contains an itemset $Y$ but not any proper superset of $Y$, then $X\cup Y$ forms a frequent *closed itemset*.
* Sub-itemset pruning:
* If $X$ is a proper subset of an already found frequent closed itemset Y and support_count($X$)=support_count($Y$), then $X$ and all of $X$’s descendants in the set enumeration tree cannot be frequent closed itemsets and thus can be pruned.
* Item skipping:
* If a local frequent item p has the same support in several header tables at different levels, we can safely prune p from the header tables at higher levels.
* New itemset added:
* **Superset checking**: Checks if this new frequent itemset is a *superset* of some already found closed itemsets.
* **Subset checking**: Checks whether the newly found itemset is a *subset* of an already found closed itemset.
* Subset checking:
* If the current itemset $S_c$ can be subsumed by another already found closed itemset $S_a$, then
* (1) $S_c$ and $S_a$ have the same support.
* (2) The length of $S_c$ is smaller than that of $S_a$.
* (3) All of the items in $S_c$ are contained in $S_a$.
* **Two-level hash index structure**:
* The first level uses the identifier of the last item in $S_c$ as a hash key.
* The second level uses the support of $S_c$ as a hash key.
## Pattern Evaluation Methods
* **Strong Rules Are Not Necessarily Interesting**
* Although both support and confidence satisfy the threshold, the overall probability is bigger than confidence.
* 'game'-->'video':
* Support = 4000/10000 = 40%, Confidence = 4000/6000 = 66%
* But, the probability of buying video is 7500/10000 = 75%

### Lift
* $A\Rightarrow B$ [support, confidence, correlation]
* $$lift = \frac{P(A\cup B)}{P(A)P(B)}$$
* Lift less than 1: A is *negatively* correlated with the occurrence of B
* Lift bigger than 1: and B are *positively* correlated
* Lift equal to 1: A and B are *independent*
* $$\frac{P(game, video)}{P(game) P(video)} = \frac{0.4}{0.60*0.75} = 0.89$$
* Negative correlation cannot be identified by a support–confidence framework
### $\chi^2$ measure
* $$\chi^2 = \sum \frac{(observed-expected)^2}{expected}$$
### All confidence
* $$allconf(A,B) = \frac{sup(A\cup B)}{max\{sup(A), sup(B)\} } = min\{P(A|B), P(B|A)\}$$
* $sup(A)$ means support of A
* Minimum confidence of the two association rules related to A and B
### Max confidence
* $$maxconf(A,B) = max\{P(A|B),P(B|A)\}$$
* Maximum confidence of the two association rules
### Kulczynski measure
* $$Kulc(A,B) = \frac{1}{2}(P(A|B)+P(B|A)$$
* Average of two confidence measures
### Cosine measure
* $$cosine(A,B) = \frac{P(A\cup B)}{\sqrt{P(A) \times P(B)}} = \frac{sup(A\cup B)}{\sqrt{sup(A) \times sup(B)}}$$
* Can be viewed as a *harmonized lift* measure
#### Short conclusion
* The latter 4 measures are good at distinguishing interesting patterns, because they are not influenced by the number of *null-transactions*.
* $Kulc$ and $IR$ working together has a good performance.
* $IR(A,B)=\frac{|sup(A)-sup(B)|}{sup(A)+sup(B)-sup(A\cup B)}$