owned this note
owned this note
Published
Linked with GitHub
# 2022 Data Mining Project 1 Report
- P76114511 楊晴雯
- Testing environment: MacOSX with Intel Core, VSCode, Python 3.8
### I. Analysis
1. Data Source: [the 2022 released testdata](https://drive.google.com/file/d/1utxVq9lTt6u0SYJ6oIAs1E8fFWScZDqM/view)
- 2,997 transactions
- 50 items in total
- 5 items in 1 transaction in average
- maximum item count (`14`:2,148)
- minimum item count (`17`:10)
4. For each frequent set `m` of size `n`, a set `p` of size from 1 to `n` is created, generating $C^n_1 + ... + C^n_{(n-1)} \approx 2^n$ rules in the form `p -> m-p` are created given that `m-p` is not empty. And then these rules are examined by their confidence = $P((m-p)|m)$; rules equal or above `minconf` is kept in the final rule set.
5. The formulas for rule statisitcs:
* $support(A\rightarrow B) =P(A, B) = \frac{C(A, B)}{C(all\_transactions)}$
* $confidence(A\rightarrow B) =\frac{P(A,B)}{P(A)}$
* $lift(A\rightarrow B)=\frac{P(A,B)}{P(A)P(B)}$
6. Given the generated frequent set $\{14, 30, 34\}$, $C^3_1 + C^3_2 = 6$ rules are generated (as shown below); since $support= P(m)$, each rule has the same `support` $=P(\{14, 30, 34\})$. Because their `confidence `statistics are all above `minconf` threshold, they are all selected. `lift` is caclulated as the ratio between rule `confidence` and `support of B`. `lift` is used to measure the association between item A and B (the ratio of buying B given A, and buying B not knowing whether A is bought or not; if`lift` > 1, these 2 items are positively correlated, i.e. buying B increases the probability of buying A; `lift` < 1 indicates the opposite.)
```
antecedent,consequent,support ,confidence,lift
{30} ,{34 14} ,0.282 ,0.501 ,1.056
{34} ,{30 14} ,0.282 ,0.441 ,1.046
{14} ,{34 30} ,0.282 ,0.394 ,1.07
{34 30} ,{14} ,0.282 ,0.767 ,1.07
{30 14},{34} ,0.282 ,0.67 ,1.046
{34 14},{30} ,0.282 ,0.595 ,1.056
```
5. One way to measure strength of a rule `A->B` is:
- `lift` >= 1.2: A and B are positively correlated.
- 0.8 <`lift` < 1.2: A and B are independent.
- `lift` <= 0.8: A and B are negatively correlated.
Given the output of `the 2022 released testdata`, there are 350 rules whose`lift` >= 1.2; 2 rules whose `lift` >= 1.3 (as shown below). Take the first rule for example, it means that the probability of seeing items $\{1,14,32\}$ are much higher if we know that $\{18,34\}$ are already in the basket. Note that their support is 0.103, which means that out of the 2997 transactions 310 transactions contain this pattern. Given that the average item length is only 5, this pattern
```
{18 34},{1 32 14},0.103,0.222,1.337
{1 32 14},{18 34},0.103,0.624,1.337
```
### II. Time and Result Analysis
#### Scenario Statistics
- Data: 2022 released testdata
(minsup, minconf)| # Frequent Set | # Rules |
--------------- | -------- | -------- |
(0.1, 0.1) | 1,490 | 10,286 |
(0.1, 0.7) | 1,490 | 1,106 |
(0.7, 0.1) | 2 | 0 |
(0.7, 0.7) |2 |0 |
The statistics show that `minconf` only affects the number of rules and has nothing to do with the number of frequent sets. `minsup`, on the other hand, is used to trim the k-itemsets in **Apriori** and the fp-tree, conditional fp-trees and finally the rule supports in **FPGrowth**, therefore changing `minsup` changes # frequent size and # rules simulataneously.
Figure 1 shows that with `minsup` fixed at 0.1, # rules is trimmed by the heightened `minconf` threshold, showing a negative-sloped linear tendency. Figure 2 shows that that with `minconf` fixed at 0.1, once `minsup` raises from 0.1 to 0.2, # rules drops drastically because # frequent sets that is used to generate the rules drops drastically as well.
(Figure 1)
![](https://i.imgur.com/1fp8ofQ.png)
(Figure 2)
![](https://i.imgur.com/lgmYbcM.png)
#### Runtime Statistics
The below statistics are measured in second (s).
- Data: 2022 released testdata
My Apriori algorithm is implemented with *prefix tree*, therefore for each iteration, the self-joining step could be done in $O(klogN)$ where $k$ is the k-th itemset size and $N$ is the total number of k-th itemsets inserted into the tree.
My FPGrowth is, however, bottlenecked by the enumeration (`itertools.combinations`) of frequent patterns when mining the maximum frequent patterns from each conditional fptrees.
(minsup, minconf)| Apriori | FPGrowth |
--------------- | -------- | -------- |
(0.1, 0.1) |3.06 | 29.65 |
(0.1, 0.7) | 2.67 | 27.62 |
(0.7, 0.1) | 0.01 | 0.02 |
(0.7, 0.7) | 0.01 | 0.02 |
### III. Bonus
- Data: [Kaggle dataset (groceries)]
(https://www.kaggle.com/datasets/rashikrahmanpritom/groceries-dataset-for-market-basket-analysismba?select=basket.csv) `
- The preprocessed version: `inputs/groceries-basket-formatted.txt`
The `Apriori` and `FPGrowth` columns show the runtime; the last 2 columns show the results (attached for statistical reference). This table further proves the point stated in Section II "the runtime of both algorithms is bounded by the time used to find the frequent sets," and since # frequent set determines # rule, the runtime and the memory usage depends mainly on `minsup`. The lower `minsup` is, the more optimization work needs to be done to ensure program speed.
It could also be observed that the confidence of the mined rules are relatively high so that increasing confidence from 1e-3 to 5e-3 does not filter out any existing rules.
(minsup, minconf)| Apriori Runtime (s) | FPGrowth Runtime |# FreqSet| # Rule |
--------------- | -------- | -------- | -------- |-------- |
(1e-3, 1e-3) |34.48 | 3.00 | 750|1,238|
(5e-3, 1e-3) | 8.44 | 0.6 | 126|74|
(1e-3, 5e-3) | 35.64 | 3.12 | 750|1,238|
(5e-3, 5e-3) | 8.50 | 0.06 | 126|74|