# Data Mining project 1
###### tags: `Datamining`
* [X] Dataset1: Select from kaggle.com / UCI
* [X] Dataset2: Use IBM Quest Synthetic Data Generator
* [X] https://sourceforge.net/projects/ibmquestdatagen/
* [X] Generate different datasets
* [X] Frequent itemsets mining
* [X] Implement Apriori Algorithm and apply on these datasets
* [X] Hash?
* [ ] Tree?
* [X] FP-growth
* [X] Generate association rules
* [X] Compare your results
---
### Dataset
* Kaggle [movielens](https://www.kaggle.com/jneupane12/movielens?select=ratings.csv)
* [IBM](https://github.com/StoneLin0708/DataMining2020/blob/master/data/ibm.csv)
---
### Implement
* c++
* [GitHub](https://github.com/StoneLin0708/DataMining2020)
---
### Mining Time - fp growth is much faster
* yellow-orange is apriori
* blue-green is fp growth
Confidence seems not to affect mining time as much as support is because frequent patterns are filtered by support before finding rules.


---
### support & confidence
* support - how much sample
* confidence - pattern consistency
|support | confidence ||
|-|-|-|
|low|low|low frequency in dataset, do not have pattern|
|low|high|low frequency in dataset, but the pattern is consistent|
|high|low|high frequency in dataset, but it comes from many different subpatterns|
|high|high|high frequency in dataset and pattern is consistent|
---