Data Mining project 1

# Data Mining project 1 ###### tags: `Datamining` * [X] Dataset1: Select from kaggle.com / UCI * [X] Dataset2: Use IBM Quest Synthetic Data Generator * [X] https://sourceforge.net/projects/ibmquestdatagen/ * [X] Generate different datasets * [X] Frequent itemsets mining * [X] Implement Apriori Algorithm and apply on these datasets * [X] Hash? * [ ] Tree? * [X] FP-growth * [X] Generate association rules * [X] Compare your results --- ### Dataset * Kaggle [movielens](https://www.kaggle.com/jneupane12/movielens?select=ratings.csv) * [IBM](https://github.com/StoneLin0708/DataMining2020/blob/master/data/ibm.csv) --- ### Implement * c++ * [GitHub](https://github.com/StoneLin0708/DataMining2020) --- ### Mining Time - fp growth is much faster * yellow-orange is apriori * blue-green is fp growth Confidence seems not to affect mining time as much as support is because frequent patterns are filtered by support before finding rules. ![](https://github.com/StoneLin0708/DataMining2020/blob/master/results/fpg_vs_apriori.png?raw=true) ![](https://github.com/StoneLin0708/DataMining2020/blob/master/results/fpg_close.png?raw=true) --- ### support & confidence * support - how much sample * confidence - pattern consistency |support | confidence || |-|-|-| |low|low|low frequency in dataset, do not have pattern| |low|high|low frequency in dataset, but the pattern is consistent| |high|low|high frequency in dataset, but it comes from many different subpatterns| |high|high|high frequency in dataset and pattern is consistent| ---