COSC430 discussion: Apriori

--- tags: cosc430 2020 --- # COSC430 discussion: Apriori - Do you prefer discussing the recording after the lecture (a "flipped classroom", or me streaming the lecture? - Is HackMD workable as a way to collect in-lecture notes? - Do breakout rooms work effectively? Unlike in a real classroom, I don't get to listen to the general level of discussion noise, so can't figure out if you've reached a natural conclusion point or not! :smile: ## Key take-away points about the data mining lecture - shopping basket concept (frequent k-itemsets) - Support - Confidence - is there a way of figuring out appropriate minimum support? - No, this will be data-driven - Setting the minimum support too low will possibly lead to: - users being flooded with non-useful "patterns" - the underlying computing systems being put under high load due to the large size of intermediate results produced - slide 9 in the PDF notes has some examples included for the fourth bullet point (Data information and discretisation) since I was interrupted when delivering that part of the lecture, so it wasn't as clear as I'd have liked. (A few other minor points were fixed in the PDF notes too.) # Discussion on Apriori paper - Consider: - Key problem under investigation - Key idea of the proposed solution - How the idea solves the problem - Evaluation - Drawbacks to the proposal? ## Breakout room 1 ### Key problem under investigation - Mining frequent itemsets ### Key idea of the proposed solution - Speed up frequent itemsets mining by applying Apriori pruning principle: - if any subset of an itemset S is infrequent, then there is no chance for S to be frequent ### How the idea solves the problem - We can significantly reduce the number of itemsets to check (i.e., to count their support) by generating them using the join operation for previously discovered large itemsets and then pruning them with Apriori pruning principle ### Evaluation - The proposed algorithms(Apriori and AprioriTid) are more efficient than the AIS and SETM - Synthetic transactions were generated and used to evaluate the performance - AIS and SETM are significantly slower than Apriori (~ 2 times for small datasets, an order of magnitude for large datasets) - This is because their candidate sets sizes are much bigger - Apriori and AprioriTid can be combined together, this approach is called AprioriHybrid - This makes sence because Apriori works faster at the beginning - But AprioriTid starts to work faster when its special sets fit into memory ### Drawbacks to the proposal? - Quantities of items are not supported for now - Multiple taxonomies (is-a hierarchies) are not supported for now - One have to carefully choose the moment for switching from Apriori to AprioriTid algorithm becase the latter is often faster at later stages but switching itself incurs some additional costs ## Breakout room 2 ### Key problem under investigation - problem of investigating association rules and proposes two algo ### Key idea of the proposed solution - more efficient than older algorithms ### How the idea solves the problem - generate the candidate itemsets when itemsets are large in the previous pass - results in much smaller number of candidate itemsets ### Evaluation - Proposed algorithms Apriori and AprioriTid are used for discovering all signficant association rules between items in a large database of transactions. - They are compared with AIS and STEM algorithms. - The experimental results show that proposed algorithm (Apriori and AprioriTid) always outperform AIS and SETM. - Performance Gap increased with problem size and ranged from a factor of three for small problems to more than an order of magnitude for large problems. - The two proposed algorithms are than combined into a Hybrid Apriori algorithm. - Apriori Hybrid scales linearly with the number of transactions. Execution time decreases as number of items in the database increases. - As the average transaction size increases the exeution time increases gradually. ### Drawbacks to the proposal? - not considered hierarchies - not considered quantities of the item bought in a transaction ## Breakout room 3 ### Key problem under investigation - Investigating association rules (itemsets found together frequently) and presenting a new algorithm with better performance than previous algorithms (AIS and SETM) in this area. ### Key idea of the proposed solution - New algorithms differ from previous two as candidate itemsets are generated during a pass. This can be not needed as you can end up counting candidate itemsets that are too small. ### How the idea solves the problem - Aprori and AproriTid only count the candidate itemsets found to be large in previous pass of data. - AproriTid database not used after first pass. An encoding of candidate itemsets used in previous pass. This may explain why AproriTid performances better in later passes. ### Evaluation - Combines performance of two algorithms with better performance (Apriori and AprioriTid) to make a hybrid with best aspects both algorithms. AproriTid is better in later phases. - Hybrid uses apriori in early passes, then aproritid in later passes. - Improvement seen at a larger scale. ### Drawbacks to the proposal? - Quantities of items not considered. - Hierarchies not explored.