Data Mining Lab 1

# Data Mining Lab 1 ## Author 313581037 許哲瑜 ## How to run my code in Step Ⅱ and Ⅲ. 1. run my code in Step Ⅱ ```sh python step2.py -f datasetA.data -s 0.3 -p Step2_task1 ``` * `-f` : dataset path。預設值為 `datasetA.data`。 * `-s` : minimum support (%)。預設值為 `0.1`。 * `-p` : output file dir path。預設值為 ```./Step2_task1```。 2. 執行完上述指令後，會在 Step2_task1 資料夾中分別看到 `Step2_task1_datasetA_0.3_result1.txt` 、 `Step2_task1_datasetA_0.3_result2.txt` 、`Step2_task2_datasetA_0.3_result1.txt` 三個執行結果檔案，並在與 `step2.py` 同資料夾下的 `Step2_datasetA_0.3_log.txt` 檔案，是記錄**Task 1**與**Task 2**的執行時間與比例。 3. run my code in Step Ⅲ. ```sh python step3.py -f datasetA.data -s 0.3 -p Step3_task1 ``` * `-f` : dataset path。預設值為 `datasetA.data`。 * `-s` : minimum support (%)。預設值為 `0.1`。 * `-p` : output file dir path。預設值為 ```./Step3_task1```。 4. 執行完上述指令後，會在 Step3_task1 資料夾中分別看到 `Step3_task1_datasetA_0.3_result1.txt` 執行結果檔案，並在與 `step2.py` 同資料夾下的 `Step3_datasetA_0.3_log.txt` 檔案，是記錄**FP-Growth Computation Time**的資訊。 5. 判斷正確性與分析執行時間 ```shell python correctness_analysis.py -r Step2_task1 -s Step3_task1 ``` * `-r` : Step2 dirname。預設值為 `Step2_task1`。 * `-s` : Step3 dirname。預設值為 `Step3_task1`。 * `-l` : log dirname。預設值為 ```log```。 6. 上述指令會將兩個資料夾中的 xxx_task1_datasetA_0.9_result1.txt檔案進行 Frequent Itemset的比較看是否正確，正確結果如下方圖1。5 之後會將log資料夾中的檔案進行效能分析（可以將第2步與第4步的xxx_log.txt檔案移到log資料夾中），並在與`correctness_analysis.py`檔案同資料夾下產生`Analysis_log.txt`，裡面記錄 Step2 與 Step3 分別運行的時間與比率，如圖2。執行結果圖如下：圖1:<img src=https://hackmd.io/_uploads/BJr8bn0gye.png/> 圖2: <img src=https://hackmd.io/_uploads/ByPFZnAgJl.png/> ## Step II. - Mining the datasets generated in Step I using Apriori algorithm: ### Sept II. The modifications you made for Task 1 and Task 2 : #### 1. 修改讀取資料的函式 `dataFromFile` ： > 由於 datasetA.data 的資料中，每一列的前三筆分別代表、第幾列、第幾列、ITEMS數量，因此在讀取資料時要記得過濾。 ```py def dataFromFile(fname): """Function which reads from the file and yields a generator""" with open(fname, "r") as file_iter: for line in file_iter.readlines(): line = line.strip().split(" ")[3:] # 以空格切分資料，並過濾前三筆 SID TID NITEMS 不需要的欄位資訊 record = frozenset(line) # frozenset可以將重複的資料刪除，且回傳一個無法再更動的集合 yield record # yield是節省記憶體空間的用法 ``` >[!Tip] > * file_iter.readlines()：可以將檔案內容根據\n讀取成一行一行的串列，可以用for迴圈取出每一行內容。 > * line.strip().split(" ")[3:] : 將每一行的內容去除換行符號後以空格切分，並過濾出前三筆資料。 #### 2. 新增計算運行時間的程式碼： > 在Task1與Task2中都需要計算運行的時間：Count and show the whole computation time for this task (task independent)，因此在 ```if __name__ == "__main__"``` 中做了以下修改 ```py ### Task 1 Mining all Frequent Itemset ### start_task1 = time.time() largeSet, candidate_counts, items = runApriori(inFile, minSupport) end_task1 = time.time() task1_time = end_task1 - start_task1 # Count the computation time for this task (task independent). print(f"Task 1 Computation Time: {task1_time:.4f} seconds") ### Task 2 Mining all Frequent Closed Itemset ### start_task2 = time.time() # The aim for this task is to mine all Frequent Closed Itemset with computation as fast as possible. closed_itemsets = find_closed_itemsets(items) end_task2 = time.time() task2_time = task1_time + end_task2 - start_task2 print(f"Task 2 Computation Time: {task2_time:.4f} seconds") # how the ratio of computation time compared to that of Task 1: Task2 / Task1 * 100% time_ratio = (task2_time / task1_time) * 100 print(f"Task 2 to Task 1 Time Ratio: {time_ratio:.2f}%") # 將結果儲存起來 fname = options.input.split("/")[-1].split('.')[0] with open('Step2_'+fname+'_'+str(options.minS)+'_log.txt', 'w') as file: # Print out the total number of frequent closed itemset. file.write(f"Task 1 Computation Time: {task1_time:.4f} seconds\n") file.write(f"Task 2 Computation Time: {task2_time:.4f} seconds\n") file.write(f"Task 2 to Task 1 Time Ratio: {time_ratio:.2f}%") ``` >[!Tip] > * time.time():可以取得當下的時間，再進行相減後便可以得到運行時間。 > * task2_time = task1_time + end_task2 - start_task2：Task2 是使用 Task1 的結果在進行運算取得**Frequent Closed Itemset** ，因此要加上Task1的運行時間 #### 3. 修改一些 Apriori algorithm 相關的函式以符合作業需求 : * `returnItemsWithMinSupport` : 為計算輸入的項目集分別的 support 支持度，並篩選出那些滿足 minSupport 最小支持度要求的項目集 -> Frequent Itemsets ```py # 計算項目集的 support 支持度，並篩選出那些滿足 minSupport 最小支持度要求的項目集 -> Frequent Itemsets def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet): """calculates the support for items in the itemSet and returns a subset of the itemSet each of whose elements satisfies the minimum support""" _itemSet = set() localSet = defaultdict(int) minSupport = round(float(minSupport) / 100,3) # 輸入為％，因此要除 100 for item in itemSet: for transaction in transactionList: if item.issubset(transaction): # 如果此 item 出現在這筆交易 transaction 中 freqSet[item] += 1 localSet[item] += 1 for item, count in localSet.items(): support = float(count) / len(transactionList) # 計算每一筆 item 的 support 支持度(出現的頻率) if support >= minSupport: # 如果大於設定的 minSupport (最小支持度)則保留起來，即為 Frequent Itemsets (頻繁項目集） _itemSet.add(item) return _itemSet ``` >[!Caution] 注意！！！ > * **minSupport = round(float(minSupport) / 100,3)** : 由於指令輸入的 **minSupport** 是百分比的格式（％），因此要將數值處以100才會正確。 > * 我一開始只有單純寫 minSupport = float(minSupport) / 100，在 datasetA_0.9的數據中會出現 minSupport 為 0.0090000002 的值！ > 會導致 support 為 0.009 的 item 全部被刪除，進而出現錯誤的數量，因此要記得使用round進行浮點數的處理！ * `joinSet` : 為得到下一次的itemset的組合（擴展），沒有進行修改 ```py # 得到下一次的itemset的組合（擴展） def joinSet(itemSet, length): """Join a set with itself and returns the n-element itemsets""" return set( [i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length] ) ``` * `getItemSetTransactionList` : 從輸入資料中取得每個集合中只有一個值的 itemSet，與所有的交易筆數 transactionList，並回傳 1-itemSets及所有的transactions，沒有進行修改 ```py # 取得每個集合中只有一個值的 itemSet，與所有的交易筆數 transactionList def getItemSetTransactionList(data_iterator): transactionList = list() itemSet = set() # 集合的資料結構，不會出現重複的 item for record in data_iterator: transaction = frozenset(record) transactionList.append(transaction) for item in transaction: itemSet.add(frozenset([item])) # Generate 1-itemSets return itemSet, transactionList ``` * `runApriori` : 根據 Task 1: Mining all Frequent Itemset 中 (b) Result file 2 (statistics file)的需求，需要紀錄 apriori algorithm 中每一次運算的前後候選的itemset數量，因此我做了以下修改： > 1. ```candidate_counts = []``` : 用來記錄作業需求，方便後續寫檔案使用 > 2. ```returnItemsWithMinSupport```：是 apriori algorithm 的核心，即將輸入的候選 itemset 根據 minSupport 來判斷是否符合頻繁項目集 Frequent Itemset，因次要在此函式前後紀錄 itemset 的數量。 > 3. 另外作業有要求：The index of iteration starts from 1，因此我在進入`while` 迴圈前有做一次紀錄: **`candidate_counts.append((1, num_candidates_before, num_candidates_after))`** ```py def runApriori(data_iter, minSupport): """ run the apriori algorithm. data_iter is a record iterator apriori algorithm 是根據以下兩個性質來運作： 1. 如果一個項目集是符合 minSupport 的頻繁項目集，則他的所有子集合也是頻繁項目集。 2. 相反來說，如果一個項目集不符合（就不是頻繁項目集）的話，即可以刪除不用再看（透過剪枝可以大幅減少計算） Return toRetItems: - items (tuple, support) """ # 步驟一：生成候選項目集，從每個items只有一個開始。 itemSet, transactionList = getItemSetTransactionList(data_iter) freqSet = defaultdict(int) largeSet = dict() candidate_counts = [] num_candidates_before = len(itemSet) # which satisfy minSupport oneCSet= returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet) num_candidates_after = len(oneCSet) candidate_counts.append((1, num_candidates_before, num_candidates_after)) currentLSet = oneCSet # 步驟二：逐步擴展候選項目集，直到計算完所有的可能，產生符合 minSupport 的 frequent itemsets k = 2 while currentLSet != set([]): largeSet[k - 1] = currentLSet # Generate next candidates by joining the current frequent itemsets currentLSet = joinSet(currentLSet, k) num_candidates_before = len(currentLSet) currentCSet= returnItemsWithMinSupport( currentLSet, transactionList, minSupport, freqSet ) currentLSet = currentCSet num_candidates_after = len(currentLSet) # 儲存每一次修剪前後的itemset數量 candidate_counts.append((k, num_candidates_before, num_candidates_after)) #print("candidate_counts:",candidate_counts) k = k + 1 def getSupport(item): """local function which Returns the support of an item""" return float(freqSet[item]) / len(transactionList) # 處理回傳結果，toRetItems 中包含每一個 frequent itemsets 與該項目集的支持度 Support toRetItems = [] for key, value in largeSet.items(): toRetItems.extend([(tuple(item), getSupport(item)) for item in value]) return largeSet, candidate_counts, toRetItems ``` >[!Tip] 函式輸出： > * largeSet : 是所有 frequence Itemset，可以用len(largeSet)，來取得該資料集符合 minSupport 的筆數。 > * candidate_counts : 即為 Result file 2 (statistics list)要求所需要的內容。 > * toRetItems : 是根據 Result file 1 (itemset list) 要求所需要的內容，裡面會包含所有的frequence Itemset以及對應的Support數值 * `find_closed_itemsets` : 是根據 Task 2: Mining all Frequent Closed Itemset 中的需求，所額外建立的函式。 > 1. 將 Task 1 所得到的`frequent Itemset`作為輸入。 > 2. 裡用雙層迴圈遍歷所有Itemset，並判斷當前`item` 是否為其他`other_item`的**超集的子集**，並且該`item`的`support`與`other_item`的`other_support`相同，則當前項集**不是閉合的**。 > 3. 最後判斷是否任然為閉合的，是則將其添加到閉合項集中。 > 4. 作業要求：Please sort those frequent closed itemsets according to support (from large to small).所以要將**closed_itemsets**根據support進行排序，設定reverse=True可以由大到小排序。 ```py # Task 2 - Frequent Closed Itemset # mine all Frequent Closed Itemset with computation def find_closed_itemsets(freq_items): closed_itemsets = [] freq_dict = {frozenset(item): support for item, support in freq_items} for item, support in freq_dict.items(): is_closed = True for other_item, other_support in freq_dict.items(): # 該 item 是其他 freq_items的子集合，或有相同的支持度，則此 item 就不是 Frequent Closed Item if item != other_item and item.issubset(other_item) and support == other_support: is_closed = False break if is_closed: closed_itemsets.append((item, support)) # Please sort those frequent closed itemsets according to support (from large to small). closed_itemsets.sort(key=lambda x: x[1], reverse=True) return closed_itemsets ``` >[!Tip] 函式輸出： > * closed_itemsets : 包含了所有的 Frequent Closed Itemset，並且根據support值由大到小排序。 #### 4. 將上述的結果寫成檔案： > 為了確保檔案名稱的一致性，我將數入參數進行以下處理，以取得資料集名稱與最小支持度 ```py fname = options.input.split('.')[0] # 取得輸入的資料及名稱 minSupport = options.minS ``` * `weitefile_itemset` : 將```runApriori```函式的結果 toRetItems 與 fname、minSupport輸入，注意要求的suooort要以百分比表示。 ```py # Task 1 / Result 1 / # Print out all frequent itemsets with support (take percentage % and round to the first decimal place) def weitefile_itemset(items,fname,minSupport): with open(options.path + '/Step2_task1_'+fname+'_'+str(minSupport)+'_result1.txt','w') as outfile: # sort those frequent itemsets according to support (from large to small). for item, support in sorted(items, key=lambda x: x[1], reverse=True): temp = ','.join(list(item)) # take percentage % and round to the first decimal place support = round(support*100,1) # File format (each frequent itemset): [support]\t[item_set]\n outfile.write("{:.1f}".format(support)+'\t{'+temp+'}\n') ``` >[!Note] 使用方式： > ``` weitefile_itemset(items,fname,minSupport) ``` * `save_statistics` : 將```runApriori```函式的結果 largeSet、candidate_counts 與 fname、minSupport輸入。 ```py # Task 1 / Result 2 / # the number of candidates generated before and after pruning def save_statistics(largeSet, candidate_counts,fname,minSupport): total_freq_itemsets = sum(len(v) for v in largeSet.values()) with open(options.path + '/Step2_task1_'+fname+'_'+str(minSupport)+'_result2.txt', 'w') as file: # Print out the total number of frequent itemsets. file.write(f"{total_freq_itemsets}\n") # Print out the number of candidates generated before and after pruning, respectively, in each iteration of Apriori. for iteration, num_candidates_before, num_candidates_after in candidate_counts: # [index_of_iteration]\t[the number of candidates (before pruning)]\t[the number of candidates (after pruning)]\n file.write(f"{iteration}\t{num_candidates_before}\t{num_candidates_after}\n") ``` >[!Note] 使用方式： > ``` save_statistics(largeSet, candidate_counts,fname,minSupport) ``` * `save_closed_itemsets` : 將```find_closed_itemsets```函式的結果 closed_itemsets 與 fname、minSupport輸入，，注意要求的suooort要以百分比表示。 ```py # Task 2 / Mining all Frequent Closed Itemset def save_closed_itemsets(closed_itemsets,fname,minSupport): fname = options.input.split('.')[0] with open(options.path + '/Step2_task2_'+fname+'_'+str(minSupport)+'_result1.txt', 'w') as file: # [total number of frequent closed itemsets]\n file.write(f"{len(closed_itemsets)}\n") # [support]\t[item_set]\n for item, support in closed_itemsets: support_percentage = round(support * 100,1) #take percentage % and round to the first decimal place item_list = '{'+','.join(item)+'}' file.write(f"{support_percentage}\t{item_list}\n") ``` >[!Note] 使用方式： > ``` save_closed_itemsets(closed_itemsets,fname,minSupport)``` ### Sept II. Computation time and Anaylsis : 以下針對不同的資料集與不同的最小支持度來做一些分析，其中冒號後方的數字為Frequent Itemset的總數量。 - In datasetA with minimum support 0.3% : 7776 <img src=https://hackmd.io/_uploads/HkWreA0xke.png/> - In datasetA with minimum support 0.6% : 909 <img src=https://hackmd.io/_uploads/BJXwlARlkg.png/> - In datasetA with minimum support 0.9% : 414 <img src=https://hackmd.io/_uploads/rybYeARgkg.png/> - In datasetB with minimum support 0.2% : 5820 <img src=https://hackmd.io/_uploads/Sk3se0Aeke.png/> - In datasetB with minimum support 0.4% : 1445 <img src=https://hackmd.io/_uploads/B1bTxA0xJg.png/> - In datasetB with minimum support 0.6% : 667 <img src=https://hackmd.io/_uploads/H1cClARg1g.png/> - In datasetC with minimum support 0.5% : 991 <img src=https://hackmd.io/_uploads/Sk31b00xJx.png/> - In datasetC with minimum support 1.0% : 356 <img src=https://hackmd.io/_uploads/B16gWARgke.png/> - In datasetC with minimum support 1.5% : 247 <img src=https://hackmd.io/_uploads/ByC-ZRRgJl.png/> 以上的時間計算皆不包含寫入檔案所花費得時間。 ### Sept II. observations/discoveries : 1. 根據上述的實驗結果顯示，Task 2 利用 Task 1的結果再進行額外的運算，花費的時間會根據Frequent Itemset的總數量來增加，不過所額外花費的時間非常少，兩者運行時間的比例都接近100%左右。 2. 在同一份資料集中，可以發現 minimum support 的數值設定越小，符合的Itemset越多，在 apriori演算法下每次迭代得量也越多，因次花飛的時間也大幅增加。 3. 隨著資料集的筆數增加(datasetA->datasetB->datasetC)，所需的時間也會大幅增加，但也是要根據資料集的特性、分佈而定，並不是線性成長的關係。 4. 另外如上方紅色注意區塊顯示，minimum support數值計算中浮點數位元的問題，造成結果根本依據不是我們所設定的minimum support。這是我在做完sept3後回去對照時才發現！前前後後花費了不少時間成本重新計算，下次要更加小心才行。 ## Step III. - Mining the datasets generated in Step I using Apriori algorithm: 我採用了**FP-Growth** 演算法來mining 頻繁項目集 Frequent Itemset。 FP-Growth 是一種非候選集生成（Non-Candidate-based）的演算法，通過構建 **頻繁模式樹（FP-Tree）** 來壓縮交易數據並進行模式挖掘。我根據下方的網頁來學習並建立 FP-Growth algorithm： > https://github.com/lxtGH/MachineLearning-1/blob/python-2.7/docs/12.%E4%BD%BF%E7%94%A8FP-growth%E7%AE%97%E6%B3%95%E6%9D%A5%E9%AB%98%E6%95%88%E5%8F%91%E7%8E%B0%E9%A2%91%E7%B9%81%E9%A1%B9%E9%9B%86.md > > https://medium.com/@tinahuang_4101/associating-rule-%E9%97%9C%E8%81%AF%E5%BC%8F%E6%B3%95%E5%89%87%E6%87%89%E7%94%A8-apriori-fp-growth-3ab46deeeb77 ### FP-Growth algorithm 的基本流程： > FP-Growth 主要分成兩大步驟，首先是建立 FP-Tree，再來就是從建立好的FP-Tree計算得到Frequent Itemset，於是我便照著流程寫出對應的函式。 1. 首先計算每個項目的支持度，並保留那些支持度超過最小支持度閾值的項目。 ```py # 預處理交易數據 def preprocess_transactions(data_iterator, min_support): transactionList = list() item_counts = defaultdict(int) for record in data_iterator: transaction = frozenset(record) transactionList.append(transaction) for item in transaction: item_counts[item] += 1 # 計算該 item 出現在交易數據中的次數 # 計算最小支持度的閾值 ( min_support(%) / 100 * 交易數據比數) min_support = round(min_support / 100 * len(transactionList),3) # 過濾掉低於最小支持度的項目 items = set(item for item, count in item_counts.items() if count >= min_support) # 移除非頻繁項目，並對剩下的項目進行排序 # 將所有交易中的項目按照支持度從高到低進行排序（高頻項目先排序），這樣具有相同前綴的交易可以共享樹中的節點。 processed_transactions = [] for transaction in transactionList: processed_transaction = [item for item in transaction if item in items] processed_transaction.sort(key=lambda item: (-item_counts[item], item), reverse=True) if processed_transaction: processed_transactions.append(processed_transaction) return processed_transactions, len(transactionList) ``` >[!Important] 設定穩定排序 > * processed_transaction.sort(key=lambda item: (-item_counts[item], item), reverse=True) > > 在相同的support數值下，根據項目名稱的數字順序進行次要排序，這樣可以穩定FP-Tree 的建立。 > 如果沒有這樣處理，會導致每次執行的結果都不相同！ 2. 構建 FP-Tree：對每筆交易進行排序，將頻繁項目按支持度排序後插入 FP-Tree。 > 這裡根據作業要求對開源的 FP-Growth 基礎程式碼 `fpgrowth.py` 進行了以下修改： * **表頭鏈結改進**：在 `FPTree` 類別中，添加了 `headers` 字典來記錄每個項目的節點鏈接，便於快速建立條件模式基。 * 穩定排序：在多處排序中增加了穩定排序的條件，確保多次執行結果的一致性 : `items = sorted(tree.headers.items(), key=lambda x: (sum(node.count for node in x[1]),x[0]))` * **輸出結果格式**：作業要求，並增加了支持度百分比的計算，並且存取和調整需要的輸出資料 : `frequent_patterns[tuple(new_pattern)] = support`。 ```py from collections import defaultdict, namedtuple from itertools import chain, combinations # 建立 FP Tree 的節點： # 可以表示每個節點的計數 class FPTreeNode: def __init__(self, item, count, parent): self.item = item self.count = count self.parent = parent self.children = {} self.link = None # 用於節點間的鏈接 def increment(self, count): self.count += count # 建立 FP Tree： class FPTree: def __init__(self): self.root = FPTreeNode(None, 1, None) self.headers = defaultdict(list) # 每個項目的表頭 # insert 將交易資料輸入到樹的資料結構中 def add_transaction(self, transaction, count): current_node = self.root for item in transaction: if item not in current_node.children: # 創建新節點 new_node = FPTreeNode(item, count, current_node) current_node.children[item] = new_node # 更新表頭鏈接 self.headers[item].append(new_node) else: current_node.children[item].increment(count) current_node = current_node.children[item] def is_single_path(self, node=None): if node is None: node = self.root while len(node.children) == 1: node = next(iter(node.children.values())) return len(node.children) == 0 # FP-Growth 遞歸挖掘 def get_fpgrowth(tree, min_support, suffix, frequent_patterns): items = sorted(tree.headers.items(), key=lambda x: (sum(node.count for node in x[1]),x[0])) for item, nodes in items: # 計算支持度 support = sum(node.count for node in nodes) if support >= min_support: # print("support",support) # 發現頻繁項目集 new_pattern = suffix + [item] frequent_patterns[tuple(new_pattern)] = support # 構建條件模式基 conditional_tree = FPTree() for node in nodes: path = [] parent = node.parent while parent and parent.item is not None: path.append(parent.item) parent = parent.parent path.reverse() if path: conditional_tree.add_transaction(path, node.count) # 遞歸查找子樹的頻繁模式 if conditional_tree.root.children: get_fpgrowth(conditional_tree, min_support, new_pattern, frequent_patterns) ``` 使用以下程式碼，將欲處理好的交易資料加入到 FP-Tree 中對於每筆交易，逐項插入 FP-Tree，並使用 `headers` 字典鏈接每個相同項目的節點。 ```py # 構建 FP-Tree tree = FPTree() for transaction in processed_transactions: tree.add_transaction(transaction, 1) ``` 3. 使用條件模式基（Conditional Pattern Base）來生成條件 FP-Tree，並遞歸地從子樹中挖掘頻繁模式。 - 針對每個項目生成條件模式基，並構建子 FP-Tree。 - 遞歸地對條件 FP-Tree 進行頻繁項目集挖掘，直到所有頻繁項目集均被找到。 - 透過上方定義好的 `get_fpgrowth` 函式進行。 ```py # 挖掘頻繁模式 min_support = int(min_support / 100 * lengh) frequent_patterns = {} get_fpgrowth(tree, min_support, [], frequent_patterns) ``` 4. **結果輸出**： - 將所有頻繁項目集及其支持度以指定格式輸出。 ```py # Task 3 / Result 1 / all frequent itemsets def weitefile_itemset(items,lengh,fname,minSupport): # fname = "T10I6001M" with open(options.path + '/Step3_task1_'+fname+'_'+str(minSupport)+'_result1.txt','w') as outfile: for item, support in sorted(items.items(), key=lambda item: item[1], reverse=True): temp = ','.join(list(item)) support = round(support/lengh*100,1) # take percentage % and round to the first decimal place outfile.write("{:.1f}".format(support)+'\t{'+temp+'}\n') ``` ### FP-Growth 演算法的改進與優勢（Differences/Improvements compared to Apriori） FP-Growth 比 Apriori 更高效的原因 - **無需生成候選項目集**：FP-Growth 無需生成候選項目集，而是使用 FP-Tree 直接在壓縮後的樹結構中進行模式挖掘，極大地減少了內存和計算資源的消耗。 - **條件模式基的遞歸挖掘**：FP-Growth 的條件模式基（Conditional Pattern Base）和遞歸結構允許它針對頻繁模式進行逐層分解和處理，從而避免了重複計算。 - **空間利用率更高**：FP-Tree 通過壓縮交易資料來節省內存，特別是當多筆交易具有相同或部分重疊的項目時，FP-Tree 可以顯著降低空間複雜度。 - **效率提升**：FP-Growth 避免了在每一階段產生大量候選項目集的計算開銷，特別是在數據量大的情況下效率更高。 >FP-Growth 在處理大數據集方面的擴展性良好，但依然會受制於記憶體資源。以下是演算法在數據集規模擴展時的效能變化： - **計算時間隨數據量的變化**：FP-Growth 的運行時間隨數據量增加而增加，尤其在頻繁項目較多的情況下。但相較於 Apriori，FP-Growth 的時間增長更為平緩，對大數據的擴展性更優。 - **支持的最大數據集大小**：在典型的 16 GB 記憶體環境下，FP-Growth 可處理數百萬條交易記錄的大型數據集。當數據集規模繼續擴大時，可以考慮分散式 FP-Growth 以處理更大的數據集。 >FP-Growth 透過 FP-Tree 壓縮數據並進行高效的遞歸模式挖掘，適合於大規模數據分析。 ### Sept III. Computation time and Anaylsis : 以下針對不同的資料集與不同的最小支持度來做一些分析，其中冒號後方的數字為Frequent Itemset的總數量。 - In datasetA with minimum support 0.3% : 7776 <img src=https://hackmd.io/_uploads/HklRVkkWyl.png/> - In datasetA with minimum support 0.6% : 909 <img src=https://hackmd.io/_uploads/S1-xHk1Wyg.png/> - In datasetA with minimum support 0.9% : 414 <img src=https://hackmd.io/_uploads/SyIbBykZ1e.png/> - In datasetB with minimum support 0.2% : 5820 <img src=https://hackmd.io/_uploads/HJ3GHk1Wyx.png/> - In datasetB with minimum support 0.4% : 1445 <img src=https://hackmd.io/_uploads/B10QB11W1g.png/> - In datasetB with minimum support 0.6% : 667 <img src=https://hackmd.io/_uploads/B1_BHkk-Jx.png/> - In datasetC with minimum support 0.5% : 991 <img src=https://hackmd.io/_uploads/ry1wH1y-yx.png/> - In datasetC with minimum support 1.0% : 356 <img src=https://hackmd.io/_uploads/rye_rk1Wyl.png/> - In datasetC with minimum support 1.5% : 247 <img src=https://hackmd.io/_uploads/r1ZFSkyWJe.png/> 以上的時間計算皆不包含寫入檔案所花費得時間。從這些結果中可以看出，**FP-Growth 演算法（Step 3）在處理大型數據集時相比 Apriori（Step 2）所花費的時間大幅減少，但跟Apriori算法類似，如果minimum Support越小，所需要的計算時間也越多。** #### 分析結果 - **Dataset A**：FP-Growth 在 `datasetA` 中加速達 96.3% 到 98.7% 不等，尤其在較低支持度（0.3%）的情況下達到最大加速效果。 - **Dataset B**：FP-Growth 在 `datasetB` 中的加速比也顯著，在 0.2% 的支持度下加速達 99%，隨著支持度增加，加速比稍微下降，但仍在 97% 以上。 - **Dataset C**：對於更大的數據集 `datasetC`，FP-Growth 的加速效果依然穩定，保持在 96.4% 到 97.7% 之間。 - **隨著數據集規模擴大**，FP-Growth 依然能夠保持高效的計算速度。從 `datasetA` 到更大的 `datasetB` 和 `datasetC`，FP-Growth 的加速效果仍穩定在 96%-99% 之間。 - **支持的最大數據集大小**：FP-Growth 在處理百萬級以上的交易數據時表現良好，但在數據集更大時可能需要更大的內存空間，特別是當頻繁項目集增多時。若數據集進一步增大，可以考慮分散式 FP-Growth 演算法以支持更大規模的數據集。