# Data Mining Final Project ###### tags: `Data Mining` `NSYSU` `CSE223` ## [Data Set](https://tinyurl.com/4ef6py5t),[Colab Link](https://colab.research.google.com/drive/1tnOIBmqm__Mcohdb410WjZEVm3Hx1toY) ## [Reference in Kaggle](https://www.kaggle.com/hassan06/nslkdd) ## Team Member B083040008 黃宸洋 @kevin-NSYSU B083040009 薛致謙 B083040012 陳柏翰 @B083040012 B083040013 梁宸熏 B084020005 蔡昀燁 @B084020005 ## ToDo - [X] Data Preprocessing - [X] Classification - [X] Clustering - [X] Accuracy - [X] Result Analysis ## Data Preprocessing * Before Prerocessing: Depending on the experience of midterm project,we should first compute the **correlation of each feature and label**,so we can choose to drop/keep/weight each feature column. * Because there are 17 types of label only in test data,so <font color="red">**we will do the [main correlation](#Data-Correlation-for-unknown-Type) after one KNN processing**</font>(to find the relationship between 5 known type and 17 unknown type). * Before the first KNN,we do simple normalization and standardization to the train_data and test_data. ### Basic Preprocessing * Continuous Data: * max-min method * Discrete Data: * one-hot encoding ## Classification:KNN * Basic KNN model is based on midterm project (modified by @kevin-NSYSU ). * What we need to to in final project about KNN: Because there are 17 label types only in test data,so **the main objective of KNN is to classify 6 label types**(normal,Dos,Probe,R2L,U2R,**and unknown type**). In detail,we will classify the test data into unknown type **if the euc_dis of test data to the five known label is too far(need a threshold to determine)**. After the KNN process,we will know the index of unknown type test data(save in unknown.txt),and the following clustering method will only deal with these unknown test data. ### Determine the unknown type threshold * Find the distance that determining test data to be known type/unknown type. * **Higher the threshold,higher the accuracy depending on train_label,but lower the chance to determine unknown type** . * Schematic Diagram: ![](https://i.imgur.com/lwqUGuu.png =450x300) * The best choice of threshold depends on the final accuracy,more detail in [Result Analysis](#Result-Analysis). ### Result of KNN (before preprocessing) #### Threshold=1 ![](https://i.imgur.com/yaEC35K.png =400x150) #### Threshold=1.5 ![](https://i.imgur.com/WGYRrsd.png =400x150) #### Threshold=2 ![](https://i.imgur.com/eeExJRu.png =400x150) * We choose **Threshold=1.5**'s KNN processing result to improve data correlation for unknown type ## Data Correlation for unknown Type ### [Colab Link](https://colab.research.google.com/drive/1rhNWCyndiqDVVw2_szl9AmWMOUSLMlmJ?usp=sharing) for code and full result * In this section,we want to **find the feature column that can determine the "unknown type"**. * Reference for correlation:test_data with label of 1st KNN process(Threshold=1.5,accuracy=69.104%-->**not 100% hit!**). * This test_label contains six types(five in train_label+unknown) and we assume it can help us to know which feature column is meaningful for determining known/unknown type. * Once again,the label is only 69% hit,so it might not show us the right direction. * Color of each type * normal:<font color="blue">**blue**</font> * Dos:<font color="green">**green**</font> * Probe:**black** * R2L:<font color="#f00">**red**</font> * U2R:<font color="yellow">**yellow**</font> * **unknown:**<font color="purple">**purple**</font> ![](https://i.imgur.com/BGBEMNV.png =300x200) ![](https://i.imgur.com/Dp5syv4.png =300x200) ![](https://i.imgur.com/YqgAc8N.png =300x200) ![](https://i.imgur.com/24PACtZ.png =300x200) ![](https://i.imgur.com/uA5M6JW.png =300x200) ![](https://i.imgur.com/XDQzAfK.png =300x200) ![](https://i.imgur.com/ICz6CPi.png =300x200) ![](https://i.imgur.com/NIhT0nb.png =300x200) ![](https://i.imgur.com/RyRoRxy.png =400x200) ## Clustering:K-means/K-Means++ ### Procedure(K-Means) 1. Initialize centroids: a. Randomly choosing K centroids on a scatter plot b. [K-Means++](#K-means) 2. Calculate the distance(euc_dis) of each point to the centroids and assign them to the nearest cluster. 3. Update the cluster centroids by calculating average point of each clustering set. 4. **Repeating step 2. and 3.** until there is no change in each clustering set. ### K-Means Model ```python= ############################## # Source Code by Kevin huang # ############################## class My_KEANS(): def fit(self,x_train,y_train): self.x_train=unknown_test_data # by test_numerical.csv self.y_train=unknown_test_label # by unknown.txt(index of unknown type test data) self.x_train=self.x_train[self.y_train,:] # unknown type test data def initial(self,x=17): # Randomly choose k centroids self.centroid=sorted(sample(range(len(self.y_train)),k)) def iteration(self): # assign/reassign each point to the nearest cluster self.clustering=[] # save each point's clustering centroid for index_x in range(len(x_train)): #clustering each test data if index_x in self.centroid: #index_x is centroid point self.clustering.append(self.centroid.index(index_x)) else: distoCent=[] # save euc_dis to each centroid for index_c in self.centroid: # calculate euc_dis of test point to each centroid distoCent.append(euc_dis(self.x_train[index_x],self.x_train[index_c])) # find nearest centroid self.clustering.append(distoCent.index(min(distoCent)) def assign(self,k=17): # assign/reassign the centroid of each cluster self.next_centroid=[] # save the reassign clustering centroid self.clusters=[] #save the data point of each clustering set # initialize clusters for index_k in range(k):self.clusters.append([]): for index_x in range(len(self.x_train)): self.clusters[self.clustering[index_x]].append(index_x) # assign the centroid of each cluster for index_k in range(k): # save the best point being the centroid candidate best_index=0;best_dis=0 for index_cen in range(self.clusters[index_k]): # total_dis of centroid candidate to other points in kth cluster total=0 # centroid candidate in kth cluster cen_can=self.clusters[index_k][index_cen] for index_pot in range(clusters[index_k]): # the point in kth cluster pot=self.clusters[index_k][index_pot] total+=euc_dis(self.x_train[cen_can],self.x_train[pot]) if index_cen==0: best_index=self.clusters[index_k][0] best_dis=total else: if total<best_dis: best_index=self.cluster[index_k][index_cen] best_dis=total # clusters[index_k][best_index] is the new centroid of kth cluster self.next_centroid.append(best_index) # check if the centroids of cluster change def check(self): if self.centroid==self.next_centroid: return True else: return False def process(self,unknown_test_data,unknown_test_label): # read test_numerical.csv,unknown.txt self.fit(unknown_test_data,unknown_test_label) # Randomly choose k centroids if self.initial(): # 1000000 loops for itr in range(1000000): self.iteration() self.assign() if self.check(): break # stop if no change else: self.previous_centroid=self.centroid self.centroid=self.next_centroid ``` ### K-means++ * K-Means++ is the **improvement on the initial centroids assignment** in the K-Means process. * **Goal : Push the centroids as far from one another as possible.** * Procedure in the initial centroid assignment: 1. First, randomly choose only one centroid. 2. For each data point compute its distance from the nearest, previously chosen centroid. 3. Using **Roulette Wheel Selection(輪盤法)** to assign the new centroid. * Detail:Setting a random number then minus each distance until the number<=0,then select the correspond data point. * **If choosing the largest dis point directly,we may choose the outlier in data space,so the Roulette Wheel Selection give the randomness for assigning the next centroid(choosing the large point but also avoid outlier).** 4. Repeat step 2 and 3 until all k centroid assigned. ## After the first try... * The accuracy of the first attempt:![](https://i.imgur.com/PsTh9iU.png =400x) * Data reference * Preprocessing by simple normalization * KNN with k=1,threshold=1.5 * K-Means with k=17 * Details * The choice of initial centroid in K-Means is randomly assigned,we can try the way of **uniformly choosing** the centroid among the unknown test data. * After K-Means process,we will get the cluster result of each 'unknown type' test data,but we still don't know the specific label of each cluster,so here are three ways to solve this problem.(Disciss in next topic) * Need to improve * Preprocessing with correlation result. // done,detail in analysis * Fix the KNN model to produce result of multiple K in one process. // done * Complete the voting method of assigning the label. // done * Try a different way to assign the initial centroid. // K-Means++ ## Mapping Clusters to Label 1. Brutal method:Try <font color="red">**17!**</font> times for giving each cluster a label(each cluster's label will be different),and find the best permutation depending on the accuracy. --> Need 40000000 days to complete even with numba,<font color="red">**NEGATIVE**</font>. 2. Voting method:Like the voting method in KNN,we calculate the occurrence of each label (by test_data_label) in the cluster and assign the label with highest occurrence. * -->Lead to the accuracy(73.4%) above. * What we need to worry about: * Different cluster might be assigned the same label (In fact,we assign **only two label**('mscan', 'apache2') for all 17 clusters in the attempt).To solve it ,we should find a way to fix the label assignment. * **Can get higher accuracy,but can't show the actual effect of K-Means.** 3. Similar to way 2,but change to-->**For each label**,we can also find a cluster that has the highest occurrence to be assigned that label. * **Can better show the result of K-Means because each cluster's label will be different.** * Used in [Result Analysis](#Result-Analysis). 4. Calculate the means of each clusters and unknown labels,then assign the label with min means distance. * Due to the result of K-Means,the means of each cluster are too close-->failed. ## Result Analysis ### Full data set and detailed result [here](https://drive.google.com/drive/folders/15OGf8qN5EiXgFVPYbD-9K6f6_7IpzLlr?usp=sharing). * KNN的前處理和threshold的選擇會影響KNN完的accuracy,其中因為K Means只會對剩餘的17類進行分群,所以實際上為前五類的test data若在KNN被判定為unknown,在K Means後也不會被分到正確的label -->因此準確率上限為KNN後的(accuracy+Not_In_Unknown),Not_In^Unknown的命中率必須看K Means的效果而定 -->必須取得KNN(包括前處理、threshold的選擇)和K-Means成效的平衡,找到最佳的準確率。 ### Meaning of numbers in Table * "Accuracy of KNN" and the "Accuracy of Unknown" represent the impact of each topic on classifying six label. * The unknown type data maybe hit after clustering,so we calculate the accuracy that classification is unknown and the actual label is last 17 type,this is what "Accuracy of Unknown"(Maybe Hit) mean. * $Accuracy_{KNN}=\frac{Hit_{known}}{Test Data}$ , $Accuracy_{Unknown}=\frac{Hit_{Unknown}}{Test Data}$ * So the sum of these two will be the best final accuracy after KNN. * "Accuracy of K-Means" represents the impact of each topic on clustering. * $Accuracy_{KMeans}=\frac{Hit_{Unknown}}{Unknown Data}$ * Unknown data here is the data that classification is unknown type(actual label may be all type). * "Final Accuracy" represents the overall influence of each topic. * $Accuracy_{Final}=\frac{Hit_{all}}{TestData}$ * "All Unknown Miss" represents the impact of each topic on the miss of all unknown data(classification is unknown and the actual label is last 17 type). * $Accuracy_{All Un Miss}=\frac{Miss_{Unknown}}{TestData}$ * Consider the randomly assignment of K-Means(++),the result of K-Means and Final will be ~~the average of five tests.~~ -->**should pick the max value among tests.** ### How preprocessing affect result * The type of preprocessing: 1. Simple Normalization and standardization. 2. Based on type1,drop columns 'land','wrong_fragment','su_attempted' and 'num_outbound_cmds' ![](https://i.imgur.com/uA5M6JW.png =200x)![](https://i.imgur.com/XDQzAfK.png =200x)![](https://i.imgur.com/f8DXvzb.png =200x)![](https://i.imgur.com/TGoxxCs.png =200x) 3. Based on type2,weight columns 'duration','src_bytes','dst_bytes','logged_in','dst_host_srv_count' by1.5 ![](https://i.imgur.com/NiZ8qiA.png =200x)![](https://i.imgur.com/o4gQV6Z.png =200x)![](https://i.imgur.com/3lfnmSq.png =200x)![](https://i.imgur.com/9CTOeBP.png =200x)![](https://i.imgur.com/RyRoRxy.png =200x) | Type of Preprocessing | Accuracy of KNN | Accuracy of Unknown | Accuracy of K-Means | Final Accuracy | All Unknown Miss | | --------------------- | --------------- | ------------------- | ------------------- | -------------- | ---------------- | | Type1 | 69.10% | 10.81% | 18.26% | 72.39% | 7.51% | | Type2 | 69.10% | 10.81% | 17.76% | 72.29% | 7.60% | | Type3 | 69.11% | 11.51% | 17.25% | 72.49% | 8.13% | * Data Reference: * KNN with k=1 * Threshlod=1.5 * K-Means++ with k=17 * **Vote method with no cluster merge (Method 3) (in order to see the clustering result)** ### How threshold affect result * The relationship between threshold and result of KNN explain [here](#Determine-the-unknown-type-threshold). * We choose threshold 1 ,1.5 ,2 to analysis. | Threshlod | Accuracy of KNN | Accuracy of Unknown | Accuracy of K-Means | Final Accuracy | All Unknown Miss | | --------- | --------------- | ------------------- | ------------------- | -------------- | ---------------- | | 1 | 68.08% | 11.70% | 9.37% | 70.10% | 9.68% | | 1.5 | 69.10% | 10.81% | 18.26% | 72.39% | 7.51% | | 2 | 69.75% | 9.06% | 12.33% | 71.37% | 7.43% | * Data Reference: * Type1 Preprocessing * KNN with k=1 * K-Means++ with k=17 * Vote method with no cluster merge(Method 3) ### How centroid assignment affect result * We compare the result of K-Means and K-Means++. | Assignment | Accuracy of K-Means | Final Accuracy | All Unknown Miss | | ---------- | ------------------- | -------------- | ---------------- | |K-Means | 8.32% | 70.57% | 9.32% | |K-Means++ | 18.26% | 72.39% | 7.51% | * Data Reference: * Type1 Preprocessing * KNN with k=1 * Threshold=1.5 * K-Means/K-Means++ with k=17 * Vote method with no cluster merge(Method 3) ## 雜筆 * 發現多k情況下因為票數增加導致容易過threshold,所以辨認成unknown的data數增加,在不調整threshold的情況下k越低越好。 * 在做分析時發現在資料經過處理/使用K-Means++的情況下原本準確率比較高的label命名方法Method2反而表現不比Method3好,表示K-Means後的結果確實有得到改善。 * Type3在Method3下final accuracy可以跑到73.28%,已經接近最初用Method2跑出的73.43% * Deep Learning:lstm、svm... * Other clustering method:K-Means^2、Fuzzy C-Means、Hierarchical... * 考慮時間複雜度!!