Data Mining Final Project

tags: `Data Mining` `NSYSU` `CSE223`

Data Set,Colab Link

Reference in Kaggle

Team Member

B083040008 黃宸洋 @kevin-NSYSU
B083040009 薛致謙
B083040012 陳柏翰 @B083040012
B083040013 梁宸熏
B084020005 蔡昀燁 @B084020005

ToDo

Data Preprocessing

Before Prerocessing:
Depending on the experience of midterm project,we should first compute the correlation of each feature and label,so we can choose to drop/keep/weight each feature column.
- Because there are 17 types of label only in test data,so we will do the main correlation after one KNN processing(to find the relationship between 5 known type and 17 unknown type).
Before the first KNN,we do simple normalization and standardization to the train_data and test_data.

Basic Preprocessing

Continuous Data:
- max-min method
Discrete Data:
- one-hot encoding

Classification:KNN

Basic KNN model is based on midterm project (modified by @kevin-NSYSU ).
What we need to to in final project about KNN:

Because there are 17 label types only in test data,so the main objective of KNN is to classify 6 label types(normal,Dos,Probe,R2L,U2R,and unknown type).

In detail,we will classify the test data into unknown type if the euc_dis of test data to the five known label is too far(need a threshold to determine).

After the KNN process,we will know the index of unknown type test data(save in unknown.txt),and the following clustering method will only deal with these unknown test data.

Determine the unknown type threshold

Find the distance that determining test data to be known type/unknown type.
Higher the threshold,higher the accuracy depending on train_label,but lower the chance to determine unknown type .
Schematic Diagram:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The best choice of threshold depends on the final accuracy,more detail in Result Analysis.

Result of KNN (before preprocessing)

Threshold=1

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Threshold=1.5

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Threshold=2

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We choose Threshold=1.5's KNN processing result to improve data correlation for unknown type

Data Correlation for unknown Type

Colab Link for code and full result

In this section,we want to find the feature column that can determine the "unknown type".
Reference for correlation:test_data with label of 1st KNN process(Threshold=1.5,accuracy=69.104%–>not 100% hit!).
This test_label contains six types(five in train_label+unknown) and we assume it can help us to know which feature column is meaningful for determining known/unknown type.
Once again,the label is only 69% hit,so it might not show us the right direction.
Color of each type
- normal:blue
- Dos:green
- Probe:black
- R2L:red
- U2R:yellow
- unknown:purple
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →

Clustering:K-means/K-Means++

Procedure(K-Means)

Initialize centroids:
a. Randomly choosing K centroids on a scatter plot
b. K-Means++
Calculate the distance(euc_dis) of each point to the centroids and assign them to the nearest cluster.
Update the cluster centroids by calculating average point of each clustering set.
Repeating step 2. and 3. until there is no change in each clustering set.

K-Means Model



















































































##############################
# Source Code by Kevin huang #
##############################
class My_KEANS():

    def fit(self,x_train,y_train):
        self.x_train=unknown_test_data    # by test_numerical.csv
        self.y_train=unknown_test_label   # by unknown.txt(index of unknown type test data)
        self.x_train=self.x_train[self.y_train,:]    # unknown type test data

    def initial(self,x=17):
        # Randomly choose k centroids
        self.centroid=sorted(sample(range(len(self.y_train)),k))

    def iteration(self):    # assign/reassign each point to the nearest cluster
        self.clustering=[]    # save each point's clustering centroid
        for index_x in range(len(x_train)):    #clustering each test data

            if index_x in self.centroid:   #index_x is centroid point
                self.clustering.append(self.centroid.index(index_x))
            else:
                distoCent=[]    # save euc_dis to each centroid

                for index_c in self.centroid:
                    # calculate euc_dis of test point to each centroid
                    distoCent.append(euc_dis(self.x_train[index_x],self.x_train[index_c]))

                # find nearest centroid
                self.clustering.append(distoCent.index(min(distoCent))

    def assign(self,k=17):    # assign/reassign the centroid of each cluster
        self.next_centroid=[]    # save the reassign clustering centroid
        self.clusters=[]    #save the data point of each clustering set

        # initialize clusters
        for index_k in range(k):self.clusters.append([]): 
        for index_x in range(len(self.x_train)):
            self.clusters[self.clustering[index_x]].append(index_x)

        # assign the centroid of each cluster
        for index_k in range(k):
            # save the best point being the centroid candidate
            best_index=0;best_dis=0   

            for index_cen in range(self.clusters[index_k]):
                # total_dis of centroid candidate to other points in kth cluster
                total=0
                # centroid candidate in kth cluster
                cen_can=self.clusters[index_k][index_cen]

                for index_pot in range(clusters[index_k]):
                    # the point in kth cluster
                    pot=self.clusters[index_k][index_pot]
                    total+=euc_dis(self.x_train[cen_can],self.x_train[pot])
                if index_cen==0:
                    best_index=self.clusters[index_k][0]
                    best_dis=total
                else:
                    if total<best_dis:
                        best_index=self.cluster[index_k][index_cen]
                        best_dis=total
            # clusters[index_k][best_index] is the new centroid of kth cluster
            self.next_centroid.append(best_index)
    
    # check if the centroids of cluster change
    def check(self):
        if self.centroid==self.next_centroid:
            return True
        else: return False
            
    def process(self,unknown_test_data,unknown_test_label):
        # read test_numerical.csv,unknown.txt
        self.fit(unknown_test_data,unknown_test_label)
        # Randomly choose k centroids
        if self.initial():
            # 1000000 loops
            for itr in range(1000000):
                self.iteration()
                self.assign()
                if self.check(): break    # stop if no change
                else: 
                    self.previous_centroid=self.centroid
                    self.centroid=self.next_centroid

K-means++

K-Means++ is the improvement on the initial centroids assignment in the K-Means process.
Goal : Push the centroids as far from one another as possible.
Procedure in the initial centroid assignment:
1. First, randomly choose only one centroid.
2. For each data point compute its distance from the nearest, previously chosen centroid.
3. Using Roulette Wheel Selection(輪盤法) to assign the new centroid.
  - Detail:Setting a random number then minus each distance until the number<=0,then select the correspond data point.
  - If choosing the largest dis point directly,we may choose the outlier in data space,so the Roulette Wheel Selection give the randomness for assigning the next centroid(choosing the large point but also avoid outlier).
4. Repeat step 2 and 3 until all k centroid assigned.

After the first try…

The accuracy of the first attempt:
Data reference
- Preprocessing by simple normalization
- KNN with k=1,threshold=1.5
- K-Means with k=17
Details
- The choice of initial centroid in K-Means is randomly assigned,we can try the way of uniformly choosing the centroid among the unknown test data.
- After K-Means process,we will get the cluster result of each 'unknown type' test data,but we still don't know the specific label of each cluster,so here are three ways to solve this problem.(Disciss in next topic)
- Need to improve
  - Preprocessing with correlation result. // done,detail in analysis
  - Fix the KNN model to produce result of multiple K in one process. // done
  - Complete the voting method of assigning the label. // done
  - Try a different way to assign the initial centroid. // K-Means++

Mapping Clusters to Label

Brutal method:Try 17! times for giving each cluster a label(each cluster's label will be different),and find the best permutation depending on the accuracy.
–> Need 40000000 days to complete even with numba,NEGATIVE.
Voting method:Like the voting method in KNN,we calculate the occurrence of each label (by test_data_label) in the cluster and assign the label with highest occurrence.
- –>Lead to the accuracy(73.4%) above.
- What we need to worry about:
  - Different cluster might be assigned the same label (In fact,we assign only two label('mscan', 'apache2') for all 17 clusters in the attempt).To solve it ,we should find a way to fix the label assignment.
- Can get higher accuracy,but can't show the actual effect of K-Means.
Similar to way 2,but change to–>For each label,we can also find a cluster that has the highest occurrence to be assigned that label.
- Can better show the result of K-Means because each cluster's label will be different.
- Used in Result Analysis.
Calculate the means of each clusters and unknown labels,then assign the label with min means distance.
- Due to the result of K-Means,the means of each cluster are too close–>failed.

Result Analysis

Full data set and detailed result here.

KNN的前處理和threshold的選擇會影響KNN完的accuracy，其中因為K Means只會對剩餘的17類進行分群，所以實際上為前五類的test data若在KNN被判定為unknown，在K Means後也不會被分到正確的label
–>因此準確率上限為KNN後的(accuracy+Not_In_Unknown)，Not_In^Unknown的命中率必須看K Means的效果而定
–>必須取得KNN(包括前處理、threshold的選擇)和K-Means成效的平衡，找到最佳的準確率。

Meaning of numbers in Table

"Accuracy of KNN" and the "Accuracy of Unknown" represent the impact of each topic on classifying six label.
- The unknown type data maybe hit after clustering,so we calculate the accuracy that classification is unknown and the actual label is last 17 type,this is what "Accuracy of Unknown"(Maybe Hit) mean.
- $A c c u r a c y_{K N N} = \frac{H i t_{k n o w n}}{T e s t D a t a}$ ,
  $A c c u r a c y_{U n k n o w n} = \frac{H i t_{U n k n o w n}}{T e s t D a t a}$
- So the sum of these two will be the best final accuracy after KNN.
"Accuracy of K-Means" represents the impact of each topic on clustering.
- $A c c u r a c y_{K M e a n s} = \frac{H i t_{U n k n o w n}}{U n k n o w n D a t a}$
- Unknown data here is the data that classification is unknown type(actual label may be all type).
"Final Accuracy" represents the overall influence of each topic.
- $A c c u r a c y_{F i n a l} = \frac{H i t_{a l l}}{T e s t D a t a}$
"All Unknown Miss" represents the impact of each topic on the miss of all unknown data(classification is unknown and the actual label is last 17 type).
- $A c c u r a c y_{A l l U n M i s s} = \frac{M i s s_{U n k n o w n}}{T e s t D a t a}$
Consider the randomly assignment of K-Means(++),the result of K-Means and Final will be ~~the average of five tests.~~ –>should pick the max value among tests.

How preprocessing affect result

The type of preprocessing:
1. Simple Normalization and standardization.
2. Based on type1,drop columns 'land','wrong_fragment','su_attempted' and 'num_outbound_cmds'
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
3. Based on type2,weight columns 'duration','src_bytes','dst_bytes','logged_in','dst_host_srv_count' by1.5
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →

Type of Preprocessing	Accuracy of KNN	Accuracy of Unknown	Accuracy of K-Means	Final Accuracy	All Unknown Miss
Type1	69.10%	10.81%	18.26%	72.39%	7.51%
Type2	69.10%	10.81%	17.76%	72.29%	7.60%
Type3	69.11%	11.51%	17.25%	72.49%	8.13%

Data Reference:
- KNN with k=1
- Threshlod=1.5
- K-Means++ with k=17
- Vote method with no cluster merge (Method 3) (in order to see the clustering result)

How threshold affect result

The relationship between threshold and result of KNN explain here.
We choose threshold 1 ,1.5 ,2 to analysis.

Threshlod	Accuracy of KNN	Accuracy of Unknown	Accuracy of K-Means	Final Accuracy	All Unknown Miss
1	68.08%	11.70%	9.37%	70.10%	9.68%
1.5	69.10%	10.81%	18.26%	72.39%	7.51%
2	69.75%	9.06%	12.33%	71.37%	7.43%

Data Reference:
- Type1 Preprocessing
- KNN with k=1
- K-Means++ with k=17
- Vote method with no cluster merge(Method 3)

How centroid assignment affect result

We compare the result of K-Means and K-Means++.

Assignment	Accuracy of K-Means	Final Accuracy	All Unknown Miss
K-Means	8.32%	70.57%	9.32%
K-Means++	18.26%	72.39%	7.51%

Data Reference:
- Type1 Preprocessing
- KNN with k=1
- Threshold=1.5
- K-Means/K-Means++ with k=17
- Vote method with no cluster merge(Method 3)

雜筆

發現多k情況下因為票數增加導致容易過threshold，所以辨認成unknown的data數增加，在不調整threshold的情況下k越低越好。
在做分析時發現在資料經過處理/使用K-Means++的情況下原本準確率比較高的label命名方法Method2反而表現不比Method3好，表示K-Means後的結果確實有得到改善。
- Type3在Method3下final accuracy可以跑到73.28%，已經接近最初用Method2跑出的73.43%
Deep Learning:lstm、svm…
Other clustering method:K-Means^2、Fuzzy C-Means、Hierarchical…
考慮時間複雜度!!

Data Mining Final Project

tags: Data Mining NSYSU CSE223

Data Set,Colab Link

Reference in Kaggle

Team Member

ToDo

Data Preprocessing

Basic Preprocessing

Classification:KNN

Determine the unknown type threshold

Result of KNN (before preprocessing)

Threshold=1

Threshold=1.5

Threshold=2

Data Correlation for unknown Type

Colab Link for code and full result

Clustering:K-means/K-Means++

Procedure(K-Means)

K-Means Model

K-means++

After the first try…

Mapping Clusters to Label

Result Analysis

Full data set and detailed result here.

Meaning of numbers in Table

How preprocessing affect result

How threshold affect result

How centroid assignment affect result

雜筆

Read more

Information Security Report

Secure Electronic Commerce Slides

Secure Electronic Commerce Report

Computer Organization Chap5 Report

tags: `Data Mining` `NSYSU` `CSE223`