Before Prerocessing: Depending on the experience of midterm project,we should first compute the correlation of each feature and label,so we can choose to drop/keep/weight each feature column.
Because there are 17 types of label only in test data,so we will do the main correlation after one KNN processing(to find the relationship between 5 known type and 17 unknown type).
Before the first KNN,we do simple normalization and standardization to the train_data and test_data.
Basic Preprocessing
Continuous Data:
max-min method
Discrete Data:
one-hot encoding
Classification:KNN
Basic KNN model is based on midterm project (modified by @kevin-NSYSU ).
What we need to to in final project about KNN:
Because there are 17 label types only in test data,so the main objective of KNN is to classify 6 label types(normal,Dos,Probe,R2L,U2R,and unknown type).
In detail,we will classify the test data into unknown type if the euc_dis of test data to the five known label is too far(need a threshold to determine).
After the KNN process,we will know the index of unknown type test data(save in unknown.txt),and the following clustering method will only deal with these unknown test data.
Determine the unknown type threshold
Find the distance that determining test data to be known type/unknown type.
Higher the threshold,higher the accuracy depending on train_label,but lower the chance to determine unknown type .
In this section,we want to find the feature column that can determine the "unknown type".
Reference for correlation:test_data with label of 1st KNN process(Threshold=1.5,accuracy=69.104%–>not 100% hit!).
This test_label contains six types(five in train_label+unknown) and we assume it can help us to know which feature column is meaningful for determining known/unknown type.
Once again,the label is only 69% hit,so it might not show us the right direction.
Initialize centroids: a. Randomly choosing K centroids on a scatter plot b. K-Means++
Calculate the distance(euc_dis) of each point to the centroids and assign them to the nearest cluster.
Update the cluster centroids by calculating average point of each clustering set.
Repeating step 2. and 3. until there is no change in each clustering set.
K-Means Model
############################### Source Code by Kevin huang ###############################classMy_KEANS():
deffit(self,x_train,y_train):
self.x_train=unknown_test_data # by test_numerical.csv
self.y_train=unknown_test_label # by unknown.txt(index of unknown type test data)
self.x_train=self.x_train[self.y_train,:] # unknown type test datadefinitial(self,x=17):
# Randomly choose k centroids
self.centroid=sorted(sample(range(len(self.y_train)),k))
defiteration(self): # assign/reassign each point to the nearest cluster
self.clustering=[] # save each point's clustering centroidfor index_x inrange(len(x_train)): #clustering each test dataif index_x in self.centroid: #index_x is centroid point
self.clustering.append(self.centroid.index(index_x))
else:
distoCent=[] # save euc_dis to each centroidfor index_c in self.centroid:
# calculate euc_dis of test point to each centroid
distoCent.append(euc_dis(self.x_train[index_x],self.x_train[index_c]))
# find nearest centroid
self.clustering.append(distoCent.index(min(distoCent))
defassign(self,k=17): # assign/reassign the centroid of each cluster
self.next_centroid=[] # save the reassign clustering centroid
self.clusters=[] #save the data point of each clustering set# initialize clustersfor index_k inrange(k):self.clusters.append([]):
for index_x inrange(len(self.x_train)):
self.clusters[self.clustering[index_x]].append(index_x)
# assign the centroid of each clusterfor index_k inrange(k):
# save the best point being the centroid candidate
best_index=0;best_dis=0for index_cen inrange(self.clusters[index_k]):
# total_dis of centroid candidate to other points in kth cluster
total=0# centroid candidate in kth cluster
cen_can=self.clusters[index_k][index_cen]
for index_pot inrange(clusters[index_k]):
# the point in kth cluster
pot=self.clusters[index_k][index_pot]
total+=euc_dis(self.x_train[cen_can],self.x_train[pot])
if index_cen==0:
best_index=self.clusters[index_k][0]
best_dis=total
else:
if total<best_dis:
best_index=self.cluster[index_k][index_cen]
best_dis=total
# clusters[index_k][best_index] is the new centroid of kth cluster
self.next_centroid.append(best_index)
# check if the centroids of cluster changedefcheck(self):
if self.centroid==self.next_centroid:
returnTrueelse: returnFalsedefprocess(self,unknown_test_data,unknown_test_label):
# read test_numerical.csv,unknown.txt
self.fit(unknown_test_data,unknown_test_label)
# Randomly choose k centroidsif self.initial():
# 1000000 loopsfor itr inrange(1000000):
self.iteration()
self.assign()
if self.check(): break# stop if no changeelse:
self.previous_centroid=self.centroid
self.centroid=self.next_centroid
K-means++
K-Means++ is the improvement on the initial centroids assignment in the K-Means process.
Goal : Push the centroids as far from one another as possible.
Procedure in the initial centroid assignment:
First, randomly choose only one centroid.
For each data point compute its distance from the nearest, previously chosen centroid.
Using Roulette Wheel Selection(輪盤法) to assign the new centroid.
Detail:Setting a random number then minus each distance until the number<=0,then select the correspond data point.
If choosing the largest dis point directly,we may choose the outlier in data space,so the Roulette Wheel Selection give the randomness for assigning the next centroid(choosing the large point but also avoid outlier).
Repeat step 2 and 3 until all k centroid assigned.
After the first try…
The accuracy of the first attempt:
Data reference
Preprocessing by simple normalization
KNN with k=1,threshold=1.5
K-Means with k=17
Details
The choice of initial centroid in K-Means is randomly assigned,we can try the way of uniformly choosing the centroid among the unknown test data.
After K-Means process,we will get the cluster result of each 'unknown type' test data,but we still don't know the specific label of each cluster,so here are three ways to solve this problem.(Disciss in next topic)
Need to improve
Preprocessing with correlation result. // done,detail in analysis
Fix the KNN model to produce result of multiple K in one process. // done
Complete the voting method of assigning the label. // done
Try a different way to assign the initial centroid. // K-Means++
Mapping Clusters to Label
Brutal method:Try 17! times for giving each cluster a label(each cluster's label will be different),and find the best permutation depending on the accuracy. –> Need 40000000 days to complete even with numba,NEGATIVE.
Voting method:Like the voting method in KNN,we calculate the occurrence of each label (by test_data_label) in the cluster and assign the label with highest occurrence.
–>Lead to the accuracy(73.4%) above.
What we need to worry about:
Different cluster might be assigned the same label (In fact,we assign only two label('mscan', 'apache2') for all 17 clusters in the attempt).To solve it ,we should find a way to fix the label assignment.
Can get higher accuracy,but can't show the actual effect of K-Means.
Similar to way 2,but change to–>For each label,we can also find a cluster that has the highest occurrence to be assigned that label.
Can better show the result of K-Means because each cluster's label will be different.
"Accuracy of KNN" and the "Accuracy of Unknown" represent the impact of each topic on classifying six label.
The unknown type data maybe hit after clustering,so we calculate the accuracy that classification is unknown and the actual label is last 17 type,this is what "Accuracy of Unknown"(Maybe Hit) mean.
,
So the sum of these two will be the best final accuracy after KNN.
"Accuracy of K-Means" represents the impact of each topic on clustering.
Unknown data here is the data that classification is unknown type(actual label may be all type).
"Final Accuracy" represents the overall influence of each topic.
"All Unknown Miss" represents the impact of each topic on the miss of all unknown data(classification is unknown and the actual label is last 17 type).
Consider the randomly assignment of K-Means(++),the result of K-Means and Final will be the average of five tests.–>should pick the max value among tests.
How preprocessing affect result
The type of preprocessing:
Simple Normalization and standardization.
Based on type1,drop columns 'land','wrong_fragment','su_attempted' and 'num_outbound_cmds'