STL 1 Assignment 9

#### From STL 1 Assignment 8 # 3 ![](https://hackmd.io/_uploads/ryexsOf9n.png) Top 5 categories: - attack_cat (which will be represent by 1 [ Attack ] and 0 [ Benign ]) - dttl - sttl - ackdat - tcprtt ![](https://hackmd.io/_uploads/rkq-PaU53.png) For the coefficients, all will be taken into account as they are attributes in the top 5 ranking. Why is it that the features are significant? - the p-value : all of the are less than the 0.05 (default threshold) - F-value : way larger than significance F - ![](https://hackmd.io/_uploads/BJhFPRL52.png) ![](https://hackmd.io/_uploads/HkHjvCIq2.png) Approximate activation threshold : 0.8 Accuracy that we got : 70% Model is not within the expected range of 90% accuracy given the significance of the top 5 attributes. This may be due to various factors: - Model attributes/features although ranking among the top 5, do not entirely explain the classification of Attack and Benign. They may be the most influential attributes in the model but they cannot explain the whole classfication - 100 samples may not be enough training data to fully understand the features to classify whether it is an attack or benign. - Some data may be under-represented as sttl samples do not contain a variety of data. ie. the samples do not contain values (31) and only contain certain values (254). ### STL 1 Assignment 9 ```python= print('-'*30 , 'Start', '-'*30 ) #Collecting 500 Phishing URLs randomly data0 = pd.read_csv("phish.csv") phishurl = data0.sample(n = 50, random_state = 33).copy() # phishurl = phishurl.reset_index(drop=True) print("Phish", phishurl.shape) print('-'*30 , 'Start', '-'*30 ) #Collecting 500 Legitimate URLs randomly data1 = pd.read_csv("benign.csv") data1.columns = ['URLs'] legiurl = data1.sample(n = 50, random_state = 33).copy() # legiurl = legiurl.reset_index(drop=True) print("Legit", legiurl.shape) ``` for the samples, the following seeds are used : `33, 98, 77, 64, 122, 799` The first 2 seed resulted in 50 samples each and the rest having 100 samples each. #### Decision Tree: ![](https://hackmd.io/_uploads/ryPZCzA93.png) #### MLP To do a simple MLP prediction, we import it from `sklearn` by using the following snippet: ```python= from sklearn.neural_network import MLPClassifier clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train) acc_train_mlp = clf.score(X_train, y_train) acc_test_mlp = clf.score(X_test, y_test) print("MLP: Accuracy on training Data: {:.3f}".format(acc_train_mlp)) print("MLP: Accuracy on test Data: {:.3f}".format(acc_test_mlp)) ``` #### SVM Similar to MLP, we import the function from the `sklearn` library and using the following snippet, we find the accuracy: ```python= from sklearn.svm import SVC clf = SVC(kernel='linear') # Linear Kernel #Train the model using the training sets clf.fit(X_train, y_train) #Predict the response for test dataset y_train_pred = clf.predict(X_train) y_test_pred = clf.predict(X_test) print("SVM : Accuracy on training Data:", accuracy_score(y_train, y_train_pred)) print("SVM : Accuracy on test Data:", accuracy_score(y_test, y_test_pred)) ``` #### Random Forest Similar to Random Forest, we import the function from the `sklearn` library and using the following snippet, we find the accuracy: ```python= from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier() rf.fit(X_train, y_train) y_train_pred = rf.predict(X_train) y_test_pred = rf.predict(X_test) train_accuracy = accuracy_score(y_train, y_train_pred) test_accuracy = accuracy_score(y_test, y_test_pred) print("Random Forest : Accuracy on training Data:", train_accuracy) print("Random Forest : Accuracy on test Data:", test_accuracy) ``` ![](https://hackmd.io/_uploads/r1ZeCz0qh.png) # 2 New features: 1. `Port number` 2. `Google Index` Reason for 1 is that the default ports for website connections (secure or otherwise) would be 80 or 443. If the website tries to connect to other ports, it can be assumed that it is malicious and therefore a phishing website. (https://www.ssl2buy.com/wiki/port-80-http-vs-port-443-https) ```python= def check_port_number(addr): domain = urlparse(addr) if domain.port is None or domain.port == 80 or domain.port == 443: return 0 # Benign else: return 1 # Malicious ``` Reason for 2 is that if the page is not indexed by Google Index, it is more likely that it is a phishing link. Getting your webiste indexed by google is a troublesome process and a phishing link would be more likely to not be indexed since it would probably be short lived and not worth the effort. ```python= # 20. Checking Google Index def check_google_index(ur): google = "https://www.google.com/search?q=site:" + url + "&hl=en" response = requests.get(google, cookies={"CONSENT": "YES+1"}) soup = BeautifulSoup(response.content, "html.parser") not_indexed = re.compile("did not match any documents") if soup(text=not_indexed): return 1 else: return 0 ``` ![](https://hackmd.io/_uploads/B1X8du6q2.png) ![](https://hackmd.io/_uploads/BJnL8O65h.png) (https://www.shellhacks.com/indexed-by-google-pages-checker-on-python/) We create a new dataset of 500 legit and 500 phishing: ```python= #Collecting 500 Phishing URLs randomly data0 = pd.read_csv("phish.csv") phishurl = data0.sample(n = 500, random_state = 279 ).copy() phishurl = phishurl.reset_index(drop=True) print("Phish", phishurl.shape) print('-'*30 , 'Start', '-'*30 ) #Collecting 500 Legitimate URLs randomly data1 = pd.read_csv("benign.csv") data1.columns = ['URLs'] legiurl = data1.sample(n = 500, random_state = 279 ).copy() legiurl = legiurl.reset_index(drop=True) print("Legit", legiurl.shape) ``` ### Atrribute Ranking ![](https://hackmd.io/_uploads/ryKyiVC53.png) ![](https://hackmd.io/_uploads/rJ7IiER52.png) We used Kstar to rank out the attributes. Ranking of the top 5 attributes are as follows: 1. Domain 2. GoogleIndex 3. Have_At 4. Redirection 5. URL_Depth ### Comparing to the original dataset. ![](https://hackmd.io/_uploads/SknXErRqh.png) In comparison to the original feature sets, adding the 2 additional features seem to generally increase the train and test accuracy of the models. Additional significant features seem to make the models more accurate in the predicting from the samples. ### Metrics In the context of binary classification, metrics like accuracy, error rates, precision, recall, and AUC (Area Under the ROC Curve) are commonly used to evaluate the performance of a model. Let's explore the meaning of each of these metrics: ##### Accuracy: Accuracy is a measure of the overall correctness of predictions made by the model. It calculates the proportion of correct predictions (both true positives and true negatives) out of the total number of predictions (true positives, true negatives, false positives, and false negatives). Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN) A high accuracy indicates a good model performance, but it can be misleading in imbalanced datasets, where the number of samples in one class vastly outweighs the other. ##### Error Rates (Misclassification Rate): The error rate is the complement of accuracy and represents the proportion of incorrect predictions made by the model. Formula: Error Rate = (FP + FN) / (TP + TN + FP + FN) = 1 - Accuracy Lower error rates indicate better model performance. ##### Precision (Positive Predictive Value): Precision measures the accuracy of positive predictions made by the model, i.e., how many of the predicted positive instances are actually positive (true positives) compared to false positives. Formula: Precision = TP / (TP + FP) High precision indicates that the model is making fewer false positive predictions. ##### Recall (Sensitivity, True Positive Rate): Recall measures the ability of the model to correctly identify positive instances, i.e., how many of the actual positive instances are correctly predicted as positive (true positives) compared to false negatives. Formula: Recall = TP / (TP + FN) High recall indicates that the model is making fewer false negative predictions. ##### AUC (Area Under the ROC Curve): The ROC (Receiver Operating Characteristic) curve is a graphical representation of the true positive rate (recall) against the false positive rate as the classification threshold is varied. The AUC is the area under the ROC curve, which represents the model's ability to distinguish between the positive and negative classes. AUC ranges from 0 to 1, with higher values indicating better model performance. An AUC of 0.5 corresponds to random guessing, while an AUC of 1 indicates perfect classification. ## Models: ### Decision tree J48 From `sklearn`, the implementation of decision tree can be done by using the `DecisionTreeClassifier`: ```python= from sklearn.tree import DecisionTreeClassifier from sklearn import tree # instantiate the model dt = DecisionTreeClassifier(max_depth = 5) # fit the model dt.fit(X_train, y_train) #predicting the target value from the model for the samples y_test_tree = dt.predict(X_test) y_train_tree = dt.predict(X_train) #computing the accuracy of the model performance acc_train_tree = accuracy_score(y_train,y_train_tree) acc_test_tree = accuracy_score(y_test,y_test_tree) tree_classification_report = classification_report(y_test,y_test_tree) roc_auc_score_tree = roc_auc_score(y_test,y_test_tree) error_rates_test = 1 - acc_test_tree print(f"Decision Tree J48: Accuracy on training Data: {acc_train_tree}") print(f"Decision Tree J48: Accuracy on test Data: {acc_test_tree}") print('-'*30) print(f"Decision Tree J48: Report ") print(tree_classification_report) print('-'*30) print(f"Decision Tree J48: Error Rates on test Data: {error_rates_test}") print(f"Decision Tree J48: ROC SCORE: {roc_auc_score_tree}") #tree visualization fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=600) tree.plot_tree(dt, feature_names = fn, class_names=cn, filled = True) fig.savefig('decision-tree-updated.png') ``` ![](https://hackmd.io/_uploads/rkbxl_Jjn.png) ### MLP Similar to the Decision Tree Classifier, we can also find the MLP function from `sklearn`: ```python= from sklearn.neural_network import MLPClassifier clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train) acc_train_mlp = clf.score(X_train, y_train) acc_test_mlp = clf.score(X_test, y_test) mlp_classification_report = classification_report(y_test,y_test_tree) roc_auc_score_mlp = roc_auc_score(y_test,y_test_tree) error_rates_test = 1 - acc_test_mlp print(f"MLP: Accuracy on training Data: {acc_train_mlp}") print(f"MLP: Accuracy on test Data: {acc_test_mlp}") print('-'*30) print(f"MLP: Report ") print(mlp_classification_report) print('-'*30) print(f"MLP: Error Rates on test Data: {error_rates_test}") print(f"MLP: ROC SCORE: {roc_auc_score_mlp}") ``` ![](https://hackmd.io/_uploads/H1MRWOysh.png) ### Naïve Bayes For Naive Bayes, we use Gaussian Naive Bayes because it assumes that each parameter (also called features or predictors) has an independent capacity of predicting the output variable. ```python= from sklearn.naive_bayes import GaussianNB model = GaussianNB() # Model training model.fit(X_train, y_train) y_train_pred_nb = model.predict(X_train) y_test_pred_nb = model.predict(X_test) train_accuracy = accuracy_score(y_train, y_train_pred_nb) test_accuracy = accuracy_score(y_test, y_test_pred_nb) nb_classification_report = classification_report(y_test,y_test_pred_nb) roc_auc_score_nb = roc_auc_score(y_test,y_test_pred_nb) error_rates_test = 1 - accuracy_score(y_train, y_train_pred_nb) print("Naive Bayes : Accuracy on training Data:", train_accuracy) print("Naive Bayes : Accuracy on test Data:", test_accuracy) print('-'*30) print(f"Naive Bayes: Report ") print(nb_classification_report) print('-'*30) print(f"Naive Bayes: Error Rates on test Data: {error_rates_test}") print(f"Naive Bayes: ROC SCORE: {roc_auc_score_nb}") ``` ![](https://hackmd.io/_uploads/HkXTE_1j2.png) ### Logistic Regression ```python= from sklearn.linear_model import LogisticRegression clf = LogisticRegression(random_state=0).fit(X_train, y_train) y_train_pred_lr = clf.predict(X_train) y_test_pred_lr = clf.predict(X_test) train_accuracy = accuracy_score(y_train, y_train_pred_lr) test_accuracy = accuracy_score(y_test, y_test_pred_lr) lr_classification_report = classification_report(y_test,y_test_pred_lr) roc_auc_score_lr = roc_auc_score(y_test,y_test_pred_lr) error_rates_test = 1 - accuracy_score(y_train, y_train_pred_lr) print("Logistic Regression : Accuracy on training Data:", train_accuracy) print("Logistic Regression : Accuracy on test Data:", test_accuracy) print('-'*30) print(f"Logistic Regression: Report ") print(lr_classification_report) print('-'*30) print(f"Logistic Regression: Error Rates on test Data: {error_rates_test}") print(f"Logistic Regression: ROC SCORE: {roc_auc_score_lr}") ``` ### SMO (SVM) ```python= from sklearn.pipeline import make_pipeline from sklearn.svm import SVC clf = SVC(kernel='linear') # Linear Kernel #Train the model using the training sets clf.fit(X_train, y_train) #Predict the response for test dataset y_train_pred = clf.predict(X_train) y_test_pred = clf.predict(X_test) svm_classification_report = classification_report(y_test,y_test_pred) roc_auc_score_svm = roc_auc_score(y_test,y_test_pred) error_rates_test = 1 - accuracy_score(y_train, y_train_pred) print(f"SVM: Accuracy on training Data: {accuracy_score(y_train, y_train_pred)}") print(f"SVM: Accuracy on test Data: {accuracy_score(y_test, y_test_pred)}") print('-'*30) print(f"SVM: Report ") print(svm_classification_report) print('-'*30) print(f"SVM: Error Rates on test Data: {error_rates_test}") print(f"SVM: ROC SCORE: {roc_auc_score_svm}") ``` ![](https://hackmd.io/_uploads/BJgc7ukjn.png) # Decision Tree from Scratch We need to understand how Decision Tree creates it classification. The general idea is that there are thresholds that each branch have and and the data would be compared against them to classify the test data. (https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html) (https://towardsdatascience.com/decision-tree-algorithm-in-python-from-scratch-8c43f0e40173) The following is the code snippets based on the understanding: As our dataset provides a label of [0,1], we can do a binary classification on the dataset. 1. Check to make sure that the labels are truly binary and there are no missing data in the dataset. ```python= def check_purity(data): label_column = data['Label'] unique_classes = np.unique(label_column) if len(list(unique_classes)) == 2: return 1 else: return 0 # We are trying ot get a binary classification. ``` 2. We create a function to do train, test split for our dataset ```python= def trainTestSplit(dataFrame, testSize): ''' Since we can't use the sklearn library, we lose access to train_test_split function Args: - dataFrame : Dataframe - testsize : float Returns: - dataFrameTrain : Datafram - dataFrameTest : Dataframe ''' if isinstance(testSize, float): testSize = round(testSize * len(dataFrame)) indices = dataFrame.index.tolist() testIndices = random.sample(population = indices, k = testSize) dataFrameTest = dataFrame.loc[testIndices] dataFrameTrain = dataFrame.drop(testIndices) return dataFrameTrain, dataFrameTest ``` 3. Create functions to calculate the accuracy, precision, and recall ```python= # Calculate accuracy percentage def accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0 # Calculate precision percentage def precision_metric(actual, predicted): true_positive = 0 false_positive = 0 for i in range(len(actual)): if predicted[i] == 1: if predicted[i] == actual[i]: true_positive += 1 else: false_positive += 1 return true_positive / (true_positive + false_positive) * 100.0 # Calculate recall percentage def recall_metric(actual, predicted): true_positive = 0 false_negative = 0 for i in range(len(actual)): if actual[i] == 1: if predicted[i] == actual[i]: true_positive += 1 else: false_negative += 1 return true_positive / (true_positive + false_negative) * 100.0 ``` 4. Build the Tree (https://anderfernandez.com/en/blog/code-decision-tree-python-from-scratch/#:~:text=The%20Gini%20index%20is%20the,when%20it%20is%20randomly%20selected.&text=Where%20Pi%20is%20the%20probability%20of%20having%20that%20class%20or%20value.) ```python= # Calculate GINI index def gini_index(groups, classes): n_instances = float(sum([len(group) for group in groups])) gini = 0.0 for group in groups: size = float(len(group)) if size == 0: continue score = 0.0 for class_val in classes: p = [row[-1] for row in group].count(class_val) / size score += p * p gini += (1.0 - score) * (size / n_instances) return gini # Basically, a way to weigh the probability of the prediction being wrong. # Split a dataset based on an attribute and an attribute value def tree_split(index, value, dataset): left, right = list(), list() for row in dataset: if row[index] < value: left.append(row) else: right.append(row) return left, right # Select the best split point for a dataset def get_split(dataset): class_values = list(set(row[-1] for row in dataset)) b_index, b_value, b_score, b_groups = 999, 999, 999, None for index in range(len(dataset[0])): for row in dataset: groups = tree_split(index, row[index], dataset) gini = gini_index(groups, class_values) if gini < b_score: b_index, b_value, b_score, b_groups = index, row[index], gini, groups return {'index':b_index, 'value':b_value, 'groups':b_groups} # Create a terminal node value def to_terminal(group): outcomes = [row[-1] for row in group] return max(set(outcomes), key=outcomes.count) # Create child splits for a node or make terminal def split(node, max_depth, min_size, depth): left, right = node['groups'] del(node['groups']) # check for a no split if not left or not right: node['left'] = node['right'] = to_terminal(left + right) return # check for max depth if depth >= max_depth: node['left'], node['right'] = to_terminal(left), to_terminal(right) return # process left child if len(left) <= min_size: node['left'] = to_terminal(left) else: node['left'] = get_split(left) split(node['left'], max_depth, min_size, depth+1) # process right child if len(right) <= min_size: node['right'] = to_terminal(right) else: node['right'] = get_split(right) split(node['right'], max_depth, min_size, depth+1) # Build a decision tree def build_tree(train, max_depth, min_size): ''' Function to create the tree Args: - train : dataframe - max_depth : int - min_size : int Returns: - root : dict ''' print('-'*30, 'Building Root', '*'*30) root = get_split(train) split(root, max_depth, min_size, 1) return root # Make a prediction with a decision tree # Classification and Regression Tree Algorithm def decision_tree(train, max_depth, min_size): print('-'*30, 'Building Tree', '*'*30) tree = build_tree(train, max_depth, min_size) return tree ``` 5. Generate Accuracy and Prediction from the tree ```python= def predict(node, row): ''' Recursively go through the tree to predict the classification value Args: - node : dict - row : np.array Returns - node (At the end. After Recursion) ''' if row[node['index']] < node['value']: if isinstance(node['left'], dict): return predict(node['left'], row) else: return node['left'] else: if isinstance(node['right'], dict): return predict(node['right'], row) else: return node['right'] def create_predictions(test_set, tree): ''' Create a list of prediction beased on the test_set Args: - test_set : - tree : ''' predictions = list() for row in test_set: prediction = predict(tree, row) predictions.append(prediction) return predictions def main(): max_depth = 5 min_size = 10 data = pd.read_csv("urldata.csv") if check_purity(data) == 0: exit() train_set, test_set = trainTestSplit(data, 0.3) train_set.drop('Domain', axis='columns', inplace=True) test_set.drop('Domain', axis='columns', inplace=True) tr_set = list(train_set.values) tst_set = list(test_set.values) tree = decision_tree(tr_set, max_depth,min_size) y_train_pred = create_predictions(tr_set, tree) y_pred = create_predictions(tst_set, tree) y_test = test_set['Label'].values y_train = train_set['Label'].values accuracy = accuracy_metric(y_train,y_train_pred) precision = precision_metric(y_train,y_train_pred) recall = recall_metric(y_train,y_train_pred) print(f'Train Accuracy : {accuracy}') print(f'Train Precision : {precision}') print(f'Train Recall : {recall}') test_accuracy = accuracy_metric(y_test,y_pred) test_precision = precision_metric(y_test,y_pred) test_recall = recall_metric(y_test,y_pred) print(f'Test Accuracy : {test_accuracy}') print(f'Test Precision : {test_precision}') print(f'Test Recall : {test_recall}') print('-'*30 , 'Finished', '-'*30) if __name__ == '__main__': main() ``` ![](https://hackmd.io/_uploads/Sks6zp1s3.png) Based on the results. 100% training accuracy would indicate an over-fitted model, however, with the test accuracy, recall, and precision being 100% as well, it may just be that the model is really good at predicting the data within the generated dataset.