# Lab 1-2
[TOC]
路徑:`/mlsec/malware/malware-classification.ipynb`
## Code
```python=
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn import tree, ensemble, naive_bayes
from sklearn import model_selection
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
%matplotlib inline
```
> cross_validation已經停用,現已改為model_selection
> %matplotlib inline是專用於jupyter note,讓matplotlib.pyplot的繪圖結果直接輸出在console
```python=
df = pd.read_csv('data.csv.original', sep='|')
legit_binaries = df[0:41323].drop(['legitimate'], axis=1)
malicious_binaries = df[41323::].drop(['legitimate'], axis=1)
```
```python=
legit_binaries['FileAlignment'].value_counts()
```
```python=
malicious_binaries['FileAlignment'].value_counts()
```
```python=
#Q1: plot and observe more data
plt.figure(figsize=(20,10))
plt.hist([legit_binaries['SectionsMaxEntropy'], malicious_binaries['SectionsMaxEntropy']],\
range=[0,8], normed=True, color=["green", "red"],label=["legitimate", "malicious"])
plt.legend()
plt.show()
```
```python=
X = df.drop(['Name', 'md5', 'legitimate'], axis=1).values
y = df['legitimate'].values
```
```python=
# Build a forest and compute the feature importances - n_estimators:The number of trees in the forest.
forest = ExtraTreesClassifier(n_estimators=10).fit(X, y)
# Meta-transformer for selecting features based on importance weights.
model = SelectFromModel(forest, prefit=True)
```
```python=
X_new = model.transform(X)
print('before X.shape: {}'.format(X.shape))
print('after X.shape: {}'.format(X_new.shape))
```
```python=
nb_features = X_new.shape[1]
indices = np.argsort(forest.feature_importances_)[::-1][:nb_features]
for f in range(nb_features):
print("%d. feature %s (%f)" % (f + 1, df.columns[2+indices[f]],forest.feature_importances_[indices[f]]))
```
```python=
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_new, y ,test_size=0.2)
clf = tree.DecisionTreeClassifier(max_depth=10)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
score = clf.score(X_test, y_test)
print("DecisionTree : %f %%" % ( score*100))
print(confusion_matrix(y_test, y_pred))
```
對DecisionTree進行預測
```python=
#Q2: Setup more model
algorithms = {
"DecisionTree": tree.DecisionTreeClassifier(max_depth=10),
"RandomForest": ensemble.RandomForestClassifier(n_estimators=50),
"GradientBoosting": ensemble.GradientBoostingClassifier(n_estimators=50),
"AdaBoost": ensemble.AdaBoostClassifier(n_estimators=100),
"GNB": naive_bayes.GaussianNB(),
}
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_new, y ,test_size=0.2)
```
設定各個algorithms
```python=
results = {}
y_preds = {}
print("Now testing algorithms\n")
#Q3 Findthe best one
results = {}
y_preds = {}
print("Now testing all algorithms\n")
algo=["DecisionTree","RandomForest","GradientBoosting","AdaBoost","GNB"]
for algo in algorithms:
clf = algorithms[algo]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
score = clf.score(X_test, y_test)
print("%s : %f %%" % (algo, score*100))
results[algo] = score
y_preds[algo] = y_pred
winner = max(results, key=results.get)
print('\nWinning algorithm is %s with a %f %% success' % (winner, results[winner]*100))
```
針對所有algorithms進行預測,找出accuracy最高值輸出
```python=
for algo in algorithms:
print(confusion_matrix(y_test,y_preds[algo]))
```
對所有結果進行confusion_matrix,輸出其結果
<style>
span.hidden-xs:after {
content: ' × ML Security' !important;
}
</style>
###### tags: `ML Security`