NTU Malware Reverse Final Project Notes

# NTU Malware Reverse Final Project Notes ###### tags: `NTU_MR` `Malware Reverse Engineering and Analysis` ## Deep learning at the shallow end Malware classification for non-domain experts ### How to reproduce? 1. Construct Environment The whole construction step can see [安裝 tensorflow 及 cuda cudnn 心得](https://hackmd.io/@cwl0429/install_tf_guide). Refer to [documentation for tensorflow](https://www.tensorflow.org/install/source_windows#gpu), I choose the library shown as below... | Object | CUDA | cuDNN | Python | GPU Driver Version | tensorflow | tensorflow-gpu | |:-------:|:----:|:-----:|:------:|:------------------:|:----------:|:--------------:| | Version | 11.2 | 8.1 | 3.6.13 | 526.98 | 2.6.2 | 2.6.0 | Then refer to [NVIDIA CUDNN DOCUMENTATION](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-windows), just use `zlibwapi.dll` provided by this page directly. This compressed folder is for `x64` processor. Notice that, **<font color=#FF0000>DO NOT USE [this page](http://www.winimage.com/zLibDll/) and [this page](https://www.dll-files.com/zlibwapi.dll.html)</font>**. These are for `x86` processor. 2. Problems Occurs while Setting-Up: * If the command send the exception about `zlibwapi.dll`, then you can check this page: [tensorflow出现报错： Could not locate zlibwapi.dll或者Could not load library cudnn_cnn_infer64_8.dll.](https://blog.csdn.net/qq_45071353/article/details/124091856) Put `zlibwapi.dll` to **`C:\Windows\System32`** and **`C:\Windows\SysWOW64`**. Meanwhile, put the uncompressed folder of `zlib123dllx64` that download on [NVIDIA CUDNN DOCUMENTATION](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-windows) to somewhere and add this path to `Environment Variables->System variables->PATH`. In my case, it's `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\zlib123dllx64\dll_x64`. The whole step about solving this problems, you can check this video at `time=11:36` {%youtube 27fBCKKZdpY %} * If you can not use `nvcc` command, check this page: [CUDA安装和检测【全】（nvcc命令找不到的解决办法）](https://blog.csdn.net/XieRuily/article/details/123670141) 3. Then revise the dataset path in code and run directly. ### How to Fed into `malimg-dataset` #### Revise original code * <font color="FF0000">Note that the `kfold` parameter can not set less than 2</font> * You can skip or comment all part of converting from .bin files to image, and start from `Load image data from the training set` section. * Note that you must preserve some variable and procedure as below ```python ... max_len = int(1e4) ... binfn_id2cls = {} # file name id is the part before . for fn_label_item in train_labels_df.itertuples(): binfn_id2cls[fn_label_item.Id ] = fn_label_item.Class ``` * Furthermore, you must revise some data path problem ```python project_dir = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset' data_dir = os.path.join(project_dir, 'malimg_dataset') train_dir = os.path.join(data_dir, 'train') test_dir = os.path.join(data_dir, 'validation') ... ############################################################# # Load image data from the training set ############################################################# # NOTE: default value width = 1 train_img_dir = train_dir img_sfx = 'png' ... N_CLASS = 25 ... ################################################## # Predict classes for test files, and save results ################################################## test_img_dir = test_dir ... ``` #### Convert `malimg` data * Common variable will be used below ```pytoh project_dir = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset' data_dir = os.path.join(project_dir, 'malimg_dataset') train_dir = os.path.join(data_dir, 'train') test_dir = os.path.join(data_dir, 'validation-original') train_labels_fn = os.path.join(data_dir, 'trainLabels.csv') ``` 1. Create csv file to store image ID and Class ```python folders = os.listdir(train_dir) with open('./trainLabels.csv', 'w', newline='') as csvf: # 建立 CSV 檔寫入器 writer = csv.writer(csvf) writer.writerow(['Id','Class']) for j, f in enumerate(folders): fullpath = os.path.join(train_dir, f) files = os.listdir(fullpath) for i in files: writer.writerow([i, j+1]) folders = os.listdir(test_dir) with open('./valLabels.csv', 'w', newline='') as csvf: # 建立 CSV 檔寫入器 writer = csv.writer(csvf) writer.writerow(['Id','Class']) for j, f in enumerate(folders): fullpath = os.path.join(test_dir, f) files = os.listdir(fullpath) for i in files: writer.writerow([i, j+1]) ``` 2. Move all image in each folder to the same folder ```python! f2 = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset/malimg_dataset/validation/' folders = os.listdir(test_dir) for j, f in enumerate(folders): fullpath = os.path.join(test_dir, f) files = os.listdir(fullpath) for i in files: files_src = os.path.join(fullpath, i) files_dest = os.path.join(f2, i) shutil.copyfile(files_src, files_dest) # 複製檔案 ``` 3. Resize train/val image to 10000bytes In order to match the data type of this model can accept, we must shrink the image size to 10000 bytes. By the way, the original data are also execute the same procedure for the same purpose. ```python! test_dir = os.path.join(data_dir, 'train-unresize') files = os.listdir(test_dir) width = 1 max_len = int(1e4) f2 = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset/malimg_dataset/train/' for idx, fn in enumerate(files): fn_wp = os.path.join(test_dir, fn) bin_stream = np.fromfile(fn_wp, dtype='uint8') bin_stream = bin_stream.reshape(bin_stream.shape[0], 1) img_shrink = cv2.resize(bin_stream, (width, max_len)) file_dest = os.path.join(f2, fn) img_shrink.tofile(file_dest) test_dir = os.path.join(data_dir, 'validation-unresize') files = os.listdir(test_dir) width = 1 max_len = int(1e4) f2 = 'D:/NTU/First Year/Malware Reverse Engineering and Analysis/Homework/Final_Project/Dataset/malimg_dataset/validation/' for idx, fn in enumerate(files): fn_wp = os.path.join(test_dir, fn) bin_stream = np.fromfile(fn_wp, dtype='uint8') bin_stream = bin_stream.reshape(bin_stream.shape[0], 1) img_shrink = cv2.resize(bin_stream, (width, max_len)) file_dest = os.path.join(f2, fn) img_shrink.tofile(file_dest) ``` #### Run directly If you want to plot confusion matrix, then comment some code at the end and add the code below. ```python! from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt labels_name = ["Adialer.C", "Agent.FYI", "Allaple.A", "Allaple.L", "Alueron.gen!J", "Autorun.K", "C2LOP.gen!g", "C2LOP.P", "Dialplatform.B", "Dontovo.A", "Fakerean", "Instantaccess", "Lolyda.AA1", "Lolyda.AA2", "Lolyda.AA3", "Lolyda.AT", "Malex.gen!J", "Obfuscator.AD", "Rbot!gen", "Skintrim.N", "Swizzor.gen!E", "Swizzor.gen!I", "VB.AT", "Wintrim.BX", "Yuner.A"] mat_con = (confusion_matrix(y_true, y_pred, labels=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25])) # Setting the attributes fig, px = plt.subplots(figsize=(8, 8)) px.matshow(mat_con, cmap=plt.cm.jet, alpha=0.5) for m in range(mat_con.shape[0]): for n in range(mat_con.shape[1]): px.text(x=m,y=n,s=mat_con[m, n], va='center', ha='center', size='large') # Sets the labels num_class = np.array(range(len(labels_name))) plt.xticks(num_class, labels_name, rotation=90, fontsize=10) plt.yticks(num_class, labels_name, fontsize=10) # plt.xlabel('Predictions', fontsize=16) # plt.ylabel('Actuals', fontsize=16) plt.title('Confusion Matrix', fontsize=15) plt.savefig(os.path.join('./Confusion_matrix/', "output.png"), format='png') plt.show() ```