# COMP9417 Project
Shiding Zhang, Xu Bai, Yu Song
## Topic
[Kaggle Microsoft Malware Prediction](https://www.kaggle.com/c/microsoft-malware-prediction)
## Introduction
It is a large dataset with around 9 million rows in the train file, we rent a cloud machine with 104v CPUs and 768 GB RAM plus 300 GB swap virtual memory to run this. The **.py files** usage is specified below.
At first stage, down load the train and test data from kaggle and put in the **root directory or the directory you like**. Second step is run the code **data_cleaning.py**. Third step is to generate one_hot encoding by running **one_hot.py**, then run the **gen_pca.py**. The data processing part is finished in here.
For the four base line models, you can run **sklearn_logistic_pca.py**, **ann.py**, **.py**, **data_cleaning.py**. If you would like to do any further improvement, **\*_cv.py** can be used to do the Cross Validation.
The last step is to run the **meta_logistic.py** which is the final result generating file.
**pkl** files is widely used in this project since it can save times in the process communicate with the disk. Please use ``pandas.load_pickle('path')/df.to_pickle('path')`` to load these.
## Workload
- Data cleaning (Shiding Zhang, Xu Bai)
- Label encoding (Shiding Zhang, Xu Bai)
- One Hot Encoding and PCA (Shiding Zhang)
- Logistic model, meta model (Shiding Zhang)
- ANN, Random Forest (Xu Bai)
- LightGBM (Yu Song)
- Data combining (Xu Bai)
- Report Composing (Shiding Zhang, Xu Bai, Yu Song)
## Code structure
**train.csv**: training data (Not submitted, please download from kaggle)
**test.csv**: testing data (Not submitted, please download from kaggle)
**data_cleaning.py**: script to do data cleaning. we pass the `train.csv` and `test.csv` for scirpt to to precess. the script will return the cleaned test and train dataset.
**one_hot.py**: Generating the one hot encoding data. The *in_path* should be the directory storing the *train.csv* and *test.csv*. The *out_path* will be the directory that output the one hot encoding files. The *nrows* can specified how many rows you want to process.
**gen_pca.py**: Decrease the dimension of the one hot data. The *in_path* should be the directory storing the *train.csv* and *test.csv*. The *out_path* will be the directory that containing the one hot encoding files and output the data after pca processing. The *nrows* can specified how many rows you want to process. The *pca_cols* can specified the out put file dimensions.
**ann_cv.py**: Cross validation script to find the appropriate number of neareset neighbors for ANN algorithm. `fit_hnsw_index()` in this code is referenced from an article in github: https://github.com/stephenleo/adventures-with-ann/blob/main/knn_is_dead.ipynb.
**ann.py**: Program to generate trainned model and data of ann.py. We will totally generate two pairs of model and data. We will generate two pairs of data and model for evaluating performance and scoring in the kaggle.
**RF_pra_tunning.py**: Using grid search to find the best combination of hypterprameters. In this code I basically used the code from the [tutorial](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) written by Data scientists Will Koehrsen.
**RF_train.py**: Programming to produce trained data and model of random forest.
**merge.py**: At the final stage, we need to merge our predicted data from four model, we use this code to achieve this.
**sklearn_logistic_cv.py**: Doing Cross Validation for the logistic model. The *in_path* should be the directory storing the *train.csv* and *test.csv*. The *out_path* will be the directory that containing the pca files and output the model and its data. The *nrows* can specified how many rows you want to process.
**sklearn_logistic.py**: Performing logistic model as baseline model. The *in_path* should be the directory storing the *train.csv* and *test.csv*. The *out_path* will be the directory that containing the pca files and output the model and its data. The *nrows* can specified how many rows you want to process.
**lightGBM.py**: Process the cleaned data, divide it into two parts, use a large part of the data for lightGBM training, and then use a small part of the data to make predictions, and compare with the original accurate value, so as to obtain the correct rate of this model. After that, in the original data, predict the data that needs to be tested.
**meta_logistic_cv.py**: Doing Cross Validation for the meta model. The *in_path* should be the directory storing the *train.csv* and *test.csv*. The *out_path* will be the directory that containing the pca files and output the model and its data. The *nrows* can specified how many rows you want to process.
**meta_logistic.py**: Train the meta model. The *in_path* should be the directory storing the *train.csv* and *test.csv*. The *out_path* will be the directory that containing the pca files and output the model and its data. The *nrows* can specified how many rows you want to process.
ANN: 0.08769113
RF: 0.37145913
Light_GBM: 1.17735825
Logistic: 0.0709871