Generative Models

--- tags: AML title: Generative Models --- ## Generative Models **Due Date:** 17th May 2022 (23:59) **Submission Format:** GitHub repository link and report (PDF). For the implementation **pytorch** is a requirement. All libraries used in implemention should be included in requirements file. **Data:** [Task 1](https://drive.google.com/file/d/1iVl4Q4Bq3Fbwv60lLthfBc6eYxYLrdnU/view?usp=sharing) & [Task 2](https://cloudstor.aarnet.edu.au/plus/s/2DhnLGDdEECo4ys?path=%2FUNSW-NB15%20-%20CSV%20Files%2Fa%20part%20of%20training%20and%20testing%20set) # Task 1 In this task, you are going to solve the task of anomaly detection in the domain of finance. The anomalies are in the form of suspicious customer transactions. The data has al lot of missing values. **The goals** : 1. Fill missing values using generative model (Autoencoder) 2. Measure the perfomence of autoencoder approach perfomence and compare with statistical approaches (i.e [From here](https://imbalanced-learn.org/stable/index.html)) 3. Reduce the dimenssion of the data using Principal component analysis and Linear Discriminant Analysis and compare the impact on selected machine learning model ## Dataset The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. The most important aspect is that the data has a lot of missing values. The data is broken into two files identity and transaction, which are joined by `TransactionID`. Not all transactions have corresponding identity information. The data contains samples and their classification distribution : | | Count | percecentage (%)| |:---------------------------:|:------:|:---------------:| | Fraudulent transactions | 20663 | 3.5 | | Not fraudulent transactions | 569877 | 96.5 | ## Dataset Tables ### Transaction Table ***It contains money transfer and also other gifting goods and service, like you booked a ticket for others, etc.*** * TransactionDT: timedelta from a given reference datetime (not an actual timestamp) * TransactionAMT: transaction payment amount in USD * ProductCD: product code, the product for each transaction * card1 - card6: payment card information, such as card type, card category, issue bank, country, etc. * addr: address * dist: distance * P_ and (R__) emaildomain: purchaser and recipient email domain * C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked. * D1-D15: timedelta, such as days between previous transaction, etc. * M1-M9: match, such as names on card and address, etc. * Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations. ``` Categorical Features: ProductCD card1 - card6 addr1, addr2 P_emaildomain R_emaildomain M1 - M9 ``` ### Identity Table Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. They're collected by Vesta’s fraud protection system and digital security partners. (The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement) ``` Categorical Features: DeviceType DeviceInfo id_12 - id_38 The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp). ``` ## Labelling logic The logic of labeling is defined from reported chargeback on the card as fraud transaction (isFraud=1) and transactions posterior to it with either user account, email address or billing address directly linked to these attributes as fraud too. If none of above is reported and found beyond 120 days, then we define as legit transaction (isFraud=0). However, in real world fraudulent activity might not be reported, e.g. cardholder was unaware, or forgot to report in time and beyond the claim period, etc. In such cases, supposed fraud might be labeled as legit, but we never could know of them. [Read more](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#588953). -- [Vesta Corporation](https://trustvesta.com/) dataset provider. ## Files * data_transaction.csv * data_identity.csv ***Split the data to train and test (80% train and 20% test)*** ## Filling missing values with Autoencoders Autoencoders are generative models are applicable in many different tasks such as anomaly detection and reconstruction of damaged signals. In this task autoencoders will be used to fill the missing values in the dataset. Three autoencoders are to be implemented: 1. Undercomplete autoencoder 2. Regularized autoencoder 3. Variational autoencoder To archive the goal the input of the autoencoder encoder $E(x)$ input will need to be modified. The information about the missing values should be embedded to the model input. This information can be simply embedded using an indicator vector. For a simple undercomplete autoencoder : Given inital data point $x_i \in \mathbb{R}^m$ and missing values indicator $I_i \in \mathbb{R}^m$. The loss function $L$ for the autoencoder gets to be : $$L(x_i, \hat{x}_i, I_i) = ||I_i*x_i - I_i*\hat{x}_i||$$ where : $$\hat{x} = D(z) \\z=E(x,I)$$ The overall achitecture of the autoencoder : ![](https://i.imgur.com/ErM1vPe.png) # Task 2 In this task, you are going to solve the task of anomaly detection in the domain of cyber security. The anomalies are in the form of network intrusions. To balance the dataset a conditional generative adversarial model will be implemented. **The goals** : 1. Detect network intrusions 2. Measure the quality of data generated by a generative adversarial model. ## Dataset The data we are going to use is UNSW-NB15 which is network intrusion benchmark dataset. UNSW-NB15 was created by the cyber range laboratory of the Australian centre for cyber security. The data is stored in CSV format. You can easily fit the data into memory. Table below outlines the data distributions of the classes in the dataset. | Name | Training Set | | -------------- |:---------------:| | Analysis | 2,000 (1.14%) | | Backdoor | 1,746 (0.99%) | | DoS | 12,264 (6.99%) | | Exploits | 33,393 (19.04%) | | Fuzzers | 18,184 (10.37%) | | Generic | 40,000 (22.81%) | | Normal | 56,000 (31.94%) | | Reconnaissance | 10,491 (5.98%) | | Shell Code | 1,133 (0.65%) | | Worms | 130 (0.07%) | The dataset contains numerical and categorical attributes. This format is not very friendly for learning algorithms. Further, we are going to discuss how to preprocess the data before passing to the generative and ML algorithm. ### Data Preprocessing The simplest way to convert the string representation into the machine-readable format is to substitute the characters with a unique integer identifier. You are free to apply approach of your choice for handling categorical string data and missing values. Deep learning and other models perform well with normalized or standard scaled data. To scale the data we advise to use a robust scaler which reduces the impact of outliers. What approach to use is up to your choice but justification of your choice should be presented in the report. As part of feature engineering we also advise to remove features with constant values (i.e 99% of the column are equal to specific value). ## Data Balancing To balance the data it will be necessary to use a conditional generative adversarial network (cGAN). cGAN is a version generative adversarial model which asserts a condition to the data to be generated (i.e type of network intrusion). The GAN is made of feed-forward artificial neural networks are known as Generator(G) and Discriminator(D). The Network is trained in an an adversarial manner as proposed by Ian Goodfellow and his colleagues in 2014. The Generator can be represented as follows: $$ G:(Z \times Y) \rightarrow X $$ where $Z$ is random noise vector from a specified distribution i.e normal distribution $\mathcal{N}(\mu, \sigma^2)$. $Y$ is the asserted condition for genetating a data sample. The discriminator network $D$ which always try to distinguish between $G$ generated samples from real data samples is defined as: \begin{equation} D:(X\times Y) \rightarrow [0,1] \end{equation} The input to the discriminator $(D)$ is sample $x$ and its corresponding class label $y$. ## Conditional GAN training The Discriminator and Generator are trained in a two-player minimax game manner. The two parts of the GAN model ($G$ and $D$) are represented as two multi-layer feed-forward neural network. The loss function of a conditional GAN is slightly different from the standard GAN described by Ian Goodfellow. The only difference is basicaly the conditional term. The Baseline parameters of the conditional GAN to implement for this assignment are as follows: | Parameter | Generator | Discriminator | | ---------------- |:---------:|:-------------:| | Ephocs | 2000 | 2000 | | Optimizer | SGD | SGD | | learning rate | 0.0005 | 0.0005 | | number of layers | 5 | 5 | | Layer 1 (Embedding layer) | | | | Layer 2 Neurons | 128 | 512 | | Layer 3 Neurons | 256 | 256 | | Layer 4 Neurons | 512 | 128 | | Layer 5 Neurons | 34 | 1 | To measure the perfomence of the generator model [KL-divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) can be calculated which will measure the difference between the generated data and real data. It is advisable to use Tensorboard to monitor the training process of the cGAN. ### Embedding Layer The embedding layer is to join the label (one-hot-encoding) and the data. The label can be used as a condition for the cGAN. The output dimmension of the embedding layer is not sutable for direct input to the next layer thus the it is necessary to reshape the embeddings. The simplest way to make the data suitable for the next layer is by flattening the output of the embedding layer. However you are free to try out some other approaches. ## Classification Models After generating the network intrusion and balanced the data, it is important to see if balancing the data helped to improve a ML classifier perfomence. There are a lot of ML classifiers that can be used. For this assignment we will only take 3 into consideration. The 3 classifiers are Random Forest, [Explainable Boosting Machine](https://interpret.ml/docs/ebm.html) and Classical Neural Network. :::info The ML classifers are not to be tuned. Use default parameters. i.e for neural network use 2-5 hidden layers (keep it simple as possible). ::: ## Performance Metric To measure the perfomence of the selected models there exist a number of metrics studied in the course of machine learning (i.e MSE, precision, recall, RMSE, F1-score, and weighted F1-score). Furthermore, use K-fold cross validation (i.e., K = 5) to make sure that the selected models and GAN did not produce the results by chance.  ## Report & Source Code After performing the comparison of the machine learning models, the results should be presented in a form of a report. The implementation should be in python using one of the deep learning framework used in labs (Pytorch). The implementation repository should be available in GitHub or GitLab. Your repository should contain - Train and test script - Readme file (how to run the train & test script) - Documentation (code documentation and Readme) Your report should contain - Motivation, explanation of what a reader should expect from your report - Brief task definition and data description - If you use an alternative data input format, explain it - Use graphs and tables to document the results of your experiments - Conclusion and discussion of the results should be presented in the report. The report should be submitted in PDF format. The document should be written in LaTeX, according to the ACM SIGPLAN two-column template. :::info In overleaf you can create the document using : ```\documentclass[sigplan, 11pt,screen,nonacm,natbib=false]{acmart}``` ::: ## Refernces 1. [InterpretML: A Unified Framework for Machine Learning Interpretability](https://arxiv.org/pdf/1909.09223.pdf) 2. [Variational autoencoders](https://www.jeremyjordan.me/variational-autoencoders/) 3. [Conditional Generative Adversarial Nets](https://arxiv.org/pdf/1411.1784.pdf) 1. [KL Divergence Python Example](https://towardsdatascience.com/kl-divergence-python-example-b87069e4b810)