# Cyber data Analytics
## `21/04` Lecture 1
### What's it about
Dealing with huge data sets with few positives, frequently unlabeled. Potentially including private information.
----
### Different models
Many kind of models to be used, can be grouped into three types based on their way of learning: geometrical, logical and probabilistic.
----
### Books
- Recommended book: Peter Flach "Machine Learning"
- The Elements of Statistical "Something" :sweat_smile:
- Data Mining: Jiawei Han "Concepts and Techniques"
- Leskovec "Mining of Massive Datasets"
----
### Tools:
- Scikitlearn - Python (jupyter notebooks)
- R
- Matlab (prtools)
- **KNIME** or RAPIDminder, tools with GUI
----
### Schedule

No exam, grade calculated based on assignments. Individual components in the lab assignments
**Mattermost is used for communication**
----
### Grading

Overall you need a 6, no minimum per assignment.
----
### Communication
Communication through: mattermost.ewi.tudelft.nl/cs4035-18-19 (wellicht 19-20)
---
### Botnets
Generate huge amounts of data.
---
Classic ML fails in cybersec
~ Large part of data is legit: no way to learn negative cases.
~ Sequential: some traffic like handshakes depends on ordering, some doesn't. Data isn't independent.
~ Data is massive and continuous, training takes time, limits on resources etc.
~ Privacy sensitive: processing such data not always allowed, but required. Poses a problem.
---
### Imbalanced data
the all negtaive classifier reaches over 99 accuracy
solutions to this:
- Resampling: oversample/undersample class
- Reweighting: assign weights
- Synthesizing: add artificial minority data
---
### Standard measures

**Which one is the most important?** Depends on context, in this case cybersec. Every false positive equals an unneeded investigations: False Positive Rate is where it's at.
---
### Labeling
Label with ranks instead of binary. Where to put the threshold? It depends you know. After you made the model you can still change it
---
### Sequential data
Patterns can be observed, can identify threat
---
### Sequence alignment
Cool plaatje

---
### Massive and continuous
Using hash functions data size can be reduced (tradeoff with accuracy). To deal with large data, distributed data processing is also an option.
---
### Problem example

solution: Bloom filter. Set of hash funtions. Further details missing during stream. Idea is that you don't need to store the entire stream of data.
---
### N-grams
Combine past sequences into state using sliding windows. Compute probabilities of sequences, from the count of occurences of sequences.

Always combine n-grams with smoothing, add 1 (Laplace)
Size too large? Use hashing
Also possible on byte streams, Byte n-gram.

---
### Lab 1: Fraud Detection
Given a dataset regarding CC transactions. Which transactions are fraudulent?
**Visualise your data in any way you can before you start: Make Heat maps**
Tips for the assignment, watch out for these variables:

## `28/04` Lecture 2
### Annoying questions
Should _ be _? Yes, look at the rubric.
Will the lecture be recorded? Yes, look at the top of your screen
---
### Assignment tips and tricks
Never ever use the class lable, that is cheating. In some cases, even using feature extraction on other features before splitting into test/training set might be considered cheating.
For the lab, simple_journal and bookingdate are the class labels, usage of these is not allowed.
Random cross validation on time-series data/sequential data is useless, a model isn't necessary. The blanks can be filled in by extrapolating the data to fill the gaps:

So for time series evaluation:

A lot of data is necessary to do this, impractical for the lab so just use cross validation but keep in mind the caveats.
---
## Inbalanced Data
To account for inbalanced data, we can act on three parts of ML.
### Input Data
Should be known:
- PCA
- Feature Selection
- Noice Reduction (FFA)
New:
- Sampling
- Oversampling the minority
- Undersampling the majority
- Try to reach 50/50
- Weighting
- Large weights to minority
- Small weights to majority
- Synthesizig
- Add artificial minority daata points
**Important: test methods on unmodified test set!**
#### The effects of sampling
In a lineair seperator, oversampling increases the error. The seperator will therefore change.
For the lab:

Try to increase the sampling until all fraud crases are detected.
Next, try the same but with undersampling the non-fraud cases.
Oversampling issues:
- Overfitting
- Ignorance, some classifiers ignore copies/oversampling
Undersampling issues:
- Throw away data, some example might be essential for good performance
By sampling, force algorithm to learn on a even distribution. However, in reality the division fraud/non-fraud is not even. But, sometimes necessary..
Less random sampling is possible:

What works? Trial and error!
#### Reweighting
Similar as sampling, but often better.
#### Synthesizing
SMOTE: Synthetic Minority Oversampling Technique

Is a standard method to go to when learning inbalanced data. A quick solution. But, limited to features based on integers. Newly created datapoints also included in the next iteration of SMOTE. You top SMOTE'ing when you distribution is more equal/desired distribution.
The effect of SMOTE, before:

Choose datapoint, choose nearest neighbour, add datapoint along the vector between point and neighbour:

Different separator can be learned:

Removing Tomek links Can be used in combination with SMOTE to improve results:

- Nearest neigbour is point from other class --> "Tomek links"
- Remove them
--> Supervised more effective than unsupervised
Data imputation
- when you have missing data, you can take the avarage from other columns
- Generally not good
- Can also copy data from other columns that look like it
- Better is to simulate the data
Synthetic data:
- Smote
- Model based
- MICE
### Output
Instead of modifying the data, we can also modify the output. Define a cost for false positives and negatives.
Learning rankers, such as naive Bayes, set decision treshold:
- optimize cost
- obrain a predefined false positive rate
Ensemble methods can be used: Learn multiple models and combine their prediction.
Different strategies:
- How to learn different models?
- How to combine the output?
Bagging. Learn model on different samples from data, combine output by majority vote. A popular example is random forest.
The basic random forest:

#### Some ensemble methods
Bagging:

Stacking. Instead of splitting, use different algorithms on same data. Combine output as another learning problem -> learn 'smart' combinations. Makes things overly complicated sometimes and hard to explain.
Metacost. Use bagging, but use the outcome of the vote and calculate labels based on mimizing conditional cost: . Then run algorithm again on relabeled data.
Boosting.

### Algorithm
Nearest neighbour could be adapted with weights. The idea is that fraud cases are far from the normal, but might be close to other fraud cases.
Sicco's work: learning by optimisation, use a greedy solution to find a model with high accuracy.