Cyber data Analytics

# Cyber data Analytics ## `21/04` Lecture 1 ### What's it about Dealing with huge data sets with few positives, frequently unlabeled. Potentially including private information. ---- ### Different models Many kind of models to be used, can be grouped into three types based on their way of learning: geometrical, logical and probabilistic. ---- ### Books - Recommended book: Peter Flach "Machine Learning" - The Elements of Statistical "Something" :sweat_smile: - Data Mining: Jiawei Han "Concepts and Techniques" - Leskovec "Mining of Massive Datasets" ---- ### Tools: - Scikitlearn - Python (jupyter notebooks) - R - Matlab (prtools) - **KNIME** or RAPIDminder, tools with GUI ---- ### Schedule ![](https://i.imgur.com/OrOglJX.png) No exam, grade calculated based on assignments. Individual components in the lab assignments **Mattermost is used for communication** ---- ### Grading ![](https://i.imgur.com/hXNqGnF.png) Overall you need a 6, no minimum per assignment. ---- ### Communication Communication through: mattermost.ewi.tudelft.nl/cs4035-18-19 (wellicht 19-20) --- ### Botnets Generate huge amounts of data. --- Classic ML fails in cybersec ~ Large part of data is legit: no way to learn negative cases. ~ Sequential: some traffic like handshakes depends on ordering, some doesn't. Data isn't independent. ~ Data is massive and continuous, training takes time, limits on resources etc. ~ Privacy sensitive: processing such data not always allowed, but required. Poses a problem. --- ### Imbalanced data the all negtaive classifier reaches over 99 accuracy solutions to this: - Resampling: oversample/undersample class - Reweighting: assign weights - Synthesizing: add artificial minority data --- ### Standard measures ![](https://i.imgur.com/WS5ldEE.png) **Which one is the most important?** Depends on context, in this case cybersec. Every false positive equals an unneeded investigations: False Positive Rate is where it's at. --- ### Labeling Label with ranks instead of binary. Where to put the threshold? It depends you know. After you made the model you can still change it --- ### Sequential data Patterns can be observed, can identify threat --- ### Sequence alignment Cool plaatje ![](https://i.imgur.com/02RrQbQ.png) --- ### Massive and continuous Using hash functions data size can be reduced (tradeoff with accuracy). To deal with large data, distributed data processing is also an option. --- ### Problem example ![](https://i.imgur.com/uYW1xID.png) solution: Bloom filter. Set of hash funtions. Further details missing during stream. Idea is that you don't need to store the entire stream of data. --- ### N-grams Combine past sequences into state using sliding windows. Compute probabilities of sequences, from the count of occurences of sequences. ![](https://i.imgur.com/q6tRflm.png) Always combine n-grams with smoothing, add 1 (Laplace) Size too large? Use hashing Also possible on byte streams, Byte n-gram. ![](https://i.imgur.com/Ck2nVDQ.png) --- ### Lab 1: Fraud Detection Given a dataset regarding CC transactions. Which transactions are fraudulent? **Visualise your data in any way you can before you start: Make Heat maps** Tips for the assignment, watch out for these variables: ![](https://i.imgur.com/YdWHizX.png) ## `28/04` Lecture 2 ### Annoying questions Should _ be _? Yes, look at the rubric. Will the lecture be recorded? Yes, look at the top of your screen --- ### Assignment tips and tricks Never ever use the class lable, that is cheating. In some cases, even using feature extraction on other features before splitting into test/training set might be considered cheating. For the lab, simple_journal and bookingdate are the class labels, usage of these is not allowed. Random cross validation on time-series data/sequential data is useless, a model isn't necessary. The blanks can be filled in by extrapolating the data to fill the gaps: ![](https://i.imgur.com/FfxHeGd.png) So for time series evaluation: ![](https://i.imgur.com/aI6B98F.png) A lot of data is necessary to do this, impractical for the lab so just use cross validation but keep in mind the caveats. --- ## Inbalanced Data To account for inbalanced data, we can act on three parts of ML. ### Input Data Should be known: - PCA - Feature Selection - Noice Reduction (FFA) New: - Sampling - Oversampling the minority - Undersampling the majority - Try to reach 50/50 - Weighting - Large weights to minority - Small weights to majority - Synthesizig - Add artificial minority daata points **Important: test methods on unmodified test set!** #### The effects of sampling In a lineair seperator, oversampling increases the error. The seperator will therefore change. For the lab: ![](https://i.imgur.com/ZM0akuR.png) Try to increase the sampling until all fraud crases are detected. Next, try the same but with undersampling the non-fraud cases. Oversampling issues: - Overfitting - Ignorance, some classifiers ignore copies/oversampling Undersampling issues: - Throw away data, some example might be essential for good performance By sampling, force algorithm to learn on a even distribution. However, in reality the division fraud/non-fraud is not even. But, sometimes necessary.. Less random sampling is possible: ![](https://i.imgur.com/8vt6ezY.png) What works? Trial and error! #### Reweighting Similar as sampling, but often better. #### Synthesizing SMOTE: Synthetic Minority Oversampling Technique ![](https://i.imgur.com/tGUnKNs.png) Is a standard method to go to when learning inbalanced data. A quick solution. But, limited to features based on integers. Newly created datapoints also included in the next iteration of SMOTE. You top SMOTE'ing when you distribution is more equal/desired distribution. The effect of SMOTE, before: ![](https://i.imgur.com/5o1VxZQ.png) Choose datapoint, choose nearest neighbour, add datapoint along the vector between point and neighbour: ![](https://i.imgur.com/7bzMc2v.png) Different separator can be learned: ![](https://i.imgur.com/o8w3lhS.png) Removing Tomek links Can be used in combination with SMOTE to improve results: ![](https://i.imgur.com/c95tu8z.png) - Nearest neigbour is point from other class --> "Tomek links" - Remove them --> Supervised more effective than unsupervised Data imputation - when you have missing data, you can take the avarage from other columns - Generally not good - Can also copy data from other columns that look like it - Better is to simulate the data Synthetic data: - Smote - Model based - MICE ### Output Instead of modifying the data, we can also modify the output. Define a cost for false positives and negatives. Learning rankers, such as naive Bayes, set decision treshold: - optimize cost - obrain a predefined false positive rate Ensemble methods can be used: Learn multiple models and combine their prediction. Different strategies: - How to learn different models? - How to combine the output? Bagging. Learn model on different samples from data, combine output by majority vote. A popular example is random forest. The basic random forest: ![](https://i.imgur.com/beYPSmZ.png) #### Some ensemble methods Bagging: ![](https://i.imgur.com/3yPNygk.png) Stacking. Instead of splitting, use different algorithms on same data. Combine output as another learning problem -> learn 'smart' combinations. Makes things overly complicated sometimes and hard to explain. Metacost. Use bagging, but use the outcome of the vote and calculate labels based on mimizing conditional cost: ![](https://i.imgur.com/yH99DbM.png). Then run algorithm again on relabeled data. Boosting. ![](https://i.imgur.com/NLJN6BB.png) ### Algorithm Nearest neighbour could be adapted with weights. The idea is that fraud cases are far from the normal, but might be close to other fraud cases. Sicco's work: learning by optimisation, use a greedy solution to find a model with high accuracy.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.