Web client classification: human vs. bot: Notes

# Web client classification: human vs. bot: Notes ## Topic: Different machine learning techniques for web bot detection based on server logs Papers: * [Botnet Detection Based On Machine Learning Techniques Using DNS Query Data ](https://www.mdpi.com/1999-5903/10/5/43/htm) * [Online Web Bot Detection Using a Sequential Classification Approach](https://ieeexplore.ieee.org/abstract/document/8622990) * [Forbes: How To Improve Bot Detection With Machine Learning](https://www.forbes.com/sites/louiscolumbus/2020/09/27/how-to-improve-bot-detection-with-machine-learning/?sh=363c5bd172d0) * [Supervised Machine Learning Bot Detection Techniques to Identify Social Twitter Bots](https://scholar.smu.edu/datasciencereview/vol1/iss2/5/) * [Web bots detection using Particle Swarm Optimization based clustering](https://ieeexplore.ieee.org/abstract/document/6900644) * [Towards a framework for detecting advanced Web bots](https://www.ideal-cities.eu/wp-content/uploads/2019/10/Iliou_Towards_a_-framework_for_detecting_advanced_Web_bots.pdf) (Supervised ML) * [A Graph-Based Machine Learning Approach forBot Detection](https://arxiv.org/pdf/1902.08538.pdf) (two phased, both supervised and unsupervised) * ... Datasets: https://www.kaggle.com/remosin/bot-detection Useful: https://github.com/chetantanwar108/ml_project_on_IBM_BOT-DETECTION Paper table: | Technique | Paper | Info | Style | | -------- | -------- | -------- | -------- | | Deep Neural Network | [Online Web Bot Detection Using a Sequential Classification Approach](https://ieeexplore.ieee.org/abstract/document/8622990)| online detection| Supervised | | Unsupervised Deep Neural Network| [Detection of malicious and non-malicious website visitors using unsupervised neural network learning](https://www.sciencedirect.com/science/article/abs/pii/S1568494612003778) | competitive learning, bruh shit's hard| Unsupervised | | A big, big combination? I don't understand, please help, I'm scared | [Bot recognition in a Web store: An approach based on unsupervised learning](https://www.sciencedirect.com/science/article/pii/S1084804520300515)|| Unsupervised | | Decision tree | [Real-time Web Crawler Detection](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5898963) | online detection | | | Hidden Markov Model | [Web Robot Detection Based on Hidden Markov Model](https://ieeexplore.ieee.org/document/4064250)| not really machine learning but stochastic || ### Ausschreibung #### Introduction As the years go by, website and infrastructure owners must face an increasing number of threats to their product's lifecycle. The increasing number of Web-bots, an automated program to crawl websites on the Internet and gather information can pose a threat to it's security and performance. According to studies [https://www.imperva.com/blog/bot-traffic-report-2016/?redirect=Incapsula] in 2016, collected from 100.000 randomly selected domains, 51,8% of the Internet's users are Web-bots. These bots can be either benign, like search engine bots, or malicious scrapers and hacker scaners. The importance of web bot detection is crucial for an organisation, in order to remain secure, maintain a normal server load and keep it's user experience at optimal levels. As technology and automation continue to advance, the number of bots crawling the Internet will continue to rise. #### Web-bot detection approaches Defense and detection mechanisms against web-bots, i.e. to distinguish a non-valid website visitor, a web-bot, from a valid one, is a vastly researched topic. There has been intrusion detection system-based solutions, honeypot server solutions, statistical and stochastical techniques to predict the legitimacy of a user based on it's behaviour, as well as simple blacklist solutions, where the server simply blacklists certain IPs, known for being the IPs of malicious bots. In recent years, with the rise of machine learning (ML) and artificial inteligence (AI), many of the research ideas are turning to ML techniques. By training a model and deploying it to the website, it can to detect and restrict the access of malicious web-bots to the website. #### Techniques in the ML approach In our research project, we present some of these common ML model training techinques, namely supervised, unsupervised and semi-supervised learning. In each technique the model is taking as input a dataset of handpicked features of the HTTP requests made during a HTTP-session of a client, e.g. user agent, IP and the file visited by the client. Then the model learns from it based on certain ML techniques and algorithms. In supervised learning, the model is learning through a labeled training dataset. This means we give it as input a dataset, where the HTTP sessions are already labeled as valid, i.e. a normal Internet user or invalid, i.e. a web-bot. Then the model finds patterns to detect and distinguish the two classes. In unsupervised learning, the model reads an unlabeled training dataset of HTTP sessions' features and tries to extract information and recognize patterns, without any prior knowledge, that allows it to categorize them into seperate classes. The semi-supervised learning is a combination of the two above-mentioned techniques. In every technique there are usually some ground truth data and rules on which the model can rely on. #### Performance and accuracy metrics After analyzing the training techniques, we take a look at the model performance metrics that evaluate the effectiveness and accuracy of the classification model. We give weight to the F-score metric, which is calculated from the precision, i.e. the number of correct bot detections out of the total number of positive detections and recall, i.e. the number of correct bot detections out of all the web bots of the test. #### Experiment and conclusion Then, we run our own experiment on training a model to classify users and to detect web-bots. After training our model, we evaluate it's performance by using the, above mentioned, F-score and compare them to other training techniques. At last, we reach to our experiment results and come to a conclusion to our report. --- ## Machine Learning Algorithms: - neural networks - decision tree learning - bayesian network - native bayes - hidden markov model (not really machine learning) - lagrange multiplier  ## Possible Research questions: * Have supervised learning techniques the best f-measure for web bot detection in comparison to unsupervised and semi-supervised techniques? ## ToDO: * Give a definition/explanation about your machine learning type. Explain the techniques from that ML type (e.g. Bayesian Network). How are the techniques used on web bot detection? ## Outline (Abstrakt) * Abstract * Intro to Web-bots and web-bots in the last years * Online/Offline detection * ML techniques (Un-/Semi-/Supervised) * Web-bot detection using ML, similar work * Supervised learning (Victor) * ... * Unsupervised learning (Mirac) * ... * Semi-supervised learning (Henry) * ... * ML learning performance measurement (~) * Methods ... * Methods ... * F-score/F-measurement * F-score/F-measurement * Table of the f-scores of mentioned methods * Our experiment * Run Un-/Semi-/Supervised experiment * F-score of our result with the other methods compare * Conclusion ## Meeting logs > 30.11.2020 Logs: ``` 1. ML techniques lernen (von Papers usw.) 2. Auf das "supervised/unsupervised" Teil fokusieren 3. Algorithmen verstehen und mind. 1 unsupervised und 1 supervised syle für jeder finden ``` > 10.11.2020 Chat logs: ``` * Techniques for web-bot classification and prevention * Different techniques --> Table --> Precision and recall (compare/evaluate) --> improvement? * Find a dataset > "State of the art"? NIST TREC 11:37 precision and recall 11:44 https://www.iaas.uni-stuttgart.de/en/department-service-computing/studentprojects/instructions-guidelines/ ```  --- ### Additional papers (other topics) * Detection/Classification (pls no) * https://www.sciencedirect.com/science/article/pii/S0950705120302318 (Initial paper) * https://ieeexplore.ieee.org/abstract/document/983028 * http://www.scs-europe.net/dlib/2017/ecms2017acceptedpapers/0605-dis_ECMS2017_0126.pdf * [Web bots detection using Particle Swarm Optimization based clustering](https://ieeexplore.ieee.org/abstract/document/6900644) * [Web Usage Analysis and Web Bot Detectionbased on Outlier Detection](https://www.ijert.org/research/web-usage-analysis-and-web-bot-detection-based-on-outlier-detection-IJERTV4IS070064.pdf) * Defense machanisms * https://ieeexplore.ieee.org/abstract/document/6682692 * https://arxiv.org/abs/1112.5605 * https://www.researchgate.net/profile/Baljit_Singh_Saini/publication/272719923_A_Review_of_Bot_Protection_using_CAPTCHA_for_Web_Security/links/5d469e9ba6fdcc370a79e16e/A-Review-of-Bot-Protection-using-CAPTCHA-for-Web-Security.pdf * Development * https://dl.acm.org/doi/abs/10.1145/3133850.3133864 * https://books.google.de/books?hl=el&lr=&id=VSSKBAAAQBAJ&oi=fnd&pg=PA1&dq=web+bots+development&ots=9dO-b_odBe&sig=mJN02kBxTjRmLMF9NaVUHZWBvHU#v=onepage&q=web%20bots%20development&f=false * Impact * https://ieeexplore.ieee.org/abstract/document/1381249/ * Differenct types of bots --- ### [Paper] Identifying legitimate Web users and bots with different traffic profiles — an Information Bottleneck approach ### Definitions: Clustering: https://www.geeksforgeeks.org/clustering-in-machine-learning/ Information Bottleneck: https://en.wikipedia.org/wiki/Information_bottleneck_method Unsupervised (Machine) Learning: https://www.datarobot.com/wiki/unsupervised-machine-learning/ * Advanced Persistent Bots * headless (moderate sophistication) * browser simulation (high sophistication) * IBBI: Information Bottleneck approach for web Bot Identification * Unsupervised ML * Relies on: * Fisher Score algorithm for feature selection/scoring * Information Bottleneck method for session clustering * ![Fig. 2](https://i.imgur.com/5dlZiaz.png) * ![Fig. 3](https://i.imgur.com/BdGK9x5.png) * Cluster labeling: * Majority class labeling: most represented class label (i.e. either bot or human) * Threshold-based labeling: threshold X is the minimum percentage of class label inside the cluster. Class label above or equal to the threshold -> cluster labelled as this class. Else mixed_X. * Classification performance * True Positives (TP): correctly recognized bots * True Negative (TN): correctly identified humans * False Positives (FP): humans mistakenly classified as bots * False Negatives (FN): bots mistakenly taken for humans ### Formulas: (1): Fisher score (2): Mutual information between X and ~X (3): Maximization of the IB functional * Bottom-up clustering tree. * Initial partition ~X = X. Merge two selected clusters at each step (with (3)?) (4): Pair of (3)\_before_cluster_merging and (3)\_after_cluster_merging with the biggest difference of (5): Calculates (4) (6)-(9): help to calculate (5) (10): (3) but rewritten (11): Entropy of a cluster to asses clustering performance (12): Total clustering entropy for all k clusters (13): Recall - Fraction of correctly recognized bots among all bots (14): Precision - Fraction of correctly recognized bots among all positive classifications (15): Accuracy - Fraction of all correct classifications (16): F1 - Overall classifier performance ###### tags: `archive` `uni`

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.