Viktor M.
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Web client classification: human vs. bot: Notes ## Topic: Different machine learning techniques for web bot detection based on server logs Papers: * [Botnet Detection Based On Machine Learning Techniques Using DNS Query Data ](https://www.mdpi.com/1999-5903/10/5/43/htm) * [Online Web Bot Detection Using a Sequential Classification Approach](https://ieeexplore.ieee.org/abstract/document/8622990) * [Forbes: How To Improve Bot Detection With Machine Learning](https://www.forbes.com/sites/louiscolumbus/2020/09/27/how-to-improve-bot-detection-with-machine-learning/?sh=363c5bd172d0) * [Supervised Machine Learning Bot Detection Techniques to Identify Social Twitter Bots](https://scholar.smu.edu/datasciencereview/vol1/iss2/5/) * [Web bots detection using Particle Swarm Optimization based clustering](https://ieeexplore.ieee.org/abstract/document/6900644) * [Towards a framework for detecting advanced Web bots](https://www.ideal-cities.eu/wp-content/uploads/2019/10/Iliou_Towards_a_-framework_for_detecting_advanced_Web_bots.pdf) (Supervised ML) * [A Graph-Based Machine Learning Approach forBot Detection](https://arxiv.org/pdf/1902.08538.pdf) (two phased, both supervised and unsupervised) * ... Datasets: https://www.kaggle.com/remosin/bot-detection Useful: https://github.com/chetantanwar108/ml_project_on_IBM_BOT-DETECTION Paper table: | Technique | Paper | Info | Style | | -------- | -------- | -------- | -------- | | Deep Neural Network | [Online Web Bot Detection Using a Sequential Classification Approach](https://ieeexplore.ieee.org/abstract/document/8622990)| online detection| Supervised | | Unsupervised Deep Neural Network| [Detection of malicious and non-malicious website visitors using unsupervised neural network learning](https://www.sciencedirect.com/science/article/abs/pii/S1568494612003778) | competitive learning, bruh shit's hard| Unsupervised | | A big, big combination? I don't understand, please help, I'm scared | [Bot recognition in a Web store: An approach based on unsupervised learning](https://www.sciencedirect.com/science/article/pii/S1084804520300515)|| Unsupervised | | Decision tree | [Real-time Web Crawler Detection](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5898963) | online detection | | | Hidden Markov Model | [Web Robot Detection Based on Hidden Markov Model](https://ieeexplore.ieee.org/document/4064250)| not really machine learning but stochastic || ### Ausschreibung #### Introduction As the years go by, website and infrastructure owners must face an increasing number of threats to their product's lifecycle. The increasing number of Web-bots, an automated program to crawl websites on the Internet and gather information can pose a threat to it's security and performance. According to studies [https://www.imperva.com/blog/bot-traffic-report-2016/?redirect=Incapsula] in 2016, collected from 100.000 randomly selected domains, 51,8% of the Internet's users are Web-bots. These bots can be either benign, like search engine bots, or malicious scrapers and hacker scaners. The importance of web bot detection is crucial for an organisation, in order to remain secure, maintain a normal server load and keep it's user experience at optimal levels. As technology and automation continue to advance, the number of bots crawling the Internet will continue to rise. #### Web-bot detection approaches Defense and detection mechanisms against web-bots, i.e. to distinguish a non-valid website visitor, a web-bot, from a valid one, is a vastly researched topic. There has been intrusion detection system-based solutions, honeypot server solutions, statistical and stochastical techniques to predict the legitimacy of a user based on it's behaviour, as well as simple blacklist solutions, where the server simply blacklists certain IPs, known for being the IPs of malicious bots. In recent years, with the rise of machine learning (ML) and artificial inteligence (AI), many of the research ideas are turning to ML techniques. By training a model and deploying it to the website, it can to detect and restrict the access of malicious web-bots to the website. #### Techniques in the ML approach In our research project, we present some of these common ML model training techinques, namely supervised, unsupervised and semi-supervised learning. In each technique the model is taking as input a dataset of handpicked features of the HTTP requests made during a HTTP-session of a client, e.g. user agent, IP and the file visited by the client. Then the model learns from it based on certain ML techniques and algorithms. In supervised learning, the model is learning through a labeled training dataset. This means we give it as input a dataset, where the HTTP sessions are already labeled as valid, i.e. a normal Internet user or invalid, i.e. a web-bot. Then the model finds patterns to detect and distinguish the two classes. In unsupervised learning, the model reads an unlabeled training dataset of HTTP sessions' features and tries to extract information and recognize patterns, without any prior knowledge, that allows it to categorize them into seperate classes. The semi-supervised learning is a combination of the two above-mentioned techniques. In every technique there are usually some ground truth data and rules on which the model can rely on. #### Performance and accuracy metrics After analyzing the training techniques, we take a look at the model performance metrics that evaluate the effectiveness and accuracy of the classification model. We give weight to the F-score metric, which is calculated from the precision, i.e. the number of correct bot detections out of the total number of positive detections and recall, i.e. the number of correct bot detections out of all the web bots of the test. #### Experiment and conclusion Then, we run our own experiment on training a model to classify users and to detect web-bots. After training our model, we evaluate it's performance by using the, above mentioned, F-score and compare them to other training techniques. At last, we reach to our experiment results and come to a conclusion to our report. --- ## Machine Learning Algorithms: - neural networks - decision tree learning - bayesian network - native bayes - hidden markov model (not really machine learning) - lagrange multiplier <!-- ## Table of content * Gemeinsamkeiten (z.B. the HTTP request feature extraction technique) --> ## Possible Research questions: * Have supervised learning techniques the best f-measure for web bot detection in comparison to unsupervised and semi-supervised techniques? ## ToDO: * Give a definition/explanation about your machine learning type. Explain the techniques from that ML type (e.g. Bayesian Network). How are the techniques used on web bot detection? ## Outline (Abstrakt) * Abstract * Intro to Web-bots and web-bots in the last years * Online/Offline detection * ML techniques (Un-/Semi-/Supervised) * Web-bot detection using ML, similar work * Supervised learning (Victor) * ... * Unsupervised learning (Mirac) * ... * Semi-supervised learning (Henry) * ... * ML learning performance measurement (~) * Methods ... * Methods ... * F-score/F-measurement * F-score/F-measurement * Table of the f-scores of mentioned methods * Our experiment * Run Un-/Semi-/Supervised experiment * F-score of our result with the other methods compare * Conclusion ## Meeting logs > 30.11.2020 Logs: ``` 1. ML techniques lernen (von Papers usw.) 2. Auf das "supervised/unsupervised" Teil fokusieren 3. Algorithmen verstehen und mind. 1 unsupervised und 1 supervised syle für jeder finden ``` > 10.11.2020 Chat logs: ``` * Techniques for web-bot classification and prevention * Different techniques --> Table --> Precision and recall (compare/evaluate) --> improvement? * Find a dataset > "State of the art"? NIST TREC 11:37 precision and recall 11:44 https://www.iaas.uni-stuttgart.de/en/department-service-computing/studentprojects/instructions-guidelines/ ``` <!-- #### Questions: 1. ~~Just web-bots or bots in general?~~ 2. ~~Only the classification/identification of web-bots or also other directions?~~ 3. ... #### Topic Ideas: * ~~A research CAPTCHAs *(ich habe einfach viele Papers darüber gefunden :P)*~~ * ~~Different bot-types and their behaviour *(vielleicht zu einfach/wenig)*~~ * **Different (machine learning?) techniques for web bot detection/classification** * ~~Differences between simple and advanced web bots and their operations/behaviour *(vielleicht zu einfach/wenig)*~~ #### Possible Topics * ~~Study (Umfrage) on human detection of a (self implemented?) chat-bot~~ * Are supervised learning techniques better than unsupervised learning techniques in detecting bots on social media platforms? --> --- ### Additional papers (other topics) * Detection/Classification (pls no) * https://www.sciencedirect.com/science/article/pii/S0950705120302318 (Initial paper) * https://ieeexplore.ieee.org/abstract/document/983028 * http://www.scs-europe.net/dlib/2017/ecms2017acceptedpapers/0605-dis_ECMS2017_0126.pdf * [Web bots detection using Particle Swarm Optimization based clustering](https://ieeexplore.ieee.org/abstract/document/6900644) * [Web Usage Analysis and Web Bot Detectionbased on Outlier Detection](https://www.ijert.org/research/web-usage-analysis-and-web-bot-detection-based-on-outlier-detection-IJERTV4IS070064.pdf) * Defense machanisms * https://ieeexplore.ieee.org/abstract/document/6682692 * https://arxiv.org/abs/1112.5605 * https://www.researchgate.net/profile/Baljit_Singh_Saini/publication/272719923_A_Review_of_Bot_Protection_using_CAPTCHA_for_Web_Security/links/5d469e9ba6fdcc370a79e16e/A-Review-of-Bot-Protection-using-CAPTCHA-for-Web-Security.pdf * Development * https://dl.acm.org/doi/abs/10.1145/3133850.3133864 * https://books.google.de/books?hl=el&lr=&id=VSSKBAAAQBAJ&oi=fnd&pg=PA1&dq=web+bots+development&ots=9dO-b_odBe&sig=mJN02kBxTjRmLMF9NaVUHZWBvHU#v=onepage&q=web%20bots%20development&f=false * Impact * https://ieeexplore.ieee.org/abstract/document/1381249/ * Differenct types of bots --- ### [Paper] Identifying legitimate Web users and bots with different traffic profiles — an Information Bottleneck approach ### Definitions: Clustering: https://www.geeksforgeeks.org/clustering-in-machine-learning/ Information Bottleneck: https://en.wikipedia.org/wiki/Information_bottleneck_method Unsupervised (Machine) Learning: https://www.datarobot.com/wiki/unsupervised-machine-learning/ * Advanced Persistent Bots * headless (moderate sophistication) * browser simulation (high sophistication) * IBBI: Information Bottleneck approach for web Bot Identification * Unsupervised ML * Relies on: * Fisher Score algorithm for feature selection/scoring * Information Bottleneck method for session clustering * ![Fig. 2](https://i.imgur.com/5dlZiaz.png) * ![Fig. 3](https://i.imgur.com/BdGK9x5.png) * Cluster labeling: * Majority class labeling: most represented class label (i.e. either bot or human) * Threshold-based labeling: threshold X is the minimum percentage of class label inside the cluster. Class label above or equal to the threshold -> cluster labelled as this class. Else mixed_X. * Classification performance * True Positives (TP): correctly recognized bots * True Negative (TN): correctly identified humans * False Positives (FP): humans mistakenly classified as bots * False Negatives (FN): bots mistakenly taken for humans ### Formulas: (1): Fisher score (2): Mutual information between X and ~X (3): Maximization of the IB functional * Bottom-up clustering tree. * Initial partition ~X = X. Merge two selected clusters at each step (with (3)?) (4): Pair of (3)\_before_cluster_merging and (3)\_after_cluster_merging with the biggest difference of (5): Calculates (4) (6)-(9): help to calculate (5) (10): (3) but rewritten (11): Entropy of a cluster to asses clustering performance (12): Total clustering entropy for all k clusters (13): Recall - Fraction of correctly recognized bots among all bots (14): Precision - Fraction of correctly recognized bots among all positive classifications (15): Accuracy - Fraction of all correct classifications (16): F1 - Overall classifier performance ###### tags: `archive` `uni`

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully