# Paper presentation
## Slide 1
A very Good afternoon to everyone present here. Respected panel members. I am Yadhu Krishna. On the behalf of my fellow authors - Sanjana S, and Thushara MG from Amrita Vishwa Vidyapeetham, I'm here to present our research work titled - Ad service detection - A comparitive study of machine learning techniques. Paper ID 300, Serial No 243. Before beginning the presentation I would like to thank IEEE and ICCCNT for providing this opportunitiy.
## Slide 2
Here is the presentation outline. First we will go through an introduction, then we present the research contributions made by our paper, then we present a literature review, the methodology followed, and we discuss the results, conclusion and future work towards the end.
## Slide 3 - Introduction
Web services have become increasingly important to our
daily lives. These include social media applications, chat
applications and other utility software and services. Also,
the trend of integrating third-party applications has seemingly
gone up.
These applications might appear free to the
users; however, they collect information about the user. The
information collected in this manner could include personally
identifiable information about the user, user behavior and
other sensitive information. The collected information can,
in turn, be sold to other third-parties or used for displaying
personalized advertisements, to generate revenue. These affect
user privacy at a greater level. Thus, privacy has become one
of the major concerns of internet users these days. Ad blockers
or content blockers were developed to overcome this problem
However, the currently available ad blockers are purely based on a predefined set of blacklist domains. This study proposes a machine
learning-based approach to detect web-based ad services and
trackers. In order to help us explore the area using machine
learning techniques, a dataset was created. The developed dataset
contained 74,000 entries comprising both domains that served
advertisements and normal domains. In addition, we have also
compared the scores that were obtained by various supervised
machine learning models, and the best model was identified.
The best model offered up to 88% accuracy in classifying ad
services and normal websites. The model contributed by this
paper could be further developed into a machine learning-based
advertisement blocker system.
## Slide 4 - Research Contributions
Traditional ad blockers work by blocking network access
to domains that are present in an internal blacklist. However,
the blacklist is mainly manually curated by reports from
individual users. We identified three major problems posed by
the pure blacklist-based approach, which in turn is the major
motivation behind this research.
The issues are:
+ Performance Issues
As the number of advertisement
services are constantly increasing, and each of these
would need to be added to a blacklist. This would
increase the size of the blacklist, and will impact the
performance of the browser or the ad blocker application.
+ Inefficiency of static blacklists
As the number of ad services and trackers
are rising indefinitely, the domains that are to be blocked
cannot be constant. Since the blacklist is curated by
reports from individuals, this could be prone to errors
and bias.
+ Frequent updates to blacklists
Since the list of domains that needs
to be blocked are dynamic, frequent updates would be
required to maintain the correctness and efficiency of
the blacklist database installed in the ad blocker
The approach we provide through this research work, can be combined with classical ad blockers in order to attain higher accuracy.
We also provide a comparison of various machine learning algorithms namely Logical regression, random **forest**, support vector machines, decision tree classifier, K-Nearest Neighbors classifier to the problem.
Now let us move to the Literature Review.
## Literature Review
A good amount of research
has been conducted in this field, however, the application
of Natural Language Processing (NLP) to the field is left
unexplored, which we intend to illustrate with this work.
Iqbal et al. proposed a graph-based machine learning approach, to automatically detect and block ads and trackers on
the web in [1]. It is based on obtaining a graph from various
parts of an HTML page including JavaScript, HTTP and
HTML.
Bhagavatula et al. [2] trained a supervised machine
learning model based on keywords extracted from HTTP query
strings for detecting advertisements.
Zain ul abi Din et al.
proposed an ad blocker system based on deep learning [3].
The system works by intercepting and processing every image
that is obtained during execution of the web page.
Mughees et al. [4] performed a study using patterns during DOM
changes. And was used to train a machine learning model for
detecting anti-ad blockers.
Gugelmann et al. [5] introduced a
machine learning based method for classifying WTA requests.
WTAGraph [6] represented the execution of a web page as a
graph and then classifiers were trained based on it to detect
advertising resources.
Hieu Le et al. [7] proposed AutoFR, that
relies on reinforcement learning (RL) to generate filter rules
in order to block unwanted URLs.
## Slide 5 - Methodology 1, 2
As a first step, we collect the required data. Dataset is combined collected from two different sources.
Ad blocker dataset - This is a blacklist that is used by ad blockers. These are considered to serve advertisements and trackers.
Alexa Top 1M websites - This is a curated list of top websites collected from Amazon Alexa. These are considered to be non-ad serving.
We then perform web scraping on all the domains to collect necessary data. There are mainly 2 kinds of features - static and dynamic. Static features are those which are directly extracted from the domain name. Dynamic features are those extracted from the web page by querying them.
## Slide 6 - Features
The table represents different features present in the dataset. An indication of type of feature either dynamic / static is also given.
## Slide 7 - Methodology 3
Then we performed Exploratory data analysis (EDA) on the dataset. Various graphs and plots were drawn to visualize the data. We also constructed a word cloud to identify important keywords on both the domains.
## Slide 8 - Methodology 4, 5
In this step, we apply NLP on the extracted metadata. **We apply stemming, **lemmatization, stop words removal and special charecters removal
As the next step, we apply various machine learning algorithms namely.. and compare the results.
## Slide 9 - Results
The figure depcits the performance of various machine learning algorithms. Parameters were tuned for each of the algorithms, and the best results are plotted.
It is identifiable that SVM classifier with an RBF kernel and Random Forest Classifier with 200 estimators produced the best results.
## Slide 10 - Conclusion & Future Work
In this paper, we illustrated a novel approach for ad blockers. The approach overcomes 3 major problems associated with traditional blacklist based ad blockers. The approach can also be combined with blacklist based ad blockers to imporve accuracy. It was identified that SVM and RF classifiers work the best in this scenario.
In future, the developed model can be used to develop a browser extension so it can be deployed in real-world scenarios. The paper focuses only on certain keywords that are extracted from a webpage, there can be many other features that would be useful in this scenario. These need to be identified, and a combination of other algorithms can be tried.