Paper presentation

# Paper presentation ## Slide 1 A very Good afternoon to everyone present here. Respected panel members. I am Yadhu Krishna. On the behalf of my fellow authors - Sanjana S, and Thushara MG from Amrita Vishwa Vidyapeetham, I'm here to present our research work titled - Ad service detection - A comparitive study of machine learning techniques. Paper ID 300, Serial No 243. Before beginning the presentation I would like to thank IEEE and ICCCNT for providing this opportunitiy. ## Slide 2 Here is the presentation outline. First we will go through an introduction, then we present the research contributions made by our paper, then we present a literature review, the methodology followed, and we discuss the results, conclusion and future work towards the end. ## Slide 3 - Introduction Web services have become increasingly important to our daily lives. These include social media applications, chat applications and other utility software and services. Also, the trend of integrating third-party applications has seemingly gone up. These applications might appear free to the users; however, they collect information about the user. The information collected in this manner could include personally identifiable information about the user, user behavior and other sensitive information. The collected information can, in turn, be sold to other third-parties or used for displaying personalized advertisements, to generate revenue. These affect user privacy at a greater level. Thus, privacy has become one of the major concerns of internet users these days. Ad blockers or content blockers were developed to overcome this problem However, the currently available ad blockers are purely based on a predefined set of blacklist domains. This study proposes a machine learning-based approach to detect web-based ad services and trackers. In order to help us explore the area using machine learning techniques, a dataset was created. The developed dataset contained 74,000 entries comprising both domains that served advertisements and normal domains. In addition, we have also compared the scores that were obtained by various supervised machine learning models, and the best model was identified. The best model offered up to 88% accuracy in classifying ad services and normal websites. The model contributed by this paper could be further developed into a machine learning-based advertisement blocker system. ## Slide 4 - Research Contributions Traditional ad blockers work by blocking network access to domains that are present in an internal blacklist. However, the blacklist is mainly manually curated by reports from individual users. We identified three major problems posed by the pure blacklist-based approach, which in turn is the major motivation behind this research. The issues are: + Performance Issues As the number of advertisement services are constantly increasing, and each of these would need to be added to a blacklist. This would increase the size of the blacklist, and will impact the performance of the browser or the ad blocker application. + Inefficiency of static blacklists As the number of ad services and trackers are rising indefinitely, the domains that are to be blocked cannot be constant. Since the blacklist is curated by reports from individuals, this could be prone to errors and bias. + Frequent updates to blacklists Since the list of domains that needs to be blocked are dynamic, frequent updates would be required to maintain the correctness and efficiency of the blacklist database installed in the ad blocker The approach we provide through this research work, can be combined with classical ad blockers in order to attain higher accuracy. We also provide a comparison of various machine learning algorithms namely Logical regression, random **forest**, support vector machines, decision tree classifier, K-Nearest Neighbors classifier to the problem. Now let us move to the Literature Review. ## Literature Review A good amount of research has been conducted in this field, however, the application of Natural Language Processing (NLP) to the field is left unexplored, which we intend to illustrate with this work. Iqbal et al. proposed a graph-based machine learning approach, to automatically detect and block ads and trackers on the web in [1]. It is based on obtaining a graph from various parts of an HTML page including JavaScript, HTTP and HTML. Bhagavatula et al. [2] trained a supervised machine learning model based on keywords extracted from HTTP query strings for detecting advertisements. Zain ul abi Din et al. proposed an ad blocker system based on deep learning [3]. The system works by intercepting and processing every image that is obtained during execution of the web page. Mughees et al. [4] performed a study using patterns during DOM changes. And was used to train a machine learning model for detecting anti-ad blockers. Gugelmann et al. [5] introduced a machine learning based method for classifying WTA requests. WTAGraph [6] represented the execution of a web page as a graph and then classifiers were trained based on it to detect advertising resources. Hieu Le et al. [7] proposed AutoFR, that relies on reinforcement learning (RL) to generate filter rules in order to block unwanted URLs. ## Slide 5 - Methodology 1, 2 As a first step, we collect the required data. Dataset is combined collected from two different sources. Ad blocker dataset - This is a blacklist that is used by ad blockers. These are considered to serve advertisements and trackers. Alexa Top 1M websites - This is a curated list of top websites collected from Amazon Alexa. These are considered to be non-ad serving. We then perform web scraping on all the domains to collect necessary data. There are mainly 2 kinds of features - static and dynamic. Static features are those which are directly extracted from the domain name. Dynamic features are those extracted from the web page by querying them. ## Slide 6 - Features The table represents different features present in the dataset. An indication of type of feature either dynamic / static is also given. ## Slide 7 - Methodology 3 Then we performed Exploratory data analysis (EDA) on the dataset. Various graphs and plots were drawn to visualize the data. We also constructed a word cloud to identify important keywords on both the domains. ## Slide 8 - Methodology 4, 5 In this step, we apply NLP on the extracted metadata. **We apply stemming, **lemmatization, stop words removal and special charecters removal As the next step, we apply various machine learning algorithms namely.. and compare the results. ## Slide 9 - Results The figure depcits the performance of various machine learning algorithms. Parameters were tuned for each of the algorithms, and the best results are plotted. It is identifiable that SVM classifier with an RBF kernel and Random Forest Classifier with 200 estimators produced the best results. ## Slide 10 - Conclusion & Future Work In this paper, we illustrated a novel approach for ad blockers. The approach overcomes 3 major problems associated with traditional blacklist based ad blockers. The approach can also be combined with blacklist based ad blockers to imporve accuracy. It was identified that SVM and RF classifiers work the best in this scenario. In future, the developed model can be used to develop a browser extension so it can be deployed in real-world scenarios. The paper focuses only on certain keywords that are extracted from a webpage, there can be many other features that would be useful in this scenario. These need to be identified, and a combination of other algorithms can be tried.