URL Metadata Comparison for Deduplication Detection

# URL Metadata Comparison for Deduplication Detection By: Gita Ayu Salsabila, QA Engineer Intern ###### tags: `Fulldive Internship` >Reference: >- [Stackoverflow: Detection of two URL drive to the same site](https://stackoverflow.com/questions/24325759/how-can-i-detect-that-these-two-urls-drive-to-the-same-site) >- [Stackoverflow: Detection similarity between two text document](https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents/8897648#8897648) >- [Tf–idf Algortihm](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) >- [Tutorial: NLP for Checking Documents Similarity](https://dev.to/coderasha/compare-documents-similarity-using-python-nlp-4odp) >-[Tutorial: TF-IDF Similarity Checking](https://medium.com/@odysseus1337/document-class-comparison-with-tf-idf-python-1b4860b9345b) ### **Goals:** *Detect the similarity of Web Pages that driven by different URLs* In order to solve this, at first I'm planning to get the metadata of an URL in HTML/XML format using Python library (**Urllib/BeautifulShop**). Then parse the metadata so I could check is there any Canonical Tags/Header. If there is no tags/header is found, then I'll try to compute **the cosine similarity** to compare the content of the URL metadata with other metadata using **tf-idf algorithm**. :::info :warning: **Problems:** - I have not have a good mechanism to search and fetch "suspected URL" whose probably has similar metadata/direct to the same URL. - I have not deciding the best value of threshold of the cosine similarity to determine whether the metadata is similar enough. - I am not sure about what I have to do if I found that several URL drive to the same page. So far, I just list the URL. ::: ## Algorithm to Solve This Problem ![](https://i.imgur.com/rgUuhSs.png) ## Implementation Check this link to see the result of the implemented code: [Google Colab](https://colab.research.google.com/drive/1_Rm5jPtBd_IKhtUTdjJxY9JrRu1srNHA?usp=sharing) Codes for checking the presence of Canonical Tags: ```python= import requests import pprint from bs4 import BeautifulSoup import pandas as pd URL = 'https://9gag.com/gag/a3MD3Nr' page = requests.get(URL) soup = BeautifulSoup(page.content) canonical = soup.find('link', rel="canonical") if canonical is None: print("Canonical URL is not found"); else: canonnicalURL = canonical['href'] print(canonnicalURL) ``` Codes to find two URL Metadata similarity using **tf-idf**: ```python= from sklearn.feature_extraction.text import TfidfVectorizer from urllib.request import urlopen URL = 'http://www.eonline.com/videos/267385/luann-de-lesseps-apologizes-after-arrest' suspectedURL = 'http://www.eonline.com/videos/266671/jennifer-lopez-is-newest-guess-girl-at-48' page1 = urlopen(URL).read() page2 = urlopen(suspectedURL).read() tfidf = TfidfVectorizer() vecs = tfidf.fit_transform([page1, page2]) corr_matrix = ((vecs * vecs.T).A) similarity = corr_matrix[0,1] print(similarity) ``` Codes for detecting two URL Metadata with 100% similarity ```python= from urllib.request import urlopen URL = 'http://www.eonline.com/videos/267385/luann-de-lesseps-apologizes-after-arrest' suspectedURL = 'http://www.eonline.com/videos/266671/jennifer-lopez-is-newest-guess-girl-at-48' page1 = urlopen(URL) page2 = urlopen(suspectedURL) if page1.read() == page2.read(): print("same site") else: print("different") ``` Full code that implements the algoritm to detecting URL deduplication: ```python= import requests import pprint from bs4 import BeautifulSoup import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from urllib.request import urlopen URL = 'https://medium.com/@odysseus1337/document-class-comparison-with-tf-idf-python-1b4860b9345b' page = requests.get(URL) soup = BeautifulSoup(page.content) canonical = soup.find('link', rel="canonical") if canonical is None: print("Canonical URL is not found \n"); suspectedURL = 'https://stackoverflow.com/questions/53748236/how-to-compare-two-text-document-with-tfidf-vectorizer/59012601#59012601' page1 = urlopen(URL).read() page2 = urlopen(suspectedURL).read() tfidf = TfidfVectorizer() vecs = tfidf.fit_transform([page1, page2]) corr_matrix = ((vecs * vecs.T).A) similarity = corr_matrix[0,1] print("Has ", similarity, " with the suspected URL") else: canonnicalURL = canonical['href'] print(canonnicalURL) ```