# URL Metadata Comparison for Deduplication Detection
By: Gita Ayu Salsabila, QA Engineer Intern
###### tags: `Fulldive Internship`
>Reference:
>- [Stackoverflow: Detection of two URL drive to the same site](https://stackoverflow.com/questions/24325759/how-can-i-detect-that-these-two-urls-drive-to-the-same-site)
>- [Stackoverflow: Detection similarity between two text document](https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents/8897648#8897648)
>- [Tf–idf Algortihm](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
>- [Tutorial: NLP for Checking Documents Similarity](https://dev.to/coderasha/compare-documents-similarity-using-python-nlp-4odp)
>-[Tutorial: TF-IDF Similarity Checking](https://medium.com/@odysseus1337/document-class-comparison-with-tf-idf-python-1b4860b9345b)
### **Goals:** *Detect the similarity of Web Pages that driven by different URLs*
In order to solve this, at first I'm planning to get the metadata of an URL in HTML/XML format using Python library (**Urllib/BeautifulShop**). Then parse the metadata so I could check is there any Canonical Tags/Header. If there is no tags/header is found, then I'll try to compute **the cosine similarity** to compare the content of the URL metadata with other metadata using **tf-idf algorithm**.
:::info
:warning: **Problems:**
- I have not have a good mechanism to search and fetch "suspected URL" whose probably has similar metadata/direct to the same URL.
- I have not deciding the best value of threshold of the cosine similarity to determine whether the metadata is similar enough.
- I am not sure about what I have to do if I found that several URL drive to the same page. So far, I just list the URL.
:::
## Algorithm to Solve This Problem

## Implementation
Check this link to see the result of the implemented code: [Google Colab](https://colab.research.google.com/drive/1_Rm5jPtBd_IKhtUTdjJxY9JrRu1srNHA?usp=sharing)
Codes for checking the presence of Canonical Tags:
```python=
import requests
import pprint
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://9gag.com/gag/a3MD3Nr'
page = requests.get(URL)
soup = BeautifulSoup(page.content)
canonical = soup.find('link', rel="canonical")
if canonical is None:
print("Canonical URL is not found");
else:
canonnicalURL = canonical['href']
print(canonnicalURL)
```
Codes to find two URL Metadata similarity using **tf-idf**:
```python=
from sklearn.feature_extraction.text import TfidfVectorizer
from urllib.request import urlopen
URL = 'http://www.eonline.com/videos/267385/luann-de-lesseps-apologizes-after-arrest'
suspectedURL = 'http://www.eonline.com/videos/266671/jennifer-lopez-is-newest-guess-girl-at-48'
page1 = urlopen(URL).read()
page2 = urlopen(suspectedURL).read()
tfidf = TfidfVectorizer()
vecs = tfidf.fit_transform([page1, page2])
corr_matrix = ((vecs * vecs.T).A)
similarity = corr_matrix[0,1]
print(similarity)
```
Codes for detecting two URL Metadata with 100% similarity
```python=
from urllib.request import urlopen
URL = 'http://www.eonline.com/videos/267385/luann-de-lesseps-apologizes-after-arrest'
suspectedURL = 'http://www.eonline.com/videos/266671/jennifer-lopez-is-newest-guess-girl-at-48'
page1 = urlopen(URL)
page2 = urlopen(suspectedURL)
if page1.read() == page2.read():
print("same site")
else:
print("different")
```
Full code that implements the algoritm to detecting URL deduplication:
```python=
import requests
import pprint
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from urllib.request import urlopen
URL = 'https://medium.com/@odysseus1337/document-class-comparison-with-tf-idf-python-1b4860b9345b'
page = requests.get(URL)
soup = BeautifulSoup(page.content)
canonical = soup.find('link', rel="canonical")
if canonical is None:
print("Canonical URL is not found \n");
suspectedURL = 'https://stackoverflow.com/questions/53748236/how-to-compare-two-text-document-with-tfidf-vectorizer/59012601#59012601'
page1 = urlopen(URL).read()
page2 = urlopen(suspectedURL).read()
tfidf = TfidfVectorizer()
vecs = tfidf.fit_transform([page1, page2])
corr_matrix = ((vecs * vecs.T).A)
similarity = corr_matrix[0,1]
print("Has ", similarity, " with the suspected URL")
else:
canonnicalURL = canonical['href']
print(canonnicalURL)
```