Web Scraping

It is common to write scripts that parse webpages for the text and information they contain. These scripts are called web scrapers. This is a great way to create your own corpra for NLP tasks!

Example Wiki Scraper

Here is a simple web scraper that recursively collects content from wiki articles (this has been tested with Wikipedia and wikis hosted on Fandom.com, but others may work as well) To run, you must first install a few pip packages and run a few commands:

Windows:

> pip install nltk
> pip install beautifulsoup4
> python
>>> import nltk
>>> nltk.download('punkt')

Mac:

> pip3 install nltk
> pip3 install beautifulsoup4
> python3
>>> import nltk
>>> nltk.download('punkt')

Here is the scraper (copy and paste into file called wiki_scraper.py)

from nltk.tokenize import word_tokenize
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests
import codecs
import random
import re

import sys

def parse_wikipedia_rec(url:str, depth:int, rand_links):
    print(url)
    u = urlparse(url)

    depth -= 1

    links = set()
    corpus = []

    result = requests.get(url).text
    doc = BeautifulSoup(result, "html.parser")

    output = doc.find(class_="mw-parser-output")
    for empty in output.find_all("p", class_='mw-empty-elt'):
        empty.extract()
    paragraphs = output.find_all("p")


    for paragraph in paragraphs:
        for sup in paragraph.find_all('sup'):
            sup.extract()
        hrefs = paragraph.find_all("a", href=re.compile("/wiki/"))
        for href in hrefs:
            links.add(href['href'])
        txt = paragraph.get_text()
        tokens = word_tokenize(txt)
        tokens.insert(0, '<START>')
        tokens.append('<END>')
        corpus.append(tokens)

    links = list(links)
    if(depth > 0):
        for link in random.sample(links, rand_links):
            corpus.append(parse_wikipedia_rec(
                u.scheme + "://" + u.netloc+link, 
                depth, rand_links))


    return corpus

if(len(sys.argv) != 5):
    print("Usage: wiki_scraper.py [output file] [wiki url] [max depth] [links at each depth]")
    exit(1)

corpus = parse_wikipedia_rec(sys.argv[2], int(sys.argv[3]), int(sys.argv[4]))
with codecs.open(sys.argv[1], 'w', 'utf-8') as f:
    f.write(str(corpus))

Writing your own scraper

If you would like to scrape your own web pages, I would recommend following the BeautifulSoup tutorial series by Tech With Tim. He goes into detail on how to look through the HTML source of websites and programmatically parse them for whatever content you need.

For our purposes, the final format should be a list of paragraphs where each paragraph is a list of words and punctuation beginning with '<START>' and ending with '<END>'. For example:

[['<START>', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',
'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',
'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',
'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',
'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',
'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',
'(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',
'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',
'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',
'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',
'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',
'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',
'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',
'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',
'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', "'", 's',
'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',
'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',
'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',
'they', 'noted', '.', '<END>'], ... ]

Corpus Source Ideas

Any Wikipedia article
Any fandom.com wiki article
All 7 Harry Potter books
Religous Texts
NLP Datasets on Kaggle
Something from this list of text corpra