Web Scraping

It is common to write scripts that parse webpages for the text and information they contain. These scripts are called web scrapers. This is a great way to create your own corpra for NLP tasks!

Example Wiki Scraper

Here is a simple web scraper that recursively collects content from wiki articles (this has been tested with Wikipedia and wikis hosted on Fandom.com, but others may work as well) To run, you must first install a few pip packages and run a few commands:

Windows:

​​​​> pip install nltk
​​​​> pip install beautifulsoup4
​​​​> python
​​​​>>> import nltk
​​​​>>> nltk.download('punkt')

Mac:

​​​​> pip3 install nltk
​​​​> pip3 install beautifulsoup4
​​​​> python3
​​​​>>> import nltk
​​​​>>> nltk.download('punkt')

Here is the scraper (copy and paste into file called wiki_scraper.py)

​​​​from nltk.tokenize import word_tokenize
​​​​from urllib.parse import urlparse
​​​​from bs4 import BeautifulSoup
​​​​import requests
​​​​import codecs
​​​​import random
​​​​import re

​​​​import sys

​​​​def parse_wikipedia_rec(url:str, depth:int, rand_links):
​​​​    print(url)
​​​​    u = urlparse(url)

​​​​    depth -= 1

​​​​    links = set()
​​​​    corpus = []

​​​​    result = requests.get(url).text
​​​​    doc = BeautifulSoup(result, "html.parser")

​​​​    output = doc.find(class_="mw-parser-output")
​​​​    for empty in output.find_all("p", class_='mw-empty-elt'):
​​​​        empty.extract()
​​​​    paragraphs = output.find_all("p")


​​​​    for paragraph in paragraphs:
​​​​        for sup in paragraph.find_all('sup'):
​​​​            sup.extract()
​​​​        hrefs = paragraph.find_all("a", href=re.compile("/wiki/"))
​​​​        for href in hrefs:
​​​​            links.add(href['href'])
​​​​        txt = paragraph.get_text()
​​​​        tokens = word_tokenize(txt)
​​​​        tokens.insert(0, '<START>')
​​​​        tokens.append('<END>')
​​​​        corpus.append(tokens)

​​​​    links = list(links)
​​​​    if(depth > 0):
​​​​        for link in random.sample(links, rand_links):
​​​​            corpus.append(parse_wikipedia_rec(
​​​​                u.scheme + "://" + u.netloc+link, 
​​​​                depth, rand_links))


​​​​    return corpus

​​​​if(len(sys.argv) != 5):
​​​​    print("Usage: wiki_scraper.py [output file] [wiki url] [max depth] [links at each depth]")
​​​​    exit(1)

​​​​corpus = parse_wikipedia_rec(sys.argv[2], int(sys.argv[3]), int(sys.argv[4]))
​​​​with codecs.open(sys.argv[1], 'w', 'utf-8') as f:
​​​​    f.write(str(corpus))

Writing your own scraper

If you would like to scrape your own web pages, I would recommend following the BeautifulSoup tutorial series by Tech With Tim. He goes into detail on how to look through the HTML source of websites and programmatically parse them for whatever content you need.

For our purposes, the final format should be a list of paragraphs where each paragraph is a list of words and punctuation beginning with '<START>' and ending with '<END>'. For example:

​​​​[['<START>', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',
​​​​'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',
​​​​'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',
​​​​'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',
​​​​'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',
​​​​'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',
​​​​'(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',
​​​​'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',
​​​​'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',
​​​​'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',
​​​​'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',
​​​​'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',
​​​​'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',
​​​​'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',
​​​​'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', "'", 's',
​​​​'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',
​​​​'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',
​​​​'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',
​​​​'they', 'noted', '.', '<END>'], ... ]

Corpus Source Ideas