It is common to write scripts that parse webpages for the text and information they contain. These scripts are called web scrapers. This is a great way to create your own corpra for NLP tasks!
Here is a simple web scraper that recursively collects content from wiki articles (this has been tested with Wikipedia and wikis hosted on Fandom.com, but others may work as well) To run, you must first install a few pip packages and run a few commands:
Windows:
> pip install nltk
> pip install beautifulsoup4
> python
>>> import nltk
>>> nltk.download('punkt')
Mac:
> pip3 install nltk
> pip3 install beautifulsoup4
> python3
>>> import nltk
>>> nltk.download('punkt')
Here is the scraper (copy and paste into file called wiki_scraper.py
)
from nltk.tokenize import word_tokenize
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests
import codecs
import random
import re
import sys
def parse_wikipedia_rec(url:str, depth:int, rand_links):
print(url)
u = urlparse(url)
depth -= 1
links = set()
corpus = []
result = requests.get(url).text
doc = BeautifulSoup(result, "html.parser")
output = doc.find(class_="mw-parser-output")
for empty in output.find_all("p", class_='mw-empty-elt'):
empty.extract()
paragraphs = output.find_all("p")
for paragraph in paragraphs:
for sup in paragraph.find_all('sup'):
sup.extract()
hrefs = paragraph.find_all("a", href=re.compile("/wiki/"))
for href in hrefs:
links.add(href['href'])
txt = paragraph.get_text()
tokens = word_tokenize(txt)
tokens.insert(0, '<START>')
tokens.append('<END>')
corpus.append(tokens)
links = list(links)
if(depth > 0):
for link in random.sample(links, rand_links):
corpus.append(parse_wikipedia_rec(
u.scheme + "://" + u.netloc+link,
depth, rand_links))
return corpus
if(len(sys.argv) != 5):
print("Usage: wiki_scraper.py [output file] [wiki url] [max depth] [links at each depth]")
exit(1)
corpus = parse_wikipedia_rec(sys.argv[2], int(sys.argv[3]), int(sys.argv[4]))
with codecs.open(sys.argv[1], 'w', 'utf-8') as f:
f.write(str(corpus))
If you would like to scrape your own web pages, I would recommend following the BeautifulSoup tutorial series by Tech With Tim. He goes into detail on how to look through the HTML source of websites and programmatically parse them for whatever content you need.
For our purposes, the final format should be a list of paragraphs where each paragraph is a list of words and punctuation beginning with '<START>'
and ending with '<END>'
. For example:
[['<START>', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',
'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',
'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',
'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',
'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',
'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',
'(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',
'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',
'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',
'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',
'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',
'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',
'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',
'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',
'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', "'", 's',
'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',
'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',
'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',
'they', 'noted', '.', '<END>'], ... ]
or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing