Goal: Search common crawl for specific URLs and some terms
https://commoncrawl.org/the-data/tutorials/
Problem:
It's the entire internet worth of data? At multiple points in time? It's a lot of data. We'll need to get specific about where and what we're searching for.
Here's a summary of the problem:
https://www.slideshare.net/AmazonWebServices/aws-public-data-sets-how-to-stage-petabytes-of-data-for-analysis-in-aws-wps326-aws-reinvent-2018
Starting @ Slide 8 describes what we need to do
It's in this format:
https://en.wikipedia.org/wiki/WARC_(file_format)
Other details:
https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
a 5 min intro: https://www.youtube.com/watch?v=y4GZ0Ey9DVw
Hosted on Amazon: https://aws.amazon.com/marketplace/pp/prodview-zxtb4t54iqjmy?sr=0-1&ref_=beagle&applicationId=AWSMPContessa#resources
GPT's thoughts: https://chat.openai.com/share/659db517-cbf6-4cf2-8c40-7cfacfa3cda2
---
# Steps to search
https://www.forbes.com/sites/darrynpollock/2021/10/12/polkadot-founder-gavin-wood-on-the-rise-of-web-30/
PRE CC-MAIN-2021-04/
PRE CC-MAIN-2021-10/
PRE CC-MAIN-2021-17/
PRE CC-MAIN-2021-21/
PRE CC-MAIN-2021-25/
PRE CC-MAIN-2021-31/
PRE CC-MAIN-2021-39/
PRE CC-MAIN-2021-43/
PRE CC-MAIN-2021-49/
1 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-04-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 1forbes.paths
2 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-10-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 2forbes.paths
3 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-17-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 3forbes.paths
4 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-21-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 4forbes.paths
curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-25-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 5forbes.paths
curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-31-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 6forbes.paths
curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-39-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 7forbes.paths
curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-43-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 8forbes.paths
curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-43-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 9forbes.paths
[other searches](/bQ4v5BnYTBSRzM9rhEOddg)
sort -R hn.paths | head -n1 | python3 -m json.tool
curl -O https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz
curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-43-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F*&output=json' > forbes.paths