Common Crawl Project

Goal: Search common crawl for specific URLs and some terms https://commoncrawl.org/the-data/tutorials/ Problem: It's the entire internet worth of data? At multiple points in time? It's a lot of data. We'll need to get specific about where and what we're searching for. Here's a summary of the problem: https://www.slideshare.net/AmazonWebServices/aws-public-data-sets-how-to-stage-petabytes-of-data-for-analysis-in-aws-wps326-aws-reinvent-2018 Starting @ Slide 8 describes what we need to do It's in this format: https://en.wikipedia.org/wiki/WARC_(file_format) Other details: https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling a 5 min intro: https://www.youtube.com/watch?v=y4GZ0Ey9DVw Hosted on Amazon: https://aws.amazon.com/marketplace/pp/prodview-zxtb4t54iqjmy?sr=0-1&ref_=beagle&applicationId=AWSMPContessa#resources GPT's thoughts: https://chat.openai.com/share/659db517-cbf6-4cf2-8c40-7cfacfa3cda2 --- # Steps to search https://www.forbes.com/sites/darrynpollock/2021/10/12/polkadot-founder-gavin-wood-on-the-rise-of-web-30/ PRE CC-MAIN-2021-04/ PRE CC-MAIN-2021-10/ PRE CC-MAIN-2021-17/ PRE CC-MAIN-2021-21/ PRE CC-MAIN-2021-25/ PRE CC-MAIN-2021-31/ PRE CC-MAIN-2021-39/ PRE CC-MAIN-2021-43/ PRE CC-MAIN-2021-49/ 1 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-04-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 1forbes.paths 2 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-10-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 2forbes.paths 3 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-17-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 3forbes.paths 4 - curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-21-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 4forbes.paths curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-25-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 5forbes.paths curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-31-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 6forbes.paths curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-39-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 7forbes.paths curl --connect-timeout 60 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-43-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 8forbes.paths curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-43-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F2021%2F10%2F12%2Fpolkadot-founder-gavin-wood-on-the-rise-of-web-30%2F*&output=json' > 9forbes.paths [other searches](/bQ4v5BnYTBSRzM9rhEOddg) sort -R hn.paths | head -n1 | python3 -m json.tool curl -O https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz curl --connect-timeout 90 --silent 'http://index.commoncrawl.org/CC-MAIN-2021-43-index?url=https%3A%2F%2Fwww.forbes.com%2Fsites%2Fdarrynpollock%2F*&output=json' > forbes.paths