There have been several reports of sites that are hosted on GitHub Pages being blocked by Airtel. These are mostly sites that have been fronted with CloudFlare Flexible SSL, since that means Airtel (as CloudFlare's network vendor) can MITM the CloudFlare<>GitHub Pages connectwion and block the site.
To validate this hypothesis, I want a dataset of all sites hosted on GitHub Pages. With this in hand, it becomes much easier to find what's hosted on CloudFlare, what's hosted directly on GitHub, where's the SSL termination happening, and whether it is blocked.
But to get there, I need a dataset of ~almost all sites powered by GitHub Pages using a Custom Domain.
I came up with several approaches, but haven't found anything that works.
The GitHub API returns 302k results for CNAME files, but trying to access this over the API hits secondary API limits very quickly (like on the second request).
I have a research account on Censys, but Censys scans each host, and does a reverse DNS lookup after. And it limits number of "virtual-hosts per IP" to 500 or so. With the 10 or so IPs fronting GitHub Pages, it's not representative.
Both censys.io as well as crt.sh have this data, but there's nothing indicative in the certs which would point to it being hosted by GitHub Pages. GitHub issued TLS certs are from LetsEncrypt, while CloudFlare uses its own CA.
No indication in the result to search for, again. Google doesn't let me search by response headers.
For every site hosted on GitHub, the server returns a unique set of headers that is fairly indicative:
X-GitHub-Request-Id
, X-Fastly-Request-ID
, Server: GitHub.com
. For sites fronted with CloudFlare, we get server: cloudflare
and a few other CloudFlare headers.
So if I wanted to search for sites on GitHub Pages fronted with CloudFlare, I would want to look for sites that return a X-GitHub-Request-Id
header along with server: cloudflare
.
The common-crawl dataset sounds like the best approach to get this, but I'm unsure about coverage, and was wondering if there are better ways to solve this? Is there a search engine where I could search by response headers?