HackMD - Collaborative Markdown Knowledge Base

This is a challenging area as the long term mitigation for how Insights obtains data from the source table needs to also fit with our strategy for scaling dotcom in a resiliant way and our broader vision for moving to the Cloud. Today, the analytics replicas are there to support the data warehouse use case and the use case to scrape data directly from MySQL by other sources is simply not funded. To shift this paradigm would require us to re-asses our headcount, our hardware/VM footprint and can be very costly to GitHub. In additional to reliability, we also want to mature our practices around security. From a security and privacy standpoint, we cannot allow direct access to production data from sources that are not the owners of that data. That said we also need to consider how we invest in our overall platform and how to optimize our architecture for the future instead of what we have now. We are more quickly moving to a sharded environment where we want to have many small shards instead of one single shard. In this architecture, no matter how many dedicated read replicas we have for ETL, the type of bulk ETL queries will fail at the vtgate layer because they tend to require the vtgate to aggreate the data across all the shards. We need to be thinking about what will work for the future of GitHub overall as we build more resilient systems and prepare to move towards an eventually consistent Cloud architecture. I would recommend investing in an architecture that will strengthen the production queues which likely means improving how applications are publishing data needed by multiple consumers. This will have added benefits that support new use cases and it is something we must do in order to have a more resilient architecture and also build for a cloud architecture.