Count Downloads Using CDN Logs

# Count Downloads Using CDN Logs ## Problem crates.io counts downloads by crate and version. This is currently done as part of the `/download` endpoint, which counts the download and then redirects the caller to the Content Delivery Networks (CDNs) for `static.crates.io`, from where the actual file is downloaded. ```mermaid sequenceDiagram User->>crates.io: Requests crate crates.io->>crates.io: Counts crate and version crates.io->>User: Redirects user to static.crates.io User->>static.crates.io: Requests crate static.crates.io->>User: Serves crate file ``` Due to the volume of requests to the `/download` endpoint, counting the crate and its version in the application has a significant performance cost. Especially when traffic spikes, the application can struggle to keep up with requests, which in the worst case can cause a service outage. ## Goal ### Key Objectives 1. Avoid hitting the web app for every crate download 2. Continue to count downloads by crate and version ### Desired Outcome In the ideal scenario, we avoid hitting the web app for download requests altogether and go straight to the CDNs. We can achieve this by changing the `dl` field in the index's `config.json` to point to the CDN instead of the application. Full compatibility with existing behavior requires to rewrite some URLs, which [has already been implemented](https://github.com/rust-lang/simpleinfra/pull/365). ```mermaid sequenceDiagram User->>static.crates.io: Requests crate static.crates.io->>User: Serves crate file ``` The CDNs could attempt to count download, but this is difficult because the CDNs are globally distributed. There is no single point that receives all the traffic, so download counts would need to be processed and merges somewhere else. That system would quickly face the same performance issues that `crates.io` currently faces. We can use the request logs from the CDNs to count downloads in an asynchronous way. The CDNs produce a single log line per request. These logs are collected and uploaded periodically to a dedicated S3 bucket as a compressed archive. Whenever a new archive is uploaded to the bucket, S3 can push an event into a SQS queue. `crates.io` can monitor the queue and pull incoming events. From the event, it can determine what files to fetch from S3, download and then parse them, and update the download counts in the database. ```mermaid sequenceDiagram static.crates.io ->> S3: Uploads logs S3 ->> SQS: Queues event crates.io ->> SQS: Pulls event from queue crates.io ->> S3: Fetches new log file crates.io ->> crates.io: Parses log file crates.io ->> crates.io: Updates download counts ``` ### Benefits - Logs are processed asynchronously and in batches. This reduces the load on the server, especially during traffic spikes. - Publishing events into SQS is natively supported on AWS and does not require any additional infrastructure (besides an SQS queue). - crates.io is already integrated with S3 to manage crates. Its access can easily be extended to grant access to the SQS queue as well as the logs bucket. - Monitoring the queue and pulling from SQS can be implemented within the existing crates.io codebase. Alternative solutions required additional infrastructure and configuration, which would have fragmented the codebase and made long-term maintenance more difficult. ### Notes - Logs from CloudFront and Fastly use a different format. - Compressed archives are typically between 5-20MB in size. ## Tasks ### Infra-Team - [ ] Create a new AWS account for crates.io - [ ] Create a new SQS queue - [ ] Enable publishing an event from S3 when a new archive is uploaded - [ ] Grant crates.io access to the SQS queue and the S3 bucket with the logs ### crates.io - [ ] Create a job that monitors the SQS queue - [ ] Fetch and parse new log files - [ ] Update the counts in the database - [ ] Change `dl` field to point to the CDN ## Resources - 2023-11-02 [Counting crate downloads](https://rust-lang.zulipchat.com/#narrow/stream/242791-t-infra/topic/Counting.20crate.20download) in #t-infra on Zulip