# Cargo Sparse Indexes This is a request for feedback on the new sparse index support in Cargo. This document is aimed for implementers of [Cargo registries](https://doc.rust-lang.org/cargo/reference/registries.html). The Cargo Team is looking for feedback as we head towards stabilizing this feature. If you have any feedback, please let us know by [filing an issue](https://github.com/rust-lang/cargo/issues) or contacting us on [#t-cargo on Zulip](https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo). Some questions that we have are: * Do you foresee problems implementing support for sparse indexes, or how your users will be able to use it? * How important is easy migration from git to HTTP sparse to your users? * How important is it to support serving an index from git and HTTP sparse simultaneously? * A separate document [Sparse registry selection] describes a future extension to `config.json` to make migration easier. Does this solution look like it may meet your needs? * Any other feedback is welcome. ## Introduction Sparse indexes are an alternative to the [git index](https://doc.rust-lang.org/cargo/reference/registries.html#index-format) that are intended to be lighter-weight alternative first proposed in [RFC 2789](https://rust-lang.github.io/rfcs/2789-sparse-index.html). From a high-level, a sparse index is essentially the same as the git index, except the files are retrieved over HTTPS. Cargo only fetches the files it needs for a build, significantly reducing the amount of data it needs to download and store on-disk. For example, the sparse index entry for the `regex` crate on crates.io can be found at https://index.crates.io/re/ge/regex. The directory structure is the same as can be found in the git index. ## HTTP behavior It is recommended (though not required) that the HTTP index server support HTTP/2. Cargo attempts to use pipelining and HTTP/2 to improve the performance of fetching a large number of files in batches. Cargo may open multiple connections to the host to further speed up the queries. However, it tries to limit this to avoid flooding the server (currently a limit of 2 connections). Cargo fetches the `config.json` file at the start in order to get the configuration for the registry. ### Caching Proper cache control may improve the performance for users. Cargo makes use of `ETag` and `Last-Modified` headers to try to avoid downloading files that have already been downloaded and haven't changed. Cargo passes the `If-None-Match` header with the contents of the `ETag` header from a previous response. Servers should respond with the 304 "Not Modified" status code if the file has not changed. If the `ETag` header was not present, Cargo passes the `If-Modified-Since` header with the contents of the `Last-Modified` header from a previous response. If the server doesn't support `ETag`, then this offers a similar process where the server can respond with the 304 "Not Modified" status code if the file has not changed. It is recommended that servers should support `ETag` for improved performance. ### Cache invalidation If a registry is using some kind of CDN or proxy which caches access to the index files, then it is recommended that registries implement some form of cache invalidation when the files are updated. If these caches are not updated, then users may not be able to access new crates until the cache is cleared. ### Nonexistent responses Servers should respond with a 404 "Not Found" or 410 "Gone" or 451 "Unavailable For Legal Reasons" code for crates that don't exist. ## How Cargo decides which protocol to use ### crates.io Since Cargo has built-in knowledge of [crates.io](https://crates.io/), it also has built-in knowledge of whether to use the git or sparse protocol. Initially, Cargo will require an opt-in to switch crates.io to use the sparse index via a config option: ```toml # config.toml example [registries.crates-io] protocol = "sparse" # Can be "sparse" or "git" ``` This can also be set with the CARGO_REGISTRIES_CRATES_IO_PROTOCOL environment variable. At some point in the near future, we will be looking to change the default of crates.io to "sparse" (see [tracking issue #10965](https://github.com/rust-lang/cargo/issues/10965)). ### Registry migration For other registries, the initial support will be more limited. At this time, there is no automatic detection system, nor a simple config similar to the crates.io case. A more sophisticated option is proposed in [Sparse registry selection] which discusses an extension to `config.json` to assist with a migration or dual-protocol support. We would be happy to have your feedback on it. Today, access to a sparse registry can be done by prefixing `sparse+` to the front of the index URL. For example: ```toml [registries.my-registry] index = "sparse+https://example.com/crates-index/" ``` The crux of the problem for sparse registries is that this URL gets embedded in `Cargo.lock` files and the `Cargo.toml` files embedded in the `.crate` files and in the index (for cross-registry support). Cargo needs some way to know that the git and sparse URLs are equivalent, and only use one of them. The `config.json` extension is a proposal on how to deal with that. [Sparse registry selection]: https://hackmd.io/@rust-cargo-team/B13O52Zko ## Future enhancements The following are enhancements that the Cargo team is working on or investigating: - Authentication: There are two initiatives to adding authentication support for registries: [RFC 3139 Registry Authentication](https://rust-lang.github.io/rfcs/3139-cargo-alternative-registry-auth.html) and [RFC 3231 Asymmetric Tokens](https://rust-lang.github.io/rfcs/3231-cargo-asymmetric-tokens.html). Both have implementations that are under review, which should be available on nightly for testing in the future. - [Sparse registry selection] for providing a migration option for existing registries. - Cache busting and consistency: See https://github.com/rust-lang/cargo/issues/10928 - Publish support in the presence of caching: See [#11062](https://github.com/rust-lang/cargo/pull/11062) for an enhancement where Cargo will delay until a publish is seen in the index. This should help with publishing multiple dependent crates when there are longer caching delays than are normally observed with git. The following are things that are out of scope for the current initiative: - End-to-end integrity and signing