owned this note changed 2 years ago
Published Linked with GitHub

Cargo Sparse Indexes

This is a request for feedback on the new sparse index support in Cargo. This document is aimed for implementers of Cargo registries.

The Cargo Team is looking for feedback as we head towards stabilizing this feature. If you have any feedback, please let us know by filing an issue or contacting us on #t-cargo on Zulip.

Some questions that we have are:

  • Do you foresee problems implementing support for sparse indexes, or how your users will be able to use it?
  • How important is easy migration from git to HTTP sparse to your users?
  • How important is it to support serving an index from git and HTTP sparse simultaneously?
  • A separate document Sparse registry selection describes a future extension to config.json to make migration easier. Does this solution look like it may meet your needs?
  • Any other feedback is welcome.

Introduction

Sparse indexes are an alternative to the git index that are intended to be lighter-weight alternative first proposed in RFC 2789.

From a high-level, a sparse index is essentially the same as the git index, except the files are retrieved over HTTPS. Cargo only fetches the files it needs for a build, significantly reducing the amount of data it needs to download and store on-disk.

For example, the sparse index entry for the regex crate on crates.io can be found at https://index.crates.io/re/ge/regex. The directory structure is the same as can be found in the git index.

HTTP behavior

It is recommended (though not required) that the HTTP index server support HTTP/2. Cargo attempts to use pipelining and HTTP/2 to improve the performance of fetching a large number of files in batches.

Cargo may open multiple connections to the host to further speed up the queries. However, it tries to limit this to avoid flooding the server (currently a limit of 2 connections).

Cargo fetches the config.json file at the start in order to get the configuration for the registry.

Caching

Proper cache control may improve the performance for users. Cargo makes use of ETag and Last-Modified headers to try to avoid downloading files that have already been downloaded and haven't changed.

Cargo passes the If-None-Match header with the contents of the ETag header from a previous response. Servers should respond with the 304 "Not Modified" status code if the file has not changed.

If the ETag header was not present, Cargo passes the If-Modified-Since header with the contents of the Last-Modified header from a previous response. If the server doesn't support ETag, then this offers a similar process where the server can respond with the 304 "Not Modified" status code if the file has not changed.

It is recommended that servers should support ETag for improved performance.

Cache invalidation

If a registry is using some kind of CDN or proxy which caches access to the index files, then it is recommended that registries implement some form of cache invalidation when the files are updated. If these caches are not updated, then users may not be able to access new crates until the cache is cleared.

Nonexistent responses

Servers should respond with a 404 "Not Found" or 410 "Gone" or 451 "Unavailable For Legal Reasons" code for crates that don't exist.

How Cargo decides which protocol to use

crates.io

Since Cargo has built-in knowledge of crates.io, it also has built-in knowledge of whether to use the git or sparse protocol. Initially, Cargo will require an opt-in to switch crates.io to use the sparse index via a config option:

# config.toml example
[registries.crates-io]
protocol = "sparse"    # Can be "sparse" or "git"

This can also be set with the CARGO_REGISTRIES_CRATES_IO_PROTOCOL environment variable.

At some point in the near future, we will be looking to change the default of crates.io to "sparse" (see tracking issue #10965).

Registry migration

For other registries, the initial support will be more limited. At this time, there is no automatic detection system, nor a simple config similar to the crates.io case. A more sophisticated option is proposed in Sparse registry selection which discusses an extension to config.json to assist with a migration or dual-protocol support. We would be happy to have your feedback on it.

Today, access to a sparse registry can be done by prefixing sparse+ to the front of the index URL. For example:

[registries.my-registry]
index = "sparse+https://example.com/crates-index/"

The crux of the problem for sparse registries is that this URL gets embedded in Cargo.lock files and the Cargo.toml files embedded in the .crate files and in the index (for cross-registry support). Cargo needs some way to know that the git and sparse URLs are equivalent, and only use one of them. The config.json extension is a proposal on how to deal with that.

Future enhancements

The following are enhancements that the Cargo team is working on or investigating:

  • Authentication: There are two initiatives to adding authentication support for registries: RFC 3139 Registry Authentication and RFC 3231 Asymmetric Tokens. Both have implementations that are under review, which should be available on nightly for testing in the future.
  • Sparse registry selection for providing a migration option for existing registries.
  • Cache busting and consistency: See https://github.com/rust-lang/cargo/issues/10928
  • Publish support in the presence of caching: See #11062 for an enhancement where Cargo will delay until a publish is seen in the index. This should help with publishing multiple dependent crates when there are longer caching delays than are normally observed with git.

The following are things that are out of scope for the current initiative:

  • End-to-end integrity and signing
Select a repo