# Pkg protocol extension: precompile files The compiler team has done an excellent job of front-loading the vast majority of TTFX into the package precompilation process. This is both more psychologically pleasing and more efficient: people are more willing to wait during the compilation step, and package precompilation only happens (at most) once per instantiation of a specific manifest, whereas loading and execution happen more frequently. However, precompiliation is quite slow, especially when loading packages with large dependency graphs. Can we avoid even that wait time for common use cases? This proposal outlines a design for extending the Pkg protocol so that Pkg client can download pregenerated precompiles files for common package configurarions that they can use immediately without waiting for local precompilation. Serving precompile files is significantly more complex than serving package source, however: there is not a single precompile file per package version—precompile files are not portable across architectures nor operating systems nor even Julia versions. Precompile files also depend on the exact versions of all transitive dependencies of the package being precompiled. In short, even for a specific package version, there's a potentially astronomical number of precompile files that could be produced, depending on the exact choice of dependency versions. This complexity begs the question of whether serving pregenerated precompile files is even practical. My [simulation experiments](https://github.com/StefanKarpinski/Resolver.jl/blob/main/slices/analysis.jl) suggest that it is. I've generated manifests with random sets of 5 (also done with 10) packages installed, sampled proportional to their download popularity by unique client addresses. We populate the "cache" with the precompile files that would be generated for that manifest and then repeat the process, keeping track of how many precompile subsets are already in the cache. The code also simulates LRU cache eviction with a fixed cache size. The resulting cache hit rate is greater than 99.9%. Moreover, the cache size remains relatively small—cache growth levels out at only a few hundred cache entries and growth after that is very slow. This cache size would be multiplied by the number of Julia versions and supported architectures, of course, but we can easily cache and serve tens of thousands of precompile files. The simulation uses a static registry, which is unrealistic, but it shows that for a given registry state, generating and caching popular precompile files would be effective. Cached files would rapidly fall off in popularity when new versions of packages are published to the registry, but that seems manageable with simple caching strategies. The simulation also suggests a simple cache warmup strategy: resolve each of the most popular packages in isolation and pregenerate the resulting precompile files. Many manifests could be satsified entirely with precompiles generated this way. The Pkg protocol uses [content-addressing](https://en.wikipedia.org/wiki/Content-addressable_storage) for most content: each package version and artifact has a resource path that includes a cryptographic hash of its content and this path permanently refers to that exact content. A server may or may not be able to serve a given resource, but if it can, then we know it's the right content. Resource names never change or expire, so you can cache them without ever needing to worry about invalidation. Client or server can validate any resource at any time by checking that the content matches the hash that was requested—if the hash matches, then the content is correct. How does the client get the names of content? It all starts with a single non-content-addressed entry point: the `/registries` endpoint, which tells the client what the current contents of registries it serves are. From there, those registries can be downloaded and the registries contain the content hashes of all package versions the client can install. Pacakges in turn contain `Artifacts.toml` files that contain the content hashes of all the artifacts the client might need to install. Precompile files present a challenge for content addressing. Unlike package versions, there are a practically unbounded number of possible precompile files. Unlike package versions, which naturally flow from a central source out to clients, demand for precompile files inherently flows from clients to the server. The client resolves package manifests and decides what versions of what packages it needs. Only when a client decides on a particular combination of packages to use, is it possible to determine what precompile files it needs. How can we handle this inverted flow of information from client to server? The basic design is that clients request URLs that are based on the inputs that would be used to generate the precompile files they need, and if the server already has such a file generated, the client can download and use it. If the server doesn't have such a precompile file, it can indicate to the client that it would like to know the full details of the requirements so that it can generate the precompile file and serve it in the future. Instead of naming precompile files by their content, this proposal names them by the hash of their requirements. What are the requirements of a precompile file? Among other things, a precompile file is specific to the Julia system image that it was generated by. It is also specific to the Julia version and operating system, but all of that is baked into the system image, so we use the hash of the system image as a proxy for all of that system and version information. A precompile file also depends on the exact versions of all the packages that it depends on, which must be loaded prior to generation of the precompile file. Thus, we also hash that set of dependencies and their versions. The trick there is to represent that data in a consistent, reproducible way so that clients and servers all agree on the hashes. Without further ado, let's dive into the actual proposal. ## Endpoints This proposal adds the following optional HTTP endpoints on Pkg servers: - `GET /precompile/sysimgs/$arch-$os`: get a list of hashes of known sysimgs for an architecture and operating system. - `GET /precompile/$sys/$deps/$uuid`: request a precompile file for a regular package for a specific system image and set of specific dependency versions. - `GET /precompile/$sys/$deps/$uuid+$ext`: request a precompile file for a package extension for a specific system image and set of specific dependency versions. - `PUT /deps/$deps`: upload a set of dependency versions to the server. The endpoints are optional in the sense that if any of them are not implemented by a pkg server, pkg clients will still be able to interact with the server and function properly. Lack of these endpoints simply prevents downloading of precompile files. The first `GET` endpoint is used to check (once per session) if the client should ask the server for precompile files at all. The next two `GET` endpoints are used to check if a server has a specific precompile file that the client can download and use: one endpoint is for packages, the other is for extensions. The `PUT` endpoint is used to allow a client to upload a dependency set specification to the server. The variable parts of these resource paths, prefixed with `$` are the following: - `arch`: system architecture (*e.g.* `x86_64`, `aarch64`) - `os`: operating system name (*e.g.* `macos`, `linux`, `windows`) - `sys`: cryptographic hash of a `sys.{so,dll,dylib}` system image file - `deps`: cryptographic hash of the identities and versions of all dependencies baked into a precompile file - `uuid`: the UUID of the package that a precompile file is for - `ext`: the name of a package extension within the package being requested How these are computed and used will be explained in further detail in what follows. ## Protocol At a high level, this is the protocol interaction for serving precompile files: **Sysimg negotiation** - Client requests `GET /precompile/sysimgs/$arch-$os` to get a list of sysimg hashes - Field values: - `arch = Base.BinaryPlatforms.HostPlatform().tags["arch"]` - `os = Base.BinaryPlatforms.HostPlatform().tags["os"]` - The response is a newline-separated list of sysimg hashes - Each one is the cryptographic hash of a `sys.{so,dll,dylib}` file the server has precompiles for - Must contain all hashes for the requested architecture and operating system - Requested by a client to check if it should make precompile requests: - If response is `404 Not Found`, the server doesn't have precompiles - If response is `200 OK` and: - Client's sysimg hash isn't in the returned hash list—the server doesn't have precompiles for that sysimg; client should not make precompile requests - Client's sysimg hash is in the returned hash list—the server may have precompiles for that sysimg; client may make precompile requests **Precompile requests** - Client requests `GET /precompile/$sys/$deps/$uuid` for a package precompile file - `sys` is a hash of the system image file - `deps` is a hash of the dependencies for the precompile file - `uuid` is the UUID of the package whose precompile file is needed - Client requests `GET /precompile/$sys/$deps/$uuid+$ext` for an extension precompile file - `sys` is a hash of the system image file - `deps` is a hash of the dependencies for the precompile file - `uuid` is the UUID of the package in which the extension is defined - `ext` is the name of the extension - If server responds with `200 OK` - Precompile file is the body of the response - If server responds with `404 Not Found` - Server doesn't have the requested precompile file - If response includes `Julia-Deps-Upload: $loc` header - Indicates that the server wants that the client upload the `deps` file - `loc` is `true` or a URL prefix to upload to (`/deps/$deps` is appended) **Deps uploads** - Client requests `PUT /deps/$deps` to upload a dependency set - Body is specifically formatted map of dependency UUIDs to git tree hashes - `deps` is the hash of the body of the request - If hash doesn't match, server responds with `400 Bad Request` - Request may be made to a different server - Upload location indicated in response headers to precompile requests ### Sysimg Negotiation We don't want to increase request load to Pkg servers that don't support precompile serving at all and we don't want Pkg clients to make a lot of useless extra requests. So we have clients check, once per session, before trying to request and precompile files, whether the server they're talking to even knows about their sysimg. If the server doesn't have a copy of their sysimg it cannot possibly produce precompile files that they can use, so there is no point in the client asking for them. The most obvious apporach is for the client to send their sysimg hash to the server and for the server to indicate—yes or no—whether it has that sysimg or not. There is, however, a privacy concern with that approach: if a client has a rare or unique system image (e.g. you build your own Julia), then the sysimg hash acts as a unique identifier, and sending it would allow the server to track the user across sessions. To prevent this, we follow this principle: *The client should not mention a sysimg hash before the server has demonstrated that it already knows about it.* Once a server has shown that it knows about a sysimg, we can assume that the sysimg is common enough that it's hash doesn't act as a unique identifier. The next most obvious approach is for the client to ask the server for a list of sysimg hashes that it knows about and check if its sysimg hash is in that list. This is the approach that we take: once per Julia session, the Pkg client makes a `GET /precompile/sysimgs/$arch-$os` request to check if the pkg server knows about its sysimg and can thus potentially serve it precompile files that it can use. The server responds with a list of hashes of sysimgs that it knows about. If the client's sysimg hash is in the returned list of hashes, then the client can go ahead and request precompile files using the plain sysimg hash (ok since the server has shown that it already knows about it). I the clients sysimg hash is not in the returned list, the client should not request any precompile files since the server cannot have them. The request path includes the client `$arch+$os` component at the end so that we can keep the sizes of responses down: there should only be at most a few dozen sysimgs per architecture-operating system combination, even if the server has sysimgs for quite a lot of Julia versions. There is one other minor privacy concern, this time for the server: the server may not want every client to know the full list of sysimg hashes that it knows about. The hashes are strong cryptographic one-way hashes, so there's not much that the client can learn from these hashes, but even the number of them could potentially be considered sensitive. (We are being *very* paranoid here.) To address this concern, the Pkg client includes a `Julia-Sysimg: $salt, $salted_hash` header in its sysimg request. Here `salt` is a randomly generated alphanumeric string and `salted_hash` is the hash of the string `$sys$salt` where `sys` is the plain hash of its system image file. The server is free to ignore this header and serve a full list of sysimgs. However, if the server considers the list of sysimgs hashes it offers to be sensitive information, it can filter the list of hashes it returns based on this header. For each sysimg hash that the server knows, it computes the salted hash with the salt sent by the client, and unless the salted hash matches, it doesn't include the plain sysimg hash in its response. When the server filters its response like this, the reply body will be empty if the server does not know about the client's sysimg, or consist of a single hash matching the client's local sysimg hash. ### Deps Upload The server can indicate that it would like the client to upload deps data by including a `Julia-Deps-Upload: $loc` header in the `404 Not Found` response to a precompile file request. The server can use any criteria it wants to decide when it wants to request a deps upload, but typically it would keep track of how many times precompile files using that deps set have been requested and/or by how many distinct client IP addresses. Unique IP address counts can be estimated efficiently and anonymously using the technique described [here](https://discourse.julialang.org/t/announcing-package-download-stats/69073#anonymity-and-unique-ip-address-count-approximation-3) and implemented [here](https://github.com/StefanKarpinski/HyperLogLogHashIPs.jl/blob/master/src/HyperLogLogHashIPs.jl). The URL to make an upload request is determined by the `loc` value in the header response: - If `loc == "true"` then upload to `/deps/$deps` on the same server - Otherwise, upload to `$loc/deps/$deps`, which should be a valid HTTP/S URL Why do we allow the server to indicate the location to upload to? The public package servers reachable at `pkg.julialang.org` are relatively simple systems that mainly act as stateless caches. We may not want to deploy the capability to receive uploads on them at all. Instead, we may want a smaller set of dedicated upload servers specifically for recieving and processing deps uploads. The simplest version of this would be for the public pkg servers to include the fixed header `Julia-Deps-Upload: https://upload.pkg.julialang.org` once a deps hash has been requested enough times. With the upload URL is determined, the actual upload request is `PUT $path` with a body consisting of carefully formatted data decribing the identity and version of each dependency that isn't baked into the sysimg. This format must be very precisely specified because it gets hashed. The next section describes this (very simple) format in full. The hash of the body should match the last path component of the upload path. I.e. if the request is `PUT /deps/1234` then the hash of the request body must be `1234` (not a real hash value, but you get the idea). If the hash of the request body does not match the `deps` hash, the server should respond with a `400 Bad Request` and ignore the request. If the hash matches and the data appears to be a valid deps specification, the server can record this data and use it to generate precompile files as it sees fit. When a server recieves a deps upload, it will typically already have seen many requests for precompile files with that `deps` hash. If it hasn't it should probably ignore the upload request. The server may use the uploaded deps data to generate precompile files that have been sufficiently frequently requested. Knowing the system image, the set of deps versions, and the UUID of the package and/or extension name is sufficient information to instantiate a manifest, load dependencies and generate a precompile file. Details of how to do this are beyond the scope of this document, which merely aims to describe how the client and server communicate sufficient information for the purpose, now how to apply that information. ### Deps Format The format of the deps data is very simple: - The format is strictly lowercase, 7-bit ASCII - It consists of a single line per dependency - Each line has this format: `$uuid $tree` - `uuid` is the UUID of the dependency in the standard UUID format - *Example:* `eeff0fb2-45e8-465e-a99a-063be327fcad` - `tree` is the git tree hash (SHA1) of the dependency source tree - Formatted as 40 lowercase hexadecimal digits - *Example:* `bcfd4f3612427cdf8c2ad503b64612c8244f6857` - Every line is terminated with a single newline character (`\n`) - There is no leading or trailing whitespace on any line - A single space appears on each line separating UUID from tree hash - Lines are sorted lexicographically Example deps data: ``` 21216c6a-2e73-6563-6e65-726566657250 00805cd429dcb4870060ff49ef443486c262e38e 682c06a0-de6a-54ab-a142-c8b1cf79cde6 31e996f0a15c7b280ba9f76636b3ff9e2ae58c9a 69de0a69-1ddd-5017-9359-2bf0b02dc9f0 8489905bcdbcfac64d1daa51ca07c0d8f0283821 aea7be01-6a6a-4083-8856-8a6e6704d82a 03b4c25b43cb84cee5c90aa9b5ea0a78fd848d2f ``` Further examples of this format are include in the "Worked Examples" section below. ### Cryptographic Hash Choice Any number of modern cryptographic hashes could be used for the hash functions in this protocol and they need not even all be the same. However, using a single hash throughout seems best and SHA224 seems like a good choice: - It's one of the SHA2 standard cryptographic hash functions, like SHA256 - It's the shortest choice among the SHA2 hash functions - Want to keep our URLs from becoming excessively long - Also makes responses to sysimg queries a bit smaller - 224 bits is still plenty to ensure that collisions are astronomically unlikely Accordingly, in the remainder we use SHA224 as our hash function. Hash values in this proposal are used in URL resource paths, HTTP headers, and listed in the bodies of HTTP responses. These are all places where it's prudent to not be wasteful of space, but also where alphanumeric values are required. To represent hash values compactly but in ASCII, we encode them using [Base64](https://en.wikipedia.org/wiki/Base64). Since 224 is not divisible by 6, some padding is required. Rather than padding with `=` as Base64 typically does, we pad the hash data with two extra bytes in such a way that the base 2^16 little-endian checksum of the padded hash is zero. The following code implements this encoding: ```jl using SHA cksum(h) = sum(reinterpret(UInt16, h)) % UInt16 # sha224 padded to 240 bits to give zero checksum function sha224_pad240(data::Union{AbstractString,IO}) h = sha224(data) s = -cksum(h) push!(h, s % UInt8) push!(h, (s >> 8) % UInt8) @assert cksum(h) == 0 return h end const digits = UInt8['A':'Z';'a':'z';'0':'9';'-';'_'] @inline digit(d::UInt8) = digits[(d & 0x3f) + 1] # sha224 padded to 240 bits and base64 encoded function sha224_base64(data::Union{AbstractString,IO}) h = sha224_pad240(data) @assert length(h) == 240÷8 v = Base.StringVector(4*length(h)÷3) for i = 0:length(h)÷3-1 a, b, c = h[3i+1], h[3i+2], h[3i+3] v[4i+1] = digit(a >> 2) v[4i+2] = digit((a << 4) | (b >> 4)) v[4i+3] = digit((b << 2) | (c >> 6)) v[4i+4] = digit(c) end return String(v) end ``` This implements the URL variant of Base64 and each hash value is exactly 40 ASCII bytes with each byte representing a letter, a number, `_` or `-`. Hash values are case sensitive. Using Base64 encoding for our SHA224 hash values also helps to visually distinguish the three kinds of randomish values we encounter: - UUIDs: 36 hex digits and dashes (`baffcb60-1cc7-409b-ba95-337c9589c227`) - SHA1 hashes: 40 hex digits (`a58131e794da2da6c61be736737b4641cd09b8a8`) - SHA224 hashes: 40 base64 digits ( `TnXY_QFMuQZOL6FonbZHC7JNyDmKSjfFAdX5-Riq`) ## Worked Example with Code Suppose I am running Julia 1.10.0 and have the following `Manifest.toml` file: ```toml # This file is machine-generated - editing it directly is not advised julia_version = "1.10.0" manifest_format = "2.0" project_hash = "cb83ea646dcd21e4c1ea197fe54d897351940bc8" [[deps.Automa]] deps = ["PrecompileTools", "TranscodingStreams"] git-tree-sha1 = "588e0d680ad1d7201d4c6a804dcb1cd9cba79fbb" uuid = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b" version = "1.0.3" [[deps.BioGenerics]] deps = ["TranscodingStreams"] git-tree-sha1 = "7bbc085aebc6faa615740b63756e4986c9e85a70" uuid = "47718e42-2ac5-11e9-14af-e5595289c2ea" version = "0.1.4" [[deps.BioSequences]] deps = ["BioSymbols", "PrecompileTools", "Random", "Twiddle"] git-tree-sha1 = "6fdba8b4279460fef5674e9aa2dac7ef5be361d5" uuid = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" version = "3.1.6" [[deps.BioSymbols]] deps = ["PrecompileTools"] git-tree-sha1 = "e32a61f028b823a172c75e26865637249bb30dff" uuid = "3c28c6f8-a34d-59c4-9654-267d177fcfa9" version = "5.1.3" [[deps.Dates]] deps = ["Printf"] uuid = "ade2ca70-3891-5945-98fb-dc099432e06a" [[deps.FASTX]] deps = ["Automa", "BioGenerics", "PrecompileTools", "StringViews", "TranscodingStreams"] git-tree-sha1 = "bff5d62bf5e1c382a370ac701bcaea9a24115ac6" uuid = "c2308a5c-f048-11e8-3e8a-31650f418d12" version = "2.1.4" weakdeps = ["BioSequences"] [deps.FASTX.extensions] BioSequencesExt = "BioSequences" [[deps.PrecompileTools]] deps = ["Preferences"] git-tree-sha1 = "03b4c25b43cb84cee5c90aa9b5ea0a78fd848d2f" uuid = "aea7be01-6a6a-4083-8856-8a6e6704d82a" version = "1.2.0" [[deps.Preferences]] deps = ["TOML"] git-tree-sha1 = "00805cd429dcb4870060ff49ef443486c262e38e" uuid = "21216c6a-2e73-6563-6e65-726566657250" version = "1.4.1" [[deps.Printf]] deps = ["Unicode"] uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7" [[deps.Random]] deps = ["SHA"] uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" [[deps.SHA]] uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce" version = "0.7.0" [[deps.StringViews]] git-tree-sha1 = "f7b06677eae2571c888fd686ba88047d8738b0e3" uuid = "354b36f9-a18e-4713-926e-db85100087ba" version = "1.3.3" [[deps.TOML]] deps = ["Dates"] uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76" version = "1.0.3" [[deps.TranscodingStreams]] git-tree-sha1 = "54194d92959d8ebaa8e26227dbe3cdefcdcd594f" uuid = "3bb67fe8-82b1-5028-8e26-92a6c54297fa" version = "0.10.3" [deps.TranscodingStreams.extensions] TestExt = ["Test", "Random"] [deps.TranscodingStreams.weakdeps] Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" [[deps.Twiddle]] git-tree-sha1 = "29509c4862bfb5da9e76eb6937125ab93986270a" uuid = "7200193e-83a8-5a55-b20d-5d36d44a0795" version = "1.1.2" [[deps.Unicode]] uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5" ``` This manifest was produced by adding the BioSequences and FASTX packages to an empty project using General registry tree `cefd0e31897c3db1bc23398d510f3a03acf49b22`. ### Sysimg Negotiation Before trying to download any precompile files, the client needs to check whether the server knows about its system image. On my system, that request will look something like this: ``` GET /precompile/sysimgs/aarch64-macos HTTP/1.1 Julia-System: aarch64-apple-darwin-libgfortran5-cxx11-julia_version+1.10.0 Julia-Version: 1.10.0 Julia-Sysimg: NfBFpOk9vnEp7fPS, lW554ydSZVzPC2oFgczGaAMfajmECcHGhl6meQi4 ``` The `Julia-Version` and `Julia-System` entries are existing headers that we send with every request. The `Julia-Sysimg` header is a new header whose value is generated like so: ```jl using Random, SHA path = unsafe_string(Base.JLOptions().image_file) sys = open(sha224_base64, path) salt = randstring(RandomDevice(), 16) hash = sha224_base64("$sys$salt") ``` Hashing the sysimg file takes some 400 ms on my system, so we'll definitely want to cache that computation. We use `RandomDevice` to generate the salt value rather than the non-cryptographic default RNG (xoshiro128++) which can be relatively easily predicted. New salt should be generated for each request to avoid requests being correlated. In this particular example, the values produced were the following: ```jl path = "/Users/stefan/.julia/juliaup/julia-1.10.0+0.aarch64.apple.darwin14/lib/julia/sys.dylib" sys = "SBCcsWhD03qK3hE3t87C9nseeMgccXgVVLMJfOkH" salt = "NfBFpOk9vnEp7fPS" hash = "lW554ydSZVzPC2oFgczGaAMfajmECcHGhl6meQi4" ``` The server is free to ignore this header and send a reply that lists all the (plain) hashes of all the sysimg files it knows about for `aarch64-macos`, in which case the reply might look something like this: ``` HTTP/1.1 200 OK Server: nginx/1.23.3 A4Sw7m_nr4PLopxuG6ssTytCGzw4rmjLkRIaV_C0 E5pFAKerk_99CT8NjRc4IfSJvwfu4oJrEvxUEmR8 SBCcsWhD03qK3hE3t87C9nseeMgccXgVVLMJfOkH SWJNC5DJ_-lID_L6CsQ7kxoAi3SUmeu4_Fmd1p-F TLAQMnRxc07GiKnn6mHJC-owZX_dnjoVHCcmefN6 UEGljdtDpgRdYrxyS3qrbB3LN2Tq5AKWB1Em6w5G WKiBV-1hD-jIIp5JLjg8JNSl7ceHHJR_cpWqiGPF rMucfJjwVpR026_g0TP4WnDFy4w-wr8WUPZmrvAX ``` This response contains the client's `sys` hash value, which indicates that the server knows about the client's system image and can serve precompile files for it. The hashes here are sorted, but they are not required to be. If the server had chosen to filter its response based on the `Julia-Sysimg` header, the body would only contain the one matching hash value: ``` HTTP/1.1 200 OK Server: PkgServer.jl/b5b92d86de SBCcsWhD03qK3hE3t87C9nseeMgccXgVVLMJfOkH ``` Both responses have the same effect: the client finds its sysimg hash in the response body and concludes that it can request precompile files going forward. If the server's response does not contain the client's sysimg hash, then the client should not make any precompile requests to the server. If the server is filtering sysimgs, it may respond with `200 OK` and an empty body—this is a valid response indicating zero sysimg hashes. In that case, the client's sysimg hash is not in the response body and the client should not make precompile requests. ### Package Precompile Request & Upload After the client installs the FASTX package from this manifest, it can attempt to download a precompile file for it. In order to do this, it needs to first determine FASTX's transitive dependency set—the subset of the manifest that FASTX depends on directly or indirectly, including the package itself. This can be determined directly form the manifest file alone: ```jl import Base: UUID import Pkg.Types: Manifest function package_deps(m::Manifest, uuid::UUID) work = Set([uuid]) deps = empty(work) while !isempty(work) u = pop!(work) push!(deps, u) for u′ in values(m[u].deps) u′ in deps || push!(work, u′) end end sort!(collect(deps)) end ``` We can apply this to the manifest above like this (assuming it's been activated): ```jl import Pkg m = Pkg.Types.EnvCache().manifest # active manifest uuid = UUID("c2308a5c-f048-11e8-3e8a-31650f418d12") # FASTX deps = package_deps(m, uuid) ``` This produces a vector of the following package UUIDs: ``` 21216c6a-2e73-6563-6e65-726566657250 (Preferences) 354b36f9-a18e-4713-926e-db85100087ba (StringViews) 3bb67fe8-82b1-5028-8e26-92a6c54297fa (TranscodingStreams) 47718e42-2ac5-11e9-14af-e5595289c2ea (BioGenerics) 4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5 (Unicode) 67c07d97-cdcb-5c2c-af73-a7f9c32a568b (Automa) ade2ca70-3891-5945-98fb-dc099432e06a (Dates) aea7be01-6a6a-4083-8856-8a6e6704d82a (PrecompileTools) c2308a5c-f048-11e8-3e8a-31650f418d12 (FASTX) de0858da-6303-5e67-8744-51eddeeeb8d7 (Printf) fa267f1f-6049-4f14-aa54-33bafae1ed76 (TOML) ``` From this list, we can generate a deps data file that has a line for each of the above UUIDs that aren't baked into the sysimg, mapping it to the appropriate git tree hash: ```jl function deps_data(io::IO, m::Manifest, deps::Vector{UUID}) for uuid in sort!(deps) entry = m[uuid] name = entry.name id = Base.PkgId(uuid, name) Base.in_sysimage(id) && continue tree = entry.tree_hash tree === nothing && error("$name has no tree") println(io, "$uuid $tree") end end ``` If the deps list contains any entries that are not built into the system image and don't have a git tree hash in the manifest—e.g. some entries are dev'd—we should abort and not make a precompile request at all. Without a fixed tree hash for a dependency, the server can't possibly generate precompile files, so continuing makes no sense. In this sample code, we handle that by erroring. We might also want to check to see that all package version hashes actually appear in a registry somewhere—if the server doesn't know about a tree, it can't possibly install it and generate precompiles from it. For the above deps list `deps_data` produces the following output: ``` 21216c6a-2e73-6563-6e65-726566657250 00805cd429dcb4870060ff49ef443486c262e38e 354b36f9-a18e-4713-926e-db85100087ba f7b06677eae2571c888fd686ba88047d8738b0e3 3bb67fe8-82b1-5028-8e26-92a6c54297fa 54194d92959d8ebaa8e26227dbe3cdefcdcd594f 47718e42-2ac5-11e9-14af-e5595289c2ea 7bbc085aebc6faa615740b63756e4986c9e85a70 67c07d97-cdcb-5c2c-af73-a7f9c32a568b 588e0d680ad1d7201d4c6a804dcb1cd9cba79fbb aea7be01-6a6a-4083-8856-8a6e6704d82a 03b4c25b43cb84cee5c90aa9b5ea0a78fd848d2f c2308a5c-f048-11e8-3e8a-31650f418d12 bff5d62bf5e1c382a370ac701bcaea9a24115ac6 ``` To get our `deps` hash, we compute the `sha224_base64` hash of this data, which in this case is `38tnhoGti4gqt6oDwzymAc-2TizCeVzATt1eaYoa`. With this deps hash, we can finally construct our precompile request: ```http GET /precompile/SBCcsWhD03qK3hE3t87C9nseeMgccXgVVLMJfOkH/38tnhoGti4gqt6oDwzymAc-2TizCeVzATt1eaYoa/c2308a5c-f048-11e8-3e8a-31650f418d12 HTTP/1.1 Julia-System: aarch64-apple-darwin-libgfortran5-cxx11-julia_version+1.10.0 Julia-Version: 1.10.0 ``` If the server has this precompile file, it can return it to the client with a `200 OK` response where the precompile file is the response body. If the server doesn't have this precompile file, it should return a `404 Not Found` response. If the server is interested in recieving the contents of the above deps data, then it should include a `Julia-Deps-Upload` header in the response, for example: ``` HTTP/1.1 404 Not Found Server: nginx/1.23.3 Julia-Deps-Upload: true ``` This response indicates that the client should upload the deps data to `/deps/38tnhoGti4gqt6oDwzymAc-2TizCeVzATt1eaYoa` on the same server. The server could also provide a URL prefix where the data should be uploaded to: ``` HTTP/1.1 404 Not Found Server: nginx/1.23.3 Julia-Deps-Upload: https://upload.pkg.julialang.org ``` This indicates that the client should upload the deps data to `https://upload.pkg.julialang.org/deps/38tnhoGti4gqt6oDwzymAc-2TizCeVzATt1eaYoa`. In either case, the upload request would look like this: ``` PUT /deps/38tnhoGti4gqt6oDwzymAc-2TizCeVzATt1eaYoa HTTP/1.1 Julia-System: aarch64-apple-darwin-libgfortran5-cxx11-julia_version+1.10.0 Julia-Version: 1.10.0 21216c6a-2e73-6563-6e65-726566657250 00805cd429dcb4870060ff49ef443486c262e38e 354b36f9-a18e-4713-926e-db85100087ba f7b06677eae2571c888fd686ba88047d8738b0e3 3bb67fe8-82b1-5028-8e26-92a6c54297fa 54194d92959d8ebaa8e26227dbe3cdefcdcd594f 47718e42-2ac5-11e9-14af-e5595289c2ea 7bbc085aebc6faa615740b63756e4986c9e85a70 67c07d97-cdcb-5c2c-af73-a7f9c32a568b 588e0d680ad1d7201d4c6a804dcb1cd9cba79fbb aea7be01-6a6a-4083-8856-8a6e6704d82a 03b4c25b43cb84cee5c90aa9b5ea0a78fd848d2f c2308a5c-f048-11e8-3e8a-31650f418d12 bff5d62bf5e1c382a370ac701bcaea9a24115ac6 ``` Upon receiving this, the server should hash the body and make sure that its hash is equal to `38tnhoGti4gqt6oDwzymAc-2TizCeVzATt1eaYoa`. If it isn't then it should reply with `400 Bad Request` and ignore the request. Of course, the hash does match in this case and this upload would allow the server to record that this data is associated with the that deps hash. This allows it to generate and subsequently serve precompile files for any requests it has received in the past with this `deps` hash value. ### Extension Precompile Request & Upload Package extensions behave much like standalone package that depend on their "trigger" packages, which include the package that defines the extension and the "weak dependencies" that it's associated with. While they have a synthetic UUID generated by hashing their parent package's UUID and the extension name, this UUID isn't registered anywhere and cannot be looked up. Instead we identify extensions by their parent package's UUID and the extension name. Extensions also have a large deps set than their parent package, which additionally includes the trigger packages and their transitive dependencies. On our example manifest, FASTX has a "BioSequencesExt" extension. When Pkg is considering precompiling this, it can first try requesting the precompile file for this extension form the server. Before it can do this, however, it needs to find the deps list for this extension: ```jl ext_uuids(m::Manifest, dep::String) = UUID[u for (u, e) in m if e.name == dep] ext_uuids(m::Manifest, deps::Vector{String}) = UUID[u for (u, e) in m if e.name in deps] ext_uuids(m::Manifest, deps::Dict{String,UUID}) = collect(values(deps)) function extension_deps(m::Manifest, uuid::UUID, ext::String) work = Set([uuid]) exts = m[uuid].exts[ext] weak = ext_uuids(m, m[uuid].exts[ext]) union!(work, weak) deps = empty(work) while !isempty(work) u = pop!(work) push!(deps, u) for u′ in values(m[u].deps) u′ in deps || push!(work, u′) end end sort!(collect(deps)) end ``` We can apply this to our ongoing example: ```jl ext = "BioSequencesExt" deps = extension_deps(m, uuid, ext) ``` This produces the following list of UUIDs: ``` 21216c6a-2e73-6563-6e65-726566657250 (Preferences) 354b36f9-a18e-4713-926e-db85100087ba (StringViews) 3bb67fe8-82b1-5028-8e26-92a6c54297fa (TranscodingStreams) 3c28c6f8-a34d-59c4-9654-267d177fcfa9 (BioSymbols) 47718e42-2ac5-11e9-14af-e5595289c2ea (BioGenerics) 4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5 (Unicode) 67c07d97-cdcb-5c2c-af73-a7f9c32a568b (Automa) 7200193e-83a8-5a55-b20d-5d36d44a0795 (Twiddle) 7e6ae17a-c86d-528c-b3b9-7f778a29fe59 (BioSequences) 9a3f8284-a2c9-5f02-9a11-845980a1fd5c (Random) ade2ca70-3891-5945-98fb-dc099432e06a (Dates) aea7be01-6a6a-4083-8856-8a6e6704d82a (PrecompileTools) c2308a5c-f048-11e8-3e8a-31650f418d12 (FASTX) de0858da-6303-5e67-8744-51eddeeeb8d7 (Printf) ea8e919c-243c-51af-8825-aaa63cd721ce (SHA) fa267f1f-6049-4f14-aa54-33bafae1ed76 (TOML) ``` This deps set is necessarily a superset of the previous deps set for FASTX by itself: it includes the transitive dependencies of both FASTX and BioSequences, which are the triggers for the BioSequencesExt extension. We can produce deps data for this set of dependencies and versions using `deps_data` again: ``` 21216c6a-2e73-6563-6e65-726566657250 00805cd429dcb4870060ff49ef443486c262e38e 354b36f9-a18e-4713-926e-db85100087ba f7b06677eae2571c888fd686ba88047d8738b0e3 3bb67fe8-82b1-5028-8e26-92a6c54297fa 54194d92959d8ebaa8e26227dbe3cdefcdcd594f 3c28c6f8-a34d-59c4-9654-267d177fcfa9 e32a61f028b823a172c75e26865637249bb30dff 47718e42-2ac5-11e9-14af-e5595289c2ea 7bbc085aebc6faa615740b63756e4986c9e85a70 67c07d97-cdcb-5c2c-af73-a7f9c32a568b 588e0d680ad1d7201d4c6a804dcb1cd9cba79fbb 7200193e-83a8-5a55-b20d-5d36d44a0795 29509c4862bfb5da9e76eb6937125ab93986270a 7e6ae17a-c86d-528c-b3b9-7f778a29fe59 6fdba8b4279460fef5674e9aa2dac7ef5be361d5 aea7be01-6a6a-4083-8856-8a6e6704d82a 03b4c25b43cb84cee5c90aa9b5ea0a78fd848d2f c2308a5c-f048-11e8-3e8a-31650f418d12 bff5d62bf5e1c382a370ac701bcaea9a24115ac6 ``` The hash of this data is `0cHNrzGNRMsmUVhsDdl3sBnn-tykWPBzqjMijnic`, which lets us construct our precompile request: ```http GET /precompile/SBCcsWhD03qK3hE3t87C9nseeMgccXgVVLMJfOkH/0cHNrzGNRMsmUVhsDdl3sBnn-tykWPBzqjMijnic/c2308a5c-f048-11e8-3e8a-31650f418d12+BioSequencesExt HTTP/1.1 Julia-System: aarch64-apple-darwin-libgfortran5-cxx11-julia_version+1.10.0 Julia-Version: 1.10.0 ``` If the server has this precompile file, it can return it with a `200 OK` response, if it doesn't it must return `404 Not Found` and if it wants the associated deps data, it can include the `Julia-Deps-Upload` header as before.