3. Resource Class Sharding

# 3. Resource Class Sharding The first problem with HyperLogLog is that some clients get rare, highly identifiable hash values. This issue can be mitigated by sharding requests by resource class. The idea is to split requests into classes that are narrow enough that correlating a client's behavior within each class is harmless yet broad enough that estimating within each class is still useful. Rather than having a single global HyperLogLog value, a client would have a different, independent HLL value for each resource class. There may be situations where such a split is not possible, but in Julia's Pkg server context, it’s fairly clear how to do it—a resource class would be some prefix of a request URL's resource path. For example here's how I would map resource paths to resource class prefixes: - `/registries` $\longrightarrow$ `/registries` - `/registry/$uuid` $\longrightarrow$ `/registry/$uuid` - `/package/$uuid/$hash` $\longrightarrow$ `/package/$uuid` - `/artifact/$hash` $\longrightarrow$ `/artifact/$hash` For registries and artifacts, this is the full resource path; for packages, it lops off the package version hash, keeping only the package UUID. Note that this means that each client generates a new, statistically independent HyperLogLog values for each package and artifact that they request. This sharding scheme allows estimating: - Total Julia users via requests to `/registries` - Users of each registry via requests to `/registry/$uuid` - Users of each package via requests to `/package/$uuid` - Users of each artifact via requests to `/artifact/$hash` Estimates can be made for arbitrary slices of the request logs within each resource class too. We can, for example, aggregate at various time scales—daily, weekly, monthly, yearly. We can slice by region, operating system, Julia version, and any other data that the client shares with the servers. Or any combination of the above that may turn out to be of interest. There's no need to anticipate which ways of slicing and dicing will be useful ahead of time—logs can be queried and aggregated arbitrarily after the fact. The only thing that resource class sharding prevents is aggregating or correlating *across* classes. You cannot ask how many clients downloaded package A _or_ B. You can only estimate A and B separately, you can't estimate the union or intersection. This is a limitation, but it seems like an acceptable one. Whereas there is significant demand for knowing how many users there are of specific packages, there is very little call for estimating correlations between packages. How does sharding help privacy? Even in the original, simple "each client sends a unique ID" scheme, resource class sharding makes the privacy situation considerably better. Imagine that instead of sending the same ID with each request, each client generates a "master key" and derives a "class-specific ID" for each request class by cryptographically hashing the master key with the resource class string. This prevents tracking users across resource classes since the IDs are different and the cryptographic hash prevents linking them. You could still see that the same client downloaded three versions of a package this month, but you'd have no way to connecting that to the same client downloading any other packages. This is actually not a bad scheme, but I suspect that some people would still strongly object to clients being uniquely identifiable at all, and it may still run afoul of privacy laws like GDPR. Resource class sharding with HyperLogLog works in much the same way except that instead of sending the derived class-specific ID, the client uses the hash of the master key with the class string to generate a HyperLogLog value and sends that along with the request. The HLL values in different classes are statistically independent thanks to the cryptographic hashing. In a given resource class, only a few unlucky clients will have rare, uniquely identifiable values. Crucially, clients that are rare in one class will be common in most other classes. Even if a client is unlucky enough to be rare in two different classes, the server has no way of knowing that those rare HLL values belong to the same client. Sharding also improves fairness. Without sharding, most clients are unremarkable, but a few, unlucky clients have uncommon, identifiable HLL values. This is unfair: some clients are trackable, most aren't. With sharding, *every* client will be unremarkable in most classes and rare in a few. Since there are an unbounded and, in practice, large number of resource classes—one per package/artifact—each client will be anonymous in most of them and rare in only some. This is a much fairer and more symmetrical situation. One may object that some resource classes are surely more important than others. For example, suppose a client has a rare HLL value in the `/registries` class. Doesn't this seem more significant than having a rare sample in some obscure package's class that the client may never even download? Perhaps, but consider that the only thing you can learn about a client with a rare value in the `/registries` class is how often they check for new registries—information that is not interesting or sensitive in the slightest. Not accidentally either: this information is uninteressting precisely *because* everyone requests `/registries` regularly. If a package has very few users, on the other hand, then it's more notable if some user does download it, but for the very same reason, none of the package's users is likely to have an especially rare HLL value. The rarity of the most uncommon HLL value in a class is inherently inversely related to the popularity of that resource class: the rarest HLL sample values will tend to happen in classes that are the most popular—and popular classes are the ones that are the least interesting to learn anything about. With sharding, the most identifiable HLL values tend to occur in the least interesting classes and the most interesting classes tend to have the least identifiable HLL values. Sharding is an effective privacy measure because almost everything of interest to an attacker comes from following users across packages, which is precisely what sharding blocks. On the other hand, correlating users across packages is of fairly minor interest for legitimate purposes and losing that ability is small sacrifice for significant privacy gain. **Next:** [4. Signed HLLs?](https://hackmd.io/@HLLoverRSA/4_Signed_HLLs)