4. Signed HLLs?

# 4. Signed HLLs? The second major issue with clients generating and sending raw HyperLogLog values is how easy it is for a malicious client to forge large geometric values and sabotage client count estimates. If a single malicious client sends `k = 63` in every bucket, then the client count estimate is inflated to 19 quintillion (billion billion) clients, which is obviously absurd, but also makes all our estimates useless. In other words, the entire system can be brought to its knees with a tiny number of malicious requests—only one per bucket, of which there are only a few thousand. If the attack is that blatant, of course, it's easy to filter out the requests with the implausble `k = 63` geometric sample values, but a more subtle attack could render estimates incorrect without being so easy to detect or filter. To prevent spoofing, it seems like a client would need to present, along side its HyperLogLog value, some evidence that it got that value legitimately and didn't just "make it up." Some kind of stamp of authenticity—a digital signature. A simple signature system would be for the server to generate a random secret key and cryptographically hash the HLL value along with that key, and use the hash value as a signature. If the client presents the correct hash value with its HLL value, then the pair is valid; otherwise it should be ignored for client counting purposes. Unfortunately, there are several problems with signed HLL values—some are solvable while others are not. The first problem with signed HLL values is that many signature schemes could themselves act as a unique client fingerprint, which is exactly what we're trying to avoid. It doesn't matter if the HLL value is only a few bits if the signature that accompanies it is large enough to uniquely identify billions of clients. Maybe the signature is a pure function of the HLL value, but how can the client know that? A protocol that's supposed to protect client anonymity needs to not only be resilient to misbehaving clients, but also prevent malicious servers from surreptitiously tracking users. Unless the client has a way to verify that the signature attached to an HLL value is the unique, correct signature for that value, then a signature scheme can be used by a malicious server to track clients. This problem is solvable: public key cryptography allows signatures that can be checked but not forged. The server publishes a public key and signs each HLL value with its corresponding private key. A client can check that the signature is correct using the public key. Depending on the public key system chosen, additional care may need to be taken to ensure that multiple different signatures cannot appear correct, but this is a minor technical detail. Public key cryptography applied carefully can fully solve this issue. A second problem with HLL signatures is that we still want to do resource class sharding. This means that a client needs a different signature for each resource class and it has to serve the same HLL value for that resource class every time (otherwise we're counting requests, not clients). The obvious way to do this is for the client to request a new HLL value the first time it makes a request in a given class and then store that value to be used in the future. Assuming 256 bits per resource class key, 18 bits per HLL value, 512 bits per signature (common for state-of-the-art elliptic curve signatures; other systems have larger signatures), that's about 100 bytes per resource class. If the typical user installs 5k packages over time, that's half a megabyte of storage. Given modern hard disk capacity this is certainly feasible, but that's a lot of data for each client to keep around for the sake of something that arguably doesn't benefit the individual user directly. Even if it is feasible, it has an "optics" problem: people may see this data and think *"Why are they storing all this 'telemetry' data about me?"* Yes, we can explain how the data isn't used to track you, but it doesn't look great. A third problem with signed HLL values is that it adds operational complexity to our servers. This can certainly be done, but it comes at a cost. We have carefully designed the Pkg server system to be simple and reliable to run. The `/registries` endpoint is the _only_ Pkg endpoint that ever changes—everything else that a Pkg server serves is immutable and content addressed and thus perfectly cacheable without possibility of invalidation. Even the `/registries` endpoint can be implemented as a small, static file that just needs to be updated when registry content changes. This means everything in the Pkg protocol is straightforwardly cacheable: most paths can be cached forever, and `/registries` can be cached with any reasonably short time to live. This makes putting a content-delivery network (CDN) in front of Pkg servers trivial, and we lean quite heavily on CDNs. An endpoint handing out randomly generated, signed HLL values would require complex, dynamic logic on a server, would be be inherently uncachable, and could not be served through a CDN. All feasible, but not ideal. The last and most critical problem with signed values is that they still allow malicious clients to observe and selectively choose HyperLogLog values with large geometric values. If the HLL values are sent in the clear, it's trivial to request as many as one wants and selectively choose the rarest one. This can be partially addressed by encryption: instead of sending HLL values in the clear, encrypt them. We can't use off-the-shelf encryption algorithms since they all have block sizes of at least 64 bits—if we encrypt an 18-bit HLL value with a 64-bit block cipher, we get an opaque 64-bit value with plenty of bits to fingerprint clients. Fortunately, one can easily roll a bespoke [Feistel cipher](https://en.wikipedia.org/wiki/Feistel_cipher) whose block size matches our HLL size. Since the encrypted payload is the same size as an HLL value, and the signature can be checked to be a pure function of the encrypted HLL value, the client can be sure that only one HLL value's worth of identifying information is being submitted with each request. Encryption doesn't entirely solve the observability problem, however: even if they're encrypted, a malicious client can request a large number of HLL values and observe which values are served most often—they can statistically reconstruct the distribution of HLL values by observation. They can submit the rarest ones they get as often as they want and bias our client count estimates. This is not as severe a problem as an attacker being able to forge arbitrarily large geometric samples, but it means that they *can* inflate user count estimates. Is this significantly worse than the "gold standard" of random unique client IDs? Unfortunately, it is. With client IDs, if an attacker wants to inflate the client count for a slice of request by $n$ extra clients, they have to make $n$ requests. With encrypted, signed HLL values, the attacker has to do a lot of work up front to find rare values, but but once they have some rare values, they only need to send one request per bucket in a slice in order to inflate its client count. Of course, they don't know which values are in which buckets, but they can submit all of the rarest values that they've found (presumably whichever ones they've only seen a single time). This means that given enough up front work finding rare HLL values, any slice of requests can be inflated arbitrarily with a fixed amount of attack-time effort. Any one of these problems might not be fatal on its own, but between the size of data storage required, the additional operational complexity, and the fact that an attacker can still inflate estimates with a fixed amount of effort, signed HyperLogLog values are, unfortunately, not an appealing solution for client count estimation. If this was the best we could do, I might consider deploying it, but I would have significant reservations. Fortunately, however, signatures are not the only possible solution. **Next:** [5. Encrypted Geometric Sampling](https://hackmd.io/@HLLoverRSA/5_Encrypted_Geometric_Sampling)