1. Anonymously Counting Users

# 1. Anonymously Counting Users It's hard to estimate the number of users of open source software. In the Julia ecosystem, whenever someone installs or updates packages, they connect (by default) to a global network of Pkg servers operated by the open source project. If each client were to send some metadata along with each request, allowing us to count them, then we could use the number of unique client installs making requests to Pkg servers to help figure out the number of Julia users. We also want package authors to be able to estimate the number of people using their specific software. This isn't merely a matter of curiosity or bragging rights: research funding for open source software is often predicated on demonstrating impact—if you can say "our package was installed on 50,000 systems last year", that's a pretty strong argument for the impact of your project. As a community, we also want to understand other things about Julia users, including how many people use various operating systems, how many people are using different versions of Julia, and where in the world Julia users come from. The simplest way to count unique installs would be for each Pkg client to generate a unique random value (*e.g.* a random 128-bit value) the first time they make a Pkg request, save it, and then send it as a header along with each Pkg server request. This was proposed at one point, but after much discussion it was deemed to be too much of a privacy hazard. Such a unique ID would allow anyone with access to Pkg server logs to track every Pkg request made by every client. Even with the ability to opt out, doing this by default is too invasive. If we're going to count clients, we need to ensure anonymity. It should be impossible for anyone—even with full access to our Pkg server logs—to identify or track users making requests. The naive random ID approach, even though it was deemed unacceptable for privacy reasons, will serve as a "gold standard" for us, at least in terms of functionality. We will judge our solutions by how close they get to providing equivalent functionality to unique client IDs. The "golden features" include: - Clients can generate their own IDs, servers do not need to issue them - Servers can passively collect request logs and post-process them - Client counts can be computed within arbitrary slices of request logs. Ideally we want something that gives us similar architectural simplicity and flexibility but preserves user anonymity and privacy. Is this even possible? If we can come up with a user count estimation scheme that functionally approximates unique client IDs while preserving user anonymity, we will consider it a success. The rest of this writeup is about exactly that: a new technique for estimating unique client installs in such a way that individual users cannot be tracked or identified, even by people with full access to server logs. It makes two modest sacrifices compared to unique client IDs that have little impact on utility in practice: 1. It doesn't allow client counting across completely arbitrary slices of requests—there are certain classes of requests across which client counts cannot be aggregated; 2. It is approximate rather than exact. We can estimate the number of users in each slice of logs with some random error. The error bounds, however, are small and tunable. Our approach combines two algorithms that each seem like magic: HyperLogLog and RSA. HyperLogLog is an algorithm for estimating the number of unique values in huge data sets using less memory than seems like should be possible. RSA (Rivest–Shamir–Adleman) is one of the original public key cryptography techniques, which, even though we now take it for granted, also doesn't seem like it should be possible. The new protocol, which I'm tentatively calling "HyperLogLog over RSA" combines these magical technologies and does something new: it allows accurately estimating the number of unique clients making requests to a service without anyone being able to identify or track those clients. Throughout this writeup, I will use Julia's client-server interaction as the grounding context, but user counting via HyperLogLog over RSA is a completely general technique that could be used by other open source projects or in any situation where one wants to estimate unique clients while provably preserving anonymity. Privacy laws like Europe's [GDPR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) and California's [CPPA](https://en.wikipedia.org/wiki/California_Privacy_Protection_Agency) make this not just a Good Thing™ to do, but also a legal requirement in may parts of the world. **Next:** [2. HyperLogLog](https://hackmd.io/@HLLoverRSA/2_HyperLogLog)