For Cryptographers

# For Cryptographers {%hackmd @HLLoverRSA/latex-preamble %} From a cryptographer's point of view, there are two parts to HyperLogLog over RSA that can largely be decoupled. Unsurprisingly, they are the HyperLogLog part and the RSA part. HyperLogLog (HLL) estimation provides core of the protocol. You can understand the privacy model and how client counts are estimated just through HLL sampling and estimation. The core of the protocol is that in each "resource class" each client chooses a uniformly random bucket index from $\set{0, 1, \dots, B-1}$ and a geometrically distributed sample value from $\set{0, 1, \dots, m}$. They send this index-sample pair to the server and the server can aggregate these as it chooses in order to estimate the number of unique client values in arbitrary subsets of requests within a resource class. The main issue with this protocol is that plain text HyperLogLog values are trivial to forge, and forging them would allow an attacker to inflate client count estimates. That is where the RSA layer comes in. We map RSA values in $\Z_N^*$ to HLL values, so that sampling uniformly from $\Z_N^*$ gives the right distribution on HLL values. This is a heavily surjective mapping, with each fiber having size proportional to the probability of the HLL value in the image. Each client randomly chooses a persistent secret, $x \in \Z_N^*$, which determines their HLL value. If they were to send $x$ with every request, it would, of course, be trivial to identify their requests. Instead, for each request, a client chooses a random multiplier, $w \in W$, and a random exponent, $t \in T$, and sends $y = wx^t$. The sets $W$ and $T$ are defined below. The crucial part is that $W$ and $T$ are defined such that any two clients with the same HLL value are equally likely to send $y$, and therefore they are indistinguishable to the server. The RSA modulus, $N$, has the following structure: $$ N = P Q = (2 B p + 1)(2^m q + 1) $$ where $B$ and $m$ are public parameters indicating the number of HyperLogLog buckets and the maximum geometric sample value, respectively. And $P$, $Q$, $p$ and $q$ are all distinct primes that are coprime to $B$. We require that $B$ is odd and that $m ≥ 2$. In practice, we use $B = 2^{12}-1$ and $m = 63$. $P$ and $Q$ are large primes, with roughly $\frac{1}{2} \log_2(N)$ bits each, so that $N$ is a large, balanced semiprime, as usual for an RSA ring. The primes, $P$, $Q$, $p$ and $q$, are all odd, distinct and coprime to $B$. The multiplicative structure of the RSA ring is: $$\begin{align} \Z_N^* &\cong C_2 \times C_B \times C_{2^m} \times C_{pq} \end{align}$$ where $C_n$ denotes the cyclic group of order $n$. The sets $W$ and $T$ that random multipliers and exponents are sample from are defined as: $$\begin{align} W &= \set{\, z^{B2^m} \st z \in \Z_N^* \,} = (\Z_N^*)^{B2^m} \\ T &= \set{\, 2Bi + 1 \st i \in \set{0, \dots, 2^{m-1}-1} \,} \end{align}$$ It is easy to sample random values from these sets without knowing the factorization of $N$. Let $\hll: \Z_N^* \to B \times (m+1)$ be the function mapping RSA values to HLL values. If $\log(x) = (a, b, c, d) \in \Z_2 \times \Z_B \times \Z_{2^m} \times \Z_{pq}$ then we define this as: $$ \hll(x) = (b, \tz(c)) $$ This can be computed straightforwardly by the party who knows the factorization of $N$. There are four main facts to establish: 1. Distribution: $\hll$ induces the right distribution on $B \times (m+1)$. 2. Anonymity: $\hll(x) = \hll(y) \iff \E (w, t) \in W \times T: y = wx^t$. 3. Tamper resistance: clients cannot bias $\hll(x)$ without factoring $N$. 4. Safety: that $N$ has the correct construction. For the distribution requirement: it's straightforward to see that the orders of elements in $C_{2^m}$ has the desired geometric distribution. This is addressed in detail in [Encrypted Geometric Sampling](https://hackmd.io/@HLLoverRSA/5_Encrypted_Geometric_Sampling#Geometric-RSA-Rings). This has also been empirically tested, but feedback would be appreciated on whether this needs to be established more rigorously or whether it is clear enough. A proof of the anonymity property is presented in [Proof of Anonymity](https://hackmd.io/@HLLoverRSA/9_Proof_of_Anonymity). It takes a slightly unorthodox approach of defining a "semigenertor" for multiplicative group that is a product of cyclic components and then working with logarithms of components with respect to this semigenerator element. If there is a more standard way to approach or present this, I'm all ears. Tamper resistance is discussed in [Malicious Clients](https://hackmd.io/@HLLoverRSA/11_Malicious_Clients), but the evidence is flimsy, and the section says as much. Ideally there would be a reduction from being able to forge values with large geometric samples to factoring $N$ or some other problem that is believed to be hard in RSA rings, but I don't have that. This is definitely the weakest part of the formalization, but also in some sense, the least important—it's much more important to have provable client anonymity. The final problem to address is demonstrating that $N$ has a "harmless" structure. The proof of anonymity definitely applies of $N$ has the expected structure. But if $N$ is allowed to have any structure at all, it's quite easy to arrange for $y = wx^t$ to be uniquely identifying. It would be possible to give a zero-knowledge proof that $N$ has the expected structure. I personally found that annoying and tricky to implement, so I ended up using a different form of evidence. The client can easily check that $N = 3 \bmod 4$ and that $\gcd(B, N) = \gcd(B, N-1) = 1$. If it is also shown that if for every $x, y \in \Z_N^*$ with positive Jacobi symbol, one of $x$, $y$ or $xy$ must be a quadratic residue, then $N$ has at most two distinct prime factors. An interactive protocol for testing this is straightforward: the client picks random $x$ and $y$ and challenges the server to produce $r$ such that $r^2 = x$ or $r^2 = y$ or $r^2 = xy$. This can be turned into a non-interactive certificate in the usual way, but producing $x$ and $y$ via cryptographic hashing (making sure to include $N$ in the input). I also show that when $N$ has more than two distinct prime factors, the probability for each random pair $(x, y)$ that one of $x$, $y$ and $xy$ is a quadratic residue is at most $5/8$, which tells us how many challenges the server needs to solve in order to reach a given level of certainty. All of this is presented in [Malicious Servers](https://hackmd.io/@HLLoverRSA/12_Malicious_Servers). That's the heart of it. There's also the business of resource class sharding and choosing a single $x_0 \in \Z_N^*$ as a "master key" to generate $x$ values specific to each resource class.