11. Malicious Clients

# 11. Malicious Clients {%hackmd @HLLoverRSA/latex-preamble %} Recall that with plain text HyperLogLog client counting we had the rather severe problem that an attacker could send just one request for each bucket and inflate the unique client estimate for a resource class to the maximum possible value. We want to convince ourselves that our new protocol doesn't have this problem. Our "gold standard" in terms of attack effort is the naive unique client ID approach: an attacker needs to send one request for every additional increment of the count in a given resource class—the attack effort is linear with a coefficient of one. If a malicious client sends $y$ with $\Jacobi_N(y) ≠ -1$, the server will detect it. So we can focus on a malicious client sending $y$ with $\Jacobi_N(y) = -1$. As discussed in the previous section, every element $y \in J_N^-$ is of the form $y = x_0 g^h$ where $g$ is our chosen semigenerator and $x_0$ is an arbitrary twist element with $\Jacobi_N(x_0) = -1$. Here we use $h$ as just an arbitrary attacker-controlled exponent value, not necessarily the output of a hash function. The question is whether a malicious client can influence $y$ to have large geometric sample values. The server doesn't publish any $x_0$ value, so clients have to generate one for themselves. However, since they don't know the factorization of $N$, they have no idea what its logarithms are. For our analysis, write $\log(x_0) = (a, b, c, d)$, but keep in mind that the attacker has no idea what these values are. The HyperLogLog sample $y = x_0 g^h$ encodes is $(b + h, \tz(c + h))$. The attacker controls $h$ but doesn't know $b$ or $c$. Since they don't know $c$, they cannot force $\tz(c + h)$ to be large or know how large it is for any particular $h$. In order to hit $k ≥ \tz(c + h)$ the attacker needs to happen to choose $h$ whose last $k$ bits complement $c$ perfectly, which occurs with probability $1/2^k$ — exactly the probability of picking a value that good by chance. What they can do, however, is scan through consecutive $h$ values. If they scan $2^k$ consecutive values, they're guaranteed to hit $h = -c \bmod{2^k}$ for one of those values and they may happen to hit something better. In the unique ID scheme, each request with a freshly "forged" client ID inflates the client count by one. How does HLL over RSA compare to this when an attacker tries to inflate its estimates? The expected maximum value of $\tz(c + h)$ is $k+1$ when scanning a consecutive block of $2^k$ values. This is slightly better than sending random values. If an attacker sends $B2^k$ spoofed requests they can push each bucket up to an expected value of $k + 1$, which gives a client count estimate of: $$\begin{align} \hat{n} = \frac{1}{2 \ln(2)} B 2^{k+1} = \frac{1}{\ln(2)} B2^k \end{align}$$ This uses the original Flajolet *et al.* (2007) estimator for $B ≥ 128$. Better estimators exist for small samples, and should be used for client count estimates, but for this analysis the original asymptotic formula is fine. So, at scale, an attacker can expect to inflate the estimate by $\ln(2)^{-1} \approx 1.44$ per malicious request. This is a rather good result: attack effort is linear and the coefficient is just a bit larger then one—only slightly worse than the unique ID gold standard. In short, whereas the naive unique ID scheme has perfect accuracy, resists inflation well, but has awful privacy, our HLL over RSA scheme is approximate (with tunable accuracy), recovers nearly the same inflation resistance, and provides provable client anonymity. While this is quite a good result, our analysis of resistance to malicious clients is somewhat weak. If an attacker can learn the $c$ coordinate of any $x_0$ value with respect to any semigenerator, $g$, then they can forge arbitrarily rare $x_0 g^h$ values by choosing $h = -c \bmod 2^k$ for whatever $k$ they want. They can also cover all the buckets by covering all $h \bmod B$ residue classes. A proof of attack resistance would use knowledge of such a tuple, $(x_0, g, c)$, to factor $N$ or perform some other computation that is believed to be hard in RSA rings. We do not have such a proof. The best we have is an argument that this seems hard since the exponent coordinates in $\Z_N^*$ are secret in a way that may protocols rely on and deciphering even one is widely believed to be hard. Without the requirement that $\Jacobi_N(x_0) = -1$ it would be quite easy to find values with known coordinates with respect to $g$: just take powers of $g$. Since $\Jacobi_N(g) = 1$, however, any power of $g$ also has positive Jacobi symbol. To get to a value with negative Jacobi symbol, one needs an element with negative Jacobi symbol, which presents would-be attackers with a seemingly unsolvable bootstrapping problem. There's no apparent way to make that leap without introducing an unknown $c$ coordinate. This is a vaguer argument than I would like to make in a write up full of rigorous proofs. Fortunately, proving that client anonymity is preserved is the much more important result, and for that we do have a solid proof. Inability to provably guarantee resistance to malicious client attacks is more acceptable: that's a risk that we, as the operators of the system, can choose to take on. If our estimates suddenly start looking ridiculously large, we can begin to wonder if someone has cracked the protocol. **Next:** [12. Malicious Servers](https://hackmd.io/@HLLoverRSA/12_Malicious_Servers)