Maxwell++ tree attack: seeing through the obfuscation

## Maxwell++ tree Otherwise know as a sparse summation Merkle tree, the Maxwell++ tree takes the regular SMT (Maxwell+ with the security fix proposed by Hu, Zhang, and Guo in 2019<sup>[[2](https://eprint.iacr.org/2018/1139.pdf)]</sup>) and splits up each leaf node into 2 or more nodes in order to obfuscate user balances. ## Attack vector Various exchanges use the Maxwell++ tree to do proof of liabilities. Some of the implementations of the tree are fully public (i.e. all leaf nodes are downloadable). If an adversary could piece together a subset of nodes such that they are sure the the subset is exactly all of the nodes that make up a single user then they can use this information to their advantage. The full breadown of attacks that this information allows will not be covered here but one example would be to track a user's balance over time to determine the leverage of their position and then execute trades to force their liquidation. It's now just a problem of figuring out the viability of such an attack. The argument given below hinges on the assumption that humans use round numbers more so than random numbers for their trades i.e. a user is more likely to trade 1 BTC than 1.71640531 BTC. ## Mathematical argument We need 2 things for this attack to be viable: 1. An efficient algorithm for finding all subsets that sum to a certain amount. 2. The probability of finding a subset $T$ of a random set $S$ that sums to a specific number $a$ should be negligible. If this probability is negligible in the random case and we are able to find such a subset summing to a round number then it is highly likely that the subset we found precisely matches the nodes belonging to a single user. The problem we need to solve is a common one known as the [Subset Sum problem](https://en.wikipedia.org/wiki/Subset_sum_problem) and the most efficient algorithm know for solving it runs in $\tilde{O}(n+a)$ time where $n$ is the size of the set and $a$ the number to sum to. More information about the various algorithms can be found [here](https://arxiv.org/pdf/1807.08248.pdf). This takes care of #1. #2 is a little trickier because there is no literature out there describing the pure probability but after some rabbit-holing into the math an equation has been found and indeed it shows that the probability of finding a subset that sums to $a$ is negligible for the parameters used in the Maxwell++ PoL trees. ## Probability equation Formal description of the problem: 1. Pick $a,b,n,m \in \mathbb{Z}^+$ 2. Construct a random multiset $S$ of size $n$ by doing the following *(note that $S$ is the random version of the set of balances for the leaf nodes of the Maxwell++ tree)*: a. Start with $S=\emptyset$ b. Pick random element in the range $[1,b]$ and add it to $S$ c. Repeat the above step until you have $n$ elements 3. Pick a random subset $T$ of $S$ that has size $m$ 4. Find $$\text{Pr} \left[\sum_{t \in T}t = a \right] = \frac{N}{D}$$ The following equation describes the numerator: $$N(a,b,n,m)=\Lambda_m \left[ \left(\!\!{n-m+1 \choose m}\!\!\right) + \sum_{k=1}^{n-m} \left(\!\!{k \choose m}\!\!\right) \left[ \left(\!\!{b \choose n-m-k+1}\!\!\right) - \sum_{l=0}^{n-m-k} \left(\!\!{b-1 \choose l}\!\!\right) \right] \right]$$ where $$ \Lambda_m = \begin{cases} p_m(a) &: a < b+m-1 \\ p_m(a) - \sum_{k=m-1}^{\alpha} &: b+m-1 \le a \le \beta \\ p_m(2\beta -a) - \sum_{k=m-1}^{\bar{}\alpha} &: \beta < a \le 2\beta -b-m+1 \\ p_m(2\beta -a) &: 2\beta-b-m+1 < a \le mb \end{cases} $$ and $$ \begin{aligned} \alpha &= a-b-1 \\ \bar{\alpha} &= 2\beta -a-b-1 \\ \beta &= m\left( 1+\frac{b-1}{2}\right) \end{aligned} $$ $p$ is the [partition function](https://en.wikipedia.org/wiki/Partition_(number_theory)#Restricted_part_size_or_number_of_parts) and the double-braced binomial terms are called [multiset coefficients](https://en.wikipedia.org/wiki/Multiset#Counting_multisets). The denominator is simply $$D(b,n,m)=\left(\!\!{b \choose n}\!\!\right)\binom{n}{m}$$ Details on the construction of the formula is out of the scope of this document and it must be noted that a mathematical proof of its validity does not exist. It has been tested by summing the individual probabilities to get $1$, for a range of values of $a,b,n,m$. We want the following equation to hold: $$1=\sum_{a=m}^{bm}\frac{N(a,b,n,m)}{D(b,n,m)}$$ The following histogram shows the summation for ranges $b \in [5,40]$, $n \in [5,40]$, $m \in [2,\min(n,8)]$: ![](https://hackmd.io/_uploads/rJURwDBq2.png) Most of the them are $1$ and all of them are $>.65$ which is sufficient enough for our purposes. The equation holds well in these ranges but as soon as $m$ is made larger the error margin quickly grows (I suspect the $\Lambda_m$ term is to blaim). Example for $b \in [5,10]$, $n \in [5,20]$, $m \in [2,n]$ ![](https://hackmd.io/_uploads/rJFfLDrq2.png) This is not an issue since we only need to use the equation for $m \in [2,8]$, because $m$ represents the number of times nodes have been split in the tree which is not going to be high. For our purposes we would like to evaluate the probability for values of $a,b,n,m$ that mimic real PoL data. Let's use the following values (taken from real exchange PoL data): $$ \begin{aligned} a &= 100000000 \text{ 1 BTC} \\ b &= 11006492497730 \text{ (~110064 BTC)} \\ n &= 2000000 \\ m &\in [2,5] \end{aligned} $$ and get our total probability by summing over $a$: $$ \sum_{a=2}^5\frac{N(a,b,nm)}{D(b,n,m)}= \text{?} $$ Unfortunately my computer does not have nearly enough memory to run the not-very-optimized program I wrote for these numbers. But it's fairly easy to see that the probability would be negligible just by eyeballing the formula. ## Conclusion One now only needs to write the algorithm described in #1, set the target $a$ to round values, run it and wait. If a subset is found it is highly likely this subset precisely represents a user.