Let's consider the case of [Summa](https://github.com/summa-dev).
Currently the tree is planned to be held with the exchange. This means the exchange will know who is requesting inclusion proofs, which means that over time and multiple PoLs the exchange may be able to safely exclude certain users from the tree and go undetected.
One way to circumvent this problem is to make the tree fully public. This does, however, expose the number of users of the exchange to everyone. This approach is also not feasible due to the fact that each inclusion proof is 2752 bytes, so with 100M users that's 256 GBs total.
## Possible alternative solution
Another method that could work is to do the following:
1. Exchange groups leaf nodes together and commits to each grouping by publishing some data on-chain
2. User can determine which group they are in using their hash and the on-chain commitment
3. User requests from the exchange all the inclusion proofs for a particular group
As long as the exchange does not know the identity of the user that does the query they do not know which proof will be verified, only that it is one of the many in the group.
This way the tree is able to reside with the exchange (so no need to worry about data leakage), but the exchange does not know which user is doing verification (only that it is one of the members of the subset). There are some security concerns with this approach:
- **unshuffled subsets, few verifications:** if the users are not shuffled across subsets over different PoLs, and only a few users are doing verification, then there will most likely be some subsets that no user has requested proofs for and so the exchange can omit all the users in that subset from the next PoL.
- **shuffled subsets, few verifications:** if the users are shuffled across subsets over different PoLs, and only a few users are doing verification, then there will most likely be some subsets that no user has requested proofs for so the exchange can take the intersctions of subsets [that have been requested] across different PoLs and slowly zone-in on the identity of those that are doing verificiation.
It's not clear exactly how many users would need to do verification in order to void the above 2 security concerns, so this needs to be delved into properly at some point.
### More formal definition of the solution
Let $U$ be the set of all users in the tree and $\mathbb{V}$ its power set: $\mathbb{V}=\mathcal{P}(U)=\{V:V \subseteq U\}$. We need 2 functions:
$$
\begin{align*}
f&:\mathbb{V}\to K \\
g&:K \times U \to \mathbb{Z}_2
\end{align*}
$$
where $K$ some finite space with bounding factor $m$, such as $\mathbb{F}_{2^m}$ or $\mathbb{Z}_2^m$ or $[1,2,...,m]$. $f$ is the function that the exchange uses to create a commitment $k$ to a specific subset of users, which they post on-chain. $g$ is the function that the user would use to check if they are in a given subset.
$f$ and $g$ need to have the following properties:
1. $\forall \space V \in \mathbb{V} \quad f(V)=k \implies g(v,k)=1 \space \forall \space v \in V$
2. $\forall \space V \in \mathbb{V} \quad f(V)=k \implies g(u,k)=0 \space \forall \space u \in U-V$
3. An adversary with access to $k$ s.t. $k=f(V)$ is not able determine any of the following in polynomial time:
- any $v \in V$
- any $u \in U \text{ s.t. } u \notin V$
- $|V|$
It may be possible for the exchange to do some cheating by including users in more than 1 subset, but this should not be an issue (need to maybe think about this a bit more).
The work to be done now is figure out $f, g, K$.
### One realization of the solution
We can make a start on $f, g, K$ by using polynomials. First, construct the disjoint split of the users:
$$
\mathbb{V}^*= \{ V_1, V_2, ... V_n : V_i \in \mathbb{V}, V_i \text{ disjoint}, V_i \neq \emptyset \}
$$
Let $k_i=f(V_i)$ and $v_{i,j} \in V_i$ be the $j^{th}$ element of $V_i$. We define our polynomial $p$ like so:
$$
\begin{align*}
p(x)=&\prod_{i=1}^{n}\prod_{j=1}^{|V_i|}(x-k_iv_{i,j})
\end{align*}
$$
and our $g$ like so:
$$
g(u,k)=
\begin{cases}
1 : p(uk)=0 \\
0 : \text{otherwise}
\end{cases}
$$
and our $f$ can be a cryptographic hash function, which defines $K$. Depending on which space we want $p$ & $v$ to live in we would have to adjust the definition of $p$ & $g$ so $k_iv_{i,j}$ is well defined.
A first look at our setup tells us we have properties #1 & #2, as long as our space chosen for $p$ is sufficiently large such that $\text{Pr}[k_av_{a,b}=k_cv_{c,d}, a \neq c] < \epsilon$. Property #3 is not achieved because any adversary that can factor the polynomial at least once will be able to access a users identity and know which subset they are in, in polynomial time.
This is a fair start. It may be possible to achieve #3 by using the same technique as used in Kate commitments: embedding the polynomial into an elliptic curve group via the function $p \to G^p$ where $G$ is a generator of the group. The intractability of the discrete logarithm problem will then be what gives us #3. The only concern at this point would be computational comlpexity for users/exchanges.
### Problems with the realization
Each set's polynomial will have degree $n|V_i|$ which will be the minimum number of values that need to be stored in order to describe it (i.e. the coefficients). This is greater than the minimum size required to store each set, which would be about $|V_i|$ (having simply the user IDs, or their hashes). Embedding this into a group would make the amount of data to store increase, as well as the computational complexity for the user who needs to find which set they are in.
### Another way
This question [was asked](https://crypto.stackexchange.com/questions/107174/public-commitments-to-subsets) on the crypto stackexchange and Bloom filters were suggested. These save on space and can be padded to hide somewhat the size of the subsets.