owned this note
owned this note
Published
Linked with GitHub
# Scaling Semaphore

*Thanks to [Vivian](https://github.com/vplasencia) for answering so many questions.*
[Semaphore](https://docs.semaphore.pse.dev/) is a protocol that allows you to cast a message as a provable group member without revealing your identity.
When dealing with large groups it's not possible to run the protocol as it is normally used, mostly on end user's devices to maintain privacy. This introduces storage, computation and bandwidth requirements that make it impractical.
Worldcoin [World ID](https://world.org/blog/worldcoin/world-id-faqs) (~11M users) currently [uses semaphore](https://worldcoin.org/blog/worldcoin/intro-zero-knowledge-proofs-semaphore-application-world-id) for their protocol. The solution they found for this issue was to delegate the proving to their servers. The downside of this solution is that users' identities are no longer private, since they need to be disclosed to the server to generate the proofs.
In this document I'll try to expand on this issue and present some possible solutions.
## Semaphore Protocol
The Semaphore flow, as defined in the [protocol specifications](https://hackmd.io/@vplasencia/B1sCrsoFkg), is the following:
1. The user generates an identity.
2. The identity is added to the group.
3. An anonymous membership proof can be generated using the identity and group.
4. A nullifier is generated to uniquely represent the user's zk proof while ensuring anonymity.
5. The user can attach an anonymous message (such as a vote, text message).

Our interest lies in the third step of the process, this is, the creation of the anonymous membership proof. In order to do this, you need both the **identity** and the **group**.
- **Identity (commitment)**: public Semaphore identity value used in Semaphore groups.
- **Group**: [Lean Incremental Merkle Tree](https://github.com/privacy-scaling-explorations/zk-kit/tree/main/papers/leanimt) in which each leaf is an identity commitment for a user.
First, a Merkle proof of inclusion is generated. Then, this is input to a circuit that generates a zero-knowledge proof that can then be used to anonymously prove that you're part of the group.
## Membership Proof
A membership proof lets you verify that a specific leaf is part of the LeanIMT. It consists of:
- **Leaf value**: The member you want to prove is included.
- **Sibling hashes**: A list of the neighboring node hashes encountered on the path from the leaf to the root.
- **Path information (or index)**: A series of bits or an index that indicates at each level whether the leaf was a left or right child, which tells you how to combine it with its sibling when recomputing the hash.
- **Merkle root**: The final root hash of the tree, against which you verify the recomputed result.
Using this data, you can iteratively hash the leaf with its siblings (in the order specified by the path) to rebuild the root. If the rebuilt root matches the given Merkle root, the proof is valid and the leaf is confirmed to be part of the tree.
If we give away this information, our identities are revealed. If we don't want this to happen, we have to generate a zero knowledge proof that we belong to the group.
## Zero Knowledge Proof
To generate the zk-proof we can use the [Semaphore circuit](https://github.com/semaphore-protocol/semaphore/blob/main/packages/circuits/src/semaphore.circom), which takes as some of it's inputs:
```circom
signal input merkleProofLength,
merkleProofIndices[MAX_DEPTH],
merkleProofSiblings[MAX_DEPTH];
```
Is clear then that any plan to delegate the generation of this proof should be made around a [Private Proof Delegation](https://github.com/privacy-scaling-explorations/private-proof-delegation-docs) scheme.
The current solution is to generate the proof locally. As I mentioned earlier, this could present a storage and/or computational limitation.
## Storage
The Semaphore v4 protocol supports trees with [depth up to 32](https://github.com/semaphore-protocol/semaphore/blob/f67958349810b0d6cda8d77488e8371f8d6f3985/packages/utils/src/constants.ts#L7).
To calculate the amount of nodes in a binary tree with depth $d$ we can use the following equation:
$$
2^{d+1}-1
$$
In our case,
$$
2^{32+1}-1 = 8,589,934,591
$$
Each leaf contains a Poseidon hash over a [BN254 field](https://github.com/chancehudson/poseidon-lite), with elements of 32 bytes each.
$$
8589934591 \times 32 = 274,877,906,912 \; bytes
$$
So the full size is `~256 GB`.
But ok, let's assume $8.5$ billion IDs is a lot, for a tree of depth $23$ the group can hold $16,777,215$ participants, not too far away from World-ID. In this case, the group would be of size $536,870,880 \; bytes$ or `~500 MB`.
> 💭 We can make the proof only with some leaves and hashes, but if we ask for this *specific* data we'll be giving away our identity.
In the latter case, it's not too unrealistic to download that amount of data if it's something the user has to do just **once**, but it already is prohibitely expensive for the rest of applications.
## Computation
The computational side of things don't seem to be a limitation.
Even though the the [Poseidon Hash circuit](https://github.com/iden3/circomlib/blob/master/circuits/poseidon.circom) seems to be quite expensive, the [Binary Merkle Root circuit](https://github.com/privacy-scaling-explorations/zk-kit.circom/blob/9d1bbca137ee47552a63c0c3c24f8f7dad0ccfd9/packages/binary-merkle-root/src/binary-merkle-root.circom#L18) only runs it at most `MAX_DEPTH` (32) times.
```circom
for (var i = 0; i < MAX_DEPTH; i++) {
var isDepth = IsEqual()([depth, i]);
roots[i] <== isDepth * nodes[i];
root += roots[i];
var c[2][2] = [ [nodes[i], siblings[i]], [siblings[i], nodes[i]] ];
var childNodes[2] = MultiMux1(2)(c, indices[i]);
nodes[i + 1] <== Poseidon(2)(childNodes);
}
```
### Benchmarks
Using the [semaphore v4 benchmarks](https://github.com/vplasencia/semaphore-benchmarks/blob/main/node/src/generate-proof.ts) I calculated some average times for the proof generation for different groups. Locally ran in an M1 MacBook Pro.
| Depth | IDs | Proof Generation Time (ms) |
|-------|-----------|------------------------------|
| 9 | 1,000 | 352.17517 |
| 13 | 15,000 | 452.96763 |
| 16 | 130,000 | 496.71687 |
| 19 | 1,000,000 | 583.64228 |
Based on these numbers, it seems like the computation necessary to generate the proofs for large groups is not a blocker. This allows us to focus on the privacy issues related to obtaining the needed data.
## Solutions
### Fetch Merkle Path Using PIR
As we've seen in the **Storage** section, relatively large groups prevents the user from being able to store the whole tree. So our main assumption here will be that the tree is stored in a server.
To generate the membership proof we'll have to fetch the Merkle Path (necessary data to make the merkle proof) corresponding to the identity commitment. If we request this information then we would be revealing who we are to the server. A way to work around this issue is [Private Information Retrieval (PIR)](https://crypto.stanford.edu/pir-library/).
PIR allows you to perform *private queries*, this is, ask a database owner for some data in a way that they can't tell what is you requested, but still being able to give it to you. Of course, this comes with some performance overhead.
Then, we could use a PIR scheme for requesting the leaves of the tree that we need. This comes with it's own challenges, since in order to do that we should know beforehand **what leaves to ask for.**
Since we're not storing the tree, we don't know it's structure and the position of our identity commitment.
A possible solution to this is to perform a two-party computation with the server to execute a function that returns the merkle path leaves indices, with our commitment as input.
Once we now the merkle path, we can then use PIR to fetch for the commitments and hashes.
### Private Proof Delegation
It could be possible to delegate the proof generation using a of Private Proof Delegation scheme.
This is a current research area in PSE, and it would allow to obtain the proof without the server being able to *read* the inputs.
### Merkle Forests
Merkle Forests can reduce the cost of the proof generation by lowering the anonimity degree, in a way that the user proves membership of a sub-tree instead of the whole group.
There is an [implementation](https://github.com/samzkback/merkle-forest) with this solution, which also includes *elastic groups*, that is, allowing the user to choose their anonimity degree, choosing groups of different sizes.
### Merkle Forests + PIR
One option to achieve full privacy with Merkle Forests is fetching them using PIR.
If we have large enough sub-trees, then the user would be able to fetch every other sub-tree root to create a full proof. Their own sub-tree could be fetched using PIR.
Sub-trees should be large enough so we can have a small set of sub-tree roots, but small enough so they can be fetched using PIR. Below are some SOTA PIR practical settings:
- Frodo PIR
- DB Size: ~$2^{20}$ elements
- Element size: `1 KB`
- Respire
- DB Size: ~$2^{20}$ elements
- Element size: `256 B`
- Spiral
- Public Parameters: `14 ~ 47 MB`
- Element size `100 KB`
- Database size: Should be adjusted by pre-processing time. 16.8 s for a 256 MB database, and 569 s for an 8 GB database.
PIR seems to be optimized for many elements of small size, instead of few large elements DBs, so we won't be able to have large trees.
Instead, we can reduce the anonimity of the whole set, to sub-trees of maximum size of $2^{20}$ elements, with `32 B` elements.
It would be possible to perform PIR on these sub-trees, and then perform the proof of membership over this sub-tree only. This will have to be complemented with a proof of membership of the sub-tree root, that can be calculated server-side.
A great insight from [Keewoo](https://keewoolee.github.io/), the structure of the LeanIMT allows us to:
- Know our index when we join the group.
- Know the tree structure with the amount of leaves. It is an append-only tree so the structure is deterministic.
This allows us to recreate the merkle path indices just by knowing the amount of leaves in the tree. We can then request these indices using PIR to calculate the proof.
## Initial Exploration Path
The first route we will work on, this is implement a PoC and benchmark it, is the **Merkle Forests + PIR** proposed solution.
We can split this into two separate parts, one for PIR for trees of up to $2^{20}$ leaves and then extend this PIR sub-trees to reach up to $2^{32}$.
The first step would be implement the [LeanIMT](https://github.com/vplasencia/leanimt-rs) **blind** merkle path retrieval. This is what would allow us to know the nodes indices that belong to our merkle path, without fetching the tree.
The next step would be to implement PIR for the LeanIMT. The chosen PIR scheme is [Respire](https://eprint.iacr.org/2024/1165), for which one of the authors (David J. Wu) gave a great presentation for a [PSE L&S session](https://www.youtube.com/watch?v=Nf4IZ2kTPN4). The reasons are:
- This scheme is tailored for databases of small records.
- It has no offline phase.
- It supports batch queries.
- It has a [working implementation](https://github.com/AMACB/respire).
The semaphore group structure can then be modified in a way that these trees are sub-trees, then extending the amount of participants a single group can hold. If the roots of all sub-trees can be stored by the user, then proofs can still be done without revealing the sub-tree the identity commitment is.
## Resources
- [Semaphore Specs](https://hackmd.io/@vplasencia/B1sCrsoFkg)
- [Merkle Tree Size Sheet](https://docs.google.com/spreadsheets/d/17gvQMLuZ6jXnWRCFNOQCVl-ZJr18FoTHZVX1cGrf-po/edit?gid=0#gid=0)
- [Lean Incremental Merkle Tree](https://github.com/privacy-scaling-explorations/zk-kit/tree/main/papers/leanimt)