Scaling Semaphore

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Thanks to Vivian for answering so many questions.

Semaphore is a protocol that allows you to cast a message as a provable group member without revealing your identity.

When dealing with large groups it's not possible to run the protocol as it is normally used, mostly on end user's devices to maintain privacy. This introduces storage, computation and bandwidth requirements that make it impractical.

Worldcoin World ID (~11M users) currently uses semaphore for their protocol. The solution they found for this issue was to delegate the proving to their servers. The downside of this solution is that users' identities are no longer private, since they need to be disclosed to the server to generate the proofs.

In this document I'll try to expand on this issue and present some possible solutions.

Semaphore Protocol

The Semaphore flow, as defined in the protocol specifications, is the following:

The user generates an identity.
The identity is added to the group.
An anonymous membership proof can be generated using the identity and group.
A nullifier is generated to uniquely represent the user's zk proof while ensuring anonymity.
The user can attach an anonymous message (such as a vote, text message).

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Our interest lies in the third step of the process, this is, the creation of the anonymous membership proof. In order to do this, you need both the identity and the group.

Identity (commitment): public Semaphore identity value used in Semaphore groups.
Group: Lean Incremental Merkle Tree in which each leaf is an identity commitment for a user.

First, a Merkle proof of inclusion is generated. Then, this is input to a circuit that generates a zero-knowledge proof that can then be used to anonymously prove that you're part of the group.

Membership Proof

A membership proof lets you verify that a specific leaf is part of the LeanIMT. It consists of:

Leaf value: The member you want to prove is included.
Sibling hashes: A list of the neighboring node hashes encountered on the path from the leaf to the root.
Path information (or index): A series of bits or an index that indicates at each level whether the leaf was a left or right child, which tells you how to combine it with its sibling when recomputing the hash.
Merkle root: The final root hash of the tree, against which you verify the recomputed result.

Using this data, you can iteratively hash the leaf with its siblings (in the order specified by the path) to rebuild the root. If the rebuilt root matches the given Merkle root, the proof is valid and the leaf is confirmed to be part of the tree.

If we give away this information, our identities are revealed. If we don't want this to happen, we have to generate a zero knowledge proof that we belong to the group.

Zero Knowledge Proof

To generate the zk-proof we can use the Semaphore circuit, which takes as some of it's inputs:

signal input merkleProofLength,
             merkleProofIndices[MAX_DEPTH],
             merkleProofSiblings[MAX_DEPTH];

Is clear then that any plan to delegate the generation of this proof should be made around a Private Proof Delegation scheme.

The current solution is to generate the proof locally. As I mentioned earlier, this could present a storage and/or computational limitation.

Storage

The Semaphore v4 protocol supports trees with depth up to 32.

To calculate the amount of nodes in a binary tree with depth

d

we can use the following equation:

2^{d + 1} - 1

In our case,

2^{32 + 1} - 1 = 8, 589, 934, 591

Each leaf contains a Poseidon hash over a BN254 field, with elements of 32 bytes each.

8589934591 \times 32 = 274, 877, 906, 912 b y t e s

So the full size is ~256 GB.

But ok, let's assume

8.5

billion IDs is a lot, for a tree of depth

23

the group can hold

16, 777, 215

participants, not too far away from World-ID. In this case, the group would be of size

536, 870, 880 b y t e s

or ~500 MB.

💭 We can make the proof only with some leaves and hashes, but if we ask for this specific data we'll be giving away our identity.

In the latter case, it's not too unrealistic to download that amount of data if it's something the user has to do just once, but it already is prohibitely expensive for the rest of applications.

Computation

The computational side of things don't seem to be a limitation.

Even though the the Poseidon Hash circuit seems to be quite expensive, the Binary Merkle Root circuit only runs it at most MAX_DEPTH (32) times.

for (var i = 0; i < MAX_DEPTH; i++) {
    var isDepth = IsEqual()([depth, i]);

    roots[i] <== isDepth * nodes[i];
    root += roots[i];

    var c[2][2] = [ [nodes[i], siblings[i]], [siblings[i], nodes[i]] ];
    var childNodes[2] = MultiMux1(2)(c, indices[i]);

    nodes[i + 1] <== Poseidon(2)(childNodes);
}

Benchmarks

Using the semaphore v4 benchmarks I calculated some average times for the proof generation for different groups. Locally ran in an M1 MacBook Pro.

Depth	IDs	Proof Generation Time (ms)
9	1,000	352.17517
13	15,000	452.96763
16	130,000	496.71687
19	1,000,000	583.64228

Based on these numbers, it seems like the computation necessary to generate the proofs for large groups is not a blocker. This allows us to focus on the privacy issues related to obtaining the needed data.

Solutions

Fetch Merkle Path Using PIR

As we've seen in the Storage section, relatively large groups prevents the user from being able to store the whole tree. So our main assumption here will be that the tree is stored in a server.

To generate the membership proof we'll have to fetch the Merkle Path (necessary data to make the merkle proof) corresponding to the identity commitment. If we request this information then we would be revealing who we are to the server. A way to work around this issue is Private Information Retrieval (PIR).

PIR allows you to perform private queries, this is, ask a database owner for some data in a way that they can't tell what is you requested, but still being able to give it to you. Of course, this comes with some performance overhead.

Then, we could use a PIR scheme for requesting the leaves of the tree that we need. This comes with it's own challenges, since in order to do that we should know beforehand what leaves to ask for.

Since we're not storing the tree, we don't know it's structure and the position of our identity commitment.

A possible solution to this is to perform a two-party computation with the server to execute a function that returns the merkle path leaves indices, with our commitment as input.

Once we now the merkle path, we can then use PIR to fetch for the commitments and hashes.

Private Proof Delegation

It could be possible to delegate the proof generation using a of Private Proof Delegation scheme.

This is a current research area in PSE, and it would allow to obtain the proof without the server being able to read the inputs.

Merkle Forests

Merkle Forests can reduce the cost of the proof generation by lowering the anonimity degree, in a way that the user proves membership of a sub-tree instead of the whole group.

There is an implementation with this solution, which also includes elastic groups, that is, allowing the user to choose their anonimity degree, choosing groups of different sizes.

Merkle Forests + PIR

One option to achieve full privacy with Merkle Forests is fetching them using PIR.

If we have large enough sub-trees, then the user would be able to fetch every other sub-tree root to create a full proof. Their own sub-tree could be fetched using PIR.

Sub-trees should be large enough so we can have a small set of sub-tree roots, but small enough so they can be fetched using PIR. Below are some SOTA PIR practical settings:

Frodo PIR
- DB Size: ~
  $2^{20}$ elements
- Element size: 1 KB
Respire
- DB Size: ~
  $2^{20}$ elements
- Element size: 256 B
Spiral
- Public Parameters: 14 ~ 47 MB
- Element size 100 KB
- Database size: Should be adjusted by pre-processing time. 16.8 s for a 256 MB database, and 569 s for an 8 GB database.

PIR seems to be optimized for many elements of small size, instead of few large elements DBs, so we won't be able to have large trees.

Instead, we can reduce the anonimity of the whole set, to sub-trees of maximum size of

2^{20}

elements, with 32 B elements.

It would be possible to perform PIR on these sub-trees, and then perform the proof of membership over this sub-tree only. This will have to be complemented with a proof of membership of the sub-tree root, that can be calculated server-side.

A great insight from Keewoo, the structure of the LeanIMT allows us to:

Know our index when we join the group.
Know the tree structure with the amount of leaves. It is an append-only tree so the structure is deterministic.

This allows us to recreate the merkle path indices just by knowing the amount of leaves in the tree. We can then request these indices using PIR to calculate the proof.

Initial Exploration Path

The first route we will work on, this is implement a PoC and benchmark it, is the Merkle Forests + PIR proposed solution.

We can split this into two separate parts, one for PIR for trees of up to

2^{20}

leaves and then extend this PIR sub-trees to reach up to

2^{32}

The first step would be implement the LeanIMT blind merkle path retrieval. This is what would allow us to know the nodes indices that belong to our merkle path, without fetching the tree.

The next step would be to implement PIR for the LeanIMT. The chosen PIR scheme is Respire, for which one of the authors (David J. Wu) gave a great presentation for a PSE L&S session. The reasons are:

This scheme is tailored for databases of small records.
It has no offline phase.
It supports batch queries.
It has a working implementation.

The semaphore group structure can then be modified in a way that these trees are sub-trees, then extending the amount of participants a single group can hold. If the roots of all sub-trees can be stored by the user, then proofs can still be done without revealing the sub-tree the identity commitment is.

Resources

AtHeartEngineer

2025/02/19 19:45:02

storage and computation

And bandwidth, simply fetching a large tree may be a challenge

2025/02/19 19:52:03

But ok, let's assume billion users is a lot, for a tree of depth the group can hold participants, not too far away from World-ID. In this case, the group would be of size or ~500 Mb.

A user may have more than 1 identity commitment in the tree. For example, a user could have 1 id commitment per device.

Brechy

2025/02/20 20:10:53

Oh, then I should update these references as IDs instead of users

2025/02/19 20:20:03

I was thinking something like this + PIR. So you could divide a group into some lower degree, and then fetch the merkle path for that sub-group with PIR. A group of depth 24 for example could be cut into multiple depth 16s, which are roughly 4mb and is probably on the upper end of reasonable/practical. https://docs.google.com/spreadsheets/d/17gvQMLuZ6jXnWRCFNOQCVl-ZJr18FoTHZVX1cGrf-po/edit?gid=0#gid=0

2025/02/20 20:12:52

This spreadsheet is great! I'll add it to the resources. Yes, I need to evaluate the PIR schemes size limits.

vplasencia

2025/02/25 12:46:32

130,000

I just released a new LeanIMT version that fixes this issue and also released a new Semaphore version with the latest LeanIMT version. Please, try again with the latest Semaphore version: `4.9.1`. Context of this error: The LeanIMT TS implementation was improved to support initializing the tree (and adding members at once) with a large number of members e.g. > 110 000 members. LeanIMT issue: https://github.com/privacy-scaling-explorations/zk-kit/issues/376 LeanIMT PR: https://github.com/privacy-scaling-explorations/zk-kit/pull/377

2025/02/25 12:51:17

Also updated the Semaphore benchmarks app. This can be useful to check benchmarks quickly: https://benchmarks.semaphore.pse.dev/

2025/02/25 14:04:50

Thank you so much! I'll update these benchmarks