# Batch Signature Check Via SIPP: Cost Analysis
## Overview
Outlined in more detail here: https://hackmd.io/1D8Y1fRtSMW1mKSl9kbMsw
1. The Prover computes $Z = \vec{A} * \vec{B}$ and passes it to the Verifier.
The Prover and Verifier then follow these steps:
1. The Prover computes $Z_L = \vec{A}_{[1:n/2]} * \vec{B}_{[1:n/2]}$, $Z_R = \vec{A}_{[n/2+1:n]} * \vec{B}_{[n/2+1:n]}$ and sends them to the Verifier. Here, $\vec{A}_{[1:n/2]}$ represents the vector composed of the first $n/2$ elements of $\vec{A}$, and $\vec{A}_{[n/2+1:n]}$ represents the vector composed of the last $n/2$ elements of $\vec{A}$.
2. The Verifier samples a random $x \in \mathbb{F}^{*}_{r}$ and provides it to the Prover.
3. Both the Verifier and Prover compute $\vec{A}' = \vec{A}_{[1:n/2]} + x \vec{A}_{[n/2+1:n]}$; $\vec{B}' = \vec{B}_{[1:n/2]} + x^{-1} \vec{B}_{[n/2+1:n]}$.
4. The Verifier computes $Z' = Z_L x Z_R x^{-1}$.
5. The following updates are made: $A \leftarrow A'$, $B \leftarrow B'$, $Z \leftarrow Z'$, $n \leftarrow n/2$.
Once $n = 1$, the Verifier checks if $e(A, B) ?= Z$, and if so, accepts.
Let $N=2^n$ be the size of the inner-product.
The inputs to the protocol are
- $\vec{A} \in {\mathbb{G}}_1^N$
- $\vec{B} \in {\mathbb{G}}_2^N$
- $Z \in {\mathbb{G}}_T$
The proof consists of the $2n$ target group elements $\vec{Z}_{L}, \vec{Z}_{R} \in {\mathbb{G}}_T^n$ (*as well as the A and B vector which is $2n$ group elements in ${\mathbb{G}}_1$ and ${\mathbb{G}}_2$ - Nicolas).
## Verifier Work
- Check the commitment to the $A$, $B$, and $Z$ vectors. Example in code is a running hash of all elements of the vectors. Method used here is not application specific and is configurable.
- Computing $n$ fiat-shamir challenges from the vectors, as well as their inverse. Inverse can be calculated externally and constrained.
- Computing $n$ MSMs in each of the source groups and the target group. This is the dominating cost that is dependent on the size of challenges. https://ethresear.ch/t/security-of-bls-batch-verification/10748
- Pairing check $e(A,B)=Z$. Can be done in circuit or external verifier.
## Proposed Architecture
For Ethereum, over an epoch, assuming 1M validators (~900k currently, padded in practice for forward applicability):
1 Epoch consists of 32 slots, there are 64 committees per slot, each committee has 512 participants (this is dynamic to validator set growth and participation), with one agg sig per committee. Therefore:
$N_e=2^{20}$ for an epoch (roughly, as there is really 2^20 (number of pks) + c (aggregate signature factors <$g^{-1},\sigma>$))
$N_s=2^{15}$ for a slot
$s=32$ slots per epoch
$c=2,048$ committees per epoch
$\mathbf{pk} \in \mathbb{G}_1^N$
$\mathbf{H(m)} \in \mathbb{G}_2^c$
$\mathbf{\sigma} \in \mathbb{G}_2^c$
In order to parallelize SIPP, each slot will run its own SIPP instance, with the realization being that the resultant $A_i, B_i, Z_i$ for slot $i$ can be accumulated into the $i+1$ SIPP proof. The accumulator can also assume the correctness of the $A_i, B_i, Z_i$ for slot $i$, and defer that verification to a secondary set of proofs.
While this enables a continuously running and continuable batch signature proof, it significantly increases the total verifier work for a proof over multiple instances. For $N$ total pairings over $i$ steps, verifier work changes from $O(\log N)$ to $O(i \log{N \over{i}})$ or concretely, in the case of Ethereum from 20 MSMs in each group to 480 ($32 * 15$) MSMs in each group. This 24x increase can be acceptable overhead if the parallelization benefit pays for the increased cost.
The downside of a monolithic SIPP proof is the need to wait until epoch completion (or even finality) to begin making the proof, meaning latency is serial to the SIPP proof generation and secondary MSM verifier proof (not a data parallel computation). In practice, on a 16 core CPU (the compute baseline assumed throughout the post) this is ~1 minute SIPP proving step with a secondary verifier required to compress the verifier into a succinct proof.
MSMs in circuit are expensive as you need to configure the circuit for largest possible scalars (security parameter in the case of randomness). Only known examples for BLS12-381 in code are [circom-pairing](https://github.com/yi-sun/circom-pairing) which are assumed to be inefficient but present the lowest engineering cost for implementation. A $\mathbb{G}_1$ exponentiation is 2.0M (non-linear) constraints, a $\mathbb{G}_2$ exponentiation is 4.1M constraints, a $\mathbb{G}_T$ exponentiation is 8M constraints (again assuming scalar is $[0,2^{250}]$).
It then follows that in the monolithic SIPP case, the MSM verifier circuit requires ~575M constraints which is out of the reach of a monolithic proof. Amortizing that cost using a folding scheme over iterative calculations and acheiving data parallelism could reduce the serial-ness of the proof and overall wall-clock latency.
### Using Binary PCD for MSMs
Assuming a per-slot SIPP implementation to enable data parallel computation of the MSM proofs, each slot would have one SIPP instance $SIPP(A_i, B_i, Z_i, r_i)$ and require a proof for well-formation of $A_i, B_i, Z_i, r_i$ such that the verifier accepts if $e(A_i,B_i)=Z_i$ $\land$ Verify($\pi_{A_i}$) $\land$ Verify($\pi_{B_i}$) $\land$ Verify($\pi_{Z_i}$) $\land$ Verify($\pi_{r_i}$). These proofs can further be folded across slots with each slot MSM instance being merged into the following slot instance.
Each slot MSM would be a binary tree with base 8, height 4 at slot 0, 5 at slot >0, with an IVC proof for MSMs over each group (best for parallelism).
Again, for $N$ total pairings over $i$ steps, verifier work $O(i \log{N \over{i}})$ or concretely, 480 ($32 * 15$) MSMs in each group. This constitutes ~6.8B cumulative constraints.
Proving latency of such a system is dominated by size of the step circuit, with optimum achieved with larger step size (# constraints) and less steps. When considering also the single threaded witness generation process which is linear in number of constraints, optimum looks to be around 2M constraints (although that is a low sample size). With exponentiations in $\mathbb{G}_T$ being the critical path at 8M constraints per step, witgen would take ~100s, with the folding steps taking ~30s (per slot).
The benefit of this architecture is that the final latency post epoch close would be the time it takes to prove a single slot, ~130s, assuming SIPP as instantaneous (our tests showed a slot at ~2.5s and epoch at ~80s). This optimal latency in practice would require a large CPU (>32 cores) or several parallel operating CPUs as each following slot would begin before the proving of the previous was finished. This would incur non-negligible data overhead.
Ignoring the final pairing check, the MSM proofs will need further compression to be usable onchain. This will result in additional compression proofs at additional latency cost.
This also does not include cost and complexity for tracking randomness which feeds the MSMs, nor the validating the transcripts. I assume both of these would be per slot proofs but at lower unit cost than the $\mathbb{G}_T$ exponentiations. The randomness proof could probably be done in parallel to the MSM proofs while constraining equality of output to MSM proof input somewhere.
So in theory, the final verifier would be a single pairing, verifying 3 IVC proofs over 2M, 4M, and 8M constraint respective step size, verifying a transcript, verifying a randomness proof. This also doesn't include a set membership check for ensuring the $A$ vector is composed of active validators only, a majority check or a multiplicity check.
## Paths for optimization
Only looking at the big rocks, where can we optimize? The dominating latency cost comes from the witgen cost linear to number of constraints. The MSM circuits could likely be optimized but represent a large engineering undertaking. Even assuming a 4x improvement (gnark impl for example), the step would still be 2M with ~30+30s total pipeline cost. Reducing security by reducing the size of the scalar for the MSM would also largely impact the overall cost (MSM with scalar $2^{10}$ is ~160k constraints in $\mathbb{G}_T$). The overall complexity of the SIPP verifier logic also requires some complex additional circuits and invariant constraints in the verifier that have an opaque cost (could be explosive, like hashing over ~$2 * 2^{20}$ group elements).
All in all, the overall cost is borderline not practical/performant enough and warrants additional research into better alternatives.
## Can we do better
In what world is logarithmic work not as efficient as linear work? When working over 12-degree extension fields and MSMs using randomness.
To review, in SIPP, we verify a batch of signatures as a IPA over $\mathbb{G}_1$ and $\mathbb{G}_2$ group elements. This requires random linear combinations of elements (as exponentiations) to reduce verifier work to logarithmic in size of the vector. But this random linear combination creates non-vanishing $\mathbb{G}_T$ factors as a result.
What would happen if we took a step back and verified a batch of signatures as a single pairing check, just as with SIPP, but the $A$ and $B$ points are just the aggregate (via point addition) $\mathbb{G}_1$ and $\mathbb{G}_2$ points over the entire signing set? This:
$$
\begin{bmatrix}
pk_1 \\
pk_2 \\
pk_3 \\
\vdots \\
pk_n \\
g_{1}^{-1}
\end{bmatrix} \cdot \begin{bmatrix}
H_0(m_1) \\
H_0(m_2) \\
H_0(m_3) \\
\vdots \\
H_0(m_n) \\ \sigma_{agg} \end{bmatrix} =1
$$
Reduces to this:
$$e(A,B)=1$$
where, for $A \in \mathbb{G}_1$ and $B \in \mathbb{G}_2$:
$$A = \sum_{j=1}^{c}(g^{-1}\sum_{i=1}^v\mathbf{pk}_i)_j $$
$$B = \sum_{j=1}^{c}\mathbf{H(m_j)} \mathbf{\sigma}_j $$
Where the sum is simply group element point addition in each source group, $\mathbf{pk}_i$ is the vector of public keys for a given committee, $g^{-1}$ is the inverse of the $\mathbb{G}_1$ generator, $\mathbf{H(m_j)}$ is the vector of hashes of messages per committee, $\mathbf{\sigma}_j$ is the vector of committee aggregate signatures, $v$ is the number of validators in a committee and $c$ is the number of committees in an epoch (always 2048 for mainnet Ethereum).
The verifier would then need to check that $A$ and $B$ are the valid point adds over the vectors above, which is linear work but a completely data parallel, commutative calculation that requires no randomness, no MSMs, and no work in $\mathbb{G}_T$. We get the same ability to batch verify signatures over heterogeneous messages and accumulate and continue proofs.
What is the cost profile of this architecture?
I go into more detail in [this](https://hackmd.io/3JcQnT6_RjW3xjfCHGd6Yg) post but I will summarize here.
A binary PCD implementation of point adds follows in much the same manner, where each slot operates in parallel, one point add instance over $\mathbb{G}_1$ elements to yield a proof for $A_i$ for each slot $i$ and one point add instance $\mathbb{G}_2$ elements to yield a proof for $B_i$. Each slot instance is then merged with the following slot's top level instance to yield an epoch proof for each group element.
Over the epoch, this corresponds to $2^{20}+1$ $\mathbb{G}_1$ point adds (adding all of the pks plus one additional inverse generator factor, $g^{-c}$), and $2^{12}$ $\mathbb{G}_2$ point adds (all of the $H(m_c)$ and $\sigma_c$).
Point additions in circuit are much cheaper at 4205 and 8579 constraints in the respective groups, using circom-pairing, for a total of 4.2B constraints for the $\mathbb{G}_1$ point add proof and 35.1M constraints for the $\mathbb{G}_2$ point add proof (4.2B total). This makes the $\mathbb{G}_1$ proof the bottleneck of performance.
Because of the small step size, we are free to optimize the size of the step (number of point adds per step) for efficiency and hardware. In practice, with single threaded witness generation, the optimum is configuring for 512 point adds per step for 2.1M constraints, aligning to one step per committee, allowing witness data to be generated in parallel across 64 committees with a 64 core cpu. This results in a slot proof binary tree of base 32 and height 5 for slot 0, and 6 for slot >0.
Our benchmarks show that this architecture requires ~30s witgen (assuming parallelized) and ~30s proving time (on 16 cores). Note that with 64 cores (in the witgen assumption), the proving time would likely see a factor of 2x+ reduction.
We hadn't spent much time optimizing the $\mathbb{G}_2$ point add proof as each committee step is only 8.5k constraints. We could batch those into slot sized steps of 544k constraints and operate a binary tree over the epoch of slots as IVC. This would not pose any bottlenecks.
Same as before, the final epoch latency for all proofs to be complete would be the time it takes to prove a single slot critical path, ~60s.
As with the proposed construction, the final proofs would need some compression to be used onchain. So in theory, the final verifier would be a single pairing, verifying 2 IVC proofs over 2.1M, and 544k constraint respective step size. This also doesn't include a set membership check for ensuring the $A$ vector is comprised of active validators only, a majority check or a multiplicity check.
### Paths for optimization
This architecture has a much simpler cost profile and circuit surface, with a single pairing, and two IVC point add proofs. There is also interesting and clearer paths for circuit optimization.
Increasing the number of point adds per circuit (more constraints per step, less steps) increases the witgen time without impacting the prover time (1024 point adds per circuit doubles the witgen time while reducing the prover time by only 10%). Decreasing number of point adds decreases the witgen time linearly but increases the prover time, while also resulting in needing more cores to fully parallelize.
One possible optimization is in changing from proving correct point addition directly to verifying the correctness of a point addition. This is (in theory - haven't implemented it yet) much cheaper than directly calculating the point addition in circuit, and as this calculation is fully data parallel, calculating all steps can be done outside of circuit. To verify a point add $A + B = C$, all you need to do is verify that all three points lie on the curve and on the same line. The latter is just a dot product where one point is rotated 90’: $(y_2-y_1)*(x_3-x_1) - (x_2-x_1)*(y_3-y_1) = 0$. This could result in a large reduction over $2^{20}$ steps, keeping in mind any single constraint removed is 1M constraints out of the total cost over $\mathbb{G}_1$ addition.
## Comparison
SIPP Batch Signature Proof
- 1 pairing
- 3 IVC proofs for MSMs in each group
- 1 proof for transcript correctness
- 1 proof for fiat-shamir challenge creation
- 6.8B total verifier constraints for MSM ops
- 8M constraints as largest step size
- ~130s slot prover latency for MSMs
- High complexity, large circuits, many paths for optimization, many opaque and unknown costs
Point Add Based Batch Signature Proof
- 1 pairing
- 2 IVC proofs for points adds in each group
- 4.2B total verifier constraints for add ops
- 2M constraints as largest step size
- ~60s slot prover latency
- Low complexity, small circuits, clear paths for optimization, not many hiding spots for surprises