# Efficient Batch Signature Verification Proofs Via Lookups And Deferred Computation
**Pathway toward efficient proof construction for proving finality of Ethereum**
# Introduction
This is a continuation of our article series documenting our research and development on our work towards a scalable and efficient proof of finality for Ethereum. The initial discussion outlining design goals and motivation can be found [here](https://hackmd.io/@wrinqle/rye7Dju92), along with benchmarking results using R1CS BLS signature verification circuits and Nova recursive schemes. The result of those learnings led to implementing a batch BLS signature verification proof via Inner Pairing Products ([here](https://hackmd.io/@wrinqle/rywNJSkeT)) which has led to this system design to achieve optimal parallelization and lower latency.
If you want to understand more of the problem statement, motivation and earlier work, please reference the previous articles linked above.
The core design goal is to create an efficient and highly parallelizable proving system to generate a succinctly verifiable proof of finality for Ethereum for use in interoperability as a blockheader relay that requires no additional trust assumptions aside from the security guarantees given by the Ethereum consensus mechanism. Designing such a proving system has two major difficulties, the size and complexity of the statement to be proven (2,048 aggregate BLS signatures over BLS12-381, of ~1M signers per epoch) and the speed at which consensus progresses (one epoch every ~6 minutes). A naive arithmetization of consensus verification logic would exceed 50 billion constraints. In practice, this would mean producing a proof over an order of magnitude more constraints than a zk-rollup's circuit, in an order of magnitude less time than it takes a zk-rollup to produce that proof (speaking roughly). As this is not practical given our design goals, creativity must be employed to make use of the algebraic properties of the signature scheme, recursive structure of consensus, and data parallelism to achieve a practical proving system. Additionally, in production as a blockheader relay, overhead and cost of proof generation directly impacts the user's cost, making hardware efficiency of the proving system a much more concrete goal.
The resulting proving system decomposes the verification logic of majority consensus to basic operations, constrained in subcircuits as preprocessing proofs, the correctness of which is proven via lookups which commit to the transcripts that the preprocessing proofs operate over. This allows for the batch verification of an entire signing set as a single pairing, deferring all other proving logic to preprocessing proofs, enabling the parallel processing of all separate proof logic in purpose built circuits using recursive schemes. As all data is public and available as epochs progress, the inputs and expected outputs of each proof step can be calculated outside of the circuits, allowing for full parallelization of recursive circuits, processing slots in parallel and as they occur.
This architecture has a 5x reduction in constraints compared to the naive construction (~9B) in a fully data parallelized and MSM based system, in a modular and composable architecture that enables the optimization of each proof separately. Benchmarks show this proof construction is capable of generating all preprocessing proofs in < 1 minute post epoch completion using a 64 core CPU with no GPU acceleration (which could reduce runtime by an additional factor of 4-10x). With this, along with circuit optimizations and implementation of curve cycle optimizations, fully onchain usable finality proofs are achievable in less than 1 minute post finality.
This proving system achieves the design goals in low-latency with clear paths for step-function optimizations and, as such, will be the architecture used in our testnet release. This document will therefore be a living document, updated with further detail as development solidifies around unbuilt aspects of the proving system.
## Previous Work and Key Learnings
The general take aways from previous iterations is that monolithic, naive arithmetizations of signature verification (or even non-native group and field arithmetic) in a generalized circuit will never be performant enough when extending that logic over a consensus mechanism. Taking advantage of the recursive structure of consensus allows for parallelization over smaller steps (22.5M constraints per committee, with 2,048 committees per epoch) while employing some method to parallelize proving and accumulation of proofs. This can be done with GKR, as done by Polyhedra [[ref](https://arxiv.org/pdf/2210.00264.pdf)], or by folding schemes [[ref to previous work](https://hackmd.io/@wrinqle/rye7Dju92)]. In either case, prover work is still proportional to the total number of constraints of the overall statement (~50B), meaning latency is reduced but overhead and cost is still high.
In order to reduce the size of the overall statement, we looked to batch the cryptographic operations required to verify the set of BLS signatures into specialized subcircuits that are tailored to each specific operation.
This led to employing Inner Pairing Products as a batch signature verification preprocessing proof to reduce the total number of pairings from 4,096 down to 1. Verifying this proof in circuit involves doing many non-native group and field operations, over each of the source groups as well as the target group, resulting in ~575M non-linear constraints.
To optimize this verifier circuit, we looked to move as much non-native group and field work as possible outside of the general purpose compression SNARK, offloading it to additional preprocessing SNARKs for each of the operations required. As these operations are highly iterative and parallelizable, we then construct these proofs as recursive SNARKs, each step proving the validity of some sequential group operation in the overall statement. This architecture results in a set of proofs that are orders of magnitude cheaper than the corresponding monolithic naive circuit encoding.
Even with the efficiency gained with pairing reduction and Inner Pairing Product verification operations in preprocessing proofs, we still had the issue of set membership and linking the committee level participation sets to the elements in the batch signature proof. This would require homomorphic commitments to the signing sets (2,048 of them per epoch) or point addition inside the set membership proofs with the verifier also having to check equivalence between the vectors in the batch proof and the sum of all commitments. This large overhead would be in addition to the aforementioned costs.
Additionally, calculating all of the required Inner Pairing Product verifier group operations cannot be done until the epoch is complete, making this large proof to verify and link invariants, however efficient, our biggest barrier to low latency proof generation.
Therefore, in order to achieve the lowest possible latency with the least overhead, our proving system must be incremental and data parallel, while minimizing the number of expensive operations for signature checks (requiring only a single pairing if possible), offloading as much non-native arithmetic work to preprocessing proofs over simple group operations that can be highly parallelized and provably linked to the active validator set.
### Data Parallelization
As the entire set of data these proofs are operating over is public and available incrementally as an epoch progresses, we can take advantage of data parallelization to externally compute the inputs with expected outputs of each proof step and separately prove the correctness of each execution in a distributed manner, across possibly different CPUs, to amortize the proving costs and achieve the desired performance and latency goals.
### Recursive Proof Composition Via Curve Cycles
Although the paradigm of parallel generation of preprocessing SNARKs enables high customizability of composable subcircuits and use-case optimized preprocessing SNARKs, it comes at the additional cost of requiring a proof wrapper to turn individual proofs of individual statements into an aggregate proof of a combined statement. For our intended onchcain use, this compression wrapper proof needs to be over the BN254 curve, which makes the curve choice and verification logic of all preprocessing proofs a determining factor for overall cost and latency.
Proving systems are defined over elliptic curves that have a scalar field $\mathbb{F}_q$ of order $q$, with group elements belonging to a larger base field $\mathbb{F}_p$ of prime order $p$. This allows us to prove statements about $\mathbb{F}_q$-arithmetic circuit satisfiability. Verifying these proofs requires performing arithmetic operations in $\mathbb{F}_p$ which, even if verifying the proof in a proving system over the same curve, becomes non-native field arithmetic over $\mathbb{F}_p$. This can lead to an order of magnitude or more constraints in overhead compared to native field arithmetic.
The Grumpkin curve has a group scalar field order equal to the base field prime order of the BN254 curve, and the Grumpkin base field prime order is equal to the group scalar field order of BN254. This allows us to efficiently verify proofs in Grumpkin in circuits defined over BN254 and vice versa, since the operations over the base field of one curve are done in native arithmetic in the scalar field of the other.
Halo2 utilizes the same curve cycle paradigm to enable efficient recursion, which is explained in clear detail [here](https://zcash.github.io/halo2/background/recursion.html).
As Grumpkin is not pairing based, we are limited in how we prove statements over Grumpkin, the optimal use of which is still active research in folding implementation.
This curve cycle will be used for all of our subcircuits, with primary SNARKs being defined over the Grumpkin curve to make verification in BN254 much more efficient, with folding over cycles defined over Grumpkin with a BN254 decider.
_______
If this general proving system paradigm sounds familiar, it is because our realizations have converged in a very similar way with the overall design goals of the Aztec proving system, Goblin Plonk [[ref](https://hackmd.io/@aztec-network/B19AA8812)]. Being that we are not concerned with zero knowledge, privacy, large instruction sets or resource constrained provers, our proving system design can get away with being far simpler.
Unless otherwise specified, all times referenced in this document were performed on a 32 core AMD CPU with no GPU acceleration.
# High Level Overview
At a high level, the finality proof verifies an epoch of validator attestations as a batch signature check in a single pairing, the verification of the well-formation of the aggregate group elements in the pairing check is proven in two recursive point addition proofs, one over $\mathbb{G}_1$ and the other over $\mathbb{G}_2$ which also hashes each individual committee attestation message to the corresponding $\mathbb{G}_2$ point. The correctness of the point addition transcripts to ensure that all validator keys exist in the beacon chain's active validator list is proven via a lookup proof over the current beacon chain state. Finally, the validator indices used for the lookup are used in a simple final majority and multiplicity check to ensure that majority was reached.

# Pairing Reduction By Batching Group Operations
SIPP demonstrates that checking the overall statement for BLS signature verification over a batch of multiple distinct messages can be reduced to a single pairing check, with the constraint that the given points used in the pairing check were calculated correctly starting from the initial vectors of group elements representing public keys and messages.
This demonstrated a scalable relation across heterogenous messages and signing sets that could be expanded from a committee, to a slot, and across an epoch or even perhaps across heterogeneous consensus mechanisms (if they all use BLS signatures over the 12-381 curve).

This relationship can then be reduced by "summing" all of the G1 points and all of the G2 points, which if well-formed, the pairing constraint still holds. Unfortunately, while proving this relationship via Inner Product Arguments is very prover efficient, the verifier checks this proof via random linear combinations of the group elements. The verifier is therefore required to do logarithmic work "replaying" the calculations of random linear combinations of group elements (exponentiations) in both of the source groups as well as the target group.
To verify this proof in circuit, this work needs to be done somewhere along the way, even if we use additional preprocessing proofs. As an upper bound, with $\mathbb{G}_T$ exponentiations costing ~8M constraints, $\mathbb{G}_2$ exponentiations costing ~4M constraints, and $\mathbb{G}_1$ exponentiations costing nearly ~2M constraints, even if we could parallelize these proofs fully, executing these over an entire validator set and epoch, would be difficult to achieve our latency goals.
Instead of relying on random linear combinations to prove well-formation, we can instead perform point addition over the entire signing set directly. Point additions are much less expensive than exponentiations when done in-circuit (~350-500x less expensive than exponentiations at 20k, 9k, 4k constraints respectively), but requires a linear number of group operations in number of participants (~1M $\mathbb{G}_1$ point adds, ~4,096 $\mathbb{G}_2$ point adds). While this does increase the total number of constraints of our statement by an order of magnitude, this construction doesn't doesn't require any $\mathbb{G}_T$ operations or any exponentiations and is fully data parallel.
Using this, we can simplify the batch signature check to a single constrained pairing:
$$e(A,B) = 1$$
where, for $A \in \mathbb{G}_1$ and $B \in \mathbb{G}_2$:
$$A = \sum_{j=1}^{c}(g^{-1}\sum_{i=1}^v\mathbf{pk}_i)_j $$
$$B = \sum_{j=1}^{c}\mathbf{H(m_j)} \mathbf{\sigma}_j $$
Where the sum is simply group element point addition in each source group, $\mathbf{pk}_i$ is the vector of public keys for a given committee, $g^{-1}$ is the inverse of the $\mathbb{G}_1$ generator, $\mathbf{H(m_j)}$ is the vector of hashes of messages per committee, $\mathbf{\sigma}_j$ is the vector of committee aggregate signatures, $v$ is the number of validators in a committee and $c$ is the number of committees in an epoch (always 2048 for mainnet Ethereum).
Instead of checking the correctness of the $A$ and $B$ group elements, this proof assumes their correctness, deferring that computation to the point addition proofs, allowing us to generate this pairing check as a standalone proof in parallel.
# Recursive Point Addition Proofs
## Over $\mathbb{G}_1$
Generating an aggregate $\mathbb{G}_1$ point with accompanying proof of correctness requires $v + s$ total point additions in the $\mathbb{G}_1$ group (each participating validator public key, a single constant inverse generator balancing factor, and one additional term per slot (-1) to merge each slot's folded instances together):
$$A = \sum_{j=1}^{c}(g^{-1}\sum_{i=1}^v\mathbf{pk}_i)_j $$
Where $s$ is the number of slots per epoch (32), $v$ is the number of public keys in the participation set, $\mathbf{pk}_i$ is the vector of participating validators in the epoch (~1,000,000 for simplicity, ~880,000 at the time of writing), $c$ (number of committees per epoch, always 2,048) entries of $g^{-1}$. This can be simplified further by adding over the set of all $\mathbf{pk}_i$ then including a single constant $g^{-2048}$ factor.
Each slot has 64 committees, with each committee containing up to 512 participants (~430 at the time of writing), which means ~32,000 point additions per slot. With slots occuring every ~12 seconds, each slot can be processed independently and in parallel as soon as its signatures are available, and can be folded together as the epoch progresses.
As this is the same operation done many times sequentially with the output of step $i-1$ passed as input to step $i$ with all values known, we can precompute the inputs and outputs outside of the circuit and have freedom for optimizing the computation of these proofs. Because of the commutative nature of elliptic curve point addition, this can be done in any order. In general, in data parallel computations, executing the computation in a binary tree results in the most efficient algorithm with lowest real time execution. This binary tree construction can be implemented in Nova (with some additional overhead).
Optimizing for the number of point adds per step versus total number of steps using Nova is related to how slots are structured and the hardware the proofs will be generated in. Being that witness generation is a single threaded operation, it is most efficient to maximize the number of instances running across all available cores. However, having more instances than cores reduces the overall efficiency and increases memory contention. So while executing this proof system on a larger than 32 core machine is possible, clock speed per core begins to decrease, reducing the overall speed. And while witness generation cannot be accelerated in a meaningful way, folding, being MSM based, can be heavily accelerated using GPUs.
An additional barrier we face is the inability to dynamically size the number of point additions done per step. This is a core limitation to circuit logic and requires circuits to be maximally configured to support longest possible execution. While SuperNova does introduce folding across non-uniform circuits, where we could select the circuit size at runtime, it is unclear if the added complexity and overhead would outweigh the simplicity of padding the undersized inputs to a maximally configured step instead. This will be explored in later optimizations.
Our sweet spot is 512 point additions per step with 64 steps per slot (one per committee). This allows the circuit to handle maximum participation in a committee while still being fast enough for our purposes.
Folding in a binary tree construction has a base of 32 running instances (each folding two instances together) and can be completed in less than 15 seconds per slot.

As each slot finishes, an additional Nova tree instance is generated, which once completed, folds the previous running instance into the new running instance. This two dimensional folding tree is carried out slot by slot until the end of the epoch, with the final output being an aggregate $\mathbb{G}_1$ point representing the aggregate public keys of the signing set (plus a constant $g^{-2048}$ term).

The resulting proof is over ~1M point adds for ~4.7B total constraints. Although the overall statement is very large, the data parallel nature of the computation paired with parallelizing folding reduces the final latency between epoch end and proof completion to the time to prove a single slot, which is ~15 seconds.
## Over $\mathbb{G}_2$
Similarly, generating an aggregate $\mathbb{G}_2$ point with accompanying proof of correctness requires $2c+s-1$ (2 additions per committee with one merge per slot excluding the first slot) point additions in $\mathbb{G}_2$:
$$B = \sum_{j=1}^{c}\mathbf{H(m_j)} \mathbf{\sigma}_j $$
Where $\mathbf{H(m_j)}$ is the vector of hashes of messages per committee, $\mathbf{\sigma}_j$ is the vector of committee aggregate signatures, and $c$ is the number of committees in an epoch. Each step of this circuit also needs to check that the point $H(m_j)$ is well-formed by taking as input the message $m_j$ and hashing it to the corresponding $\mathbb{G}_2$ group element.
The same optimization structure follows here, where the optimal step size to number of steps is one step per committee or 2 point adds with one hash to group operation per step.
Folding in a binary tree construction follows similarly (same as the above figures) which has a base of 32 running instances and can be completed in 15 seconds per slot and can be run in parallel to the point addition proof in $\mathbb{G}_1$.
As each slot finishes, an additional Nova instance is generated, folding the previous running instance into the new instances. This is carried out in a similar way across slots to achieve a final epoch aggregate $\mathbb{G}_2$ point representing the aggregate signatures and messages for the epoch.
Hashing to group is a relatively expensive operation to implement in circuit, with many opportunities to optimize further. Our current $\mathbb{G}_2$ hash and add circuit is over 2M constraints (of similar size to the $\mathbb{G}_1$ point add step circuit), making the overall statement size ~4.3B constraints. The resulting proof has similar performance with the $\mathbb{G}_1$ proof, with latency determined by the slot proving time at just under 15 seconds.
## Verifying Nova Proofs In Circuit
Utilizing Nova as a precomputation proof enables efficient and highly parallelizable proofs. Verifying a Nova proof requires checking circuit satisfiability of the folded instance over the primary curve, verifying the proof of folding over the second curve, and checking the hash commitments of instances are correct. This makes Nova proofs not particularly efficient to directly verify in circuit as, no matter the curve cycle choice, a non-negligible amount of work needs to be done over the second curve using non-native arithmetic.
To make this more efficient, we make use of an additional compression layer to simplify in-circuit verification, such that the circuit satisfiability and proof of folding are checked using a polynomial commitment scheme. This then reduces the verification of each Nova proof to two openings and instance hash verification. Currently, this is implemented using Spartan, a public-coin, prover efficient sum-check based protocol using a multivariate polynomial commitment scheme, Spark.
Sum-check based schemes offer optimal prover overhead without requiring a trusted setup with the tradeoff being higher verifier overhead in opening a multivariate polynomial commitment. Zeromorph [[ref](https://eprint.iacr.org/2023/917)] can be used as a tool to transform, via linear isomorphism, a multivariate polynomial commitment to a univariate polynomial commitment, which is far more verifier friendly and can be aggregated with other KZG commitments for batch openings. In practice, applying this to our Nova compression proof output yields optimal prover overhead with best case verifier overhead in a univariate KZG commitment. More work can be done here to further reduce verifier overhead via [curve transposition](https://notes.ethereum.org/@dankrad/kzg_commitments_in_proofs), but it is unclear if this is needed complexity.
CycleFold [[ref](https://eprint.iacr.org/2023/1192)] improves the Nova verifier overhead by instantiating the folding scheme verifier in the primary curve, essentially instantiating the curve cycle inside the folding scheme, moving almost all of the work on the secondary curve to the primary curve. This reduces the verification complexity in circuit by an order of magnitude, requiring only a single scalar multiplication in the non-native field. This does add implementation complexity by requiring the inclusion of a folding verifier circuit on the first curve along with the relaxed step circuit, requiring the implementation of non-uniform multi-folding. This will be part of future optimization explorations.
## Attestation And Slot Mechanics
In Ethereum consensus, actual slot progression and number of attestations per slot can vary with attestions not necessarily pointing to the same source/target checkpoints. In order to capture all consensus data, each slot allows up to 128 attestations (including late or minority forks) which enables the beacon chain to catch up in situations where slots were skipped. For simplicity, we discuss only the happy path but the non-uniform path follows the same overall logic with added dynamic structuring. Being that we care only about majority consensus toward a given checkpoint to assess finality, we only include attestations that contribute to vote weight for a given checkpoint. This may result in some slot proofs containing less than 64 attestations, reducing the number of steps in that slot proof. In the case of a skipped slot, where the following slot has more than 64 attestations for a given checkpoint, a secondary Nova instance is generated for capturing the overflow from the skipped slot.
# Set Membership Proof Via Lookups
In order to ensure the correctness of the aggregate point proofs, we must also prove that the public keys contained in the $\mathbb{G}_1$ point addition proof are all currently active validators.
In the same way we wish to parallelize the generation of the pairing check with the point addition proofs by assuming correctness of inputs and deferring that computation to a secondary set of proofs, we assume the correctness of the transcripts over which the point addition proofs operate and defer that computation to a separate lookup proof. This lookup proof operates over each committee signing set to check validator status against the beacon chain validator list and commits to a transcript that represents the correct inputs to the point addition proofs. This commitment then is checked in the compression proof for equivalence to the transcript commitment used in the Nova point addition proofs.
Here we use Lasso [[ref](https://eprint.iacr.org/2023/1216)] as a means to commit to the lookup table as an snapshot in time of the active validator list to enable trustless checking of the root of truth for the proving system as well as a means to lookup a vector of $v$ field elements that comprise the bigint representation of participating public keys (passed as input to the Nova point addition proofs), taking as input the aggregated participation bitvector as a lookup matrix.
The lookup argument proves that for a lookup table of active validator public keys $t \in \mathbb{F}^m$, and a lookup vector of participating public keys $a \in \mathbb{F}^v$, that $a$ is a subset of $t$. This is done by proving knowledge of a sparse bit matrix $M \in \{0,1\}^{v \times m}$ such that $M \cdot t = a$.
In our case, the participation bit vector from each committee can be directly transformed into the sparse lookup matrix, $M$, which, when matrix-vector multiplied against the lookup table $t$, results in the public keys of the validators that participated in signing an attestation.
Lasso shows that proving the set membership of $a$ in $t$, or knowledge of $M$ such that $M \cdot t = a$, is equivalent (with negligible error) to verifying:
$$ \sum_{y \in \{0,1\}^{log \ m}} \tilde{M}(r,y) \cdot \tilde{t}(y) = \tilde{a}(r) $$
where $\tilde{M}$, $\tilde{t}$, and $\tilde{a}$ are the multilinear extension polynomials of $M$, $t$, and $a$, evaluated at a verifier selected random point $r \in \mathbb{F}^{log \ v}$. Using the sum-check protocol, this statement can efficiently be reduced to checking polynomial evaluations at a single random point using their previously committed forms.
Building this for a table of ~1M validator keys, with a lookup of ~900k participants, the proof consists of a homomorphic commitment to $\tilde{M}$, $\tilde{a}$, and $\tilde{t}$, with sum-check proof for evaluation.
Since the lookup proof requires knowing which specific validators participated in each committee/slot, the proof cannot be generated until this data is fully available. This can either happen once per slot concurrently with the Nova proofs, or once per epoch. Batching multiple lookups into a single Lasso proof is covered in a later section.
## Multiplicity-check Of Participation Bits
In order to openly prove the security of the overall proof such that vote weight is strictly "1 key, 1 vote" and to ensure no attestations are duplicated to represent false majority, we must also prove that the lookup matrix $M$ used by the Lasso proof does not look up repeated entries. Naively, this corresponds to the following additional constraints on $M \in \{0,1\}^{v \times m}$:
$$ \sum_{i=0}^{v} M_{i,j} \leq 1, \forall j \in [m] $$
to ensure that every column of the matrix only has at most one row with a non-zero entry, or in other words, every row of the matrix is unique. Representing these constraints such that they can be efficiently verified using the commitment to $M$ depends on the specific underlying polynomial commitment scheme used by Lasso. Improving on this naive representation will be part of future optimization explorations.
## Majority-check Of Consensus
To constrain that the overall finality proof actually represents the aggregate attestations of a majority of validators toward a given source checkpoint (the definition of finality), we must also prove that the public keys included in the point addition proof represent at least 2/3 of the active validator set.
The verifier knows the dimensionality of $M$, $t$ and $a$ in order to calculate a valid challenge $r$ in verifying the lookup proof, the verifier can then also constrain that $v \geq \frac{2}{3} m$, where $m$ is the number of active validators in the epoch.
This check paired with the multiplicity check completes the logic for ensuring default safety for the overall finality proof whereas to create a falsely valid proof would require controlling the majority of validator keys.
## Verifying Lasso Proofs In Circuit
The verifier for Lasso is required to verify a sum-check proof requiring $O(log N)$ field operations in the base field over which the PCS is defined. Using Spark as our PCS, which only requires discrete log hardness over $\mathbb{G}$, we can generate our lookup proof over Grumpkin, which makes proof verification use native field arithmetic in the BN254 compression proof.
## Checking Transcript Equivalence
The lookup proof contains a commitment to a vector $a$ corresponding to the results of the lookup into the table $t$ that aligns to the exact inputs of the group operations done in the $\mathbb{G}_1$ point addition proof. The compression proof therefore needs to additionally constrain that the commitment to the vector $a$ matches the commitments to the transcript of the point addition Nova proof. This is done by asserting a match between the committed folded instances and the committed lookup vector over the same randomness, for each step in the point addition proof.
## Gaining Efficiency In Parallelization And Heterogeneous Lookups
As we look to reduce latency in the proving system, parallelizing the lookups may be necessary, as mentioned previously. Lasso is purposefully built with splitting tables into subtables in mind. This would have lookup table $t$ be defined as:
$$ t = [t_1,t_2,...t_{2048}]$$
with each $t_c$ being the subtable of validators for committee $c$ as determined by the shuffle algorithm, each of size $N^{1/c}$. It would then follow that there exist a $c$-variate polynomial $g$ such that for every $r = (r_1,...,r_c)$,
$$T[r] = g(T_1[r_1],...,T_c[r_c])$$
which allows us to batch prove lookups across a set of tables by individually calculating their multilinear extensions as that data becomes available. This enables the calculation of each committee lookup and point addition proof in parallel at the moment the slot is accepted. This added complexity doesn't come for free and adds an extra evaluation (field operation) query per table, so the added benefit in data parallelization may not offset the added verifier overhead in practice.
However, this concept also extends into combining heterogeneous lookups across different tables, to have a single lookup proof for all deferred elliptic curve operation proofs. This concept will be important for extending this proving system across additional consensus mechanisms.
# Proof Composition and Final BN254 Verification and Compression Proof
The final proof step will be composing all of the individual proof statements in one BN254 proof to compress it all into one proof that verifies:
- The single pairing of A and B
- Constraining that the pairing is 1
- Verify the $\mathbb{G}_1$ point addition proof
- Verify the output of that matches A
- Verify the $\mathbb{G}_2$ point addition proof
- Verify the output of that matches B
- Verify the set membership lookup proof
- Verify the multiplicity constraint
- Verify the majority assertion
- Verify the iteration counts of the deciders are equal to the proper number of steps in an epoch
- Verify transcript equivalence between lookup proof and point addition proofs
# Expanding This Proof Architecture Over Multiple Networks
This proving system lays the foundation of a highly data parallel, modular and parallelizable recursive proving system that can scale horizontally across different networks using BLS signatures over the 12-381 curve, as well as other signature schemes with the addition of non-native operation specific recursive circuits.
This construction has proven to be performant down to low latency (< 3 minutes) for a consensus mechanism that requires > 1M group operations over the large 381 bit field with ~1M signing keys, which presents the largest foreseeable proving overhead for PoS consensus mechanisms.
Being that almost (reserving some room for consensus mechanisms we are unfamiliar with) every PoS blockchain uses elliptic curve signature schemes, verifying consensus can always be decomposed to combinations of iterative group operations with checking the well formation of those results over a verifiably correct transcript (along with some other invariants). As all elliptic curve groups are additive and commutative, and the data needed to verify consensus being public and known ahead of proving, all of these operations can be parallelized and concurrently proven. As a result, this proving system architecture can be extended to many PoS blockchain consensus mechanisms and achieve similar or better performance.
Merging these proofs over heterogeneous signing schemes will require additional layers of recursion and batch verification of KZG proofs, which carry additional prover overhead. As each of the subcircuits are data parallel with overlap across different networks, this additional latency and overhead will be sublinear, with the final proving costs being constant in size of a single step over all instructions used.
As such, this architecture provides the raw materials to achieve the vision of a highly parallelized, scalable and efficient unified state verification layer.