# Aggregation proposals
[TOC]
## Context
The current bottleneck of our chain is that there is an unique being verifier by the whole network for single 32/64GB sectors. Henceforth, we want to remove the linear amount of proofs being verified onchain. With Hyperdrive, we already are aggregating proofs and that will give a noticeable throughput improvement. However, onchain verification still contains linear operations, such as individual sectors book-keeping that will soon dominate the cost of verification. In this document we're evaluating solutions to remove any linearities in the state and in onchain verification, as much as possible on a short/mid term basis.
## Goals
We want the following goals:
* The state is does not contain a linear amount of information anymore, for example and particularly $comR$s, but only an "aggregation" of commRs.
* As consequence, the SnarkPack verifier don't need to read *all* the public inputs to pass as input to the fiat shamir transform
* As well, verifier don't need to read *all* the public inputs inside the Groth16 equation
Optionals:
* Keeping the same circuit for ProveCommit would be ideal
* Find a way to compress the $comD$ as well
## Background
* Prover is aggregating $n$ proofs
* Each proofs has $p$ public inputs $a_j$ for $j:0 \rightarrow p$
* We denote $a_{i,j} \in \mathbb{F}$ the $j$-th public input of the $i$-th proof
SnarkPack uses the public inputs in two places, **two places where this proposal removes most linearities**:
1. Inputs to the Fiat Shamir transform to generate a random $r$ used for a random linear combination of all the proofs
* $r = H(C_p, \{a_{i,j}\} ) \in \mathbb{F}$ with $i:0 \rightarrow n$ and $j: 0 \rightarrow p$
* $C_p$ is a commitmet to all the proofs
2. Verification of the Groth16 equation using the random linear combination
* The relevant term that uses theses public inputs is:
$$
T = \prod\limits_{j=0}^{p} S_j^{\sum\limits_{i=0}^{n} a_{i,j} * r^i}
$$
where $S_j \in \mathbb{G_1}$ is part of the verification key of the Groth16 provecommit proof.
The public inputs *for one ProveCommit* are composed of multiple pieces:
* ReplicaID = H(ProverID, SectorID, ReplicaSeed, CommD, PoRepID)
* Hash is done **outside the circuit**
* The $comR$: this is generated during Precommit time, so it is currently known to the chain, before miner sends its prove commit. One per sector.
* The $comD$: this is a commitment to the data being embedded in the sector
* The challenges $c_k$: these are generated from a fixed seed $s$ from random beacon (let's note up to $k$ challenges)
**Deals**: With respect to the miner storage power, deals are considered in two places:
* PreCommit time: Miner advertises which deal CIDs $dealID$ he is integrating into his sector. The commitment (root of merkle tree) of these $dealID$ is called $comD$
* ProveCommit:
* Miner proves he knows some leaves whose root is $comD$
* Verifier also checks $comD$ is correctly computed from the advertised $dealID$s from PreCommit
## Overview of the proposals
### Aggregating sectors
There exists two main "practical" (in the sense of not requiring to re-design everything in Filecoin) solutions to achieve these goals; both ideas are centered around the idea of **aggregating the $comR$** and keeping only a **single aggregate $comR*$ onchain**. We call an aggregation of sectors proved at the same time a **supersector**, which is designated by a single $comR^*$ onchain.
There are currently two ways envisionned for doing this:
1. **SnarkPack inputs aggregation**: onchain verification includes now verifying a SNARK that proves the aggregation of inputs has been done correctly, and gives the aggregation to SnarkPack verifier directly - it removes the linearity in the SnarkPack system.
2. **Circuit inputs aggregation**: onchain verification now only passes a constant number of inputs per proofs, i.e. the $comR^*$, and the circuit checks that the $comR$ belongs to $comR^*$, i.e. that the sector belongs to the supersector.
**Size of supersectors**: Given we use Groth16 and don't have unlimited recursion available, we must fix a constant size $n$ of how many sectors are included in a supersector. If a miner provides less than $n$ sectors, then a simple padding is applied at the proof level.
### Enforcing uniqueness of sectors
For each of these solutions, we need to guarantee the same guarantees are preserved than the one made by onchain verification. We can use SNARKs to help us. One such particular constraint is to enforce the uniqueness of sector ID: they all must be unique - either by proving they are unique or by showing they are incremental Indeed, the sectorID is the only (minimum) "variable" for a given miner when sealing a new sector. We want to remove this check from onchain verification, such that there is no need to pass individual $comRs$ or sectorIDs onchain. There are two envisionned ways for doing this, both compatible with any of the above proposals:
1. **Incremental SectorID**: Miner proves he is using sequential sectorIDs inside a SNARK
2. **Unique SectorID**: Miner proves he is using unique sectorIDs inside a SNARK
*Why can't we simply put the $comR$s into an accumulator ?*:
Because the SNARK detects with high probability that the prover used an invalid sectorID, because all the labellings will be incorrect with respect to the valid encoding. However, prover can still create a new $comR$ from the same replica and proof, by design, allows prover to cheat up to a certain percentage, like 20%.
### Deals
Because deals are touching many areas in Filecoin including market actors, removing completely the linearities around them would probably require a more indepth restructuring of the state (moving to full stateless approach).
The current state of Filecoin does not keep around the $comD$s of each sector already, so the mapping is between $dealID$s and a sector.
In this proposal, the mapping is now between $dealID$s and a supersector, but **it keeps the same logic**: prover have to verify in a ProveCommit that it knows some leaves corresponding to a $comD$ that depends on the deals.
### Overview of Costs of engineering
Proposal 1 requires the engineering of:
* 1 aggregation snark proof
* 1 windowpost proof
* 1 winning post proof
* 1 merging supersector proof
* Rework the snarkpack library to operate in two steps (mild effort)
Proposal 2 requires the engineering of:
* 1 porep proof
* 1 wpost proof
* 1 merging supersector proof
* 1 winning post proof
In both case, 3 new circuits will have to developped and the new way of aggregating sectors into supersectors be implemented by proof land as well.
## Proposal 1: SnarkPack inputs aggregation with incremental SectorIDs
### PreCommit step
#### Sector related operations
* Generate a unique SuperSector ID or $SSID$
* Generate the different SectorIDs $S_i = H(SSID, MinerID, i)$
* Generate all the respective ReplicaID $R_i$ using these $S_i$
* Generate all its poreps using the respective sector IDs leading to many $comR$s
* Creates a Merkle Tree of all of these $comR$s, whose root is $comR^*$
* We must use a SNARK friendly hash function here because we are going to verify inclusion proofs in circuits later on.
#### Deals Informations
Assuming the provers has a list $L = (D_1,\dots,D_p)$ of deals ID ($p$ of them), he wants to put them in the supersector. Let's call $L_i$ the *subset* of deals he wants to include in the $i$-th sector.
**Note**: deals repartition between the different sectors is something to be customized according to specific needs / constraints. This solution enables any kind of subset.
#### On chain submission
Prover submits the following:
* $comR^*$
* $SSID$
* randomness epoch chosen for this precommit step
* Deals related information: all subsets of deals $L_i$ with $i: 0 \rightarrow p$ **This linear information is kept in the state**.
#### Trade off
* It is not strictly required to get all sectors use the same precommit epoch to get the same randomness. If we don't, then that means we still have a linear amount of epochs *at most* to track onchain *for the duration of precommit* - it eventually gets removed after some time or after provecommit. As a first step, it might be an acceptable cost because we still get rid of the linear amount of $comR$s onchain and miner dont need to get huge hardware to make and keep all proofs at the same time.
### ProveCommit
Steps of the prover:
1. Prover gets the seed $s$ to generate the challenges $c_{i,k}$ from the seed ($i$ sector, $k$ challenge index)
2. Prover generates all its ProveCommit indiviual proofs as usual
3. Prover starts the SnarkPack procedure but with $r$ but modified:
* $r = H(C_p, comR^*,s, minerID)$ - all this information is public
4. Prover creates an "aggregation" Groth16 proof:
* Public inputs (statement): $r$, the $comR^*$, the miner ID, $s$, $SSID$, all $l_j$ (see below), and all $comD_i$
* Private inputs (witness): the individual $comRs$, the individual challenges $c_i$, individual $S_i$
* Circuit:
a. Prove correct generation of all sectors ID $S_i$ (hashing)
b. Prove correct generation all the challenges (hashing)
c. Prove merkle inclusion proof about all $comR$s given to the $comR^*$
e. Prove correct computation of $j$-th terms of the linear combination of SnarkPack $l_j = \sum\limits_{i=0}^{n} a_{i,j} * r^i \in \mathbb{F}$ for $j: 0 \rightarrow p$ .Note prover has access to all $a_{i,j}$ from public + private inputs
g. Ensure all $comR$ are different
4. Prover sends on chain: the SnarkPack proof + "aggregation" proof + $l_j$s on chain
6. Verifier loads the $comR^*$, minerID from state
7. Verifier generates the $r$ from SnarkPack and all the $comD_i$ from PreCommit information (from all the $L_i$)
8. Verifier now verifies the "aggregation" proof (now that he has all public inputs)
9. If successful, then he verifies SnarkPack with the output of the "aggregation proof" for the linear random combination:
$$
T = \prod\limits_{j=0}^{p} S_j^{l_j}
$$
This takes time only with respect of public inputs of ProveCommits; not in the size of the number of proofs.
#### Deals
For the deals, verifier has to compute first all $comD_i$ manually before giving them to the aggregation SNARK. **This work is linear in the number of deals**. As well, the state must store all deal IDs individually as it is currently the case.
**Constant $comD$**: Filecoin API could just give a single $comD$ and the proof system will take care of using the same one at the relevant places. In this case, it has to perform a multiexp like the following:
$$
\prod\limits_{j=0}^{p} S_j^{comD_j}
$$
We can't gain anything on the computation by having the same $comD_j$ accross all $j$ but this details will be hidden at the border between Lotus and Filecoin.
### WindowPost
For WindowPost, we need to make a new circuit with the following characteristic (OR change the inner circuit of prove commit):
* Public inputs (statement): the $comR^*$,the seeds $s$ that generate the challenges
* Private inputs (witness): *all* the individual $comR$s from the supersector, all challenges $c_i$ (X for each sectors) generated from the seed $s_i$
* Circuit:
* verifies all the individual $comR$s are included in $comR^*$
* for each $comR$s $i$, take the seed $s_i$ and generate $x$ challenges
* for each such challenges $x_i$, prove the inclusion of leaf inside the relevant $comR$ (native windowpost circuit)
**Trade-off**: We can also create an extra "glue" SNARK similar toProveCommit that aggregates the public inputs so we don't change the window post circuit. But this solution requires to aggregate multiple window post as we are using SnarkPack.
#### Aggregating WindowPosts
Given the circuit above, we can use SnarkPack natively to aggregate wpostpost proofs. There work will be linear but in the amounts of super sectors.
### Merging SuperSectors
The main **condition** is merging supersectors can only happen when the number of total sectors is $\leq n$. Let's say one wants to aggregate two supersectors. Here is some notations:
* First one designated by $comR_1^*$ with $n_1$ actual sectors $comR_{1,1} \dots comR_{1,n_1}$ inside
* Second one designatured by $comR_2^*$ with $n_2$ actual sectors $comR_{2,1} \dots comR_{2,n_2}$ inside
* The new supersector containing sector from both supersectors is designated via $comR_a^*$ and contains $n_1 + n_2$ sectors.
As noted in the overview section, we must have that $n_1+n_2 \leq n$.
The circuit looks like the following:
* Public Inputs: $comR_1^*, comR_2^*, n_1, n_2, comR_a^*$
* Private Inputs: $comR_{1,1} \dots comR_{1,n_1}, comR_{2,1} \dots comR_{2,n_2}$
* Outside circuit: prover computes the new merkle tree where the leafs are all the $comRs$ from both supersectors
* Circuit:
* For each $comR_{1,i}$, verify inclusion proofs to $comR_1^*$
* For each $comR_{2,i}$, verify inclusion proofs to $comR_2^*$
* For each $comR_{i,j}$, with $i=1,2$, verify inclusion proof to $comR_a^*$
#### Constraint & Variations
This "naive" mechanism forces **all sectors** in the supersectors to be proven in this new circuit. Indeed, if we allow a restricted number, then verifier don't know if prover is re-proving old proofs or not. So by default, **we could assume a supersector contains 2500 sectors**. When aggregating with SnarkPack, there would be 1 public input per supersector so linearity is OK there.
**For larger supersector sizes**, assuming 2500 sectors is the maximum we can prove inside a wpost, we can change the above circuit to only prove a subset by including the indexes as public inputs. The circuit will only verify inclusion proofs for the given indexes. In that way, we can prove X sectors in one proof, and X in another proof, both for the same supersector.
In that case, when aggregating with SnarkPack, we would need **a linear number of inputs with the number of sectors being proven**. These inputs are only indexes, but the verifier would need to know which one (for example, first 0->2500 in one proof, then 2501->5000 in another proof
## Proposal 2: Circuit inputs aggregation with incremental SectorIDs
### PreCommit
Similar to proposal 1.
### ProveCommit
#### Single Sector
In this proposal, we modify the ProveCommit circuit to check that the sector belongs to the supersector directly inside. **The circuit still proves only one sector only**:
* Public inputs (statement): $comR^*, seed,i,minerID,SSID, comD_i$
* Private inputs (witness): $comR$, inclusion proof for $comR$, $sectorID$,$replicaID$, challenges $x_i$
* Circuit:
* Verify correct computation of $sectorID$ (hashing)
* Verify correct computation of ReplicaID
* Verify correct computation of challenges $x_i$ (hashing)
* Verify the inclusion proof from $comR$ to $comR^*$
* Verify the inclusion proof from $comD$ to $comD^*$ (hashing)
* Verify ProveCommit as before (inclusion proofs on challenges)
#### Super Sector
Prover creates a proof for all sectors in the super sector, and then aggregates them using SnarkPack natively.
Note how the verifier must, as in the proposal 1, compute himself the $comD$s to give as public inputs to the SnarkPack aggregation routine.
#### Linearities
SnarkPack at this stage still needs the public inputs of each porep: it leads to a **still linear amount of public inputs** but with a **much lower constant**. For each provecommit proofs to be aggregated, verifier needs to provide:
* $comR^*$ - that must be read from the state, only once given sectors belong to same supersector
* $i$ index of the sector - no read, verifier simply input this index
* $seed$ - read only once per supersector from chain info
* $SSID$ - read only once per supersector from state
* $comD$ - read from state, **per proofs** (linear)
That means there are 6 public inputs per proofs. Using 819 poreps, that means 49152 public inputs still. Going to a higher number of proofs using SnarkPack, **verification time may suffer because of this linearity**
**TODO**: to be analyzed more deeply if we want to go with this solution, where the bottleneck is lying inside SnarkPack. General consensus seems to agree that current bottleneck is state reading, so that would be fine in this case - except for $comD$
### WindowPost
Similar to Proposal 1.
## Comparisons & Overview
Proposal 1 has the following pros:
* ProveCommit circuit is NOT modified, so there is no distinction of hardware/proving time between miners that use supersector or not
* Removes all linearities to when verifying ProveCommits for a supersector
However, it requires a complex interaction between SnarkPack and the proofs, (which may be hard to prove), we're not using SnarkPack as a black box anymore.
Proposal 2 is potentially more elegant as it brings us closer to the design of fully stateless mining in Filecoin. However that requires new proofs systems for ProveCommit and unfortunately, still keep some linearities in the ProveCommit phase.
However, both systems **require a new prove commit**.
## Areas of improvements
### Deals
The linearity that is still present in the system in both systems is due to the deals, **that is left to solve**. Namely, both proposals remove the need to store individuals $comD$ but the individual $dealID$s stored inside a supersector are still needed in state. Moving that offchain requires touching many other components of Filecoin, and would most likely endup having the full state being provable offhchain, the "stateless" mining solution.
### Proving Uniqueness of Sector IDs
There is also another way to ensure sectorIDs that removes the need to require incremental sector IDs using accumulators and that approach may lead closer to the fully stateless mining approach. It is however a bit more complex to implement in SNARK.
The key problem here is how to build an accumulator that (1) supports non membership proofs and (2) produces proofs that can be efficiently verified in a circuit.
#### "Pure" cryptographic accumulators
Using accumulators on pairing settings or even on regular curves or RSA settings, yield (1) but not (2) given our current SNARK framework (BLS12-381). We could use a different curves like BLS12-377 that enables efficients pairings or another proof system, but that is unlikely to give a solution short term.
--> **Merkle Trees**:
Using a Merkle Tree naively fits (2) but not (1). To have (1), we need special properties about the Merkle Tree that isn't clear yet how this would impact the computation inside the circuit
#### Sorted Merkle Tree
A sorted tree allows to prove non membership by proving the existing neighbors in the tree. Example: prove 9 and 11 share common ancestor and there is no need for 10, for proving 10 isn't included in the tree. This solution requires complex machinery to update the tree each time we are including a new value. If the tree is empty at the beginning it isn't clear we can do this in a straightforward fashion using our current SNARK toolset (Groth16 / R1CS). We would potentially need more recent constructions that allows incremental computations to allow a diverse depth of updates.
#### A Merkle Tree of "0"s
Another practical approach is to consider a Merkle Tree full of zeros, with a maximum size $M$ defined. To prove non-membership of entry $s_i$, we look at index $s_i$ and prove inclusion of the leaf "0". To update, we simply put the value $s_i$ at the index $s_i$ and prove inclusion of the value $s_i$.
This gives (1) and (2). However, it might come at a high practical cost, given miners now must keep in memory/disk a **large** merkle tree, regardless of the number of sectors they currently have proven ! Potentially we can cache most layers of the tree so miners don't need to compute from scratch. It will nevertheless probably impact negatively the computation of proofs.
# Notes
## Notes 2020-06-16
@anca @rosario @nikkolasg
* It is highly mikely miners / nodes need to still access the individual comR for disputing wpost (maybe even for deals?) - we should look into more.
* We need to make sure the comR* is correctly formed from valid comRs so at PreCommit time, prover sends the individual comRs in the transaction data and verifiers recreates the commitment and validates each individual comRs. This is to avoid Anca's attack where we can reuse previous comRs.
* We CAN remove the hashing of the challenges inside the extra SNARK if we want to. That means verifier will have to compute these challenges outside the circuit. We could use a fast hashing function like blake3 for this so that it is efficiently done. Even more if we use the same challenges accross all proofs
* Using same challenges accross all proofs need to be seen with Lucas to see if Porep would still be secure overall for many proofs. More thoughts are needed on this topic.
* Not clear yet if we can avoid having a new circuit for this.
## Notes 2020-06-17
@anca @rosario @nikkolasg @nicola
* ReplicaID = H(ProverID, SectorID, ReplicaSeed, CommD, PoRepID)
* ProverID: miner address
* SectorID: picked by the miner and it can't be already be used
* ReplicaSeed: some randomness picked in the past from the chain (this ties the proof to a fork of the blockchain)
* PoRepID: the identifier for this type of proof of replication version
ReplicaID is unique per PoRep, what makes it unique per sector is the SectorID
Aggregation (CommR1, CommR2, CommR3) -> CommRAgg
* Proof 1 (SectorID) -> CommR
* Proof 2 (SectorID) -> CommR
* Proof 3 (SectorID) -> CommR
ProveCommit(CommRAgg, proof, XXXXX)
Problem: If we naively aggregate inputs, we lose track of whether CommRs have been re-used
### Input aggregation at circuit layer
* Circuit CommRAgg, index
* SubCircuit1: There exist a path at index i, that has a leaf that is CommR
* SubCircuit2: Prove CommR is highly likely to be correctly generated from SectorID using some challenges which select some nodes to be correctly generated and consistent with CommR
* Checks all merkle trees witnesses of challenged node are consistent with the position and with CommR and CommD
* Check that the labeling of a node = H(ReplicaID||parents)
```
* CommRAgg
/\
/\ /\
CommR1...CommRn
/\
/\
node_1...node_z
```
* Make all proofs to be aggregated to have the same SectorID
* This does not work, because sectors with the same sector IDs lead to the same CommR and one could generate one replica and aggregate it many times
* Subsectors have a Sector ID -> this will be a public input to the SNARK
* This makes verification time linear AND requires the chain to store SectorIDs in the state, so the state size is linear in the subsectors.
* Subsectors have a subsectorID which is SuperSectorID||i
Circuit(CommRAgg, SuperSectorID, index):
* Private Input: CommR at the given index in the super sector
* SubCircuit1(i): There exist a path at index i, that has a leaf that is CommR
* SubCircuit2(i):
* Prove all merkle trees for parents and challenged nodes
* Check that the labeling is done correctly
* Check that ReplicaID = H(ProverID, SuperSectorID||index, ReplicaSeed, CommD, PoRepID)
SnarkPack([snark1(CommRAgg, 1)...snarkn(CommrAgg, n)])
Verification(CommRAgg, [1...n])
### Input aggregation at aggregation layer
**Way 1: Prover packs the all inputs into snark**
Glue SNARK with replicaID inside of circuit (non-linear):
* Subcircuit 1: Ensure that linear combinations are all correctly computed with inputs that are all aggregated into a CommInputs: Merkle tree of (CommR1, 1, ReplicaID1), (CommR2, 2, ReplicaIDn)
* Subcircuit 2: For each leaf i show that
* the second item is increasing from 1 to n
* the ReplicaID is contructed as H(......||i)
**Way 2: Verifier computes inputs to be passed to snarkpack**
SnarkPack with replicaID outside of circuit (non-linear):
* At verification time verifier must generate 1..n, must generate ReplicaID1..ReplicaIDn
* Glue Circuit:
* Subcircuit 1: Ensure that linear combinations are all correctly computed with inputs that are all aggregated into a CommInputs: Merkle tree of (CommR1, 1, ReplicaID1), (CommR2, 2, Replica3IDn)
### Meeting 6 July
w/ luca + nikkolasg
* We should draft something about winning post (that means 3 new circuits to realize)
* We should confirm no sector IDs are passed into windowpost / winningpost
* Challenges:
# Drafts - workplace - Do not look if you dont want to loose time
## Deal solution1
precommit:
prover gives $L_i$ (subset of whose dealsID are in which sector), $comD_i$ and $comD^*$
provecommit:
Verifier recomputes $comD_i$ from dealsID (sha256 so it is fast)
circuit public input takes $comD^*$, as well as the $comD_i$
Regular prove commit verifies that for each challenge $j$ at sector $i$ prover knows the path from the leaf to $comD_i$
The aggregation circuit also verifies that $comD_i$ is part of $comD^*$ - so each public input $comD_i$ passed to provecommit, is also proven to be part of $comD^*$, therefore the whole $comD^*$ is proven correct, therefore we have a mappings from dealsID -> $comD^*$
**Linearities**:
Computation:
* verifier needs to compute all $comD_i$ manually
Storing:
* list of Deals tied to super sector, with $comD^*$
## Deal solution 2
precommit:
prover gives $L_i$, verifier computes himselfs the $comD_i$
provecommit:
circuit public input takes all $comD_i$m simply aggregates them normally
verifier needs to load these as public inputs
**Linearities**
Computation:
* verifier needs to compute all $comD_i$ manually
Storing:
* List of Deals with each $comD_i$