# Using lazy loading for duplicate subcircuit constraints to cut 95% memory usage for gnark R1CS
## Purpose
We are the zkbnb team and we are optimizing the lib used in zkbnb. When using gnark as a base zksnarks lib for the roll-up system, we face memory issues. For block size 64(a block contains 64 transactions), with circuit (link to circuit code) which compiled to 60m constrains, we will reach 128GB memory usage in this proof generation for the block. It not only costs more in hardware but also impacts on the performance because it requires lots of page swap.
We detect two parts using huge memory/storage size.
1. R1CS size. In our case R1CS takes about 56G memory.
2. Pk size. In our case Pk size takes about 64G memory.
Here we mainly focus on reducing the R1CS size.
## Background
#### Groth16
[Groth16](https://eprint.iacr.org/2016/260.pdf) are widely used by layer 2 and roll-up projects as their final prove backend system. It will output a succinct proof that can be easily verified onchain. Groth16 first needs an arithmetic circuit that fully describes the function.
#### arithmetic circuit
An arithmetic circuit consists of gates computing arithmetic operations like addition and multiplication, with wires connecting the gates. For example, the circuit looks like this

#### R1CS
R1CS is the description of an arithmetic circuit used by groth16. R1CS is composed of many R1C, every single R1C will contain $L$, $R$, $O$ `LinearExpression`, and `LinearExpression` contains `variables` or `constants` set.
One R1C will describe that $L * R = O$. For the figure above, for the g1 gate we can describe it by R1C $L(w_1) * R(w_2) = O(w_4)$.
#### Phases
The Whole E2E proving contains compile, setup, solve and prove phases.
**In compile phase**, a frontend system will read all possible constraints and output the R1CS containing various R1C.
During compilation, gnark will iterate the circuit definition, and generate R1CS. During this process, there will be many internal variables added. For example, if writing a circuit like this,
```go=
// Define declares the circuit constraints
// x**3 + x + 5 == y
func (circuit *Circuit) Define(api frontend.API) error {
x3 := api.Mul(circuit.X, circuit.X, circuit.X)
api.AssertIsEqual(circuit.Y, api.Add(x3, circuit.X, 5))
return nil
}
```
Here, x3 is the internal variable and will be a part of `LinearExpression`.
**In setup phase**, a backend system will uses FFT(Fast Fourier Transform) to get all the pk points on the curve, and outputs the vk as well.
**In solving phase** R1C will be extracted and calculate the right value. For example, if L,R,O is all solved, then the solving system will constraint them to satisfy the multiplication equation. If any two of them are solved, then the solving system will calculate the unsolved one. If less than two of them are solved, the solving system will panic.
**In proving phase**, once the backend system get all the solutions satisfied the constraints, we use MSM(Multi-Scalar Multiplication) of correspond pk and solution to get the proof point on the curve. And then the backend will aggregate the proof points to a valid proof.
## Solution
### Gnark Official Solution
There’re many efforts to reduce memory usage by the Consensys team, in the gnark latest release v0.8.0, like in this gnark PR [#418](https://github.com/ConsenSys/gnark/pull/418) the R1CS compilation will replace the long LinearExpression with a new one, and then using the new one for other R1C in the afterward compilation.
In the gnark PR [#412](https://github.com/ConsenSys/gnark/pull/412) the updates for the R1CS core system also mainly focus on R1CS memory usage and re-structure the LinearExperssion and Term(item in Linear Expression) fields.
These effort has reduced memory, however, the explosion of zk application more complex systems(like zkevm or recursive proving) make the circuit even larger. These optimizations will not stop the R1CS memory usage from growing linearly with the circuit size growing.
**We will need to specifically deal with the large circuit for reducing memory.**
### Our Solution
Noted that the large circuit always contains loops in R1CS generation, here we call them `duplicate subcircuits`. For example, the Merkle tree in the circuit will reuse hash functions many times, and the hash functions like MiMC will reuse permutations many times. Thus we come up with an idea to highly compress the `duplicate subcircuit` constraints to simple subcircuit constraints. And then we lazy load the constraint for use.
### Example
Let’s imagine a giant circuit with inputs i1 and i2, assume that v_ is the internal variables generated, and c_ is the constant variables used, during compilation to generate the R1CS the circuit will generate R1C as below,
_circuit 1 generated constraints as below_
```
1. (i1 + c1) * (i1 + c1) = v1
2. (i2 + c2) * (i2 + c2) = v2
3. (v1 + c3) * (v1 + c3) = v3
4. v3 * v3 = v4
n. ...constraints
```
If compilation later meets the same duplicate subcircuit with inputs i3 and i4, assume that before processing circuit 2, the existing internal variables index is already 1000, then the constraints will be
_circuit 2 generated constraints as below_
```
1. (i3 + c1) * (i3 + c1) = v1001
2. (i4 + c2) * (i4 + c2) = v1002
3. (v1001 + c3) * (v1001 + c3) = v1003
4. v1003 * v1003 = v1004
n. ...constraints
```
As you can see, there will be many internal variables generated in the duplicate subcircuit, the variables’ born and order are all the same, except that they got a shift(here is 1000) for the variables in the same constraint.
Here we do not look at constraint `1` and constraint `2` because they contain input variables.
For constraint `3` you will find that we can easily generate constraint 3 in circuit 2 by adding shift 1000 to the circuit 1 constraint 3's `v1` and `v3`.
For constraint `4` you will find that we can easily generate constraint 4 in circuit 2 by adding shift 1000 to the circuit 1 constraint 4's `v3` and `v4`.
so we save the structure static constraints(meanwhile circuit 1 generated constraints) for the first time and during solving we recover the other same duplicate circuits by
1. retrieve the input constraints, which contain the input variables. this is the same with normal constraints.
2. retrieve constraints and add shifts to the variables. In the above case, we have shift 1000, so in constraint 4 we can recover (v3 * v3 = v4 in static constraints) to (v1003 * v1003 = v1004)
So during compilation(or during R1CS generation), we need to record the first appear duplicate subcircuit and record input constraints for any other same duplicate subcircuit, then record the shift. So Afterwards we can easily recover the constraints which are hidden.
now we conclude the solution
1. record repeatable constraints structure (during this process, we will record the static constraints StaticConstraints, and it will be stored in a map by a key defined by the user, for looking up the constraint that needs to be recovered)
2. record repeatable constraints’ inputs, and pick only the input constraints to store. (GeneralLazyInputs)
3. lazify and remove all the repeatable r1cs structure, with only lazy inputs constraints left.
4. recover the r1c when solving. the input constraints can be easily extracted from the store, for others we have a shift recorded the distance from the static constraints, and add the offset shift to the variables to recover.
## Experiment
This is the experiment data for dumping the r1cs to file (for a living prover, another part pk memory usage will not be affected here), and we test for the `origin version`, `optimized version without lazify`(compared to the original version it includes some meta) and `optimized version with lazify`. and the mimc testing circuit for testing is like below.
### circuit
```go=
for i := 0; i < times; i++ {
mimc.Write(circuit.PreImage, circuit.PreImage, circuit.PreImage, circuit.PreImage)
api.AssertIsDifferent(mimc.Sum(), circuit.Hash)
mimc.Reset()
}
```
#### experiments show that we save up to **95%** memory of r1cs
### results
#### mimc-inputs-4 (save 70% r1cs memory usage)
mimc | 1 | 10 | 100 | 1000 | 10000
-- | -- | -- | -- | -- | --
dump_r1cs_w/o_origin (kb) | 127 | 1,225 | 12,939 | 135,828 | 1,364,706
dump_r1cs_w/o_lazify (kb) | 244 | 1,342 | 13,056 | 135,946 | 1,364,833
dump_r1cs_w/_lazify (kb) | 158 | 479 | 4,055 | 43,072 | 435,292
as you can see, for the such circuit we save nearly 70% of memory, and both lazified and non-lazified grow linearly with the mimc hashes num, that's because we use input constraints to recover true constraints. If the input constraints contained are very little, then the memory of r1cs used actually will be very little.
so we test for Poseidon 12 inputs, which surely the input constraints contained in its static constraints are less.
#### poseidon-inputs-12 (save 95% r1cs memory usage)
poseidon | 1 | 10 | 100 | 1000 | 10000
-- | -- | -- | -- | -- | --
dump_r1cs_w/o_origin (kb) | 313 | 2,520 | 24,602 | 273,890 | N/A
dump_r1cs_w/o_lazify (kb) | 546 | 2,761 | 24,915 | 275,049 | N/A
dump_r1cs_w/_lazify (kb) | 325 | 450 | 1,745 | 16,962 | N/A
this time it saves nearly 95% of the memory used in r1cs.
During our zkbnb production usage of this optimization, we are able to reduce 70% memory usage of R1CS size. We apply the technique to duplicate subcircuit of MiMC permutations, which occupy 70% of constraints usage.
## Contribution
1. Reorganize the R1CS structures, and enable the high compression for the R1CS large circuit to save 70% of memory usage of zkbnb.
2. Make the memory usage by R1CS not grow linearly with circuit size, so that developer can ignore the limit of memory usage, which will surely help to make applications on gnark much more practical.
3. With this effort, we are able to minimize the R1CS storage size, which will make the R1CS reloading much faster. (Here We also use parallelize loading). The R1CS loading time in zkbnb for 20M constraints is only 8 seconds, compared to the former 30 minutes.
## What can be done next?
1. formulize the R1CS compress format, so that we can not only apply to gnark but also to all R1CS-based zksnarks lib.
2. autonomously detect the duplicate subcircuit, now we urge users to find the duplicate subcircuit themselves, which is not making sense for an infrastructure lib.
we will work will the Consensys team to make this happen.