# Halo2 Peak Memory Usage
This note shows a pseudocode of `create_proof` of `halo2_proofs` and counts the memory usage for each moment.
```rust
fn create_proof(params: &KZGParams, pk: &ProvingKey, circuit: &Circuit, instances: &[&[F]]) {
// Let:
//
// - k: log 2 of number of rows
// - n: `1 << k`
// - d: Degree of circuit
// - e: Extension magnitude, equal to `(d - 1).next_power_of_two()`
// - c_f: number of fixed columns
// - c_a: number of advice columns
// - c_i: number of instance columns
// - c_p: number of columns enabled with copy constraint
// - c_pg: number of grand product in permutation argument, equal to `div_ceil(c_p, d - 2)`
// - c_l: number of lookup argument
//
// The memory usage M.C and M.S stands for:
//
// - M.C: number of "n elliptic curve points" (with symbol ◯)
// - M.S: number of "n field elements" (with symbol △)
// - M.E: number of "e * n field elements" (with symbol ⬡)
//
// So the actual memory usage in terms of bytes will be:
//
// M = 32 * n * (2 * M.C + M.S + e * M.E)
//
// We'll ignore other values with sublinear amount to n.
// 0. In the beginning:
//
// `params` has:
// ◯ powers_of_tau
// ◯ ifft(powers_of_tau)
//
// M.C = 2 (+= 2)
// M.S = 0
// M.E = 0
//
// `pk` has:
// ⬡ l0
// ⬡ l_last
// ⬡ l_active_row
// △ fixed_lagranges (c_f)
// △ fixed_monomials (c_f)
// ⬡ fixed_extended_lagranges (c_f)
// △ permutation_lagranges (c_p)
// △ permutation_monomials (c_p)
// ⬡ permutation_extended_lagranges (c_p)
//
// M.C = 2
// M.S = 2 * c_f + 2 * c_p (+= 2 * c_f + 2 * c_p)
// M.E = 3 + c_f + c_p (+= 3 + c_f + c_p)
//
// And let's ignore `circuit`
// 1. Pad instances as lagrange form and compute its monomial form.
//
// M.C = 2
// M.S = 2 * c_f + 2 * c_p + 2 * c_i (+= 2 * c_i)
// M.E = 3 + c_f + c_p
let instance_lagranges = instances.to_lagranges();
let instance_monomials = instance_lagranges.to_monomials();
// 2. Synthesize circuit and collect advice column values.
//
// M.C = 2
// M.S = 2 * c_f + 2 * c_p + 2 * c_i + c_a (+= c_a)
// M.E = 3 + c_f + c_p
let advice_lagranges = circuit.synthesize_all_phases();
// 3. Generate permuted input and table of lookup argument.
// For each lookup argument, we have:
//
// △ compressed_input_lagranges - cached for later computation
// △ permuted_input_lagranges
// △ permuted_input_monomials
// △ compressed_table_lagranges - cached for later computation
// △ permuted_table_lagranges
// △ permuted_table_monomials
//
// M.C = 2
// M.S = 2 * c_f + 2 * c_p + 2 * c_i + c_a + 6 * c_l (+= 6 * c_l)
// M.E = 3 + c_f + c_p
let (
compressed_input_lagranges,
permuted_input_lagranges,
permuted_input_monomials,
compressed_table_lagranges,
permuted_table_lagranges,
permuted_table_monomials,
) = lookup_permuted()
// 4. Generate grand products of permutation argument.
//
// M.C = 2
// M.S = 2 * c_f + 2 * c_p + 2 * c_i + c_a + 6 * c_l + c_pg (+= c_pg)
// M.E = 3 + c_f + c_p + c_pg (+= c_pg)
let (
perm_grand_product_monomials,
perm_grand_product_extended_lagranges,
) = permutation_grand_products();
// 5. Generate grand products of lookup argument.
// And then drops unnecessary lagranges values.
//
// M.C = 2
// M.S = 2 * c_f + 2 * c_p + 2 * c_i + c_a + 3 * c_l + c_pg (-= 3 * c_l)
// M.E = 3 + c_f + c_p + c_pg
let lookup_product_monomials = lookup_grand_products();
drop(compressed_input_lagranges);
drop(permuted_input_lagranges);
drop(compressed_table_lagranges);
drop(permuted_table_lagranges);
// 6. Generate random polynomial.
//
// M.C = 2
// M.S = 1 + 2 * c_f + 2 * c_p + 2 * c_i + c_a + 3 * c_l + c_pg (+= 1)
// M.E = 3 + c_f + c_p + c_pg
let random_monomial = random();
// 7. Turn advice_lagranges into advice_monomials.
let advice_monomials = advice_lagranges.to_monomials();
drop(advice_lagranges);
// 8. Generate necessary extended lagranges.
//
// M.C = 2
// M.S = 1 + 2 * c_f + 2 * c_p + 2 * c_i + c_a + 3 * c_l + c_pg
// M.E = 3 + c_f + c_p + c_pg + c_i + c_a (+= c_i + c_a)
let instances_extended_lagrnages = instances_monomials.to_extended_lagranges();
let advice_extended_lagrnages = advice_monomials.to_extended_lagranges();
// 9. While computing the quotient, these extended lagranges:
//
// ⬡ permuted_input_extended_lagranges
// ⬡ permuted_table_extended_lagranges
// ⬡ lookup_product_extended_lagranges
//
// of each lookup argument are generated on the fly and drop before next.
//
// And 1 extra quotient_extended_lagrange is created. So the peak memory:
//
// M.C = 2
// M.S = 1 + 2 * c_f + 2 * c_p + 2 * c_i + c_a + 3 * c_l + c_pg
// M.E = 4 + c_f + c_p + c_pg + c_i + c_a + 3 * (c_l > 0) (+= 3 * (c_l > 0) + 1)
let quotient_extended_lagrange = quotient_extended_lagrange();
// 10. After quotient is comuputed, drop all the other extended lagranges.
//
// M.C = 2
// M.S = 1 + 2 * c_f + 2 * c_p + 2 * c_i + c_a + 3 * c_l + c_pg
// M.E = 4 + c_f + c_p (-= c_pg + c_i + c_a + 3 * (c_l > 0))
drop(instances_extended_lagrnages)
drop(advice_extended_lagrnages)
drop(perm_grand_product_extended_lagranges)
// 11. Turn quotient_extended_lagrange into monomial form.
// And then cut int `d - 1` pieces.
//
// M.C = 2
// M.S = 2 * c_f + 2 * c_p + 2 * c_i + c_a + 3 * c_l + c_pg + d (+= d - 1)
// M.E = 3 + c_f + c_p (-= 1)
let quotient_monomials = quotient_monomials()
drop(quotient_extended_lagrange)
// 12. Evaluate and open all polynomial except instance ones.
}
```
From the computation, currently peak memory usage could be found at step 9:
```
M.C = 2
M.S = 1 + 2 * c_f + 2 * c_p + 2 * c_i + c_a + 3 * c_l + c_pg
M.E = 4 + c_f + c_p + c_pg + c_i + c_a + 3 * (c_l > 0)
```
Here is a script to reproduce the actual peak memory usage https://gist.github.com/han0110/676ba81904a5e99d5ff7d8961cebaed5.
## Improvements
### Low-hanging fruits
- After step 5, we can drop `instance_lagranges` already
- Generate `perm_grand_product_extended_lagranges` also on the fly so it doesn't overlap with lookup's auxiliary extended lagranges
With these, the peak memory usage reduces to:
```
M.C = 2
M.S = 1 + 2 * c_f + 2 * c_p + c_i + c_a + 3 * c_l + c_pg
M.E = 4 + c_f + c_p + c_i + c_a + max(3 * (c_l > 0), c_pg)
```
### Compute quotient chunk by chunk
In fact:
```rust
best_fft(monomial, omega_nd, log2_nd).into_iter().skip(i).step_by(d)
== best_fft(coset(monomial, omega_nd_to_i), omega_n, log2_n)
```
So we can compute quotient chunk by chunk without needing extended lagrange of other monomials in memory all at once, so degree won't have affect on peak memory usage anymore.
But with a performance drawback, is the need to compute `d - 1` extra times of `coset(monomial, omega_nd_to_i)`, which seems fine because it's not that expensive.
With this, the peak memory usage reduces to:
```
M.C = 2
M.S = 1 + 2 * c_f + 2 * c_p + 2 * c_i + 2 * c_a + 3 * c_l + c_pg
+ max(3 * (c_l > 0), c_pg)
M.E = 4 + c_f + c_p
```
To reduce even more, we can also discard the extended lagranges in proving key, and just compute them on the fly.
```
M.C = 2
M.S = 7 + 3 * c_f + 3 * c_p + 2 * c_i + 2 * c_a + 3 * c_l + c_pg
+ max(3 * (c_l > 0), c_pg)
M.E = 1
```
So the peak memory usage is roughly estimated to 2x sum of all polynomials size.
## Reference
- https://github.com/scroll-tech/halo2/pull/28 - Original implementation of "compute quotient chunk by chunk"
- https://github.com/taikoxyz/halo2/pull/6 - Also adoption by Taiko
- https://github.com/axiom-crypto/halo2/pull/17 - and by Axiom