# opt idea for halo2-based circuit
[toc]
## 0. current situation
## 1. recap: halo2-lib circuits
Circuits that use`halo2-lib` builds upon the simple `FlexGate` which then builds upon the simplest `BaseGate`. For each [`BaseGate`](https://github.com/axiom-crypto/halo2-lib/blob/1cf8eca320406e0c9b9402258e96227af9c88d3b/halo2-base/src/gates/flex_gate/mod.rs#L39-L48), it has
- 1 selector, 1 advice column.
- The custom gate is `sel * (adv + adv' * adv'' - adv''') = 0`.
And the `FlexGate` has [several `BaseGate`](https://github.com/axiom-crypto/halo2-lib/blob/1cf8eca320406e0c9b9402258e96227af9c88d3b/halo2-base/src/gates/flex_gate/mod.rs#L123-L135) inside. The advices in different base_gate instance are connected (or copied) to each other by permutation argument.
For example, if `FlexGate` has $n$ base gates and $m$ constant columns, then it has
- $n$ selector (fixed column);
- $n$ advice columns;
- the permutation argument has $n+m$ columns.
advice columns layout
```
| n advice columns | l lookup advice columns | m lookup constant columns | 1 advice column |
// 1st compression
| 15 (sel*(a+a'*a'' -a''') | 2 (lookup) | 1 (lookup) |
```
For example,
1. scroll's first layer compression circuit has $n = 15$ advice columns, $l = 2$ **lookup advice** columns, 1 **instance** column and $m = 1$ **constant** columns. Therefore, the number of columns in permutation argument is $n+m+l+1 = 19$.
2. scroll's second layer compression circuit has $n = 2$ advice columns, $l = 2$ **lookup advice** columns, 1 **instance** column and $m = 1$ **constant** columns. Therefore, the number of columns in permutation argument is $n+m+l+1 = 6$.
The **main observation** we have is
> The $n$ custom gates are **disjoint** in the sense that to evaluate the $i$-th custom gate, the cpu/gpu only needs to have access to $i$-th selector and $i$-th advice.
The quotient polynomial $$\begin{equation}\begin{split} q(X) & = \sum_{i=0}^{n-1} y^i * \textrm{sel}_i(X)* \big[a_i(X) + a_i(X*\omega)*a_i(X*\omega^2) - a_i(X*\omega^3)\big] \\ & + y^n \sum P_i(X) + y^{n+2t} * \sum L_i(X).\end{split}\end{equation}.$$
Below we use $s_i(X) = \textrm{sel}_i(X)$ to make it easier to read the doc.
|circuit | #(columns) in permutation |t| chunk_len| degree|
|--------|----|----|----|---|
|1st compression| 19| 7| 3| 2^24|
|2nd compression| 6 | 2 | 3| 2^24|
Let's denote by $t$ the number of permutation $z(X)$ polynomials.
1. permutation's quotient polynomial $P(X)$ is RLC of the following $2t$ polynomials:
a. $P_0(X) = l_0(X) * (1 - z_0(X))$;
b. $P_1(X) = l_{\textrm{last}}(X) * \big(z_{t-1}^2(X) - z_{t-1}(X)\big)$;
c. $P_{i+1}(X) = l_0(X)*\big(z_i(X) - z_{i-1}(\omega^{\textrm{last}}*X) \big)$ for $1 \le i < t$;
d. $$\begin{equation}\begin{split} P_{t+i}(X) & = (1 - (l_{\textrm{last}}(X) + l_{\textrm{blind}}(X)) * \\ & \big(z_i(\omega*X) (a_{3i}(X) + \beta \sigma_{3i}(X) + \gamma)*(a_{3i+1}(X) + \beta \sigma_{3i+1}(X) + \gamma) *(\cdots) - \\ & z_i(X)*(a_{3i}(X) + \beta \delta^{3i} X + \gamma)*(a_{3i+1}(X) + \beta \delta^{3i+1} X + \gamma)*(\cdots) \big) \end{split}\end{equation}$$ for $0 \le i < t$;
2. lookup's quotient polynomial $L(X)$ is easy as each lookup can be handled separately.
<!--is RLC of the following polynomials:
a. (logup / mv-lookup) variant:
- $L_{3i+0}(X) = l_0(X)*\phi(X)$;
- $L_{3i+1}(X) = l_{\textrm{last}}(X)*\phi(X)$;
- $L_{3i+2}(X) = (1- l_{\textrm{last}}(X) - l_{\textrm{blind}}(X))*(\textrm{lhs} - \textrm{rhs})$
b. old halo2 lookup variant:
- $L_{5i+0}(X) = l_0(X) * (1 - z(X))$;
- $L_{5i+1}(X) = l_{\textrm{last}}(X) * (z^2(X) - z(X))$;
- $L_{5i+2}(X)$;
- $L_{5i+3}(X) = l_0(X) * (a'(X) - s'(X))$;
- $L_{5i+4}(X) = (1 - l_{\textrm{last}}(X) - l_{\textrm{blind}}(X))*(a'(X) - s'(X))*(a'(X) - a'(\omega^{-1}*X))$.
-->
### 1.1 how to handle permutation
Let's denote by $c$ the chunk length of permutation argument. The most common value of $c$ is 3.
Let's say we have $s$ GPUs ($s$ can be one of $[1, 2, 4]$), then we need to transfer/store the following polynomials to the $j$-th gpu:
1. $z_j(X), z_{s+j}(X), z_{2s+j}(X), \ldots$;
2. $a_{3*j}(X), a_{3j+1}(X), a_{3j+2}(X), a_{3(s+j)}(X), a_{3(s+j)+1}(X), \ldots$;
3. $\sigma_{3*j}(X), \sigma_{3j+1}(X), \sigma_{3j+2}(X), \sigma_{3(s+j)}(X), \sigma_{3(s+j)+1}(X), \ldots$;
4. $l_0(X), l_{\textrm{blind}}(X), l_{\textrm{last}}(X)$: its lagrange form is simple;
That is,
- To evaluate grand product $z(X)$ for one permutation, we need to store $2*c+1 = 7$ polynomials.
- to handle one permutation on $j$-th gpu, we need to store at most $2c+2 = 8$ polynomials (we do not need to transfer $\sigma(X)$ as we can generate it on gpu).
## 2. new workflow
0. phase1:
After finished the circuit synthesis, **compress** all advice polynomials $a_i(X)$ and selector polynomials $s_i(X)$ on cpu side.
Store the compressed version in main memory.
1. phase3: permutation
For the $i$-th permutation, the result grand product poly $z_i(X)$ be stored in $j = i \pmod{s}$-th gpu.
- decode the polynomials $a_{3i}(X), a_{3i+1}(X), a_{3i+2}(X)$ from compressed form on the gpu side.
- generate the sigma polynomials $\sigma_{3i}(X), \sigma_{3i+1}(X), \sigma_{3i+2}(X)$ on the gpu side.
- compute the $z_i(X)$ from the above two set $(a, s)$ of polynomials on the gpu side.
- compute the commit of $z_i(X)$.
- store
```rust
// sigma polys are generated automatically.
// c = chunk_len
// sigma_{c*i}, sigma_{ci+1}, ...., sigma_{ci+c-1}.
fn compute_perm_z_gpu(&self, i: usize, cols: &[PolyCompressed], gpu_idx: usize) -> MemObject;
fn compute_perm_z_cpu(&self, i: usize, cols: &[Polynomial]);
// z0(X) stored at gpu 0 -> gpu 1
// z1(X) stored at gpu 1 -> gpu 2
// z2(X) stored at gpu 2 -> gpu 3
// z3(X) stored at gpu 3 -> gpu 0
// ...
```
2. phase4: quotient
**For the $i$-th permutation, all relevant polynomials will be stored in $j = i \pmod{s}$-th gpu.** Each gpu has 1 accumulating polynomial $Q_j(X)$.
a. run ifft_cosetfft_part on $z_i(X)$ and copy it back to cpu;
b. after all $s$ GPUs have send back then each gpu copy $z_{i-1}(X)$ to its memory, then go to step c.
c. generate the sigma polynomials $\sigma_{3i}(X), \sigma_{3i+1}(X), \sigma_{3i+2}(X)$ on the gpu side.
d. do the following steps
1. if i == 0, then $Q_j(X) \leftarrow Q_j(X) + y^n * l_0(X) *(1 - z_i(X))$;
2. else if i == t-1, then $Q_j(X) \leftarrow Q_j(X) + y^{n+t-1} * l_{\textrm{last}}(X) * (z_i^2(X) - z_i(X))$;
3. else $Q_j(X) \leftarrow Q_j(X) + y^{n+i} * l_0(X)*\big(z_i(X) - z_{i-1}(\omega^{\textrm{last}}*X) \big)$;
4. \begin{equation}\begin{split} Q_j(X) \leftarrow & Q_j(X) + y^{n+i}* (1 - (l_{\textrm{last}}(X) + l_{\textrm{blind}}(X)) * \\ & \big(z_i(\omega*X) (a_{3i}(X) + \beta \sigma_{3i}(X) + \gamma)*(a_{3i+1}(X) + \beta \sigma_{3i+1}(X) + \gamma) *(\cdots) - \\ & z_i(X)*(a_{3i}(X) + \beta \delta^{3i} X + \gamma)*(a_{3i+1}(X) + \beta \delta^{3i+1} X + \gamma)*(\cdots) \big) \end{split}\end{equation}
e. if we need to compute custom gate quotient, then
for k in `[3i, 3i+1, 3i+2]`, do the following steps:
1. decode the $s_k(X)$ from its compressed form on the gpu side.
2. decode the $a_k(X)$ from its compressed form on the gpu side.
3. $Q_j(X) \leftarrow Q_j(X) + y^k * s_k(X)*(a_k(X) + a_k(X*\omega)*a_k(X*\omega^2) - a_k(X*\omega^3))$.
f. if we need to compute lookup quotient
TODO: add mv-lookup quotient here.
the final quotient polynomial $Q(X) = \sum_{j=0}^{s-1} Q_j(X)$.
## 3. evaluation
## Appendix
Here we list the number of polynomials that common consumer-level GPUs can hold.
|gpu| log(degree)| #(polys)|
|-|-|-|
|RTX 3080 (10G) | 21 | 160|
|RTX 3080 (10G) | 22 | 80 |
|RTX 3080 (10G) | 23 | 40 |
|RTX 3080 (10G) | **24** | **20** |
|RTX 3080 (10G) | 25 | 10 |
|RTX 3080 (10G) | 26 | 5 |
|RTX 3090 (24G) | 21 | 384|
|RTX 3090 (24G) | 22 | 192|
|RTX 3090 (24G) | 23 | 96|
|RTX 3090 (24G) | **24** | **48**|
|RTX 3090 (24G) | **26** | **12**|