optimizing quotient poly computation of halo2-lib favored circuits.

# opt idea for halo2-based circuit [toc] ## 0. current situation ## 1. recap: halo2-lib circuits Circuits that use`halo2-lib` builds upon the simple `FlexGate` which then builds upon the simplest `BaseGate`. For each [`BaseGate`](https://github.com/axiom-crypto/halo2-lib/blob/1cf8eca320406e0c9b9402258e96227af9c88d3b/halo2-base/src/gates/flex_gate/mod.rs#L39-L48), it has - 1 selector, 1 advice column. - The custom gate is `sel * (adv + adv' * adv'' - adv''') = 0`. And the `FlexGate` has [several `BaseGate`](https://github.com/axiom-crypto/halo2-lib/blob/1cf8eca320406e0c9b9402258e96227af9c88d3b/halo2-base/src/gates/flex_gate/mod.rs#L123-L135) inside. The advices in different base_gate instance are connected (or copied) to each other by permutation argument. For example, if `FlexGate` has $n$ base gates and $m$ constant columns, then it has - $n$ selector (fixed column); - $n$ advice columns; - the permutation argument has $n+m$ columns. advice columns layout ``` | n advice columns | l lookup advice columns | m lookup constant columns | 1 advice column | // 1st compression | 15 (sel*(a+a'*a'' -a''') | 2 (lookup) | 1 (lookup) | ``` For example, 1. scroll's first layer compression circuit has $n = 15$ advice columns, $l = 2$ **lookup advice** columns, 1 **instance** column and $m = 1$ **constant** columns. Therefore, the number of columns in permutation argument is $n+m+l+1 = 19$. 2. scroll's second layer compression circuit has $n = 2$ advice columns, $l = 2$ **lookup advice** columns, 1 **instance** column and $m = 1$ **constant** columns. Therefore, the number of columns in permutation argument is $n+m+l+1 = 6$. The **main observation** we have is > The $n$ custom gates are **disjoint** in the sense that to evaluate the $i$-th custom gate, the cpu/gpu only needs to have access to $i$-th selector and $i$-th advice. The quotient polynomial $$\begin{equation}\begin{split} q(X) & = \sum_{i=0}^{n-1} y^i * \textrm{sel}_i(X)* \big[a_i(X) + a_i(X*\omega)*a_i(X*\omega^2) - a_i(X*\omega^3)\big] \\ & + y^n \sum P_i(X) + y^{n+2t} * \sum L_i(X).\end{split}\end{equation}.$$ Below we use $s_i(X) = \textrm{sel}_i(X)$ to make it easier to read the doc. |circuit | #(columns) in permutation |t| chunk_len| degree| |--------|----|----|----|---| |1st compression| 19| 7| 3| 2^24| |2nd compression| 6 | 2 | 3| 2^24| Let's denote by $t$ the number of permutation $z(X)$ polynomials. 1. permutation's quotient polynomial $P(X)$ is RLC of the following $2t$ polynomials: a. $P_0(X) = l_0(X) * (1 - z_0(X))$; b. $P_1(X) = l_{\textrm{last}}(X) * \big(z_{t-1}^2(X) - z_{t-1}(X)\big)$; c. $P_{i+1}(X) = l_0(X)*\big(z_i(X) - z_{i-1}(\omega^{\textrm{last}}*X) \big)$ for $1 \le i < t$; d. $$\begin{equation}\begin{split} P_{t+i}(X) & = (1 - (l_{\textrm{last}}(X) + l_{\textrm{blind}}(X)) * \\ & \big(z_i(\omega*X) (a_{3i}(X) + \beta \sigma_{3i}(X) + \gamma)*(a_{3i+1}(X) + \beta \sigma_{3i+1}(X) + \gamma) *(\cdots) - \\ & z_i(X)*(a_{3i}(X) + \beta \delta^{3i} X + \gamma)*(a_{3i+1}(X) + \beta \delta^{3i+1} X + \gamma)*(\cdots) \big) \end{split}\end{equation}$$ for $0 \le i < t$; 2. lookup's quotient polynomial $L(X)$ is easy as each lookup can be handled separately.  ### 1.1 how to handle permutation Let's denote by $c$ the chunk length of permutation argument. The most common value of $c$ is 3. Let's say we have $s$ GPUs ($s$ can be one of $[1, 2, 4]$), then we need to transfer/store the following polynomials to the $j$-th gpu: 1. $z_j(X), z_{s+j}(X), z_{2s+j}(X), \ldots$; 2. $a_{3*j}(X), a_{3j+1}(X), a_{3j+2}(X), a_{3(s+j)}(X), a_{3(s+j)+1}(X), \ldots$; 3. $\sigma_{3*j}(X), \sigma_{3j+1}(X), \sigma_{3j+2}(X), \sigma_{3(s+j)}(X), \sigma_{3(s+j)+1}(X), \ldots$; 4. $l_0(X), l_{\textrm{blind}}(X), l_{\textrm{last}}(X)$: its lagrange form is simple; That is, - To evaluate grand product $z(X)$ for one permutation, we need to store $2*c+1 = 7$ polynomials. - to handle one permutation on $j$-th gpu, we need to store at most $2c+2 = 8$ polynomials (we do not need to transfer $\sigma(X)$ as we can generate it on gpu). ## 2. new workflow 0. phase1: After finished the circuit synthesis, **compress** all advice polynomials $a_i(X)$ and selector polynomials $s_i(X)$ on cpu side. Store the compressed version in main memory. 1. phase3: permutation For the $i$-th permutation, the result grand product poly $z_i(X)$ be stored in $j = i \pmod{s}$-th gpu. - decode the polynomials $a_{3i}(X), a_{3i+1}(X), a_{3i+2}(X)$ from compressed form on the gpu side. - generate the sigma polynomials $\sigma_{3i}(X), \sigma_{3i+1}(X), \sigma_{3i+2}(X)$ on the gpu side. - compute the $z_i(X)$ from the above two set $(a, s)$ of polynomials on the gpu side. - compute the commit of $z_i(X)$. - store ```rust // sigma polys are generated automatically. // c = chunk_len // sigma_{c*i}, sigma_{ci+1}, ...., sigma_{ci+c-1}. fn compute_perm_z_gpu(&self, i: usize, cols: &[PolyCompressed], gpu_idx: usize) -> MemObject; fn compute_perm_z_cpu(&self, i: usize, cols: &[Polynomial]); // z0(X) stored at gpu 0 -> gpu 1 // z1(X) stored at gpu 1 -> gpu 2 // z2(X) stored at gpu 2 -> gpu 3 // z3(X) stored at gpu 3 -> gpu 0 // ... ``` 2. phase4: quotient **For the $i$-th permutation, all relevant polynomials will be stored in $j = i \pmod{s}$-th gpu.** Each gpu has 1 accumulating polynomial $Q_j(X)$. a. run ifft_cosetfft_part on $z_i(X)$ and copy it back to cpu; b. after all $s$ GPUs have send back then each gpu copy $z_{i-1}(X)$ to its memory, then go to step c. c. generate the sigma polynomials $\sigma_{3i}(X), \sigma_{3i+1}(X), \sigma_{3i+2}(X)$ on the gpu side. d. do the following steps 1. if i == 0, then $Q_j(X) \leftarrow Q_j(X) + y^n * l_0(X) *(1 - z_i(X))$; 2. else if i == t-1, then $Q_j(X) \leftarrow Q_j(X) + y^{n+t-1} * l_{\textrm{last}}(X) * (z_i^2(X) - z_i(X))$; 3. else $Q_j(X) \leftarrow Q_j(X) + y^{n+i} * l_0(X)*\big(z_i(X) - z_{i-1}(\omega^{\textrm{last}}*X) \big)$; 4. \begin{equation}\begin{split} Q_j(X) \leftarrow & Q_j(X) + y^{n+i}* (1 - (l_{\textrm{last}}(X) + l_{\textrm{blind}}(X)) * \\ & \big(z_i(\omega*X) (a_{3i}(X) + \beta \sigma_{3i}(X) + \gamma)*(a_{3i+1}(X) + \beta \sigma_{3i+1}(X) + \gamma) *(\cdots) - \\ & z_i(X)*(a_{3i}(X) + \beta \delta^{3i} X + \gamma)*(a_{3i+1}(X) + \beta \delta^{3i+1} X + \gamma)*(\cdots) \big) \end{split}\end{equation} e. if we need to compute custom gate quotient, then for k in `[3i, 3i+1, 3i+2]`, do the following steps: 1. decode the $s_k(X)$ from its compressed form on the gpu side. 2. decode the $a_k(X)$ from its compressed form on the gpu side. 3. $Q_j(X) \leftarrow Q_j(X) + y^k * s_k(X)*(a_k(X) + a_k(X*\omega)*a_k(X*\omega^2) - a_k(X*\omega^3))$. f. if we need to compute lookup quotient TODO: add mv-lookup quotient here. the final quotient polynomial $Q(X) = \sum_{j=0}^{s-1} Q_j(X)$. ## 3. evaluation ## Appendix Here we list the number of polynomials that common consumer-level GPUs can hold. |gpu| log(degree)| #(polys)| |-|-|-| |RTX 3080 (10G) | 21 | 160| |RTX 3080 (10G) | 22 | 80 | |RTX 3080 (10G) | 23 | 40 | |RTX 3080 (10G) | **24** | **20** | |RTX 3080 (10G) | 25 | 10 | |RTX 3080 (10G) | 26 | 5 | |RTX 3090 (24G) | 21 | 384| |RTX 3090 (24G) | 22 | 192| |RTX 3090 (24G) | 23 | 96| |RTX 3090 (24G) | **24** | **48**| |RTX 3090 (24G) | **26** | **12**|