Designing a plonk circuit for Liam Eagen's protocol

# Designing a plonk circuit for Liam Eagen's protocol ## Liam Eagen's protocol. First of all, let me describe my version of Liam's protocol (it uses arbitrary negative base decomposition, it is -3 in the paper, but lookups skew this a bit). Let us denote as ``p`` the curve group prime order, and ``B`` the decomposition base. We set ``D = ceil(log(p, B) / 2)`` - which is amount of B-digits in each scalar. That means that scalars only can be up to roughly $\sqrt p$, which is done to ensure that they do not overflow. Full range MSM will require twice as many operations. I also assume that we are in a fixed-base scenario, this is not an important restriction, but convenient for me now. We denote as ``N`` our total amount of points 1. We are given arrays ``scalars[N], points[N]``. We claim that their inner product is a point ``Q``. 2. We decompose each scalar into ``scalars_buckets[N][B-1]`` array - each ``scalars[i]`` equals to a sum of ``sum<j>(dgt_enc(j)*scalars_buckets[i][j]``, where ``dgt_enc(j)`` is ``j``-th nonzero digit (we will use digits from ``-B//2`` to ``B//2`` for odd ``B`` and from ``-B/2+1`` to ``B/2`` for even ``B``). Each ``scalars_bucket`` entry only has digits ``1`` or ``0``, and obtained by collecting all digits of the corresponding scalar equal to ``dgt_enc(j)``. This is checked via splitting into limbs, and a lookup. 3. We compute an array of coefficients functions ``Div[D]``, where each ``Div[i]`` is a function of the form ``a(x) + y*b(x)``, with maximal possible degrees satisfying ``deg(a) == ceil(N/2) + B``, ``deg(b) == floor(N/2) + B``. That means each such function has a total of ``N+B+1`` coefficients. These functions are computed as functions vanishing in collections of ``N+2`` points, obtained in a following fashion: Denote array ``points_table[i][j] = points[i]*digit(scalar[i], j, B)`` where function ``digit`` computes ``j``-th digit of ``scalar[i]`` in ``-B``-base system. Now, do the following: ``` acc = E(0) for j in range(D): for _ in range(B): points_table[D-j-1].append(-acc) acc = E(0) for pt in points_table[D-j-1]: acc += pt points_table[D-j-1].append(-acc) ``` Where ``E(0)`` denotes neutral element of our group. What happens here, is we compute accumulated sum of each row (starting from higher ones), and then append minus them ``B`` times to the next row, before computing the next accumulator - et cetera. ``Div[i]`` is a vanishing function of i-th row of this table. We must allocate these polynomials into our circuit in a first phase. 4. New round starts - we get challenge point ``A`` from a verifier, and set ``B = -2*A``. Denote, also, ``z`` to be a line function passing through ``A`` and ``B``, with ``z = y-tx+f``, with ``t`` being the slope. Compute ``c2 = 2*B.y*(A.x-B.x)/(3*B.x**2 - 2*t*B.y)`` Now, we need to compute in-circuit the values of ``Div[i](A), Div[i](B)`` and their derivatives along the tangent to the curve ``Der[i] = Div[i].diff(x) + (3x**2/2y) * Div[i].diff(y)``, also in points ``A`` and ``B``. We will use additional challenge to do it more efficiently; this will be explained later. ``` lhs = 0 for i in range(D): lhs += (-B)^i * ((c2 * Der[i](B)/Div[i](B) - (c2 + 2*t) * Der[i](A)/Div[i](A))) ``` This must be equal to rhs: ``` rhs = 0 for i in range(N): for w in range(B-1): dgt = dgt_enc(w) pt = dgt*points[i] rhs += scalars_buckets[i][w]*(A.x - pt.x)/(-z(pt)) rhs += (A.x-Q.x)/(-z(Q)) ``` ## Optimizations In this part, I will describe efficient digit-splitting strategy, and strategy to evaluate polynomials and their derivatives in a point. ### Digit splitting Denote ``T`` the logarithm of the size of the available lookup table. It will likely be in range ~15-20 for cases we are willing to consider. Then, we populate this table with the following entries: $a_0 + a_1 B + a_2 B^2 + ... a_{T-1} B^{T-1}$, with all $a_i$-s being $0$ or $1$. Every ``scalars_buckets[i][j]`` must be split into ``L = ceil(D/T)`` limbs, which are individually looked up in this table. This is the main bulk of work in the computation of the rhs. ### Computing polynomials Considering computation of the lhs, the principal part is that we need to commit to all the coefficients of the polynomials, and then compute them and their derivatives in few points. Normally, computing a polynomial in a point with Horner's method would require ~ the same size of the witness as the length of the original polynomial. This is, likely, acceptable in scenarios where we are bandwidth constrained. If we are aiming for mainnet, theoretically, we do not care that much about using few more rotations (in SHPlonK, these require only few field ops more). Therefore, I'd like suggest an optional optimization for computing a polynomial P in a public challenge point x: 1. Create gate of the form $a_0 + a_1 x + ... + a_{9} x^9 = b_0$, apply it to each batch of 10 consecutive coefficients of a polynomial. 2. Compute $b_0 + x^{10} b_{10} + x^{20} b_{20} + ...$ using Horner method. This can even be optimized further by combining both operations in the same gate: $b_{10i} = x^{10} b_{10(i+1)} + a_{10i} + a_{10i+1} x + ... + a_{10i+9}$ This means that the column b will have a lot of spare cells, only 10% of cells are taken by this computation. 10, of course, is just an example here. Another optimization stems from noticing that computation of derivative is a linear operation. This means that using an *additional challenge round* we can commit to values of ``Div[i](A)``, ``Div[i](B)``, ``Der[i](A)`` and ``Der[i](B)``, and then sample a challenge and only compute these derivatives once. This gives roughly 4x advantage, and linear combination can be computed using a similar method. ## Estimates Before we continue further, lets determine what are the tradeoffs. When we are increasing the base B, an amount of work needed to compute lhs decreases (as D decreases). However, an amount of work needed to compute rhs increases. The following table allows to understand this dynamics: LHS-1 corresponds to the first phase of computing LHS, i.e. size of all coefficients of all polynomials D\*N. As we have explained, if we use our optimization trick, the rest is ~ the same amount of work. If we do use our optimization trick, it is ~10% (or whatever fan-in we choose). RHS-1 = N + N\*(B-1) + N\*(B-1)\*L - total amount of scalars + buckets + limbs of these buckets. L is computed as D/T. Here are the stats for T=15 (witness size consumed per point). | BASE | LHS-1 | RHS-1 | lookup | | ----------- | ----------- | --------- | ------ | | 2 | 127 | 11 | 9 | 3 | 81 | 15 | 12 | 4 | 64 | 19 | 15 | 5 | 55 | 21 | 16 | 6 | 50 | 26 | 20 | 7 | 46 | 31 | 24 | 8 | 43 | 29 | 21 | 9 | 41 | 33 | 24 | 10 | 39 | 37 | 27 | 11 | 37 | 41 | 30 | 12 | 36 | 45 | 33 | 13 | 35 | 49 | 36 | 14 | 34 | 53 | 39 | 15 | 33 | 57 | 42 | 16 | 32 | 61 | 45 | 17 | 32 | 65 | 48 | 18 | 31 | 69 | 51 | 19 | 30 | 55 | 36 | 20 | 30 | 58 | 38 We are also guaranteed to spend some work on lookups, and spend at least some work on computing the polynomials. If we are in bandwidth-constrained scenario, we likely should be choosing relatively large base, and avoid using high fan-in gates. Here, I describe an optimization with prover performance and gas cost in mind (but not proof size) - i.e. I will use relatively large amount of rotations and will aim to minimize amount of columns. ## Suggested layout Here, I assume N=~1000. If it is larger, say, 10000, it is possible to use larger lookup table. Our circuit will have 3 phases, with first phase being the column ``a``, second ``b``, and third will also have 1 column ``c``. It will use custom lookup argument, and no additional variable columns, ``T=15``, ``B=5``. --- ### Column ``a`` In a column ``a``, we allocate all the coefficients of our polynomials, with same coefficients of different polynomials put consequently. This takes ``55*(N+2)`` rows. --- ### Column ``b`` In the second phase, we obtain the public challenge point ``A`` from the verifier, and compute ``B``, ``c``, ``t`` (and function ``z = y-tx+c``), we also compute slopes ``sl_A = 3*A.x**2 / (2*A.y)`` and ``sl_B = 3*B.x**2/(2*B.y)``. In a column ``b``, we allocate the following data: our scalars, in the following order: |row|b | |---|---| |0|sc[0]| |1|sc_bucket_neg_2[0]| |2|sc_bucket_neg_2_limb[0]| |3|sc_bucket_neg_2_limb[1]| |4|sc_bucket_neg_2_limb[2]| |5|sc_bucket_neg_2_limb[3]| |6|sc_bucket_neg_1[0]| |7|sc_bucket_neg_1_limb[0]| |8|sc_bucket_neg_1_limb[1]| |9|sc_bucket_neg_1_limb[2]| |10|sc_bucket_neg_1_limb[3]| |11|sc_bucket_1[0]| |12|sc_bucket_1_limb[0]| |13|sc_bucket_1_limb[1]| |14|sc_bucket_1_limb[2]| |15|sc_bucket_1_limb[3]| |16|sc_bucket_2[0]| |17|sc_bucket_2_limb[0]| |18|sc_bucket_2_limb[1]| |19|sc_bucket_2_limb[2]| |20|sc_bucket_2_limb[3]| |21|0| |22| sc[1] | |...|...| And activate the following gates: ``GATE_SC_WELL_FORMED: b - ((-2)*b[+1] + (-1)*b[+6] + b[+11] + 2*b[+16]))`` which is activated if ``row%22==0`` And ``GATE_BUCKET_WELL_FORMED: b - b[+1] - (-5)**15 * b[+2] - (-5)**30 * b[+3] - (-5)**45 * b[+4]`` This stuff occupies ``22*N`` rows. Next, we leave one empty row. After that, we allocate ceil(2^15*(22/20)) = 36045 rows, and here we note that for N=1000 it is very similar in length to the original 55\*N which is the height of the first column. For much larger N, an inconvenient gap will be occuring, this can be compensated with increase of the lookup table. This rapidly gets messy, and gains are relatively small, though not exactly negligible. To give an example, for N=10k, using 2^18-sized table, and base=6, we obtain lhs1=50, rhs1=21, which leaves the gap 29N, which is enough to fit 2^18 ~= 262k rows. This would give a solid 10% improvement. (I assume tighter fit can be obtained if we use more columns, but I am refusing to do it to have a clear optimization objective). In the rows from ``22*N+1`` to ``22*N+36045+1`` we put the consequent values $q_i$, where $q_i$ is computed as amount of limbs which turned out to be equal to the $s_0 + 5s_1 + ... + 5^{14} s_{14}$, where $\{s_j\}$ is a bit encoding of $i$. We also skip every row such that it %22 equals either 0 or 21. These will be used in our lookup argument. Next, we leave one empty row once again. In the next $4D = 1020$ rows we allocate claimed ``Div[i](A)``, ``Div[i](B)`` and ``Der[i](A)``, ``Der[i](B)``. --- We will use selectors to combine these gates (and will reuse it later) ``S1 * GATE_SC_WELL_FORMED + S2 * GATE_BUCKET_WELL_FORMED`` In the first 22N rows S1 is activated provided ``row%22==0``, and S2 if ``row%22 in [1, 6, 11, 16]``. In all other rows they are 0. This ends the second phase. --- ### Column c **Lookup** In a third phase, we obtain two independent challenges - ``v`` for the lookup argument, and ``r`` for random linear combination of coefficients. Now, we want to use lookup argument. Let us denote ``t`` our precomputed table column of base -5 limbs with digits 0 and 1. Assume this table is allocated in rows ``21*N - 21*N + 36045``, and assume that every other value in ``t`` is 0. We will use log derivative lookup argument: compute sum of $q_i / (v - t_i)$, and check that it is equal to the sum of $1/(v-\text{limb})$. To compute lookup lhs, we use the following expression: ``LOOKUP_GATE_LHS = (c-c[-1]*CONST[-1]-c[-3]*(1-CONST[-1]))*(v-t) - b`` Where in table of constants in the table region CONST == 1 provided that ``row%22 != 21 and row%22 !=20``, and 0 otherwise. This gate is activated by a selector ``S_TABLE``, which works in the table region (from 22\*N to 22\*N + 36045), skipping rows equal to 0 and 21 mod 22. To compute lookup rhs, we use the following: ``LOOKUP_GATE_RHS_1 = (c-c[-1]) * (v-b) - 1 `` ``LOOKUP_GATE_RHS_2 = (c-c[-2]) * (v-b) - 1 `` ``LOOKUP_GATE_RHS_3 = (c-c[-3]) * (v-b) - 1 `` We can reuse some selectors (and introduce new one, S3, which is active if ``row%22 in [3,4,5 , 8,9,10, 13,14,15, 18,19,20]``), and S_TABLE which is active in rows corresponding to the table ``LOOKUP_GATE_RHS_3 * S1[-2] + LOOKUP_GATE_RHS_2 * S2[-1] + S3*LOOKUP_GATE_RHS_3 + S_TABLE * LOOKUP_GATE_LHS``. Now, we constrain: ``c[-1] === 0`` (lookup rhs init) ``c[22*N] === 0`` (lookup lhs init) ``c[22*N+36045] === c[22*N-1]`` (lhs = rhs) *Note: this is the first time we have actually used a copy constraint. It might be theoretically possible to avoid using copy constraints completely, but I am not going to risk my sanity for such an optimization. Moreover, amount of selectors I will need to introduce in the process is likely to eat all the performance gains I could obtain this way.* *Therefore, I will instead be satisfied if constraints are only used in column c*. **RHS** The remaining space in the first region of the column c consists of rows with ``row%22 in [0,1,6,11,16,21]``. We will use another precomputed column ``CONST``, which will contain, in particular, precomputed points: for the values ``sc_bucket[i][j]`` (which, we recall, are allocated in rows ``22*i + 1 + 5*j ``), we put a corresponding point ``dgt_enc(j) * points[i]`` in the same and next row. Now, we will use the following gates: ``RHS_ACCUMULATOR_GATE_1 = (c-c[-5])*(-(CONST[+1]-t*CONST + f)) - (A.x-CONST)*b`` // if this is not obvious, this is rhs += (Ax - pt.x)/(-z(pt)) this gate must be activated in rows = 6, 11, 16 mod 22 in row = 1 mod 22, we should activate ``RHS_ACCUMULATOR_GATE_2 = (c-c[-7])*(-(CONST[+1]-t*CONST + f)) - (A.x-CONST)*b`` We can reuse existing selectors for it, activating them in a following fashion: ``(S2-S1[-1])*RHS_ACCUMULATOR_GATE_1 + S1[-1]*RHS_ACCUMULATOR_GATE_2`` To ensure accumulation starts from 0, we must constrain ``c[-6] === 0`` The result of this process (which is almost RHS) is put into the cell ``c[22*(N-1)+16]``. **LHS** Now, we wish to compute random linear combination of our polynomials. In order to do this, we introduce two more gates: ``GATE_POLY_ACCUMULATOR_1 = c - r**11*c[-1]*CONST - (a + r*a[+1] + r**2 * a[+2] + ... r**10 * a[+10])`` ``GATE_POLY_ACCUMULATOR_2 = c[11] - r**11*c[-10]*CONST - (a + r*a[+1] + r**2 * a[+2] + ... r**10 * a[+10])`` predictably, first one of these is activated using a selector ``S1``, and other one with selector ``S1[-10]``. And we will allocate the values of CONST as follows: 1 on each 0-th and 10-th row (recall that we have only partially allocated consts in 1,2 , 6,7, 11,12, 16,17 rows); except if ``row%(N+2)==0``, in which case ``CONST=0`` (and we assume that N+2 is divisible by 22, or just add zero coefficients there). This, effectively, ensures that in each ``c[(N+2)i-1]`` lives an accumulated coefficient of the polynomial. The remaining work is best done with an universal arithmetic gate, for example, having the following form: ``ARITH = b*CONST + c[-3]*c[-2] + c[-1]*CONST[+1] - c`` The remaining equations involve at best ~4N work, so with this it will be at worst ~8N, which is acceptable. Remaining equations involve: copying committed evaluations from b to c (can be done using ARITH), checking that their random linear combination is correct (using Horner method, but now we don't bother with creating an additional gate and just naively compute polynomial evaluation), evaluating the final sum for lhs, and adding 1 more term in rhs (corresponding to the point Q which is a claimed answer). ---------- ## Estimated stats: This method seems to require: 6 fixed columns, i.e.: CONST, t, S1, S2, S3, S_TABLE, S_ARITH Gates of degree at most 3 (counting selectors). Rotations: CONST, CONST[-1], CONST[+1] (3 rotations) S1, S1[-1], S1[-2], S1[-10] (4 rotations) S2, S2[-1] (2 rotations) S3 (1 rotation) S_TABLE (1 rotation) S_ARITH (1 rotation) t (1 rotation) total: 12 openings Three advice columns: a, b, c Rotations: a - a[+10] (11 rotations) b (1 rotation) c, c[-1], c[-2], c[-3], c[-5], c[-7], c[-10] (7 rotations) total: 19 openings Row cost is slightly larger than 55N. ---