Fast recursive arguments based on Plonk and Halo

--- tags: plonk, halo, snark, recursion, circuit, r1cs, cryptography --- # Fast recursive arguments based on Plonk and Halo *Daniel Lubarov, William Borgeaud & Brendan Farmer (Mir Protocol)* <span style="color:red">[This content has been moved! See [our blog](https://mirprotocol.org/blog/Fast-recursive-arguments-based-on-Plonk-and-Halo) for the most recent version.]</span> **TLDR**: By combining Plonk's permutation argument, Halo's polynomial commitment scheme, and a high-arity circuit model, we are able to recursively verify an argument using fewer than $2^{14}$ gates. We have a prototype impementation of this scheme in [Plonky](https://github.com/mir-protocol/plonky). While it's not ready for real use yet, we're seeing encouraging results, with recursive proofs taking ~9 seconds on a 6-core laptop. This post assumes some familiarity with [Plonk](https://eprint.iacr.org/2019/953), [Halo](https://eprint.iacr.org/2019/1021) and [Rescue](https://eprint.iacr.org/2019/426). ## Contents [TOC] ## Batch opening polynomial commitments Verifying a Plonk argument involves opening several polynomial commitments at a challenge point $x$, and opening one of them at an additional point $y$. The Halo paper describes a protocol for reducing several openings to one, but it involves a significant amount of computation per opening on the verifier's part, which we would like to avoid. We will instead apply a batching technique described by Izaak Meckler [here](https://hackmd.io/f6ShwcCLTfmClTmmmMUjuA?view). Let $G$ be a sequence of random generators, and let $X = (x^0, x^1, x^2, \dots)$ and $Y = (y^0, y^1, y^2, \dots)$. The prover first sends each polynomial commitment $c(f_i) = \langle f_i, G \rangle$ and purported evaluations $z_i = \langle f_i, X \rangle, w_i = \langle f_i, Y \rangle$. The verifier samples random $\alpha, \beta \in \mathbb{F}$ and computes \begin{align} Z &= \sum_i \alpha^i z_i, \\ W &= \sum_i \alpha^i w_i, \\ c(F) &= \sum_i \alpha^i c(f_i), \end{align} where $F = \sum_i \alpha^i f_i$. At this point, the prover must argue that $F(x) = Z$ and $F(y) = W$ which, with negligible loss of soundness, reduces to $$ \left\langle F, \, X + \beta Y \right\rangle = Z + \beta W. $$ The prover sends an inner product argument for the relation above. Verifying it requires knowing $\langle s, G \rangle$ and $\langle s, X + \beta Y \rangle$, where $s$ is as defined in the Halo paper. The former is handled with the usual Halo technique, and the latter can be computed as $\langle s, X \rangle + \beta \langle s, Y \rangle = g(X, u_i) + \beta g(Y, u_i)$. ## Halo's bottleneck: curve multiplication In Halo, verifying a polynomial commitment's opening involves computing $$ Q = \sum_{j=1}^k \left( \left[ u_j^2 \right] L_j \right) + P' + \sum_{j=1}^k \left( \left[ u_j^{-2} \right] R_j \right), $$ where $L_j$ and $R_j$ are given by the prover, and $u_j \in \{ 0, 1 \}^\lambda$ are random challenge points. Each curve multiplication $[r] P$ could be performed with a simple double-and-add algorithm, which would involve $\lambda$ additions and $\lambda$ doublings. Halo does something similar, but leverages an endomorphism $\phi$ to process two bits at a time, reducing the cost to $\lambda/2$ additions and $\lambda/2$ doublings. We can do better by noticing that the equation above has the form of a multi-scalar multiplication with $2k$ terms (plus the addition of $P'$). A simple, circuit-friendly MSM optimization is to perform the doublings simultaneously for all terms (cf. simultaneous squaring). This brings the cost down to $k \lambda$ additions and $\lambda / 2$ doublings, or $(k + 1/2) \lambda$ group operations. ## Generalizing Plonk's circuit model In Plonk's "standard" circuit model, each gate interacts with three wires, $a$, $b$ and $c$, and enforces a single constraint on the contents of those wires, $$ q_L a + q_R b + q_O c + q_M a b + q_C = 0, $$ where $q$ values can be thought of as gate configuration flags. As an example, we could create a multiplicative constraint $ab = c$ by configuring $q_M = 1$ and $q_C = -1$, with the other $q$ values being zero. This model is nice and simple, but leads to rather high gate counts. Curve operations seems to require around 7 gates, for example, depending on the curve and completeness assumptions. Thankfully, the basic Plonk scheme is highly flexible; there are several generalizations which we can use to achieve lower gate counts. First, we can use higher-arity gates. If we wanted a single gate to perform a curve operation, for example, we might use an arity of 6. This increases the prover's cost per gate, but it also allows us to dramatically reduce gate counts for certain operations. Second, we are not limited to a single constraint. At first glance, it might seem as though adding several constraints would require the prover to compute many more FFTs, but this depends on our approach. Let $d$ be the maximum degree of any constraint. If we extend each polynomial to degree $d$ upfront, then all constraint-related arithmetic thereafter can be done in point-value form, with no additional FFTs. Third, not all wires necessarily need to be routed. "Advice" wires are useful for things like purported inverses, or for intermediate results. Advice wires do not contribute to the degree of Plonk's permutation argument, which is often the highest-degree polynomial in Plonk-based constructions. Finally, since Plonk opens each polynomial at a "shifted" point $g x$ in addition to the challenge point $x$, constraints can operate on the wires of the "next gate" in addition to the "local gate". This is one of the main insights behind TurboPlonk. We take this approach even farther, adding additional shifted openings in order to minimize the need for "copying" wires with Plonk's permutation argument. ## Elliptic curve gates In the MSM algorithm described previously, most steps involve conditionally negating a point, conditionally applying the endomorphism $\phi((x,y)) = (\zeta x, y)$, then adding the modified point to an accumulator. This can be expressed as \begin{align} x_1 &\leftarrow (1 + (\zeta - 1) r_\mathrm{high}) x, \\ y_1 &\leftarrow (2 r_\mathrm{low} - 1) y, \\ (x_3, y_3) &\leftarrow (x_1, y_1) + (x_2, y_2), \end{align} where $(x, y)$ is the point being multiplied, $(x_1, y_1)$ is the point to be added to the acumulator, $(x_2, y_2)$ is the old state of the accumulator, $(x_3, y_3)$ is its updated state, and $r_\mathrm{high}, r_\mathrm{low}$ are two consecutive bits of the scalar. ### Explicit formulae For short Weierstrass curves in affine form, incomplete addition can be computed as \begin{align} \mathrm{inv} &= 1 / (x_1 - x_2), \\ \lambda &= (y_1 - y_2) \mathrm{inv}, \\ x_3 &= \lambda^2 - x_1 - x_2, \\ y_3 &= \lambda(x_1 - x_3) - y_1. \end{align} A simple way to arithmetize this computation is to introduce advice wires for the intermediate results like $\lambda$. In particular, our gate could be defined in the following way: * Routed wires: $x$, $y$, $x_2$, $y_2$, $x_3$, $y_3$, $q_\mathrm{low}$, $q_\mathrm{high}$ * Advice wires: $x_1$, $y_1$, $\mathrm{inv}$, $\lambda$ * Constraints: \begin{align} q_\mathrm{low} (q_\mathrm{low} - 1) &= 0, \\ q_\mathrm{high} (q_\mathrm{high} - 1) &= 0, \\ x_1 &= (1 + (\zeta - 1) r_\mathrm{high}) x, \\ y_1 &= (2 r_\mathrm{low} - 1) y, \\ (x_1 - x_2) \mathrm{inv} &= 1, \\ \lambda &= (y_1 - y_2) \mathrm{inv}, \\ x_3 &= \lambda^2 - x_1 - x_2, \\ y_3 &= \lambda(x_1 - x_3) - y_1. \end{align} This may seem like an awful lot of constraints, but low-degree constraints have little bearing on performance, since they only need to be checked at a single challenge point. ### Exceptional cases The affine formulae above assume $x_1 \ne x_2$. In the context of the Halo MSMs described earlier, a malicious prover could break this assumption simply by sending identical $L_j$ values in two consecutive rounds, so as a security measure we must verify that $x_1 \ne x_2$. We enforce this with the constraint $(x_1 - x_2) \mathrm{inv} = 1$. For an honest prover, though, each $L_j$ is independently random since it incorperates a random group element $[l_j] H$. Since $P(x_1 = x_2) = 2^{-|F|}$ for an honest prover, we can simply abort the protocol in this case, incurring a negligible completeness error. ### Optimizations In practice, we wouldn't want so many advice wires, as each wire requires the prover to compute a polynomial commitment and contributes to the argument length. $x_1$, $y_1$ and $\lambda$ can simply be inlined. Further, we can combine the accumulators $(x_2, y_2)$ and $(x_3, y_3)$ by constraining the accumulator wires of the "next" gate. Finally, our actual implementation uses a single gate to perform these curve operations while simultaneously verifying the decomposition of each scalar. ## Rescue hashes Let $M = \bigl( \begin{smallmatrix}A & B \\ C & D\end{smallmatrix}\bigr)$ be an MDS matrix. A single round of a width-2 Rescue permutation can be defined as \begin{align} \operatorname{step}_{i,1}\left(\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\right) &= M \begin{bmatrix}x_1^{1/\alpha} \\ x_2^{1/\alpha}\end{bmatrix} + \begin{bmatrix}r_{i,1} \\ r_{i,2}\end{bmatrix}, \\ \operatorname{step}_{i,2}\left(\begin{bmatrix}x_1 \\ x_2\end{bmatrix}\right) &= M \begin{bmatrix}x_1^{\alpha} \\ x_2^{\alpha}\end{bmatrix} + \begin{bmatrix}r_{i,3} \\ r_{i,4}\end{bmatrix}, \\ \operatorname{round}_i &= \operatorname{step}_{i,2} \circ \operatorname{step}_{i,1}, \end{align} where $r_{i, 1} \dots r_{i, 4}$ are random constants. Computing $\alpha$th roots deterministically would be expensive, so we ask the prover to supply them via advice wires $y_1$ and $y_2$. Let $z_1$ and $z_2$ denote the output of the round function; then our gate may be defined as * Routed wires: $x_1$, $x_2$, $z_1$, $z_2$ * Advice wires: $y_1$, $y_2$ * Constraints: \begin{align} x_1 &= y_1^\alpha, \\ x_2 &= y_2^\alpha, \\ z_1 &= A(A y_1 + B y_2 + r_{i,1})^\alpha + B(C y_1 + D y_2 + r_{i,2})^\alpha + r_{i,3}, \\ z_2 &= C(A y_1 + B y_2 + r_{i, 1})^\alpha + D(C y_1 + D y_2 + r_{i,2})^\alpha + r_{i,4}. \end{align} We could even perform several rounds of Rescue in a single gate, but in practice we are limited by the number of constants available to each gate. A single round already involves 4 constants $r_{i, 1} \dots r_{i, 4}$, and adding more would mean more preprocessed polynomials which must be opened at a challenge point. As a future optimization, we could interleave Rescue gates with gates which do not require any constants, such as curve operation gates. Each Rescue gate could then utilize the unused constant slots belonging to its neighbor, enabling it to perform more steps without increasing the number of constant polynomials. ## A unified constraint set So far, we have described separate systems of equations for each gate type, but Plonk assumes a single set of constraints. To accomplish this, we combine each constraint with a "filter" expression which evaluates to 0 or 1, indicating whether the constraint should be applied to a given gate index. Our unified constraint set is then simply a sum of filtered constraints. A simple way to implement this would involve a constant polynomial for each gate type. For example, `is_rescue_gate` could be defined as the polynomial with `is_rescue_gate(g^i) = 1` if and only if gate `i` is a Rescue gate. We use a variety of custom gates, though, and we wouldn't want to open so many constant polynomials. Instead we organize gates in the leaves of a binary tree, as shown: ```graphviz digraph G { r [shape=point] 0 [shape=point] 00 [label="Rescue A"] 01 [label="Rescue B"] 1 [shape=point] 10 [shape=point] 11 [label="Curve Add (Endo)"] 100 [shape=point] 101 [shape=point] 1010 [shape=point] 1011 [shape=point] 10100 [shape=point] 101000 [label="Buffer"] 101001 [label="Public Input"] 10101 [label="Curve Add"] 1000 [label="Base 4 Sum"] 1001 [label="Arithmetic"] 10110 [label="Constant"] 10111 [label="Curve Double"] r -> 0 r -> 1 0 -> 00 0 -> 01 1 -> 10 1 -> 11 10 -> 100 10 -> 101 100 -> 1000 100 -> 1001 101 -> 1010 101 -> 1011 1010 -> 10100 1010 -> 10101 1011 -> 10110 1011 -> 10111 10100 -> 101000 10100 -> 101001 } ``` We then introduce a constant polynomial $C_i$ for each layer of the tree, and set its value to 0 or 1 to indicate a left or a right turn in the tree. For example, our arithmetic gate has a path of 1001, so its constraints are combined with the filter $C_1(x) (1 - C_2(x)) (1 - C_3(x)) C_4(x)$. It might seem odd that different gates are given different depths in this binary tree. There are two reasons for this. First, a smaller depth results in a lower-degree filter polynomial, and in order to keep our max degree low, certain gates with higher-degree constraints must be given lower-degree filters. Second, any constants not used in the filter polynomial are available to be used for gate configuration. Our arithmetic gate has a filter involving only $C_1, \dots, C_4$, for example, so it uses $C_5$ and $C_6$ to configure the type of arithmetic being performed. ## Future improvements While we have been focused on optimizing our recursive circuit size, there is also plenty of room to speed up our primitives. Plonky is currently 100% Rust, and we expect a major speedup on x86 systems from [carry chain optimizations](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf), which the compiler is not capable of. Our proving time is dominated by by multi-scalar multiplications, which we implemented using a variant of Yao's method. We may be able to do better with Pippinger's algorithm, especially for the IPA reduction which involves variable-base MSMs. Another potential improvement was suggested by Daira Hopwood. Instead of applying the endomorphism zero or one times based on a bit of the scalar, we could apply it zero, one or two times based on a base-3 limb of the scalar. This would reduce the number of iterations from 64 to 50 while maintaining injectivity, although it becomes more difficult to prove. ## Thanks Thanks to Daira Hopwood, Sean Bowe and Zachary Williamson for helpful pointers and discussions.