# Notes about optimizing emulated pairing (part 1) This note describes the techniques used for optimizing the bilinear pairing computation with field emulation in Groth16 using gnark. A shortlist of the techniques used are: * field emulation; * augmented Groth16; * product argument in-circuit; * amortized methods in gnark. ## Field emulation When using pairing-based SNARK constructions (for example, Groth16 and PLONK), then the computations are performed on the scalar field of the underlying elliptic curve. Depending on the context either the elliptic curve is fixed (for example, when we target verifier running as an Ethereum smart contract) or we have little choice over choosing the curve to have suitable scalar field. This means that if we want to perform computations over arbitrary finite field (further on, lets call it nonnative field), then we have to emulate multi-precision arithmetic on the native field. ![](https://i.imgur.com/XK5AoS8.png) For consistency, let denote the modulus of the native field as $r$ and and the modulus of the nonnative field as $q$. Usually, the multi-precision arithmetic is implemented by representing an integer value $x$ decomposed into $n$ limbs using a fixed base $b$: $$ x = \sum_{i=0}^{n-1} x_i b^i. $$ Assuming the add-with-carry and mul-with-carry we can implement multi-precision addition and multiplication by operating on the limbs and carrying the excess over the next limb, as learned in school. This approach works really well on native hardware which has suitable primitive integer types (8-, 16, 32- and 64-bit wide unsigned integers, i.e. $b \in \{2^8, 2^{16}, 2^{32}, 2^{64}\}$) and arithmetic operations with carry. However, in a native field we only have field addition and multiplication. If the field cardinality is not a power of two, then in case the operations overflow the underlying field we lose the information about the exact values of the result due to automatic modular reduction. To avoid this, we need to ensure that all the limb values are always small enough so that the multiplication (and to a lesser sense, addition) never overflow the native field. But as the bitlength of multiplication of two $t$-bit inputs is $2t-1$, then even for very small $t$ we overflow the native field after only a few multiplication operations. This means that we have to have a way to reduce the limbs. Secondly, we have to be able to reduce the integer value modulo $q$. Essentially, we have to show that there exist $k, x'$ such that $$ x' = k*q + x, $$ or alternatively $$ x'-x = k*q. $$ We compute both $x'$ and $k$ non-deterministically (in gnark terms, using hints). For correctness, we must ensure that $k,x' < q$ and $x'_i, k_i < b$. By computing the integer value $y = x'-x$, we are left to show that $$ y = k*q. \quad (1) $$ For that, we can instead consider the polynomial representation of the multi-precision integer $$ p_y(Z) = \sum_{i=0}^{n-1} y_i Z^i. $$ With this representation we have $p_y(b) = y$ for the base $b$. Now, $(1)$ becomes $$ p_{y}(Z) = p_{k}(Z) p_{q}(Z), \quad (1') $$ which we can show by evaluating both sides at $2n-1$ different points, as $deg(p_{y}), deg(p_q), deg(p_k) \leq 2n-1$. Although we did not explicitly mention, the complexity of multiplication is showing that the limbs $y_i, k_i$ are bounded by $b$. Naively we would show this by decomposing every limb $y_i, k_i$ into bits non-deterministically and showing that this decomposition was correct. More formally, we can write the value $v$ in base 2 as $$ v = \sum_{i=0}^{t} v_i 2^i, $$ where $t$ is the bitlength of the native field modulus. However, this is not sound in the SNARK circuit as we could assign any value for $v_i$. Thus, we have to additionally assert $$ v_i * (1-v) = 0, \forall i \in [t]. $$ We can see that this approach is really inefficient as we have to create $s$ R1CS constraints. For every multiplication this is approximately $2*t$ constraints! Lets see how to make the multiplication more efficient in three steps. ## Non-interactive protocols The first critical step is to obtain some randomness in circuit. But the circuits are fixed and for plain Groth16 (also PLONK) the only randomness could theoretically be provided by the prover. By default this isn't usually sufficient for probabilistic checks as the prover could provide such values which would lead to false positive outcomes. ![](https://i.imgur.com/UE1vljc.png) *prover chooses path which leads to false positive outcome* The simplest way to overcome this problem is to use something called Fiat-Shamir heuristic. Using Fiat-Shamir heuristic, the verifiers challenges in the protocols are replaced with hashes of the protocol transcript up to the moment. So for example if we have an interactive protocol ``` PROVER VERIFIER msg -- msg --> ch <- Sample() <-- ch -- use ch ...... ``` then this is turned into non-interactive protocol as ``` PROVER DETERMINISTIC VERIFIER msg -- msg --> ch <- H(msg) <-- ch -- use ch ...... ``` Now, the proof is the transcript `(msg, ch, ...)` of the protocol. Everyone can verify the correctness of the proof if they check that the verifier's challenges are correctly computed. It depends on the viewpoint but normally this transformation, assuming that we use strong hash functions, is sound. This is due to difficulty of choosing inputs to the hash function which lead to something suitable for the prover. With strong hash functions if the prover changes only a bit in the inputs, this cascades and changes many output bits. We can use this approach also in the SNARK circuits, but we stumble again against the wall of inefficiency. Cryptographic hash functions are not SNARK-friendly and create a lot of constraints in the circuit. Even with SNARK-friendly hash functions it becomes unfeasible to feed the whole protocol transcript into the hash function as the the hashing overhead cancels out any possible saving coming from probabilistic checks. ## Augmented Groth16 :::info The approach for constructing commitment from the witness comes from [LegoSNARK paper](https://eprint.iacr.org/2019/142.pdf). ::: However, there is a way. To explain it, we would have to remind how Groth16 works on a high level. The arithmetic circuit is compiled into $c$ equations in the form $$ \left(\sum_{i = 0}^N u_{i,j} x_i \right) \cdot \left(\sum_{i=0}^N v_{i,j} x_i \right) = \sum_{i=0}^N w_{i,j} x_i. \quad \forall j \in [c] $$ Here the the vector $\mathbf{x} = (x_1, ..., x_N)$ is the witness, where first element is $1$, next $\ell$ elements are the public (the statement) and the rest is prover's witness. It is possible to represent the matrices $(u_{i,j})$, $(v_{i,j})$ and $(w_{i,j})$ as polynomials $u_i(Z)$, $v_i(Z)$ and $w_{i}(Z)$ such that their evaluations at some domain $\Omega$ correspond to the matrix cells. Then the check becomes $$ \left( \sum_{i=0}^N u_i(Z) x_i \right) \cdot \left( \sum_{i=0}^N v_i(Z) x_i \right) = \sum_{i=0}^N u_i(Z) x_i + h(Z) Z_\Omega(Z). $$ Omitting the full proof SNARK proof computation (see [Groth16](https://eprint.iacr.org/2016/260.pdf) for details), the prover also computes $$ C = \frac{\sum_{i=\ell+1}^N z_i(\beta u_i(x) + \alpha v_i(x) + w_i(x) + h(x)Z_\Omega(x))}{\delta} + As + Br - rs \delta \quad (2) $$ and verifier checks $$ [A]_1 \cdot [B]_2 = [\alpha]_1 \cdot [\beta]_2 + \sum_{i=0}^\ell z_i \left[\frac{\beta u_i{x} + \alpha v_i(x) + w_i(x)}{\gamma} \right]_1 \cdot [\gamma]_2 + [C]_1 \cdot [\delta]_2. \quad (3) $$ Now, we can use Pedersen commitment as a Fiat-Shamir challenge of the witness elements $\{x_i\}_{i = \ell+1}^{\ell + \mu}$. For this, we first have to compute the commitment base $$ G_i = \left[\frac{\beta u_i{x} + \alpha v_i(x) + w_i(x)}{\gamma} \right]_1, \quad \ell + 1 \leq i \leq \ell + \mu, $$ and the commitment value is $$ M = \sum_{i=\ell+1}^{\ell+\mu} z_i G_i. $$ Additionally, the prover will have to provide proof of knowledge $\pi$ to ensure that $M$ is correctly computed using the base $\{G_i\}_{i = \ell+1}^{\ell + \mu}$. The modified prover computation $(2)$ is now: $$ C = \frac{\sum_{i=\ell\color{red}{+\mu}+1}^N z_i(\beta u_i(x) + \alpha v_i(x) + w_i(x) + h(x)Z_\Omega(x))}{\delta} + As + Br - rs \delta $$ and verifier's computation $(3)$ is: $$ [A]_1 \cdot [B]_2 = [\alpha]_1 \cdot [\beta]_2 + \sum_{i=0}^\ell z_i \left[\frac{\beta u_i{x} + \alpha v_i(x) + w_i(x)}{\gamma} \right]_1 \cdot [\gamma]_2 \color{red}{+ M \cdot [\gamma]_2}+ [C]_1 \cdot [\delta]_2;\\ \color{red}{\text{Verify}(\{G_i\}, \pi, M) = 1}. $$ The commitment M can now be used in circuit as a Fiat-Shamir challenge on the inputs $\{z_i\}_{i = \ell+1}^{\ell+\mu}$ after hashing it into the scalar field. :::info We can build a similar argument in PLONK using custom gate to mark variables we wish to commit. ::: ## Product argument for range checking We can now do a single Fiat-Shamir challenge for some variables in the circuit. Lets choose the inputs well. In the first section we had to check that the limbs from hints are less than $b$. Rephrasing, we can say that the value of the limb must be in the table $\{f_i\}_{i=0}^{b-1} = \{0, 1, \ldots, b-1\}$. If denote all the variables we want to range check as $\{s_i\}$, then we want to show that $\{s_i\} \in \{f_i\}$. We compute the the histogram of $\{s_i\}$ relative to $\{f_i\}$ to get the multiplicities $\{e_i\}$ of every element in $\{f_i\}$. We build canonical polynomials of the sets as $$ F(Z) = \prod_{i=0}^{b-1} (Z-f_i)^{e_i};\\ S(Z) = \prod_{i=0}^{M} (Z-s_i) $$ and need to show that $$ F(Z) = S(Z). \quad (4) $$ By computing commitment $r = \text{Commit}(\{s_i\}, \{e_i\})$ then we can check $(4)$ with two polynomial evaluations: $$ F(r) = S(r). $$ The evaluations cost $b\log(N) + N$ multiplications. Compared to binary decomposition approach this is a significant performance improvement assuming we perform a lot of range checks as the left side of the table is fixed. There is additional trick which allows to get rid of the $\log(N)$ factor of evaluating $F(Z)$. Instead of $(4)$ we look at the logarithmic derivative of it. Then the equation to check becomes $$ F'(Z) = S'(Z)\\ \sum_{i=0}^{b-1} \frac{e_i}{Z-f_i} = \sum_{i=0}^{M} \frac{1}{Z-s_i}. $$ On the both sides we replaced products with inverses, which have the same cost. But we were able to lose the exponent which decreases the number of constraints to evaluate the left side. :::info This idea for log-derivative lookup table comes from [Haböck](https://eprint.iacr.org/2022/1530.pdf). ::: ## Implementation detail Gnark achieves to be a modular and safe tool for the circuit developers. For example, we see in the range checking gadget that before we can compute the challenge commitment and assert the product check, we have to collect all the queries and compute the multiplicites. From the user perspective this means that the computations are performed out of order. An option would be to require the developer to manually do these steps, but this would provide an opportunity to write unsound circuits if the user forgets to include any query. To avoid this, we have to build the range checking gadget in a way which ensures that the checks are always included in the circuit. For that, we added methods to register circuit finalizer callbacks and caches. The finalizer callbacks are functions which are called after the user-defined part of the circuit. Additionally, we use caches to instantiate some constraints only once. We see that in the range checking gadget the left-hand side of the equation adds $b$ constraints regardless of how many queries we perform. So we better instantiate the gadget only once and then apply it for different parts of the circuit by decomposing the individual checks into smaller table-specific checks. As an example, given the following simple circuit in gnark, lets see what happens in the background: ```go= type PairingComputationCircuit struct { // the witness vector is automatically allocated and initialised ElG1 sw_bn254.G1Affine ElG2 sw_bn254.G2Affine ExpectedGt sw_bn254.GTEl // the in-circuit types are fully compatible with out-circuit counterpats in // gnark-crypto } func (c *PairingComputationCircuit) Define(api frontend.API) error { // we use cached instanse of pairing context. The pairing context itself // uses cached instance of range check context. We can call different // gadgets which all will use the same range check gadget instance. pairingCtx, err := sw_bn254.NewPairing(api) if err != nil { return err } // the witness non-native element limbs are automatically constrained first // time they are used in a circuit. Intermediate results are automatically // constrained. res, err := pairingCtx.Pair([]*sw_bn254.G1Affine{&c.ElG1}, []*sw_bn254.G2Affine{&c.ElG2}) if err != nil { return err } pairingCtx.AssertIsEqual(res, &c.ExpectedGt) // we check that all witness elements are used in circuits. Without // assertion get compile-time error. return nil // optimal range check table size computed, range check elements decomposed // into table entries and checked. } func Example() { var P bn254.G1Affine var Q bn254.G2Affine var a, b fr.Element _, _, g1gen, g2gen := bn254.Generators() a.SetRandom() b.SetRandom() P.ScalarMultiplication(&g1gen, a.BigInt(new(big.Int))) Q.ScalarMultiplication(&g2gen, b.BigInt(new(big.Int))) x, err := bn254.Pair([]bn254.G1Affine{P}, []bn254.G2Affine{Q}) if err != nil { panic(err) } assignment := PairingComputationCircuit{ ElG1: sw_bn254.NewG1Affine(P), ElG2: sw_bn254.NewG2Affine(Q), ExpectedGt: sw_bn254.NewGTEl(x), // automatically splits the extension field elements into limbs for // non-native field arithmetic. } ccs, err := frontend.Compile(ecc.BN254.ScalarField(), r1cs.NewBuilder, &PairingComputationCircuit{}) if err != nil { panic(err) } pk, vk, err := groth16.Setup(ccs) if err != nil { panic(err) } w, err := frontend.NewWitness(&assignment, ecc.BN254.ScalarField()) if err != nil { panic(err) } proof, err := groth16.Prove(ccs, pk, w) // prover computes the Groth16 proof and commitment proof of knowledge if err != nil { panic(err) } pw, err := w.Public() if err != nil { panic(err) } err = groth16.Verify(proof, vk, pw) // verifier checks Groth16 proof and the commitment proof. if err != nil { panic(err) } } ``` ## Performance Using these tricks, the performance of BN254-in-BN254 and BLS12-381-in-BN254 pairing computation on Macbook Pro M1 are: * BN254 - number of constraints: 1393318, proving time: 5.7s * BLS12-381 - number of constraints: 2088277, proving time: 7.4s In the next part we will describe further optimisations for pairing computation.