--- description: We present an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not quite all) choices of modulus. This work was done at ConsenSys by Gautam Botrel, Gus Gutoski and Thomas Piellard image: https://i.imgur.com/iAj7POt.png --- # Faster big-integer modular multiplication for most moduli We discovered an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not quite all) choices of modulus. Big-integer modular multiplication is a low-level component found in most modern cryptographic protocols including * Elliptic curve cryptography, such as * Elliptic Curve Digital Signature Algorithm (ECDSA) * Elliptic Curve Diffie-Hellman key exchange (ECDH) * Elliptic Curve pairings * RSA cryptography * Zero-knowledge proofs Some protocols (such as zero-knowledge proofs and elliptic curve pairings) invoke modular multiplication billions of times in a single execution. Other protocols (such as ECDSA or ECDH) are themselves invoked billions of times each second around the world. In both cases, even a tiny improvement in modular multiplication yields a very significant computational saving---every nanosecond counts. In our [benchmarks](#Benchmarks), our implementation (`goff`) yields a **10--15% speed improvement** over exisiting libraries on 64-bit architectures for moduli of bit-length 704 or less. The [source code](https://github.com/consensys/gnark) is freely available under an Apache 2.0 License. *We are not aware of any prior art describing this optimization, though it is possible that we missed something. If you’ve seen this optimization before then please do let us know!* :fire: `goff` is a tool we wrote to generate pure Go code (no assembly!) for fast modular arithmetic. For more details check out our [`goff` announcement article](https://hackmd.io/@zkteam/goff). :fire: <img style="display: block;margin: auto;" width="50%" src="https://i.imgur.com/UUppBSP.png"> ## Montgomery multiplication ### The modular multiplication problem Given integers $a$, $b$, and $q$ the modular multiplication problem is to compute the remainder of the product $$a*b \mod q$$ On computers a division operation is much slower than other operations such as multiplication. Thus, a naive implementation of $ab \bmod q$ using a division operation is prohibitively slow. In 1985, [Montgomery introduced a method](https://en.wikipedia.org/wiki/Montgomery_modular_multiplication) to avoid costly divisions. This method, now called _Montgomery multiplication_, is among the fastest solutions to the problem and it continues to enjoy widespread use in modern cryptography. ### Overview of the solution: Montgomery multiplication There are many good expositions of Montgomery multiplication---[here](https://eprint.iacr.org/2017/1057) is one example. As such, we do not go into detail on the mathematics of Montgomery multiplication. Instead, this section is intended to establish notation that is used throughout this article. The Montgomery multiplication algorithm does not directly compute $ab \bmod q$. Instead it computes $abR^{-1} \bmod q$ for some carefully chosen number $R$ called the _Montgomery radix_. Typically, $R$ is set to the smallest power of two exceeding $q$ that falls on a computer word boundary. For example, if $q$ is 381 bits then $R=2^{6\times 64}=2^{384}$ on a 64-bit architecture. In order to make use of Montgomery multiplication the numbers $a,b$ must be encoded into _Montgomery form_: instead of storing $a,b$, we store the numbers $\tilde{a}, \tilde{b}$ given by \begin{align} \tilde{a} &= aR \mod q \\ \tilde{b} &= bR \mod q \end{align} A simple calculation shows that Montgomery multiplication produces the product $ab\bmod q$, _also encoded in Montgomery form_: $$ (aR)(bR)R^{-1} = abR \mod q$$ The idea is that numbers are _always stored in Montgomery form_ so as to avoid costly conversions to and from Montgomery form. Other arithmetic operations such as addition, subtraction are unaffected by Montgomery form encoding. But the modular inverse computation $a^{-1}\bmod q$ must be adapted to account for Montgomery form. We do not discuss modular inversion in this article. ## Implementation For security purposes, cryptographic protocols use large moduli---$a,b$ and $q$ are stored on multiple machine words (multi-precision). In this article we let $D$ denote the basis in which integers are represented. (For example, $D=2^{64}$ if a word is 64 bits). A large number $n$ can be represented by its base-$D$ digits $n_0,\dots,n_m$ stored in machine words (`uint`): $$n = \sum_{i=0}^m n_{i}D^{i}$$ There are several variations of multi-precision Montgomery multiplication. A popular choice is the Coarsely Integrated Operand Scanning (CIOS) variant ^[[Tolga Acar's thesis](https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf)]. In some settings, factors such as modulus size, CPU cache management, optimization techniques, architecture and available instruction set might favor other variants. ### How fast is the CIOS method? Let $N$ denote the number of machine words needed to store the modulus $q$. For example, if $q$ is a 381-bit prime and the hardware has 64-bit word size then $N=6$. The CIOS method solves modular multiplication using $4N^2 + 4N + 2$ unsigned integer additions and $2N^2 + N$ unsigned integer multiplications. Our optimization reduces the number of additions needed in CIOS Montgomery multiplication to only $4N^2 - N$, **a savings of $5N+2$ additions**. This optimization can be used whenever the highest bit of the modulus is zero (and not *all* of the remaining bits are set---see below for details). The core of the state-of-the-art CIOS Montgomery multiplication is reproduced below. This listing is adapted from [Section 2.3.2 of Tolga Acar's thesis](https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf). The symbols in this listing have the following meanings: - `N` is the number of machine words needed to store the modulus $q$. - `D` is the word size. For example, on a 64-bit architecture `D` is $2^{64}$. - `a[i], b[i], q[i]` is the `i`th word of the numbers $a,b,q$. - `q'[0]` is the lowest word of the number $-q^{-1} \bmod R$. This quantity is pre-computed, as it does not depend on the inputs $a,b$. - `t` is a temporary array of size $N+2$ - `C`, `S` are machine words. A pair `(C,S)` refers to (hi-bits, lo-bits) of a two-word number ```clike= for i=0 to N-1 C := 0 for j=0 to N-1 (C,t[j]) := t[j] + a[j]*b[i] + C (t[N+1],t[N]) := t[N] + C C := 0 m := t[0]*q'[0] mod D (C,_) := t[0] + m*q[0] for j=1 to N-1 (C,t[j-1]) := t[j] + m*q[j] + C (C,t[N-1]) := t[N] + C t[N] := t[N+1] + C ``` --- :::success **We show that we can get rid of additions line 5 and 14 if the highest word of the modulus $q$ is at most $\frac{D-1}{2} - 1$**. This condition holds if and only if the highest bit of the modulus is zero and not all of the remaining bits are set. ::: ### Our optimization Observe that lines 4 and 11 have the form $$(\mathrm{hi},\mathrm{lo}) := m_1 + m_2*B + m_3$$ where $m_1,m_2,m_3,B,\mathrm{hi},\mathrm{lo}$ are machine words, so that each is at most $D-1$. If $B \leq \frac{D-1}{2} - 1$ then a simple calculation shows: \begin{align} m_1 + m_2B + m_3 &\leq (D-1)+(D-1)\left( \frac{D-1}{2} - 1 \right) + (D-1) \\ &= D \underbrace{ \left( \frac{D-1}{2} \right)}_{\mathrm{hi}} + \underbrace{ \left( \frac{D+1}{2} - 1 \right) }_{\mathrm{lo}} \end{align} from which we observe: #### Fact 1 :::success If $B \leq \frac{D-1}{2} - 1$ then $\mathrm{hi} \leq \frac{D-1}{2}$. ::: --- We leverage [Fact 1](#fact-1) to prove the following: #### Fact 2 :::success If the highest word of $q$ is at most $\frac{D-1}{2} - 1$ then the variables `t[N]`, `t[N+1]` always store the value `0` at the beginning of each iteration of the outer `i`-loop. ::: This fact can be proven by induction. The base case `i=0` is trivial, since the `t` array is initialized to `0`. For the inductive step at iteration `i` suppose `t[N]=t[N+1]=0` and trace the execution through the iteration. Begin at the final iteration of the first inner loop (`j=N-1`) on line 4. Because $a<q$ and because the highest word of $q$ is smaller than $\frac{D-1}{2}$ we may employ [Fact 1](#fact-1) to see that the carry `C` is at most $\frac{D-1}{2}$. Then line 5 sets ```clike t[N] = C t[N+1] = 0 ``` A similar observation holds at the end of the second inner loop (`j=N-1`) on line 11: [Fact 1](#fact-1) implies that the carry `C` is at most $\frac{D-1}{2}$. We previously observed that `t[N]` is also at most $\frac{D-1}{2}$, so `t[N] + C` is at most $$ \frac{D-1}{2} + \frac{D-1}{2} = D-1$$ which fits entirely into a single word. Then line 13 sets `C` to `0`, line 14 sets `t[N]` to `0`. The proof by induction is now complete. --- So here we are: at line 5, we no longer need the addition so we store `C` in a variable `A`, at line 13 we do ```clike (_,t[N-1]) := A + C ``` ### Final algorithm For $N<12$^[`goff` is not optimized for $N\geq 12$], we merge the two inner loops ```clike= for i=0 to N-1 (A,t[0]) := t[0] + a[0]*b[i] m := t[0]*q'[0] mod W C,_ := t[0] + m*q[0] for j=1 to N-1 (A,t[j]) := t[j] + a[j]*b[i] + A (C,t[j-1]) := t[j] + m*q[j] + C t[N-1] = C + A ``` ## Montgomery squaring The condition on the modulus differs, here $$q[N-1] <= \frac{D-1}{4} - 1$$ The reasoning is similar and we end up with^[adapted from [Johann Großschädl's work](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.3276)] ```clike= for i=0 to N-1 C, t[i] = a[i] * a[i] + t[i] p = 0 for j=i+1 to N-1 p,C,t[j] = 2*a[j]*a[i] + t[j] + (p,C) A = C m = t[0] * q'[0] C, _ = t[0] + q[0]*m for j=1 to N-1 C, t[j-1] = q[j]*m + t[j] + C t[N-1] = C + A ``` ## Benchmarks [`goff`](https://github.com/consensys/`goff`) (go finite field) will generate field arithmetic source code in Go for any given modulus. In the context of [`gnark`](https://github.com/consensys/gnark), our ZKP framework, we are particularly interested in fields from BLS12-381, BLS12-377 and BN256 ($N\leq 6$). :::warning `goff` has not been audited and is provided as-is, use at your own risk. In particular, `goff` makes no security guarantees such as constant time implementation or side-channel attack resistance. ::: :::warning All benchmarks hereafter are run on a 2016 MBP with a Intel Core i7 (2.9 GHz). These results may be impacted by * available instruction set * CPU cache configuration * artifacts (thermal trottling, turbo boost, ...) * compiler optimizations ::: ### FIPS vs CIOS vs our optimization First, we measure the impact of our algorithmic optimization against two existing methods (CIOS and FIPS). To do so, we implemented all three methods in `goff` and generated code for modulus sizes between 127 bits and 2047 bits, that is for $N \in [2...32]$ and $D=2^{64}$. We can see that for $N\leq 8$, **our optimization yields a ~5-18% improvement** over the "schoolbook" CIOS method. Take these numbers with caution, as it's likely possible to further optimize the codebase, and we spend more time doing so in our implementation (==NO_CARRY==). <br/> | N | 2|3|4|5|6|7|8| | -------- | --------|--------| -------- | -------- |-------- |-------- |------- | | CIOS (ns/op)|10.7|18.2|29.1|46.1|61.7|84.5|109.6| | NO_CARRY (ns/op) |10.2|16.6|25.3|37.6|50.8|72.4|94.3| | improvement |==-4.7%==|==-8.8%==|==-13.1%==|==-18.4%==|==-17.7%==|==-14.3%==|==-14.0%==| <br/> ![](https://i.imgur.com/LaxCfaB.png) However, we notice as $N$ scales that performance gains becomes negligible, mostly because the economy of few additions don't weight much in front of CPU cache accesses. We didn't invest time in making our method more efficient for $N\geq 12$ as it's out of the scope of our main project. ![](https://i.imgur.com/fuDxgtY.png) ### `goff` vs others It is difficult to precisely compare benchmarks between languages. Nonetheless, it appears that our Go implementation is slightly faster than existing state-of-the-art implementations. Here are our measurements for the Montgomery modular multiplication. | library | bn256 (ns/op)|bls381 (ns/op)| language | project | | -------- | --------|--------| -------- | -------- | | `goff` ==*== |21.0|40.7| Go | `gnark` | | bn256 ==*== |28.5|| Go | Cloudflare | | libff |33|60| Rust | ZCash | | barretenberg ==*== |need update| | C++ | AZTEC | | mcl ==*== |need update| | C++ | | | ctbignum |36.4| | C++ | | | libff ==*?== (using GMP)^[benchmark code adapted from [ctbignum](https://github.com/niekbouman/ctbignum/blob/master/benchmarks/src/bench-montmul.cpp). libff seems to have seems to have [state of the art assembly implementation](https://github.com/scipr-lab/libff/blob/master/libff/algebra/fields/fp.tcc), so we're cautious on these results] |49|| C++ | libsnark | Note that libraries marked with ==*== have a constant time implementation of the Montgomery multiplication. The CIOS method (even with our optimization) can be made constant time by always subtracting the modulus $q$ to the result $t$ instead of comparing $t$ and $q$. :fire: In our [introduction to `goff`](https://hackmd.io/@zkteam/goff), we walk through some optimization techniques we used. Spoiler: no assembly! :fire: --- <img style="float:left;width:90px;margin-right:20px;" src="https://i.imgur.com/tYQ8i9r.png">*Authors: [Gautam Botrel](https://www.linkedin.com/in/gautam-botrel/), [Gus Gutoski](https://www.linkedin.com/in/ggutoski/), and [Thomas Piellard](https://www.linkedin.com/in/thomas-piellard-53b881129/)* *We’re a research team at [ConsenSys](https://consensys.net/). If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), [give us a shout](mailto:zkteam@consensys.net).*