---
description: We present an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not quite all) choices of modulus. This work was done at ConsenSys by Gautam Botrel, Gus Gutoski and Thomas Piellard
image: https://i.imgur.com/iAj7POt.png
---
# Faster big-integer modular multiplication for most moduli
We discovered an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not quite all) choices of modulus.
Big-integer modular multiplication is a low-level component found in most modern cryptographic protocols including
* Elliptic curve cryptography, such as
* Elliptic Curve Digital Signature Algorithm (ECDSA)
* Elliptic Curve Diffie-Hellman key exchange (ECDH)
* Elliptic Curve pairings
* RSA cryptography
* Zero-knowledge proofs
Some protocols (such as zero-knowledge proofs and elliptic curve pairings) invoke modular multiplication billions of times in a single execution. Other protocols (such as ECDSA or ECDH) are themselves invoked billions of times each second around the world. In both cases, even a tiny improvement in modular multiplication yields a very significant computational saving---every nanosecond counts.
In our [benchmarks](#Benchmarks), our implementation (`goff`) yields a **10--15% speed improvement** over exisiting libraries on 64-bit architectures for moduli of bit-length 704 or less. The [source code](https://github.com/consensys/gnark) is freely available under an Apache 2.0 License.
*We are not aware of any prior art describing this optimization, though it is possible that we missed something. If you’ve seen this optimization before then please do let us know!*
:fire: `goff` is a tool we wrote to generate pure Go code (no assembly!) for fast modular arithmetic. For more details check out our [`goff` announcement article](https://hackmd.io/@zkteam/goff). :fire:
<img style="display: block;margin: auto;" width="50%"
src="https://i.imgur.com/UUppBSP.png">
## Montgomery multiplication
### The modular multiplication problem
Given integers $a$, $b$, and $q$ the modular multiplication problem is to compute the remainder of the product $$a*b \mod q$$ On computers a division operation is much slower than other operations such as multiplication. Thus, a naive implementation of $ab \bmod q$ using a division operation is prohibitively slow.
In 1985, [Montgomery introduced a method](https://en.wikipedia.org/wiki/Montgomery_modular_multiplication) to avoid costly divisions. This method, now called _Montgomery multiplication_, is among the fastest solutions to the problem and it continues to enjoy widespread use in modern cryptography.
### Overview of the solution: Montgomery multiplication
There are many good expositions of Montgomery multiplication---[here](https://eprint.iacr.org/2017/1057) is one example. As such, we do not go into detail on the mathematics of Montgomery multiplication. Instead, this section is intended to establish notation that is used throughout this article.
The Montgomery multiplication algorithm does not directly compute $ab \bmod q$. Instead it computes $abR^{-1} \bmod q$ for some carefully chosen number $R$ called the _Montgomery radix_. Typically, $R$ is set to the smallest power of two exceeding $q$ that falls on a computer word boundary. For example, if $q$ is 381 bits then $R=2^{6\times 64}=2^{384}$ on a 64-bit architecture.
In order to make use of Montgomery multiplication the numbers $a,b$ must be encoded into _Montgomery form_: instead of storing $a,b$, we store the numbers $\tilde{a}, \tilde{b}$ given by
\begin{align}
\tilde{a} &= aR \mod q \\
\tilde{b} &= bR \mod q
\end{align} A simple calculation shows that Montgomery multiplication produces the product $ab\bmod q$, _also encoded in Montgomery form_:
$$ (aR)(bR)R^{-1} = abR \mod q$$ The idea is that numbers are _always stored in Montgomery form_ so as to avoid costly conversions to and from Montgomery form.
Other arithmetic operations such as addition, subtraction are unaffected by Montgomery form encoding. But the modular inverse computation $a^{-1}\bmod q$ must be adapted to account for Montgomery form. We do not discuss modular inversion in this article.
## Implementation
For security purposes, cryptographic protocols use large moduli---$a,b$ and $q$ are stored on multiple machine words (multi-precision). In this article we let $D$ denote the basis in which integers are represented. (For example, $D=2^{64}$ if a word is 64 bits). A large number $n$ can be represented by its base-$D$ digits $n_0,\dots,n_m$ stored in machine words (`uint`):
$$n = \sum_{i=0}^m n_{i}D^{i}$$
There are several variations of multi-precision Montgomery multiplication. A popular choice is the Coarsely Integrated Operand Scanning (CIOS) variant ^[[Tolga Acar's thesis](https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf)]. In some settings, factors such as modulus size, CPU cache management, optimization techniques, architecture and available instruction set might favor other variants.
### How fast is the CIOS method?
Let $N$ denote the number of machine words needed to store the modulus $q$. For example, if $q$ is a 381-bit prime and the hardware has 64-bit word size then $N=6$. The CIOS method solves modular multiplication using $4N^2 + 4N + 2$ unsigned integer additions and $2N^2 + N$ unsigned integer multiplications.
Our optimization reduces the number of additions needed in CIOS Montgomery multiplication to only $4N^2 - N$, **a savings of $5N+2$ additions**. This optimization can be used whenever the highest bit of the modulus is zero (and not *all* of the remaining bits are set---see below for details).
The core of the state-of-the-art CIOS Montgomery multiplication is reproduced below. This listing is adapted from [Section 2.3.2 of Tolga Acar's thesis](https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf). The symbols in this listing have the following meanings:
- `N` is the number of machine words needed to store the modulus $q$.
- `D` is the word size. For example, on a 64-bit architecture `D` is $2^{64}$.
- `a[i], b[i], q[i]` is the `i`th word of the numbers $a,b,q$.
- `q'[0]` is the lowest word of the number $-q^{-1} \bmod R$. This quantity is pre-computed, as it does not depend on the inputs $a,b$.
- `t` is a temporary array of size $N+2$
- `C`, `S` are machine words. A pair `(C,S)` refers to (hi-bits, lo-bits) of a two-word number
```clike=
for i=0 to N-1
C := 0
for j=0 to N-1
(C,t[j]) := t[j] + a[j]*b[i] + C
(t[N+1],t[N]) := t[N] + C
C := 0
m := t[0]*q'[0] mod D
(C,_) := t[0] + m*q[0]
for j=1 to N-1
(C,t[j-1]) := t[j] + m*q[j] + C
(C,t[N-1]) := t[N] + C
t[N] := t[N+1] + C
```
---
:::success
**We show that we can get rid of additions line 5 and 14 if the highest word of the modulus $q$ is at most $\frac{D-1}{2} - 1$**.
This condition holds if and only if the highest bit of the modulus is zero and not all of the remaining bits are set.
:::
### Our optimization
Observe that lines 4 and 11 have the form $$(\mathrm{hi},\mathrm{lo}) := m_1 + m_2*B + m_3$$ where $m_1,m_2,m_3,B,\mathrm{hi},\mathrm{lo}$ are machine words, so that each is at most $D-1$. If $B \leq \frac{D-1}{2} - 1$ then a simple calculation shows:
\begin{align}
m_1 + m_2B + m_3 &\leq (D-1)+(D-1)\left( \frac{D-1}{2} - 1 \right) + (D-1) \\
&= D \underbrace{ \left( \frac{D-1}{2} \right)}_{\mathrm{hi}} + \underbrace{ \left( \frac{D+1}{2} - 1 \right) }_{\mathrm{lo}}
\end{align} from which we observe:
#### Fact 1
:::success
If $B \leq \frac{D-1}{2} - 1$ then $\mathrm{hi} \leq \frac{D-1}{2}$.
:::
---
We leverage [Fact 1](#fact-1) to prove the following:
#### Fact 2
:::success
If the highest word of $q$ is at most $\frac{D-1}{2} - 1$ then the variables `t[N]`, `t[N+1]` always store the value `0` at the beginning of each iteration of the outer `i`-loop.
:::
This fact can be proven by induction.
The base case `i=0` is trivial, since the `t` array is initialized to `0`.
For the inductive step at iteration `i` suppose `t[N]=t[N+1]=0` and trace the execution through the iteration.
Begin at the final iteration of the first inner loop (`j=N-1`) on line 4. Because $a<q$ and because the highest word of $q$ is smaller than $\frac{D-1}{2}$ we may employ [Fact 1](#fact-1) to see that the carry `C` is at most $\frac{D-1}{2}$. Then line 5 sets
```clike
t[N] = C
t[N+1] = 0
```
A similar observation holds at the end of the second inner loop (`j=N-1`) on line 11: [Fact 1](#fact-1) implies that the carry `C` is at most $\frac{D-1}{2}$. We previously observed that `t[N]` is also at most $\frac{D-1}{2}$, so `t[N] + C` is at most $$ \frac{D-1}{2} + \frac{D-1}{2} = D-1$$
which fits entirely into a single word. Then line 13 sets `C` to `0`, line 14 sets `t[N]` to `0`.
The proof by induction is now complete.
---
So here we are: at line 5, we no longer need the addition so we store `C` in a variable `A`, at line 13 we do
```clike
(_,t[N-1]) := A + C
```
### Final algorithm
For $N<12$^[`goff` is not optimized for $N\geq 12$], we merge the two inner loops
```clike=
for i=0 to N-1
(A,t[0]) := t[0] + a[0]*b[i]
m := t[0]*q'[0] mod W
C,_ := t[0] + m*q[0]
for j=1 to N-1
(A,t[j]) := t[j] + a[j]*b[i] + A
(C,t[j-1]) := t[j] + m*q[j] + C
t[N-1] = C + A
```
## Montgomery squaring
The condition on the modulus differs, here $$q[N-1] <= \frac{D-1}{4} - 1$$
The reasoning is similar and we end up with^[adapted from [Johann Großschädl's work](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.3276)]
```clike=
for i=0 to N-1
C, t[i] = a[i] * a[i] + t[i]
p = 0
for j=i+1 to N-1
p,C,t[j] = 2*a[j]*a[i] + t[j] + (p,C)
A = C
m = t[0] * q'[0]
C, _ = t[0] + q[0]*m
for j=1 to N-1
C, t[j-1] = q[j]*m + t[j] + C
t[N-1] = C + A
```
## Benchmarks
[`goff`](https://github.com/consensys/`goff`) (go finite field) will generate field arithmetic source code in Go for any given modulus.
In the context of [`gnark`](https://github.com/consensys/gnark), our ZKP framework, we are particularly interested in fields from BLS12-381, BLS12-377 and BN256 ($N\leq 6$).
:::warning
`goff` has not been audited and is provided as-is, use at your own risk. In particular, `goff` makes no security guarantees such as constant time implementation or side-channel attack resistance.
:::
:::warning
All benchmarks hereafter are run on a 2016 MBP with a Intel Core i7 (2.9 GHz).
These results may be impacted by
* available instruction set
* CPU cache configuration
* artifacts (thermal trottling, turbo boost, ...)
* compiler optimizations
:::
### FIPS vs CIOS vs our optimization
First, we measure the impact of our algorithmic optimization against two existing methods (CIOS and FIPS).
To do so, we implemented all three methods in `goff` and generated code for modulus sizes between 127 bits and 2047 bits, that is for $N \in [2...32]$ and $D=2^{64}$.
We can see that for $N\leq 8$, **our optimization yields a ~5-18% improvement** over the "schoolbook" CIOS method. Take these numbers with caution, as it's likely possible to further optimize the codebase, and we spend more time doing so in our implementation (==NO_CARRY==).
<br/>
| N | 2|3|4|5|6|7|8|
| -------- | --------|--------| -------- | -------- |-------- |-------- |------- |
| CIOS (ns/op)|10.7|18.2|29.1|46.1|61.7|84.5|109.6|
| NO_CARRY (ns/op) |10.2|16.6|25.3|37.6|50.8|72.4|94.3|
| improvement |==-4.7%==|==-8.8%==|==-13.1%==|==-18.4%==|==-17.7%==|==-14.3%==|==-14.0%==|
<br/>
![](https://i.imgur.com/LaxCfaB.png)
However, we notice as $N$ scales that performance gains becomes negligible, mostly because the economy of few additions don't weight much in front of CPU cache accesses.
We didn't invest time in making our method more efficient for $N\geq 12$ as it's out of the scope of our main project.
![](https://i.imgur.com/fuDxgtY.png)
### `goff` vs others
It is difficult to precisely compare benchmarks between languages. Nonetheless, it appears that our Go implementation is slightly faster than existing state-of-the-art implementations.
~~Here are our measurements for the Montgomery modular multiplication.~~
:::info
**Edit (Sept 2020)**: [you will find up to date benchmarks here](https://github.com/ConsenSys/gurvy#benchmarks).
:::
:fire: In our [introduction to `goff`](https://hackmd.io/@zkteam/goff), we walk through some optimization techniques we used. :fire:
---
<img style="float:left;width:90px;margin-right:20px;"
src="https://i.imgur.com/tYQ8i9r.png">*Authors: [Gautam Botrel](https://www.linkedin.com/in/gautam-botrel/), [Gus Gutoski](https://www.linkedin.com/in/ggutoski/), and [Thomas Piellard](https://www.linkedin.com/in/thomas-piellard-53b881129/)*
*We’re a research team at [ConsenSys](https://consensys.net/). If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), [give us a shout](mailto:zkteam@consensys.net).*

Authors: Youssef El Housni (ConsenSys, GRACE) and Aurore Guillevic (Inria, Aarhus University) Last year in [HG20], we introduced the BW6-761 curve which is a pairing-friendly curve that forms a 2-chain with BLS12-377 (previously introduced in ZEXE). That is, BW6-761 has a subgroup order $r$ precisely equal to the characteristic $q$ of the base field on which BLS12-377 is defined. In a recent work [HG21], we generalised the curve construction, the pairing formulas ($e : \mathbb{G}_1 \times \mathbb{G}_2 → \mathbb{G}_T$) and the group operations to any BW6 curve defined on top of any BLS12 curve, forming a family of 2-chain pairing-friendly curves. We also investigated other possible 2-chain families, and much more (read the paper!). Since BW6-761 is already implemented in few projects (gnark-crypto, arkworks, EY/libff, kilic/bw6 and constantine) and is used in production (Celo, Aleo), natural questions arise: What changes for BW6-761 after this work? Is the new formulas backward-compatible with the current implementations? If not, does it worth switching? The new general construction where BW6-761 would fall, is coherent with the groups operations and pairing formulae found previously in [HG20]. However, we found new pairing algorithms that are faster. In this note, we try to shed light on the differences and the backward compatibilty. Pairing computation Given $P \in \mathbb{G}_1$ and $Q \in \mathbb{G}_2$, a pairing computation for BW6 is

10/12/2022With my co-authors Aurore Guillevic and Diego F. Aranha, we published on ePrint mid-May a survey of elliptic curves for proof systems (ePrint 2022/586). Recently, there have been many tailored constructions of these curves that aim at efficiently implementing different kinds of proof systems. In that survey we provide the reader with a comprehensive view on existing work and revisit the contributions in terms of efficiency and security. We present an overview at three stages of the process: curves to instantiate a SNARK, curves to instantiate a recursive SNARK, and curves to express an elliptic-curve related statement. In this post, we are only interested in the first case. For pairing-based SNARKs, such as the circuit-specific Groth16 [1], bilinear pairings ($e:\mathbb{G}_1\times \mathbb{G}_2 \rightarrow \mathbb{G}_T$) are needed for the proof verification. For universal SNARKs, such as PLONK [2] or Marlin [3], a polynomial commitment scheme is needed. An example of an efficient one is the Kate-Zaverucha-Goldberg (KZG) scheme [4]. The verifier needs to verify some polynomial openings using bilinear pairings. So far, for both circuit-specific and universal constructions, the verification algorithms boil down mainly to pairing computations. However, the prover work is not the same. For Groth16, it is mainly multi-scalar-multiplications (MSM) in both $\mathbb{G}_1$ and $\mathbb{G}_2$ while for KZG-based schemes it is mainly MSMs in $\mathbb{G}_1$ only. In section 4 of the survey paper, we look at a particular family of pairing-friendly elliptic curves suitable for KZG. We derive algorithm 6 to tailor this family to the SNARK context. Note that KZG is also useful for other compelling applications such as Verkle trees for blockchain stateless clients. Pairing-friendly curves

6/14/2022Authors: Youssef El Housni (ConsenSys, GRACE) and Aurore Guillevic (inria) Following the recent work on FFT over non-smooth fields (ECFFT), it might be viable to instantiate a SNARK with an elliptic curve of non-smooth subgroup order. Particularly, one layer SNARK composition can be achieved with less constraints for the choice of curves. In this note we propose a 2-chain based on the widely used BL12-381 curve. The outer curve is a BW6 with a subgroup order 2-adicity equal to one. ECFFT For smooth finite fields $\mathbb{F_q}$ (i.e., when $q−1$ factors into small primes) the Fast Fourier Transform (FFT) leads to the fastest known algebraic algorithms for many basic polynomial operations, such as multiplication, division, interpolation and multi-point evaluation. However, the same operations over fields with no smooth order root of unity suffer from an asymptotic slowdown. [In a recent work by Ben-Sasson, Carmon, Kopparty and Levit, called ECFFT, the authors proposed a new approach to fast algorithms for polynomial operations over all large finite fields.] The key idea is to replace the group of roots of unity with a set of points $L \subset F$ suitably related to a well-chosen elliptic curve group (the set $L$ itself is not a group). The key advantage of this approach is that elliptic curve groups can be of any size in the Hasse-Weil interval $[q+1\pm2\sqrt{q}]$ and thus can have subgroups of large, smooth order, which an FFT-like divide and conquer algorithm can exploit. (From the paper's abstract) 2-chains

6/2/2022last benchmark update: 29/01/21 There are several pairing-friendly elliptic curves libraries used in zero-knowledge proofs (ZKP) projects. Typically, the most important elliptic curve operations in ZKP schemes are "multi scalar multiplication" (MSM) and "pairing product" (PP). This writeup is a try to benchmark and compare the building blocks of these operations in two different architechtures. In fact, since MSM and PP are size-dependant, we focus on timings of mixed addition in G1/G2, doubling in G1/G2, Miller loop and Final exponentiation. The libraries are written in different languages with different software and mathematical optimizations, and with more or less assembly code. The target curves are: BN254, BLS12-381, BLS12-377 and BW6-761. The chosen libraries are:

4/22/2021
Published on ** HackMD**