We discovered an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not quite all) choices of modulus.
Big-integer modular multiplication is a low-level component found in most modern cryptographic protocols including
Some protocols (such as zero-knowledge proofs and elliptic curve pairings) invoke modular multiplication billions of times in a single execution. Other protocols (such as ECDSA or ECDH) are themselves invoked billions of times each second around the world. In both cases, even a tiny improvement in modular multiplication yields a very significant computational saving–-every nanosecond counts.
In our benchmarks, our implementation (goff
) yields a 10–15% speed improvement over exisiting libraries on 64-bit architectures for moduli of bit-length 704 or less. The source code is freely available under an Apache 2.0 License.
We are not aware of any prior art describing this optimization, though it is possible that we missed something. If you’ve seen this optimization before then please do let us know!
goff
is a tool we wrote to generate pure Go code (no assembly!) for fast modular arithmetic. For more details check out our goff
announcement article. Given integers
In 1985, Montgomery introduced a method to avoid costly divisions. This method, now called Montgomery multiplication, is among the fastest solutions to the problem and it continues to enjoy widespread use in modern cryptography.
There are many good expositions of Montgomery multiplication–-here is one example. As such, we do not go into detail on the mathematics of Montgomery multiplication. Instead, this section is intended to establish notation that is used throughout this article.
The Montgomery multiplication algorithm does not directly compute
In order to make use of Montgomery multiplication the numbers
Other arithmetic operations such as addition, subtraction are unaffected by Montgomery form encoding. But the modular inverse computation
For security purposes, cryptographic protocols use large moduli–-uint
):
There are several variations of multi-precision Montgomery multiplication. A popular choice is the Coarsely Integrated Operand Scanning (CIOS) variant [1]. In some settings, factors such as modulus size, CPU cache management, optimization techniques, architecture and available instruction set might favor other variants.
Let
Our optimization reduces the number of additions needed in CIOS Montgomery multiplication to only
The core of the state-of-the-art CIOS Montgomery multiplication is reproduced below. This listing is adapted from Section 2.3.2 of Tolga Acar's thesis. The symbols in this listing have the following meanings:
N
is the number of machine words needed to store the modulus D
is the word size. For example, on a 64-bit architecture D
is a[i], b[i], q[i]
is the i
th word of the numbers q'[0]
is the lowest word of the number t
is a temporary array of size C
, S
are machine words. A pair (C,S)
refers to (hi-bits, lo-bits) of a two-word number
for i=0 to N-1
C := 0
for j=0 to N-1
(C,t[j]) := t[j] + a[j]*b[i] + C
(t[N+1],t[N]) := t[N] + C
C := 0
m := t[0]*q'[0] mod D
(C,_) := t[0] + m*q[0]
for j=1 to N-1
(C,t[j-1]) := t[j] + m*q[j] + C
(C,t[N-1]) := t[N] + C
t[N] := t[N+1] + C
We show that we can get rid of additions line 5 and 14 if the highest word of the modulus
This condition holds if and only if the highest bit of the modulus is zero and not all of the remaining bits are set.
Observe that lines 4 and 11 have the form
If
We leverage Fact 1 to prove the following:
If the highest word of t[N]
, t[N+1]
always store the value 0
at the beginning of each iteration of the outer i
-loop.
This fact can be proven by induction.
The base case i=0
is trivial, since the t
array is initialized to 0
.
For the inductive step at iteration i
suppose t[N]=t[N+1]=0
and trace the execution through the iteration.
Begin at the final iteration of the first inner loop (j=N-1
) on line 4. Because C
is at most
t[N] = C
t[N+1] = 0
A similar observation holds at the end of the second inner loop (j=N-1
) on line 11: Fact 1 implies that the carry C
is at most t[N]
is also at most t[N] + C
is at most
which fits entirely into a single word. Then line 13 sets C
to 0
, line 14 sets t[N]
to 0
.
The proof by induction is now complete.
So here we are: at line 5, we no longer need the addition so we store C
in a variable A
, at line 13 we do
(_,t[N-1]) := A + C
For
for i=0 to N-1
(A,t[0]) := t[0] + a[0]*b[i]
m := t[0]*q'[0] mod W
C,_ := t[0] + m*q[0]
for j=1 to N-1
(A,t[j]) := t[j] + a[j]*b[i] + A
(C,t[j-1]) := t[j] + m*q[j] + C
t[N-1] = C + A
The condition on the modulus differs, here
The reasoning is similar and we end up with[1]
for i=0 to N-1
C, t[i] = a[i] * a[i] + t[i]
p = 0
for j=i+1 to N-1
p,C,t[j] = 2*a[j]*a[i] + t[j] + (p,C)
A = C
m = t[0] * q'[0]
C, _ = t[0] + q[0]*m
for j=1 to N-1
C, t[j-1] = q[j]*m + t[j] + C
t[N-1] = C + A
goff
(go finite field) will generate field arithmetic source code in Go for any given modulus.
In the context of gnark
, our ZKP framework, we are particularly interested in fields from BLS12-381, BLS12-377 and BN256 (
goff
has not been audited and is provided as-is, use at your own risk. In particular, goff
makes no security guarantees such as constant time implementation or side-channel attack resistance.
All benchmarks hereafter are run on a 2016 MBP with a Intel Core i7 (2.9 GHz).
These results may be impacted by
First, we measure the impact of our algorithmic optimization against two existing methods (CIOS and FIPS).
To do so, we implemented all three methods in goff
and generated code for modulus sizes between 127 bits and 2047 bits, that is for
We can see that for
N | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|
CIOS (ns/op) | 10.7 | 18.2 | 29.1 | 46.1 | 61.7 | 84.5 | 109.6 |
NO_CARRY (ns/op) | 10.2 | 16.6 | 25.3 | 37.6 | 50.8 | 72.4 | 94.3 |
improvement | -4.7% | -8.8% | -13.1% | -18.4% | -17.7% | -14.3% | -14.0% |
However, we notice as
We didn't invest time in making our method more efficient for
goff
vs othersIt is difficult to precisely compare benchmarks between languages. Nonetheless, it appears that our Go implementation is slightly faster than existing state-of-the-art implementations.
Here are our measurements for the Montgomery modular multiplication.
Edit (Sept 2020): you will find up to date benchmarks here.
goff
, we walk through some optimization techniques we used. We’re a research team at ConsenSys. If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), give us a shout.