Try   HackMD

Faster big-integer modular multiplication for most moduli

We discovered an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not quite all) choices of modulus.

Big-integer modular multiplication is a low-level component found in most modern cryptographic protocols including

  • Elliptic curve cryptography, such as
    • Elliptic Curve Digital Signature Algorithm (ECDSA)
    • Elliptic Curve Diffie-Hellman key exchange (ECDH)
    • Elliptic Curve pairings
  • RSA cryptography
  • Zero-knowledge proofs

Some protocols (such as zero-knowledge proofs and elliptic curve pairings) invoke modular multiplication billions of times in a single execution. Other protocols (such as ECDSA or ECDH) are themselves invoked billions of times each second around the world. In both cases, even a tiny improvement in modular multiplication yields a very significant computational saving-every nanosecond counts.

In our benchmarks, our implementation (goff) yields a 1015% speed improvement over exisiting libraries on 64-bit architectures for moduli of bit-length 704 or less. The source code is freely available under an Apache 2.0 License.

We are not aware of any prior art describing this optimization, though it is possible that we missed something. If you’ve seen this optimization before then please do let us know!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
goff is a tool we wrote to generate pure Go code (no assembly!) for fast modular arithmetic. For more details check out our goff announcement article.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Montgomery multiplication

The modular multiplication problem

Given integers

a,
b
, and
q
the modular multiplication problem is to compute the remainder of the product
abmodq
On computers a division operation is much slower than other operations such as multiplication. Thus, a naive implementation of
abmodq
using a division operation is prohibitively slow.

In 1985, Montgomery introduced a method to avoid costly divisions. This method, now called Montgomery multiplication, is among the fastest solutions to the problem and it continues to enjoy widespread use in modern cryptography.

Overview of the solution: Montgomery multiplication

There are many good expositions of Montgomery multiplication-here is one example. As such, we do not go into detail on the mathematics of Montgomery multiplication. Instead, this section is intended to establish notation that is used throughout this article.

The Montgomery multiplication algorithm does not directly compute

abmodq. Instead it computes
abR1modq
for some carefully chosen number
R
called the Montgomery radix. Typically,
R
is set to the smallest power of two exceeding
q
that falls on a computer word boundary. For example, if
q
is 381 bits then
R=26×64=2384
on a 64-bit architecture.

In order to make use of Montgomery multiplication the numbers

a,b must be encoded into Montgomery form: instead of storing
a,b
, we store the numbers
a~,b~
given by
a~=aRmodqb~=bRmodq
A simple calculation shows that Montgomery multiplication produces the product
abmodq
, also encoded in Montgomery form:
(aR)(bR)R1=abRmodq
The idea is that numbers are always stored in Montgomery form so as to avoid costly conversions to and from Montgomery form.

Other arithmetic operations such as addition, subtraction are unaffected by Montgomery form encoding. But the modular inverse computation

a1modq must be adapted to account for Montgomery form. We do not discuss modular inversion in this article.

Implementation

For security purposes, cryptographic protocols use large moduli-

a,b and
q
are stored on multiple machine words (multi-precision). In this article we let
D
denote the basis in which integers are represented. (For example,
D=264
if a word is 64 bits). A large number
n
can be represented by its base-
D
digits
n0,,nm
stored in machine words (uint):
n=i=0mniDi

There are several variations of multi-precision Montgomery multiplication. A popular choice is the Coarsely Integrated Operand Scanning (CIOS) variant [1]. In some settings, factors such as modulus size, CPU cache management, optimization techniques, architecture and available instruction set might favor other variants.

How fast is the CIOS method?

Let

N denote the number of machine words needed to store the modulus
q
. For example, if
q
is a 381-bit prime and the hardware has 64-bit word size then
N=6
. The CIOS method solves modular multiplication using
4N2+4N+2
unsigned integer additions and
2N2+N
unsigned integer multiplications.

Our optimization reduces the number of additions needed in CIOS Montgomery multiplication to only

4N2N, a savings of
5N+2
additions
. This optimization can be used whenever the highest bit of the modulus is zero (and not all of the remaining bits are set-see below for details).

The core of the state-of-the-art CIOS Montgomery multiplication is reproduced below. This listing is adapted from Section 2.3.2 of Tolga Acar's thesis. The symbols in this listing have the following meanings:

  • N is the number of machine words needed to store the modulus
    q
    .
  • D is the word size. For example, on a 64-bit architecture D is
    264
    .
  • a[i], b[i], q[i] is the ith word of the numbers
    a,b,q
    .
  • q'[0] is the lowest word of the number
    q1modR
    . This quantity is pre-computed, as it does not depend on the inputs
    a,b
    .
  • t is a temporary array of size
    N+2
  • C, S are machine words. A pair (C,S) refers to (hi-bits, lo-bits) of a two-word number
for i=0 to N-1 C := 0 for j=0 to N-1 (C,t[j]) := t[j] + a[j]*b[i] + C (t[N+1],t[N]) := t[N] + C C := 0 m := t[0]*q'[0] mod D (C,_) := t[0] + m*q[0] for j=1 to N-1 (C,t[j-1]) := t[j] + m*q[j] + C (C,t[N-1]) := t[N] + C t[N] := t[N+1] + C

We show that we can get rid of additions line 5 and 14 if the highest word of the modulus

q is at most
D121
.
This condition holds if and only if the highest bit of the modulus is zero and not all of the remaining bits are set.

Our optimization

Observe that lines 4 and 11 have the form

(hi,lo):=m1+m2B+m3 where
m1,m2,m3,B,hi,lo
are machine words, so that each is at most
D1
. If
BD121
then a simple calculation shows:
m1+m2B+m3(D1)+(D1)(D121)+(D1)=D(D12)hi+(D+121)lo
from which we observe:

Fact 1

If

BD121 then
hiD12
.


We leverage Fact 1 to prove the following:

Fact 2

If the highest word of

q is at most
D121
then the variables t[N], t[N+1] always store the value 0 at the beginning of each iteration of the outer i-loop.

This fact can be proven by induction.
The base case i=0 is trivial, since the t array is initialized to 0.
For the inductive step at iteration i suppose t[N]=t[N+1]=0 and trace the execution through the iteration.

Begin at the final iteration of the first inner loop (j=N-1) on line 4. Because

a<q and because the highest word of
q
is smaller than
D12
we may employ Fact 1 to see that the carry C is at most
D12
. Then line 5 sets

t[N] = C
t[N+1] = 0

A similar observation holds at the end of the second inner loop (j=N-1) on line 11: Fact 1 implies that the carry C is at most

D12. We previously observed that t[N] is also at most
D12
, so t[N] + C is at most
D12+D12=D1

which fits entirely into a single word. Then line 13 sets C to 0, line 14 sets t[N] to 0.
The proof by induction is now complete.


So here we are: at line 5, we no longer need the addition so we store C in a variable A, at line 13 we do

(_,t[N-1]) := A + C 

Final algorithm

For

N<12[1], we merge the two inner loops

for i=0 to N-1 (A,t[0]) := t[0] + a[0]*b[i] m := t[0]*q'[0] mod W C,_ := t[0] + m*q[0] for j=1 to N-1 (A,t[j]) := t[j] + a[j]*b[i] + A (C,t[j-1]) := t[j] + m*q[j] + C t[N-1] = C + A

Montgomery squaring

The condition on the modulus differs, here

q[N1]<=D141
The reasoning is similar and we end up with[1]

for i=0 to N-1 C, t[i] = a[i] * a[i] + t[i] p = 0 for j=i+1 to N-1 p,C,t[j] = 2*a[j]*a[i] + t[j] + (p,C) A = C m = t[0] * q'[0] C, _ = t[0] + q[0]*m for j=1 to N-1 C, t[j-1] = q[j]*m + t[j] + C t[N-1] = C + A

Benchmarks

goff (go finite field) will generate field arithmetic source code in Go for any given modulus.

In the context of gnark, our ZKP framework, we are particularly interested in fields from BLS12-381, BLS12-377 and BN256 (

N6).

goff has not been audited and is provided as-is, use at your own risk. In particular, goff makes no security guarantees such as constant time implementation or side-channel attack resistance.

All benchmarks hereafter are run on a 2016 MBP with a Intel Core i7 (2.9 GHz).

These results may be impacted by

  • available instruction set
  • CPU cache configuration
  • artifacts (thermal trottling, turbo boost, )
  • compiler optimizations

FIPS vs CIOS vs our optimization

First, we measure the impact of our algorithmic optimization against two existing methods (CIOS and FIPS).
To do so, we implemented all three methods in goff and generated code for modulus sizes between 127 bits and 2047 bits, that is for

N[2...32] and
D=264
.

We can see that for

N8, our optimization yields a ~5-18% improvement over the "schoolbook" CIOS method. Take these numbers with caution, as it's likely possible to further optimize the codebase, and we spend more time doing so in our implementation (NO_CARRY).


N 2 3 4 5 6 7 8
CIOS (ns/op) 10.7 18.2 29.1 46.1 61.7 84.5 109.6
NO_CARRY (ns/op) 10.2 16.6 25.3 37.6 50.8 72.4 94.3
improvement -4.7% -8.8% -13.1% -18.4% -17.7% -14.3% -14.0%

However, we notice as

N scales that performance gains becomes negligible, mostly because the economy of few additions don't weight much in front of CPU cache accesses.

We didn't invest time in making our method more efficient for

N12 as it's out of the scope of our main project.

goff vs others

It is difficult to precisely compare benchmarks between languages. Nonetheless, it appears that our Go implementation is slightly faster than existing state-of-the-art implementations.
Here are our measurements for the Montgomery modular multiplication.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
In our introduction to goff, we walk through some optimization techniques we used.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Authors: Gautam Botrel, Gus Gutoski, and Thomas Piellard

We’re a research team at ConsenSys. If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), give us a shout.