Faster big-integer modular multiplication for most moduli

We discovered an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not quite all) choices of modulus.

Big-integer modular multiplication is a low-level component found in most modern cryptographic protocols including

Elliptic curve cryptography, such as
- Elliptic Curve Digital Signature Algorithm (ECDSA)
- Elliptic Curve Diffie-Hellman key exchange (ECDH)
- Elliptic Curve pairings
RSA cryptography
Zero-knowledge proofs

Some protocols (such as zero-knowledge proofs and elliptic curve pairings) invoke modular multiplication billions of times in a single execution. Other protocols (such as ECDSA or ECDH) are themselves invoked billions of times each second around the world. In both cases, even a tiny improvement in modular multiplication yields a very significant computational saving–-every nanosecond counts.

In our benchmarks, our implementation (goff) yields a 10–15% speed improvement over exisiting libraries on 64-bit architectures for moduli of bit-length 704 or less. The source code is freely available under an Apache 2.0 License.

We are not aware of any prior art describing this optimization, though it is possible that we missed something. If you’ve seen this optimization before then please do let us know!

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

goff is a tool we wrote to generate pure Go code (no assembly!) for fast modular arithmetic. For more details check out our goff announcement article.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Montgomery multiplication

The modular multiplication problem

Given integers

a

b

, and

q

the modular multiplication problem is to compute the remainder of the product

a * b \mod q

On computers a division operation is much slower than other operations such as multiplication. Thus, a naive implementation of

a b mod q

using a division operation is prohibitively slow.

In 1985, Montgomery introduced a method to avoid costly divisions. This method, now called Montgomery multiplication, is among the fastest solutions to the problem and it continues to enjoy widespread use in modern cryptography.

Overview of the solution: Montgomery multiplication

There are many good expositions of Montgomery multiplication–-here is one example. As such, we do not go into detail on the mathematics of Montgomery multiplication. Instead, this section is intended to establish notation that is used throughout this article.

The Montgomery multiplication algorithm does not directly compute

a b mod q

. Instead it computes

a b R^{- 1} mod q

for some carefully chosen number

R

called the Montgomery radix. Typically,

R

is set to the smallest power of two exceeding

q

that falls on a computer word boundary. For example, if

q

is 381 bits then

R = 2^{6 \times 64} = 2^{384}

on a 64-bit architecture.

In order to make use of Montgomery multiplication the numbers

a, b

must be encoded into Montgomery form: instead of storing

a, b

, we store the numbers

\tilde{a}, \tilde{b}

given by

\begin{aligned} \tilde{a} & = a R \mod q \\ \tilde{b} & = b R \mod q \end{aligned}

A simple calculation shows that Montgomery multiplication produces the product

a b mod q

, also encoded in Montgomery form:

(a R) (b R) R^{- 1} = a b R \mod q

The idea is that numbers are always stored in Montgomery form so as to avoid costly conversions to and from Montgomery form.

Other arithmetic operations such as addition, subtraction are unaffected by Montgomery form encoding. But the modular inverse computation

a^{- 1} mod q

must be adapted to account for Montgomery form. We do not discuss modular inversion in this article.

Implementation

For security purposes, cryptographic protocols use large moduli–-

a, b

and

q

are stored on multiple machine words (multi-precision). In this article we let

D

denote the basis in which integers are represented. (For example,

D = 2^{64}

if a word is 64 bits). A large number

n

can be represented by its base-

D

digits

n_{0}, \dots, n_{m}

stored in machine words (uint):

n = \sum_{i = 0}^{m} n_{i} D^{i}

There are several variations of multi-precision Montgomery multiplication. A popular choice is the Coarsely Integrated Operand Scanning (CIOS) variant ^[1]. In some settings, factors such as modulus size, CPU cache management, optimization techniques, architecture and available instruction set might favor other variants.

How fast is the CIOS method?

Let

N

denote the number of machine words needed to store the modulus

q

. For example, if

q

is a 381-bit prime and the hardware has 64-bit word size then

N = 6

. The CIOS method solves modular multiplication using

4 N^{2} + 4 N + 2

unsigned integer additions and

2 N^{2} + N

unsigned integer multiplications.

Our optimization reduces the number of additions needed in CIOS Montgomery multiplication to only

4 N^{2} - N

, a savings of
$5 N + 2$ additions. This optimization can be used whenever the highest bit of the modulus is zero (and not all of the remaining bits are set–-see below for details).

The core of the state-of-the-art CIOS Montgomery multiplication is reproduced below. This listing is adapted from Section 2.3.2 of Tolga Acar's thesis. The symbols in this listing have the following meanings:

N is the number of machine words needed to store the modulus
$q$ .
D is the word size. For example, on a 64-bit architecture D is
$2^{64}$ .
a[i], b[i], q[i] is the ith word of the numbers
$a, b, q$ .
q'[0] is the lowest word of the number
$- q^{- 1} mod R$ . This quantity is pre-computed, as it does not depend on the inputs
$a, b$ .
t is a temporary array of size
$N + 2$
C, S are machine words. A pair (C,S) refers to (hi-bits, lo-bits) of a two-word number














for i=0 to N-1
    C := 0
    for j=0 to N-1
      (C,t[j]) := t[j] + a[j]*b[i] + C
    (t[N+1],t[N]) := t[N] + C
  
    C := 0
    m := t[0]*q'[0] mod D
    (C,_) := t[0] + m*q[0]
    for j=1 to N-1
      (C,t[j-1]) := t[j] + m*q[j] + C

    (C,t[N-1]) := t[N] + C
    t[N] := t[N+1] + C

We show that we can get rid of additions line 5 and 14 if the highest word of the modulus

$q$ is at most
$\frac{D - 1}{2} - 1$ .
This condition holds if and only if the highest bit of the modulus is zero and not all of the remaining bits are set.

Our optimization

Observe that lines 4 and 11 have the form

(hi, lo) := m_{1} + m_{2} * B + m_{3}

where

m_{1}, m_{2}, m_{3}, B, hi, lo

are machine words, so that each is at most

D - 1

. If

B \leq \frac{D - 1}{2} - 1

then a simple calculation shows:

\begin{aligned} m_{1} + m_{2} B + m_{3} & \leq (D - 1) + (D - 1) (\frac{D - 1}{2} - 1) + (D - 1) \\ = D \underset{hi}{\underset{⏟}{(\frac{D - 1}{2})}} + \underset{lo}{\underset{⏟}{(\frac{D + 1}{2} - 1)}} \end{aligned}

from which we observe:

Fact 1

B \leq \frac{D - 1}{2} - 1

then

hi \leq \frac{D - 1}{2}

We leverage Fact 1 to prove the following:

Fact 2

If the highest word of

q

is at most

\frac{D - 1}{2} - 1

then the variables t[N], t[N+1] always store the value 0 at the beginning of each iteration of the outer i-loop.

This fact can be proven by induction.
The base case i=0 is trivial, since the t array is initialized to 0.
For the inductive step at iteration i suppose t[N]=t[N+1]=0 and trace the execution through the iteration.

Begin at the final iteration of the first inner loop (j=N-1) on line 4. Because

a < q

and because the highest word of

q

is smaller than

\frac{D - 1}{2}

we may employ Fact 1 to see that the carry C is at most

\frac{D - 1}{2}

. Then line 5 sets

t[N] = C
t[N+1] = 0

A similar observation holds at the end of the second inner loop (j=N-1) on line 11: Fact 1 implies that the carry C is at most

\frac{D - 1}{2}

. We previously observed that t[N] is also at most

\frac{D - 1}{2}

, so t[N] + C is at most

\frac{D - 1}{2} + \frac{D - 1}{2} = D - 1

which fits entirely into a single word. Then line 13 sets C to 0, line 14 sets t[N] to 0.
The proof by induction is now complete.

So here we are: at line 5, we no longer need the addition so we store C in a variable A, at line 13 we do

(_,t[N-1]) := A + C

Final algorithm

For

N < 12

^[1], we merge the two inner loops









for i=0 to N-1
    (A,t[0]) := t[0] + a[0]*b[i] 
    m := t[0]*q'[0] mod W
    C,_ := t[0] + m*q[0]
    for j=1 to N-1
        (A,t[j])  := t[j] + a[j]*b[i] + A
        (C,t[j-1]) := t[j] + m*q[j] + C
   
    t[N-1] = C + A

Montgomery squaring

The condition on the modulus differs, here

q [N - 1] <= \frac{D - 1}{4} - 1

The reasoning is similar and we end up with^[1]














 for i=0 to N-1
    C, t[i] = a[i] * a[i] + t[i]
    p = 0
    for j=i+1 to N-1
        p,C,t[j] = 2*a[j]*a[i] + t[j] + (p,C)

    A = C

    m = t[0] * q'[0]
    C, _ = t[0] + q[0]*m
    for j=1 to N-1
        C, t[j-1] = q[j]*m +  t[j] + C

    t[N-1] = C + A

Benchmarks

goff (go finite field) will generate field arithmetic source code in Go for any given modulus.

In the context of gnark, our ZKP framework, we are particularly interested in fields from BLS12-381, BLS12-377 and BN256 (

N \leq 6

goff has not been audited and is provided as-is, use at your own risk. In particular, goff makes no security guarantees such as constant time implementation or side-channel attack resistance.

All benchmarks hereafter are run on a 2016 MBP with a Intel Core i7 (2.9 GHz).

These results may be impacted by

available instruction set
CPU cache configuration
artifacts (thermal trottling, turbo boost, …)
compiler optimizations

FIPS vs CIOS vs our optimization

First, we measure the impact of our algorithmic optimization against two existing methods (CIOS and FIPS).
To do so, we implemented all three methods in goff and generated code for modulus sizes between 127 bits and 2047 bits, that is for

N \in [2. . .32]

and

D = 2^{64}

We can see that for

N \leq 8

, our optimization yields a ~5-18% improvement over the "schoolbook" CIOS method. Take these numbers with caution, as it's likely possible to further optimize the codebase, and we spend more time doing so in our implementation (NO_CARRY).

N	2	3	4	5	6	7	8
CIOS (ns/op)	10.7	18.2	29.1	46.1	61.7	84.5	109.6
NO_CARRY (ns/op)	10.2	16.6	25.3	37.6	50.8	72.4	94.3
improvement	-4.7%	-8.8%	-13.1%	-18.4%	-17.7%	-14.3%	-14.0%

However, we notice as

N

scales that performance gains becomes negligible, mostly because the economy of few additions don't weight much in front of CPU cache accesses.

We didn't invest time in making our method more efficient for

N \geq 12

as it's out of the scope of our main project.

`goff` vs others

It is difficult to precisely compare benchmarks between languages. Nonetheless, it appears that our Go implementation is slightly faster than existing state-of-the-art implementations.
~~Here are our measurements for the Montgomery modular multiplication.~~

Edit (Sept 2020): you will find up to date benchmarks here.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In our introduction to goff, we walk through some optimization techniques we used.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Authors: Gautam Botrel, Gus Gutoski, and Thomas Piellard

We’re a research team at ConsenSys. If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), give us a shout.

Faster big-integer modular multiplication for most moduli

Montgomery multiplication

The modular multiplication problem

Overview of the solution: Montgomery multiplication

Implementation

How fast is the CIOS method?

Our optimization

Fact 1

Fact 2

Final algorithm

Montgomery squaring

Benchmarks

FIPS vs CIOS vs our optimization

goff vs others

Read more

What changes for BW6-761?

Yet another curve, but THE curve for your KZG!

BW6 over BLS12-381

Benchmarking pairing-friendly elliptic curves libraries

`goff` vs others