Montgomery modular multiplication

Original version

Suppose

a, b \in Z_{N}

, modular multiplication

[a b]

requires first multiply them in the range

[0, N^{2} - 2 N + 1]

, and then use Euclidean division theorem to get

a b = q N + r, q = ⌊ a b / N ⌋

, such that

[a b] = r

. The division is quite expensive. Montgomery form is to choose a different divisor

R > N

with

gcd (R, N) = 1

. For example, by choose

R = 2^{n}

, the division is just shift by

n

bits. Thus, we just want to make sure the numerator is divisible by

R

Montgomery form of

a

is defined as

\bar{a} = [a R]

, i.e. residual class of

a R

. We have:

gcd (R, N) = 1 ⟹ \exists R^{- 1} : R R^{- 1} \equiv 1 ⟹ \exists N^{'} < R : R R^{- 1} = N^{'} N + 1

i.e.

N^{'}

is negative mod inverse of

N

w.r.t

R

. And we have multiplication by

R

is ring isomorphism. And:

T = [a R] [b R] \equiv a b R \cdot R \equiv [a b R] \cdot R

Notice

T < N \cdot N < R * N

. For any

0 \leq T < R N

, define reduction form of

T

as:

t = r e d c (T) := T R^{- 1} (\mod N)

i.e.

\overset{―}{a b} = r e d c (\bar{a} \bar{b})

. We use the following algorithm to do the reduction:
















/// fast REDC algorithm.
/// R is const with: gcd(R,N)=1
/// N' is const with: R*R^{-1} = N'*N + 1
/// assume input 0<=T<R*N
/// To calculate [ab], (1) a'=[aR], b'=[bR] (2)[ab]=redc(a'*b')
fn redc(T: u64) -> u64 {
    let u = T % R; // can skip
    let m = (u*N') % R; // or: m = (T*N') % R; 
    let mut t = (T+m*N)/R; //here R|(T+m*N)
    t = if t >= N {
        t - N
    } else {
        t
    }
    t
}

It's easy to see

t R \equiv T + m N \equiv T (\mod N)

. We still need to prove

R | (T + m N)

and

0 \leq T + m N < 2 N

, and this can be checked directly:

(T \mod R) x \equiv T x (\mod R) ⟹ T + m N = (T + ((T \mod R) k \mod R) N) / R (\mod R) \equiv T + T k N = T (1 + k N) \equiv 0 (\mod R)

Also notice that we can use:

[a R] = r e d c (a R^{2})

to fast calculate

[a R]

Multi-precision version

Represent integer with radix

b

A = a_{r - 1} b^{r - 1} + \dots + a_{1} b + a_{0}

. And assume

R = b^{r}

. Let's give a Montgomery algorithm with multiprecision.










fn redc(A,B,N,R) {
    let n0' = -n0^{-1} mod b;
    let mut T = A*B;
    // T updated every loop so that b^{i+1}|T
    for i in 0..(r-1) {
        let mi = ti*n0' mod b; // ti is the i-th word of latest T
        T = T+mi*N*b^i;
    }
    T
}

Here we use a trick:
To calculate

m_{i}

, instead of using

N^{'}

which is negative mod inverse of

N

w.r.t

R = b^{r}

; we can use

n_{0}^{'} \equiv n (\mod b)

, which is negative mod inverse of

n_{0}

w.r.t.

b

.
The idea is that everytime we update

T

, we make

T

divisible by

b^{i + 1}, i \in {0, \dots, n - 1}

, thus eventually by

R = b^{r}

.
We prove this by induction, suppose that

T^{(i)}

is the value after loop

i

, i.e.

T^{(i)} = A B + \sum_{j = 0}^{i} m_{j} N b^{j}

When

j = 0

T^{(0)} = A B + m_{0} N = A B + (A B)_{0} n_{0}^{'} (n_{0} + n_{1} b + \dots) = (A B)_{0} (1 + n_{0}^{'} n_{0}) + b \cdot (\dots)

Now for

j = i

, first notice that first

i - 1

words of

T^{(i - 1)}

0

by induction, thus we have:

T^{(i)} = (0, \dots, 0, T_{i}^{(i - 1)}, \dots) + m_{i} N b^{i} = T_{i}^{(i - 1)} b^{i} + T_{i}^{(i - 1)} n_{0}^{'} n_{0} b^{i} + b^{i + 1} (\dots) = T_{i}^{(i - 1)} b^{i} (1 + n_{0}^{'} n_{0}) + b^{i + 1} (\dots)

This finishs the proof.
From above algorithm and proof of the trick, we can immediately improve it by only adding the lowest non-zero word of

A B

in each loop.

fn redc(A,B,N,R) {
    let n0' = -n0^{-1} mod b;
    let mut T = 0;
    // T updated every loop so that b^{i+1}|T
    for i in 0..(r-1) {
        T = T+bi*A*b^i; // since we only need the lowest non-zero word
        let mi = ti*n0' mod b; // ti is the i-th word of latest T
        T = T+mi*N*b^i;
    }
    T
}

In practice, we will shift

T

by one word at the end of each loop, to further optimize, which becomes the following:
















// CIOS
for i=0 to r-1
    C := 0
    // calculate T=T+B[i]*A
    for j=0 to r-1
      (C,t[j]) := t[j] + a[j]*b[i] + C
    (t[r+1],t[r]) := t[r] + C
  
    C := 0
    m := t[0]*N'[0] mod b
    // calculate T=T+m*N
    (C,_) := t[0] + m*N[0]
    for j=1 to r-1
      (C,t[j-1]) := t[j] + m*N[j] + C //shift by one word here
    (C,t[r-1]) := t[r] + C
    t[r] := t[r+1] + C

Algorithm comparison (quote from Ph.D thesis of Tolga Acar, page 23. Here

s

is our

r

above):

Method	Multiplications	Additions	Reads	Writes	Space
SOS	$2 s^{2} + s$	$4 s^{2} + 4 s + 2$	$6 s^{2} + 7 s + 3$	$2 s^{2} + 6 s + 2$	$2 s + 2$
CIOS(*)	$2 s^{2} + s$	$4 s^{2} + 4 s + 2$	$6 s^{2} + 7 s + 2$	$2 s^{2} + 5 s + 1$	$s + 3$
FIOS	$2 s^{2} + s$	$5 s^{2} + 3 s + 2$	$7 s^{2} + 5 s + 2$	$3 s^{2} + 4 s + 1$	$s + 3$
FIPS	$2 s^{2} + s$	$6 s^{2} + 2 s + 2$	$9 s^{2} + 8 s + 2$	$5 s^{2} + 8 s + 1$	$s + 3$
CIHS	$2 s^{2} + s$	$4 s^{2} + 4 s + 2$	$6.5 s^{2} + 6.5 s + 2$	$3 s^{2} + 5 s + 1$	$s + 3$

The best algorithm is the CIOS (Coarsely Integrated Operand Scanning) method.

A further improvement on CIOS is possible which reduces the number of additions needed in CIOS Montgomery multiplication to only

4 r^{2} - r

, a saving of

5 r + 2

additions. This optimization can be used whenever the highest bit of the modulus is zero and not all of the remaining bits are set. (see goff for details).

Derivation of CIOS algorithm

Assuming

b = 2^{w}, R = b^{r}

, Montgomery reduction is to calculate

S := x \cdot y \cdot 2^{- w r}

. We have:

S = \sum_{i = 0}^{r - 1} x_{i} \cdot y \cdot 2^{- (w - i) r}

We can write it as running sum:

S=0;
for i in 0..r-1 {
    S = (S + xi*y)*2^{-w}
}

To fast calculate

2^{- w}

, we find

N^{'}

such that

S + x_{i} \cdot y

with

S + x_{i} \cdot y + q_{i} \cdot N \equiv 0 \mod 2^{w}

. i.e.

q_{i} = (- N^{- 1}) (S + x_{i} \cdot y)

. We only need make it true on first limb, i.e.

N^{'} = - N^{- 1} \mod 2^{w} N^{'} = N_{0}^{'} + N_{1}^{'} \cdot 2^{w} + \dots N \cdot N^{'} \equiv - 1 \mod 2^{w} \leftrightarrow N_{0} \cdot N_{0}^{'} \equiv - 1 \leftrightarrow N_{0}^{'} \equiv - N_{0}^{- 1} \mod 2^{w} q_{i} = N_{0}^{'} \cdot (S_{0} + x_{i} \cdot y_{0})

for i in 0..r-1 {
    S[i] = 0;
}
for i in 0..r-1 {
    // calculate q_i
    (_, t') = S[0] +x[i]*y[0];
    (_, qi) = n0'*t';
    // carry
    (c, _) = t + qi*N[0];
    // calculate S+x[i]*y+qi*N
    for j in 1..r-1 {
        (c, S[j-1]) = S[j] + x[i]*y[j] + qi*N[j] + c;
    }
    S[r-1] = c;
}

The inner for loop will update

(S [0], \dots, S [r - 1])

for each

x [i] * y

. Notice that

S [0] = 0 \mod 2^{w}

after our correction, we can drop

S [0]

. Then we shift

S [i]

S [i - 1]

for

i = 1. . n - 1

which is equivalent to divde by

2^{w}

. It's clear that we have

2 r^{2} + r

multiplications instead of naive implementation of

3 r^{2}

Further optimization on wasm is to choose the correct limb size

w

to reduce the number of carry operations. Check this for detail.

tags: `public`

Dmitry Kazakov

2025/01/10 12:54:31

The derivation of CIOS algorithm looks broken. When x[0]*y[0]=0 qi=0 then no modulo reduction happens. A numeric example: X=0x1000000000000000, Y=0x2000000000000000, N=0x70000000000000000000000000000001 This will result in: 0x200000000000000

Montgomery modular multiplication

Original version

Multi-precision version

Derivation of CIOS algorithm

tags: public

Read more

Barrett Reduction

Lookup argument for folding

Plonk By Hand

MACI v1.0: top-up voice credits

tags: `public`