Try   HackMD

Montgomery modular multiplication

Original version

Suppose

a,bZN, modular multiplication
[ab]
requires first multiply them in the range
[0,N22N+1]
, and then use Euclidean division theorem to get
ab=qN+r,q=ab/N
, such that
[ab]=r
. The division is quite expensive. Montgomery form is to choose a different divisor
R>N
with
gcd(R,N)=1
. For example, by choose
R=2n
, the division is just shift by
n
bits. Thus, we just want to make sure the numerator is divisible by
R
.

Montgomery form of

a is defined as
a¯=[aR]
, i.e. residual class of
aR
. We have:
gcd(R,N)=1R1:RR11N<R:RR1=NN+1

i.e.
N
is negative mod inverse of
N
w.r.t
R
. And we have multiplication by
R
is ring isomorphism. And:
T=[aR][bR]abRR[abR]R

Notice
T<NN<RN
. For any
0T<RN
, define reduction form of
T
as:
t=redc(T):=TR1(modN)

i.e.
ab=redc(a¯b¯)
. We use the following algorithm to do the reduction:

/// fast REDC algorithm. /// R is const with: gcd(R,N)=1 /// N' is const with: R*R^{-1} = N'*N + 1 /// assume input 0<=T<R*N /// To calculate [ab], (1) a'=[aR], b'=[bR] (2)[ab]=redc(a'*b') fn redc(T: u64) -> u64 { let u = T % R; // can skip let m = (u*N') % R; // or: m = (T*N') % R; let mut t = (T+m*N)/R; //here R|(T+m*N) t = if t >= N { t - N } else { t } t }

It's easy to see

tRT+mNT(modN). We still need to prove
R|(T+mN)
and
0T+mN<2N
, and this can be checked directly:
(TmodR)xTx(modR)T+mN=(T+((TmodR)kmodR)N)/R(modR)T+TkN=T(1+kN)0(modR)

Also notice that we can use:
[aR]=redc(aR2)
to fast calculate
[aR]
.

Multi-precision version

Represent integer with radix

b:
A=ar1br1++a1b+a0
. And assume
R=br
. Let's give a Montgomery algorithm with multiprecision.

fn redc(A,B,N,R) { let n0' = -n0^{-1} mod b; let mut T = A*B; // T updated every loop so that b^{i+1}|T for i in 0..(r-1) { let mi = ti*n0' mod b; // ti is the i-th word of latest T T = T+mi*N*b^i; } T }

Here we use a trick:
To calculate

mi, instead of using
N
which is negative mod inverse of
N
w.r.t
R=br
; we can use
n0n(modb)
, which is negative mod inverse of
n0
w.r.t.
b
.
The idea is that everytime we update
T
, we make
T
divisible by
bi+1,i{0,,n1}
, thus eventually by
R=br
.
We prove this by induction, suppose that
T(i)
is the value after loop
i
, i.e.
T(i)=AB+j=0imjNbj

When
j=0
:
T(0)=AB+m0N=AB+(AB)0n0(n0+n1b+)=(AB)0(1+n0n0)+b()

Now for
j=i
, first notice that first
i1
words of
T(i1)
is
0
by induction, thus we have:
T(i)=(0,,0,Ti(i1),)+miNbi=Ti(i1)bi+Ti(i1)n0n0bi+bi+1()=Ti(i1)bi(1+n0n0)+bi+1()

This finishs the proof.
From above algorithm and proof of the trick, we can immediately improve it by only adding the lowest non-zero word of
AB
in each loop.

fn redc(A,B,N,R) {
    let n0' = -n0^{-1} mod b;
    let mut T = 0;
    // T updated every loop so that b^{i+1}|T
    for i in 0..(r-1) {
        T = T+bi*A*b^i; // since we only need the lowest non-zero word
        let mi = ti*n0' mod b; // ti is the i-th word of latest T
        T = T+mi*N*b^i;
    }
    T
}

In practice, we will shift

T by one word at the end of each loop, to further optimize, which becomes the following:

// CIOS for i=0 to r-1 C := 0 // calculate T=T+B[i]*A for j=0 to r-1 (C,t[j]) := t[j] + a[j]*b[i] + C (t[r+1],t[r]) := t[r] + C C := 0 m := t[0]*N'[0] mod b // calculate T=T+m*N (C,_) := t[0] + m*N[0] for j=1 to r-1 (C,t[j-1]) := t[j] + m*N[j] + C //shift by one word here (C,t[r-1]) := t[r] + C t[r] := t[r+1] + C

Algorithm comparison (quote from Ph.D thesis of Tolga Acar, page 23. Here

s is our
r
above):

Method Multiplications Additions Reads Writes Space
SOS
2s2+s
4s2+4s+2
6s2+7s+3
2s2+6s+2
2s+2
CIOS(*)
2s2+s
4s2+4s+2
6s2+7s+2
2s2+5s+1
s+3
FIOS
2s2+s
5s2+3s+2
7s2+5s+2
3s2+4s+1
s+3
FIPS
2s2+s
6s2+2s+2
9s2+8s+2
5s2+8s+1
s+3
CIHS
2s2+s
4s2+4s+2
6.5s2+6.5s+2
3s2+5s+1
s+3

The best algorithm is the CIOS (Coarsely Integrated Operand Scanning) method.

A further improvement on CIOS is possible which reduces the number of additions needed in CIOS Montgomery multiplication to only

4r2r, a saving of
5r+2
additions. This optimization can be used whenever the highest bit of the modulus is zero and not all of the remaining bits are set. (see goff for details).

Derivation of CIOS algorithm

Assuming

b=2w,R=br, Montgomery reduction is to calculate
S:=xy2wr
. We have:
S=i=0r1xiy2(wi)r

We can write it as running sum:

S=0;
for i in 0..r-1 {
    S = (S + xi*y)*2^{-w}
}

To fast calculate

2w, we find
N
such that
S+xiy
with
S+xiy+qiN0mod2w
. i.e.
qi=(N1)(S+xiy)
. We only need make it true on first limb, i.e.
N=N1mod2wN=N0+N12w+NN1mod2wN0N01N0N01mod2wqi=N0(S0+xiy0)

for i in 0..r-1 {
    S[i] = 0;
}
for i in 0..r-1 {
    // calculate q_i
    (_, t') = S[0] +x[i]*y[0];
    (_, qi) = n0'*t';
    // carry
    (c, _) = t + qi*N[0];
    // calculate S+x[i]*y+qi*N
    for j in 1..r-1 {
        (c, S[j-1]) = S[j] + x[i]*y[j] + qi*N[j] + c;
    }
    S[r-1] = c;
}

The inner for loop will update

(S[0],,S[r1]) for each
x[i]y
. Notice that
S[0]=0mod2w
after our correction, we can drop
S[0]
. Then we shift
S[i]
to
S[i1]
for
i=1..n1
which is equivalent to divde by
2w
. It's clear that we have
2r2+r
multiplications instead of naive implementation of
3r2
.

Further optimization on wasm is to choose the correct limb size

w to reduce the number of carry operations. Check this for detail.

tags: public