Suppose
Montgomery form of
i.e.
Notice
i.e.
/// fast REDC algorithm.
/// R is const with: gcd(R,N)=1
/// N' is const with: R*R^{-1} = N'*N + 1
/// assume input 0<=T<R*N
/// To calculate [ab], (1) a'=[aR], b'=[bR] (2)[ab]=redc(a'*b')
fn redc(T: u64) -> u64 {
let u = T % R; // can skip
let m = (u*N') % R; // or: m = (T*N') % R;
let mut t = (T+m*N)/R; //here R|(T+m*N)
t = if t >= N {
t - N
} else {
t
}
t
}
It's easy to see
Also notice that we can use:
Represent integer with radix
fn redc(A,B,N,R) {
let n0' = -n0^{-1} mod b;
let mut T = A*B;
// T updated every loop so that b^{i+1}|T
for i in 0..(r-1) {
let mi = ti*n0' mod b; // ti is the i-th word of latest T
T = T+mi*N*b^i;
}
T
}
Here we use a trick:
To calculate
The idea is that everytime we update
We prove this by induction, suppose that
When
Now for
This finishs the proof.
From above algorithm and proof of the trick, we can immediately improve it by only adding the lowest non-zero word of
fn redc(A,B,N,R) {
let n0' = -n0^{-1} mod b;
let mut T = 0;
// T updated every loop so that b^{i+1}|T
for i in 0..(r-1) {
T = T+bi*A*b^i; // since we only need the lowest non-zero word
let mi = ti*n0' mod b; // ti is the i-th word of latest T
T = T+mi*N*b^i;
}
T
}
In practice, we will shift
// CIOS
for i=0 to r-1
C := 0
// calculate T=T+B[i]*A
for j=0 to r-1
(C,t[j]) := t[j] + a[j]*b[i] + C
(t[r+1],t[r]) := t[r] + C
C := 0
m := t[0]*N'[0] mod b
// calculate T=T+m*N
(C,_) := t[0] + m*N[0]
for j=1 to r-1
(C,t[j-1]) := t[j] + m*N[j] + C //shift by one word here
(C,t[r-1]) := t[r] + C
t[r] := t[r+1] + C
Algorithm comparison (quote from Ph.D thesis of Tolga Acar, page 23. Here
Method | Multiplications | Additions | Reads | Writes | Space |
---|---|---|---|---|---|
SOS | |||||
CIOS(*) | |||||
FIOS | |||||
FIPS | |||||
CIHS |
The best algorithm is the CIOS (Coarsely Integrated Operand Scanning) method.
A further improvement on CIOS is possible which reduces the number of additions needed in CIOS Montgomery multiplication to only
Assuming
We can write it as running sum:
S=0;
for i in 0..r-1 {
S = (S + xi*y)*2^{-w}
}
To fast calculate
for i in 0..r-1 {
S[i] = 0;
}
for i in 0..r-1 {
// calculate q_i
(_, t') = S[0] +x[i]*y[0];
(_, qi) = n0'*t';
// carry
(c, _) = t + qi*N[0];
// calculate S+x[i]*y+qi*N
for j in 1..r-1 {
(c, S[j-1]) = S[j] + x[i]*y[j] + qi*N[j] + c;
}
S[r-1] = c;
}
The inner for loop will update
Further optimization on wasm is to choose the correct limb size
public