# Montgomery modular multiplication
## Original version
Suppose $a,b\in \mathbb{Z}_N$, modular multiplication $[ab]$ requires first multiply them in the range $[0,N^2-2N+1]$, and then use Euclidean division theorem to get $ab=qN+r,q=\lfloor ab/N\rfloor$, such that $[ab]=r$. The division is quite expensive. Montgomery form is to choose a different divisor $R>N$ with $\gcd{(R, N)}=1$. For example, by choose $R=2^n$, the division is just shift by $n$ bits. Thus, we just want to make sure the numerator is divisible by $R$.
**Montgomery form** of $a$ is defined as $\bar{a}=[aR]$, i.e. residual class of $aR$. We have:
$$\gcd{(R,N)}=1\implies \exists R^{-1}:RR^{-1}\equiv 1\implies\exists N'<R:RR^{-1}=N'N+1$$
i.e. $N'$ is negative mod inverse of $N$ w.r.t $R$. And we have multiplication by $R$ is ring isomorphism. And:
$$T=[aR][bR]\equiv abR\cdot R\equiv [abR]\cdot R$$
Notice $T<N\cdot N< R*N$. For any $0\leq T < RN$, define reduction form of $T$ as:
$$t=redc(T):=TR^{-1}\pmod N$$
i.e. $\overline{ab}=redc(\bar{a}\bar{b})$. We use the following algorithm to do the reduction:
```rust=
/// fast REDC algorithm.
/// R is const with: gcd(R,N)=1
/// N' is const with: R*R^{-1} = N'*N + 1
/// assume input 0<=T<R*N
/// To calculate [ab], (1) a'=[aR], b'=[bR] (2)[ab]=redc(a'*b')
fn redc(T: u64) -> u64 {
let u = T % R; // can skip
let m = (u*N') % R; // or: m = (T*N') % R;
let mut t = (T+m*N)/R; //here R|(T+m*N)
t = if t >= N {
t - N
} else {
t
}
t
}
```
It's easy to see $tR\equiv T+mN\equiv T\pmod N$. We still need to prove $R|(T+mN)$ and $0\leq T+mN < 2N$, and this can be checked directly:
$$(T\,\mathrm{mod}\,R)x\equiv Tx\pmod R\\
\implies T+mN=(T+((T\,\mathrm{mod}\,R)k\,\mathrm{mod}\,R)N)/R \pmod R\\
\equiv T+TkN=T(1+kN)\equiv 0\pmod R
$$
Also notice that we can use: $[aR]=redc(aR^2)$ to fast calculate $[aR]$.
## Multi-precision version
Represent integer with radix $b$: $A=a_{r-1}b^{r-1}+\cdots+a_1b+a_0$. And assume $R=b^r$. Let's give a Montgomery algorithm with multiprecision.
```rust=
fn redc(A,B,N,R) {
let n0' = -n0^{-1} mod b;
let mut T = A*B;
// T updated every loop so that b^{i+1}|T
for i in 0..(r-1) {
let mi = ti*n0' mod b; // ti is the i-th word of latest T
T = T+mi*N*b^i;
}
T
}
```
Here we use a trick:
To calculate $m_i$, instead of using $N'$ which is negative mod inverse of $N$ w.r.t $R=b^r$; we can use $n'_0\equiv n \pmod b$, which is negative mod inverse of $n_0$ w.r.t. $b$.
The idea is that everytime we update $T$, we make $T$ divisible by $b^{i+1},i\in\{0,\cdots,n-1\}$, thus eventually by $R=b^r$.
We prove this by induction, suppose that $T^{(i)}$ is the value after loop $i$, i.e.
$$T^{(i)}=AB+\sum_{j=0}^{i}m_jNb^j$$
When $j=0$:
$$T^{(0)}=AB+m_0N=AB+(AB)_0n'_0(n_0+n_1b+\cdots)\\
=(AB)_0(1+n'_0n_0)+b\cdot(\cdots)$$
Now for $j=i$, first notice that first $i-1$ words of $T^{(i-1)}$ is $0$ by induction, thus we have:
$$T^{(i)}=(0,\cdots,0,T^{(i-1)}_i,\cdots)+m_iNb^i=T_i^{(i-1)}b^i+T_i^{(i-1)}n'_0n_0b^i+b^{i+1}(\cdots)\\
=T^{(i-1)}_ib^i(1+n'_0n_0)+b^{i+1}(\cdots)$$
This finishs the proof.
From above algorithm and proof of the trick, we can immediately improve it by only adding the lowest non-zero word of $AB$ in each loop.
```
fn redc(A,B,N,R) {
let n0' = -n0^{-1} mod b;
let mut T = 0;
// T updated every loop so that b^{i+1}|T
for i in 0..(r-1) {
T = T+bi*A*b^i; // since we only need the lowest non-zero word
let mi = ti*n0' mod b; // ti is the i-th word of latest T
T = T+mi*N*b^i;
}
T
}
```
In practice, we will shift $T$ by one word at the end of each loop, to further optimize, which becomes the following:
```rust=
// CIOS
for i=0 to r-1
C := 0
// calculate T=T+B[i]*A
for j=0 to r-1
(C,t[j]) := t[j] + a[j]*b[i] + C
(t[r+1],t[r]) := t[r] + C
C := 0
m := t[0]*N'[0] mod b
// calculate T=T+m*N
(C,_) := t[0] + m*N[0]
for j=1 to r-1
(C,t[j-1]) := t[j] + m*N[j] + C //shift by one word here
(C,t[r-1]) := t[r] + C
t[r] := t[r+1] + C
```
Algorithm comparison (quote from Ph.D thesis of Tolga Acar, page 23. Here $s$ is our $r$ above):
| Method | Multiplications | Additions | Reads | Writes | Space |
| --------| -------- | -------- |-------- |-------- |-------- |
| SOS | $2s^2+s$ | $4s^2+4s+2$ | $6s^2+7s+3$ |$2s^2+6s+2$ |$2s+2$ |
| CIOS(*) | $2s^2+s$ |$4s^2+4s+2$ |$6s^2+7s+2$ |$2s^2+5s+1$ |$s+3$ |
| FIOS |$2s^2+s$ |$5s^2+3s+2$ |$7s^2+5s+2$ |$3s^2+4s+1$ |$s+3$ |
| FIPS |$2s^2+s$ |$6s^2+2s+2$ |$9s^2+8s+2$ |$5s^2+8s+1$ |$s+3$ |
| CIHS |$2s^2+s$ |$4s^2+4s+2$ |$6.5s^2+6.5s+2$ |$3s^2+5s+1$ |$s+3$ |
The best algorithm is the CIOS (Coarsely Integrated Operand Scanning) method.
A further improvement on CIOS is possible which reduces the number of additions needed in CIOS Montgomery multiplication to only $4r^2-r$, a saving of $5r+2$ additions. This optimization can be used whenever the highest bit of the modulus is zero and not all of the remaining bits are set. (see [goff](https://hackmd.io/@gnark/modular_multiplication) for details).
## Derivation of CIOS algorithm
Assuming $b=2^w, R=b^r$, Montgomery reduction is to calculate $S:=x\cdot y \cdot 2^{-wr}$. We have:
$$S=\sum_{i=0}^{r-1} x_i\cdot y\cdot 2^{-(w-i)r}$$
We can write it as running sum:
```rust!
S=0;
for i in 0..r-1 {
S = (S + xi*y)*2^{-w}
}
```
To fast calculate $2^{-w}$, we find $N'$ such that $S+x_i\cdot y$ with $S+x_i\cdot y+q_i\cdot N\equiv 0 \mod 2^w$. i.e. $q_i=(-N^{-1})(S+x_i\cdot y)$. We only need make it true on first limb, i.e.
$$N' = -N^{-1} \mod 2^w\\
N'=N_0'+N_1'\cdot 2^w+\cdots\\
N\cdot N'\equiv -1 \mod 2^w \leftrightarrow N_0\cdot N'_0\equiv -1\leftrightarrow N_0'\equiv -N_0^{-1}\mod 2^w \\
q_i=N_0'\cdot (S_0+x_i\cdot y_0)
$$
```rust!
for i in 0..r-1 {
S[i] = 0;
}
for i in 0..r-1 {
// calculate q_i
(_, t') = S[0] +x[i]*y[0];
(_, qi) = n0'*t';
// carry
(c, _) = t + qi*N[0];
// calculate S+x[i]*y+qi*N
for j in 1..r-1 {
(c, S[j-1]) = S[j] + x[i]*y[j] + qi*N[j] + c;
}
S[r-1] = c;
}
```
The inner for loop will update $(S[0],\cdots,S[r-1])$ for each $x[i]*y$. Notice that $S[0] = 0 \mod 2^w$ after our correction, we can drop $S[0]$. Then we shift $S[i]$ to $S[i-1]$ for $i=1..n-1$ which is equivalent to divde by $2^w$. It's clear that we have $2r^2+r$ multiplications instead of naive implementation of $3r^2$.
Further optimization on wasm is to choose the correct limb size $w$ to reduce the number of carry operations. Check [this](https://github.com/mitschabaude/montgomery#13-x-30-bit-multiplication) for detail.
###### tags: `public`