Montgomery modular multiplication

# Montgomery modular multiplication ## Original version Suppose $a,b\in \mathbb{Z}_N$, modular multiplication $[ab]$ requires first multiply them in the range $[0,N^2-2N+1]$, and then use Euclidean division theorem to get $ab=qN+r,q=\lfloor ab/N\rfloor$, such that $[ab]=r$. The division is quite expensive. Montgomery form is to choose a different divisor $R>N$ with $\gcd{(R, N)}=1$. For example, by choose $R=2^n$, the division is just shift by $n$ bits. Thus, we just want to make sure the numerator is divisible by $R$. **Montgomery form** of $a$ is defined as $\bar{a}=[aR]$, i.e. residual class of $aR$. We have: $$\gcd{(R,N)}=1\implies \exists R^{-1}:RR^{-1}\equiv 1\implies\exists N'<R:RR^{-1}=N'N+1$$ i.e. $N'$ is negative mod inverse of $N$ w.r.t $R$. And we have multiplication by $R$ is ring isomorphism. And: $$T=[aR][bR]\equiv abR\cdot R\equiv [abR]\cdot R$$ Notice $T<N\cdot N< R*N$. For any $0\leq T < RN$, define reduction form of $T$ as: $$t=redc(T):=TR^{-1}\pmod N$$ i.e. $\overline{ab}=redc(\bar{a}\bar{b})$. We use the following algorithm to do the reduction: ```rust= /// fast REDC algorithm. /// R is const with: gcd(R,N)=1 /// N' is const with: R*R^{-1} = N'*N + 1 /// assume input 0<=T<R*N /// To calculate [ab], (1) a'=[aR], b'=[bR] (2)[ab]=redc(a'*b') fn redc(T: u64) -> u64 { let u = T % R; // can skip let m = (u*N') % R; // or: m = (T*N') % R; let mut t = (T+m*N)/R; //here R|(T+m*N) t = if t >= N { t - N } else { t } t } ``` It's easy to see $tR\equiv T+mN\equiv T\pmod N$. We still need to prove $R|(T+mN)$ and $0\leq T+mN < 2N$, and this can be checked directly: $$(T\,\mathrm{mod}\,R)x\equiv Tx\pmod R\\ \implies T+mN=(T+((T\,\mathrm{mod}\,R)k\,\mathrm{mod}\,R)N)/R \pmod R\\ \equiv T+TkN=T(1+kN)\equiv 0\pmod R $$ Also notice that we can use: $[aR]=redc(aR^2)$ to fast calculate $[aR]$. ## Multi-precision version Represent integer with radix $b$: $A=a_{r-1}b^{r-1}+\cdots+a_1b+a_0$. And assume $R=b^r$. Let's give a Montgomery algorithm with multiprecision. ```rust= fn redc(A,B,N,R) { let n0' = -n0^{-1} mod b; let mut T = A*B; // T updated every loop so that b^{i+1}|T for i in 0..(r-1) { let mi = ti*n0' mod b; // ti is the i-th word of latest T T = T+mi*N*b^i; } T } ``` Here we use a trick: To calculate $m_i$, instead of using $N'$ which is negative mod inverse of $N$ w.r.t $R=b^r$; we can use $n'_0\equiv n \pmod b$, which is negative mod inverse of $n_0$ w.r.t. $b$. The idea is that everytime we update $T$, we make $T$ divisible by $b^{i+1},i\in\{0,\cdots,n-1\}$, thus eventually by $R=b^r$. We prove this by induction, suppose that $T^{(i)}$ is the value after loop $i$, i.e. $$T^{(i)}=AB+\sum_{j=0}^{i}m_jNb^j$$ When $j=0$: $$T^{(0)}=AB+m_0N=AB+(AB)_0n'_0(n_0+n_1b+\cdots)\\ =(AB)_0(1+n'_0n_0)+b\cdot(\cdots)$$ Now for $j=i$, first notice that first $i-1$ words of $T^{(i-1)}$ is $0$ by induction, thus we have: $$T^{(i)}=(0,\cdots,0,T^{(i-1)}_i,\cdots)+m_iNb^i=T_i^{(i-1)}b^i+T_i^{(i-1)}n'_0n_0b^i+b^{i+1}(\cdots)\\ =T^{(i-1)}_ib^i(1+n'_0n_0)+b^{i+1}(\cdots)$$ This finishs the proof. From above algorithm and proof of the trick, we can immediately improve it by only adding the lowest non-zero word of $AB$ in each loop. ``` fn redc(A,B,N,R) { let n0' = -n0^{-1} mod b; let mut T = 0; // T updated every loop so that b^{i+1}|T for i in 0..(r-1) { T = T+bi*A*b^i; // since we only need the lowest non-zero word let mi = ti*n0' mod b; // ti is the i-th word of latest T T = T+mi*N*b^i; } T } ``` In practice, we will shift $T$ by one word at the end of each loop, to further optimize, which becomes the following: ```rust= // CIOS for i=0 to r-1 C := 0 // calculate T=T+B[i]*A for j=0 to r-1 (C,t[j]) := t[j] + a[j]*b[i] + C (t[r+1],t[r]) := t[r] + C C := 0 m := t[0]*N'[0] mod b // calculate T=T+m*N (C,_) := t[0] + m*N[0] for j=1 to r-1 (C,t[j-1]) := t[j] + m*N[j] + C //shift by one word here (C,t[r-1]) := t[r] + C t[r] := t[r+1] + C ``` Algorithm comparison (quote from Ph.D thesis of Tolga Acar, page 23. Here $s$ is our $r$ above): | Method | Multiplications | Additions | Reads | Writes | Space | | --------| -------- | -------- |-------- |-------- |-------- | | SOS | $2s^2+s$ | $4s^2+4s+2$ | $6s^2+7s+3$ |$2s^2+6s+2$ |$2s+2$ | | CIOS(*) | $2s^2+s$ |$4s^2+4s+2$ |$6s^2+7s+2$ |$2s^2+5s+1$ |$s+3$ | | FIOS |$2s^2+s$ |$5s^2+3s+2$ |$7s^2+5s+2$ |$3s^2+4s+1$ |$s+3$ | | FIPS |$2s^2+s$ |$6s^2+2s+2$ |$9s^2+8s+2$ |$5s^2+8s+1$ |$s+3$ | | CIHS |$2s^2+s$ |$4s^2+4s+2$ |$6.5s^2+6.5s+2$ |$3s^2+5s+1$ |$s+3$ | The best algorithm is the CIOS (Coarsely Integrated Operand Scanning) method. A further improvement on CIOS is possible which reduces the number of additions needed in CIOS Montgomery multiplication to only $4r^2-r$, a saving of $5r+2$ additions. This optimization can be used whenever the highest bit of the modulus is zero and not all of the remaining bits are set. (see [goff](https://hackmd.io/@gnark/modular_multiplication) for details). ## Derivation of CIOS algorithm Assuming $b=2^w, R=b^r$, Montgomery reduction is to calculate $S:=x\cdot y \cdot 2^{-wr}$. We have: $$S=\sum_{i=0}^{r-1} x_i\cdot y\cdot 2^{-(w-i)r}$$ We can write it as running sum: ```rust! S=0; for i in 0..r-1 { S = (S + xi*y)*2^{-w} } ``` To fast calculate $2^{-w}$, we find $N'$ such that $S+x_i\cdot y$ with $S+x_i\cdot y+q_i\cdot N\equiv 0 \mod 2^w$. i.e. $q_i=(-N^{-1})(S+x_i\cdot y)$. We only need make it true on first limb, i.e. $$N' = -N^{-1} \mod 2^w\\ N'=N_0'+N_1'\cdot 2^w+\cdots\\ N\cdot N'\equiv -1 \mod 2^w \leftrightarrow N_0\cdot N'_0\equiv -1\leftrightarrow N_0'\equiv -N_0^{-1}\mod 2^w \\ q_i=N_0'\cdot (S_0+x_i\cdot y_0) $$ ```rust! for i in 0..r-1 { S[i] = 0; } for i in 0..r-1 { // calculate q_i (_, t') = S[0] +x[i]*y[0]; (_, qi) = n0'*t'; // carry (c, _) = t + qi*N[0]; // calculate S+x[i]*y+qi*N for j in 1..r-1 { (c, S[j-1]) = S[j] + x[i]*y[j] + qi*N[j] + c; } S[r-1] = c; } ``` The inner for loop will update $(S[0],\cdots,S[r-1])$ for each $x[i]*y$. Notice that $S[0] = 0 \mod 2^w$ after our correction, we can drop $S[0]$. Then we shift $S[i]$ to $S[i-1]$ for $i=1..n-1$ which is equivalent to divde by $2^w$. It's clear that we have $2r^2+r$ multiplications instead of naive implementation of $3r^2$. Further optimization on wasm is to choose the correct limb size $w$ to reduce the number of carry operations. Check [this](https://github.com/mitschabaude/montgomery#13-x-30-bit-multiplication) for detail. ###### tags: `public`

Read more

Barrett Reduction

Lookup argument for folding

Plonk By Hand

MACI v1.0: top-up voice credits