# Multi-scalar Multiplication (MSM)
## Goal
Let $\mathbb{G}$ be a group of prime order $p=2^\lambda$. Let $g_0, ..., g_{N-1}$ be elements of $\mathbb{G}$ and let $e_0, ..., e_{N-1}$ be elements of $\mathbb{Z}_p$. Let $G = \sum_{i=0}^{N-1} e_i * g_i$.
*Problem*: Given $g_0, ..., g_{N-1}$ and $e_0, ..., e_{N-1}$, compute G.
## Naive Approach
Let $e_i = e_{i,\lambda-1}e_{i,\lambda-2}...e_{i,0}$, where each $e_{i,j} \in {0,1}$.
We have
$$e_i = e_{i,\lambda-1}*2^{\lambda-1} + e_{i,\lambda-2}*2^{\lambda-2} + \cdots + e_{i,0}*2^0$$
To compute $e_i * g_i$, we can compute $2^0*g_i$, $2^1*g_i$, ..., $2^{\lambda-1}*g_i$. Then, we simply add them up to get $e_i*g_i$. Here, we need $\lambda-1$ squarings and $\lambda-1$ additions.
To computing $e_1*g_1$, $e_2*g_2$, ..., $e_{N-1}*g_{N-1}$, we need $N*(\lambda-1)$ squarings and $N*(\lambda-1)$ additions.
We additionally need $N-1$ additions to sum $e_i*g_i$ into $G$.
In total, we need $N*(\lambda-1)$ squarings and $N*\lambda -1$ additions.
## Pippenger Approach
At a high level, Pippenger approach divides $\lambda$ bits into $num\_window$ $(=\frac{\lambda}{c})$ bit windows where each bit window has $c$ bits. $c$ is also called "window size".
Part I: [computation for each window]
Initialize a vector $window\_res = [0,0,...,0]$ of length $num\_window$.
For the $w$-th window:
* Initialize a vector $buckets = [0, 0, ..., 0]$ of length $num\_buckets = 2^c-1$.
* Get the $start\_idx$ of the current window. Remember that we are considering a $c$-bit window out of $\lambda$-bit fields.
* For each pair of $(e_i, g_i)$:
* * Get the bits in scalar $e_i$ corresponding to the current window. Formally, we have $scalar = (e_i >> start\_idx) \% 2^c$.
* * If $scalar \neq 0$, we have $buckets[scalar-1] +=g_i$. Recall that $buckets$ does not have a zero bucket.
* Initialize $tmp = 0$
* For $j$ in $2^c-1$ to $1$:
* * $tmp = tmp + buckets[j]$
* $window\_res[w] = tmp$
Part II: [Sum over all windows]
* $lowest = window\_res[0]$
* $total = 0$
* For $w$ in $num\_window-1$ to $1$:
* * $total += window\_res[w]$
* * for _ in 0..c: $total = total+total$
* $total = total + lowest$
$total$ is the final result.
In Part I. For each window, we need $N + 2^c$ additions. Since we have $\frac{\lambda}{c}$ windows, we need $(N+2^c)*\frac{\lambda}{c}$ additions.
In Part II, we need $1$ addition and $c$ squarings for each window.
In total, we need $\frac{\lambda}{c}*(N+2^c+1)$ additions and $\lambda$ squarings.
In Arkworks implementation, c is set to be $ln(N)+2$.
For notation simplicity, let's set c to be $log2(N)$. Then, the computation complexity of Pippenger's algorithm is $\frac{2*N*\lambda}{log_2(N)}+1$ additions and $\lambda$ squarings.
Since $\lambda$ is a constant for a elliptic curve, people usually say that Pippenger has a time complexity of $O(\frac{N}{log(N)})$.
## Optimization 1: Batch Affine for Point Addition
Point addition is the major computation in Pippenger's algorithm.
### Math Formula for Point Addition
Let's first talk about the formula for point addition.
Consider two non-zero, non-symmetric points $P = (x_1, y_1)$ and $Q = (x_2, y_2)$.
If $P$ and $Q$ are distinct ($x_1 \neq x_2$), we have:
$$m = \frac{y_2 - y_1}{x_2 - x_1}$$
$$x_3 = m^2-x_1-x_2$$
$$y_3 = m*(x_1-x_3) - y_1$$
If $P=Q$, we need to set $m$ as:
$$m = \frac{3*x_1^2 + a}{2*y_1}$$
### Naive approach
We can compute following the math formula. For point addition, we would need $1$ division, $2$ multiplications, and $6$ additions on finite fields.
The most expensive part is the division when computing $\frac{1}{x_2-x_1}$. In finite field, computing division is expensive and requires Extended Euclidean Theorem. Computing the division consumes ~$\lambda$ (e.g., 254) operations, which are several orders of magnitudes larger than remaining computations (e.g., multiplications and additions). Thus, the major optimizations for point addition is to improve performance for division.
### Mixed-addition (projective) formula for Elliptic Curve with a=0
Consider a special curve $y^2 = x^3 + b$. Note that BLS12-381 follows this curve.
Given a point $P=(x_1, y_1, z_1)$ in projective curve and a point $Q=(x_2, x_2, 1)$ in affine curve. We want to compute a point $R = (x_3, y_3, z_3) = P+Q$
We need **$7$ field multiplications, $4$ squarings, $9$ field additions, $3$ multiplication by $2$, and $1$ multiplication by $1$**. Here, multiplication by $2$ or $4$ is counted different from field multiplications since $2$ and $4$ are quite small and require much less computation than field multiplications with general field elements (e.g., 0x$123...987$).
Computation [1]:
Z1Z1 = Z12
U2 = X2*Z1Z1
S2 = Y2*Z1*Z1Z1
H = U2-X1
HH = H2
I = 4*HH
J = H*I
r = 2*(S2-Y1)
V = X1*I
X3 = r2-J-2*V
Y3 = r*(V-X3)-2*Y1*J
Z3 = (Z1+H)2-Z1Z1-HH
### Batch Affine
Point addition becomes different when independently adding many points [2], especially in MSM. We may use only $1$ division for all point additions. In this way, we can save computation.
Formally, consider a sequence of points
$$P_1=(x_{1,1}, y_{1,1}), Q_1=(x_{1,2}, y_{1,2}), \cdots, P_n=(x_{n,1}, y_{n,1}), Q_n=(x_{n,2}, y_{n,2})$$
We want to compute $P_1+Q_1, P_2+Q_2, \cdots, P_n+Q_n$.
Denote $a_i = x_{i,2}-x_{i,1}$.
We have the following computation
$$\forall i \in \{1,2, \cdots, n\}, d_i = \prod _{j=1}^{i-1}a_j$$
$$s = (d_n*a_n)^{-1}$$
$$\forall i \in \{1,2, \cdots, n\}, e_i = s*\prod_{j=i+1}^na_j$$
$$\forall i \in \{1,2, \cdots, n\}, r_i = d_i*e_i$$
We can easily check that $r_i = \frac{1}{a_i} = \frac{1}{x_{i,2}-x_{i,1}}$. Suppose we have a large number of independent point additions, the computation for $s$ term is amortized to be "free" since we only need to inverse once. So, we only need $3$ multiplications for $d_i$, $e_i$, and $r_i$.
Finally, we need $6$ multiplications and $6$ additions for each point addition.
## Optimization 2: GLV with efficient endomorphism
**Goal**: Accelerate a single point multiplication $k*P$, where $k$ is a finite field and $P$ is a point on an elliptic curve.
Credit to [3,4]
### Part I: Properties under Endomorphism
For short Weierstrass curves $y^2=x^2 + b\; mod \; r$, if there exists a cube root of unity mod r, we can take advantage of an endomorphism to decompose a 254 bit scalar into 2 128 bit scalars.
Suppose $\beta$ = cube root of 1, mod q (q = order of Fq).
$s$ = cube root of 1, mod r (r = order of fr)
For a point $P_1 = (x_0,y_0)$, where $y^2 = x^3 + b$, we know that the point $P_2 = (x_0 * \beta, y_0)$ is also a point on the curve [Note: remember that $\beta^3=1$ and $(x_0*\beta)^3=x_0^3$]. For some elliptic curves (e.g., BN254), we can represent $P_2$ as a scalar multiplication of $P_1$, where $P_2 = s * P_1$. If we can find such as $s$, we can apply the following optimizations in Part II. For certain elliptic curves (e.g., BLS12-381), we cannot find such a $s$ and cannot apply the following optimizations.
For a generic multiplication of $P_1$ by a 254 bit scalar $k$, we can decompose $k$ into two $127$-bit scalars $(k_1, k_2)$, such that $k = k_1 + (k_2 * s)$. We can now represent $(k*P_1)$ as $(k_1*P_1) + (k_2 * P_2)$, where $P_2 = (x * \beta, y)$. Here, $k_1$ and $k_2$ have half the bit length of $k$. This is the **key property** we want from endomorphism. This property leads to significant saving in the number of point doubling (see detailed discussion below in Part II).
### Part II: Accelerate with above properties
#### *Naive* point multiplication
Let's first consider a (kind of) *naive* point multiplication $k*P$, where $k$ is a $\lambda$-bit scalar and $P$ is an elliptic curve point.
At a high level, this algorithm splits $\lambda$ bits into $w$-bit windows.
**Algorithm 1:**
1. Compute $i*P$ for all $i$ in $[0,2^w-1]$.
2. Write $k = (k^{d-1}, ..., k^1, k^0)$ where each $k^i$ is a bit string of length $w$, and $d = ceil(\lambda/w)$
3. $R \leftarrow O$
4. For $i$ from $d-1$ down to $0$:
4.1 $R \leftarrow 2^w R$
4.2 $R \leftarrow R + (k^i * P)$
5. Return $R$.
**Analysis:**
Line 1 requires $2^w$ point additions.
Line 4.1 requires $w$ point doubling. Line 4.2 requires 1 point addition. Considering the loop in Line 4, we need $dw \approx \lambda$ point doublings and $d=ceil(\lambda/w)$ point additions.
In total, **algorithm 1 requires $\lambda$ point doublings and $2^w + ceil(\lambda/w)$ point additions.** Consider that 1 point doubling has almost the same cost as 1 point addition. The major cost here is the $\lambda$ point doublings. The reason is that $\lambda$ is $254$ and is much larger than $2^w + ceil(\lambda/w)$ when $w$ is small (e.g., $4$).
#### Optimized point addition
In Part I, we mention that, with endomorphism, we can split $\lambda$-bit scalar $k$ into two $\lambda/2$-bit scalars $k_1$ and $k_2$ such that $k = k_1 + (k_2*s)$ and $k*P = k_1*P + k_2*Q$.
**Algorithm 2: Compute $k_1*P + k_2*Q$ simultaneously**
1. Compute $i*P+j*Q$ for all $i,j$ in $[0,2^w-1]$.
2. Write $k_1 = (k_1^{d'-1}, ..., k_1^1, k_1^0)$, $k_2 = (k_2^{d'-1}, ..., k_2^1, k_2^0)$ where each $k_1^i$ and $k_2^i$ is a bit string of length $w$, and $d' = ceil(\frac{\lambda}{2*w})$
3. $R \leftarrow O$
4. For $i$ from $d'-1$ down to $0$:
4.1 $R \leftarrow 2^w R$
4.2 $R \leftarrow R + (k_1^i * P + k_2^i * Q)$.
5. Return $R$.
**Analysis:**
Line 1 requires $2^{w+1}$ point additions.
Line 4.1 requires $w$ point doubling. Line 4.2 requires $1$ point addition. Considering the loop in Line 4, we need $d'w\approx\lambda/2$ point doublings and $d'=ceil(\frac{\lambda}{2*w})$ point additions.
In total, we need $\lambda/2$ point doublings and $2^{w+1}+ceil(\frac{\lambda}{2*w})$ point additions. This optimized algorithm reduces the number of point doubling by half.
## Optimization3: windoed non-adjacent form (wnaf)
TODO
## Reference
[1] http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-add-2007-bl
[2] Ariel Gabizon and Zachary J. Williamson (Aztec Protocol). Proposal: The Turbo-PLONK program syntax for specifying SNARK programs. https://docs.zkproof.org/pages/standards/accepted-workshop3/proposal-turbo_plonk.pdf
[3] Robert P. Gallant, Robert J. Lambert, and Scott A. Vanstone. Faster Point Multiplication on Elliptic Curves with Efficient Endomorphisms.
[4] Aztec implementation for GLV method on BN254. https://github.com/AztecProtocol/barretenberg/blob/f45c27ecd70a3c5607cf30ad5e36bcb4b7385b6d/barretenberg/src/aztec/ecc/fields/field.hpp#L212