Multi-scalar Multiplication (MSM)

# Multi-scalar Multiplication (MSM) ## Goal Let $\mathbb{G}$ be a group of prime order $p=2^\lambda$. Let $g_0, ..., g_{N-1}$ be elements of $\mathbb{G}$ and let $e_0, ..., e_{N-1}$ be elements of $\mathbb{Z}_p$. Let $G = \sum_{i=0}^{N-1} e_i * g_i$. *Problem*: Given $g_0, ..., g_{N-1}$ and $e_0, ..., e_{N-1}$, compute G. ## Naive Approach Let $e_i = e_{i,\lambda-1}e_{i,\lambda-2}...e_{i,0}$, where each $e_{i,j} \in {0,1}$. We have $$e_i = e_{i,\lambda-1}*2^{\lambda-1} + e_{i,\lambda-2}*2^{\lambda-2} + \cdots + e_{i,0}*2^0$$ To compute $e_i * g_i$, we can compute $2^0*g_i$, $2^1*g_i$, ..., $2^{\lambda-1}*g_i$. Then, we simply add them up to get $e_i*g_i$. Here, we need $\lambda-1$ squarings and $\lambda-1$ additions. To computing $e_1*g_1$, $e_2*g_2$, ..., $e_{N-1}*g_{N-1}$, we need $N*(\lambda-1)$ squarings and $N*(\lambda-1)$ additions. We additionally need $N-1$ additions to sum $e_i*g_i$ into $G$. In total, we need $N*(\lambda-1)$ squarings and $N*\lambda -1$ additions. ## Pippenger Approach At a high level, Pippenger approach divides $\lambda$ bits into $num\_window$ $(=\frac{\lambda}{c})$ bit windows where each bit window has $c$ bits. $c$ is also called "window size". Part I: [computation for each window] Initialize a vector $window\_res = [0,0,...,0]$ of length $num\_window$. For the $w$-th window: * Initialize a vector $buckets = [0, 0, ..., 0]$ of length $num\_buckets = 2^c-1$. * Get the $start\_idx$ of the current window. Remember that we are considering a $c$-bit window out of $\lambda$-bit fields. * For each pair of $(e_i, g_i)$: * * Get the bits in scalar $e_i$ corresponding to the current window. Formally, we have $scalar = (e_i >> start\_idx) \% 2^c$. * * If $scalar \neq 0$, we have $buckets[scalar-1] +=g_i$. Recall that $buckets$ does not have a zero bucket. * Initialize $tmp = 0$ * For $j$ in $2^c-1$ to $1$: * * $tmp = tmp + buckets[j]$ * $window\_res[w] = tmp$ Part II: [Sum over all windows] * $lowest = window\_res[0]$ * $total = 0$ * For $w$ in $num\_window-1$ to $1$: * * $total += window\_res[w]$ * * for _ in 0..c: $total = total+total$ * $total = total + lowest$ $total$ is the final result. In Part I. For each window, we need $N + 2^c$ additions. Since we have $\frac{\lambda}{c}$ windows, we need $(N+2^c)*\frac{\lambda}{c}$ additions. In Part II, we need $1$ addition and $c$ squarings for each window. In total, we need $\frac{\lambda}{c}*(N+2^c+1)$ additions and $\lambda$ squarings. In Arkworks implementation, c is set to be $ln(N)+2$. For notation simplicity, let's set c to be $log2(N)$. Then, the computation complexity of Pippenger's algorithm is $\frac{2*N*\lambda}{log_2(N)}+1$ additions and $\lambda$ squarings. Since $\lambda$ is a constant for a elliptic curve, people usually say that Pippenger has a time complexity of $O(\frac{N}{log(N)})$. ## Optimization 1: Batch Affine for Point Addition Point addition is the major computation in Pippenger's algorithm. ### Math Formula for Point Addition Let's first talk about the formula for point addition. Consider two non-zero, non-symmetric points $P = (x_1, y_1)$ and $Q = (x_2, y_2)$. If $P$ and $Q$ are distinct ($x_1 \neq x_2$), we have: $$m = \frac{y_2 - y_1}{x_2 - x_1}$$ $$x_3 = m^2-x_1-x_2$$ $$y_3 = m*(x_1-x_3) - y_1$$ If $P=Q$, we need to set $m$ as: $$m = \frac{3*x_1^2 + a}{2*y_1}$$ ### Naive approach We can compute following the math formula. For point addition, we would need $1$ division, $2$ multiplications, and $6$ additions on finite fields. The most expensive part is the division when computing $\frac{1}{x_2-x_1}$. In finite field, computing division is expensive and requires Extended Euclidean Theorem. Computing the division consumes ~$\lambda$ (e.g., 254) operations, which are several orders of magnitudes larger than remaining computations (e.g., multiplications and additions). Thus, the major optimizations for point addition is to improve performance for division. ### Mixed-addition (projective) formula for Elliptic Curve with a=0 Consider a special curve $y^2 = x^3 + b$. Note that BLS12-381 follows this curve. Given a point $P=(x_1, y_1, z_1)$ in projective curve and a point $Q=(x_2, x_2, 1)$ in affine curve. We want to compute a point $R = (x_3, y_3, z_3) = P+Q$ We need **$7$ field multiplications, $4$ squarings, $9$ field additions, $3$ multiplication by $2$, and $1$ multiplication by $1$**. Here, multiplication by $2$ or $4$ is counted different from field multiplications since $2$ and $4$ are quite small and require much less computation than field multiplications with general field elements (e.g., 0x$123...987$). Computation [1]: Z1Z1 = Z12 U2 = X2*Z1Z1 S2 = Y2*Z1*Z1Z1 H = U2-X1 HH = H2 I = 4*HH J = H*I r = 2*(S2-Y1) V = X1*I X3 = r2-J-2*V Y3 = r*(V-X3)-2*Y1*J Z3 = (Z1+H)2-Z1Z1-HH ### Batch Affine Point addition becomes different when independently adding many points [2], especially in MSM. We may use only $1$ division for all point additions. In this way, we can save computation. Formally, consider a sequence of points $$P_1=(x_{1,1}, y_{1,1}), Q_1=(x_{1,2}, y_{1,2}), \cdots, P_n=(x_{n,1}, y_{n,1}), Q_n=(x_{n,2}, y_{n,2})$$ We want to compute $P_1+Q_1, P_2+Q_2, \cdots, P_n+Q_n$. Denote $a_i = x_{i,2}-x_{i,1}$. We have the following computation $$\forall i \in \{1,2, \cdots, n\}, d_i = \prod _{j=1}^{i-1}a_j$$ $$s = (d_n*a_n)^{-1}$$ $$\forall i \in \{1,2, \cdots, n\}, e_i = s*\prod_{j=i+1}^na_j$$ $$\forall i \in \{1,2, \cdots, n\}, r_i = d_i*e_i$$ We can easily check that $r_i = \frac{1}{a_i} = \frac{1}{x_{i,2}-x_{i,1}}$. Suppose we have a large number of independent point additions, the computation for $s$ term is amortized to be "free" since we only need to inverse once. So, we only need $3$ multiplications for $d_i$, $e_i$, and $r_i$. Finally, we need $6$ multiplications and $6$ additions for each point addition. ## Optimization 2: GLV with efficient endomorphism **Goal**: Accelerate a single point multiplication $k*P$, where $k$ is a finite field and $P$ is a point on an elliptic curve. Credit to [3,4] ### Part I: Properties under Endomorphism For short Weierstrass curves $y^2=x^2 + b\; mod \; r$, if there exists a cube root of unity mod r, we can take advantage of an endomorphism to decompose a 254 bit scalar into 2 128 bit scalars. Suppose $\beta$ = cube root of 1, mod q (q = order of Fq). $s$ = cube root of 1, mod r (r = order of fr) For a point $P_1 = (x_0,y_0)$, where $y^2 = x^3 + b$, we know that the point $P_2 = (x_0 * \beta, y_0)$ is also a point on the curve [Note: remember that $\beta^3=1$ and $(x_0*\beta)^3=x_0^3$]. For some elliptic curves (e.g., BN254), we can represent $P_2$ as a scalar multiplication of $P_1$, where $P_2 = s * P_1$. If we can find such as $s$, we can apply the following optimizations in Part II. For certain elliptic curves (e.g., BLS12-381), we cannot find such a $s$ and cannot apply the following optimizations. For a generic multiplication of $P_1$ by a 254 bit scalar $k$, we can decompose $k$ into two $127$-bit scalars $(k_1, k_2)$, such that $k = k_1 + (k_2 * s)$. We can now represent $(k*P_1)$ as $(k_1*P_1) + (k_2 * P_2)$, where $P_2 = (x * \beta, y)$. Here, $k_1$ and $k_2$ have half the bit length of $k$. This is the **key property** we want from endomorphism. This property leads to significant saving in the number of point doubling (see detailed discussion below in Part II). ### Part II: Accelerate with above properties #### *Naive* point multiplication Let's first consider a (kind of) *naive* point multiplication $k*P$, where $k$ is a $\lambda$-bit scalar and $P$ is an elliptic curve point. At a high level, this algorithm splits $\lambda$ bits into $w$-bit windows. **Algorithm 1:** 1. Compute $i*P$ for all $i$ in $[0,2^w-1]$. 2. Write $k = (k^{d-1}, ..., k^1, k^0)$ where each $k^i$ is a bit string of length $w$, and $d = ceil(\lambda/w)$ 3. $R \leftarrow O$ 4. For $i$ from $d-1$ down to $0$: 4.1 $R \leftarrow 2^w R$ 4.2 $R \leftarrow R + (k^i * P)$ 5. Return $R$. **Analysis:** Line 1 requires $2^w$ point additions. Line 4.1 requires $w$ point doubling. Line 4.2 requires 1 point addition. Considering the loop in Line 4, we need $dw \approx \lambda$ point doublings and $d=ceil(\lambda/w)$ point additions. In total, **algorithm 1 requires $\lambda$ point doublings and $2^w + ceil(\lambda/w)$ point additions.** Consider that 1 point doubling has almost the same cost as 1 point addition. The major cost here is the $\lambda$ point doublings. The reason is that $\lambda$ is $254$ and is much larger than $2^w + ceil(\lambda/w)$ when $w$ is small (e.g., $4$). #### Optimized point addition In Part I, we mention that, with endomorphism, we can split $\lambda$-bit scalar $k$ into two $\lambda/2$-bit scalars $k_1$ and $k_2$ such that $k = k_1 + (k_2*s)$ and $k*P = k_1*P + k_2*Q$. **Algorithm 2: Compute $k_1*P + k_2*Q$ simultaneously** 1. Compute $i*P+j*Q$ for all $i,j$ in $[0,2^w-1]$. 2. Write $k_1 = (k_1^{d'-1}, ..., k_1^1, k_1^0)$, $k_2 = (k_2^{d'-1}, ..., k_2^1, k_2^0)$ where each $k_1^i$ and $k_2^i$ is a bit string of length $w$, and $d' = ceil(\frac{\lambda}{2*w})$ 3. $R \leftarrow O$ 4. For $i$ from $d'-1$ down to $0$: 4.1 $R \leftarrow 2^w R$ 4.2 $R \leftarrow R + (k_1^i * P + k_2^i * Q)$. 5. Return $R$. **Analysis:** Line 1 requires $2^{w+1}$ point additions. Line 4.1 requires $w$ point doubling. Line 4.2 requires $1$ point addition. Considering the loop in Line 4, we need $d'w\approx\lambda/2$ point doublings and $d'=ceil(\frac{\lambda}{2*w})$ point additions. In total, we need $\lambda/2$ point doublings and $2^{w+1}+ceil(\frac{\lambda}{2*w})$ point additions. This optimized algorithm reduces the number of point doubling by half. ## Optimization3: windoed non-adjacent form (wnaf) TODO ## Reference [1] http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-add-2007-bl [2] Ariel Gabizon and Zachary J. Williamson (Aztec Protocol). Proposal: The Turbo-PLONK program syntax for specifying SNARK programs. https://docs.zkproof.org/pages/standards/accepted-workshop3/proposal-turbo_plonk.pdf [3] Robert P. Gallant, Robert J. Lambert, and Scott A. Vanstone. Faster Point Multiplication on Elliptic Curves with Efficient Endomorphisms. [4] Aztec implementation for GLV method on BN254. https://github.com/AztecProtocol/barretenberg/blob/f45c27ecd70a3c5607cf30ad5e36bcb4b7385b6d/barretenberg/src/aztec/ecc/fields/field.hpp#L212