## TLDR If we look closer to the current MSM implementation for CPU such as [Arkworks](https://github.com/arkworks-rs/algebra/blob/master/ec/src/scalar_mul/variable_base/mod.rs#L175-L270), we can see the algorithms follow the high-level strategies below: 1. Reduce one b-bit MSM to several c-bit MSMs for manageable c <= b 2. Use tricks to solve the c-bit MSMs 3. Combine c-bit MSMs into the final b-bit MSM ## Accumulation phase > Accumulate the bucket in a window-wise fashion, resulting in fewer threads with more workload for each, maximizing the utilization of each thread. Used in most of the MSM implementation for CPU. - bucket: B[window_size][bucket_size] - window size: b bits scalar ceiling divides by c bits of chunk size - bucket size: the value of the chunk of scalar, calculated by the bit representation of the scalar chunk. i.e. 2^c Accumulation: $B\Big[ j \Big]\Big[ s [ i ][ j ] \Big] += P[ i ]$ code from Arkworks: ```rust= let window_sums: Vec<_> = ark_std::cfg_into_iter!(window_starts) .map(|w_start| { let mut res = zero; let mut buckets = vec![zero; (1 << c) - 1]; scalars_and_bases_iter.clone().for_each(|(&scalar, base)| { if scalar == one { // We only process unit scalars once in the first window. if w_start == 0 { res += base; } } else { let mut scalar = scalar; // We right-shift by w_start, thus getting rid of the // lower bits. scalar >>= w_start as u32; // We mod the remaining bits by 2^{window size}, thus taking `c` bits. let scalar = scalar.as_ref()[0] % (1 << c); // If the scalar is non-zero, we update the corresponding // bucket. // (Recall that `buckets` doesn't have a zero bucket.) if scalar != 0 { buckets[(scalar - 1) as usize] += base; } } }); ... ``` ## Reduction phase final window object: w[window_size] Reduction: $W[ j ] = \sum\limits_{k=1}^{bsize} k \cdot{} B[ j ][ k ]$ Apply sum of sum to implement the algo. code from Arkworks: ```rust= ... // Compute sum_{i in 0..num_buckets} (sum_{j in i..num_buckets} bucket[j]) // This is computed below for b buckets, using 2b curve additions. // // We could first normalize `buckets` and then use mixed-addition // here, but that's slower for the kinds of groups we care about // (Short Weierstrass curves and Twisted Edwards curves). // In the case of Short Weierstrass curves, // mixed addition saves ~4 field multiplications per addition. // However normalization (with the inversion batched) takes ~6 // field multiplications per element, // hence batch normalization is a slowdown. // `running_sum` = sum_{j in i..num_buckets} bucket[j], // where we iterate backward from i = num_buckets to 0. let mut running_sum = V::zero(); buckets.into_iter().rev().for_each(|b| { running_sum += &b; res += &running_sum; }); res }) .collect(); ... ``` ## Compute MSM from reduced windows MSM =$\sum\limits_{j=0}^{\text{wsize}-1} 2^{cj} \cdot{} W[j]$ Times $2^c$ to re-combine the c-bit scalar back to b-bit MSM result code from Arkworks: ```rust= ... // We store the sum for the lowest window. let lowest = *window_sums.first().unwrap(); // We're traversing windows from high to low. lowest + &window_sums[1..] .iter() .rev() .fold(zero, |mut total, sum_i| { total += sum_i; for _ in 0..c { total.double_in_place(); } total }) ```