## TLDR
If we look closer to the current MSM implementation for CPU such as [Arkworks](https://github.com/arkworks-rs/algebra/blob/master/ec/src/scalar_mul/variable_base/mod.rs#L175-L270), we can see the algorithms follow the high-level strategies below:
1. Reduce one b-bit MSM to several c-bit MSMs for manageable c <= b
2. Use tricks to solve the c-bit MSMs
3. Combine c-bit MSMs into the final b-bit MSM
## Accumulation phase
> Accumulate the bucket in a window-wise fashion, resulting in fewer threads with more workload for each, maximizing the utilization of each thread. Used in most of the MSM implementation for CPU.
- bucket: B[window_size][bucket_size]
- window size: b bits scalar ceiling divides by c bits of chunk size
- bucket size: the value of the chunk of scalar, calculated by the bit representation of the scalar chunk. i.e. 2^c
Accumulation: $B\Big[ j \Big]\Big[ s [ i ][ j ] \Big] += P[ i ]$
code from Arkworks:
```rust=
let window_sums: Vec<_> = ark_std::cfg_into_iter!(window_starts)
.map(|w_start| {
let mut res = zero;
let mut buckets = vec![zero; (1 << c) - 1];
scalars_and_bases_iter.clone().for_each(|(&scalar, base)| {
if scalar == one {
// We only process unit scalars once in the first window.
if w_start == 0 {
res += base;
}
} else {
let mut scalar = scalar;
// We right-shift by w_start, thus getting rid of the
// lower bits.
scalar >>= w_start as u32;
// We mod the remaining bits by 2^{window size}, thus taking `c` bits.
let scalar = scalar.as_ref()[0] % (1 << c);
// If the scalar is non-zero, we update the corresponding
// bucket.
// (Recall that `buckets` doesn't have a zero bucket.)
if scalar != 0 {
buckets[(scalar - 1) as usize] += base;
}
}
});
...
```
## Reduction phase
final window object: w[window_size]
Reduction: $W[ j ] = \sum\limits_{k=1}^{bsize} k \cdot{} B[ j ][ k ]$
Apply sum of sum to implement the algo.
code from Arkworks:
```rust=
...
// Compute sum_{i in 0..num_buckets} (sum_{j in i..num_buckets} bucket[j])
// This is computed below for b buckets, using 2b curve additions.
//
// We could first normalize `buckets` and then use mixed-addition
// here, but that's slower for the kinds of groups we care about
// (Short Weierstrass curves and Twisted Edwards curves).
// In the case of Short Weierstrass curves,
// mixed addition saves ~4 field multiplications per addition.
// However normalization (with the inversion batched) takes ~6
// field multiplications per element,
// hence batch normalization is a slowdown.
// `running_sum` = sum_{j in i..num_buckets} bucket[j],
// where we iterate backward from i = num_buckets to 0.
let mut running_sum = V::zero();
buckets.into_iter().rev().for_each(|b| {
running_sum += &b;
res += &running_sum;
});
res
})
.collect();
...
```
## Compute MSM from reduced windows
MSM =$\sum\limits_{j=0}^{\text{wsize}-1} 2^{cj} \cdot{} W[j]$
Times $2^c$ to re-combine the c-bit scalar back to b-bit MSM result
code from Arkworks:
```rust=
...
// We store the sum for the lowest window.
let lowest = *window_sums.first().unwrap();
// We're traversing windows from high to low.
lowest
+ &window_sums[1..]
.iter()
.rev()
.fold(zero, |mut total, sum_i| {
total += sum_i;
for _ in 0..c {
total.double_in_place();
}
total
})
```