MSMs - HackMD

# MSM Analysis ## Slowdown in the Lurk MSM When running the Lurk pipeline, we noticed that the MSM was taking longer than expected, compared to the benchmarks on `pasta-msm`. For each folding step, there are two big MSMs: the committment to the witness and the cross-term `T`. In both cases, the individual timings are much slower. To isolate the issue, we set up the following test. First, we generated a test run of Lurk and recorded all of the witnesses and cross-terms, by serializing them to disk. Then, we deserialized these test "`lurkrs`" vectors in `pasta-msm` and benchmarked them against randomly generated "`random`" vectors of the same length. The benchmarks are below. > NOTE: All benchmarks shown below are done on a NVIDIA RTX A6000 GPU. The table contains 16 test vectors, labeled by: - either "`witness`" or "`T`" (meaning cross-term); - a number (1-30), which is an unique ID to keep the vectors distinct; the numbers are not contiguous because we have filtered out small MSMs of length <100,000; - the length of the vector. | | `random` | `lurkrs` | |:-------------------------------------|:---------------------------|:----------------------------------| | **`witness 1, 7941351 scalars`** | `119.79 ms` (✅ **1.00x**) | `552.09 ms` (❌ *4.60x slower*) | | **`T 2, 9699051 scalars`** | `144.74 ms` (✅ **1.00x**) | `143.87 ms` (✅ **1.00x faster**) | | **`witness 5, 7941351 scalars`** | `118.40 ms` (✅ **1.00x**) | `560.24 ms` (❌ *4.74x slower*) | | **`T 6, 9699051 scalars`** | `143.99 ms` (✅ **1.00x**) | `323.89 ms` (❌ *2.25x slower*) | | **`witness 9, 7941351 scalars`** | `120.42 ms` (✅ **1.00x**) | `561.95 ms` (❌ *4.67x slower*) | | **`T 10, 9699051 scalars`** | `143.02 ms` (✅ **1.00x**) | `350.17 ms` (❌ *2.44x slower*) | | **`witness 13, 7941351 scalars`** | `111.75 ms` (✅ **1.00x**) | `560.65 ms` (❌ *5.04x slower*) | | **`T 14, 9699051 scalars`** | `119.85 ms` (✅ **1.00x**) | `468.23 ms` (❌ *3.93x slower*) | | **`witness 17, 7941351 scalars`** | `119.13 ms` (✅ **1.00x**) | `560.10 ms` (❌ *4.66x slower*) | | **`T 18, 9699051 scalars`** | `143.52 ms` (✅ **1.00x**) | `564.23 ms` (❌ *3.94x slower*) | | **`witness 21, 7941351 scalars`** | `121.13 ms` (✅ **1.00x**) | `558.80 ms` (❌ *4.65x slower*) | | **`T 22, 9699051 scalars`** | `145.63 ms` (✅ **1.00x**) | `614.23 ms` (❌ *4.23x slower*) | | **`witness 25, 7941351 scalars`** | `118.84 ms` (✅ **1.00x**) | `557.14 ms` (❌ *4.72x slower*) | | **`T 26, 9699051 scalars`** | `141.11 ms` (✅ **1.00x**) | `679.76 ms` (❌ *4.81x slower*) | | **`witness 29, 7941351 scalars`** | `119.40 ms` (✅ **1.00x**) | `557.92 ms` (❌ *4.68x slower*) | | **`T 30, 9699051 scalar`** | `142.80 ms` (✅ **1.00x**) | `702.32 ms` (❌ *4.94x slower*) | We conclude that there is significant slowdown in the Lurk MSM pipeline. ## Looking at the data We analyzed the distribution and statistics of the `lurkrs` vectors and collected them in [visual plots](https://github.com/lurk-lab/pasta-msm/tree/pushing-the-limit/plots) and [stats](https://github.com/lurk-lab/pasta-msm/tree/pushing-the-limit/stats). To be terse, let's just look at the first vector (labeled by 1), though the analysis and any conclusions we draw here are equally applied to all the other vectors. ### Table for `witness #1` ``` scalars length: 7941351 freq>1000 (count, total, %): 726, 5769851, 72.66% freq>500 (count, total, %): 1045, 6042757, 76.09% freq>100 (count, total, %): 4129, 6680757, 84.13% top 100 values: 0, 0, 1757881 1, 1, 77154 2, 336293d1250e16ace79c1b9cfdd14e5d80bc7928a947635797fcbc1ebfa512e8, 12600 3, 294b6a7f0984f3d0560b3d0d973d20f41186547a1e41c758e2a3348c0141a665, 12600 4, 1af18bf8b21cf712a700f08f14dbcc588ad948ebc1ff85cc300eea6a581d26bf, 12600 5, 4943278e947828f315482d004370c011c9acf8706cac544710a26ce111fb11a, 9517 6, 29dd0402a87d5b40f9d6cfd1b6bafe4b88d3a89d1a2391749fd334313e8bdb08, 9517 7, 2a53d5a36091904f09cd86bd43b07c7961faad8ec81c97c9bd30d3c6e9da1ff2, 9517 8, 3ab358db82ff700d702170c90e32398517828e100235f92d83675a10738fc5de, 8747 9, 399b95a127d3d123a3b7da459fcde8918efcf0437899c6b0c210f5c20270c828, 8747 ``` The table first lists the overall frequency data at certain cutoffs. Each `freq>X` row should be read as: - `count` is number of distinct scalars that occur more than `X` times; - `total` is the sum of all the frequencies of the scalars greater than `X`, i.e. the sum of all the frequencies counted by `count`; - `%` is the percentage these `freq>X` scalars make up of the total vector, i.e. `total / length`. Following that, we list the top 100 most frequent scalar values, ordered in highest to lowest frequency. Details that stand out: - There is a large frequency of zeros (1757881 times) and ones (77154 times). Relative to the total length of the vector, zeros make up a significant 22%, while ones only make up 1%. - There are many specific scalar values that show up very frequently, in the thousands. Focus on the top `freq>X` rows of the table. There are only ~700 scalars (0.01%) which occur more than 1000 times, but these alone make up 72.66% of the vector! - These are supported by the plot representation, which shows an significant variation in the distribution of scalars, especially around zero and one (the leftmost spike). See below. ### Plot for `witness #1` ![lurkrs_1](https://hackmd.io/_uploads/HkAjdLEKa.png) Note, all the data for the rest of the vectors can be found here: - stats: https://github.com/lurk-lab/pasta-msm/tree/pushing-the-limit/stats - plots: https://github.com/lurk-lab/pasta-msm/tree/pushing-the-limit/plots ## Potential causes for slowdown Now that we've looked at the data, let's consider some possiblities for the cause of slowdown. ### Zeros and ones To test the impact of the large number of zeros (and somewhat ones) in the vector, we replace all the zero and ones in `witness #1` with random values, and benchmark the result. | | `lurkrs` | `lurkrs no zeros` | |:---------------------------|:--------------------------|:--------------------------------- | | **`#1, 7941351 scalars`** | `550.19 ms` (✅ **1.00x**) | `566.97 ms` (✅ **1.03x slower**) | | | `lurkrs` | `lurkrs no zeros and ones` | |:---------------------------|:--------------------------|:------------------------------------ | | **`#1, 7941351 scalars`** | `546.36 ms` (✅ **1.00x**) | `562.64 ms` (✅ **1.03x slower**) | Conclusion: the large number of zeros and ones are not a cause of slowdown. Full benchmarks for all vectors are here: - zeros: https://gist.github.com/winston-h-zhang/5e73cc8fd236b4ca5baa18f279b0f2e5; - zeros and ones: https://gist.github.com/winston-h-zhang/73533bc2f74015e954e34e8e5d31a71a. ### Skewed distribution Given that the reason for slowdown is not zeros and ones, the only other striking feature is the skewed distribution of scalars -- we conclude this is the cause of the slowdown. ## Potential solutions ### In theory How can we deal with scalars that appear with very high frequency? In `sppark`, we know that all the ones are "filtered out" in a preprocessing step that adds all of the bases associated to the ones. Let's apply the same idea to all these other high frequency scalars, and expand that preprocessing step. Instead of just filtering the ones, we pick some threshold `THRESHOLD` and filter all scalars with frequnecy greater than `THRESHOLD`. For each of these high frequency scalars, collect all the bases associated to that scalar -- let's call these the "coincident bases (to the scalar)" -- then sum all the coincident bases. After this is done, we are left with a compressed MSM that is smaller than the original. Finally, simply compute the smaller MSM normally, with pippenger. To summarize, we have 3 major steps in this modified MSM algorithm: 1. Preprocess the large MSM by collecting statistics about the frequencies and coincident bases. 2. For each high frequency scalar, sum all of its coincident bases -- this is embrassingly parallel. 3. Compute the smaller (compressed) MSM. ### In practice To partially test this idea, we estimated the performance of steps (2) and (3), to compare them to what we currently have. First, we estimate how much smaller and faster step (3) is by simulating what we would get from compressing the Lurk vectors with `THRESHOLD=1`, i.e. we merge all the scalars as much as possible. The benchmarks are below. | | `random` | `lurkrs` | `lurkrs compressed` | |:---------------------------|:--------------------------|:---------------------------------|:-------------------------------- | | **`#1, 7941351 scalars`** | `121.32 ms` (✅ **1.00x**) | `547.39 ms` (❌ *4.51x slower*) | `19.74 ms` (🚀 **6.15x faster**) | | **`#2, 9699051 scalars`** | `141.87 ms` (✅ **1.00x**) | `145.90 ms` (✅ **1.03x slower**) | `38.80 ms` (🚀 **3.66x faster**) | | **`#5, 7941351 scalars`** | `117.80 ms` (✅ **1.00x**) | `550.36 ms` (❌ *4.67x slower*) | `18.25 ms` (🚀 **6.45x faster**) | | **`#6, 9699051 scalars`** | `137.59 ms` (✅ **1.00x**) | `321.07 ms` (❌ *2.33x slower*) | `52.13 ms` (🚀 **2.64x faster**) | | **`#9, 7941351 scalars`** | `120.01 ms` (✅ **1.00x**) | `555.16 ms` (❌ *4.63x slower*) | `17.04 ms` (🚀 **7.04x faster**) | | **`#10, 9699051 scalars`** | `144.52 ms` (✅ **1.00x**) | `345.90 ms` (❌ *2.39x slower*) | `61.81 ms` (🚀 **2.34x faster**) | | **`#13, 7941351 scalars`** | `120.17 ms` (✅ **1.00x**) | `554.94 ms` (❌ *4.62x slower*) | `17.12 ms` (🚀 **7.02x faster**) | | **`#14, 9699051 scalars`** | `142.90 ms` (✅ **1.00x**) | `463.73 ms` (❌ *3.25x slower*) | `57.67 ms` (🚀 **2.48x faster**) | | **`#17, 7941351 scalars`** | `118.52 ms` (✅ **1.00x**) | `557.10 ms` (❌ *4.70x slower*) | `16.43 ms` (🚀 **7.21x faster**) | | **`#18, 9699051 scalars`** | `144.78 ms` (✅ **1.00x**) | `560.78 ms` (❌ *3.87x slower*) | `65.20 ms` (🚀 **2.22x faster**) | | **`#21, 7941351 scalars`** | `118.18 ms` (✅ **1.00x**) | `555.62 ms` (❌ *4.70x slower*) | `17.44 ms` (🚀 **6.78x faster**) | | **`#22, 9699051 scalars`** | `147.36 ms` (✅ **1.00x**) | `610.48 ms` (❌ *4.14x slower*) | `65.04 ms` (🚀 **2.27x faster**) | | **`#25, 7941351 scalars`** | `119.47 ms` (✅ **1.00x**) | `555.26 ms` (❌ *4.65x slower*) | `16.50 ms` (🚀 **7.24x faster**) | | **`#26, 9699051 scalars`** | `142.61 ms` (✅ **1.00x**) | `675.52 ms` (❌ *4.74x slower*) | `66.13 ms` (🚀 **2.16x faster**) | | **`#29, 7941351 scalars`** | `119.86 ms` (✅ **1.00x**) | `554.61 ms` (❌ *4.63x slower*) | `16.63 ms` (🚀 **7.21x faster**) | | **`#30, 9699051 scalars`** | `146.12 ms` (✅ **1.00x**) | `698.63 ms` (❌ *4.78x slower*) | `67.42 ms` (🚀 **2.17x faster**) | Comparing the columns `lurkrs` to `lurkrs-compressed`, we have speed ups ranging from 10-30x. Of course, we are aware that this does not account for steps (1) and (2). Then, the next step is to estimate step (2). We expect others to be much more knowledgable in this, but we've added a partial analysis. Since `sppark` optimizes ones by adding all their coincident bases, we can test against running a vector of all ones. To adjust for any overheads, we run a vector of all zeros. Then the difference between these two runs will give us the pure time it takes to add all the ones. ``` GPU/ones/preallocated 10000000 scalars time: [61.739 ms 62.573 ms 63.441 ms] GPU/zeros/preallocated 10000000 scalars time: [54.205 ms 54.969 ms 55.761 ms] ``` Given that it takes around 62.5 - 55 ms = 7.5 ms, that means we should be able to add around 10 million points / 7.5 ms = 1.33 billion points / sec. From our test set, no vector length exceeds 10 million, so we estimate that step (2) will take, cautiously, around 10 ms. Adding this estimate to the estimates of step (3) from our benchmarks above (the `lurkrs-compressed` column), this gives a good idea of how long steps (2) + (3) would take. Finally, we should also note that step (1) will take a nontrivial amount of work, since the preprocessing step will include moving data onto the device, counting the scalars, binning their coincident bases, perhaps some sorting, etc. We are not sure how to estimate the performance of step (1), though we suspect that it will be fast-ish compared to the work of steps (2) + (3). Generously, perhaps 10-15 ms. In total, adding ~25ms to the `lurkrs-compressed` column of the benchmarks above, we may have a reasonable estimate of the performance of this modified MSM. This indicates a possible performance of 50-100 ms for Lurk MSMs, a significant speed up over our current timings. In the cases where the number of coincident bases is high, it should even perform better than the baseline random MSM. ## Closing This document is largely a summary of everything we've discussed internally about this Lurk MSM slowdown. We hope it provides good data for further analysis. We also included some ideas for a solution, but is only an estimate, and we are of course are open to anything better if possible. Please let us know if anyone wants the Lurk vectors as test data, or require a way to reproduce the results -- we can quickly set that up.