Notes on Large MSMs

# Notes on Large MSM project ## Notes from mid-January We are computing $\sum_{i=1}^n c_i G_i$ Let k be fixed, and $c_i = \sum c_{i,j} 2^{j * k}$. Let's say $l = 254/k$. Then \begin{align*} \sum_{i=1}^n c_i G_i &= \sum_{i=1}^n (\sum_{j=1}^{l} c_{i,j} 2^{j * k}) G_i \\ &= \sum_{j=1}^{l} (\sum_{i=1}^n c_{i,j} (G_i 2^{j * k})) \\ &= \sum_j B_j \end{align*} The inner sum ($\sum_{i=1}^n c_{i,j} (G_i 2^{j * k}$) is called the "sub-MSM". This inner operation sub-MSM is what the pseudocode describes. The main strength of the sub-MSM algorithm is that due to coefficients being small we can use (non-ZK) RAM lookups on `buckets` which speeds up things quite a bit. Questions: - Why do we use the blinding generator H, and not G? - for avoiding infinity point - Why is `total` not equal to H? - to cancel that blinder essentially - Performance concerns: - Halo2wrong ecc addition is 202 rows. - In our case we'll use wide rows, so we'll have just a row per foreign field EC addition. - A really wide row will handle the elliptic addition (consisting ) - But the wide-instead-of-tall trade-off is reasonable. For each of the columns we use FFT to convert it from one form to another, and then another FFT for extending. so nlogn * #columns. - Each iteration of the algorithm with limited buckets is $3n$ additions at most - The beginning of the cycle populates the buckets and it takes $n$ additions because that's how much multiplications are there in `to_scale_pairs`. - The end the loop does two additions per each non-zero bucket (assume there are $n'$ of these), so we have $2 \cdot n'$ additions, which is in the worst case $2 n$ additions. - The whole algorithm then takes $3n \cdot (256/k)$ additions because we need to run the inner algorithm $(256/k)$ times, and then sum the results (but the sum is negligibly small and can be accumulated along the way). - Assuming worst case $n = 2^{16}$ and $k = 15$, we get exactly $2^{17}$ additions, which is still more than $2^{16}$ budget. - But we will use folding. - The only shared states between the rounds of folding are (1) total accumulator for the computed value $\sum_{i=1}^j B_i$, (2) ram lookups. RAM lookups are not ZK and they don't need to be. - However this needs to be repeated for all the sections, $254/14 = 19$ of them. - All the $n * (256/k)$ base points are always the input to every folding round? - this computation also looks like $x_0 G + x_1 (2^k G) + x_2 (2^{2k} G)$ + ... - How is that supposed to be reasonable speed? - How do we recombine after we do the inner operation? Isn't this another MSM? - We'll continue adding in the folding procedure. ### Notes 2024/01/18 If $c_i$ on 15 bits, we have $2^{15}$ different values. Therefore we can put in the precomputed table all coefficients of $G_{1}$, $2 G_{1}$, $\cdots$. | coef | X coordinate | Y coordinate | | ---- | ---------------- | ------------ | | 1 | G_{1}.x | G_{1}.y | | 2 | (2 G_{1}).x | (2 G_{1}).y | | 3 | (3 G_{1}).x | (3 G_{1}).y | | 4 | G_{1}.x | G_{1}.y | | ... | G_{1}.x | G_{1}.y | | 2^15 | (2^15 * G_{1}).x | G_{1}.y | Much more vague version for step number $j$: | coef | point | |------|-------| | c | $\sum_i$ (for base $i$ is scaled by $c$) $G_i^{k*j}$ | Circuit that LambdaClass is supposed to write: - verification of a Pickles proof - written in Kimchi The circuit has to: - encode large MSM over Pasta curves (costly) - encode foreign field arithmetic over Pasta curves (costly) The proof is going to be based on KZG polynomial commitment instead of IPA for ethereum compatibility to use the predefined pairing using the bn254 curve. The verification of a Pickles proof implies to encode a large IPA verification, which consists of large MSM (around 2^15). Computing the MSM naively in a circuit (foreign field arithmetics that simulates pasta inside bn256) would cost more (scalar mul is 80k rows?) than the available existing SRS (around 2^16), therefore we have to optimise in circuit. ### Q/A Q1: what are the available SRS sizes that LambdaClass want to use? Q2: what kind of proof systems will they use? A: (Misha): kimchi -> will write a kimchi verifier over bn254 and KZG in Solidity. Q3: how folding is going to be used? A: Let $\sum_{j=1}^{l} (\sum_{i=1}^n c_{i,j} (G_i 2^{j * k}))$. We hope to fit $\sum_{i=1}^n c_{i,j} (G_i 2^{j * k})$, the internal sum in one Kimchi circuit, and folding. Q4: what is the current circuit size? With naive MSM? With the described MSM optimisation? A: With naive MSMs and standard narrow circuits think about 200 lines per foreign field addition. In our case we will have a (quite a) wide row per one foreign field addition. Naively, multiply this by 256 (double-and-add ladder) to get the cost of one scalar multiplication. Multiply this by 2^15 to get the size of the whole circuit. Q5: Can we have a link to the current work that has been done by LambdaClass? Q6: If LambdaClass uses o1js, we will need to expose the optimised MSM in o1js. Is it part of the project? A (Misha): Absolutely not in the scope at the moment :) At least Matthew was explicitly limiting the scope to just MSM and saying not to worry about integration with anything that LambdaClass do. Q7: What is the Kimchi layout? i.e. the number of columns? A (Misha): As much as we need to make the row number fit into BN256 KZG SRS available. We want to have (at least?) a Foreign Field curve addition on one row, i.e. an addition of two points $p_1$ and $p_2$ of Pallas/Vesta encoded in a circuit using the bn254 prime field. Q8: Does the task include the implementation of addition of two Pallas or Vesta curve points in the prime field of BN254? A: Yes. And seems it is the harder bits. Q9: Do we want to fit one inner algorithm (sub-MSM) run (that is at most $3n$ additions) into one kimchi circuit / per one fold? Or we will have 3 folds per run of the inner algorithm? Q10: How much can we reuse from the existing FF mul/add codebase? A: (Misha) I vaguely remember Matthew was not positive about it. (Danny) it seems it will be reimplemented in o1js directly. Q11: Why do we need to have wide rows instead of making more folding repetitions? Q12: Which coordinates would be better (in circuit with foreign field arithmetics) to use? - [Affine](https://www.hyperelliptic.org/EFD/g1p/auto-shortw.html) - [Jacobian](https://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian.html#doubling-dbl-2007-bl) - [Projective](https://www.hyperelliptic.org/EFD/g1p/auto-shortw-projective.html) We might find inspiration and decide which encoding is more efficient by looking at existing FF/ECC implementations: - https://github.com/privacy-scaling-explorations/halo2wrong - https://github.com/DelphinusLab/halo2ecc-s - https://hackmd.io/@arielg/B13JoihA8 Q13: Only using Logup (additive lookups)? KZG bn254 <--> Form a cycle with Grumpkin. Verification of KZG can be encoded with Grumpkin curve (see [this aztec blog post](https://hackmd.io/@aztec-network/ByzgNxBfd#2-Grumpkin---A-curve-on-top-of-BN-254-for-SNARK-efficient-group-operations)). ### Algebraic structures - BN254: - Scalar field (F_scalar): 21888242871839275222246405745257275088548364400416034343698204186575808495617 - Base field (F_base): 21888242871839275222246405745257275088696311157297823662689037894645226208583 - Equation: Weierstrass curve - $y^2 = x^3 + 3$ - Vesta: - Scalar field ($\mathbb{F}_{scalar}$): 28948022309329048855892746252171976963363056481941560715954676764349967630337 - Base field ($\mathbb{F}_{base}$): 28948022309329048855892746252171976963363056481941647379679742748393362948097 - Equation: Weierstrass curve - $y^2 = x^3 + 5$ - Pallas: - Scalar field ($\mathbb{F}_{scalar}$): 28948022309329048855892746252171976963363056481941647379679742748393362948097 - Base field ($\mathbb{F}_{base}$): 28948022309329048855892746252171976963363056481941560715954676764349967630337 - Equation: Weierstrass curve - $y^2 = x^3 + 5$ We have BN254($\mathbb{F}_{scalar}$) < BN254($\mathbb{F}_{base}$) < Vesta($\mathbb{F}_{scalar}$) < Vesta($\mathbb{F}_{base}$). Assuming we verify an MSM for a Step proof: - BN254($\mathbb{F}_{scalar}$) is the field the circuit has to be expressed in - Vesta($\mathbb{F}_{scalar}$) is the field for the scalar used in the MSM - Vesta($\mathbb{F}_{base}$) is the field for the coordinates of the curve Therefore, $\underbrace{\sum_{j=1}^{l} (\sum_{i=1}^n \overbrace{c_{i,j}}^{\in Vesta(Fp)} (\overbrace{G_i 2^{j * k}}^{\text{Coordinates in Vesta(Fq), computed externally}})}_{\text{Encoded in BN254(Fp)}})$ The coefficients $c_{i, j}$ will be encoded on $2^k$ bits, with $k$ small compared to the field size (around 15). A lookup table will be used to fetch the corresponding $G_i 2^{j * k}$. Therefore, the only operations that we need to encoded is the addition of Vesta(F_base) elements in BN254(F_scalar). Note that the elements $G_i 2^{j * k}$ will have coordinates in Vesta(F_base). Therefore, the table will require more than one limbs for each coordinates. ### Next task (Danny) - 20240118 - Play with Kimchi over BN254 - An example always exists in https://github.com/o1-labs/proof-systems/blob/master/kimchi/src/tests/generic.rs#L90. - Play with Kimchi with KZG --> implement a small circuit doing a MSM for two points over BN254 Suggestion for collaboratively working: sharing a feature branch we both push on. #### Plan suggestion (drafty - Danny) ##### Foreign field arithmetic and arithmetic over other curves * Have a look at foreign field addition and multiplication gate in Kimchi * Reading doc in the book + the code. Maybe starting without details (half a day), after that jump on the code to implement, and come back after that for a deeper understanding * Implement the addition of points of (non-native) Vesta with (native) BN254(Fp) using the existing FFA library within Kimchi. * The ECC addition operation must be implemented from scratch, but it relies on the foreign field additions/multiplications support which already exists in Kimchi. * Try with jacobian, projective and affine. * Important: we need to verify that the conditions are respected for the scalar field of BN254. Initially, it has been written for the scalar field of Vesta/Pallas. * Check the number of constraints (2 days) * Training with the current gate can help understanding how it works at the moment, and it can be useful to have ideas to create a new gate to perform Foreign EC addition on one row (which is supposed to be a goal?). ##### Lookups * Analyze additive lookup (logup) and try to use it in a simple circuit with one of the Pasta curve. Must be able to prove and verify a circuit. It is independent of this work. * Do we have to change the implementation of foreign field multiplication to use additive lookups instead of Plookup? ##### Folding * Analyze folding and try to use it in a simple circuit with one of the Pasta curve. Must be able to prove and verify a circuit. It is independent of this work. ## The Kimchi's Foreign Field Arithmetic Algorithm There is already an implementation of FFA (addition and multiplication) in Kimchi that can be used as an inspiration: - https://o1-labs.github.io/proof-systems/kimchi/foreign_field_add.html - https://o1-labs.github.io/proof-systems/kimchi/foreign_field_mul.html In the following are some notes on how it is used / needs to be modified. ### (Sketch) Misha: Understanding the algorithm Assume that we have two fields: $\mathbb{F}_f$ (foreign) and $\mathbb{F}_n$ (native). #### FF Addition Let's start with addition. Our original equation is as follows: $$ a + s \cdot b = r \mod \mathbb{F}_f $$ Given that $a, b, r \in \mathbb{F}_f$, and $s \in \{-1,1\}$. Now, set This is equivalent to the following equation over the integers: $$ a + s \cdot b = q \cdot f + r $$ Given that $a, b, r \in \mathbb{F}_f$ and $s \in \{-1,1\} \land q \in \{-1,0,1\}$. Now, let $B = 2^{88}$. Then the coefficients can be decomposed into three limbs each as follows: $$ \begin{align*} a = a_0 + a_1 \cdot 2^{88} + a_2 \cdot 2^{176} = \sum_{i=0}^2 a_i B^i \end{align*} $$ With each limb $a_i \in [0,2^{88}). And the equation becomes: $$ \begin{align*} \sum_{i=0}^2 a_i B^i + s \cdot (\sum_{i=0}^2 b_i B^i) = q \cdot (\sum_{i=0}^2 f_i B^i) + (\sum_{i=0}^2 r_i B^i) \end{align*} $$ With the constraints that $a_i, b_i, r_i \in [0,2^{88})$ and $s \in \{-1,1\} \land q \in \{-1,0,1\}$. But given that $\sum a_i B^i$ is not necessarily an element of $\mathbb{F}_f$ because this sum can be bigger than $f$, we must also assert sum decompositions $a = \sum a_i B^i$ and similarly for $b_i, r_i$. This assumes that $|\mathbb{F}_f| \leq |\mathbb{F}_n|$ which is true in our case --- then $a$ "fits" into $F_n$ and so equation $a = \sum a_i B^i$ "fits" natively within our native field. We still can't express our main equation natively, so we must split it into several sub-equations reasoning about limbs: $$ \begin{align*} a_0 + s \cdot b_0 &= q \cdot f_0 + r_0 + c_0 \cdot B\\ a_1 + s \cdot b_1 &= q \cdot f_1 + r_1 + c_1 \cdot B^2 - c_0\\ a_2 + s \cdot b_2 &= q \cdot f_2 + r_2 + \phantom{c_2 \cdot B^3} - c_1 \end{align*} $$ The $c_i$ are the newly introduced carry bits and they must be constrained to $c_i \in \{0,1\}$. @volhovm: I don't understand why $c_i$ bits do not have something similar to $s$ bits before them, depending on whether we're subtracting or adding... #### FF Multiplication The original formula $a \cdot b = r \mod f$ for $a,b, r \in [0,f-1]$ translates to the following equations over the integers: $$a \cdot b = q \cdot f + r$$ Now, technically since $a,b,r \in [0,f-1]$ the maximum value of $q$ is such that $q \cdot f = (f-1)(f-1) - 0 = f^2 - 2f + 1$. But since we're working over the integers $q = f-2$ will be too small ($(f-2)f = f^2 - 2f$, which is $1$ smaller than what we need), so we will constrain $q \in [0,f-1]$. We will use the same $B = 2^{88}$ sized limbs as in the FF addition. This results in the equation above (spanning about 512 bits) to be split into 6 limbs, and, therefore, sub-equations. @volhovm: TODO check optimality. This was optimised for 7-column Kimchi, while in our case we have as many columns as we want. The next step is to split the equation by CRT into two sub-equations. Pick the smallest integer of the form $M = 2^t \cdot n$ for the smallest $m$ (and $n$ being the fixed prime size of $\mathbb{F}_n$) such that $M > (f-1)(f-1) + f$, the right hand side of our equation. Now, since $n$ is prime, $2^t$ is coprime with it, and any operation mod $N$ can be CRT-recomposed from two operations mod $n$ and mod $2^m$. Let $a_t = a \mod 2^t$, and similarly for $b_t$, $f_t$, $q_t$, $r_t$. Now let $a_n$ and others to be the same but mod $n$. Our general equation $$a \cdot b = q \cdot f + r \mod M \mod 2^t \cdot n$$ is equivalent to: $$ \begin{align*} a_t \cdot b_t &= q_t \cdot f_t + r_t \mod 2^t \\ a_n \cdot b_n &= q_n \cdot f_n + r_n \mod n \end{align*} $$ Under the condition that each element is constrained to be correct modulo its corresponding subfield: $a_t = a \mod 2^t$ and so on. To CRT-recombine $m$ such that $m = m_t \mod 2^t \land m = m_n \mod n$ we compute $$m = m_t + ((m_n - m_t \mod n) \cdot ((2^t)^{-1} \mod n) \mod n) \cdot 2^t$$ @volhovm: isn't it a bit expensive to compute modulo reductions?.. Note that the second equation is very cheap since we work in $\mathbb{F}_n$: $a_n,b_n,f_n$ are already $\mod n$, and all the operations in the second equation are native. ## Implementation Plan and Breakdown Optimistic estimates for ~1.3 person. 1. Finalizing MVLookup + adding Lookup tests (dw). - 1 day - MVLookup already works, but we want to make sure that it works correctly and is (not terribly un) sound. 1. Implementing the simplest gate + adding expression framework commitment evaluation into `prover.rs` - 1 day. - Create an `MSMEnv` similar to how it's done in zkVM, make sure commitments are created and evaluated properly. - Make sure it works with lookups. 1. Implementing the 16-limb FFA addition. With a test. - 1-2 days, should be fairly easy. - Might be hard for @volhovm who has less experience with Kimchi circuits and testing them. 1. Implementing the 16-limb FFA multiplication. With a test. - 2 days. - Multiplication is a bit more tricky, but by now we should have everything end-to-end. 1. Implementing the 16-limb FFEC addition. With a test. - 1-2 days. - Should be fairly easy compared to the previous steps. Maybe we will need to think about the infinity point. Maybe we will need to write only the incomplete formula (todo check with Matthew) 1. Implement the sub-MSM algorithm for small MSM number that fully fits within the circuit. With a test. - 2 days - Will need an accumulator version of our lookup argument. Maybe we will need RAM lookup and not just logup? - With $2^{15}$ rows in the SRS we can fit one run of the sub-MSM protocol for the MSM of size $N = 2^{13}$ into the circuit if we do one ECC addition per row. This is because sub-MSM algorithm takes $3N$ rows. 1. Scaling up the MSM to the circuit limits - 2 days - With the previous estimates and $N=2^{13}$ we can do $l = 254/k \approx 17$ independent proofs, each for a run of the sub-MSM algorithm. - But the circuit has a lot of extra space, since we have quater of the rows empty. - Instead we can try to fit 3 additions per row, in which case we'll be able to run 4 iterations of the sub-MSM algorithm per circuit, reducing the number of proofs to $l/4$. This increases the proof size, of course, but is more efficient. #### Footer . . . (to be able to scroll lower)