# New multiplication algorithm for Circle STARK We present a new algorithm for multiplication and modular reduction over the M31, CM31, and QM31 fields that are used in Circle STARK. This algorithm is from [Avihu Levy](https://x.com/avihu28) from [StarkWare](https://x.com/StarkWareLtd). Credits also go to [Liam Eagen](https://x.com/liameagen) from [Alpen Labs](https://x.com/alpenlabs), who pointed out to us that multiplying $a$ and $b$ can be done through the squares of $a+b$ and $a-b$ and the side effect that the result is multiplied by 4 can be addressed by floor-dividing by 4 each element in a lookup table of these squares. We start by presenting a 8-bit $\times$ 8-bit $\rightarrow$ 16-bit multiplier. M31 will be built from this multiplier. ## 8-bit $\times$ 8-bit $\rightarrow$ 16-bit multiplier Given $a$ and $b$, both of which are nonnegative numbers of no more than 8-bit, we want to compute $a\cdot b$, which, as expected, would be a 16-bit number. First, it is useful to observe that this multiplication can be represented as differences of squares: $(a+b)^2$ and $(a-b)^2$, assuming $a\geq b$. If we encounter $a < b$, we can simply swap the two numbers and continue with the algorithm. $$a\cdot b = \frac{(a+b)^2 - (a-b)^2}{4}$$ If we ignore the denominator $4$ for a moment, it is possible to use a table to look up $x^2$ for a given $x$ that is 9-bit. Here, we consider 9-bit because $a+b$ can go up to 9-bit. This would result in a lookup table of size 512. In Bitcoin script, we can build this table. ```rust script! { for i in (0..=512).rev() { { i * i } } } ``` To look up this table given $x$, we only need to ```rust script! { // x is currently the top of the stack and the table is k elements away if k != 0 { { k } OP_ADD } OP_PICK // x^2 is now at the top of the stack } ``` To address the denominator 4, it appears that we only need to floor-divide each element in the table by 4. ```rust script! { for i in (0..=512).rev() { { i * i / 4 } } } ``` To show why this works, we decompose $(a+b)^2$ and $(a-b)^2$ under modulus 4. $$(a+b)^2 = 4n_1 + r_1~~(0\leq r_1 < 4)$$ $$(a-b)^2 = 4n_2 + r_2~~(0\leq r_2 < 4)$$ We want to show that $r_1 = r_2$ because if that is the case, $n_1 - n_2$ is the desired result, as from the lookup of the table. $$(a+b)^2 - (a-b)^2 = 4(n_1 - n_2)$$ To prove $r_1=r_2$, observe that: $$(a+b)^2 - (a-b)^2 = 4ab$$ $$(a+b)^2 - (a-b)^2 = 4(n_1 - n_2) + r_1 - r_2$$ Doing the modular reduction on the right-hand sides. we have $0 \equiv r_1 - r_2 \pmod{4}$. Since $0\leq r_1, r_2 < 4$, we have $r_1 = r_2$. Another way to understand why this holds is to see that: - if $x\equiv 0~\mathrm{or}~2 \pmod{4}$, or, to put it simply, an even $x$, then $x^2\equiv 0 \pmod{4}$. - if $x\equiv 1~\mathrm{or}~3 \pmod{4}$, or, to put it simply, an odd $x$, then $x^2\equiv 1 \pmod{4}$. And we can easily see that $(a+b)$ and $(a-b)$ are either both even or both odd, simply because $b \equiv -b \pmod{2}$. This is why their squares have the same residue modulo 4. **Possible optimization:** We believe that there is a large design space for this type of algorithm and we expect people to come up with ways to improve it further. Here we list a number of optimization that we have thought about but we did not experiment throughly or implement due to the lack of time or the limited improvement from the initial study. There are ways to shrink down the table so that it does not take too much stack space. First, note that if $256\leq a+b \leq 512$, one does not have to use a table of size 512. This is because one can use $t = a + b - 256$ and we know: $$(a+b)^2 = (t + 256)^2 = t^2 + 512t + 65536$$ Since $0\leq t\leq 256$ the lookup can be done efficiently. One uses repeated doubling to compute $512t$ out of $t$. This clearly introduces additional overhead, but it leaves more stack space available. Another idea to control the table size is to observe that: $$(a+b+1)^2 = (a+b)^2 + 2(a+b) + 1$$ $$(a-b+1)^2 = (a-b)^2 + 2(a-b) + 1$$ And therefore: $$(a+b+1)^2 - (a-b+1)^2 = 4ab + 4b$$ Or furthermore, $$ab + b = \frac{(a+b+1)^2 - (a-b+1)^2}{4}$$ This means that one can use a smaller table that only contains the squares for even numbers (or, equivalently, odd numbers). Deciding the sign of a 8-bit limb can be challenging as it usually requires bit decomposition. It appears that the best solution is to store each limb as a signed number: - If $a$ is odd, the limb is $-(a+1)/2$, a negative number. - If $a$ is even, the limb is $a/2$. Note that the encoding was designed specifically in the way above because we need to distinguish between $1$ and $0$ after the encoding (note: 'negative zero' exists in Bitcoin script but is difficult to use due to BIP-62). Alternatively, one can also make odd numbers positive and even numbers non-positive. The sign needs to be extracted before the actual algorithm, and the actual algorithm needs to find $(a+b + 1)/2$ (if $a+b$ is odd) and $(a+b)/2$ (if $a+b$ is even) using the halved value stored above. The second idea can work together with the first idea, meaning that it is possible for one to shrink the table from 512 down to 128. Yet another idea is to consider $(a-b)^2$, $a^2$, $b^2$, which require three lookups but all of them are within the 8-bit range. The idea is that $a^2 + b^2 - (a-b)^2 = 2ab$. To remove the $2$, we can divide all elements in the lookup table by 2, but some adjustment is needed: - if $a$ and $b$ are both odd, $a^2$ and $b^2$ are both odd, but $(a-b)^2$ is even. This would cause an error by 1. - if $a$ and $b$ are both even, all $a^2$, $b^2$, $(a-b)^2$ are even. This is fine. - if one is even, one is odd, it would also be fine. This method can also work with the previous idea, but some care is needed. For example, if we want to incorporate this with the idea that compresses odd and even in the lookup table, observe that: - if $a$ and $b$ are both odd, we can look up $(a+1)^2$, $(b+1)^2$, and $(a-b)^2$: $$(a+1)^2 + (b+1)^2 - (a-b)^2 = 2(ab + a + b + 1)$$ - if $a$ is odd but $b$ is even, we can look up $(a+1)^2$, $b^2$, and $(a-b+1)^2$: $$(a+1)^2 + b^2 - (a-b+1)^2 = 2(ab + b)$$ - if $a$ is even but $b$ is odd, we can look up $a^2$, $(b+1)^2$, and $(a-b-1)^2$: $$a^2 + (b+1)^2 - (a-b-1)^2 = 2(ab + a)$$ And one can see that the downside is that each lookup becomes much more expensive. It is possible to quantify if an optimization is worthy or not because whenever the stack size is a concern, we can address that by splitting the program using covenants. Currently, pushing the lookup table takes 1657 bytes, and roughly a covenant takes additionally 1512 bytes. Some calculation suggests that any single optimization that shrinks the table by half but brings an overhead of more than $8\%$ over the current algorithm (or, equivalently, an additional overhead by 33 bytes in Bitcoin script per M31 multiplication) is not worthy, but one should use the covenant to continue the computation in another script execution. ## M31 We now turn our attention to M31. Different from the tradition in which an M31 element is expressed exactly as one element in the stack, the implementation of this new algorithm assumes that an M31 element is expressed as 4 limbs: $a_1$, $a_2$, $a_3$, $a_4$ in which $a_1$, $a_2$, $a_3$ are all 8-bit where $a_4$ is expected to be 7-bit only. The four limbs together represent one M31 element. $$a = a_1 + a_2 \cdot 256 + a_3\cdot 256^{2} + a_4\cdot 256^{3}$$ **Multiplying an element by 256.** One building block that we will use is to multiply a number by 256. This can be implemented by doing `OP_DUP OP_ADD` eight times, as follows, which takes 16 bytes. ```rust script! { for _ in 0..8 { OP_DUP OP_ADD } } ``` If one has OP_CAT, there is a slightly faster implementation that only takes 8 bytes, by sticking a 0x00 to the bottom of a number as long as this number is not zero (if we put 0x00 on zero, it would violate BIP-62 as the zero is not in the minimal representation). ```rust script! { OP_SIZE OP_NOT OP_NOTIF OP_PUSHBYTES_1 OP_PUSHBYTES_0 OP_SWAP OP_CAT OP_ENDIF } ``` We now go back to the M31 algorithm. To multiply two numbers $a$ and $b$, we expect that the limbs of $a$ and $b$ are already constructed. $$a_1, a_2, a_3, a_4, b_1, b_2, b_3, b_4$$ The textbook multiplication is our starting point, which calculates 7 intermediate results using 16 times of 8-bit $\times$ 8-bit $\rightarrow$ 16-bit multiplication. - $s_1 = a_1\cdot b_1$ - $s_2 = a_1\cdot b_2 + a_2\cdot b_1$ - $s_3 = a_1\cdot b_3 + a_2\cdot b_2 + a_3\cdot b_1$ - $s_4 = a_1\cdot b_4 + a_2\cdot b_3 + a_3\cdot b_2 + a_4 \cdot b_1$ - $s_5 = a_2\cdot b_4 + a_3\cdot b_2 + a_2\cdot b_3$ - $s_6 = a_3\cdot b_4 + a_4\cdot b_3$ - $s_7 = a_4\cdot b_4$ Note that M31 has the modular reduction over $2^{31} - 1$. The base of $s_5$ is $2^{32}$, the base of $s_6$ is $2^{40}$, and the base of $s_7$ is $2^{48}$. We can inline the modular reduction by changing the intermediate results as follows by reducing $s_5$, $s_6$, and $s_7$. - $s_1 = a_1\cdot b_1 + (a_2\cdot b_4 + a_3\cdot b_2 + a_2\cdot b_3)\cdot 2$ - $s_2 = a_1\cdot b_2 + a_2\cdot b_1 + (a_3\cdot b_4 + a_4\cdot b_3) \cdot 2$ - $s_3 = a_1\cdot b_3 + a_2\cdot b_2 + a_3\cdot b_1 + (a_4\cdot b_4)\cdot 2$ - $s_4 = a_1\cdot b_4 + a_2\cdot b_3 + a_3\cdot b_2 + a_4 \cdot b_1$ This does not reduce the number of multiplications, but it simplies the handling of the reduction step that we will discuss now. Our goal is to obtain the computation result as a single M31 element. That is to say, we want to reduce $s_1$, $s_2$, $s_3$, $s_4$, each of which is a roughly 16-bit to 18-bit number, with the corresponding bases $2^0$, $2^8$, $2^{16}$, $2^{24}$, to a single number that represents the M31 element. To perform the reduction, we require a hint element $q$ defined as follows. $$q = \lfloor\frac{s_1 + s_2\cdot 256 + s_3\cdot 256^2 + s_4\cdot 256^3}{2^{31} - 1}\rfloor$$ The reduction algorithm is as follows: - $t \leftarrow (c_4 - q\cdot 128)\cdot 256$ - $t \leftarrow (t + c_3)\cdot 256$ - $t \leftarrow (t + c_2)\cdot 256$ - $t \leftarrow t + q + c_1$ The resulting $t$ is the desired M31 element. The overhead comes from mostly the multiplication by 128 and by 256. At the end of the computation, we enforce some bounds since $q$ is supplied externally and is untrusted. ```rust script! { // enforce not negative OP_DUP OP_DUP OP_ABS OP_EQUALVERIFY // enforce smaller than the limit OP_DUP { (1i64 << 31) - 1 } OP_LESSTHAN OP_VERIFY } ``` **Possible optimization:** The textbook multiplication uses 16 times of 8-bit $\times$ 8-bit $\rightarrow$ 16-bit multiplication, which does contribute to the majority of the overhead, and a natural idea is whether we can reduce the number of multiplications. There is some possibility, but we feel that it could be tough, for the following reason. For example, if we use 2-layer Karatsuba, we should be expecting a reduction from 16 to 9, but the addition required in Karatsuba causes a side effect that the inputs to the multipliers go from 8-bit to 9-bit or even 10-bit. Although methods exist to address by handling the highest bit separately, it is very possible that the overhead of such handling exceeds the saving from 16 to 9. This is particularly because a single step of 8-bit $\times$ 8-bit $\rightarrow$ 16-bit multiplication takes only 12-18 bytes. Any handling that is slightly complicated may exceed this, leading to Karatsuba being more expensive. Nevertheless, there could be methods that can make this faster, but we do not know a concrete alternative. ## CM31 We can view the M31 multiplier as an algorithm with the following syntax: - **input:** $a$ given in 4 limbs, $b$ given in 4 limbs, where each limb is no more than 8 bits - **output:** $a\cdot b$ after reduction modulo $2^{31} - 1$ Now, given two CM31 elements $a_1 + ia_2$ and $b_1 + ib_2$. Their product, $c_1 + ic2$, is: $$c_1 = a_1 \cdot b_1 - a_2 \cdot b_2$$ $$c_2 = a_1\cdot b_2 + a_2 \cdot b_1$$ And it is known that one can compute $c_1$ and $c_2$ using Karatsuba with three M31 multiplications. - $t_1 = a_1\cdot b_1$ - $t_2 = a_2\cdot b_2$ - $t_3 = (a_1 + a_2)\cdot (b_1 + b_2)$ - $c_1 = t_1 - t_2$ - $c_2 = t_3 - t_1 - t_2$ One can use the same algorithm as before to compute $t_1$ and $t_2$. But additional care is needed when handling $t_3$ because we have not defined how to add two limb representations of M31 together. Note that $a_1$ and $b_1$ both consist of 4 limbs (8-bit, 8-bit, 8-bit, 7-bit). Directly adding them limb by limb would not work because the first three limbs would go up to 9-bit, preventing the table lookup method. Our solution is to define an algorithm to add two M31 limb representations together. Given $a$ written as $a_1, a_2, a_3, a_4$ and $b$ written as $b_1, b_2, b_3, b_4$, we compute the sum $c$ as follows. - $c_1 \leftarrow a_1 + b_1$, $c_2 \leftarrow a_2 + b_2$, $c_3 \leftarrow a_3 + b_3$, $c_4 \leftarrow a_4 + b_4$. - if $c_1 \geq 256$, then $c_1 \leftarrow c_1 - 256$ and $c_2 \leftarrow c_2 + 1$. - if $c_2 \geq 256$, then $c_2 \leftarrow c_2 - 256$ and $c_3 \leftarrow c_3 + 1$. - if $c_3 \geq 256$, then $c_3 \leftarrow c_3 - 256$ and $c_4 \leftarrow c_4 + 1$. - $c$ is represented as $c_1, c_2, c_3, c_4$. This method gives $c$ that consists also 4 limbs, but now all the limbs are expected to be 8-bit (the last limb is now larger). We do not perform further reduction on the last limb here because the table lookup algorithm as well as the reduction algorithn for M31 can tolerate the last limb to be 8-bit. With this limb addition, we can compute $t_3$ and follow the Karatsuba algorithm as above. The result is a CM31 element. ## QM31 Now, we abstract the CM31 multiplier as an algorithm with the following syntax: - **input:** $a$ given in 8 limbs, $b$ given in 8 limbs, where each limb is no more than 8 bits, and the 4-th and 8-th limbs must be no more than 7 bits - **output:** $a\cdot b$ after reduction, where $a$ and $b$ are both CM31 Now given two QM31 elements $a_1 + ja_2$ and $b_1 + jb_2$. Their product, $c_1 + jc_2$, is: $$c_1 = a_1\cdot b_1 + (i + 2) \cdot a_2\cdot b_2$$ $$c_2 = a_1\cdot b_2 + a_2 \cdot b_1$$ We will continue to use Karatsuba, so it consists of five steps. - $t_1 = a_1\cdot b_1$ - $t_2 = a_2\cdot b_2$ - $t_3 = (a_1 + a_2)\cdot (b_1 + b_2)$ - $c_1 = t_1 + (i + 2)\cdot t_2$ - $c_2 = t_3 - t_1 - t_2$ But the challenge is regarding $t_3$. In the CM31 algorithm, we add some limbs together resulting the 4-th limb of an M31 element potentially being 8-bit. If we add some limbs also here in QM31, the CM31 algorithm may end up creating a 4-th limb of 9-bit, which prevents the table lookup. Our solution is to perform a partial reduction. Given $a$ written as $a_1, a_2, a_3, a_4, a_5, a_6, a_7, a_8$ and $b$ written as $b_1, b_2, b_3, b_4, b_5, b_6, b_7, b_8$, we compute the sum $c$ as follows. - $c_1 \leftarrow a_1 + b_1$, $c_2 \leftarrow a_2 + b_2$, $c_3 \leftarrow a_3 + b_3$, $c_4 \leftarrow a_4 + b_4$. - $c_5 \leftarrow a_5 + b_5$, $c_6 \leftarrow a_6 + b_6$, $c_7 \leftarrow a_7 + b_7$, $c_8 \leftarrow a_8 + b_8$. - if $c_1 \geq 256$, then $c_1 \leftarrow c_1 - 256$ and $c_2 \leftarrow c_2 + 1$. - if $c_2 \geq 256$, then $c_2 \leftarrow c_2 - 256$ and $c_3 \leftarrow c_3 + 1$. - if $c_3 \geq 256$, then $c_3 \leftarrow c_3 - 256$ and $c_4 \leftarrow c_4 + 1$. - if $c_4 \geq 128$, then $c_4 \leftarrow c_4 - 128$ and $c_1 \leftarrow c_1 + 1$. - if $c_5 \geq 256$, then $c_1 \leftarrow c_5 - 256$ and $c_6 \leftarrow c_6 + 1$. - if $c_6 \geq 256$, then $c_2 \leftarrow c_6 - 256$ and $c_7 \leftarrow c_7 + 1$. - if $c_7 \geq 256$, then $c_3 \leftarrow c_7 - 256$ and $c_8 \leftarrow c_8 + 1$. - if $c_8 \geq 128$, then $c_4 \leftarrow c_8 - 128$ and $c_5 \leftarrow c_5 + 1$. - $c$ is represented as $c_1, c_2, c_3, c_4, c_5, c_6, c_7, c_8$. This algorithm also has a side effect, in that $c_1$ and $c_5$ may go beyond 8-bit and becomes exactly 256, which takes 9 bits. Our solution is to make the table lookup mechanism "tolerate" 256 by making the lookup table slightly larger (by one element), so that it accepts sums from 0 to 512 inclusive, and therefore even if the 8-bit $\times$ 8-bit $\rightarrow$ 16-bit multiplier encounters two 256 limbs, it can still work. After this is addressed, we continue with the Karatsuba algorithm, which gives us a QM31 element. This would be sufficient to cover all the arithmetic operations in Circle STARK.