See [soundness proof](https://hackmd.io/@mitschabaude/SJsLdMgFn) and [existing FFmul RFC](https://o1-labs.github.io/proof-systems/rfcs/foreign_field_mul.html) for relevant, but not required context. **Recap**: The original design could not be made sound because we are quite sure that individual range checks on all limbs $q_0$, $q_1$, $q_2$ are necessary, and the original design already uses all 7 x 2 = 14 permutable cells in the two rows available to the gate. Original design: ```Rust //~ | col | `ForeignFieldMul` | `Zero` | //~ | --- | ---------------------------- | ------------------------- | //~ | 0 | `left_input0` (copy) | `remainder0` (copy) | //~ | 1 | `left_input1` (copy) | `remainder1` (copy) | //~ | 2 | `left_input2` (copy) | `remainder2` (copy) | //~ | 3 | `right_input0` (copy) | `quotient_bound01` (copy) | //~ | 4 | `right_input1` (copy) | `quotient_bound2` (copy) | //~ | 5 | `right_input2` (copy) | `product1_lo` (copy) | //~ | 6 | `carry1_lo` (copy) | `product1_hi_0` (copy) | //~ | 7 | `carry1_hi` (plookup) | | //~ | 8 | `carry0` | | //~ | 9 | `quotient0` | | //~ | 10 | `quotient1` | | //~ | 11 | `quotient2` | | //~ | 12 | `quotient_bound_carry` | | //~ | 13 | `product1_hi_1` | | //~ | 14 | | | ``` Why we need the 14 permutable cells in the original design: - We need 3x left input and 3x right input to use the inputs - We need 3 range checks on intermediate values (`carry1_lo`, `product1_lo` and `product1_hi_0`) and there is no design known to us which needs fewer. We need ranged-checked values to be in permutable cells, so that the RC gates can use them. - We need 3x `remainder` to use it as output of the gate, for example, for passing it as input to another ffmul gate. - We use 2x `quotient_bound` because we **thought** we only needed 2 upper bounds on $q$, and that a "compact RC" using 2 permutable cells was enough for those bounds In total, that's 3 + 3 + 3 + 3 + 2 = 14. Now that we know we need 3x `quotient`, we need to either add a row or find a different tweak to free up one permutable cell. Happily, at least we now know that we don't need `quotient_bound01` (permutable) and `quotient_bound_carry` (witness). We still need to compute the bound $q_2 + f'_2$ (`quotient_bound2`) but we can do that elsewhere if we expose $q_2$ (`quotient2`) from this gate. This write-up proposes two ideas to free up a cell, and based on that, three candidate revised gate designs. ## Candidate 1: Compact $r$ As the [soundness proof](https://hackmd.io/@mitschabaude/SJsLdMgFn) shows, the FFmul gate only requires a compact RC for $r$. Thus, we can expose $r$ as a combined bottom limb `remainder_01` = $r_{01} = r_0 + 2^\ell r_1$ plus the high limb `remainder2` = $r_2$. This frees up the necessary space for 3x quotient: ```Rust //~ | col | `ForeignFieldMul` | `Zero` | //~ | --- | ---------------------------- | ---------------------- | //~ | 0 | `left_input0` (copy) | `remainder01` (copy) | //~ | 1 | `left_input1` (copy) | `remainder2` (copy) | //~ | 2 | `left_input2` (copy) | `quotient0` (copy) | //~ | 3 | `right_input0` (copy) | `quotient1` (copy) | //~ | 4 | `right_input1` (copy) | `quotient2` (copy) | //~ | 5 | `right_input2` (copy) | `product1_lo` (copy) | //~ | 6 | `carry1_lo` (copy) | `product1_hi_0` (copy) | //~ | 7 | `carry1_hi` (plookup) | | //~ | 8 | `carry0` | | //~ | 9 | `product1_hi_1` | | //~ | 10 | | | //~ | 11 | | | //~ | 12 | | | //~ | 13 | | | //~ | 14 | | | ``` Compared to the original design, 3 witness cells become free as the quotient bound is removed. **Advantage**: This is the minimal possible change to the gate that enables RCing all of $q$, giving us soundness. **Trade-offs**: - **Compact $r$**: We can no longer use all three remainder outputs directly in another gate. All 3 will be available from the MRC gate though, which we want to add (almost) every time we use this gate anyway; so we just have to add the additional logic in the FFmul gadget to add the MRC gates for $r$ immmediately and expose its outputs. - **Separate quotient bound**: The quotient $q_0, q_1, q_2$ is exposed for an MRC, but the high limb bound $q_2 + f'_2$ is not. So we have to do that addition elsewhere. Without a dedicated gate to do this, we will likely waste half a double generic gate for this single addition. Again, it is likely the job of an ffmul gadget to handle the addition and add the RC to `external_checks`. **Total rows: 15.833**. 2 for FFmul, 10 * 4/3 = 13.33 for 10 RCs (3 intermediate values, 6 outputs, quotient bound), 0.5 for computing quotient bound ## Candidate 2: Inlining one range check An alternative, earlier design used a trick to reduce rows: It used free space to inline an entire RC. In that design, it used 7 lookup cells (holding $7\cdot 12 = 84$ bits) and one additional witness cell holding the remaining 4 bits. We could do the same: our two rows still have exactly 7 lookup cells available, and plenty of space for the 4-bit witness. The concrete proposal is to inline the RC on `carry1_low` and lookup on `carry1_hi` with 7x 12-bit lookups and additional 3x 2-bit and 1x 1-bit constraints on 4 smaller limbs for a total of $84 + 3\cdot 2 + 1 = 91 = \ell + 3$ bits, which is what we need for $v_1 \in [0, 2^{\ell + 3})$ ($v_1$ is the carry on the high-limb mod 2^t equation). So, we introduce `carry1_0`, ..., ,`carry1_9`, `carry1_a` and `carry1` to hold their sum, with constraints * 12-bit lookup $v_{10},...,v_{16}$ * 2-bit constraints on $v_{17},v_{18},v_{19}$ * 1-bit constraint on $v_{1a}$ * constraint which combines them: $$ v_1 = \sum_{i=0}^7 v_{1i} 2^{12 i} + v_{i8} 2^{12\cdot 7 + 2} + v_{i9} 2^{12\cdot 7 + 4} + v_{ia} 2^{12\cdot 7 + 6} $$ The free permutable cell is used to expand the $r$ output into 3 cells again. Gate layout: ```Rust //~ | col | `ForeignFieldMul` | `Zero` | //~ | --- | -------------------------- | ------------------------ | //~ | 0 | `left_input0` (copy) | `remainder0` (copy) | //~ | 1 | `left_input1` (copy) | `remainder1` (copy) | //~ | 2 | `left_input2` (copy) | `remainder2` (copy) | //~ | 3 | `right_input0` (copy) | `quotient0` (copy) | //~ | 4 | `right_input1` (copy) | `quotient1` (copy) | //~ | 5 | `right_input2` (copy) | `quotient2` (copy) | //~ | 6 | `product1_lo` (copy) | `product1_hi_0` (copy) | //~ | 7 | `carry1_0` (plookup) | `carry1_4` (plookup) | //~ | 8 | `carry1_1` (plookup) | `carry1_5` (plookup) | //~ | 9 | `carry1_2` (plookup) | `carry1_6` (plookup) | //~ | 10 | `carry1_3` (plookup) | `carry1` | //~ | 11 | `carry1_7` | `carry0` | //~ | 12 | `carry1_8` | `product1_hi_1` | //~ | 13 | `carry1_9` | | //~ | 14 | `carry1_a` | | ``` **Advantage**: We save an entire range-check, and simplify usage of the gate because the remainder is immediately available as 3 separate limbs. **Trade-offs**: * **Slightly more complicated gate**. * **Separate quotient bound**: See above. **Total rows: 14.5**. 2 for FFmul, 9 * 4/3 = 12 for 9 RCs (2 intermediate values, 6 outputs, 1 quotient bound), 0.5 for computing quotient bound ## Candidate 3: Compact $r$ and inlining RC and bring back quotient bound We listed getting access to all three remainder limbs as advantage of candidate 2, but that doesn't seem relevant if we add the MRC for $r$ right in the gadget anyway. So why don't we use the free cell for something actually useful? The final proposal is to do the same as the second, but keeping $r$ compact in 2 cells. Instead, we expose the quotient bound $q_2 + f'_2$ in the free permutable cell. This saves 0.5 rows and simplifies usage of the gate because the addition doesn't have to be done externally. ```Rust //~ | col | `ForeignFieldMul` | `Zero` | //~ | --- | -------------------------- | ------------------------ | //~ | 0 | `left_input0` (copy) | `remainder01` (copy) | //~ | 1 | `left_input1` (copy) | `remainder2` (copy) | //~ | 2 | `left_input2` (copy) | `quotient0` (copy) | //~ | 3 | `right_input0` (copy) | `quotient1` (copy) | //~ | 4 | `right_input1` (copy) | `quotient2` (copy) | //~ | 5 | `right_input2` (copy) | `quotient_bound` (copy) | //~ | 6 | `product1_lo` (copy) | `product1_hi_0` (copy) | //~ | 7 | `carry1_0` (plookup) | `carry1_4` (plookup) | //~ | 8 | `carry1_1` (plookup) | `carry1_5` (plookup) | //~ | 9 | `carry1_2` (plookup) | `carry1_6` (plookup) | //~ | 10 | `carry1_3` (plookup) | `carry1` | //~ | 11 | `carry1_7` | `carry0` | //~ | 12 | `carry1_8` | `product1_hi_1` | //~ | 13 | `carry1_9` | | //~ | 14 | `carry1_a` | | ``` **Advantage**: Even more efficient than candidate 2 without actual downsides. **Trade-offs**: * **Slightly more complicated gate**. * **Compact $r$**: See above. **Total rows: 14**. 2 for FFmul, 9 * 4/3 = 12 for 9 RCs (2 intermediate values, 6 outputs, 1 quotient bound)