# Ron's Optimization for Multilinear Extension Evaluation
**Author's Note:** *This article builds directly on concepts established in **Part 1: Multilinear Polynomial Extension and Direct Evaluation**. If you haven't read that yet, I strongly advise starting there to understand the definition of the MLE and the naive direct evaluation method. This part focuses on optimization strategies proposed by Ron Rothblum.*
<br/>
<br/>
In Part 1, we implemented the "Direct Evaluation" of a Multilinear Extension (MLE). While mathematically clear, that approach suffers from a significant bottleneck: it requires $O(m \cdot 2^m)$ field multiplications.
In the context of Zero-Knowledge Proofs (like Spartan or checking Sumcheck protocols), the prover often works with large polynomials ($m \approx 20+$). At this scale, the difference between $m \cdot 2^m$ and $2^m$ is a factor of 20x or more. Standard Dynamic Programming (DP) can achieve $O(2^m)$ speed but often at the cost of $O(2^m)$ memory—a prohibitive cost for large zkVMs.
In this post, we explore an elegant optimization I call **Ron’s Method** (from [Rothblum](https://eprint.iacr.org/2024/1103.pdf)). This technique allows us to evaluate the MLE in **linear time** $O(2^m)$ with **constant extra memory** $O(1)$ (relative to the input size) by utilizing **Gray Codes**.
### The Intuition: Streaming the Hypercube
Recall the MLE formula:
$$
\tilde{f}(z) = \sum_{b \in \{0,1\}^m} \text{eq}(z, b) \cdot f(b)
$$
The Lagrange basis term $\text{eq}(z, b)$ is a product of $m$ univariate terms. In the standard lexicographic iteration (000, 001, 010...), consecutive points $b$ and $b'$ might differ by multiple bits. For example, moving from `011` to `100` flips 3 bits. This forces us to recompute the product almost entirely.
However, if we iterate over the hypercube using a **Gray Code** sequence, consecutive integers differ by exactly **one bit**. This mathematical property allows us to update our running product $\text{eq}(z, b)$ with a single multiplication operation.
### The Update Logic
Let $b$ and $b'$ be two adjacent points in a Gray code sequence that differ only at index $k$.
To transition from $\text{eq}(z, b)$ to $\text{eq}(z, b')$, we don't need to rebuild the product. We simply "swap out" the term for the $k$-th coordinate.
We precompute two update ratios for every dimension $i \in \{0, \dots, m-1\}$:
$$
eq(z, b \oplus e_i) = eq(z, b) \cdot \frac{eq_1(z_i, b_i \oplus 1)}{eq_1(z_i, b_i)}
$$
By precomputing these ratios, we can transition from one basis evaluation to the next with **one multiplication**.
1. **Flip $0 \to 1$ (Up):** $R_{i, \uparrow} = \frac{z_i}{1 - z_i}$
2. **Flip $1 \to 0$ (Down):** $R_{i, \downarrow} = \frac{1 - z_i}{z_i}$
*(Note: We assume $z_i \notin \{0,1\}$ for now. We will handle the boolean case in the "Dimension Reduction" section).*
The update rule becomes:
$$
\text{eq}(z, b') = \text{eq}(z, b) \times (\text{Ratio for bit } k)
$$
### Visualizing the Traversal
Let's trace this with a concrete example where $m=3$. We want to compute the weighted sum of evaluations $f(b)$.
* **Dimensions ($m$):** 3
* **Verifier Challenge ($z$):** $z = (z_0, z_1, z_2)$ (using 0-indexing for convenience).
* **Evaluations ($f$):** A vector $[0, 1, 2, 3, 4, 5, 6, 7]$.
* *Note: These values typically map to lexicographic indices $000_2$ to $111_2$. Because we are traversing in Gray code order, we must access these values out of order.*
**1. Precomputation**
We calculate the update ratios for each dimension $i \in \{0, 1, 2\}$. Assuming $z_i \notin \{0, 1\}$:
* To flip bit $i$ from $0 \to 1$: multiply by $R_{i, \uparrow} = \frac{z_i}{1-z_i}$
* To flip bit $i$ from $1 \to 0$: multiply by $R_{i, \downarrow} = \frac{1-z_i}{z_i}$
**2. The Traversal**
* Start at $b=000$.
* Initial accumulator $E = (1-z_0)(1-z_1)(1-z_2)$.
* We traverse the Gray code sequence.
| Step | Gray Code ($b$) | Bit Flipped | Action on $E$ | Contribution to Sum |
| :--- | :--- | :--- | :--- | :--- |
| **0** | `000` | — | Initial Calc | $E \cdot f(0)$ |
| **1** | `001` | 0 ($0 \to 1$) | $E \times R_{0, \uparrow}$ | $E \cdot f(1)$ |
| **2** | `011` | 1 ($0 \to 1$) | $E \times R_{1, \uparrow}$ | $E \cdot f(3)$ |
| **3** | `010` | 0 ($1 \to 0$) | $E \times R_{0, \downarrow}$ | $E \cdot f(2)$ |
| **4** | `110` | 2 ($0 \to 1$) | $E \times R_{2, \uparrow}$ | $E \cdot f(6)$ |
| **5** | `111` | 0 ($0 \to 1$) | $E \times R_{0, \uparrow}$ | $E \cdot f(7)$ |
| **6** | `101` | 1 ($1 \to 0$) | $E \times R_{1, \downarrow}$ | $E \cdot f(5)$ |
| **7** | `100` | 0 ($1 \to 0$) | $E \times R_{0, \downarrow}$ | $E \cdot f(4)$ |
Notice that for every step after the initialization, we perform exactly **one field multiplication** to update our weight $E$, and **one addition** to accumulate the result. This is optimal.
**3. Detailed Trace of the Math**
We track the accumulator $E$ as it transforms through the hypercube.
* **Definitions:**
* $R_{i, \uparrow} = \frac{z_i}{1-z_i}$ (multiplier when bit $i$ flips $0 \to 1$).
* $R_{i, \downarrow} = \frac{1-z_i}{z_i}$ (multiplier when bit $i$ flips $1 \to 0$).
* Target index $b$ is written as $(b_2, b_1, b_0)_2$.
#### **Step 0: Initialization**
* **Gray Code:** `000` (Integer 0)
* **Action:** Compute the base term directly.
* **Math:**
$$E_0 = (1-z_2)(1-z_1)(1-z_0)$$
* **Result:** Matches $eq(z, 000)$.
* **Accumulate:** Add $E_0 \cdot f(0)$ to sum.
#### **Step 1: Flip Bit 0 ($0 \to 1$)**
* **Gray Code:** `001` (Integer 1)
* **Action:** Update $E_0$ using ratio $R_{0, \uparrow}$.
* **Math:**
$$E_1 = E_0 \cdot \frac{z_0}{1-z_0}$$
$$E_1 = (1-z_2)(1-z_1)\cancel{(1-z_0)} \cdot \frac{z_0}{\cancel{1-z_0}}$$
* **Result:** $(1-z_2)(1-z_1)z_0$. Matches $eq(z, 001)$.
* **Accumulate:** Add $E_1 \cdot f(1)$ to sum.
#### **Step 2: Flip Bit 1 ($0 \to 1$)**
* **Gray Code:** `011` (Integer 3)
* **Action:** Update $E_1$ using ratio $R_{1, \uparrow}$.
* **Math:**
$$E_2 = E_1 \cdot \frac{z_1}{1-z_1}$$
$$E_2 = (1-z_2)\cancel{(1-z_1)}z_0 \cdot \frac{z_1}{\cancel{1-z_1}}$$
* **Result:** $(1-z_2)z_1 z_0$. Matches $eq(z, 011)$.
* **Accumulate:** Add $E_2 \cdot f(3)$ to sum. *Note: We access index 3 here, not 2.*
#### **Step 3: Flip Bit 0 ($1 \to 0$)**
* **Gray Code:** `010` (Integer 2)
* **Action:** Update $E_2$ using ratio $R_{0, \downarrow}$.
* **Math:**
$$E_3 = E_2 \cdot \frac{1-z_0}{z_0}$$
$$E_3 = (1-z_2)z_1 \cancel{z_0} \cdot \frac{1-z_0}{\cancel{z_0}}$$
* **Result:** $(1-z_2)z_1 (1-z_0)$. Matches $eq(z, 010)$.
* **Accumulate:** Add $E_3 \cdot f(2)$ to sum.
#### **Step 4: Flip Bit 2 ($0 \to 1$)**
* **Gray Code:** `110` (Integer 6)
* **Action:** Update $E_3$ using ratio $R_{2, \uparrow}$.
* **Math:**
$$E_4 = E_3 \cdot \frac{z_2}{1-z_2}$$
$$E_4 = \cancel{(1-z_2)}z_1 (1-z_0) \cdot \frac{z_2}{\cancel{1-z_2}}$$
* **Result:** $z_2 z_1 (1-z_0)$. Matches $eq(z, 110)$.
* **Accumulate:** Add $E_4 \cdot f(6)$ to sum.
#### **Step 5: Flip Bit 0 ($0 \to 1$)**
* **Gray Code:** `111` (Integer 7)
* **Action:** Update $E_4$ using ratio $R_{0, \uparrow}$.
* **Math:**
$$E_5 = E_4 \cdot \frac{z_0}{1-z_0}$$
$$E_5 = z_2 z_1 \cancel{(1-z_0)} \cdot \frac{z_0}{\cancel{1-z_0}}$$
* **Result:** $z_2 z_1 z_0$. Matches $eq(z, 111)$.
* **Accumulate:** Add $E_5 \cdot f(7)$ to sum.
#### **Step 6: Flip Bit 1 ($1 \to 0$)**
* **Gray Code:** `101` (Integer 5)
* **Action:** Update $E_5$ using ratio $R_{1, \downarrow}$.
* **Math:**
$$E_6 = E_5 \cdot \frac{1-z_1}{z_1}$$
$$E_6 = z_2 \cancel{z_1} z_0 \cdot \frac{1-z_1}{\cancel{z_1}}$$
* **Result:** $z_2 (1-z_1) z_0$. Matches $eq(z, 101)$.
* **Accumulate:** Add $E_6 \cdot f(5)$ to sum.
#### **Step 7: Flip Bit 0 ($1 \to 0$)**
* **Gray Code:** `100` (Integer 4)
* **Action:** Update $E_6$ using ratio $R_{0, \downarrow}$.
* **Math:**
$$E_7 = E_6 \cdot \frac{1-z_0}{z_0}$$
$$E_7 = z_2 (1-z_1) \cancel{z_0} \cdot \frac{1-z_0}{\cancel{z_0}}$$
* **Result:** $z_2 (1-z_1) (1-z_0)$. Matches $eq(z, 100)$.
* **Accumulate:** Add $E_7 \cdot f(4)$ to sum.
At the end of this linear scan, we have computed $\hat{f}(z)$ using exactly **7 multiplications** for the updates (plus initial setup) and **8 additions** for the accumulation.
The algorithm described so far assumes that $z_i \notin \{0, 1\}$ to avoid division by zero during the update steps. However, in real-world protocols (like Sumcheck), it is common for the verifier's challenge vector $z$ to contain **boolean values** (0 or 1), or for the engineer to want to evaluate the extension on a sub-cube.
The paper provides a powerful optimization for this case: **Dimension Reduction**.
#### Dimension Reduction
If a coordinate $z_i$ is boolean ($0$ or $1$), the Lagrange basis polynomial $eq_1(z_i, b_i)$ collapses:
* If $z_i = 1$, then $eq_1(1, b_i)$ is $1$ only if $b_i = 1$, and $0$ otherwise.
* If $z_i = 0$, then $eq_1(0, b_i)$ is $1$ only if $b_i = 0$, and $0$ otherwise.
Mathematically, we partition the coordinates into two sets:
1. **$S$ (Boolean Set):** Indices where $z_i \in \{0, 1\}$.
2. **$\bar{S}$ (Non-Boolean Set):** Indices where $z_i \in \mathbb{F} \setminus \{0, 1\}$.
The paper observes that $eq(z, b) = 0$ whenever the bits in $b$ do not exactly match the boolean values in $z$. Therefore, we can **skip** all vectors $b$ where $b_S \neq z_S$. We only perform the Gray code enumeration over the remaining coordinates in $\bar{S}$.
This effectively reduces the problem size from $2^m$ to $2^{|\bar{S}|}$, changing the complexity from exponential in $m$ to exponential only in the number of *non-boolean* variables.
### Example when dimension reduction kicks in
Let's revisit our 3-variable example where $m=3$ and the full evaluation vector is indices $[0, \dots, 7]$.
**Scenario:**
Suppose the verifier sends the challenge:
$$z = (z_0, z_1, z_2) = (0.5, 1, 0)$$
Here:
* $z_0 = 0.5$ (Non-boolean). This is the only "active" dimension we must iterate.
* $z_1 = 1$ (Boolean).
* $z_2 = 0$ (Boolean).
**The Filtering Logic:**
We rely on the property that for the final result to be non-zero, the bits of our index $b = (b_2, b_1, b_0)_2$ must match the boolean constraints of $z$:
1. Since $z_1 = 1$, we must have $b_1 = 1$.
2. Since $z_2 = 0$, we must have $b_2 = 0$.
The only free variable is $b_0$.
**The Reduced Iteration:**
Instead of iterating through all 8 indices ($000$ to $111$), we essentially fix $b_2=0$ and $b_1=1$, and iterate $b_0$ through its Gray code sequence ($0 \to 1$).
Our target indices are:
1. $b = (0, 1, 0)_2 \to$ **Index 2**
2. $b = (0, 1, 1)_2 \to$ **Index 3**
**The Execution Trace:**
We are now computing a 1-variable MLE (over $z_0$) scaled by the fixed boolean terms (which evaluate to 1).
| Step | Active Bit ($b_0$) | Fixed Bits ($b_2, b_1$) | Full Index | Value Calculation |
| :--- | :--- | :--- | :--- | :--- |
| **0** | $0$ | $0, 1$ | **2** (`010`) | $Accum += f(2) \cdot (1-z_0)$ |
| **1** | $1$ | $0, 1$ | **3** (`011`) | $Accum += f(3) \cdot (z_0)$ |
We computed the exact result using only **2 steps** instead of 8. In a large system (e.g., $m=20$), if half the variables are boolean, this optimization speeds up the prover by a factor of $2^{10} \approx 1000x$.
# Ron Algo Impl Explaination
This is a high-performance implementation of **Proposition 1** and **Corollary 2** from Rothblum's note.
The code implements the core optimization: calculating the Multilinear Extension (MLE) in $O(2^m)$ time and $O(m)$ space using **Gray code enumeration**. It also implements the specific optimization for "general vectors" (where some $z_i \in \{0,1\}$) to reduce the problem size.
Here is the breakdown of the Rust code, linked directly to the theoretical steps in the paper.
### 1. The Precomputed Update Ratios
**Code:**
```rust
#[derive(Clone, Copy, Debug)]
pub struct UpdateRatio<F: Field> {
up: F, // Multiplier to flip 0 -> 1: z / (1-z)
down: F, // Multiplier to flip 1 -> 0: (1-z) / z
}
```
**Theory Connection:**
The paper states that consecutive terms in a Gray code sequence differ by exactly one bit. To update the running product $eq(z, b)$, we perform a single multiplication.
The derivation comes from Equation (2) in the paper:
$$eq(z, b \oplus e_i) = \frac{eq_1(z_i, b_i \oplus 1)}{eq_1(z_i, b_i)} \cdot eq(z, b)$$
* If flipping $0 \to 1$: The multiplier is $\frac{z_i}{1-z_i}$ (variable `up`).
* If flipping $1 \to 0$: The multiplier is $\frac{1-z_i}{z_i}$ (variable `down`).
The code precomputes these values to ensure the inner loop only performs **one multiplication** per step, fulfilling the requirement of exactly $2^m$ multiplications.
### 2\. Initialization and Handling Boolean Inputs
**Code:**
```rust
let mut ratios = Vec::with_capacity(m);
let mut active_bit_positions = Vec::with_capacity(m);
let mut global_idx = 0usize;
let mut current_eq = F::one();
for (i, &val) in z.iter().enumerate() {
let global_bit_pos = m - 1 - i; // Big-endian index adjustment
if val.is_zero() {
// Do nothing; bit remains 0 in global_idx
} else if val.is_one() {
global_idx |= 1 << global_bit_pos; // Fixed to 1
} else {
// General case: z_i is not 0 or 1
let one_minus_val = F::one() - val;
let r_up = val * one_minus_val.inverse().unwrap();
let r_down = one_minus_val * val.inverse().unwrap();
ratios.push(UpdateRatio { up: r_up, down: r_down });
active_bit_positions.push(global_bit_pos);
current_eq *= one_minus_val; // Initial value for b_i = 0
}
}
```
**Theory Connection:**
This section implements the optimization for "general vectors" described in the proof of Proposition 1.
* **Boolean Inputs ($z_i \in \{0,1\}$):** The paper notes that $eq(z_S, b_S) = 0$ whenever $b_S \neq z_S$.
* The code efficiently skips these dimensions. It fixes the bits in `global_idx` (effectively effectively "hard-coding" the correct path in the truth table) and does not add them to `active_bit_positions`. This reduces the loop size from $2^m$ to $2^{|S|}$, where $|S|$ is the number of non-boolean variables.
* **Initial `current_eq`:** The Gray code iteration starts at logical $00\dots0$. For the active variables, $eq_1(z_i, 0) = (1 - z_i)$. Therefore, the starting `current_eq` is initialized to $\prod_{i \in Active} (1 - z_i)$.
### 3. The Gray Code Loop (The Core Algorithm)
**Code:**
```rust
// ... setup code ...
total_sum += current_eq * evals[global_idx]; // First term (all active bits 0)
let mut prev_gray = 0usize;
for k in 1..num_steps {
// Standard Gray code generation: g(k) = (k >> 1) ^ k
let gray = (k >> 1) ^ k;
let diff = prev_gray ^ gray;
// Identify which bit changed
let active_idx = diff.trailing_zeros() as usize;
let global_bit_pos = active_bit_positions[active_idx];
// Determine direction of flip
let direction_up = (gray & diff) != 0;
if direction_up {
current_eq *= ratios[active_idx].up;
} else {
current_eq *= ratios[active_idx].down;
}
// Update the truth table index
global_idx ^= 1 << global_bit_pos;
total_sum += current_eq * evals[global_idx];
prev_gray = gray;
}
```
**Theory Connection:**
This loop implements the enumeration described in the proof: "We generate the sequence according to the Gray code ordering...".
1. **`let gray = (k >> 1) ^ k;`**: This generates the standard binary reflected Gray code.
2. **`active_idx`**: This finds the index $i$ where the change occurred. Because we filtered out Boolean inputs, this index maps to `active_bit_positions` rather than the raw variable index.
3. **The Update:** The code executes the update rule: "store only the previous $eq(z,b)$ and update it using a single multiplication".
* If `direction_up` is true, we flipped $0 \to 1$, so we multiply by `up` ($z/(1-z)$).
* Otherwise, we flipped $1 \to 0$, so we multiply by `down` ($(1-z)/z$).
4. **`global_idx ^= 1 << global_bit_pos`**: This effectively "moves" our pointer in the truth table `evals` to the next required value without recalculating the index from scratch.
### Summary on Impl Efficiency
* **Multiplications:** Exactly $2^{|Active|}$ multiplications inside the loop (one per iteration).
* **Space:** The `ratios` and `active_bit_positions` vectors consume $O(m)$ space, matching the paper's claim of linear space complexity.
* **Additions:** Exactly $2^{|Active|}$ additions to `total_sum`.
## Performance Benchmarks
Theory suggests that Ron’s optimized method ($O(2^m)$) should strictly outperform the Direct method ($O(m \cdot 2^m)$). However, in systems programming, theory must always contend with CPU architecture, branch prediction, and memory hierarchy.
We implemented both methods in Rust using `arkworks` (BLS12-381 `Fr` field) and benchmarked them using `criterion` on an Apple Silicon machine.
### The Results
| Variables ($m$) | Vector Size ($2^m$) | Direct Method | Ron's Optimized | Speedup Factor |
| :--- | :--- | :--- | :--- | :--- |
| **5** | 32 | 3.16 µs | 13.96 µs | **4x (Slower)** |
| **10** | 1,024 | 170.43 µs | 54.62 µs | **3.12x** |
| **17** | 131,072 | 37.42 ms | 4.68 ms | **7.99x** |
| **20** | 1,048,576 | 341.27 ms | 102.17 ms | **3.34x** |
The data reveals three distinct phases of performance behavior:
**1. The Overhead Phase ($m=5$)**
At small sizes ($m=5$), the Optimized method is actually **4x slower** than the naive approach.
* **Why?** The setup costs dominate. The Optimized method requires allocating vectors for ratios, computing inverses for $z_i$, and performing bitwise logic to generate Gray codes.
* **The Lesson:** For tiny sub-circuits or small lookups, the naive $O(m \cdot 2^m)$ approach is preferred because the constant factors are negligible and the code is instruction-cache friendly.
**2. The Compute-Bound Phase ($m=10$ to $m=17$)**
This is the "sweet spot" where the algorithm shines.
* At $m=17$, we see an **8x speedup**.
* Here, the vector size fits comfortably within the CPU's L2/L3 cache. The CPU is bound by arithmetic operations (field multiplications). By reducing the multiplications from $17 \cdot 2^{17}$ to $1 \cdot 2^{17}$, the theoretical efficiency translates directly to wall-clock time.
**3. The Memory-Bound Phase ($m=20$)**
This tends to get less attension,
At $m=20$, the speedup drops from ~8x to ~3.3x.
* **The Context:** A vector of $2^{20}$ BLS12-381 scalars (32 bytes each) occupies $\approx 33.5 \text{ MB}$. This exceeds the L3 cache size of most consumer CPUs.
* **The Bottleneck:** The "Direct" method iterates through the `evals` vector linearly (`0, 1, 2...`), which is perfect for the CPU's hardware prefetcher.
* **Ron's Method:** While efficient in arithmetic, it traverses memory based on Gray codes (`000` $\to$ `001` $\to$ `011` $\to$ `010`). This results in a non-linear memory access pattern. Once the data size exceeds the cache, the CPU spends more cycles waiting for RAM (cache misses) than performing the field multiplications we optimized away.
### My hot take
Ron’s optimization is a critical tool for modern zkVMs, particularly in the $m=10$ to $m=18$ range common in folding schemes and sumcheck protocols. However, engineers should remain aware of the "Memory Wall" at larger sizes ($m \ge 20$), where memory bandwidth may become the new bottleneck over arithmetic complexity.
**References:**
1. *Rothblum, [A Note on Efficient Computation of the Multilinear Extension](https://eprint.iacr.org/2024/1103.pdf)
2. Developeruche, [Rust Impl](https://github.com/developeruche/cryptography-n-zk-research/blob/main/exps/eff-mle/src/ron.rs)