Efficient Implementation of the Multilinear Equality Polynomial

# Efficient Implementation of the Multilinear Equality Polynomial This note details efficient strategies for evaluating the **multilinear equality polynomial**. We will explore two primary approaches: - The current method we have in Plonky3 for a single evaluation point, - A more advanced batched method that significantly improves performance when accumulating multiple evaluations. ## The Multilinear Equality Polynomial: Basics The multilinear equality polynomial, denoted as $\text{eq}(\mathbf{x}, \mathbf{z})$, is a function over two vectors $\mathbf{x}, \mathbf{z} \in \mathbb{F}^n$. For our purposes, we are interested in its evaluation over the Boolean hypercube, where $\mathbf{x} \in \{0, 1\}^n$. The polynomial is defined to be **1** if $\mathbf{x} = \mathbf{z}$ and **0** otherwise. Its algebraic form is: $$\text{eq}(\mathbf{x}, \mathbf{z}) = \prod_{i=0}^{n-1} (x_i z_i + (1 - x_i)(1 - z_i))$$ Each term in the product $(x_i z_i + (1 - x_i)(1 - z_i))$ evaluates to 1 if $x_i = z_i$ and 0 if $x_i \neq z_i$. The product of these terms is therefore 1 only if all coordinates match. Our goal is to compute the evaluations of $\alpha \cdot \text{eq}(\mathbf{x}, \mathbf{z})$ for a given point $\mathbf{z}$ and scalar $\alpha$, across all $2^n$ possible inputs $\mathbf{x} \in \{0, 1\}^n$. ## Approach 1: Recursive Evaluation for a Single Point A naive evaluation would involve iterating through all $2^n$ values of $\mathbf{x}$ and computing the full product for each, leading to a complexity of $O(n \cdot 2^n)$. We can do much better by exploiting the polynomial's recursive structure. ### Mathematical Foundation Let's split the vectors $\mathbf{x}$ and $\mathbf{z}$ into their first coordinate and the rest: * $\mathbf{x} = (x_0, \mathbf{x}')$ where $\mathbf{x}' = (x_1, \dots, x_{n-1})$ * $\mathbf{z} = (z_0, \mathbf{z}')$ where $\mathbf{z}' = (z_1, \dots, z_{n-1})$ The definition of $\text{eq}$ can be decomposed as follows: $$\text{eq}(\mathbf{x}, \mathbf{z}) = (x_0 z_0 + (1-x_0)(1-z_0)) \cdot \prod_{i=1}^{n-1} (x_i z_i + (1-x_i)(1-z_i))$$ $$\text{eq}(\mathbf{x}, \mathbf{z}) = (x_0 z_0 + (1-x_0)(1-z_0)) \cdot \text{eq}(\mathbf{x}', \mathbf{z}')$$ Since $x_0$ can only be 0 or 1, we get two distinct cases: 1. If $x_0 = 0$: $\text{eq}((0, \mathbf{x}'), \mathbf{z}) = (1 - z_0) \cdot \text{eq}(\mathbf{x}', \mathbf{z}')$ 2. If $x_0 = 1$: $\text{eq}((1, \mathbf{x}'), \mathbf{z}) = z_0 \cdot \text{eq}(\mathbf{x}', \mathbf{z}')$ This recurrence is the property used for an efficient implementation. It allows us to compute the evaluations for $n$ variables using the evaluations for $n-1$ variables. We can start with a base case (e.g., $n=0$ or $n=1$) and build up the full evaluation table of size $2^n$. The total complexity is reduced to $O(2^n)$. ### Implementation Strategy (currently in Plonky3) Reference code: https://github.com/Plonky3/Plonky3/blob/36171c7eec615783f5e3a313fc728c04af6ec8ef/multilinear-util/src/eq.rs The Rust code implements this recursive strategy with several layers of optimization. #### Core Recursion The `eval_eq_basic` function is a direct translation of the math. It recursively splits the problem in half: 1. It takes the current evaluation point `eval` and a `scalar` (which accumulates the products from previous steps). 2. It computes two new scalars: `s0 = scalar * (1 - z_0)` and `s1 = scalar * z_0`. 3. It calls itself twice: - once for the "low" half of the output buffer (where $x_0=0$) with `s0`, - once for the "high" half (where $x_0=1$) with `s1`. 4. The recursion stops at small base cases (`n=1, 2, 3`), which are unrolled for efficiency.  ```rust // Simplified logic from eval_eq_basic let (low, high) = out.split_at_mut(out.len() / 2); let s1 = scalar * z_0; let s0 = scalar - s1; eval_eq_basic(tail, low, s0); eval_eq_basic(tail, high, s1); ``` #### Optimization 1: SIMD via Packing As a reminder, modern CPUs can perform the same operation on multiple data points simultaneously (SIMD). The `p3-field` library exposes this through "packed" field types (e.g., `F::Packing`). A packed type can hold multiple field elements (e.g., 8 or 16) and apply arithmetic to all of them at once. The `eval_eq_packed` function leverages this. Instead of a single `scalar`, it operates on a `PackedField` of scalars. The recursive logic remains the same, but each multiplication or subtraction now processes multiple evaluation paths in parallel. This is particularly effective for the final few variables in the recursion, as handled by `packed_eq_poly`. :::info ### An Example of Packed Evaluation This example demonstrates how the `eval_eq_packed` algorithm uses SIMD to evaluate the polynomial over its final two variables. The process is shown as a sequence of transformations on packed coefficient vectors. ### **Initial State** The process begins with an **initial state**, represented by a single packed vector we'll call $C_{initial}$. To understand its components, remember that the total evaluation of the equality polynomial is a long product of terms, one for each variable: $$\text{term}_0 \cdot \text{term}_1 \cdot \dots \cdot \text{term}_{n-1}$$ When our example starts, the algorithm is already deep in this process and has computed the product for the first $k-1$ variables. The components of $C_{initial}$ are the **partial results** of this work. Let the initial state be: $$C_{initial} = [c_0, c_1, c_2, c_3]$$ Here, each component $c_i$ is a partial product for a specific evaluation path. For example, $c_0$ represents: $$c_0 = \text{term}_0 \cdot \text{term}_1 \cdot \dots \cdot \text{term}_{k-1}$$ The vector contains four such components because **SIMD parallelism** allows the algorithm to process four independent evaluation paths at the same time (this is just an arbitrary value for the example): * $c_0$ is the partial result for the first path. * $c_1$ is the partial result for the second path. * $c_2$ is the partial result for the third path. * $c_3$ is the partial result for the fourth path. ### **Step 1: Transformation by the First Variable ($z_k$)** The first recursive step processes the variable $z_k$. It transforms the single initial vector $C_{initial}$ into **two** new vectors for the next level. These represent the two branches of the computation, for $x_k=0$ and $x_k=1$. Let $C_{0*}$ and $C_{1*}$ be the resulting vectors, where the $*$ denotes a placeholder for the yet-to-be-evaluated final variable. They are computed as: \begin{aligned} C_{0*} &= C_{initial} \cdot (1 - z_k) \\ C_{1*} &= C_{initial} \cdot z_k \end{aligned} Given our initial state where $C_{initial} = [c_0, c_1, c_2, c_3]$, the component-wise breakdown is as follows: \begin{aligned} C_{0*} &= [c_0(1-z_k), \ c_1(1-z_k), \ c_2(1-z_k), \ c_3(1-z_k)]\\ C_{1*} &= [c_0 z_k, \ c_1 z_k, \ c_2 z_k, \ c_3 z_k] \end{aligned} Thanks to SIMD, the four multiplications required for each new vector are performed in a single CPU instruction. At this point, the computation has branched, and we now have two packed vectors tracking the state of our four parallel evaluations. ### **Step 2: Transformation by the Final Variable ($z_{k+1}$)** The second and final recursive step processes the variable $z_{k+1}$. This step is applied to **each** of the vectors from the previous stage, expanding them further. The vector $C_{0*}$ is transformed into two final vectors: \begin{aligned} C_{00} &= C_{0*} \cdot (1 - z_{k+1})\\ C_{01} &= C_{0*} \cdot z_{k+1} \end{aligned} Simultaneously, the vector $C_{1*}$ is transformed into two final vectors: \begin{aligned} C_{10} &= C_{1*} \cdot (1 - z_{k+1}) \\ C_{11} &= C_{1*} \cdot z_{k+1} \end{aligned} This completes the recursive evaluation, leaving us with four packed vectors, each corresponding to a specific outcome for $(x_k, x_{k+1})$. ### **Summary of Final Results** The four final vectors contain the fully-evaluated products. By substituting the intermediate definitions, we can see the complete calculation for each outcome. * For the outcome $(x_k, x_{k+1}) = (0, 0)$, the result is in $C_{00}$: $$C_{00} = [c_0(1-z_k)(1-z_{k+1}), \ c_1(1-z_k)(1-z_{k+1}), \ \dots ]$$ * For the outcome $(x_k, x_{k+1}) = (0, 1)$, the result is in $C_{01}$: $$C_{01} = [c_0(1-z_k)(z_{k+1}), \ c_1(1-z_k)(z_{k+1}), \ \dots ]$$ * For the outcome $(x_k, x_{k+1}) = (1, 0)$, the result is in $C_{10}$: $$C_{10} = [c_0(z_k)(1-z_{k+1}), \ c_1(z_k)(1-z_{k+1}), \ \dots ]$$ * For the outcome $(x_k, x_{k+1}) = (1, 1)$, the result is in $C_{11}$: $$C_{11} = [c_0(z_k)(z_{k+1}), \ c_1(z_k)(z_{k+1}), \ \dots ]$$ These four vectors are then unpacked into the final output slice. This method efficiently computes all outcomes in parallel using a minimal number of SIMD operations. ![image](https://hackmd.io/_uploads/HJPTjS3sgl.png) ::: #### Optimization 2: Parallelism For a large number of variables $n$, the evaluation domain $2^n$ is massive. The computation can be split across multiple CPU cores. The `eval_eq_common` function orchestrates this: 1. It splits the $n$ variables into three groups based on the machine's capabilities: * **Top `log_num_threads` variables:** Used for top-level parallelism. The problem is split into `num_threads` independent sub-problems. * **Middle variables:** Handled by the recursive `eval_eq_packed` function within each thread. * **Bottom `log_packing_width` variables:** Processed together using SIMD packing at the base of the recursion. 2. Rayon's `par_chunks_exact_mut` is used to assign a chunk of the output buffer and a pre-computed scalar to each available thread. This hybrid approach ensures that the work is balanced and that both multi-core parallelism and single-core SIMD instructions are fully utilized. :::info ### An Example of the Hybrid Parallelism Strategy This example demonstrates how the `eval_eq_common` function orchestrates both multi-core parallelism and SIMD packing by splitting the variables into three distinct groups. ### **Scenario Setup** To see the full logic, let's assume a larger problem: * **Total variables:** $n=10$. * **CPU Cores:** 4, so `log_num_threads` = 2. * **SIMD Packing Width:** 4, so `log_packing_width` = 2. ### **Step 1: Variable Split (3 groups)** The core of the strategy is to divide the 10 variables ($z_0$ to $z_9$) into three groups, each with a specific role: * **Top Variables: $z_0, z_1$** (`log_num_threads` = 2 variables) * Their purpose is to create the large, independent tasks for each of the 4 CPU cores. * **Middle Variables: $z_2, \dots, z_7$** (The remaining 6 variables) * These will be processed recursively *within* each parallel task. This is the main workload for each core. * **Bottom Variables: $z_8, z_9$** (`log_packing_width` = 2 variables) * These are handled at the lowest level of the recursion using SIMD instructions, as shown in the packed evaluation example. ### **Step 2: Preparing the Initial States (Bottom-Up)** Instead of starting from the top, the algorithm efficiently prepares the initial state for each parallel task by working from the bottom up. It creates a `parallel_buffer` with 4 rows, one for each core. #### **Processing the Bottom (SIMD) Variables** First, the algorithm computes the equality polynomial for only the **bottom variables** ($z_8, z_9$). This produces a single packed vector of results, exactly as detailed in the SIMD example. Let's call this initial packed result $C_{bottom}$. $$C_{bottom} \leftarrow \text{Result of eq}( (x_8, x_9), (z_8, z_9) )$$ This single packed vector is placed in the **first row** of the `parallel_buffer`. #### **Processing the Top (Multi-Core) Variables** Next, the algorithm uses the **top variables** ($z_0, z_1$) to expand the single result $C_{bottom}$ into four distinct initial states, filling the entire `parallel_buffer`. * **Row 0 (for Core 0):** $C_{core0} = C_{bottom} \cdot (1-z_0)(1-z_1)$ * **Row 1 (for Core 1):** $C_{core1} = C_{bottom} \cdot (1-z_0)(z_1)$ * **Row 2 (for Core 2):** $C_{core2} = C_{bottom} \cdot (z_0)(1-z_1)$ * **Row 3 (for Core 3):** $C_{core3} = C_{bottom} \cdot (z_0)(z_1)$ At the end of this step, the `parallel_buffer` is fully prepared. Each row contains the correct starting (packed) coefficient vector for one of the four cores. ### **Step 3: Launching Parallel Tasks** With the initial states prepared, the algorithm launches four parallel tasks. Each core receives its own unique starting state and is assigned a distinct part of the problem: * **Core 0 receives:** * **Initial State:** The packed vector $C_{core0}$. * **Workload:** The **middle variables** $z_2, \dots, z_7$. * It will compute all evaluations for inputs where $(x_0, x_1) = (0, 0)$. * **Core 1 receives:** * **Initial State:** The packed vector $C_{core1}$. * **Workload:** The **middle variables** $z_2, \dots, z_7$. * It will compute all evaluations for inputs where $(x_0, x_1) = (0, 1)$. * **Cores 2 and 3** receive their respective states ($C_{core2}, C_{core3}$) and handle the remaining cases. Each core now independently runs the recursive, SIMD-optimized `eval_eq_packed` function on its assigned workload. This hybrid "Bottom -> Top -> Middle" approach ensures that the work is split efficiently across cores while also maximizing the performance within each core using SIMD. ::: ## Approach 2: Batched Evaluation for Multiple Points The first approach is optimized for a single evaluation, but its performance becomes a critical bottleneck in protocols that require verifying many constraints at once. This section details a more advanced **batched** method could solve this problem efficiently. We would like to implement this in Plonky3 so that this make both the usages easier. ### The Motivation for Batching: The `combine` Bottleneck In many cryptographic protocols, it's common to take a large number of individual checks and probabilistically combine them into a single, more efficient check. The `Statement::combine` function from the WHIR codebase is a perfect example of this. Its goal is to take a set of $m$ individual constraints of the form $p(z_i) = s_i$ and merge them using random challenge powers $\gamma^i$. The core of this operation is the construction of an aggregated "weight polynomial", $W(X)$, defined as: $$W(X) = \sum_{i=0}^{m-1} \gamma^i \cdot \text{eq}(\mathbf{z}_i, X)$$ To proceed, the prover must compute the evaluations of this $W(X)$ over the entire Boolean hypercube. A naive implementation would call our optimized `eval_eq` function once for each of the $m$ terms in the sum and then add the results. As profiling and analysis from Angus Gruen revealed, this is a significant performance hotspot: > *Currently, `eval_eq`...is responsible for around 40 - 50% of the total proof time. Primarily this time consumption occurs in a call to `combine`...As each call of `eval_eq` costs roughly $2^n$ multiplications and $2^{n+1}$ additions/subtractions, if we want to accumulate $m$ points, this will require roughly $m \cdot 2^n$ multiplications.* To solve this Angus proposed an alternative solution. ### Leveraging Linearity Instead of computing each `eq` polynomial's evaluations and then summing them, we can leverage linearity to perform the summation *inside* the recursive evaluation. Let's illustrate with the simple case where $n=1$, as highlighted in the performance analysis: * **Current Approach (Compute then Sum):** For each of the $m$ points, we would: 1. Compute $\gamma_i \cdot z_i$ (1 multiplication). 2. Compute $\gamma_i \cdot (1 - z_i) = \gamma_i - \gamma_i \cdot z_i$ (1 subtraction). 3. Add these two values to running accumulators (2 additions). **Total for $m$ points: $m$ multiplications and $3m$ additions/subtractions.** * **New Approach (Sum then Compute):** The final accumulated values we need are $\sum \gamma_i z_i$ and $\sum \gamma_i (1-z_i)$. We can compute these more directly: 1. Compute the sum $\sum \gamma_i z_i$ ($m$ mults, $m-1$ adds). 2. Compute the sum $\sum \gamma_i$ ($m-1$ adds). 3. The second result is found with one subtraction: $\sum \gamma_i (1-z_i) = \sum \gamma_i - \sum \gamma_i z_i$. **Total for $m$ points: $m$ multiplications and roughly $2m$ additions/subtractions.** By rearranging the operations, we save approximately $m$ additions. This insight can be generalized to the full recursive algorithm. ### The Batched Recursive Algorithm The key to a faster implementation is to generalize the recursive strategy from a single point to a **batch** of points. Instead of computing each equality polynomial and then summing the results, we will build an algorithm that works on the entire sum at every step. #### The Mathematical Blueprint Our goal is to directly compute the sum $S(\mathbf{x}) = \sum_{i=0}^{m-1} \gamma_i \cdot \text{eq}(\mathbf{z}_i, \mathbf{x})$ for all $\mathbf{x}$ on the Boolean hypercube. The recursive property of the equality polynomial extends to the entire sum. At each step of the recursion, we can update the whole vector of scalars $(\gamma_0, \gamma_1, \dots, \gamma_{m-1})$ to generate two new vectors of scalars: - one for the $x_j=0$ branch, - one for the $x_j=1$ branch. Let's formalize this by splitting on the first variable, $x_0$. The full sum can be written as: $$S((x_0, \mathbf{x}')) = \sum_{i=0}^{m-1} \gamma_i \cdot (x_0 z_{i,0} + (1-x_0)(1-z_{i,0})) \cdot \text{eq}(\mathbf{x}', \mathbf{z}'_i)$$ This gives us two sub-problems: 1. **For the $x_0 = 0$ branch**, the problem becomes a new sum over the remaining variables $\mathbf{x}'$, but with a new set of scalars, $\gamma'_i$: $$S((0, \mathbf{x}')) = \sum_{i=0}^{m-1} \underbrace{[\gamma_i \cdot (1 - z_{i,0})]}_{\text{new scalar } \gamma'_i} \cdot \text{eq}(\mathbf{x}', \mathbf{z}'_i)$$ 2. **For the $x_0 = 1$ branch**, we get a similar sub-problem with its own new scalars, $\gamma''_i$: $$S((1, \mathbf{x}')) = \sum_{i=0}^{m-1} \underbrace{[\gamma_i \cdot z_{i,0}]}_{\text{new scalar } \gamma''_i} \cdot \text{eq}(\mathbf{x}', \mathbf{z}'_i)$$ This reveals the recursive structure: the sub-problem at each step is identical in form to the original, just with an updated list of scalars. Our algorithm can simply recurse by re-calculating this list of scalars for each variable. ### Implementation Strategy The new implementation should refactor the original logic to align with this batched mathematical model. In all the following description, we will refer to the following commit as a first draft of the implementation: https://github.com/Plonky3/Plonky3/blob/36171c7eec615783f5e3a313fc728c04af6ec8ef/multilinear-util/src/eq.rs ## **Batched Implementation** ### **New Data Structures: Representing the Batch** The function signature is redesigned to handle an entire batch of $m$ evaluation problems at once. * **Single-Point Input:** * A single point $\mathbf{z} = (z_0, \dots, z_{n-1})$ and a scalar $\alpha$. * **Batched Input:** * An $n \times m$ **matrix of points** $\mathbf{Z}$. Each column is a complete point $\mathbf{z}_i$, and each row contains the $j$-th coordinate for all points in the batch. $$\mathbf{Z} = \underset{\text{all points in the batch}}{\begin{pmatrix} z_{0,0} & z_{0,1} & \dots & z_{0,m-1} \\ z_{1,0} & z_{1,1} & \dots & z_{1,m-1} \\ \vdots & \vdots & \ddots & \vdots \\ z_{n-1,0} & z_{n-1,1} & \dots & z_{n-1,m-1} \end{pmatrix}} \leftarrow \text{j-th coordinate for all points}$$ * An $m$-dimensional **vector of scalars** $\mathbf{\gamma}$, holding the corresponding weight for each point. $$\mathbf{\gamma} = (\gamma_0, \gamma_1, \dots, \gamma_{m-1})$$ ### **Recursivity** The core of the new algorithm is how it updates the vector of scalars at each recursive step. Instead of operating on a single value, it transforms the entire vector. Let $\mathbf{\gamma}^{(j)}$ be the vector of scalars at recursion depth $j$. The algorithm takes this vector and the $j$-th row of the points matrix, $\mathbf{z}_{j, \cdot} = (z_{j,0}, \dots, z_{j, m-1})$, to produce two new vectors for the next level. * **For the $x_j = 0$ branch**, the new vector of scalars is computed via element-wise multiplication (Hadamard product $\odot$): $$\mathbf{\gamma}^{(j+1)}_{x_j=0} = \mathbf{\gamma}^{(j)} \odot (1 - \mathbf{z}_{j, \cdot}) = (\overbrace{\gamma_0^{(j)} (1-z_{j,0})}^{\text{new } \gamma'_0}, \overbrace{\gamma_1^{(j)} (1-z_{j,1})}^{\text{new } \gamma'_1}, \dots, \overbrace{\gamma_{m-1}^{(j)} (1-z_{j,m-1})}^{\text{new } \gamma'_{m-1}})$$ * **For the $x_j = 1$ branch**, the process is similar: $$\mathbf{\gamma}^{(j+1)}_{x_j=1} = \mathbf{\gamma}^{(j)} \odot \mathbf{z}_{j, \cdot} = (\overbrace{\gamma_0^{(j)} z_{j,0}}^{\text{new } \gamma''_0}, \overbrace{\gamma_1^{(j)} z_{j,1}}^{\text{new } \gamma''_1}, \dots, \overbrace{\gamma_{m-1}^{(j)} z_{j,m-1}}^{\text{new } \gamma''_{m-1}})$$ This is the central loop of the recursion: at each level, it produces the inputs for the next level by performing a vector-scalar multiplication across the entire batch. ### **Base Cases for The Final Summation** The recursion bottoms out when there is only one variable left ($n=1$). At this point, the unrolled base case (like the `eval_eq_1` function) performs the final summation with maximum efficiency. The function receives a single row of points $\mathbf{z}_{0, \cdot}$ and the final vector of scalars $\mathbf{\gamma}$. It must compute the two values of the combined polynomial $S(0)$ and $S(1)$: \begin{aligned} S(0) &= \sum_{i=0}^{m-1} \gamma_i (1-z_{0,i}) \\ S(1) &= \sum_{i=0}^{m-1} \gamma_i z_{0,i} \end{aligned} Instead of a loop, it uses the linearity insight: 1. Compute $S(1)$ with a single dot product: $S(1) = \mathbf{\gamma} \cdot \mathbf{z}_{0, \cdot}$. 2. Compute the total sum of scalars: $S_{sum} = \sum_{i=0}^{m-1} \gamma_i$. 3. Find $S(0)$ with a single subtraction: $S(0) = S_{sum} - S(1)$. This avoids $m$ extra operations compared to a naive summation loop (as described earlier in the theoretical section), providing a significant speedup for the base case. #### **Adapting Optimizations** The existing optimization strategies are adapted to work on this new batched data model: * **SIMD Packing:** The element-wise vector multiplications in the recursive step are perfectly suited for SIMD. A packed register can hold (e.g.) 4 scalars from $\mathbf{\gamma}^{(j)}$ and 4 corresponding points from $\mathbf{z}_{j,\cdot}$, performing 4 of the required updates in a single instruction. This is applied across the columns of the matrix. * **Parallelism:** The top-level parallel split remains the same. The key difference is that the initial state prepared for each of the `num_threads` is no longer a single scalar ($\alpha_{00}$) but an entire **matrix** of pre-computed scalars, one vector for each point in the batch. This allows the full, batched computation to be distributed and solved across all available cores.