XL-Circuits in Halo2

# XL-Circuits in Halo2 *Status: ideas.* In Halo2, circuits are organized in a rectangular table, where the rows correspond to repeated applications of gates. This article explores the concept of XL-circuits, where the witness is organized in fewer columns of different sizes. This mostly targets KZG-based systems, where we try to make circuits narrower, i.e. with fewer columns. Then the methods below should be same-or-better than current practices, prover and verifier sides. With the Halo2-style IPA commitment scheme, trade-offs are complicated but mostly favor the prover. The limitation of rectangular tables is that the prover work is multiplied by the highest degree of any gate, which is typically under-exploited. The prover work is also tightly coupled with the height of the witness table, which forces suboptimal designs in practice (there are examples in zkEVM). The section "Exploration" below expands on this with performance estimates. But let’s jump right to the proposal. This builds on vanishing arguments from the [Halo2 book](https://zcash.github.io/halo2/design/proving-system/vanishing.html). #### Before: ![](https://i.imgur.com/zlu3tJd.png =x200) #### After: ![](https://i.imgur.com/aEaSoiK.png =x200) ## A Size System Let’s start by naming things. * We define a progression of column sizes from `S` to `XXL`: small, medium, large, extra-large. These correspond to overlapping domains of polynomial evaluation. * We call `XL` the domain of size sufficient to prove that gates are satisfied, i.e. for the vanishing argument. Its size `N` is our main sizing and performance parameter. * XL-gates are constraints that apply `N` times on the rows of the XL domain. Similarly, we can run `N/2` times L-gates, `N/4` M-gates, and `N/8` S-gates. ![](https://i.imgur.com/2gYon1J.jpg =100x) Columns have different capacities, i.e. numbers of cells holding different values, but they are aligned, e.g. an offset of +1 in S equals an offset of +8 in XL. Domain | Column Capacity (n) | Gate Max Degree (d) | Lookup Support (see respective sections) | Use-cases -------|---------------|------------|------------------------------------------|------------------------------- S | N/8 | 8 + 1 | Quad | Complex gates M | N/4 | 4 + 1 | Dual | Complex gates L | N/2 | 2 + 1 | Single | Simple gates, Plonk XL | **N** = len(h) | 2 | XL-Lookup | Integers, function tables, booleans, R1CS-style XXL | 2N | 1 | Hyper-Lookup (cq…) | Integers, function tables The size system follows from the equation `N = n * (d-1)`, which combines parameters of Halo2: `n` the size of columns, `d` the maximum gate degree, into `N` the power of 2 just above the degree of the quotient `h(X)`. The count of `#columns ~ #cells / n` affects both the prover’s, and importantly, the verifier’s or aggregator’s costs. Meanwhile, `N` is fixed by a given circuit design and this drives the prover’s work. Our goal is to maximize the utility of that work with the fewest columns of the largest `n` sizes possible. ## XL-Gates In gate expressions, we count how many columns are multiplied together. A column counts as a fraction, as per the table. Example: `S*S` is worth the same as `M` because `N/8 + N/8 = N/4`. The product terms must add up to XXL (`2N`) at most. Therefore, expressions of this form are allowed in XL-gates: XXL XL * XL ## S/M/L-Gates In order to mix all columns within a single XL-zero-check, we adapt S/M/L-gates into the XL domain by masking. This means multiplying by XL-columns that have 0 at the rows outside of the smaller domains, which we call S-, M-, or L-mask (see table below): Mask * Expression ↑ ↑ XL XL In masked expressions, the products must add up to XL, with forms such as: XL L * L L * M * M L * M * S * S … S^8 All additions, scalings, and rotations are allowed, such as: XL - L * L Masking is counted as `+1` degree in the table above. Masks are a little less flexible, but almost as useful as regular columns, for instance as row selectors. From there, how do we design a gate? Here are the rules: * Pick the domain where the gate applies. If not XL, then masks are used. * Columns in this and larger domains can be accessed at the corresponding rows, and at any relative offset. * It’s hard to access smaller columns. Instead, make another gate in the smaller domain. ## Visualization Let’s recap in a table. The ✅’s are the cells that store fixed or witness values. Row (ωʳᵒʷ) | S [N/8] | M [N/4] | L [N/2] | XL [N] | | S-mask [N] | M-mask [N] | L-mask [N] ----------|---------|---------|---------|--------|-|------------|------------|-------- 8i | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ 8i+1 | | | | ✅ | | 0 | 0 | 0 8i+2 | | | ✅ | ✅ | | 0 | 0 | ✅ 8i+3 | | | | ✅ | | 0 | 0 | 0 8i+4 | | ✅ | ✅ | ✅ | | 0 | ✅ | ✅ 8i+5 | | | | ✅ | | 0 | 0 | 0 8i+6 | | | ✅ | ✅ | | 0 | 0 | ✅ 8i+7 | | | | ✅ | | 0 | 0 | 0 Here are two schematic examples of zero-checks or vanishing arguments, and the maximum degree of polynomials involved: Argument: h ~ XL_column * XL_column / t Domains: XL ~ XL * XL / XL Sizes: N = N + N - N Degrees: N-1 > N-1 + N-1 - N Argument: h ~ L_column * L_column * mask / t Domains: XL ~ L * L * XL / XL Sizes: N = N/2 + N/2 + N - N Degrees: N-1 > N/2-1 + N/2-1 + N-1 - N ### Footnotes: * We could complicate things even more with zero-checks on smaller domains, but that would not bring much, and miss out on XL and XXL sizes. Rather, we standardize everything around XL-gates. * Sizes are powers-of-two as usual to work with radix-2 FFTs. The sizes of h are rounded up. ## L-Lookups Let’s see how Halo2 lookups work with sizes ([reference](https://zcash.github.io/halo2/design/proving-system/lookup.html)). A lookup needs 3 internal columns, and gate degree 3: * Sorted versions of the column A and of the table T * A column Z for running products * Constraints of this form: ` Z * A * T ` We can lookup in the domain L, and the column Z will be of type L-mask, according to the sizing table above. The total degree works out to XXL as needed: Z * A * T ↑ ↑ ↑ XL L L = XXL *Note: this is a version without zero-knowledge, appropriate for zkEVM use-cases. For ZK, we need `+1` degree, so we downsize all domains by one notch.* *Note: it is complicated to implement this with masks. XL-Lookups below are simpler and more efficient.* ## XL-Lookups Now, what we really want to do is lookups in the XL domain, because: 1. A lookup is quite expensive so we want to amortize it. 2. XL tables are larger, so more powerful. What can we do with these XL-Lookups? * They are great for large **integer decompositions**. Limbs are **range-checked** in large batches, and combined in linear expressions. * Similarly, we can lookup input/output of **precomputed functions**, and we can chain those, all in XL size. * If we do need high-degree gates, we can **batch** smaller lookups into a bigger one. See below. We need to bring the degree down from 3 to 2. Note that the implementation above handles A and T together, which is no longer the best optimization. We can simply not do that, and use separate Zᴀ and Zᴛ columns. For the purpose of degree calculations, constraints will have these forms: Zᴀ * A Zᴛ * T A' * T' ↑ ↑ XL XL = XXL Here is one such constraint in full, with `γ` a random challenge, and `ωX` meaning "next row": Zᴀ(ωX) * (A'(X) + γ) - Zᴀ(X) * (A(X) + γ) == 0 **Result:** we can lookup `N` cells using `5` columns, versus `8` in the L domain, `16` in the M domain, or `32` in the S domain (calculations below). ## Batched Lookups Consider a circuit with four M-columns used in high-degree gates, but also in lookups. Instead of doing four lookups of these small columns, we can copy their content into an XL column, with cells interleaved four-by-four, enforced by `M_k[4i] == XL[4i+k]`: Row (ωʲ) | XL-Lookup | M0 | M1 | M2 | M3 --------------|-----------|-------|----|----|---- 4i | m0 | == m0 | m1 | m2 | m3 4i+1 | m1 | | == | | 4i+2 | m2 | | | == | 4i+3 | m3 | | | | == Then, we run a single lookup on that large column. The cost of batching works out as follows, if each circuit column counts as 1, plus an amortization of the shared lookup columns. Domain | Batch Size | Cost per circuit column | Cost per `N` cells | …without batching --------|------------|-------------------------|--------------------|------------------ S | 8 * N/8 | 1.625 | 13 | 32 or 26 M | 4 * N/4 | 2.25 | 9 | 16 or 14 L | 2 * N/2 | 3.5 | 7 | 8 XL | 1 * N | 5 | 5 | 5 Compare to the cost of a regular lookup which is 4. *- Calculations per `N` cells: in XL, once {A, A', T', Zᴀ, Zᴛ} makes 5 columns. `7 = 5XL + 2L` and `9 = 5XL + 4M` and `13 = 5XL + 8S`.* *- Per circuit column: divide by the batch size.* *- Calculations without batching: a regular lookup with {A, A', T', Z} is 4. Domains L/M/S are smaller so need 2, 4, or 8 times regular lookups, making 8, 16, 32.* *- In principle, M and S could be implemented with only 2 Zs, plus {A, A', T'} 4 or 8 times, making `14 = 2+4*3` and `26 = 2+8*3`.* *- More optimization: It may be possible to do this without the explicit A column, instead composing the smaller circuit columns in a product. This would lower the counts by `1`.* ## Hyper-Lookups By hyper-lookups we refer to lookups into tables much larger than our circuits: Caulk+, Flookup, Ba*loo*, cq, … The same reasonning as XL-Lookups apply, but tables are bigger, so more powerful. For instance, we can realistically do range checks of 24 bits at once. Hyper-lookups are more expensive to verify, firstly because they rely on pairing checks. Moreover, a *cq* proof uses 8 polynomial commitments ([reference](https://eprint.iacr.org/2022/1763.pdf)), twice as much as XL-Lookups. Luckily, we can also amortize it twice better, with XXL columns. Indeed, hyper-lookups come with their own pairing-based checks and there are no constraints on the actual looked up column, so XXL works. The commitment counts "per circuit column" and "per `N` cells" are similar to XL-lookups above. From there, we can use XXL columns in use-cases of degree 1: integer decompositions, precomputed functions, and lookup batching (same as previous sections). ## API In the API of the original Halo2 implementation, there is a single parameter `k`, and all of the following have size `2ᵏ`: * The length of all columns. * The capacity of the polynomial commitments. * The number of gate applications, i.e., the domain of `t`. All existing circuits essentially live in either the M or the S domain. They look like the wide or the tall examples in sections below. Ideally, there should be an on-boarding path for developers to add XL-style features on demand. The easiest way to integrate the ideas above is to add support for **batched lookups**. The user code requests k lookups into a same table. The backend decides whether and how to batch them using larger domains (see XL-lookups). In any case, it returns k columns in the single working domain. The user code may not be exposed to other domains. Meanwhile, some **automation** may be possible. With a sufficiently abstract definition of columns, rows, and regions, the Layouter should be able to allocate or split or concatenate vectors into the minimal set of polynomials. It can also plan the prover’s heavy lifting over the smallest possible evaluation domains (see [PR#427](https://github.com/zcash/halo2/issues/427)), and maybe skip some FFTs. This requires a little more research. The final level is to allow **explicit multi-size designs**. A natural way to extend the circuit API is to add support for "super-sampled" columns from the larger domains. Intermediate "data rows" will appear inserted in-between the regular rows where gates are applied. Data rows will be indexed by half- and quarter-rotations; corresponding to shift by √ω seen from the smaller gate domain. A new option will change the domain of gates to XL, from the current equivalent of M or S. However, to use the most beneficial features, the developer needs to work in the XL domain, with possibly some adjustments to existing M- or S-gates. ---- # Exploration and Calculations 🛑 Below this line is the detailed reasonning that led to the system above. 🛑 ---- Let’s introduce an example task. Then we will compare multiple circuit designs for the same computation. The first two designs favor respectively prover or verifier work at the expense of the other. Then, we introduce a mixed design which has "best of both worlds" properties. ## Motivating Examples In order to compare apples to apples, all examples below target these parameters: `n`: The number of applications of gates that we need. `w`: The number of witness variables per gate application. Let’s take an example computation, repeated `n` times, using `w=6` variables per iteration (`a…f`), a selector `s`, and implemented with gates of this form: (Π⁴(a,b) + Σ²(c,d,e,f)) * s == 0 The term `Π⁴(a,b)` represents some complex logic with a moderate degree of 4, which is typical; sometimes it may be 8. The term `Σ²(c,d,e,f)` represents a function of many variables but a low degree of 2. The ubiquitous examples are the decomposition of an integer into smaller limbs, or binary logic. Besides, we would like to constrain the variables c,d,e,f to belong to lookup tables. That means the term `Σ²` is actually fairly powerful in spite of the low gate degree. See the lookup section below. Some form of gate selectors is usually needed, which is exemplified wholesale by the term `s`. The gates are applied using a vanishing argument, that is proving the existence of the quotient `h(X)`, with `t(X)` the polynomial vanishing on the chosen domain: h = (Π⁴(a,b) + Σ²(c,d,e,f)) * s / t Our main performance metric is going to be the size of the Lagrange domain where `h(X)` is computed, which is the power of two just above its degree. This is because proving involves interpolating all input vectors to this size, then evaluating the gates this number of times. ## Basic Wide Circuit Let’s start with the simplest design. Row | t | s | A | B | C | D | E | F | [height = n] ----|----|-----|----|----|----|----|----|---|-------------- i | 0 | 0/1 | a | b | c | d | e | f | We allocate one committed column per variable, A to F, and we count one selector. commit_count = 1 + w = 7 Each row contains the value of the variables for one of `n` iterations, and the gates apply on every row. len(t) = len(s) = len(A) = … = len(F) = n We can calculate the size of the domain of `h(X)` by looking at multiplications in its expression. The degrees of `s` and `t` cancel each other, and the term `Π⁴` dominates: len(h) ≈ 4*n ↑ h(X) ∝ Π⁴(A,B) + Σ²(C,D,E,F) ↓ 2*n lower than 4*n Note: the exact minimum size is `4(n-1)`, but we neglect that because we will round all sizes up to powers of two, that is `n ≤ 2ᵏ` and `len(h) = 4 * 2ᵏ` ## Tall Circuit Let’s assume that we want to reduce proof size and verifier work. We can pack different variables on consecutive rows of a same column, which we will call "data rows". The variables on data rows are accessed using rotations, such as `c = A(ω³ⁱ⁺¹)`. Purpose | Row | t | s | A | B | [height ≈ 3n] -------------|-----------|-----|---|----|---|-------------- gate row | 3i | 0 | 1 | a | b | \`- data row | 3i + 1 | 0 | 0 | c | d | \`- data row | 3i + 2 | 0 | 0 | e | f | The data rows are used purely to store witness and the gates do not apply to them. Here we exploit a selector column `s` to disable the gates at these positions. In practice, selectors may cost more than in basic circuits. We changed the aspect ratio by a factor of 3 in this example. commit_count = 1 + w/3 = 3 len(t) = len(s) = len(A) = len(B) ≈ 3n Note: depending on the value of `n`, we may tweak the design to obtain again `len(A) = 2ᵏ` and `len(h) = 4 * 2ᵏ`, but that is not the point here. Plugging this into the expression of h: len(h) ≈ 4*3n ≈ 12n ↑ h(X) ∝ Π⁴(A,B) + Σ²(C,D,E,F) ↓ 2*3n lower than 4*3n The domain of h is getting large, because gates degree and data rows both have a multiplicative cost effect. That is the major factor causing slow proving times and high RAM usage. Note: there are real-world examples of this method used intensively and with more costly ratios. **This is the problem that the mixed circuit below intends to solve.** ## Mixed Circuit In a mixed circuit, we organize the witness based on how variables are used. Variables involved in high-degree logic are stored in regular columns of length `n`, like a wide circuit. Variables involved in low-degree operations are packed into fewer, longer columns of length `2n`, like a tall circuit. Iteration | Row | t [n] | s [n] | A [n] | B [n] | C [2n] | D [2n] -------------|---------|--------|-------|-------|-------|--------|-------- current | 2i | 0 | 0/1 | a | b | c | d \`- data row | 2i + 1 | | | | | e | f commit_count = 5 ∈ [1 + w/2; 1 + w] (depends on the mix) len(t) = len(s) = len(A) = len(B) = n len(C) = len(D) = 2n len(h) = 4*n ↑ h(X) ∝ Π⁴(A,B) + Σ²(C,D,E,F) ↓ 2*2n same as 4*n * The domain of h remains low, like the wide circuit and unlike the tall circuit. Notably, the C and D columns, while holding more data than A or B, do not increase the degree of h because their term `Σ²(C,D)` has a low enough degree even after doubling the lengths. * The benefit gets more significant with more variables in low-degree functions; for example, many integer decompositions or many lookups. The point is that we are able to balance *short-columns in high-degree gates* versus *long-columns in low-degree gates*. This gives us an optimal usage of the extended domain of `h`. ## Mixed Circuit, high-degree We can also go for another balanced design. Assume we can rewrite the gates to exploit functions such as `Π⁸(A)`. The higher degree 8 is costly but it may enable savings elsewhere, exemplified here by removing the variable `b`. Π⁸(a) + Σ²(c,d,e,f) == 0 Meanwhile, the data columns can go up to a length of 4n. Iteration | Row | t [n] | s [n] | A [n] | C [4n] -------------|---------|--------|-------|-------|-------- current | 4i | 0 | 0/1 | a | c \`- data row | 4i + 1 | | | | d \`- data row | 4i + 2 | | | | e \`- data row | 4i + 3 | | | | f commit_count = 3 ∈ [1 + w/4; 1 + w] (depends on the mix) len(t) = len(s) = len(A) = n len(C) = 4n len(h) = n*8 + n - n = 8n ↑ ↑ ↑ h(X) = (Π⁸(A) + Σ²(C)) * s / t ↓ 4n*2 same as n*8 ## Visualization of the interpolating polynomials Row (ωʲ) | t [n] | Short [n] | Long [2n] | h [4n] ----------|-------|-----------|-----------|--- 4i | 0 | x | x | x 4i+1 | | | | x 4i+2 | | | x | x 4i+3 | | | | x ## The Double-Wide Circuit If we want to trade-off in the other direction - wider and shorter - we can always reorganize the wide circuit into two parts side-by-side, which executes two iterations per row, for double the width and half the height. Row (ωⁱ) | t | Aₗ | Bₗ | Cₗ | Dₗ | | Aᵣ | Bᵣ | Cᵣ | Dᵣ [length = n/2] ---------|-----|----|----|----|----|-|----|----|----|---- i | 0 | a | b | c | d | | a' | b' | c' | d' commit_count = w*2 = 8 len(t) = len(Aₗ) = … = len(Dᵣ) = n/2 len(h) = 2*(n/2) = n - There are twice as many commitments, so the verifier work roughly doubles (with the KZG PCS). - The size of the domain is halved, so the prover work decreases somewhat due to the super-linearity of FFTs. - The complexity of evaluating h stays about the same, because we evaluate two gates on half fewer rows. ## A special case of the tall circuit Starting from the tall circuit above, and assuming the selector has a static power-of-two period, we can optimize it away in the vanishing polynomial. However, this seems impractical for real-world circuits that mix multiple tasks. index (ωⁱ) | t | A | B -----------|---------------|---|--- i | 0 | a | b i+1 | (not applied) | c | d i+2 | 0 | a'| b' i+3 | (not applied) | c'| d' commit_count = w/2 = 2 len(A) = len(B) = 2n len(t) = n len(h) = 2n*2 + 2n - n = 5n ↑ ↑ ↑ ↑ ↑ h(X) = (A² * B + A₊₁ + 2B₊₁ - A") / t ## Lookups (details) Say we have `kn` variables in a wide circuit with `k` lookups of length `n`. In the examples above, `k=2`. The metrics are: #columns = 4k (example: 8) len(h) = n + n + n - n = 2n ↑ ↑ ↑ ↑ h(X) ~ Z * A * S / t Let’s compare with a tall circuit with `1` lookup of length `kn`: #columns = 4 len(h) = 2kn (example: 4n) This factor `2k` is costly. Instead, we can make another trade-off. Indeed, the implementation above handles A and S together. We can simply not do that, and use separate Zᴀ and Zꜱ columns. Schematically, the gates will contain products of this form: Zᴀ * A Zꜱ * S The metrics become: #columns = 5 len(h) = kn + kn - kn = kn (example: 2n) ↑ ↑ ↑ h(X) ~ Zᴀ * A / t ## Dual- and Quad- Lookups It is possible to merge multiple lookups around the same Z column, with more multiplications in the gate. In fact, the original lookup above is already a merge of the two multiset equalities for one lookup, `A⇔A'` and `T⇔T'`. According to the rules of our system, we can pack two M-lookups or four S-lookups against a single Z check. Z * (A₀ * T₀) * (A₁ * T₁) XL * M⁴ = XXL Z * (A₀ * T₀) * (A₁ * T₁) * (A₂ * T₂) * (A₃ * T₃) XL * S⁸ = XXL This is not an earth-shaking improvement though, because the column count only goes from 4 to 3.5 or 3.25 per lookup. It’s much better to use XL- or Hyper-lookups. ## Hyper-Lookups (details) Domain | Batch Size | Cost per circuit column | Cost per `N` cells | Cost of `2N` batch ----------|------------|-------------------------|--------------------|-------------------- S | 16 * N/8 | 1.5625 | 12.5 | 25 M | 8 * N/4 | 2.125 | 8.5 | 17 L | 4 * N/2 | 3.25 | 6.5 | 13 XL | 2 * N | 5.5 | 5.5 | 11 XXL | 1 * 2N | 9 | 4.5 | 9 *Calculations: `9 = 8cq + 1XXL`, then `11 = 9 + 2XL` and `13 = 9 + 4L` and `17 = 9 + 8M` and `25 = 9 + 16S`. Then divide by 2 or by the batch size.* ## Comparison of lookup implementations (draft) ![](https://i.imgur.com/SUdDWOp.jpg) ## Footnote on Polynomial Commitments `n`: The capacity of the polynomial commitment scheme. With the KZG scheme, `n` is free for the verifier, but it is limited by the practical size of the trusted setup. With the Halo2 IPA scheme, the batched verification cost is linear in `n`. In case a column is larger than `n`, it’s best to rework the circuit to make it wider, i.e. more columns. In principle, large columns can be split over multiple commitments, and Halo2 does do that for `h(X)`, but this cancels the gains verifier-side. ---- # Feedback Thanks to str4d for his insights. Maybe it is possible to achieve the same results with the rectangular model. The layouter would detect low-degree columns, or recognizes developer annotations. Then it either packs the layout into fewer columns, or optimize their computation in smaller domains (like [#427](https://github.com/zcash/halo2/issues/427)). The question is whether automatic layout is sufficient, versus explicit designs.