generalized CCS backend / DSL proposal

Our proposal with Yar involves moving away from PlonK completely and using more modern arguments instead (technically, something similar to CCS, though it is a bit more general).

So, I'd like to discuss what are differences between PlonK and CCS, and what advantages and disadvantages this move has.

PlonK arithmetization

We have a matrix, consisting of some columns (

C_{0}, . . ., C_{k - 1}

), some of them are actually fixed (selector columns), but let's not differentiate them for now.

A collection of gates

E_{1}, . . ., E_{s}

is defined, which are applied at every row of our matrix.
A bunch of copy constraints of the form

C_{a} [i] == C_{b} [j]

is applied on top of that (actually only on special "copy activated columns", but I will ignore it for simplicity).

In order to implement the copy constraints, we define a permutation on cells of our matrix. Let's denote

e (a, i)

some unique indexing of cells of a matrix. Then, permutation is defined on this set of indices.

We define permutation

σ

, satisfying the condition that starting from some cell

e

it visits exactly all the cells

e^{'}

such that (transitive) copy constraint is applied to

e, e^{'}

. (i.e. cycles of these matrix = sets that we want to make equal to each other).

How it is implemented, roughly

After committing to these columns, a new round of IOP starts. We spawn a random challenge

γ

, and form a single gate equation,

E = \sum_{i} γ^{i} E_{i}

. Then, two things happen:

A gate equation
$E (C_{0} [i], . . ., C_{k - 1} [i])$ is applied at each parameter
$i$ (via some IOP, for example sumcheck in hyperplonk, or standard PlonK IOP).
A challenge
$τ$ is spawned, and a permutation argument is applied to sets
$C_{a} [i] + τ e (a, i)$ and
$C_{a} [i] + τ σ (e (a, i))$ .

Permutation argument always involves committing to
$σ (e (a, i))$ , as it is pretty much unstructured, and sometimes to additional data (as in normal Halo2); though you could use any permutation argument instead (for example, GKR-based).

Important takeaways:

In order to apply multiple gates to a single element, you are more or less forced to copy them to multiple places. This gives a large space for micro-optimizations. Copy constraints are better to activate on smaller amount of columns (esp. for Halo2), this also increases amount of gains from low-level optimizations and pipelining of operations into different state machines.

I will argue that using CCS instead, we can get a much narrower amount of low-level optimizations and get better performance with much smaller user-facing complexity.

CCS (not our version, just how it is defined by Srinath)

We still have columns, but now they are "virtual", in a sense that we do not commit to them, and instead have a witness

W

, and columns are

C_{a} = M_{a} (W)

- applications of some matrices.

There is still a gate equation applied to these columns, but no copy constraints applied (as we can simulate them using these linear maps).

CCS is a superset of R1CS

To simulate R1CS, set up three matrices

M_{0}, M_{1}, M_{2}

and apply a single equation

C_{0} [i] * C_{1} [i] - C_{2} [i] = 0

CCS is a superset of PlonK

This is a bit contentious because in original Srinath's paper it was not stated that you can linearly combine gates with a challenge; however it is pretty obvious that you can:

Set

W

to be de-duplicated witness (i.e. glue all copied things together), and set

M_{i}

-s to be mappings that reconstruct

i

-th column from

W

How CCS can be implemented

There are few versions of CCS now observed in the wild (or at least proposed):

Free linear combination protocols. These are descendants of Groth16, and most notable is Garuda. They do
$M_{i}$ -s for free, but require trusted setup and can only work for elliptic curves. Inventing free linear combination protocol for STARKs is out of scope for this discussion.
SPARK. This is a multivariate protocol, which commits to a matrix
$M$ in a sparse form, and simulates evaluation of a dense multivariate polynomial
$M (x_{0}, . . ., x_{k - 1}; y_{0}, . . ., y_{l - 1})$ in a point
$(r; r^{'})$ .

There are few versions of spark, but in essense it can be thought of as committing to a triple of tables:

(val, row, col)

, where the entry

(a, i, j)

means that we add the value

a

to the cell

(i, j)

To perform the evaluation, we use some indexed lookup argument (Lasso, logup, or whatever you like) to index into tables

eq (r, ⋆)

and

eq (r^{'}, *)

, and then compute sum of triple products

\sum_{i} val[i] * eq (r, row [i]) * eq (r^{'}, col [i])

It might appear that SPARK is definitely heavier than normal permutation argument: while we do not actually need to commit to

val

as it is always

1

, we seem to spend 2 lookups per copy. This impression is correct; however it is possible to modify this argument to make it vastly cheaper.

Vectorized SPARK (our suggestion, part 1)

Start by observing that SPARK has an interesting property - it is basically a list of execution calls that tell you how to map your witness to your columns. You could, in theory, add more calls: for example, a call that copies a chunk of data from a witness to a column in a single instruction.

In fact, it is a relatively easy task! (while we do not suggest actually doing it because it would be non-ergonomic, even this would likely outperform the plonkish-style constraints by decreasing the amount of operations that we can process).

Let's see how it could be done. We start by restricting our chunk: it should be something that we call a standard segment: a chunk of the form

[i \cdot 2^{k} . . (i + 1) \cdot 2^{k}]

. Let's assume that we want to map it to a chunk

[j \cdot 2^{k} . . (j + 1) \cdot 2^{k}]

Then, the instruction

{vmap}_{k} (i, j)

just needs to compute the term

\underset{u = 0}{\sum^{2^{k} - 1}} eq (r, 2^{k} \cdot i + u) eq (r^{'}, 2^{k} \cdot j + u)

(these are just the terms that a normal SPARK would have in the sum with

val = 1

But this sum is actually equal to

eq (r [k . .], i) eq(r'[k..], j) \underset{u = 0}{\sum^{2^{k} - 1}} eq (r [. . k], u) eq (r^{'} [. . k], u)

The sum term can be computed once.

Actually, this can be done for any

2^{k} \times 2^{l}

matrix that you want to apply multiple times: for a matrix

a

precompute the terms

\underset{u = 0}{\sum^{2^{k} - 1}} \underset{v = 0}{\sum^{2^{l} - 1}} a [u, v] eq (r [. . k], u) eq (r^{'} [. . k], v)

This is, naturally, very aligned with what you want to do as a programmer: premake some gadget (component, in terms of circom), and apply it multiple times. If you allocate this component in standard segments, computation of SPARK for it will consist of two computations:

Computation of gluing of individual "model" component (can be done recursively, but even if directly is only slightly heavier than PlonK).
Applying vectorized SPARK as above to each invocation of the component.

While this looks like a lot of space to optimize, in fact the programmers already write their code modularly, splitting everything into separate gadgets. So I expect that direct application of this approach is totally worth to do (and you also get almost free linear combinations - i.e. they become free if you use the gadget enough times).

FFI

In fact, I want to stress that we are not bound to CCS. We can use these SPARK mappings to send our actual committed witness to a bunch of virtual columns (even of a different size, potentially!)

I suggest that we model this (very generalized) CCS as follows:

We have a witness
$W$ .
We have maps to some virtual polynomials
$C_{i}$ (potentially of different lengths).
We can apply any IOPs to these columns (and we have standard ones for "custom gates" that we use, lookups, etc).

These IOPs can call-back and allocate their additional witnesses in next rounds if they want (and we will figure out how to package it in a relatively uniform way), but they likely should not be able to call the SPARK emulator back.

Futuristic SPARK machine (our suggestion, part 2)

There is a surprising upgrade that seems possible in this paradigm. Specifically, it seems that it can allow us to completely solve the circuit vs VM tradeoff!

The tradeoff is:

Circuits are faster for pure functions, but unable to execute branching (need to pay for all possible branches) => do not scale well.
VMs are slower (every load / store has a cost of roughly 2 lookups), but as an upside, can use branching.

There are few current design attempts at improving this situation, typically using SNARK recursion:

Aztec path: starting from circuit paradigm (in Aztec, every contract is a Noir circuit), you can simulate call stack using recursive aggregation of proofs. The size of call-stack is then severely limited by a recursive overhead.
Precompile path: starting from the VM paradigm, you might wish to add certain circuits which execute "fat" opcodes - precompiles. However, you can't do it by building a monolithic circuit (this was one of the main problems of PSE's zk-EVM), so the typical approach done by modern VMs is pushing down these precompile invocations to specific custom SNARKs tailored to only optimize a single particular instruction. This makes it much less usable for situations where a single precompile is only invoked few times - i.e., such VMs are forced to have a relatively large upfront costs to compute anything at all! - not good for client-side proving.
Another problem with precompile path is the fact that you need to know what accelerated instructions you want to do beforehand.

We propose the dynamic circuit builder path - the relatively lightweight VM which processes the control flow only and builds the circuit as it progresses (this approach is partially inspired by definition of quantum computation, which is a classical computer building a quantum circuit).

While this looked completely impossible in PlonK (outputting selector tables might be ok, but dealing with the permutation seems completely insane), the vectorized SPARK is already, naturally, a virtual machine for computation of the transition matrices

M_{a}

. It only seems natural to take advantage of this fact, and introduce dynamic loads and stores into it - which makes it possible to create arbitrary circuit from the callstack (details are being settled here, and we are from formal specification; but it seems there is no barrier here).

Advantages are hard to overstate:

From a perspective of a circuit builder, it unlocks the ability to use conditional jumps. I.e., you can have as much hand-written optimization as you want, and still invoke only components that you need, when you need them.
From a perspective of a VM lookup lover, this is an opportunity to turn every pure function in a precompile! You can make "precompiles" for all your purposes, they are not baked into the proof system, and in essense you can get a significant performance boost for branch-less computation.