Our proposal with Yar involves moving away from PlonK completely and using more modern arguments instead (technically, something similar to CCS, though it is a bit more general).
So, I'd like to discuss what are differences between PlonK and CCS, and what advantages and disadvantages this move has.
We have a matrix, consisting of some columns (), some of them are actually fixed (selector columns), but let's not differentiate them for now.
A collection of gates is defined, which are applied at every row of our matrix.
A bunch of copy constraints of the form is applied on top of that (actually only on special "copy activated columns", but I will ignore it for simplicity).
In order to implement the copy constraints, we define a permutation on cells of our matrix. Let's denote some unique indexing of cells of a matrix. Then, permutation is defined on this set of indices.
We define permutation , satisfying the condition that starting from some cell it visits exactly all the cells such that (transitive) copy constraint is applied to . (i.e. cycles of these matrix = sets that we want to make equal to each other).
How it is implemented, roughly
After committing to these columns, a new round of IOP starts. We spawn a random challenge , and form a single gate equation, . Then, two things happen:
A gate equation is applied at each parameter (via some IOP, for example sumcheck in hyperplonk, or standard PlonK IOP).
A challenge is spawned, and a permutation argument is applied to sets and .
Permutation argument always involves committing to , as it is pretty much unstructured, and sometimes to additional data (as in normal Halo2); though you could use any permutation argument instead (for example, GKR-based).
Important takeaways:
In order to apply multiple gates to a single element, you are more or less forced to copy them to multiple places. This gives a large space for micro-optimizations. Copy constraints are better to activate on smaller amount of columns (esp. for Halo2), this also increases amount of gains from low-level optimizations and pipelining of operations into different state machines.
I will argue that using CCS instead, we can get a much narrower amount of low-level optimizations and get better performance with much smaller user-facing complexity.
We still have columns, but now they are "virtual", in a sense that we do not commit to them, and instead have a witness , and columns are - applications of some matrices.
There is still a gate equation applied to these columns, but no copy constraints applied (as we can simulate them using these linear maps).
CCS is a superset of R1CS
To simulate R1CS, set up three matrices and apply a single equation .
CCS is a superset of PlonK
This is a bit contentious because in original Srinath's paper it was not stated that you can linearly combine gates with a challenge; however it is pretty obvious that you can:
Set to be de-duplicated witness (i.e. glue all copied things together), and set -s to be mappings that reconstruct -th column from .
How CCS can be implemented
There are few versions of CCS now observed in the wild (or at least proposed):
There are few versions of spark, but in essense it can be thought of as committing to a triple of tables: , where the entry means that we add the value to the cell .
To perform the evaluation, we use some indexed lookup argument (Lasso, logup, or whatever you like) to index into tables and , and then compute sum of triple products
It might appear that SPARK is definitely heavier than normal permutation argument: while we do not actually need to commit to as it is always , we seem to spend 2 lookups per copy. This impression is correct; however it is possible to modify this argument to make it vastly cheaper.
Start by observing that SPARK has an interesting property - it is basically a list of execution calls that tell you how to map your witness to your columns. You could, in theory, add more calls: for example, a call that copies a chunk of data from a witness to a column in a single instruction.
In fact, it is a relatively easy task! (while we do not suggest actually doing it because it would be non-ergonomic, even this would likely outperform the plonkish-style constraints by decreasing the amount of operations that we can process).
Let's see how it could be done. We start by restricting our chunk: it should be something that we call a standard segment: a chunk of the form . Let's assume that we want to map it to a chunk .
Then, the instruction just needs to compute the term (these are just the terms that a normal SPARK would have in the sum with ).
But this sum is actually equal to
The sum term can be computed once.
Actually, this can be done for any matrix that you want to apply multiple times: for a matrix precompute the terms .
This is, naturally, very aligned with what you want to do as a programmer: premake some gadget (component, in terms of circom), and apply it multiple times. If you allocate this component in standard segments, computation of SPARK for it will consist of two computations:
While this looks like a lot of space to optimize, in fact the programmers already write their code modularly, splitting everything into separate gadgets. So I expect that direct application of this approach is totally worth to do (and you also get almost free linear combinations - i.e. they become free if you use the gadget enough times).
In fact, I want to stress that we are not bound to CCS. We can use these SPARK mappings to send our actual committed witness to a bunch of virtual columns (even of a different size, potentially!)
I suggest that we model this (very generalized) CCS as follows:
These IOPs can call-back and allocate their additional witnesses in next rounds if they want (and we will figure out how to package it in a relatively uniform way), but they likely should not be able to call the SPARK emulator back.
There is a surprising upgrade that seems possible in this paradigm. Specifically, it seems that it can allow us to completely solve the circuit vs VM tradeoff!
The tradeoff is:
There are few current design attempts at improving this situation, typically using SNARK recursion:
We propose the dynamic circuit builder path - the relatively lightweight VM which processes the control flow only and builds the circuit as it progresses (this approach is partially inspired by definition of quantum computation, which is a classical computer building a quantum circuit).
While this looked completely impossible in PlonK (outputting selector tables might be ok, but dealing with the permutation seems completely insane), the vectorized SPARK is already, naturally, a virtual machine for computation of the transition matrices . It only seems natural to take advantage of this fact, and introduce dynamic loads and stores into it - which makes it possible to create arbitrary circuit from the callstack (details are being settled here, and we are from formal specification; but it seems there is no barrier here).
Advantages are hard to overstate: