Polynomial composition ideas
A major bottleneck in the prover is composing the univariate trace polynomials into the multivariate constraint polynomials to produce one univariate polynomial. Let's think about what it's doing.
The multivariate constraints are a bunch of terms which each look like this:
The max degree of our constraints is 4 so there are at most 4 variables in each term. So for each term we have to multiply max 4 polynomials together and then multiply the result by a single field element.
If the polynomials are in eval form (over the root of unity) then multiplication is just pointwise vector multiplication. We already have these polynomials in eval form- they're literally the columns of the trace. We can even transpose the trace so that the polynomials are all in contiguous arrays if that could speed things up. So for each term, we just do pointwise multiplications on the already-computed eval form of the polynomials, do an NTT to turn it back into monomial basis, and then multiply by the coefficient which is a base field element.
Question- is there a way to multiply by the coefficient while still in the eval form? That could speed things up.
So fundamentally, composing the polynomials is just doing a bunch of pointwise vector multiplications followed by an NTT (and then multiplying by the coefficient). These can all be done in parallel and they're all over the base field. We are planning to invest in doing fancy NTT tricks using our field and in building FPGAs to run these NTTs quickly. We can just reuse all this logic to do the composition.
One detail I omitted is that we also need the 'second row polys' that give the variables in the second row. This is just the first row polys composed with
One optimization is that we can precompute what powers of each column are needed and then compute those ahead of time. I know we sometimes multiply one column by itself (for example when constraining a column to binary we do
Another question is to do pointwise multiplications of large vectors can we also use the FPGA or do this with GPUs?
All this combined makes me think we don't need to turn the multivariable polynomial constraints into graphs and can just use the normal map representation to compute every term in parallel and then add them together in the end.