# Multi-Var-Length Keccak Circuit
Design proposal for a Keccak circuit that can handle multiple variable-length inputs.
## Context
After redefining $$\mathsf{circuitID}(vk) = \mathsf{keccak}(vk)$$ we will be hashing
- Fixed-length inputs: The Pedersen commitment $m$
- Inputs with one variable length: The proof ID is $\mathsf{keccak}(cid, p_0, \ldots p_\ell)$
- Inputs with two variable lengths: The circuit ID is $\mathsf{keccak}(vk.fixed, vk.s, vk.pedersen)$ and the last two inputs have variable lengths.
## Summary
This proposal attempts to treat the two var-length cases consistently. The rough idea is
1. Decompose inputs to bytes
2. Fill buffer `preimage` with bytes of fixed input then bytes of each var-len input
3. Feed `preimage` to existing pipeline:
a. Bytes to keccak padded words
b. Grey-box computes all intermediate digests
c. True digest selected based on `preimage` length
Note that
- Part 1 should be performed by the `KeccakCircuit`, which is aware of the types/encodings of its inputs (e.g. vk consists of G1, G2 points in limb representation)
- Part 2 is a `KeccakChip` functionality that is just responsible for copying the meaningful bytes from each input piece into a buffer.
- Part 3 is `KeccakChip::keccak_var_len`
Part 2 is described below.
### Example
The difficulty of having multiple variable-length inputs is that some entries in `preimage` may come from either of the variable-length inputs.
Say we wish to hash `vk`, consisting of
- `vk.alpha` (64 bytes)
- `vk.beta` (128 bytes)
- `vk.gamma` (128 bytes)
- `vk.delta` (128 bytes)
- `vk.s` (64 * `vk.s.length` bytes)
- `vk.h` (256 * `has_commitment` bytes)
Then the first 448 bytes of `preimage` are always `[vk.alpha || vk.beta || vk.gamma || vk.delta]`. But the remaining bytes of `preimage` are either bytes of `vk.s` or `vk.h` (or padding), depending on `vk.s.length` and `has_commitment`.
Therefore the circuit component that forms `preimage` must be able to insert values from `vk.s` or `vk.h` into any slot of `preimage` from index 448.
Suppose these unrealistic values:
- `vk.s.length = 2` out of a max possible of `4`
- `vk.h.length = 2` out of a max possible of `2`
Then `preimage` has length `448 + 4 + 2` and the 6 final bytes are
$$\begin{align*}
\begin{pmatrix}s_0 & s_1 & s_2 & s_3\end{pmatrix} \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0\end{pmatrix} + \begin{pmatrix} h_0 & h_1 \end{pmatrix} \begin{pmatrix} 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0\end{pmatrix} = \\ \begin{pmatrix}s_0 & s_1 & h_0 & h_1 & 0 & 0 \end{pmatrix}
\end{align*}$$
<!-- $$\begin{pmatrix}s_0 & s_1 & s_2 & s_3 & h_0 & h_1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0\end{pmatrix}$$ -->
Here $s_i$ are the *padded* `vk.s` elements, same for $h_i$. That is, multiplying by appropriate "selector matrices" and summing puts the right elements in the right places.
## Specification
Define a "selector matrix" $S_{L_p, L_{tot},o,\ell}$ for padded length $L_p$, total length $L_{tot}$, offset $o$, and data length $\ell$ to have
- $L_p$ rows
- $L_{tot}$ columns
- Its first $o$ columns all zero
- A diagonal of 1's of length $\ell$ starting from the index $o$ column
$S_{L_p, L_{tot},o,\ell}[i,j]$ can be computed in-circuit treating $L_p, L_{tot}$ as circuit constants and $o, \ell$ as witness variables
$$S_{L_p, L_{tot},o,\ell}[i,j]= (j-i == o) \land (i < o+\ell)$$
> The $(i<o+\ell)$ condition looks wrong to me now -- won't it just be the first $\ell$ rows of the matrix that can have a 1 and it occurs in column $i + o$? So that the expression would be $(j-i == o) \land (i < \ell)$? That has a simpler interpretation: The row/column index differ by the correct offset and there are exactly $\ell$ non-zero columns
### Sketch
- Consistency condition: first input's data length = second input's offset
- Then `preimage[fixed_size...] = vk.s * selector_0 + vk.h * selector_1`
- The full generic matrix multiplication is unnecessary, structure of the selectors + known possible lengths can be used to reduce operations
### Questions/Notes
- Can the matrix mult. be done on bigger chunks, e.g. at level of field elements?
- Minimum size hints (e.g. `vk.s.length >=1`) can be used to eliminate some matrix operations, though this gets complex
### Implementation
```rust
fn var_len_query(
fixed_input: &[AssignedValue],
fixed_size: usize, // size of fixed_input, known at keygen
var_input: &[&[AssignedValue]], // each var_input[i] a single var input
offsets: &[AssignedValue] // var_input.len many offsets describing where each piece starts
lengths: &[AssignedValue] // var_input.len many lengths describing meaningful length in each var_input
) {
// Length checks:
assert_eq!(fixed_input.len(), fixed_size);
let mut input_length = ctx.assign_constant(fixed_size);
for (offset, length) in offsets.iter().zip_eq(lengths.iter()) {
gate.constrain_equal(input_length, offset + 1); // Redefine input_length to start 1 lower
input_length = gate.add(input_length, length);
}
//
}
```
- 2 fn's:
- Append var-len to fixed-length: return var-len buffer = list of values and size. Knows size of fixed-length and can ignore first block of matrix
- append var-len to var-len: return same type. More generic matrix mult.
The functions could append data to a `KeccakQuery`, consisting of
- Fixed input: we assume `preimage` begins with data of a fixed length (which we can allow to be 0).
- Var inputs: each of these is a pair `(padded_bytes, len)` where `padded_bytes.length` is a circuit constant.
Then when it's time to `finalize`, the algorithm is roughly
- Copy `fixed_bytes` into `preimage`
- Sum `padded_bytes.length` over all var inputs to obtain $L_{tot}$. This is how many more bytes will be added to `preimage`, though in general some are padding. This is a circuit constant.
- Initialize `meaningful_length = 0`
- For `var_input` of `var_inputs`:
- `offset := meaningful_length`
- `meaningful_length += var_input.len`
- Perform (partial) matrix mult. `padded_bytes *`$S_{L_{tot}, L_p, o, \ell}$
- Add to `preimage[fixed_bytes.length..]`
### Naive Version of the copy
The basic operation we keep doing is `preimage += var_input * S`, where `preimage` is actually the slice `preimage[fixed_input_length..]` and `S` is the appropriate selector matrix.
Then for each $j$ this is `preimage[j] += inner_product(var_input, S[.,j])`, i.e. inner product of the padded variable input vector with the `j`th column of the selector matrix.
### Optimizations
- Below the diagonal is always 0 on the selector matrices: $S_{i,j}=0$ if $i>j$
- Each var input has a furthest position where it could be placed in the preimage, and beyond that position the corresponding column of $S$ will be 0. This farthest position is determined by the padded input length of the current var input and all preceding ones, hence is a circuit constant.
- Example: Var inputs of padded lengths $2, 3, 5$.
- First selector matrix can only have first two columns non-zero.
- Second selector matrix can only have first 5 columns non-zero
- Third selector matrix can only have first 10 columns non-zero
So the relevant quantity for `var_inputs[i]` is `var_input_lengths[0] + ... + var_input_lengths[i]`. This could be called `running_padded_len` and be passed into `copy_bytes` to provide the upper bound
# Circuit ID Computation
Encode VK as concatenation of:
- fixed bytes: (vk.alpha, beta, gamma, delta, vk.s.length (as `u256`, length as vec of scalars, not byte length), vk.s[0])
- vk.s[1..]
- vk.h (potentially empty)
If `vk.has_commitment`,
$$\mathsf{circuitID} = \mathsf{keccak}(\texttt{"UPA Groth16 with commitment circuit id"} || vk.fixed|| vk.s[1..] || vk.h)$$
Else
$$\mathsf{circuitID} = \mathsf{keccak}(\texttt{"UPA Groth16 circuit id"} || vk.fixed || vk.s[1..])$$
# Optimization: Fq-inputs
Treating the var-length inputs as byte vectors had poor performance because they got too long. We should be able to treat the var-length input vectors as a larger data-type while doing the matrix mult. step, then do byte decomposition after. In this case the appropriate data type would be $\mathbb{F}_q$ points. It would be nice to operate on a whole $\mathbb{F}_q$ point at once, but may be necessary to treat it as 3 $\mathbb{F}_r$-limbs.
The vector of $\mathbb{F}_r$-limbs
$$
\begin{pmatrix} a_0^0 & a_0^1 & a_0^2 & a_1^0 & a_1^1 & a_1^2 & \ldots & a_{k-1}^2 \end{pmatrix}
$$
corresponds to the vector of $\mathbb{F}_q$ points
$$
\begin{pmatrix} a_0 & a_1 & \ldots & a_{k-1} \end{pmatrix}
$$
I don't see how we could handle the latter vector of $\mathbb{F}_q$ points in circuit, I think we'll need to do matmul using an $\mathbb{F}_r$ vector. But we can at least avoid recomputing some matrix elements unnecessarily; the matrix elements will be 1's or 0's in chunks of 3:
$$
\begin{pmatrix} a_0^0 & a_0^1 & a_0^2 & a_1^0 & a_1^1 & a_1^2 & \ldots & a_{k-1}^2 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0 & & & & & & & \ldots \\
0 & 1 & 0 & & 0 & & & 0 & & \ldots \\
0 & 0 & 1 & & & & & & & \ldots \\
& & & & & & & & & \ldots \\
& 0 & & & 1 & & & 0 & & \ldots \\
& & & & & & & & & \ldots \\
\vdots & & & & \ddots
\end{pmatrix}
$$
So we'll compute the slice-adder matrix 3-elements at a time and just multiply each group of 3 limbs. That forms a sort of "pre-preimage" where the variable input portion of the preimage is specified as a vector of Fq-points encoded in chunks of 3 Fr-limbs.
Then the desired preimage is computed by
- Copying in the fixed-length input bytes (no change compared to above b/c there was no overhead to treating these as bytes)
- For each chunk of 3 Fr's from pre-preimage:
- Copy over 11 bytes from first
- Copy over 11 bytes from second
- Copy over 10 bytes from third
- Compute preimage length (in bytes) as fixed input byte length + 32 * var-input Fq length.
But this logic to compute the preimage from the pre-preimage is fixed, it doesn't need to know anything about the preimage sizes.
> Could do an additional optimization to account for the fact that our second var input is all-or-nothing, so there's no need to compute those matrix elements individually. But it's just the difference between computing 1 vs 8 matrix elements, you still have to do all the matrix multiplications to position them correctly.