# Multi-Var-Length Keccak Circuit Design proposal for a Keccak circuit that can handle multiple variable-length inputs. ## Context After redefining $$\mathsf{circuitID}(vk) = \mathsf{keccak}(vk)$$ we will be hashing - Fixed-length inputs: The Pedersen commitment $m$ - Inputs with one variable length: The proof ID is $\mathsf{keccak}(cid, p_0, \ldots p_\ell)$ - Inputs with two variable lengths: The circuit ID is $\mathsf{keccak}(vk.fixed, vk.s, vk.pedersen)$ and the last two inputs have variable lengths. ## Summary This proposal attempts to treat the two var-length cases consistently. The rough idea is 1. Decompose inputs to bytes 2. Fill buffer `preimage` with bytes of fixed input then bytes of each var-len input 3. Feed `preimage` to existing pipeline: a. Bytes to keccak padded words b. Grey-box computes all intermediate digests c. True digest selected based on `preimage` length Note that - Part 1 should be performed by the `KeccakCircuit`, which is aware of the types/encodings of its inputs (e.g. vk consists of G1, G2 points in limb representation) - Part 2 is a `KeccakChip` functionality that is just responsible for copying the meaningful bytes from each input piece into a buffer. - Part 3 is `KeccakChip::keccak_var_len` Part 2 is described below. ### Example The difficulty of having multiple variable-length inputs is that some entries in `preimage` may come from either of the variable-length inputs. Say we wish to hash `vk`, consisting of - `vk.alpha` (64 bytes) - `vk.beta` (128 bytes) - `vk.gamma` (128 bytes) - `vk.delta` (128 bytes) - `vk.s` (64 * `vk.s.length` bytes) - `vk.h` (256 * `has_commitment` bytes) Then the first 448 bytes of `preimage` are always `[vk.alpha || vk.beta || vk.gamma || vk.delta]`. But the remaining bytes of `preimage` are either bytes of `vk.s` or `vk.h` (or padding), depending on `vk.s.length` and `has_commitment`. Therefore the circuit component that forms `preimage` must be able to insert values from `vk.s` or `vk.h` into any slot of `preimage` from index 448. Suppose these unrealistic values: - `vk.s.length = 2` out of a max possible of `4` - `vk.h.length = 2` out of a max possible of `2` Then `preimage` has length `448 + 4 + 2` and the 6 final bytes are $$\begin{align*} \begin{pmatrix}s_0 & s_1 & s_2 & s_3\end{pmatrix} \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0\end{pmatrix} + \begin{pmatrix} h_0 & h_1 \end{pmatrix} \begin{pmatrix} 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0\end{pmatrix} = \\ \begin{pmatrix}s_0 & s_1 & h_0 & h_1 & 0 & 0 \end{pmatrix} \end{align*}$$ <!-- $$\begin{pmatrix}s_0 & s_1 & s_2 & s_3 & h_0 & h_1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0\end{pmatrix}$$ --> Here $s_i$ are the *padded* `vk.s` elements, same for $h_i$. That is, multiplying by appropriate "selector matrices" and summing puts the right elements in the right places. ## Specification Define a "selector matrix" $S_{L_p, L_{tot},o,\ell}$ for padded length $L_p$, total length $L_{tot}$, offset $o$, and data length $\ell$ to have - $L_p$ rows - $L_{tot}$ columns - Its first $o$ columns all zero - A diagonal of 1's of length $\ell$ starting from the index $o$ column $S_{L_p, L_{tot},o,\ell}[i,j]$ can be computed in-circuit treating $L_p, L_{tot}$ as circuit constants and $o, \ell$ as witness variables $$S_{L_p, L_{tot},o,\ell}[i,j]= (j-i == o) \land (i < o+\ell)$$ > The $(i<o+\ell)$ condition looks wrong to me now -- won't it just be the first $\ell$ rows of the matrix that can have a 1 and it occurs in column $i + o$? So that the expression would be $(j-i == o) \land (i < \ell)$? That has a simpler interpretation: The row/column index differ by the correct offset and there are exactly $\ell$ non-zero columns ### Sketch - Consistency condition: first input's data length = second input's offset - Then `preimage[fixed_size...] = vk.s * selector_0 + vk.h * selector_1` - The full generic matrix multiplication is unnecessary, structure of the selectors + known possible lengths can be used to reduce operations ### Questions/Notes - Can the matrix mult. be done on bigger chunks, e.g. at level of field elements? - Minimum size hints (e.g. `vk.s.length >=1`) can be used to eliminate some matrix operations, though this gets complex ### Implementation ```rust fn var_len_query( fixed_input: &[AssignedValue], fixed_size: usize, // size of fixed_input, known at keygen var_input: &[&[AssignedValue]], // each var_input[i] a single var input offsets: &[AssignedValue] // var_input.len many offsets describing where each piece starts lengths: &[AssignedValue] // var_input.len many lengths describing meaningful length in each var_input ) { // Length checks: assert_eq!(fixed_input.len(), fixed_size); let mut input_length = ctx.assign_constant(fixed_size); for (offset, length) in offsets.iter().zip_eq(lengths.iter()) { gate.constrain_equal(input_length, offset + 1); // Redefine input_length to start 1 lower input_length = gate.add(input_length, length); } // } ``` - 2 fn's: - Append var-len to fixed-length: return var-len buffer = list of values and size. Knows size of fixed-length and can ignore first block of matrix - append var-len to var-len: return same type. More generic matrix mult. The functions could append data to a `KeccakQuery`, consisting of - Fixed input: we assume `preimage` begins with data of a fixed length (which we can allow to be 0). - Var inputs: each of these is a pair `(padded_bytes, len)` where `padded_bytes.length` is a circuit constant. Then when it's time to `finalize`, the algorithm is roughly - Copy `fixed_bytes` into `preimage` - Sum `padded_bytes.length` over all var inputs to obtain $L_{tot}$. This is how many more bytes will be added to `preimage`, though in general some are padding. This is a circuit constant. - Initialize `meaningful_length = 0` - For `var_input` of `var_inputs`: - `offset := meaningful_length` - `meaningful_length += var_input.len` - Perform (partial) matrix mult. `padded_bytes *`$S_{L_{tot}, L_p, o, \ell}$ - Add to `preimage[fixed_bytes.length..]` ### Naive Version of the copy The basic operation we keep doing is `preimage += var_input * S`, where `preimage` is actually the slice `preimage[fixed_input_length..]` and `S` is the appropriate selector matrix. Then for each $j$ this is `preimage[j] += inner_product(var_input, S[.,j])`, i.e. inner product of the padded variable input vector with the `j`th column of the selector matrix. ### Optimizations - Below the diagonal is always 0 on the selector matrices: $S_{i,j}=0$ if $i>j$ - Each var input has a furthest position where it could be placed in the preimage, and beyond that position the corresponding column of $S$ will be 0. This farthest position is determined by the padded input length of the current var input and all preceding ones, hence is a circuit constant. - Example: Var inputs of padded lengths $2, 3, 5$. - First selector matrix can only have first two columns non-zero. - Second selector matrix can only have first 5 columns non-zero - Third selector matrix can only have first 10 columns non-zero So the relevant quantity for `var_inputs[i]` is `var_input_lengths[0] + ... + var_input_lengths[i]`. This could be called `running_padded_len` and be passed into `copy_bytes` to provide the upper bound # Circuit ID Computation Encode VK as concatenation of: - fixed bytes: (vk.alpha, beta, gamma, delta, vk.s.length (as `u256`, length as vec of scalars, not byte length), vk.s[0]) - vk.s[1..] - vk.h (potentially empty) If `vk.has_commitment`, $$\mathsf{circuitID} = \mathsf{keccak}(\texttt{"UPA Groth16 with commitment circuit id"} || vk.fixed|| vk.s[1..] || vk.h)$$ Else $$\mathsf{circuitID} = \mathsf{keccak}(\texttt{"UPA Groth16 circuit id"} || vk.fixed || vk.s[1..])$$ # Optimization: Fq-inputs Treating the var-length inputs as byte vectors had poor performance because they got too long. We should be able to treat the var-length input vectors as a larger data-type while doing the matrix mult. step, then do byte decomposition after. In this case the appropriate data type would be $\mathbb{F}_q$ points. It would be nice to operate on a whole $\mathbb{F}_q$ point at once, but may be necessary to treat it as 3 $\mathbb{F}_r$-limbs. The vector of $\mathbb{F}_r$-limbs $$ \begin{pmatrix} a_0^0 & a_0^1 & a_0^2 & a_1^0 & a_1^1 & a_1^2 & \ldots & a_{k-1}^2 \end{pmatrix} $$ corresponds to the vector of $\mathbb{F}_q$ points $$ \begin{pmatrix} a_0 & a_1 & \ldots & a_{k-1} \end{pmatrix} $$ I don't see how we could handle the latter vector of $\mathbb{F}_q$ points in circuit, I think we'll need to do matmul using an $\mathbb{F}_r$ vector. But we can at least avoid recomputing some matrix elements unnecessarily; the matrix elements will be 1's or 0's in chunks of 3: $$ \begin{pmatrix} a_0^0 & a_0^1 & a_0^2 & a_1^0 & a_1^1 & a_1^2 & \ldots & a_{k-1}^2 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0 & & & & & & & \ldots \\ 0 & 1 & 0 & & 0 & & & 0 & & \ldots \\ 0 & 0 & 1 & & & & & & & \ldots \\ & & & & & & & & & \ldots \\ & 0 & & & 1 & & & 0 & & \ldots \\ & & & & & & & & & \ldots \\ \vdots & & & & \ddots \end{pmatrix} $$ So we'll compute the slice-adder matrix 3-elements at a time and just multiply each group of 3 limbs. That forms a sort of "pre-preimage" where the variable input portion of the preimage is specified as a vector of Fq-points encoded in chunks of 3 Fr-limbs. Then the desired preimage is computed by - Copying in the fixed-length input bytes (no change compared to above b/c there was no overhead to treating these as bytes) - For each chunk of 3 Fr's from pre-preimage: - Copy over 11 bytes from first - Copy over 11 bytes from second - Copy over 10 bytes from third - Compute preimage length (in bytes) as fixed input byte length + 32 * var-input Fq length. But this logic to compute the preimage from the pre-preimage is fixed, it doesn't need to know anything about the preimage sizes. > Could do an additional optimization to account for the fact that our second var input is all-or-nothing, so there's no need to compute those matrix elements individually. But it's just the difference between computing 1 vs 8 matrix elements, you still have to do all the matrix multiplications to position them correctly.