# Single-cycle hashing in Distaff VM
This note assumes some familiarity with [permutation checks](https://hackmd.io/@arielg/ByFgSDA7D), [Rescue Prime](https://eprint.iacr.org/2020/1143) hash function, and [Distaff VM](https://github.com/GuildOfWeavers/distaff).
Currently, computing a 2-to-1 Rescue hash (64 bytes -> 32 bytes) in Distaff VM requires 10 cycles and can be initiated only on steps which are multiples of 16. This effectively means that we can compute one hash every 16 cycles. Hashing is done like so:
1. First, we need to push the data we want to hash onto the stack. Since current implementation of Distaff VM uses a 128-bit field, each element on the stack is a 128-bit value - so, we need to push 4 elements onto the stack.
2. Then, we need to initialize capacity portion of the state to zeros. That requires pushing two more ZERO elements onto the stack. The VM has a `PAD2` instruction which allows doing this in a single cycle.
3. Then, we need to execute `RESCR` instruction 10 times in a row. This instruction computes a single round of Rescue hash function and leaves the result on the stack. The first invocation of `RESCR` instruction must happen on a step which is a multiple of 16.
4. The result of the hash is now in the 5th and 6th registers from the top of the stack, the top 4 registers contain garbage. We can use `DROP4` instruction to bring the result to the top of the stack within a single cycle.
Graphically, this looks roughly like the image below. Here, we are assuming that the stack has already been initiliazed with values of $v_1, v_2, v_3, v_4$ and we are trying to compute $hash(v_1, v_2, v_3, v_4) \rightarrow (r_1, r_2)$

Moving to a 64-bit field and using Rescue Prime instead of plain Rescue, we can improve on this a bit. Specifically, we can reduce the number of `RESCR` operations to 7, and hashing can start on a cycle which is a multiple of 8 (rather than 16).
However, there is a way to do much better than that.
## New hashing approach
The core idea of the new approach is that we can "offload" hash computations to a pre-computed table and then just reference rows in the table via a permutation check. Assuming we are now in a 64-bit field, a table for computing a 2-to-1 $hash(v_1, v_2, v_3, v_4, v_5, v_6, v_7, v_8) \rightarrow (r_1, r_2, r_3, r_4)$ would be built up like so ("-" is used in place of values which would take too much effort to fill in by hand):

Here, we are assuming that we are using Rescue Prime instantiated in a **64-bit field** over a state of 12 field elements. To compute a hash, we need 13 columns: one for the address (which increases monotonically), and 12 for the hasher state. We also need to compute 7 rounds (at 40% security margin) - so, we are able to compute a single hash in an 8 step cycle.
The AIR constraints for the table are pretty simple:
1. For the address column $a_0$, we need to enforce that $a_0' = a_0 + 1$, where $a_0$ is the value of the cell in the current row, and $a_0'$ is the value of the cell in the next row.
2. For the remaining 12 columns ($a_1, ..., a_{12}$), we need to enforce regular Rescue Prime round constraints for all steps except for the last one in every 8-step cycle. For this step, the first 8 cells should be un-constrained, and the other 4 cells should be set to zeros.
To handle the above exception, we can introduce a simple periodic column $k_0$, which will contain 7 zeros followed by a single one. We can then multiply the constraints either by $k_0$ or $(1 - k_0)$ to apply the right set of constraints for a given step (see table below).

The reason for the second periodic column $k_1$ is that for the permutation check, we want to skip all intermediate states and include only the states from the first and last step of every 8-step cycle. For example, we could compute values included in the permutation check product like so:
$$
v_i = \left(\sum_{k=0}^{8} (\alpha^{k+1} \cdot a_k)\right) \cdot (1 - k_0) \cdot k_1 + \left(\sum_{k=0}^{4} (\alpha^{k+1} \cdot a_k)\right) \cdot k_0 \cdot k_1
$$
where, $\alpha$ is a random value used for the random linear combination.
We can then compute the product of all $v_i$ values like so ($z$ is also a random value):
$$
\prod_{i=0}^n ((z + v_i) \cdot k_{i,1} + 1 - k_{i,1})
$$
Thus, on steps where $k_1 = 1$, $(z + v_i)$ will be included in the product, otherwise the value of the product will remain the same since $((z + v_i) \cdot k_{i,1} + 1 - k_{i,1})$ will evaluate to $1$.
Now, to use the pre-computed hashes in the VM, we can introduce `RHASH` operation, which can hash up to eight 64-bit values in a single VM cycle. The operation to compute $hash(u_1, u_2, u_3, u_4, u_5, u_6, u_7, u_8) \rightarrow (p_1, p_2, p_3, p_4)$ could work something like this:

Denoting stack registers as $s_0, ..., s_n$ where $s_0$ is the top of the stack, and helper registers as $h_0, ..., h_3$, this operation will add the following terms to the permutation product:
$$
z + (h_0 \cdot \alpha + s_7 \cdot \alpha^2 + ... + s_0 \cdot \alpha^9)
$$
$$
z + ((h_0 + 7) \cdot \alpha + s_3' \cdot \alpha^2 + ... + s'_0 \cdot \alpha^5)
$$
One nice property of this new approach is that the values in the hash table and helper registers are filled in by the prover. Thus, as far as the user is concerned, they loaded eight 64-bit values onto the stack, called `RHASH` instruction and got the result in the top four registers of the stack in the next cycle.
## Supporting N-to-1 hashing
The above construction allows us to compute 2-to-1 hashes, but it will not work if we wanted to compute N-to-1 hashes because capacity registers $a_9, ..., a_{12}$ get reset to zero on every 8th step. We can, of course, simulate N-to-1 hashing using a series of 2-to-1 hashes, but that won't be as efficient.
Thus, to support N-to-1 hashing natively, we need to have a way to carry over the values in the capacity registers to the next steps. This can be achieved by imposing an extra constraint on the last four columns of the table. This constraint may look something like this (as before, $a_i$ denotes a value of column $a_i$ at the current row, and $a_i'$ denotes value of column $a_i$ at the next row.):
$$
\left(1 - \prod_{i=9}^{12}(1 - a_i)\right) \cdot \left(1 - \prod_{i=9}^{12}(1 - a_i' - a_i)\right) \cdot k_0 = 0
$$
The basic idea is that we want to make sure that when $k_0 = 1$, the values in the last 4 columns of the table need to be either: (1) reset to zeros, or (2) be equal to the values from the previous row. The former is enforced by the first term of the above constraint, while the latter is enforced by the second term of the above constraint.
And as before, when $k_0 = 0$ we would still need to enforce regular Rescue Prime transition constraints.
## Other potential uses of the table
One obvious limiting factor is that the number of calls to `RHASH` operation cannot exceed 1/8 of the number of cycles consumed by the program. However, it is rather unlikely that a program will execute `RHASH` operation every 8th cycle on average (unless all the program does is hashing). In fact, it is more likely that there will be many fewer `RHASH` instructions used than the capacity of the table would allow.
Thus, we may end up wasting a lot of trace space. Fortunately, with some modifications, the table can be adapted to a number of other use cases. For example, it could be used to implement read-write memory. In such a case, a row in the table would look like so:

where:
* `ctx` hold the context ID of the currently executing function (every function gets its own memory).
* `addr` is the memory address. Each memory address points to a 256-bit segment of memory (or four 64-bit field elements, to be exact). These addresses don't need to be contiguous.
* `clk` is a clock cycle of a read or a write operation. These values must be in increasing order.
* `old_n` are the four 64-bit values stored at the address at the beginning of the read or write operation.
* `new_n` are the four 64-bit values stored at the address at the end of the read or write operation.
* There are two extra columns that remain. These could be used for intermediate values to help enforce necessary constraints (e.g., constraints enforcing that `clk` values are always increasing).
Thus, with this setup, we could load or store up to 256 bits of data with a single operation. For example, stack transition for a load operation could look like so:

As a part of this operation, we would also need to enforce that values in the helper registers are equal to the corresponding values at the top of the stack on the next cycle.
A store operation could look like so:

These are just rough sketches: they don't capture all the details and there are likely ways to do it more efficiently.