# Blake2b implementation in Halo2
## Overview
The main idea of this implementation is to have some modular basic operations and use them as independent configs. These operation gates are:
* Addition mod $2^{64}$.
* Right Rotation of 16, 24, 32 and 63 bits in numbers of 64 bits.
* Xor of 64 bits
* Not of 64 bits
We also need to decompose each 64-bit word in the trace into limbs, so we use utility methods to divide our word in 8 limbs, 8-bit each.
Then we have the Blake2bChip which makes use of all of the above chips.
Finally we have a Blake2b Gadget that uses an instance of Blake2bChip to operate.
## Cell Types
In this implementation we use custom types to refer to cells that have already been constrained. We have 4 main types:
* AssignedBit: $[0,1]$
* AssignedByte: $[0,255]$
* AssignedBlake2bWord: $[0, 2^{16}-1]$
* AssignedRow: 1 AssignedBlake2bWord and 8 AssignedBytes
The AssignedBit is constrained in the AdditionGate and it's used for the carry value.
The AssignedByte is constrained by range-checks.
The AssignedBlake2bWord is constrained by a decomposition gate, along with the range-checks over the bytes that compose it.
The AssignedRow doesn't need any additional constraints.
## Gate description
### Decomposition
We use utility functions to decompose 64bit values into 8 limbs of 8 bits each. To verify this in-circuit, we performe a range check of each limb and ensure that they add up to the the full number using a Halo2 custom gate. That is, decomposes the value $$x = \sum_{i=0}^7 l_i*2^i$$into a row $|x|l_0|l_1| \ldots |l_7|$ where the first column represents the full number and the following 8 columns represent the limbs that compose it. By ensuring that each limb is an 8bit number, we make sure that the full number is a 64bit number.
* Max constraint degree: 2
* Range checks: 8 in the same row
* Owned columns: $t_{range}$, $q_{range}$, $q_{decompose}$
* Inherited columns: full_number_u64, 8 limbs
* Rows per operation: 1
* Lookup tables: 1 column of $2⁸$ rows for the range-check ($t_{range}$)
### Negate
* Description: Receives an already constrained AssignedBlake2bWord and copies it into a new cell. Then calculates it's bitwise negation and writes it in the following row. Both cells are in the full_number_u64 column.
* Explanation of the constraint: We know that if $x =\neg y$ then $x + y = 2^{64} - 1$ (all bits will be 1). So the restriction for this chip is $$q_{negate} \cdot (2^{64} - 1 - x - y = 0)$$
* Max constraint degree: 2
* Rows per operation: 2
* Range checks: 8 in each row
* Owned columns: q_negate
* Inherited columns: full_number_u64, 8 limbs
* Lookup tables: ---
**Soundness consideration**:
### Rotate 63b chip
* Description: This chip rotates an AssignedBlake2bWord 63 bits to the right and writes it in the following row, in the full_number_u64 column. It also writes the limbs of the result and range-checks them, activating the decomposition gate to verify the correct decomposition.
* Constraint: $$q_{rot63}\cdot (2x-y)\cdot (2x-y-(2^{64}-1))$$ where $x$ is the input number and $y$ is the rotated number, i.e. the result.
* Explanation: Let's prove that this constraint is complete and sound whenever the modulo of the prime field we are using is greater than $2^{65}$.
**Completeness**: If we want to rotate $x$ which has a binary representation of $b_0|\ldots |b_{63}$, we want to get $y = b_1|\ldots |b_{63}|b_0$ as a result, where $b_i = 0,1$ for each $i=0,1,\ldots, 63$. Let us write $x$ as $$x = b_0\cdot 2^{63} + z$$where $z$ is a 63 bits number with binary representation $z = b_1|\ldots| b_{63}$. Now, we can also write $y = 2z+b_0$. Therefore, $$2x-y = (b_0\cdot 2^{64}+2z) - (2z + b_0) = b_0\cdot 2^{64} - b_0 = b_0\cdot (2^{64}-1).$$ Thus, if $y$ is the right rotation of 63 bits of $x$, the expression $2x-y$ will be either $0$ or $2^{64}-1$ (depending on the value of $b_0$, which is the most representative bit in $x$). This makes the expression $$(2x-y)\cdot (2x-y-(2^{64}-1))$$ equal to $0$.
**Soundness**: Suppose we have a field element $y$ such that $$(2x-y)\cdot (2x-y-(2^{64}-1)) = 0$$and that the modulo $p$ of our field is $p>2^{65}$.
Since both $x$ and $y$ are range-checked to be 64bit numbers (i.e. $0 \le x,y < 2^{64}$), we know that the expression $2x-y$ is in the range $(-2^{64},2^{65}-1).$ Now, the constraint is satisfied only if $2x-y \equiv 0 \pmod p$ or $2x-y\equiv 2^{64}-1 \pmod p.$
Since $p > 2^{65}$, the only *integer* congruent to $0 \pmod p$ in the range $(-2^{64},2^{65}-1)$ is $0$. Similarly, the only *integer* congruent to $2^{64}-1 \pmod p$ in the range $(-2^{64},2^{65}-1)$ is $2^{64}-1$.
This tells us that, if $x,y$ satisfy the constraint, then $2x-y$ is equal to either $0$ of $2^{64}-1$ *as integers*.
If $2x-y=0$, this tells us that $2x=y$. Since $y < 2^{64}$, we can imply that $x<2^{63}$. So the most significant bit of $x$ as a 64b number is $0$, which is also the last bit of $y$ (since $y$ equals $2x$ which is even). The equality $2x=y$ tells us that the other 63 bits of each number are the same but shifted one place. Therefore, $y$ is the rotation of $x$.
If $2x-y=2^{64}-1$, we know that the most significant bit of $x$ is $1$ (because otherwise $x \le 2^{63}-1$ and therefore $2x \le 2^{64}-2$), and that the less significant bit of $y$ is also $1$ since $y$ is odd. And then, the equality $2x-2^{64} = y - 1$ tells us that the other 63 bits of each number are the same but shifted one place. Therefore, $y$ is the rotation of $x$. $\square$
* Max constraint degree: 3
* Owned columns: $q_{rot63}$
* Inherited columns: full_number_u64, 8 limbs
* Rows per operation: 1
**Optimization note:** you might have noticed that this operation only occupies one row instead of two (one for each value). This is because this gate expects the input operand to be in the previous row in the circuit, so there's no need to copy it and we can just reuse it.
### Rotate 16b, 24b and 32b chip (Limb Rotation)
* Description: these gates receive an AssignedRow as its input, not just the full value. This is because we need the limbs to make the rotations. The idea is that since each full number can be represented with 8 8bit limbs, **rotations that are multiples of 8 are just a rearrange of the limbs**. The operation consists in copying the already constrained limbs into the next row, with the corresponding rotation.
For example, if we want to make a 32-bit rotation, we would arrange the limbs from $$|x|l0|l1|l2|l3|l4|l5|l6|l7|$$ to $$|rot32(x)|l4|l5|l6|l7|l0|l1|l2|l3|$$
* Copy contraints: 8
* Max constraint degree: 2
* Owned columns: ---
* Inherited columns: full_number_u64, 8 limbs
* Rows per operation: 1
**Optimization note:** this operation also occupies only one row instead of two because it expects the input operand to be in the previous row in the circuit, and reuses it. Also, since the limbs are already constrained, there's no need to range-check them again, but we still have to constrain the rotated full number to have a correct decomposition in limbs.
### Addition mod64 chip
* Description: Receives two already constrained AssignedBlake2bWord $x,y$ and copies them in two rows on the full_number_u64 column.
Then, the two values are added and the result mod $2^{64}$ ($z$ from now on) is written in a third row, in the same column. The result is written along with its limb decomposition (for optimization reasons, will talk about it later). If the result of the sum is larger than $2^{64} - 1$ and there is an overflow, the *carry* column is set to $1$, if it is not, the column is set to $0$.
The base constraint is $$x+y = z + carry*2^{64}$$ and note that the carry is always either $0$ or $1$ since $x,y$ are 64-bit numbers.
To enforce that carry is 0 or 1, we add the following restriction to the gate: $$carry * (1 - carry) = 0$$
For optimization reasons, the carry is written in the place that would've taken the first limb of the second operand, which is free because we're not decomposing the operands.
* Max constraint degree: 2
* Owned columns: $q_{add}$
* Rows per operation: 3 (we can reduce it to 2 with an optimization explained below)
* Lookups: 8 (for the 8 limbs of the result)
* Inherited columns: full_number_u64, 8 limbs
**Optimization notes:** depending on the case, we will reuse the first operand. Sometimes we know that the first operand is the last written row in the circuit, but sometimes that doesn't happen. When it happens, we don't need to copy the first operand, and thus use only 2 rows.
### XOR gate
* Description: Receives two already constrained AssignedBlake2bWord $x,y$ and copies them in two rows, decomposing them into limbs and constraining and range-checking them. The limbs are a key part to compute the xor operation.
The gate uses a lookup table that precomputes the result of the xor between every pair of bytes. This results in 3 columns $2^8*2^8=2^{16}$ rows. These columns are for the two operands and the result.
* Max constraint degree: 3 (lookups)
* Rows per operation: 3 (we can reduce it to 2 with an optimization explained below)
* Lookups: 8, one per limb
* Owned columns: t_xor_left, t_xor_right, t_xor_out, q_xor
* Inherited columns: full_number_u64, 8 limbs
* Lookup tables: XOR precomputation --> 3 columns, $2^{16}$ rows.
**Optimization notes:** depending on the case, we will reuse the first operand. Sometimes we know that the first operand is the last written row in the circuit, but sometimes that doesn't happen. When it happens, we don't need to copy the first operand, and thus use only 2 rows.
## Optimization for Addition and XOR gates
Addition and Xor gates receive two values as parameters. Both gates require both inputs and the result to be consecutive rows in the trace due the locality restriction in the Halo2 framework. As a consequence, these gates should copy the input values as new rows in the trace and write the result afterwards.
Sometimes, mainly in the Mix function of Blake2b, we perform the addition or XOR operation with the last element of the trace as an operand (the other operand is somewhere else in the trace). This means that we could just copy this last input and reuse the first one, which is already in a near position that is useful for the gate. So we made this optimization, which means that XOR and ADD operations can cost us 3 rows when we copy both input parameters, or 2 in the case we can reuse one of them.
Additionaly, every time a xor operation appears on the trace is imediately after an addition operation, and remember that the Xor gate needs the limb decomposition of both operands. Taking those two facts in consideration, it would be ideal if the Addition gate already provided the result decomposed; that way we could leverage the optimization mentioned above. This is why the Addition gate decomposes its result, returning an AssignedRow instead of an AssignedBlake2bWord.
Something similar happens with the limb decomposition: every time a limb decomposition operation appears in the trace es imediately after a xor operation. In this case we don't need to do any additional effort, since the xor result is already decomposed into limbs, which is what the limb rotation needs.
## Some implementation details
Some details of Blake2bChip are:
* Owned columns: $q_{range}$, $q_{decompose}$, $t_{range}$. All of them are fixed.
* Inherited columns: full_number_u64, 8 limbs. All of them are advice columns.
The cost model is:
* The amount of advice rows is 2469 (for 1 block of input)
* The amount of advice columns is 9
* The amount of instance columns is 1 (the circuit that uses the Blake2bChip has one column for public inputs, used to constrain the result).
* The amount of fixed columns is 11 (4 lookup columns, 1 fixed column and 6 selector columns).
* The gate degree is 3.
* The max degree is 5.
* The compressed rows count is 65536 (2^16).
* The min k is $2^{17}$. This is due to the xor lookup table, which has a length of $2^{16}$.
## Benchmarks
Here we'll present the results of the benchmarks. These were performed in a machine with the following specifications:
**OS:** NixOS 24.05.5287.759537f06e69 (Uakari)
**Kernel:** x86_64 Linux 6.6.52
**CPU:** Intel Core i9-14900K @ 32x 5.7GHz [37.0°C]
**GPU:** NVIDIA GeForce RTX 4090
**RAM:** 47940MiB
We separated the results in 4 benches:
* Generation of the verification key
* Generation of the proving key
* Generation of the proof
* Verification
The amount of blocks tested were 1, 5, 10, 15, 20 and 30.
There were 30 inputs per bench/optimization/block_size.
To summarize then:
* 4 benches
* 3 optimizations
* 6 block sizes
* 30 inputs per execution
For a detailed report, check the git repository with the full results.
#### Verification key

#### Proving key

#### Proof

#### Verify

## Trace structure
The trace is composed of 9 advice columns. Those columns are:
* The full number operand column (full_number_u64)
* The 8 8-bit limbs of the full number
You can look at the trace [here](https://docs.google.com/spreadsheets/d/1KCaAlI-EpuaBJ35hdxv_4IqrWgbyFV4aG5KHb3SLPJo/edit?usp=sharing).
The trace starts by defining a set of constants. We need to define the following:
- Initialization Vector constants: we have 8 of them.
- Zero constant. A constant that contains the zero value. We need it to restrict padding of the input.
- The last one (called init_const_0) is a constant that is the result of performing a few initial operations on the constant values needed before compressing the blocks.
After defining these constants, we iterate over blocks (in the example there's just one). Each time we need to process a new block, we put it in the trace as 16 64-bit numbers, which are decomposed into limbs and range-checked. After that the compression round begins. Here we're working with a state composed by 16 64-bit words, and 4 positions of the state are going to be changed in the Mix round. There are 96 mixing rounds inside a Compress iteration (in the example there's just one mixing round).
## Soundness notes
It may result a bit confusing the fact that we're not range-checking the full_number_u64 all the time. However, we can guarantee that whenever a value appears in that column it is in fact a 64-bit number.
* **Base cases**
* **Constants:** when we write a constant in the trace, the range-check is not needed because that value is public both for prover and verifier.
* **Inputs**: the inputs of the circuit should be cells containing bytes. When we copy them into the Blake2b circuit we range-check them, and we turn on the decomposition gate to ensure that the corresponding full number is the result of composing those bytes, which are his limbs.
* **State components**: the initial global state of Blake2b is composed by constants, and in each compress iteration the local state is initialized with constants, so we can guarantee that those are in range.
* **Recursive case**
Now we have to ensure that all the values are in range, not only the initial ones. Well, the recursive case is given by each operation, namely the Addition, Xor, Negation and Rotations. We can assume that the inputs of the operations are in range, so if the operations generate results in range under that premise, we can propagate the "in-range" property through all the algorithm.
Each gate has its own reasons to preserve the property, given by its constraints (in the case of addition, rot63 and negation), its lookups (in the case of xor) or its copy constraints (in the limb rotations).
Also, is important to note that the inputs of the operations are allways results or copies of results of other operations (except when they're base cases).
With all of this in mind, we can ensure that all the values are range-checked.