# A new AIR for Blake hash
## 1. Introduction
The purpose of this post is to describe a new AIR design for the Blake hash, valid for the variants Blake2s and Blake3. More precisely, we will describe an AIR for the Blake permutation $\mathsf{blake}$, which is the building block of the hash.
We will first delve into the definition of $\mathsf{blake}$ in Section 2 and then in Section 3 we will describe the actual AIR. The implementation of the AIR (as well as end-to-end proving) can be found in the [Stwo repository](https://github.com/starkware-libs/stwo/tree/dev/crates/prover/src/examples/blake).
## 2. The Blake hash
<!-- \label{sec:hash} -->
The Blake hash (in both variants Blake{2s,3}) takes as inputs a string of $n$ bits and outputs a string of 256 bits. As is standard for hashing functions, the hash of an arbitrary length string is obtained by chaining an inner function that works on fixed-size inputs. This building block is a function $\mathsf{blake}: \{0,1\}^{512}\times \{0,1\}^{512}\to\{0,1\}^{512}$ that takes as input two strings of 512 bits (a *message* and a *state*) and outputs a string of 512 bits (a new state). This function will be the focus of this post.
In what follows, a *word* is a string of 32 bits. We are given as input a message $m=(m_1,\ldots,m_{16})$ and a state $v=(v_1,\ldots,v_{16})$, each consisting of 16 words $m_i, v_i$, $i=1,\ldots,16$. The function $\mathsf{blake}$ applies to the state vector an inner function $\mathsf{round}$, a certain number of times $n_{rounds}\in\{7,10\}$ (Blake2s requires 10 rounds, and Blake3 requires 7 rounds), and outputs a new state $v'=(v'_1,\ldots,v'_{16}$). When starting out the hashing, the state is initialized to a constant predefined vector as described in the specifications of the Blake3 hash [1] (see the Appendix for its definition).
We omit from the notation the dependence of $\mathsf{blake}$ on $n_{rounds}$.
### The function $\mathsf{round}$
The function $\mathsf{round}$ takes as input a state $v=(v_1,\ldots,v_{16})$ and a message $m=(m_1,\ldots,m_{16})$, both consisting of 16 words. The output of $\mathsf{round}$ is a new state $v'=(v'_1,\ldots,v'_{16})$.
The function $\mathsf{round}$ depends, in turn, on another function, $\mathsf{G}$, which sits at the lowest level of the hierarchy and performs arithmetic and bitwise operations.
To deal with $\mathsf{G}$ it will be clearer to treat it as an algorithm which, given some part of the state, it will modify it *in-place*, without outputting anything. Before defining $\mathsf{G}$ we introduce some notation for bitwise operations. For two bit-strings $a,b$, we write $a\oplus b$ for their XOR, and for a bit-string $a$ and an integer $n$, we denote by $a\;\texttt{>>>}\;n$ the bit-rotation of $a$ to the right with a shift of $n$. So, for example we have the equality $10010\;\texttt{>>>}\;2 = 10100$.
Now let's define $\mathsf{G}$. It takes as input 4 words $a,b,c,d$ and two auxiliary words $f_0, f_1$, and it modifies $a,b,c,d$ in-place, as follows (additions are all modulo $2^{32}$):
\begin{align*}
a &\leftarrow a + b + f_0 \\
d &\leftarrow (a\oplus d) \;\texttt{>>>}\; 16 \\
c &\leftarrow c + d \\
b &\leftarrow (b \oplus c) \;\texttt{>>>}\; 12 \\
a &\leftarrow a + b + f_1 \\
d &\leftarrow (a\oplus d) \;\texttt{>>>}\; 8 \\
c &\leftarrow c + d \\
b &\leftarrow (b\oplus c) \;\texttt{>>>}\; 7
\end{align*}
The left arrows above denote a modification of the variable in-place. To compute $\mathsf{round}$, we first arrange the input state $v$ in a $4\times4$ matrix
$$
\begin{pmatrix}
v_1 & v_2 & v_3 & v_4 \\
v_5 & v_6 & v_7 & v_8 \\
v_9 & v_{10} & v_{11} & v_{12} \\
v_{13} & v_{14} & v_{15} & v_{16}
\end{pmatrix}
$$
Then, we apply the function $\mathsf{G}$ to all columns and to all diagonals of the above matrix, using some of the words in the message $m$ as auxiliary words $f_0, f_1$.
<!-- The choice of these message words depends on the round number $r$ and on a constant table of permutations $(\sigma_1,\ldots,\sigma_{10})$ (see Appendix \ref{app:constants} for its definition), where each $\sigma_i$ is a permutation of $\{1,\ldots, 16\}$. For a round number $r$, we use the permutation $\sigma_r$ to determine the auxiliary words in the applications of $\mathsf{G}$. -->
More precisely, for state $v=(v_1,\ldots,v_{16})$ and message $m=(m_1,\ldots,m_{16})$, the function $\mathsf{round}$ consists in applying $\mathsf{G}$ sequentially to the columns and diagonals of the above matrix as follows
\begin{align*}
\mathsf{G}(v_1,v_5,v_9,v_{13},m_1, m_2) \\
\mathsf{G}(v_2,v_6,v_{10},v_{14}, m_3, m_4) \\
\mathsf{G}(v_3,v_7,v_{11},v_{13}, m_5, m_6) \\
\mathsf{G}(v_4,v_8,v_{12},v_{16}, m_7, m_8) \\
\mathsf{G}(v_1,v_6,v_{11},v_{16}, m_9, m_{10}) \\
\mathsf{G}(v_2,v_7,v_{12},v_{13}, m_{11}, m_{12}) \\
\mathsf{G}(v_3,v_8,v_9,v_{14}, m_{13}, m_{14}) \\
\mathsf{G}(v_4,v_5,v_{10},v_{15}, m_{15}, m_{16})
\end{align*}
The output is the state vector $v'$ that consists of the words $v_1,\ldots,v_{16}$ after all the above applications of $\mathsf{G}$.
### From $\mathsf{round}$ to $\mathsf{blake}$
Now that we have the function $\mathsf{round}$, we can describe how to compute the function $\mathsf{blake}$ on an input message $m=(m_1,\ldots,m_{16})$ and an input state $v^0 = (v^0_1,\ldots,v^0_{16})$.
The output state is obtained by applying $\mathsf{round}$ successively to each intermediate state starting from $v^0$ and with a certain message input. To determine the latter, we need a constant table of permutations $(\sigma_1,\ldots,\sigma_{10})$ (see the appendix below for their definition), where each $\sigma_i$ is a permutation of $\{1,\ldots, 16\}$. For a round number $r$, the message input to $\mathsf{round}$ is the permutation of $m$ by $\sigma_r$, i.e. $m_{\sigma_r} := (m_{\sigma_r(1)},\ldots,m_{\sigma_r(16)})$.
More precisely, for each round $r=1,\ldots,n_{rounds}$, define successively $v^r=\mathsf{round}(v^{r-1}, m_{\sigma_r})$. The final hash, i.e. $\mathsf{blake}(m)$, is $v^{n_{rounds}}$. Note that for Blake3 $(n_{rounds}=7)$, we only need the first 7 permutations.
## 3. The AIR
The goal of this section is to define an AIR for the function $\mathsf{blake}$. The AIR will be defined over the Mersenne field $\mathbb{F}_p$ where $p=2^{31}-1$. This prime is the one used by Circle STARK [2] and the Stwo prover.
For those unfamiliar with the notion of AIR, it essentially means constructing a "table" (in our case, a union of a number of columns of different lengths), where each entry is an element of the Mersenne field , and such that:
1. The entries of the table satisfy certain polynomial constraints,
2. The input and the output of the function $\mathsf{blake}$ appear at some predefined locations in the table.
For a precise definition of AIR see e.g. [3, Section 5.1].
### Structure of the AIR
Each word (32 bits) will be stored in two trace cells, each holding a 16 bit value, interpreted as an element of $\mathbb{F}_p$.
The AIR is divided into smaller AIRs and the linking between them is performed through lookup arguments. Except for lookups, there are no constraints which mix between different rows. The main components of the AIR are:
1. A component (the "round" component) that computes the $\mathsf{round}$ function on the intermediate states and permuted messages.
2. A component (the "scheduler") that links the intermediate states/permuted messages to the inputs received by the round component.
3. Five constant XOR tables of different dimensions.
The AIR itself is agnostic to the implementation of the lookup argument (in the code we implement *logUp* [4]).
We keep the same notation as in Section 2, and denote by $m = (m_1,\ldots,m_{16})$ the input message and by $v^r$, $r=0,\ldots,n_{rounds}$, the intermediate states.
### The XOR tables
There are five constant XOR tables that compute the XOR for all possible input pairs of bit length $12,9,8,7,4$, respectively.
### The scheduler component
The scheduler consists of a single row containing the intermediate states $v^0,\ldots,v^{n_{rounds}}$ and the message $m$.
For each round $r$, the component "sends" (by means of a lookup argument) the tuple $(v^r,v^{r+1},m_{\sigma_r})$ to the round component described below.
The scheduler guarantees that the inputs used by the round component are the correct ones and that the states that it holds are correctly computed. Note that the round component receives the message already permuted according to the correct permutation.
### The round component
This component verifies a single round. For verifying a total of $N$ hashes, this component should have at least $N\cdot n_{rounds}$ rows. Let's describe the constraints for a single row.
We first allocate $16\cdot2\cdot2=64$ cells to the input state and the permuted message. Recall that the function $\mathsf{round}$ applies $\mathsf{G}$ eight times, on certain subsets of the input state. A single application of $\mathsf{G}$ consists in performing an addition (with three summands at most) and a XOR rotation, sequentially, 4 times. (By "XOR rotation" we just mean a XOR followed by a rotation.) Let's describe the constraints for a single addition-XOR rotation.
For a word $a$, we will write $a=(a_h,a_l)$ to denote its high 16 bits and low 16 bits, respectively, interpreted as field elements.
#### Addition
Let $a=(a_h,a_l),b=(b_h, b_l),c=(c_h,c_l)$ be 3 words. We wish to compute and prove the sum $a+b+c \;\;(\text{mod}\;2^{32})$. Allocate two additional cells for the claimed sum $s=(s_h, s_l)$: to prove that $(s_h, s_l)$ is indeed the sum $a+b+c \;\;(\text{mod}\;2^{32})$, it's enough to prove
1. $(a_l + b_l + c_l - s_l) \,/\, 2^{16}\in\{0,1,2\}$,
2. $(a_h + b_h + c_h - s_h) \,/\, 2^{16}\in\{0,1,2\}$,
3. $0\le s_h,s_l\le 2^{16}-1$.
We impose the first two constraints on the trace cells, i.e. the degree 3 constraint given by the vanishing of $x(x-1)(x-2)$. We delay the range-check of point 3. to the next stage. We will obtain it as a by-product of the constraints for the XOR rotation.
#### XOR rotation
Let $a=(a_h,a_l),b=(b_h, b_l)$ be 2 words and $rot\in\{7,8,12\}$ a rotation shift (recall that in $\mathsf{G}$ we need to rotate with shifts $\{7,8,12,16\}$. We omit the case of $rot=16$ which is similar to the case of $rot=8$ except for the order in which the cells enter the constraints).
We wish to compute and prove the XOR rotation $(a\oplus b)\;\texttt{>>>}\; rot$. First split each of $a_h,a_l,b_h,b_l$ into their low $rot$ bits and high $16-rot$ bits, which we'll denote by $a_{hh},a_{hl},\ldots, b_{lh},b_{ll}$. In terms of trace cells, it's enough to allocate 4 cells to $a_{hh}, a_{lh}, b_{hh}, b_{lh}$ (the other four "double-indexed" variables can be derived algebraically from $a_{hh}, a_{lh}, b_{hh}, b_{lh}$ and $a_h,a_l,b_h,b_l$).
For each pair $(a_{ll},b_{ll}), (a_{lh},b_{lh}), (a_{hl}, b_{hl}), (a_{hh},b_{hh})$, allocate a new cell for the result of their XOR, which we denote by $c_{ll},c_{lh},c_{hl},c_{hh}$. To prove that the XOR is computed correctly, we do a lookup to the XOR table of the correct size (either $rot$ or $16-rot$).
This lookup (say, for $c_{ll}$) ensures two things:
1. That $c_{ll}$ is indeed the XOR of $a_{ll}$ and $b_{ll}$,
2. That $0\le a_{ll},b_{ll}\le 2^{rot}-1$.
Note that the lookup for $c_{lh}$ gives $0\le a_{lh}\le 2^{16-rot}-1$. With the previous bounds, this implies $0\le a_l\le 2^{16}-1$ (by the same argument, this bound holds for $a_h,b_l,b_h$), i.e. we get for free a range-check of the single-indexed variables. By giving the output of the addition as input to the XOR rotation, we end up with the range-check that we needed in order to prove the correctness of the addition.
### Costs
1. **XOR tables**. The largest table is the XOR table for width $12$, which is a column of $(2^{12})^2 = 2^{24}$ trace cells. Since we use *logUp* as lookup argument, we also store a multiplicity column, where each cell contains the number of invocations to that row. In total: $2^{24}$ trace cells and $2^{24}$ lookups.
2. **Scheduler component**: we need to allocate cells for $n_{rounds}+1$ states and one message, and each word takes up 2 cells: $(n_{rounds}+2)\cdot 16\cdot 2$. There are $n_{rounds}$ of lookups.
3. **Round component**: for one round, we have:
- Initial state and message: $64$ cells + $1$ lookup to the scheduler AIR.
- Addition: a single addition costs $2$ cells, so in a round, addition costs $2\cdot 4\cdot 8 = 64$ cells.
- XOR rotation: a single XOR rotation costs $8$ cells + $4$ lookups, so in a round, XOR rotation costs $8\cdot 4\cdot 8=256$ cells and $4\cdot 4\cdot 8 = 128$ lookups.
Cells: $64 + 64 + 256 =384$.
Lookups: $1 + 128 = 129$.
In total: $384\cdot n_{rounds}$ cells and $129\cdot n_{rounds}$ lookups.
The grand total across the non-constant components is:
- cells: $(n_{rounds}+2)\cdot 32 + 384\cdot n_{rounds} = 64 + 416\cdot n_{rounds}$.
- lookups: $n_{rounds} + 129\cdot n_{rounds}=130\cdot n_{rounds}$.
In Blake3 $(n_{rounds}=7)$:
- cells: $64 + 416\cdot 7 = 2976$
- lookups: $130\cdot 7 = 910$
In Blake2s $(n_{rounds}=10)$:
- cells: $64 + 416\cdot 10 = 4224$
- lookups: $130\cdot 10 = 1300$
## Appendix: Blake constants
### Initial state
The first 8 words of the initial state $v^0$ are:
\begin{align*}
\texttt{0x6a09e667} \\
\texttt{0xbb67ae85} \\
\texttt{0x3c6ef372} \\
\texttt{0xa54ff53a} \\
\texttt{0x510e527f} \\
\texttt{0x9b05688c} \\
\texttt{0x1f83d9ab} \\
\texttt{0x5be0cd19}
\end{align*}
The last 8 words are $0$.
### Permutation table
\begin{align*}
\sigma_1 = (1\;\;2\;\;3\;\;4\;\;5\;\;6\;\;7\;\;8\;\;9\;\;10\;\;11\;\;12\;\;13\;\;14\;\;15\;\;16) \\
\sigma_2 = (15\;\;11\;\;5\;\;9\;\;10\;\;16\;\;14\;\;7\;\;2\;\;13\;\;1\;\;3\;\;12\;\;8\;\;6\;\;4) \\
\sigma_3 = (12\;\;9\;\;13\;\;1\;\;6\;\;3\;\;16\;\;14\;\;11\;\;15\;\;4\;\;7\;\;8\;\;2\;\;10\;\;5) \\
\sigma_4 = (8\;\;10\;\;4\;\;2\;\;14\;\;13\;\;12\;\;15\;\;3\;\;7\;\;6\;\;11\;\;5\;\;1\;\;16\;\;9) \\
\sigma_5 = (10\;\;1\;\;6\;\;8\;\;3\;\;5\;\;11\;\;16\;\;15\;\;2\;\;12\;\;13\;\;7\;\;9\;\;4\;\;14) \\
\sigma_6 = (3\;\;13\;\;7\;\;11\;\;1\;\;12\;\;9\;\;4\;\;5\;\;14\;\;8\;\;6\;\;16\;\;15\;\;2\;\;10) \\
\sigma_7 = (13\;\;6\;\;2\;\;16\;\;15\;\;14\;\;5\;\;11\;\;1\;\;8\;\;7\;\;4\;\;10\;\;3\;\;9\;\;12) \\
\sigma_8 = (14\;\;12\;\;8\;\;15\;\;13\;\;2\;\;4\;\;10\;\;6\;\;1\;\;16\;\;5\;\;9\;\;7\;\;3\;\;11) \\
\sigma_9 = (7\;\;16\;\;15\;\;10\;\;12\;\;4\;\;1\;\;9\;\;13\;\;3\;\;14\;\;8\;\;2\;\;5\;\;11\;\;6) \\
\sigma_{10} = (11\;\;3\;\;9\;\;5\;\;8\;\;7\;\;2\;\;6\;\;16\;\;12\;\;10\;\;15\;\;4\;\;13\;\;14\;\;1) \\
\end{align*}
## References
[1] J. O'Connor, J.-P. Aumasson, S. Neves, Z. Wilcox-O'Hearn. Blake3 - one function, fast everywhere. https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf. 2021
[2] U. Haböck, D. Levit, S. Papini. Circle STARKs. https://eprint.iacr.org/2024/278. 2024
[3] StarkWare. ethSTARK. https://eprint.iacr.org/2021/582.pdf. 2021
[4] U. Haböck. Multivariate lookups based on logarithmic derivatives. https://eprint.iacr.org/2022/1530. 2022