Plookup is a modification of PLONK to allow efficient table lookups in a circuit. We consider the cost of implementing the AES block cipher using Plookup.
Definition: The \(b\)-sparse representation of an \(n\)-bit integer \(x = \sum\limits^{n-1}_{i=0} x_i 2^i\) is \(\mathsf{sparse}_b(x) = \sum\limits^{n-1}_{i=0} x_i b^i\).
Definition: A \(k\)-ary operation on integers \(\odot\) is "sparse-friendly" if there exists \(f_\odot\) such that \(\bigodot\limits^{k-1}_{j=0} a_j = f_\odot\left(\sum\limits^{k}_{j=0} \mathsf{sparse}_{k+1}(a_j)\right)\). In other words, the result of \(\odot\) depends only on the bitwise sum of the inputs.
Fact: bitwise \(k\)-ary XOR is sparse-friendly. (As are [N]AND, [N]OR, and NXOR, but we won't need those.)
This technique was invented by Vitalik Buterin. We'd considered \(4\)-sparse representation at the Boulder Zcash summit in November 2014, but discarded it then because there was no efficient way to do lookups in R1CS-based proof systems.
The AES state is a \(4\) by \(4\) matrix with elements in \(GF(2^8)\). A \(GF(2^8)\) element is represented in polynomial basis \(GF(2)[X]/(X^8 + X^4 + X^3 + X + 1)\), where the polynomial \(p(X) = \sum\limits^{7}_{i=0} c_i X^i\) is typically identified with its integer evaluation \(p(2) = \sum\limits^{7}_{i=0} c_i 2^i\).
Following the AES specification, we use the notation \(S_{i,j}\) for the state element at (zero-based) row \(i\) and column \(j\).
The AES round function is defined as \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{ShiftRows} \circ \mathsf{SubBytes}\):
\(\mathsf{SubBytes}\) — a non-linear substitution step where each byte is replaced with another according to a lookup function \(\mathsf{SBox}\).
\(\mathsf{ShiftRows}\) — a transposition step where the last three rows of the state are shifted cyclically a certain number of steps.
\(\mathsf{MixColumns}\) — a linear mixing operation which operates on the columns of the state, combining the four bytes in each column.
\(\mathsf{AddRoundKey}_{RK}\) — each byte of the state is combined with a byte of the round key \(RK\) using bitwise XOR (equivalently, polynomial addition).
The \(\mathsf{MixColumns}\) operation transforms each column \(0 \leq j \leq 3\) as follows:
\(\begin{bmatrix} S'_{0,j} \\ S'_{1,j} \\ S'_{2,j} \\ S'_{3,j} \end{bmatrix} = \begin{bmatrix} 2 & 3 & 1 & 1 \\[0.7ex] 1 & 2 & 3 & 1 \\[0.7ex] 1 & 1 & 2 & 3 \\[0.7ex] 3 & 1 & 1 & 2 \end{bmatrix} \begin{bmatrix} S_{0,j} \\[0.5ex] S_{1,j} \\[0.5ex] S_{2,j} \\[0.5ex] S_{3,j} \end{bmatrix}\)
(as polynomials, \(2\) means \(X\) and \(3\) means \(X + 1\)).
Since \(\mathsf{ShiftRows}\) commutes with \(\mathsf{SubBytes}\), we can write the round function as \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{SubBytes} \circ \mathsf{ShiftRows}\). When each byte is represented separately (no matter what representation is used), \(\mathsf{ShiftRows}\) is free, and so we are left with implementing \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{SubBytes}\).
Let \(S^{in}_{i,j}\) be the input element at \((i, j)\), after applying \(\mathsf{ShiftRows}\) to the state from the previous round.
Suppose we use a \(4\)-sparse representation for each byte of the state. We will use \(\underline{underlining}\) for values in \(4\)-sparse representation and functions that return such values.
Let \(\underline{f_{\oplus}}\) be the function that applies \((0, 1, 2, 3) \mapsto (0, 1, 0, 1)\) to each base-\(4\) digit, and let \(f_{\oplus}\) be the function that applies \(\underline{f_{\oplus}}\) and then converts back to binary form (i.e. squashing out the zero bits).
We can implement \(3\)-ary bitwise XOR as \(x \oplus y \oplus z = f_{\oplus}(\underline{x} + \underline{y} + \underline{z})\), or \(\underline{x \oplus y \oplus z} = \underline{f_{\oplus}}(\underline{x} + \underline{y} + \underline{z})\) to leave the result in \(4\)-sparse form.
Let \(\underline{F_z}(\underline{x}) = \mathsf{sparse}_4\big(z\;\mathsf{SBox}(f_{\oplus}(\underline{x}))\big)\) for \(1 \leq z \leq 3\).
In each round we can write \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{SubBytes}\) as
\(S^{out}_{0,j}\) \(= 2\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 3\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{0,j}\)
\(S^{out}_{1,j}\) \(= 1\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 2\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 3\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{1,j}\)
\(S^{out}_{2,j}\) \(= 1\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 2\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 3\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{2,j}\)
\(S^{out}_{3,j}\) \(= 3\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 2\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{3,j}\)
which in \(4\)-sparse form can be implemented as
\(\underline{S^{out}_{0,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_2}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_3}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_1}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{0,j}}\)
\(\underline{S^{out}_{1,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_1}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_2}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_3}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_1}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{1,j}}\)
\(\underline{S^{out}_{2,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_1}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_2}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_3}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{2,j}}\)
\(\underline{S^{out}_{3,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_3}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_2}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{3,j}}\)
Note that there are two uses of each \(\underline{F_1}(\underline{S^{in}_{i,j}})\) term, so there are \(12\) unique \(F_z\) lookups here, not \(16\). So, in total we have \(64\) lookups per round (\(4\) \(\underline{f_{\oplus}}\) and \(12\) \(\underline{F_z}\) per column).
The \(S^{out}_{i,j}\) are then permuted by \(\mathsf{ShiftRows}\) to become the \(S^{in}_{i,j}\) of the following round.
Before the first round, we need \(16\) lookups to convert each byte to \(4\)-sparse form. At that point we add the initial round key before computing the first round function as above. After the last round, we need \(16\) more \(f_{\oplus}\) lookups to convert back to binary form.
So,
We need five tables \((f_{\oplus}, \underline{f_{\oplus}}, \underline{F_1}, \underline{F_2}, \underline{F_3})\), each with \(4^8 = 2^{16}\) entries, and one table \(\mathsf{sparse}_4\) with \(2^8\) entries. This does not include computing round keys or the cost of additions.
Define \(G(\underline{x}) = \mathsf{SBox}(f_{\oplus}(\underline{x}))\) and \(\underline{F'_z}(x) = \mathsf{sparse}_4(z x)\) for \(1 \leq z \leq 3\).
Then we have \(\underline{F_z}(\underline{x}) = \underline{F'_z}(G(\underline{x}))\) for \(1 \leq z \leq 3\), which we substitute in the round function above. This doubles the number of lookups needed for \(\underline{F_z}\), so we have \(112\) lookups per round or \(1152\) in total for AES-128. But we only need three \(2^{16}\)-sized tables (\(f_{\oplus}\), \(\underline{f_{\oplus}}\), and \(G\)) rather than five.
We could use \(3\)-sparse representations throughout. This means that we can only use binary XOR rather than \(3\)-ary XOR before normalizing.
For each byte we have, for example
\(\underline{S^{out}_{0,j}}\) \(= \underline{f_{\oplus}}\Bigg(\underline{f_{\oplus}}\bigg(\underline{f_{\oplus}}\Big(\underline{F_2}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_3}\big(\underline{S^{in}_{1,j}}\big)\Big) + \underline{F_1}\big(\underline{S^{in}_{2,j}}\big)\bigg) + \underline{F_1}\big(\underline{S^{in}_{3,j}}\big)\Bigg) + \underline{RK_{0,j}}\)
So, we need \(12\) \(\underline{f_{\oplus}}\) lookups and \(12\) \(\underline{F_z}\) lookups per column, i.e. \(96\) lookups per round. We still need \(16\) lookups to convert to \(3\)-sparse form before the first round, and \(16\) lookups to convert back to binary form after the last round.
So in this variant,
This has smaller overall table size (five tables of size \(3^8 = 6561\) and one of size \(2^8\)), and better performance than the "fewer large tables" variant above, so the "fewer large tables" variant can be discarded.
Rijndael also supports \(192\)-bit and \(256\)-bit blocks. This is achieved by increasing the number of columns to \(6\) or \(8\) in a straightforward way. The number of lookups scales linearly with the block size.
The resulting lookup counts for \(4\)-sparse representation are:
\begin{array}{|c|ccc|} \hline & & \text{Block size} & \\ \text{Rounds} & 128 & 192 & 256 \\ \hline 10 & 672 & 1008 & 1344 \\ 12 & 800 & 1200 & 1600 \\ 14 & 928 & 1392 & 1856 \\ \hline \end{array}
and for \(3\)-sparse representation:
\begin{array}{|c|ccc|} \hline & & \text{Block size} & \\ \text{Rounds} & 128 & 192 & 256 \\ \hline 10 & 992 & 1488 & 1984 \\ 12 & 1184 & 1776 & 2368 \\ 14 & 1376 & 2064 & 2752 \\ \hline \end{array}