Implementing AES with Lookups

Plookup is a modification of PLONK to allow efficient table lookups in a circuit. We consider the cost of implementing the AES block cipher using Plookup.

Sparse representations

Definition: The \(b\)-sparse representation of an \(n\)-bit integer \(x = \sum\limits^{n-1}_{i=0} x_i 2^i\) is \(\mathsf{sparse}_b(x) = \sum\limits^{n-1}_{i=0} x_i b^i\).

Definition: A \(k\)-ary operation on integers \(\odot\) is "sparse-friendly" if there exists \(f_\odot\) such that \(\bigodot\limits^{k-1}_{j=0} a_j = f_\odot\left(\sum\limits^{k}_{j=0} \mathsf{sparse}_{k+1}(a_j)\right)\). In other words, the result of \(\odot\) depends only on the bitwise sum of the inputs.

Fact: bitwise \(k\)-ary XOR is sparse-friendly. (As are [N]AND, [N]OR, and NXOR, but we won't need those.)

This technique was invented by Vitalik Buterin. We'd considered \(4\)-sparse representation at the Boulder Zcash summit in November 2014, but discarded it then because there was no efficient way to do lookups in R1CS-based proof systems.

AES round function

The AES state is a \(4\) by \(4\) matrix with elements in \(GF(2^8)\). A \(GF(2^8)\) element is represented in polynomial basis \(GF(2)[X]/(X^8 + X^4 + X^3 + X + 1)\), where the polynomial \(p(X) = \sum\limits^{7}_{i=0} c_i X^i\) is typically identified with its integer evaluation \(p(2) = \sum\limits^{7}_{i=0} c_i 2^i\).

Following the AES specification, we use the notation \(S_{i,j}\) for the state element at (zero-based) row \(i\) and column \(j\).

The AES round function is defined as \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{ShiftRows} \circ \mathsf{SubBytes}\):

\(\mathsf{SubBytes}\) — a non-linear substitution step where each byte is replaced with another according to a lookup function \(\mathsf{SBox}\).
\(\mathsf{ShiftRows}\) — a transposition step where the last three rows of the state are shifted cyclically a certain number of steps.
\(\mathsf{MixColumns}\) — a linear mixing operation which operates on the columns of the state, combining the four bytes in each column.
\(\mathsf{AddRoundKey}_{RK}\) — each byte of the state is combined with a byte of the round key \(RK\) using bitwise XOR (equivalently, polynomial addition).

The \(\mathsf{MixColumns}\) operation transforms each column \(0 \leq j \leq 3\) as follows:

\(\begin{bmatrix} S'_{0,j} \\ S'_{1,j} \\ S'_{2,j} \\ S'_{3,j} \end{bmatrix} = \begin{bmatrix} 2 & 3 & 1 & 1 \\[0.7ex] 1 & 2 & 3 & 1 \\[0.7ex] 1 & 1 & 2 & 3 \\[0.7ex] 3 & 1 & 1 & 2 \end{bmatrix} \begin{bmatrix} S_{0,j} \\[0.5ex] S_{1,j} \\[0.5ex] S_{2,j} \\[0.5ex] S_{3,j} \end{bmatrix}\)

(as polynomials, \(2\) means \(X\) and \(3\) means \(X + 1\)).

A possible implementation

Since \(\mathsf{ShiftRows}\) commutes with \(\mathsf{SubBytes}\), we can write the round function as \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{SubBytes} \circ \mathsf{ShiftRows}\). When each byte is represented separately (no matter what representation is used), \(\mathsf{ShiftRows}\) is free, and so we are left with implementing \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{SubBytes}\).

Let \(S^{in}_{i,j}\) be the input element at \((i, j)\), after applying \(\mathsf{ShiftRows}\) to the state from the previous round.

Suppose we use a \(4\)-sparse representation for each byte of the state. We will use \(\underline{underlining}\) for values in \(4\)-sparse representation and functions that return such values.

Let \(\underline{f_{\oplus}}\) be the function that applies \((0, 1, 2, 3) \mapsto (0, 1, 0, 1)\) to each base-\(4\) digit, and let \(f_{\oplus}\) be the function that applies \(\underline{f_{\oplus}}\) and then converts back to binary form (i.e. squashing out the zero bits).

We can implement \(3\)-ary bitwise XOR as \(x \oplus y \oplus z = f_{\oplus}(\underline{x} + \underline{y} + \underline{z})\), or \(\underline{x \oplus y \oplus z} = \underline{f_{\oplus}}(\underline{x} + \underline{y} + \underline{z})\) to leave the result in \(4\)-sparse form.

Let \(\underline{F_z}(\underline{x}) = \mathsf{sparse}_4\big(z\;\mathsf{SBox}(f_{\oplus}(\underline{x}))\big)\) for \(1 \leq z \leq 3\).

In each round we can write \(\mathsf{AddRoundKey}_{RK} \circ \mathsf{MixColumns} \circ \mathsf{SubBytes}\) as

\(S^{out}_{0,j}\) \(= 2\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 3\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{0,j}\)
\(S^{out}_{1,j}\) \(= 1\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 2\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 3\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{1,j}\)
\(S^{out}_{2,j}\) \(= 1\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 2\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 3\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{2,j}\)
\(S^{out}_{3,j}\) \(= 3\;\mathsf{SBox}(S^{in}_{0,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{1,j}) \oplus 1\;\mathsf{SBox}(S^{in}_{2,j}) \oplus 2\;\mathsf{SBox}(S^{in}_{3,j}) \oplus RK_{3,j}\)

which in \(4\)-sparse form can be implemented as

\(\underline{S^{out}_{0,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_2}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_3}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_1}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{0,j}}\)
\(\underline{S^{out}_{1,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_1}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_2}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_3}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_1}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{1,j}}\)
\(\underline{S^{out}_{2,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_1}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_2}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_3}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{2,j}}\)
\(\underline{S^{out}_{3,j}}\) \(= \underline{f_{\oplus}}\Big(\underline{F_3}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{1,j}}\big) + \underline{F_1}\big(\underline{S^{in}_{2,j}}\big)\Big) + \underline{F_2}\big(\underline{S^{in}_{3,j}}\big) + \underline{RK_{3,j}}\)

Note that there are two uses of each \(\underline{F_1}(\underline{S^{in}_{i,j}})\) term, so there are \(12\) unique \(F_z\) lookups here, not \(16\). So, in total we have \(64\) lookups per round (\(4\) \(\underline{f_{\oplus}}\) and \(12\) \(\underline{F_z}\) per column).

The \(S^{out}_{i,j}\) are then permuted by \(\mathsf{ShiftRows}\) to become the \(S^{in}_{i,j}\) of the following round.

Before the first round, we need \(16\) lookups to convert each byte to \(4\)-sparse form. At that point we add the initial round key before computing the first round function as above. After the last round, we need \(16\) more \(f_{\oplus}\) lookups to convert back to binary form.

So,

  • AES-128 (with \(10\) rounds) requires \(672\) lookups.
  • AES-192 (with \(12\) rounds) requires \(800\) lookups.
  • AES-256 (with \(14\) rounds) requires \(928\) lookups.

We need five tables \((f_{\oplus}, \underline{f_{\oplus}}, \underline{F_1}, \underline{F_2}, \underline{F_3})\), each with \(4^8 = 2^{16}\) entries, and one table \(\mathsf{sparse}_4\) with \(2^8\) entries. This does not include computing round keys or the cost of additions.

With fewer large tables

Define \(G(\underline{x}) = \mathsf{SBox}(f_{\oplus}(\underline{x}))\) and \(\underline{F'_z}(x) = \mathsf{sparse}_4(z x)\) for \(1 \leq z \leq 3\).

Then we have \(\underline{F_z}(\underline{x}) = \underline{F'_z}(G(\underline{x}))\) for \(1 \leq z \leq 3\), which we substitute in the round function above. This doubles the number of lookups needed for \(\underline{F_z}\), so we have \(112\) lookups per round or \(1152\) in total for AES-128. But we only need three \(2^{16}\)-sized tables (\(f_{\oplus}\), \(\underline{f_{\oplus}}\), and \(G\)) rather than five.

With smaller (\(3^8\)-sized) tables

We could use \(3\)-sparse representations throughout. This means that we can only use binary XOR rather than \(3\)-ary XOR before normalizing.

For each byte we have, for example

\(\underline{S^{out}_{0,j}}\) \(= \underline{f_{\oplus}}\Bigg(\underline{f_{\oplus}}\bigg(\underline{f_{\oplus}}\Big(\underline{F_2}\big(\underline{S^{in}_{0,j}}\big) + \underline{F_3}\big(\underline{S^{in}_{1,j}}\big)\Big) + \underline{F_1}\big(\underline{S^{in}_{2,j}}\big)\bigg) + \underline{F_1}\big(\underline{S^{in}_{3,j}}\big)\Bigg) + \underline{RK_{0,j}}\)

So, we need \(12\) \(\underline{f_{\oplus}}\) lookups and \(12\) \(\underline{F_z}\) lookups per column, i.e. \(96\) lookups per round. We still need \(16\) lookups to convert to \(3\)-sparse form before the first round, and \(16\) lookups to convert back to binary form after the last round.

So in this variant,

  • AES-128 (with \(10\) rounds) requires \(992\) lookups.
  • AES-192 (with \(12\) rounds) requires \(1184\) lookups.
  • AES-256 (with \(14\) rounds) requires \(1376\) lookups.

This has smaller overall table size (five tables of size \(3^8 = 6561\) and one of size \(2^8\)), and better performance than the "fewer large tables" variant above, so the "fewer large tables" variant can be discarded.

Other Rijndael block sizes

Rijndael also supports \(192\)-bit and \(256\)-bit blocks. This is achieved by increasing the number of columns to \(6\) or \(8\) in a straightforward way. The number of lookups scales linearly with the block size.

The resulting lookup counts for \(4\)-sparse representation are:

\begin{array}{|c|ccc|} \hline & & \text{Block size} & \\ \text{Rounds} & 128 & 192 & 256 \\ \hline 10 & 672 & 1008 & 1344 \\ 12 & 800 & 1200 & 1600 \\ 14 & 928 & 1392 & 1856 \\ \hline \end{array}

and for \(3\)-sparse representation:

\begin{array}{|c|ccc|} \hline & & \text{Block size} & \\ \text{Rounds} & 128 & 192 & 256 \\ \hline 10 & 992 & 1488 & 1984 \\ 12 & 1184 & 1776 & 2368 \\ 14 & 1376 & 2064 & 2752 \\ \hline \end{array}

Select a repo