---
Title: How to Store a Permutation Compactly
breaks: false
---
$$
\def\Fp{\mathbb{F}_p}
\def\deq{\mathrel{\mathop:}=}
\def\range#1{[{#1}]}
\def\alg#1{\textsf{#1}}
\def\rank{\textsf{rank}}
\def\msb{\textsf{msb}}
\def\bit{\textsf{bit}}
\def\mib#1{\color{blue}{\mathbf{{#1}}}}
\def\mir#1{\color{red}{\mathbf{{#1}}}}
$$
# How to Store a Permutation Compactly
### Bram Cohen and Dan Boneh
#### Feb. 2022
---
In this note we describe a compact way to store a permutation, while supporting a fast way to evaluate and invert the permutation. This question comes up naturally when optimizing the software underlying the [Chia proof-of-space](https://docs.chia.net/docs/03consensus/proof-of-space) blockchain. [Chia](https://chia.net) replaces Nakamoto's energy hungry proof-of-work consensus with an eco-friendly [proof-of-space](https://en.wikipedia.org/wiki/Proof_of_space).
We end the note with an open problem that we don't know how to solve.
## How to store a permutation compactly
A permutation on the set $\range{n} \deq \{0,1,2,\ldots,n-1\}$ is a one-to-one function from $\range{n}$ to $\range{n}$. The group of all $n!$ permutations on $\range{n}$ is denoted by $S_n$. Here is an example permutation in $S_{16}$ that maps $0 \to 9,\ \ 1 \to 14,\ \ 2 \to 12,\ \ 3 \to 2$, etc.:
$$
\begin{equation}
\pi_1 \deq
\left(\begin{array}{*{16}c}
0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
9 & 14 & 12 & 2 & 7 & 10 & 13 & 3 & 1 & 15 & 0 & 6 & 4 & 11 & 5 & 8
\end{array} \right)
\tag{1} % \label{eq:exmpl}
\end{equation}
$$
We say that $D$ is a data structure for storing a permutation if $D$ supports the following interface:
* <span style="display:inline-block;width:18ex"> $\alg{process}(\pi) \to d_\pi\,$:</span> process the given permutation $\pi \in S_n$ and output its compact representation $d_\pi$,
* <span style="display:inline-block;width:18ex"> $\alg{eval}(d_\pi,x) \to y\,$:</span> on input $d_\pi$ and $x \in \range{n}$, output $y = \pi(x)$.
We require that $\alg{eval}(\cdot,\cdot)$ be fast, meaning that its running should be at most $O(\log n)$. Algorithm $\alg{process}(\cdot)$ can run in time polynomial in $n$.
Our goal is to design a data structure for storing a permutation where the worst case size of $d_\pi$, measured in bits, is as small as possible. By worst case we mean worst case over all $\pi \in S_n$. For extra bonus points we also want the data structure to support fast inversion:
* <span style="display:inline-block;width:18ex"> $\alg{invert}(d_\pi, y) \to x\,$:</span> on input $d_\pi$ and $y \in \range{n}$, output $x = \pi^{-1}(y)$.
This algorithm should similarly run in time at most $O(\log n)$.
## The trivial solution
Suppose $n$ is a power of 2. The trivial data structure for storing a permutation $\pi \in S_n$ simply lists the elements of the permutation in order. That is,
* $\alg{process}(\pi)$ outputs $d_\pi \deq \bigl[\pi(0),\ \pi(1),\ \ldots,\ \pi(n-1)\bigr]$, and
* $\alg{eval}(d_\pi, x)$ outputs $d_\pi[x]$ for $x \in \range{n}$.
For example, for the permutation $\pi_1$ in $(1)$ above we have
$$ d_{\pi_1} \deq [1010\; 1100\; 1101\; 1001\; 0111\; 0010\; 1110\; 0011\; 1011\; 1111\; 0000\; 0110\; 0100\; 0001\; 0101\; 1000].$$ This trivial data structure has the following properties:
* the length of $d_\pi$ is always $n \log_2 n$ bits, and
* $\alg{eval}(\cdot,\cdot)$ runs in constant time.
This looks quite good. So, are we done?
Well, not quite. Since the number of permutation in $S_n$ is $n!$, the number of bits needed to represent a permutation in the worst case is at least
$$ {\cal P}(n) \deq \lceil \log_2(n!) \rceil.$$
By [Stirling's approximation](https://en.wikipedia.org/wiki/Stirling%27s_approximation), and the fact that $\log_2 e$ is about $1.443$, we know that
$$
{\cal P}(n) = \lceil \log_{2}(n!)\rceil = n\log_{2} n - n \log_{2} e + \Theta(\log _{2}n)
\approx n\log_{2} n - \fbox{$1.443\, n$}.
$$
This means that we should be able to reduce the space needed to store a premutation by about 1.443 bits per entry over the trivial data structure.
**The optimally compressed representation.** One can represent a permutation $\pi$ in $S_n$ as an integer $d_\pi$ in $\{1, \ldots, n!\}$, and this will take at most ${\cal P}(n)$ bits. While this is more compact than the trivial data structure, we lose the ability to quickly evaluate and invert the permutation.
So, to summarize:
<p style="border:2px solid blue;padding:1em">
The challenge is to construct a data structure for storing a permutation
that takes less space than the trivial data stucture,
specifically 1.44 <i>n</i> fewer bits,
while retaining the ability to quickly evaluate the permutation.
</p>
## Who cares?
Is saving 1.44 bits per entry worth discussing?
Yes! There is an immediate application. In a Proof-of-Space based blockchain the monetary rewards for those providing proofs is proportional to the amount of space they have. Therefore, compressing the files used to create a Proof-of-Space will lead to increased rewards without the need to buy additional storage. It is a good day when a clever algorithm can increase revenues without increasing cost.
The [Chia Proof-of-Space](https://docs.chia.net/docs/03consensus/proof-of-space) uses a domain of size $n = 2^{32}$. A reduction of $1.44 n$ bits in storage results in a $1.44 n / (n \log_2 n) = 1.44/32 \approx 4.5\%$ increase in rewards, at no additional cost. However, due to the specifics of the Chia Proof-of-Space, one needs to compress a slightly more complicated object than a permutation. Adapting the compression technique presented here to Chia can, in principle, increase revenues by about 0.48\% in the limit.
## A compact data structure for storing a permutation
Now back to our question. From here on we will assume that $n$ is a power of two, so that $n = 2^k$ for some positive integer $k$. Recall that we define ${\cal P}(n) \deq \lceil \log_2(n!) \rceil$.
Our starting point is the classic [Beneš network](https://en.wikipedia.org/wiki/Clos_network#Bene%C5%A1_network_(m_=_n_=_2)) (1965). A Beneš network with $2$ inputs is a simple switch that has two inputs and two outputs. The switch has two settings: in its "zero" setting the outputs are the same as the inputs; in its "one" setting the outputs are swapped. This is illustrated in the following figure:
<img style="display:block;margin:auto;width:80%;padding-bottom:1em" src="https://i.imgur.com/SQH9VA7.jpg">
<br>
A Beneš network with $n = 2^k$ inputs is defined recursively using the following figure:
<img style="display:block;margin:auto;width:100%;padding:2em 0em" src="https://i.imgur.com/1Ww7cDr.jpg">
It is a simple exercise to show that for every permutation $\pi$ in $S_n$ there is a setting of the switches in the Beneš network that implements the permutation $\pi$. For example, for $n=16$, here are the switch settings for the permutation $\pi_1$ from $(1)$. The dotted path shows how the input `4' is mapped to the output '12'. The purpose of the green switches will become clear in a minute.
<img style="display:block;margin:auto;width:100%;padding-bottom:2em" src="https://hackmd.io/_uploads/ByZNP-CwC.png">
Once all the switches are set, evaluating $\pi(x)$ for an input $x \in \range{n}$ takes $(2 \log_2 n)-1$ steps. For a given input $x \in \range{n}$, we follow the path from $x$ to an output by traversing one switch at a time. This path contains exactly $(2 \log_2 n) - 1$ switches. Inverting the permutation is similarly done in $(2 \log_2 n)-1$ steps by processing the switches in the reverse order.
How much space do we need to store all the switch settings? The number of switches in an $n$-input Beneš network is exactly
$$\text{# of switches} = n \log_2 n - \frac{n}{2}.$$ This lets us represent any permutation in $S_n$ using that many bits, which is a savings of 0.5 bits per entry over the trivial solution. A good start, but we want to do better.
[Waksman](https://dl.acm.org/doi/abs/10.1145/321439.321449) (1968) observed that the top-left-most switch of a Beneš network can always be set to zero while still retaining the network's ability to express every permutation $\pi$ in $S_n$. This switch is shown in green in our recursive description of the Beneš network. We can apply this observation recursively to all the constituent Beneš networks and set all the green switches in our example network to zero. Since these switches are fixed, we do not need to store their settings, and as a result we save a total of $n/2 - 1$ bits. Hence, this observation reduces the number of bits to exactly
$$\text{# of bits} = n \log_2 n - n + 1.$$
Good progress, but we are still not at $n \log_2 n - 1.443\, n$.
[Munroa, Raman, Ramanc, and Rao](https://www.sciencedirect.com/science/article/pii/S0304397512002253) (2012) suggest a way to further compress a Beneš network. Let $q = 2^\ell$ be a small power of two, say $\ell \leq 8$. We prematurely terminate the recursive structure of the Beneš network at a permutation of size $q$, and then encode this permutation (on a domain of size $q$) using the *optimally compressed representation*. Here is the network for $\pi_1$ when we terminate the recursion at a permutation of size 4:
<img style="display:block;margin:auto;width:100%;padding:1em 0em 2em 0em" src="https://hackmd.io/_uploads/HkN3D-0PR.png">
Here we replaced the three inner layers of the Beneš network with four permutations in $S_4$. Each permutation can be represented as an integer $d$ where $d \in \range{24}$. Evaluating one of these $S_4$ permutations takes constant time.
More generally, suppose we terminate the recursion at a permutation of size $q = 2^\ell$. This eliminates the inner most $2\ell + 1$ layers of the network. Then the number of bits needed to store the entire network is:
$$
\begin{align}
\underbrace{n (\log_2 n - \log_2 q)}_{\begin{array}{c}\text{# switches excluding the} \\ \text{$2\ell+1$ inner layers} \end{array}} & - \underbrace{(n/q)+1}_{\begin{array}{c}\text{# Waksman} \\ \text{bits} \end{array}} + \underbrace{(n/q) \cdot \cal{P}(q)}_{\begin{array}{c}\text{size of the single} \\ \text{$q$-permutation layer} \end{array}} = \\[8px] & = n \log_2 n - n \Bigl(\log_2 q + (1/q) - \lceil \log_2(q!) \rceil / q \Bigr) + 1.
\tag{2}\label{eq:calc}
\end{align}
$$
Concretely, we get
* for $q = 2$ the data structure uses about $\bigl[n \log_2 n - n\bigr]$ bits. (a Beneš-Waksman network)
* for $q = 4$ the data structure uses about $\bigl[n \log_2 n - n\bigr]$ bits. (as in the figure above)
* for $q = 8$ the data structure uses about $\bigl[n \log_2 n - 1.125 n\bigr]$ bits.
* for $q = 16$ the data structure uses about $\bigl[n \log_2 n - 1.25 n\bigr]$ bits.
* for $q = 32$ the data structure uses about $\bigl[n \log_2 n - 1.34 n\bigr]$ bits.
* For $q = 256$ the data structure uses about $\bigl[n \log_2 n - 1.43 n\bigr]$ bits.
This converges to ${\cal P}(n) \approx n \log_2 n - 1.443 n$ bits, which is exactly what we want. In practice using $q = 32$ is sufficient. We note that one can store a batch of $S_q$ permutations more compactly than storing them separately. This lets us remove the ceiling function in the $\lceil \log_2(q!) \rceil$ term in $\eqref{eq:calc}$ and get some more space savings.
<br>
**Evaluation time.** We can speed up the evaluation procedure by processing multiple layers at a time. To do so, observe that the group of twelve blue switches in the figure below are sufficient to process the first three layers for the inputs $\{0,1,\ldots,7\}$.
<img style="display:block;margin:auto;width:100%;padding-bottom:1em" src="https://hackmd.io/_uploads/rkBVubCPR.png">
The twelve yellow switches are sufficient to process the first three layers for the inputs $\{8,\ldots,15\}$. Thus, we can store the twelve bits for the blue switches in one block on disk, and the twelve bits for the yellow switches in another block on disk. Now, for an input $x \in \range{16}$, we can process the first three layers with a single disk access.
For permutations in $S_n$ where $n=2^{32}$ --- the case of interest to us --- the Beneš network contains $63$ layers. Using $q = 32$ we get to remove the 11 inner layers, and replace them with permutations in $S_{32}$. This leaves 53 layers to process: 26 switching layers, a permutation in $S_{32}$, and 26 more switching layers.
The plan is to process eight layers at a time. In this case, the group of "blue" switches contains $2^7 \times 8 = 1024$ switches, which means storing 1024 bits, or 128 bytes, in one block on disk. Evaluating the permutation at a given input $x \in \range{2^{32}}$ can now be done by reading only *seven* disk blocks:
* we process the first 24 layers, eight layers at a time, by reading three disk blocks;
* we process the next two layers, the permutation in $S_{32}$, and the next two layers after that, using a single disk access;
* finally, we process the remaining 24 layers, eight layers at a time, by reading three more disk blocks.
This is a total of seven disk blocks that need to be read. Interestingly, evaluating the inverse premutation at a given input $y \in \range{2^{32}}$ is done exactly the same way, just in the reverse order.
## Conclusion
This completes our story. We explained how a compressed Beneš network lets us store a permutation $\pi$ in $S_n$ using an optimal number of bits. Evaluating $\pi(x)$ and $\pi^{-1}(y)$ can be done in about $2\log_2 n$ steps. The algorithm has some locality that lets us reduce the number of disk reads.
**An open problem.** Design a data structure with similar performance to the one presented here *and* better locality. In particular, evaluating a permutation in $S_n$ for $n=2^{32}$ should require reading only a single disk block (4KB). This will match the performance of the trivial solution, but with less space.
## Related work
[Bernstein](https://cr.yp.to/papers/controlbits-20200923.pdf) (2020) describes Waksman's observation and its history in Section 6. We thank Dan Bernstein for pointing us to this, and for other helpful comments.
[Barbay and Navarro](https://arxiv.org/pdf/0902.1038.pdf) show how to compress specific classes of permutations in $S_n$, while retaining the ability to quickly evalaute the premutation and its inverse. For example, consider the class of *local* permutations where there is a known bound $b \ll n$ such that $\bigl|\pi(x)-x\bigr| \leq b$ for all $x \in \range{n}$.