Implementing A Min-Max Slasher

--- spellcheck: true --- # Implementing A Min-Max Slasher These are my design notes for a min-max slasher for [Lighthouse][], the Ethereum 2.0 client from Sigma Prime. ## Theory In [this post](https://github.com/protolambda/eth2-surround#min-max-surround) Proto describes a scheme for efficently storing attestation data so that surround slashings can be quickly detected. Detecting surrounds is the difficult part of building a slasher, so we'll focus mainly on that for now and come back to detecting double votes later. Proto's construction is as follows: * For each validator, store two arrays indexed by epochs. * The arrays are 54K elements each to account for the weak subjectivity period. * The entries of the arrays are attestation _targets_ or some encoding thereof, described in the next sections. ### Storing the Target Directly The most conceptually straight-forward variant involves storing the min and max targets directly. Let $A_v$ be the set of all attestations signed by validator $v$. Each $a \in A_v$ has type [`AttestationData`][]. For each validator $v$, let $\text{min_targets}_v$ and $\text{max_targets}_v$ be 54K element arrays such that: $$ \text{min_targets}_v[i] = \textbf{min}(\{a\text{.target.epoch} \mid a \in A_v,\, a\text{.source.epoch > i }\}) \\ \text{max_targets}_v[i] = \textbf{max}(\{a\text{.target.epoch} \mid a \in A_v,\, a\text{.source.epoch < i }\}) $$ In natural language: $\text{min_targets}_v[i]%$ is the minimum target of an attestation with source epoch greater than $i$. $\text{max_targets}_v[i]%$ is the maximum target of an attestation with source epoch less than $i$. ### Why do this? #### Lemma 1: $$ x \;\text{surrounds an existing attestation} \iff x\text{.target.epoch} > \text{min_targets}_v[x \text{.source.epoch}] $$ If we have a new attestation $x$, then we can check whether it is **surrounding** any existing attestation by checking if $x\text{.target.epoch} > \text{min_targets}_v[x\text{.source.epoch}]$. If the target epoch of the new attestation is greater than the min target for its source, that means there exists a previously signed attestation $a$ with $a\text{.source.epoch} > x\text{.source.epoch}$ and $\text{a.target.epoch} < x\text{.source.epoch}$, i.e. $x$ surrounds $a$. Moreover, because $a$ is the attestation with the minimum target amongst attestations that could be surrounded by $x$, we know that if the check fails, then $x$ cannot surround any other previously signed attestation. This has been a hand-wavy proof of Lemma 1. #### Lemma 2: $$ x \;\text{is surrounded by an existing attestation} \iff x\text{.target.epoch} < \text{max_targets}_v[x \text{.source.epoch}] $$ Similarly, the max targets array allows us to check whether a new attestation $x$ is **surrounded by** any existing attestation. If $x\text{.target.epoch} < \text{max_targets}_v[x\text{.source.epoch}]$ then $\exists a \in A_v. a\text{.source.epoch} < x\text{.source.epoch} \wedge a\text{.target.epoch} > x\text{.target.epoch}$, i.e. $a$ surrounds $x$. By similar reasoning to above we have the other direction of the iff. ### Wrap-Around So far we've considered arrays of length 54K epochs without considering what happens once the chain progresses past epoch 54K. In this case, we want to wrap around the end of the array so that we are still storing the same amount of history, while overwriting the oldest values. For an epoch $e$ within the weak subjectivity period we store the minimum/maximum target for $e$ at index $e \bmod N$. $$ \text{min_targets}_v[e \bmod N] = \textbf{min}(\{a\text{.target.epoch} \mid a \in A_v,\, a\text{.source.epoch > e }\}) \\ \text{max_targets}_v[e \bmod N] = \textbf{max}(\{a\text{.target.epoch} \mid a \in A_v,\, a\text{.source.epoch < e }\}) $$ This means that the arrays now require knowledge of the current epoch which necessitates some clarifications: * To prevent overriding historic values, the arrays should only store information about attestations such that $\text{current_epoch} - N < a\text{.source.epoch} \leq \text{current_epoch}$, where $N$ is the length of the array. * Given the above restriction, we need to define how the arrays are initialised at epoch 0, and how they should be updated when the current epoch advances. ### Initialisation and Epoch Advances Let $\top$ (top) be the target that is greater than all others (e.g. `INT_MAX`). For $\text{min_targets}_v$: * Initial value: $\text{min_targets}_v = [\top, \text{undefined}, \ldots]$ * Epoch update: $\text{min_targets}_v[\text{next_epoch} \bmod N] = \top$ Initialising the min target for new epochs to $\top$ reflects that we have no information about attestations with source epochs greater than new epochs, and ensures that any attestation with source equal to the new epoch is unslashable (Lemma 1). For $\text{max_targets}_v$: * Initial value: $\text{max_targets}_v = [0, \text{undefined}, \ldots]$ * Epoch update: $\text{max_targets}_v[\text{next_epoch} \bmod N] = \text{max_targets}_v[\text{current_epoch} \bmod N]$ For max targets, the initial 0 value represents a lack of information, and ensures new attestations are not slashable (Lemma 2). The epoch update is more interesting: if a maximum target is known for the current epoch, then it also applies to the next epoch because the $a$ with max target has $a\text{.source.epoch} < \text{current_epoch} < \text{next_epoch}$. However, there could also be attestations with $a\text{.source.epoch} = \text{current_epoch}$ which should also be taken into account. To deal with this, an epoch of lookahead could be implemented, either by constraining the source of attestations processed: $\text{current_epoch} - (N - 1) < a\text{.source.epoch} \leq \text{current_epoch} - 1$, or something equivalent. We'll see how this falls out in the implementation. ### New Attestation Updates **Lemmas 1 & 2** give us rules for checking when a new attestation is slashable, but we also need to update the min-max arrays for the existence of the new attestation. We can do this efficiently using the following algorithms: ```py # Update every entry from one epoch prior to `attestation.source.epoch`. # Bail out of the update early if an existing target is already less than # or equal to the new min target, because that min target will also apply to all # prior epochs. def update_min_targets( min_targets: [u16], attestation: AttestationData, current_epoch: Epoch ): e = attestation.source.epoch min_epoch = if current_epoch < N then 0 else current_epoch - N while e > min_epoch: e -= 1 if attestation.target.epoch < min_targets[e % N]: min_targets[e % N] = attestation.target_epoch else: return # Update every entry from one epoch after `attestation.source.epoch`. def update_max_targets( max_targets: [u16], attestation: AttestationData current_epoch: Epoch, ): e = attestation.source.epoch + 1 while e <= current_epoch: if attestation.target.epoch > max_targets[e % N]: max_targets[e % N] = attestation.target_epoch i += 1 else: return ``` ## Implementation ### Storing the Distance Rather than storing the target of attestations in the arrays directly, we can instead equivalently store their distance from epoch $e$. $$ \text{min_targets}_v[e \bmod N] = \textbf{min}(\{a\text{.target.epoch} - e \mid a \in A_v,\, a\text{.source.epoch > e }\}) \\ \text{max_targets}_v[e \bmod N] = \textbf{max}(\{a\text{.target.epoch} - e \mid a \in A_v,\, a\text{.source.epoch < e }\}) $$ This has several advantages: * It reduces the number of bits required to store the target. As we pass the 54,000th epoch, we have to store targets that exceed the maximum size of a `u16`. The distance will always fit inside a `u16`. * In ideal network conditions (the common case -- we hope), storing the distance makes the values of the array more uniform, which allows for very effective compression. If all attestations are targetting the epoch immediately after their source epoch, then the arrays are completely uniform: $$ \text{min_targets}_v = [2, 2, 2, \ldots] \\ \text{max_targets}_v = [0, 0, 0, \ldots] \\ $$ In implementation, we have several options for compressing these arrays, depending on our choice of data store. ### 1D Chunking It is likely that as well as repetition within the arrays for a single validator, there will also be repetition across the attestation behaviours of validators. It is essential that we take advantage of this repetition to reduce disk usage, as naively the arrays could use up to 64.8 GB, and we don't want to be rewriting entire arrays all the time. Here's a first draft of a chunking scheme, that comes with some drawbacks: * Let `C` be the number of array elements per chunk * Store chunks in the database addressed by their hash * For each validator, store a mapping `(validator_index: u64, chunk_idx: u16) => ChunkHash` When a new attestation appears: 1. Load the `chunk_hash` for `(validator_index, chunk_idx)`, where `chunk_idx = att.source.epoch / C` 2. Load the chunk for `chunk_hash` 3. Check the attestation against the chunk, and make any necessary modifications to it (lazily loading neighbouring chunks as necessary) 4. Store any updated chunks in the DB and update the pointers for each `(validator_index, chunk_idx)`. This will involve hashing each updated chunk, so we should choose a cheap hash function (maybe BLAKE2b) #### Garbage Collection In order to implement garbage collection, we could tag each chunk with the epoch at which it was last used. This would use an extra 8 bytes per chunk, and require an extra write to every chunk used in an epoch. #### Upsides * Allows chunk sharing **between** and **within** target arrays for validators. Common runs of data like `232323...` can be stored once and referenced from multiple `(validator_id, chunk_idx)` pairs * Compatible with compression: just compress each chunk * Compatible with in-memory caches at steps 1 & 2 * Step 4 can be quite efficient by checking for the existence of the chunks in the database before writing * In each epoch at most `2 * V` chunks need to be written to disk. With C=240 (~one day) and V=300k this is only `2 * 2 * 240 * 300_000` = 288 MB * In each epoch around 300k chunk pointers might need to be updated -- at most another few hundred MBs #### Downsides * Garbage-collection: finding chunks that are no longer used is expensive/impractical. We could add reference counting but that would be a lot of write overhead. * Hashing overhead: hashing each chunk is potentially expensive, particularly if we use SHA-256. A BLAKE-family hash function would mitigate this concern, and might even reduce the size of chunk pointers (we could choose a digest size < 32 bytes, sacrificing some security). A preliminary benchmark of BLAKE2b on `[u16; 240]` shows that it runs in less than 1μs! For arrays of 2048 elements (far beyond what we're likely to want) the hashing time is ~4μs. #### Notes * Total space required for the chunk pointers is `V * N/C * H` where `V` is num validators and `H` is the size of a hash (probably 32 bytes). That's around 2.16 GB for `C=240`, smaller with larger chunk sizes. * Total space required for the chunks themselves is a uniqueness factor `U` times the full size `2 * 2 * V * N` (64.8 GB for V=300k, U=1.0). * Choosing the chunk size is a trade-off between chunks that are likely to repeat (small chunks), and the size of the index/pointers (big chunks). * Need to prevent multiple attestations for the same validator from racing (a queue perhaps?) ### 2D Chunking Here's another approach to chunking, due to Proto. Split each array into "horizontal" runs of length `C`. Then, take `K` runs of length `C` from different validators and store these as a chunk on disk (size `2 * K * C` bytes). * Disk layout is `chunk_id => chunk`, where `chunk_id` is computed from validator index and epoch like so: * `validator_chunk_number = validator_index / K` * `chunk_number = att.source.epoch / C` * `chunk_id = validator_chunk_number * C + chunk_number` (not necessary in an SQL DB) The update protocol for a new `att` is: 1. Load the chunk for `chunk_id(validator_index, att.source.epoch)` 2. Fetch the validator's chunk from the `validator_index % K` sub-chunk 3. Check for slashing and update the chunk (lazy load neighbours and modifying them as necessary) 4. Write all updated chunks back to the database #### Upsides * Single chunk table (no need for the index, thus smaller overall) * No hashing * Compression-friendly: each large chunk will likely compress quite well #### Downsides * Concurrency control necessary to prevent races when multiple validators update a chunk (queue) * No de-duplication across validator chunks #### Other * Writes per epoch: in the best-case `K` chunks containing all validators get updated... TODO ### 1D vs 2D The real differentiaton between 1D and 2D is: 1D provides de-duplication but requires garbage collection and uses more space in the worst-case than 2D. What's the break-even point for 1D to use less space? Assuming a single array for simplicity 1D Disk Usage: * Index: `H*V*N/C` * Chunks: `U*2*V*N` (`U` uniqueness factor) 2D Disk Usage: * Chunks: `2*V*N`  Solving for equality yields `U = 1 - H/2C` as the break-even point at which both strategies use the same disk space. With the `last_used_epoch` tags for GC: 1D Disk Usage: * Index: `H*V*N/C` * Chunks: `U*V*N/C*(2C + 8)`  We get `U = (2C - H)/(2C + 8)` Some numbers are here: https://docs.google.com/spreadsheets/d/1UEWfQ_Oh73uarXLqTuKz9JERoRGutfLr-M6bGqX9u-w It looks like the de-duplication could prove useful, particularly if we allow ourselves 8 or 16 byte BLAKE2b hashes. However, due to the extra complexity of this approach, I'm going to start by implementing 2D chunking. ### Data Store Compression Although we could rely on the compression provided by a **key-value store** or **relational database**, support is patchy between databases, and we can probably do better with a custom-designed format. For completeness, here's a quick summary of compression support accross different stores. [LevelDB uses Snappy][leveldb-snappy], but [LMDB doesn't support any form of compression][lmdb-compression]. [Postgres has no support for compressed tables][postgres-compression], but can compress individual rows, which could work with a one-array-per-row storage schema. [MariaDB supports compressed pages][mariadb-compression], which would work with one-array-per-row _or_ one-element-per-row. ### Run-Length Encoding _This section mostly obsoleted by chunking, above^_ To compress the long runs of identical distances, it seems [run-length encoding][] (RLE) could be effective. Noting that the min target for the current epoch $e_C$ will always be $\top$, the run-length encoding $E$ of an ideal $\text{min_taherergets}$ array would be: $$ \begin{equation} \begin{aligned} \text{min_targets}_v = & \, [2, \ldots, 2, \top, 2, \ldots] \\ E(\text{min_targets}_v) = & \, [(e_C \bmod N, 2), (1, \top), (N - (e_C \bmod N) - 1, 2)] \end{aligned} \end{equation} $$ We're using tuples of $(\text{count}, \text{value})$ to represent the run-length encoding (a `Vec<(u16, u16)>`). With this encoding, it doesn't matter what data store we use, we can store a value (or row) for each validator index, and read/write it as it changes, without using much space on disk. To really optimise for the best-case scenario, we could use a run-length encoded queue which always stores the value for the current epoch at the last index. In the best case this could remove the need to write the queue to disk every epoch, in a flow like the following: 1. Initially at epoch `e`, we have `min_targets = [(N - 1, 2), (1, 0xff)]` 2. Perform the epoch update: `min_targets = [(N - 2, 2), (2, 0xff)]` 3. Apply attestations for epoch `e + 1` in memory, new value: `min_targets = [(N - 1, 2), (1, 0xff)]`. This is the same as the value already stored on disk -- no need to write it. #### Estimated Disk Usage In the best case, the min-max target queues could require only four `u16`s each (8 bytes). Assuming the data store requires a `u64` to encode the length of values/rows, this pushes the minimum size to 16 bytes. For 300k validators, we're looking at: * Unrealistic Best Case: 2 arrays (min/max) * 16 bytes * 300k validators = 9.6 MB (tiny!) Keep in mind this is just the disk usage for the min-max arrays, and that we also need to store all the attestations, which will require hundreds of gigabytes. This is probably also very unrealistic, so lets look at some less uniform attestation patterns. Here are the expected sizes for different run-length encodings with X% of the full number of elements: * 2% uniqueness: 2 * (0.02 * 54_000 * 4 + 8) * 300_000 = 2.6 GB * 10% uniqueness: 2 * (0.1 * 54_000 * 4 + 8) * 300_000 = 13.0 GB * 30% uniqueness: 2 * (0.3 * 54_000 * 4 + 8) * 300_000 = 38.9 GB * 50% uniqueness: 2 * (0.5 * 54_000 * 4 + 8) * 300_000 = 64.8 GB * 100% uniqueness: 2 * (1.0 * 54_000 * 4 + 8) * 300_000 = 129.6 GB Beyond 50% uniqueness the RLE is actually less efficient than storing every value, as 2 arrays * 54k epochs * 2 bytes (`u16`) * 300k validators = 64.8 GB. We could have a flag bit to switch between RLE and direct encoding if 50%+ uniqueness is observed in practice. Hopefully it won't be! And if we can get some compression from the filesystem or the DB, that will help too (particularly if it acts across rows/values). #### Estimated Write Throughput If every validator attests every epoch were are going to be reading (and often writing) every min-max array on disk... That's a lot of data! In the worst-case that's 64.8 GB per epoch, or 180 MB/s... That's just within the realm of feasibility if we can avoid writing unchanged values and sit below the worst-case most of the time. ### Architecture * Stand-alone process * Ingest attestations from the BN via HTTP (hex-encoded SSZ?) TODO ### The Plan * Write a Rust library to prototype the min-max arrays and their update protocols * Implement 2D chunking * Choose a database (leaning towards LMDB) and get a prototype slasher running on Altona to estimate the effectiveness of our design. In practice the network might be kind to us... [`AttestationData`]: https://github.com/ethereum/eth2.0-specs/blob/v0.12.1/specs/phase0/beacon-chain.md#attestationdata [Lighthouse]: https://github.com/sigp/lighthouse [leveldb-snappy]: https://github.com/google/leveldb#features [lmdb-compression]: http://www.lmdb.tech/bench/inmem/compress/LMDB/ [postgres-compression]: https://stackoverflow.com/questions/1369864/does-postgresql-support-transparent-compressing-of-tables-fragments [mariadb-compression]: https://mariadb.com/kb/en/innodb-page-compression/ [Run-length encoding]: https://en.wikipedia.org/wiki/Run-length_encoding