Implementing A Min-Max Slasher

These are my design notes for a min-max slasher for Lighthouse, the Ethereum 2.0 client from Sigma Prime.

Theory

In this post Proto describes a scheme for efficently storing attestation data so that surround slashings can be quickly detected. Detecting surrounds is the difficult part of building a slasher, so we'll focus mainly on that for now and come back to detecting double votes later.

Proto's construction is as follows:

For each validator, store two arrays indexed by epochs.
The arrays are 54K elements each to account for the weak subjectivity period.
The entries of the arrays are attestation targets or some encoding thereof, described in the next sections.

Storing the Target Directly

The most conceptually straight-forward variant involves storing the min and max targets directly.

Let

A_{v}

be the set of all attestations signed by validator

v

. Each

a \in A_{v}

has type AttestationData.

For each validator

v

, let

{min_targets}_{v}

and

{max_targets}_{v}

be 54K element arrays such that:

{min_targets}_{v} [i] = min ({a .target.epoch ∣ a \in A_{v}, a .source.epoch > i}) {max_targets}_{v} [i] = max ({a .target.epoch ∣ a \in A_{v}, a .source.epoch < i})

In natural language:

{min_targets}_{v} [i]

is the minimum target of an attestation with source epoch greater than

i

{max_targets}_{v} [i]

is the maximum target of an attestation with source epoch less than

i

Why do this?

Lemma 1:

x surrounds an existing attestation ⟺ x .target.epoch > {min_targets}_{v} [x .source.epoch]

If we have a new attestation

x

, then we can check whether it is surrounding any existing attestation by checking if

x .target.epoch > {min_targets}_{v} [x .source.epoch]

. If the target epoch of the new attestation is greater than the min target for its source, that means there exists a previously signed attestation

a

with

a .source.epoch > x .source.epoch

and

a.target.epoch < x .source.epoch

, i.e.

x

surrounds

a

. Moreover, because

a

is the attestation with the minimum target amongst attestations that could be surrounded by

x

, we know that if the check fails, then

x

cannot surround any other previously signed attestation. This has been a hand-wavy proof of Lemma 1.

Lemma 2:

x is surrounded by an existing attestation ⟺ x .target.epoch < {max_targets}_{v} [x .source.epoch]

Similarly, the max targets array allows us to check whether a new attestation

x

is surrounded by any existing attestation. If

x .target.epoch < {max_targets}_{v} [x .source.epoch]

then

\exists a \in A_{v} . a .source.epoch < x .source.epoch \land a .target.epoch > x .target.epoch

, i.e.

a

surrounds

x

. By similar reasoning to above we have the other direction of the iff.

Wrap-Around

So far we've considered arrays of length 54K epochs without considering what happens once the chain progresses past epoch 54K. In this case, we want to wrap around the end of the array so that we are still storing the same amount of history, while overwriting the oldest values.

For an epoch

e

within the weak subjectivity period we store the minimum/maximum target for

e

at index

e mod N

{min_targets}_{v} [e mod N] = min ({a .target.epoch ∣ a \in A_{v}, a .source.epoch > e}) {max_targets}_{v} [e mod N] = max ({a .target.epoch ∣ a \in A_{v}, a .source.epoch < e})

This means that the arrays now require knowledge of the current epoch which necessitates some clarifications:

To prevent overriding historic values, the arrays should only store information about attestations such that
$current_epoch - N < a .source.epoch \leq current_epoch$ , where
$N$ is the length of the array.
Given the above restriction, we need to define how the arrays are initialised at epoch 0, and how they should be updated when the current epoch advances.

Initialisation and Epoch Advances

Let

⊤

(top) be the target that is greater than all others (e.g. INT_MAX).

For

{min_targets}_{v}

Initial value:
${min_targets}_{v} = [⊤, undefined, \dots]$
Epoch update:
${min_targets}_{v} [next_epoch mod N] = ⊤$

Initialising the min target for new epochs to

⊤

reflects that we have no information about attestations with source epochs greater than new epochs, and ensures that any attestation with source equal to the new epoch is unslashable (Lemma 1).

For

{max_targets}_{v}

Initial value:
${max_targets}_{v} = [0, undefined, \dots]$
Epoch update:
${max_targets}_{v} [next_epoch mod N] = {max_targets}_{v} [current_epoch mod N]$

For max targets, the initial 0 value represents a lack of information, and ensures new attestations are not slashable (Lemma 2). The epoch update is more interesting: if a maximum target is known for the current epoch, then it also applies to the next epoch because the

a

with max target has

a .source.epoch < current_epoch < next_epoch

. However, there could also be attestations with

a .source.epoch = current_epoch

which should also be taken into account. To deal with this, an epoch of lookahead could be implemented, either by constraining the source of attestations processed:

current_epoch - (N - 1) < a .source.epoch \leq current_epoch - 1

, or something equivalent. We'll see how this falls out in the implementation.

New Attestation Updates

Lemmas 1 & 2 give us rules for checking when a new attestation is slashable, but we also need to update the min-max arrays for the existence of the new attestation. We can do this efficiently using the following algorithms:

# Update every entry from one epoch prior to `attestation.source.epoch`.
# Bail out of the update early if an existing target is already less than
# or equal to the new min target, because that min target will also apply to all
# prior epochs.
def update_min_targets(
    min_targets: [u16],
    attestation: AttestationData,
    current_epoch: Epoch
):
    e = attestation.source.epoch
    min_epoch = if current_epoch < N then 0 else current_epoch - N
    while e > min_epoch:
        e -= 1
        if attestation.target.epoch < min_targets[e % N]:
            min_targets[e % N] = attestation.target_epoch
        else:
            return

# Update every entry from one epoch after `attestation.source.epoch`.
def update_max_targets(
    max_targets: [u16],
    attestation: AttestationData
    current_epoch: Epoch,
):
    e = attestation.source.epoch + 1
    while e <= current_epoch:
        if attestation.target.epoch > max_targets[e % N]:
            max_targets[e % N] = attestation.target_epoch
            i += 1
        else:
            return

Implementation

Storing the Distance

Rather than storing the target of attestations in the arrays directly, we can instead equivalently store their distance from epoch

e

{min_targets}_{v} [e mod N] = min ({a .target.epoch - e ∣ a \in A_{v}, a .source.epoch > e}) {max_targets}_{v} [e mod N] = max ({a .target.epoch - e ∣ a \in A_{v}, a .source.epoch < e})

This has several advantages:

It reduces the number of bits required to store the target. As we pass the 54,000th epoch, we have to store targets that exceed the maximum size of a u16. The distance will always fit inside a u16.
In ideal network conditions (the common case – we hope), storing the distance makes the values of the array more uniform, which allows for very effective compression.

If all attestations are targetting the epoch immediately after their source epoch, then the arrays are completely uniform:

{min_targets}_{v} = [2, 2, 2, \dots] {max_targets}_{v} = [0, 0, 0, \dots]

In implementation, we have several options for compressing these arrays, depending on our choice of data store.

1D Chunking

It is likely that as well as repetition within the arrays for a single validator, there will also be repetition across the attestation behaviours of validators. It is essential that we take advantage of this repetition to reduce disk usage, as naively the arrays could use up to 64.8 GB, and we don't want to be rewriting entire arrays all the time.

Here's a first draft of a chunking scheme, that comes with some drawbacks:

Let C be the number of array elements per chunk
Store chunks in the database addressed by their hash
For each validator, store a mapping (validator_index: u64, chunk_idx: u16) => ChunkHash

When a new attestation appears:

Load the chunk_hash for (validator_index, chunk_idx), where chunk_idx = att.source.epoch / C
Load the chunk for chunk_hash
Check the attestation against the chunk, and make any necessary modifications to it (lazily loading neighbouring chunks as necessary)
Store any updated chunks in the DB and update the pointers for each (validator_index, chunk_idx). This will involve hashing each updated chunk, so we should choose a cheap hash function (maybe BLAKE2b)

Garbage Collection

In order to implement garbage collection, we could tag each chunk with the epoch at which it was last used. This would use an extra 8 bytes per chunk, and require an extra write to every chunk used in an epoch.

Upsides

Allows chunk sharing between and within target arrays for validators. Common runs of data like 232323... can be stored once and referenced from multiple (validator_id, chunk_idx) pairs
Compatible with compression: just compress each chunk
Compatible with in-memory caches at steps 1 & 2
Step 4 can be quite efficient by checking for the existence of the chunks in the database before writing
In each epoch at most 2 * V chunks need to be written to disk. With C=240 (~one day) and V=300k this is only 2 * 2 * 240 * 300_000 = 288 MB
In each epoch around 300k chunk pointers might need to be updated – at most another few hundred MBs

Downsides

Garbage-collection: finding chunks that are no longer used is expensive/impractical. We could add reference counting but that would be a lot of write overhead.
Hashing overhead: hashing each chunk is potentially expensive, particularly if we use SHA-256. A BLAKE-family hash function would mitigate this concern, and might even reduce the size of chunk pointers (we could choose a digest size < 32 bytes, sacrificing some security). A preliminary benchmark of BLAKE2b on [u16; 240] shows that it runs in less than 1μs! For arrays of 2048 elements (far beyond what we're likely to want) the hashing time is ~4μs.

Notes

Total space required for the chunk pointers is V * N/C * H where V is num validators and H is the size of a hash (probably 32 bytes). That's around 2.16 GB for C=240, smaller with larger chunk sizes.
Total space required for the chunks themselves is a uniqueness factor U times the full size 2 * 2 * V * N (64.8 GB for V=300k, U=1.0).
Choosing the chunk size is a trade-off between chunks that are likely to repeat (small chunks), and the size of the index/pointers (big chunks).
Need to prevent multiple attestations for the same validator from racing (a queue perhaps?)

2D Chunking

Here's another approach to chunking, due to Proto.

Split each array into "horizontal" runs of length C. Then, take K runs of length C from different validators and store these as a chunk on disk (size 2 * K * C bytes).

Disk layout is chunk_id => chunk, where chunk_id is computed from validator index and epoch like so:
- validator_chunk_number = validator_index / K
- chunk_number = att.source.epoch / C
- chunk_id = validator_chunk_number * C + chunk_number (not necessary in an SQL DB)

The update protocol for a new att is:

Load the chunk for chunk_id(validator_index, att.source.epoch)
Fetch the validator's chunk from the validator_index % K sub-chunk
Check for slashing and update the chunk (lazy load neighbours and modifying them as necessary)
Write all updated chunks back to the database

Upsides

Single chunk table (no need for the index, thus smaller overall)
No hashing
Compression-friendly: each large chunk will likely compress quite well

Downsides

Concurrency control necessary to prevent races when multiple validators update a chunk (queue)
No de-duplication across validator chunks

Other

Writes per epoch: in the best-case K chunks containing all validators get updated… TODO

1D vs 2D

The real differentiaton between 1D and 2D is:

1D provides de-duplication but requires garbage collection and uses more space in the worst-case than 2D.

What's the break-even point for 1D to use less space?

Assuming a single array for simplicity

1D Disk Usage:
* Index: H*V*N/C
* Chunks: U*2*V*N (U uniqueness factor)
2D Disk Usage:
* Chunks: 2*V*N

Solving for equality yields U = 1 - H/2C as the break-even point at which both strategies use the same disk space.

With the last_used_epoch tags for GC:

1D Disk Usage:
* Index: H*V*N/C
* Chunks: U*V*N/C*(2C + 8)

We get U = (2C - H)/(2C + 8)

Some numbers are here: https://docs.google.com/spreadsheets/d/1UEWfQ_Oh73uarXLqTuKz9JERoRGutfLr-M6bGqX9u-w

It looks like the de-duplication could prove useful, particularly if we allow ourselves 8 or 16 byte BLAKE2b hashes. However, due to the extra complexity of this approach, I'm going to start by implementing 2D chunking.

Data Store Compression

Although we could rely on the compression provided by a key-value store or relational database, support is patchy between databases, and we can probably do better with a custom-designed format. For completeness, here's a quick summary of compression support accross different stores.

LevelDB uses Snappy, but LMDB doesn't support any form of compression.

Postgres has no support for compressed tables, but can compress individual rows, which could work with a one-array-per-row storage schema. MariaDB supports compressed pages, which would work with one-array-per-row or one-element-per-row.

Run-Length Encoding

This section mostly obsoleted by chunking, above^

To compress the long runs of identical distances, it seems run-length encoding (RLE) could be effective. Noting that the min target for the current epoch

e_{C}

will always be

⊤

, the run-length encoding

E

of an ideal

min_taherergets

array would be:

\begin{aligned} {min_targets}_{v} = & [2, \dots, 2, ⊤, 2, \dots] \\ E ({min_targets}_{v}) = & [(e_{C} mod N, 2), (1, ⊤), (N - (e_{C} mod N) - 1, 2)] \end{aligned}

We're using tuples of

(count, value)

to represent the run-length encoding (a Vec<(u16, u16)>).

With this encoding, it doesn't matter what data store we use, we can store a value (or row) for each validator index, and read/write it as it changes, without using much space on disk.

To really optimise for the best-case scenario, we could use a run-length encoded queue which always stores the value for the current epoch at the last index. In the best case this could remove the need to write the queue to disk every epoch, in a flow like the following:

Initially at epoch e, we have min_targets = [(N - 1, 2), (1, 0xff)]
Perform the epoch update: min_targets = [(N - 2, 2), (2, 0xff)]
Apply attestations for epoch e + 1 in memory, new value: min_targets = [(N - 1, 2), (1, 0xff)]. This is the same as the value already stored on disk – no need to write it.

Estimated Disk Usage

In the best case, the min-max target queues could require only four u16s each (8 bytes). Assuming the data store requires a u64 to encode the length of values/rows, this pushes the minimum size to 16 bytes. For 300k validators, we're looking at:

Unrealistic Best Case: 2 arrays (min/max) * 16 bytes * 300k validators = 9.6 MB (tiny!)

Keep in mind this is just the disk usage for the min-max arrays, and that we also need to store all the attestations, which will require hundreds of gigabytes.

This is probably also very unrealistic, so lets look at some less uniform attestation patterns. Here are the expected sizes for different run-length encodings with X% of the full number of elements:

2% uniqueness: 2 * (0.02 * 54_000 * 4 + 8) * 300_000 = 2.6 GB
10% uniqueness: 2 * (0.1 * 54_000 * 4 + 8) * 300_000 = 13.0 GB
30% uniqueness: 2 * (0.3 * 54_000 * 4 + 8) * 300_000 = 38.9 GB
50% uniqueness: 2 * (0.5 * 54_000 * 4 + 8) * 300_000 = 64.8 GB
100% uniqueness: 2 * (1.0 * 54_000 * 4 + 8) * 300_000 = 129.6 GB

Beyond 50% uniqueness the RLE is actually less efficient than storing every value, as 2 arrays * 54k epochs * 2 bytes (u16) * 300k validators = 64.8 GB. We could have a flag bit to switch between RLE and direct encoding if 50%+ uniqueness is observed in practice. Hopefully it won't be! And if we can get some compression from the filesystem or the DB, that will help too (particularly if it acts across rows/values).

Estimated Write Throughput

If every validator attests every epoch were are going to be reading (and often writing) every min-max array on disk… That's a lot of data! In the worst-case that's 64.8 GB per epoch, or 180 MB/s… That's just within the realm of feasibility if we can avoid writing unchanged values and sit below the worst-case most of the time.

Architecture

Stand-alone process
Ingest attestations from the BN via HTTP (hex-encoded SSZ?)

TODO

The Plan

Write a Rust library to prototype the min-max arrays and their update protocols
Implement 2D chunking
Choose a database (leaning towards LMDB) and get a prototype slasher running on Altona to estimate the effectiveness of our design. In practice the network might be kind to us…

Diederik Loerakker

2020/07/10 15:07:03

then the arrays are completely uniform:

Maybe add a more explicit note that it is uniform, but only per validator. However, different validators will likely have similar arrays, and this should be considered for compression. (Edited)

2020/07/10 15:09:00

Before going into compression, a section about splitting the history of a single validator would help. And possibly doing the same for all validators at aligned chunks. Recent history is very "hot", while we commonly may never see an attestation again for older epochs. (Edited)

2020/07/10 15:11:54

one-array-per-row

Don't rewrite the whole history of a validator please! This would apply to all validators. That's a lot of writes. Chunkification, splitting at time intervals, batching groups (or all) validators, makes more sense. It can be a rolling thing, where you keep the last N batches in memory, and the others are cold on disk. (Edited)

2020/07/10 15:16:45

It's definitely applicable, but before going into ideal storage layout, it does not help as much. If multiple validators are batched together, and split between time, I expect blobs of data that are very regular, but have a little noise. Most sparse-detail compression techniques will work, and don't require rolling our own compression. Instead time is better spent on figuring out how chunkification would look like. Good insight with run length encoding though :+1: (Edited)

2020/07/10 15:18:43

which will require hundreds of gigabytes.

But are mostly completely cold. Except: - a fast cuckoo filter to check membership - an index of roots to find the attestation by (Edited)

2020/07/10 15:22:57

uniqueness

If we use existing compression techniques, it's not limited to uniqueness. E.g. there may very well be common distance patterns like 121212, which can be symbolic in compression, and avoid poor runlength encoding. (Edited)

2020/07/10 15:24:17

64.8 GB

Remember that we also like to have the data in memory to hit it very quickly. If we chunkify the storage, we can have just the last few chunks in memory, and won't hit anything close to the 100+GB. (Edited)

Guest Reed2024/08/14 03:21:37

ok sounds good

2020/07/10 15:28:40

. That's a lot of dat

Don't rewrite the whole array, chunkify over time. E.g. a week of attestations = 1575 epochs = about 2 GB of memory at 50% uniqueness (probably much less). Split it up in 32 chunks (w.r.t. time), and rotate them out over time. Only overwriting the touched data. (Edited)

2020/07/10 15:30:24

min-max arrays

I would take smaller steps, and start with just the update protocol, no compression or chunkification. And then let's discuss chunkification some more first, I think it's more essential than the RLE part. (Edited)

2020/07/10 15:31:49

. I

121112, 121111, etc., very repeated, will be common. RLE may work ok-ish, but if there's a lot of low-resolution noise, then other existing compression may probably work better. (Edited)

2020/07/10 15:33:55

Min-Max Slasher

Yay, thanks for looking into my slashing detection algo idea. This write-up is better than the original! However, there are still some points to look at before starting a prototype implementation, I added some feedback to this post, and happy to chat about them. (Edited)