These are my design notes for a min-max slasher for Lighthouse, the Ethereum 2.0 client from Sigma Prime.
In this post Proto describes a scheme for efficently storing attestation data so that surround slashings can be quickly detected. Detecting surrounds is the difficult part of building a slasher, so we'll focus mainly on that for now and come back to detecting double votes later.
Proto's construction is as follows:
The most conceptually straight-forward variant involves storing the min and max targets directly.
Let AttestationData
.
For each validator
In natural language:
If we have a new attestation
Similarly, the max targets array allows us to check whether a new attestation
So far we've considered arrays of length 54K epochs without considering what happens once the chain progresses past epoch 54K. In this case, we want to wrap around the end of the array so that we are still storing the same amount of history, while overwriting the oldest values.
For an epoch
This means that the arrays now require knowledge of the current epoch which necessitates some clarifications:
Let INT_MAX
).
For
Initialising the min target for new epochs to
For
For max targets, the initial 0 value represents a lack of information, and ensures new attestations are not slashable (Lemma 2). The epoch update is more interesting: if a maximum target is known for the current epoch, then it also applies to the next epoch because the
Lemmas 1 & 2 give us rules for checking when a new attestation is slashable, but we also need to update the min-max arrays for the existence of the new attestation. We can do this efficiently using the following algorithms:
# Update every entry from one epoch prior to `attestation.source.epoch`.
# Bail out of the update early if an existing target is already less than
# or equal to the new min target, because that min target will also apply to all
# prior epochs.
def update_min_targets(
min_targets: [u16],
attestation: AttestationData,
current_epoch: Epoch
):
e = attestation.source.epoch
min_epoch = if current_epoch < N then 0 else current_epoch - N
while e > min_epoch:
e -= 1
if attestation.target.epoch < min_targets[e % N]:
min_targets[e % N] = attestation.target_epoch
else:
return
# Update every entry from one epoch after `attestation.source.epoch`.
def update_max_targets(
max_targets: [u16],
attestation: AttestationData
current_epoch: Epoch,
):
e = attestation.source.epoch + 1
while e <= current_epoch:
if attestation.target.epoch > max_targets[e % N]:
max_targets[e % N] = attestation.target_epoch
i += 1
else:
return
Rather than storing the target of attestations in the arrays directly, we can instead equivalently store their distance from epoch
This has several advantages:
u16
. The distance will always fit inside a u16
.If all attestations are targetting the epoch immediately after their source epoch, then the arrays are completely uniform:
In implementation, we have several options for compressing these arrays, depending on our choice of data store.
It is likely that as well as repetition within the arrays for a single validator, there will also be repetition across the attestation behaviours of validators. It is essential that we take advantage of this repetition to reduce disk usage, as naively the arrays could use up to 64.8 GB, and we don't want to be rewriting entire arrays all the time.
Here's a first draft of a chunking scheme, that comes with some drawbacks:
C
be the number of array elements per chunk(validator_index: u64, chunk_idx: u16) => ChunkHash
When a new attestation appears:
chunk_hash
for (validator_index, chunk_idx)
, where chunk_idx = att.source.epoch / C
chunk_hash
(validator_index, chunk_idx)
. This will involve hashing each updated chunk, so we should choose a cheap hash function (maybe BLAKE2b)In order to implement garbage collection, we could tag each chunk with the epoch at which it was last used. This would use an extra 8 bytes per chunk, and require an extra write to every chunk used in an epoch.
232323...
can be stored once and referenced from multiple (validator_id, chunk_idx)
pairs2 * V
chunks need to be written to disk. With C=240 (~one day) and V=300k this is only 2 * 2 * 240 * 300_000
= 288 MB[u16; 240]
shows that it runs in less than 1μs! For arrays of 2048 elements (far beyond what we're likely to want) the hashing time is ~4μs.V * N/C * H
where V
is num validators and H
is the size of a hash (probably 32 bytes). That's around 2.16 GB for C=240
, smaller with larger chunk sizes.U
times the full size 2 * 2 * V * N
(64.8 GB for V=300k, U=1.0).Here's another approach to chunking, due to Proto.
Split each array into "horizontal" runs of length C
. Then, take K
runs of length C
from different validators and store these as a chunk on disk (size 2 * K * C
bytes).
chunk_id => chunk
, where chunk_id
is computed from validator index and epoch like so:
validator_chunk_number = validator_index / K
chunk_number = att.source.epoch / C
chunk_id = validator_chunk_number * C + chunk_number
(not necessary in an SQL DB)The update protocol for a new att
is:
chunk_id(validator_index, att.source.epoch)
validator_index % K
sub-chunkK
chunks containing all validators get updated… TODOThe real differentiaton between 1D and 2D is:
1D provides de-duplication but requires garbage collection and uses more space in the worst-case than 2D.
What's the break-even point for 1D to use less space?
Assuming a single array for simplicity
1D Disk Usage:
* Index: H*V*N/C
* Chunks: U*2*V*N
(U
uniqueness factor)
2D Disk Usage:
* Chunks: 2*V*N
Solving for equality yields U = 1 - H/2C
as the break-even point at which both strategies use the same disk space.
With the last_used_epoch
tags for GC:
1D Disk Usage:
* Index: H*V*N/C
* Chunks: U*V*N/C*(2C + 8)
We get U = (2C - H)/(2C + 8)
Some numbers are here: https://docs.google.com/spreadsheets/d/1UEWfQ_Oh73uarXLqTuKz9JERoRGutfLr-M6bGqX9u-w
It looks like the de-duplication could prove useful, particularly if we allow ourselves 8 or 16 byte BLAKE2b hashes. However, due to the extra complexity of this approach, I'm going to start by implementing 2D chunking.
Although we could rely on the compression provided by a key-value store or relational database, support is patchy between databases, and we can probably do better with a custom-designed format. For completeness, here's a quick summary of compression support accross different stores.
LevelDB uses Snappy, but LMDB doesn't support any form of compression.
Postgres has no support for compressed tables, but can compress individual rows, which could work with a one-array-per-row storage schema. MariaDB supports compressed pages, which would work with one-array-per-row or one-element-per-row.
This section mostly obsoleted by chunking, above^
To compress the long runs of identical distances, it seems run-length encoding (RLE) could be effective. Noting that the min target for the current epoch
We're using tuples of Vec<(u16, u16)>
).
With this encoding, it doesn't matter what data store we use, we can store a value (or row) for each validator index, and read/write it as it changes, without using much space on disk.
To really optimise for the best-case scenario, we could use a run-length encoded queue which always stores the value for the current epoch at the last index. In the best case this could remove the need to write the queue to disk every epoch, in a flow like the following:
e
, we have min_targets = [(N - 1, 2), (1, 0xff)]
min_targets = [(N - 2, 2), (2, 0xff)]
e + 1
in memory, new value: min_targets = [(N - 1, 2), (1, 0xff)]
. This is the same as the value already stored on disk – no need to write it.In the best case, the min-max target queues could require only four u16
s each (8 bytes). Assuming the data store requires a u64
to encode the length of values/rows, this pushes the minimum size to 16 bytes. For 300k validators, we're looking at:
Keep in mind this is just the disk usage for the min-max arrays, and that we also need to store all the attestations, which will require hundreds of gigabytes.
This is probably also very unrealistic, so lets look at some less uniform attestation patterns. Here are the expected sizes for different run-length encodings with X% of the full number of elements:
Beyond 50% uniqueness the RLE is actually less efficient than storing every value, as 2 arrays * 54k epochs * 2 bytes (u16
) * 300k validators = 64.8 GB. We could have a flag bit to switch between RLE and direct encoding if 50%+ uniqueness is observed in practice. Hopefully it won't be! And if we can get some compression from the filesystem or the DB, that will help too (particularly if it acts across rows/values).
If every validator attests every epoch were are going to be reading (and often writing) every min-max array on disk… That's a lot of data! In the worst-case that's 64.8 GB per epoch, or 180 MB/s… That's just within the realm of feasibility if we can avoid writing unchanged values and sit below the worst-case most of the time.
TODO