Try   HackMD

Optimized Prysm Slasher Design

Background

Currently, our slasher implementation uses a simple approach towards min-max span detection. The problem of slashing detection is non-trivial, as a slasher needs to keep track of attesting histories for every single validator in some optimal format to account for space required and also observe every single attestation the beacon node sees to perform slashing detection. The hardest part about building a slasher is when it comes to optimizing around resource use. Our end goal is for an average staker to be able to run a slasher in the same way they can run a beacon node.

Unfortuntely, our current approach does not fare well in terms of resource use. We need to revamp it from the ground up and test the assumptions we made when first developing it. Thankfully, Lighthouse has done a lot of work on optimized slashing detection here: https://hackmd.io/@sproul/min-max-slasher and their slasher implementation is a great reference that will be the main backbone of this design discussion. Also, see protolambda's exploration of this problem here: https://github.com/protolambda/eth2-surround.

We have a working prototype of this design document here: https://github.com/prysmaticlabs/prysm/tree/advancedslasher

Explaining surround votes

The hardest part about slashing detection comes when considering how to detect surround votes for attesters. As such, this design will only focus on that aspect of slasher.

At a high-level, a surround vote happens when an attester creates an attestation that is trying to "contradict" history and chain consensus. For example, if they create an attestation that shows they believe consensus should be different than what the chain believes. Let's take a look at an example first:

A = (source: 1, target: 2)
B = (source: 0, target: 3)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The second attestation,

B, is surrounding the first one,
A
, because the validator is changing their mind by saying the justified epoch is actually in the past of what their previous attestation says it is. This is a slashable offense. Conversely, the
A
is surrounded by
B
. If we see a validator created this pair of attestations in either order, such validator needs to be slashed. That is, both surrounding and surrounded attestations count as slashable offenses.

The reason this sort of event is difficult to effectively detect is because with a naive approach, we need to keep track of every attestation source, target pair for every single validator, and for every single new attestation that comes in, we need to loop through all of these for each validator to cross-check whether there is a slashable event. As expected, the memory and time complexity of the naive approach is unreasonable.

Naive approach:

  1. Receive an attestation:
  2. Store source_epoch => [target_epoch] in the DB
  3. When checking if an attestation = att is slashable for a public key
    • For every (source, [target]) in the DB
      • checkSurround(att.source, source, att.target, target)

Better way to detect surround votes

Formally, a surround vote is defined as:

an attestation

A surrounds
B
if and only if

A.source.epoch<B.source.epoch<B.target.epoch<A.target.epoch

Using this definition, we can create create two arrays that can help us quickly detect surround votes, namely min_targets and max_targets.

Where we define:

min_targets[i] is the minimum target of an attestation with source epoch greater than

i

max_targets[i] is the maximum target of an attestation with source epoch less than

i

In lighthouse's design doc https://hackmd.io/@sproul/min-max-slasher, they mathematically prove surround votes can be caught using these arrays because, for an an attestation

A:

A surrounds an existing attestation
A
.target.epoch > min_targets[
A
.source.epoch]

and

A is surrounded by an existing attestation
A
.target.epoch < max_targets[
A
.source.epoch]

Given we only care about attestations within a weak subjectivity period, for every validator, we have to store 2 arrays of type []uint64, with length 54k.

Even then, for many validators, each of these arrays are wasteful given we do not actually need to store target epochs at all. Instead, we can store the difference between current epoch and target epoch, guaranteeing values in our arrays are bounded by MaxUint16, further reducing the space requirements. We will be calling these epoch distances max_spans and min_spans, respectively. Let's take a closer look to ensure we understand what these arrays are.

min_spans[e] is defined as the min((att.target.epoch - e) for att in attestations) where att.source.epoch > e. That is, it is the minimum distance between the specified epoch and all attestation target epochs a validator has created where att.source.epoch > e.

max_spans[e] is defined as the max((att.target.epoch - e) for att in attestations) where att.source.epoch < e. That is, it is the maximum distance between the specified epoch and all attestation target epochs a validator has created where att.source.epoch < e.

Remember, the reason we do this is to optimize for storage. Difference between epochs can be stored as uint16 values instead of uint64, giving us precious space optimizations. The above is pretty hard to understand without an example. So, let's try to construct some min and max spans with the following attestation history, which would be the ideal attesting history since genesis up to and including epoch 4.

(source: 0, target: 1)
(source: 1, target: 2)
(source: 2, target: 3)
(source: 3, target: 4)

The attestations above represent ideal network conditions. We'll build up min spans first, and we have 5 epochs worth of data.

Building min spans manually

For min_spans[0], we find all attestations with source > 0. Then, for each, we compute the difference (target - 0). The smallest such value is 2, so we set the value as 2.

For min_spans[1], we find all attestations with source > 1. Then, for each, we compute the difference (target - 1). The smallest such value is 2, so we set the value as 2.

For min_spans[2], we find all attestations with source > 2. Then, for each, we compute the difference (target - 2). The smallest such value is 2, so we set the value as 2.

For min_spans[3], we find all attestations with source > 3. We have none, so we set the value to undefined

Our list looks like this min_spans = [2, 2, 2, undefined]

Building max spans manually

For max_spans[0], we find all attestations with source < 0. We have none, so we set the value as 0. The reason we can use 0 here is because detecting a surrounded vote given an attestation

A involves checking if
A
.target < max_spans[
A
.source]. Since nothing can be less than 0, 0 serves as a good neutral element for max_spans. More on this is explained in the next section.

For max_spans[1], we find all attestations with source < 1. Then, for each, we find the maximum target - 1, getting a value of 0

For max_spans[2], we find all attestations with source < 2. Then, for each, we find the maximum target - 2, getting a value of 0

For max_spans[3], we find all attestations with source < 3. Then, for each, we find the maximum target - 3, getting a value of 0.

Our list looks like this: max_spans = [0, 0, 0, 0]

Detecting surround votes with min/max spans

Let's take a look at example of a surround vote and why these spans arrays are so helpful to us. Let's take a look at a surround vote example:

A = (source: 3, target: 4)
B = (source: 2, target: 5)

The second attestation,

B, is surrounding the first one,
A
. How can we verify that
B
is surrounding
A
using our spans? We use min_spans in this case with the following condition explained perfectly by the Lighthouse team

If the target epoch of a new attestation,

B, is greater than the min target for its source, that means there exists a previously signed attestation
A
with
A
.source >
B
.source and
A
.target <
B
.target, i.e.
B
surrounds
A
. Moreover, because
A
is the attestation with the minimum target amongst attestations that could be surrounded by
B
, we know that if the check fails, then
B
cannot surround any other previously signed attestation.

In other words,

B surrounds
A
B
.target > min_spans[
B
.source].

Ok, so we want to check

B.target = 5 < min_spans[B.source = 2]

First of all, let's update our min spans with attestation

A.

For min_spans[2], we find all attestations with source > 2. Then, for each, we compute the difference (target - 2). The smallest such value is (4 - 2) = 2, so we set the value as 2.

So then

5>2, we have a surround vote!

That's great! Now what if we saw

B first and then
A
? How can we verify that
A
is surrounded by
B
? We use max_spans in this case with the following condition:

A is surrounded by
B
A
.target < max_spans[
A
.source].

So then, we want to check

A.target = 4
max_spans[A.source = 3] = ???

To do this, let's update our max spans with attestation

B at epoch 3

For max_spans[3], we find all attestations with source < 3. Then, for each, we compute the difference (target - 2). The largest such value is (5 - 2) = 3, so we set the value as 3.

A.target = 4
max_spans[A.source = 3] = 3

So then

A.target = 4 > 3 we have a surrounded vote! The rationale is as follows:

If the target epoch of an attestation

A is less than the max target for its source, that means there exists a previously signed attestation
B
with
B
.source >
A
.source and
B
.target <
A
.target, i.e.
B
surrounds
A
. Moreover, because
A
is the attestation with the maximum target amongst attestations that could be surrounded by
B
, we know that if the check fails, then
A
cannot be surrounded by any other previously signed attestation.

These spans arrays give us a nice, compact way of mathematically checking surround votes quite easily!

Towards an optimal approach: 1D chunking

The great thing about these span arrays is that in ideal network conditions, as we covered in the previous section and as what we are seeing on mainnet today, min and max spans for most validators will look like this (assuming each of their attestations target epochs immediately follow their source epochs):

min_spans=[2,2,2,...]
and for max spans:
max_spans=[0,0,0,...]

Let's think about the naive way of storing these on disk. Something we could do is store two buckets in boltDB, called minSpansBucket and maxSpansBucket where the keys are validator indices and the values are the respective arrays of 54000 elements (the weak subjectivity period).

Since each array is a []uint16, and we have 54000 elements in each, the total data stored on disk for 100,000 validators is:

min_spans_size = 54000 * 2 bytes (since each elem is uint16)
max_spans_size = 54000 * 2 bytes

total_per_validator = min_spans_size + max_spans_size = 216kb
total_values = num_validator_indices * total_per_validator
             = 100,000 * 216kb = 21Gb total

21Gb total for 100k validators ends up being pretty big! Can we do something more to reduce this space?

Let's take a look at single validator with 100% attestation effectiveness. Their spans arrays will look like [2, 2, 2, ...] and [0, 0, 0, ...] Given this data is all redundant, we could split each array up into chunks of length

C. There will be
54000
/ C chunks. So if
C
= 1000, then we have
54
chunks of length
1000
. Chunk index
1
starts at
1000
, chunk index
10
starts at
10000
, etc.

Then, we can take each chunk's hash and store it in the DB. In a separate bucket, we store (validator_idx ++ chunk_idx) => chunk_hash. So our schema looks like this:

(validator_idx ++ chunk_idx) => chunk_hash
chunk_hash => chunk

To give an example, let's say validator index 1 is attesting perfectly has a min span array that looks like this [2, 2, 2, 2, , 2]. We split it up into

54 chunks of
C=1000
. We take the hash of each chunk, and since all of them are
2
the entire array, all of them have the same hash 0x923099304902934. This means in that in the DB, we are saving a TON of space, and storing 54 chunks worth of data in a single chunk since they are same, redundant data.

Given certain runs of numbers could be repeated across tons of validators, best case scenario we would have only 2 chunks in the database for the entire validator registry!

This means that our DB would look something like this

0x923099304902934 -> [2, 2, 2, ..., 2]
0x1231a3723762833 -> [0, 0, 0, ..., 0]

if we were to store chunks addressed by hash and we had perfect uniformity. Even if we do not have perfect uniformity, the more uniform each chunk is, the more we can save on space and reuse chunks!

This approach is great, however, it becomes complex because we need to perform database pruning for chunks that are not in use anymore and the simplicity of the schema makes this really hard to do To do this, we would need to tag each chunk with the latest epoch it was used and perform garbage collection to do this. The lighthouse team abandoned this approach given the complexity of this pruning routine.

How do you choose C?

The parameter

C, being the number of elements per chunk, is important to determine how optimal our approach is in saving disk space. The space used on disk is proportional to
CU
where
U
is a uniformity factor in a chunk. Namely, how redundant the data in a chunk actually is. The lighthouse team chooses a default value of
16
based on optimization analysis they performed here.

Easiest and optimal approach: 2D chunking

An even better approach is as follows. Instead of storing chunks and hashes for each validator index for min and max spans, we can be a little more clever with how we store data on disk. Under ideal network conditions, where every target epoch immediately follows its source,
min spans for a set of validators will look as follows:

 min_spans_validator_0 = [2, 2, 2, ..., 2]
 min_spans_validator_1 = [2, 2, 2, ..., 2]
 min_spans_validator_2 = [2, 2, 2, ..., 2]
 ...

Next, we can chunk this list of min spans into chunks of length

C. For
C
= 2, for example:

                            chunk0  chunk1       chunkN
                             {  }   {   }         {  }
 chunked_min_spans_val_0 = [[2, 2], [2, 2], ..., [2, 2]]
 
                            chunk0  chunk1       chunkN
                             {  }   {   }         {  }
 chunked_min_spans_val_1 = [[2, 2], [2, 2], ..., [2, 2]]
 
                            chunk0  chunk1       chunkN
                             {  }   {   }         {  }
 chunked_min_spans_val_2 = [[2, 2], [2, 2], ..., [2, 2]]

Finally, we can store each chunk index for

K validators into a single flat slice. For
K
= 3:

                                   val0  val1  val2
                                   {  }  {  }  {  }
  chunk_0_for_validators_0_to_3 = [2, 2, 2, 2, 2, 2]
  
                                   val0  val1  val2
                                   {  }  {  }  {  }
  chunk_1_for_validators_0_to_3 = [2, 2, 2, 2, 2, 2]
  
                                   val0  val1  val2
                                   {  }  {  }  {  }
  chunk_2_for_validators_0_to_3 = [2, 2, 2, 2, 2, 2]
    

Then we have an easy way of representing a chunk

Ci for
K
validators on disk all in a single flat bytes slice. 1D provides de-duplication but requires garbage collection and uses more space in the worst-case than 2D, because we have to store hashes for every chunk and also garbage collection markers to allow us to prune old chunks. In both approaches, we still have to store the 2*54000 on disk for the spans.

We store these in a single bucket in our database with the schema chunk_id => flat_chunk where the chunk_id for an incoming attestation is computed as follows from a validator index and epoch:

validator_chunk_number = validator_index / K
chunk_number = att.source.epoch / C
chunk_id = validator_chunk_number * C + chunk_number

The lighthouse design doc outlines this approach and mentiones the new update method when receiving an attestation is:

  1. Load the chunk for chunk_id(validator_index, att.source.epoch)
  2. Fetch the validator’s chunk from the validator_index % K sub-chunk
  3. Check for slashing and update the chunk (lazy load neighbours and modifying them as necessary)
  4. Write all updated chunks back to the database
    Upsides
  5. Single chunk table (no need for the index, thus smaller overall)

There is no need for hashing in this scheme so there is fewer data we need to store, making this approach space efficient and predictable.

Let's take a look at how an update process specifically works for this 2D chunking scheme:

Let's set

H = historyLength = 2, meaning a min span
will hold 2 epochs worth of attesting history. Then we set
C
= 2 meaning we will
chunk the min span into arrays each of length 2 and
K
= 3 meaning we store each chunk index for 3 validators at a time.

So assume we get a target 3 for source 0 and validator 0, then, we need to update every epoch in the span from
3 to 0 inclusive. First, we find out which chunk epoch 3 falls into, which is calculated as:
chunk_idx = (epoch % H) / C = (3 % 4) / 2 = 1


					val0        val1        val2
				      {     }     {      }    {      }
  chunk_1_for_validators_0_to_3 = [[nil, nil], [nil, nil], [nil, nil]]
			              |    |
			              |    |-> epoch 3 for validator 0
				      |
                                     |-> epoch 2 for validator 0

				        val0        val1        val2
				      {     }     {      }    {      }
  chunk_0_for_validators_0_to_3 = [[nil, nil], [nil, nil], [nil, nil]]
				      |    |
				      |    |-> epoch 1 for validator 0
				      |
			              |-> epoch 0 for validator 0

Next up, we proceed with the update process for validator index 0, starting epoch 3
updating every value along the way according to the update rules for min spans.

Once we finish updating a chunk, we need to move on to the next chunk. We stop whenever we reach the min epoch we need
to update, in our example, we stop at 0, which is a part chunk 0, so we need to perform updates across 2 different "2D" datastructures as shown above.

After doing the math, the chunk_0 for validator_0 will look like [1, 0] and chunk_1 validator_0 will look like [3, 2].

Requirements of our implementation

When actually writing the code for our slasher, we need to keep the following items in mind:

  • Simplicity in the implementation
  • Caching strategies only where needed and done at the very end
  • Ability to scale to large numbers of validators under reasonable resource constraints
  • Extremely easy to test
  • Easy entrypoint to run a simulation of any number of validators and attestations for testing correctness and benchmarking performance
  • Readable code with comprehensive godoc comments and a simple package structure
  • Performant in terms of disk I/O

Design Question: Part of Beacon Node vs. Independent Process

Currently, we designed our slasher to be its own independent process. However, this led to a lot more pain the benefit as we realzed the needs of slasher a bit too late. In particular, slasher needs access to all indexed attestations observed by the beacon node, and also needs access to certain information about the validator registry. Given we do not store indexed attestations in the database, our current slasher reaches out to a beacon node gRPC endpoint which duplicates work by converting attestations to indexed form and also aggregating them, which puts a lot of load in our beacon node process at runtime, especially during times of low chain participation. Given that at the most basic level, having a simple feed of indexed attestations is all we need for slashing detection, we posit that slasher should instead be a service in a running beacon node. Another reason is that slasher is so tightly coupled with a beacon node that it makes sense to have both of them together.

Implementation

Inspired by lighthouse's slasher implementation (https://github.com/sigp/lighthouse/tree/stable/slasher/src), our approach mirrors its general flow.

Components

Here are the required components for our implementation:

  • Source of indexed attestations (a beacon node sending to an event feed, for example)
  • Receiver of attestations from an event feed, able to perform basic integrity checks and enqueue received attestations for processing
  • Background routine running on_tick to dequeue attestations, group them by validator chunk indices, and process those batches
  • A Chunker interface, specifying methods to update chunks, detect slashings within a chunk, etc.
    • a MinChunkList implementation of the Chunker interface
    • a MaxChunkList implementation of the Chunker interface
  • A function which, for each attestation, is able to update get the chunks for all validators involved in the attesting indices, update min and max chunks, and save them to the database. This function will leverage the Chunker interface
  • Simulator that can serve as a mock source of indexed attestations for benchmarking, testing, and more
  • Background routine which periodically deletes redundant attestations from the db
  • API endpoints for remote slashing protection
  • Ability to prune old data and defer attestations from the future for later processing (out of scope)

The flow of an attestation

When we first receive an attestation via some source, we:

  • Check if the attestation is valid (source epoch < target epoch)
    • If not, discard
  • Check if attestation.target.epoch <= current epoch
    • If not, defer for future processing
  • Enqueue verified attestations for batch processing

In the background, we run a goroutine which processes enqueued attestations on tick intervals (per epoch):

Created with Raphaël 2.2.0OnTick(Epoch)Group by validator_chunk_idxProcess batch (validator_chunk_idx, []attestations)SaveToDB(attestations in batch)FetchChunks(batch)slashings := UpdateChunks(MaxChunk)ProcessSlashings(slashings)Endslashings := UpdateChunks(MinChunk)

Grouping attestations by validator chunk index will allow us to process batches easily and concurrently in goroutines, making our lives easier.

The chunk interface

Given the routine for updating chunks is the same at a high level for both min and max spans, we can define a Chunker interface that defines some methods that actually do the heavy lifting.

type Chunker interface {
    NeutralElement() uint16
    Chunk() []uint16
    CheckSlashable(
        validatorIdx uint64,
        attestation *ethpb.IndexedAttestation,
    ) (bool, slashingKind, error)
    Update(
        chunkIdx,
        validatorIdx,
        startEpoch uint64,
        attestation *ethpb.IndexedAttestation,
    ) (bool, error)
    StartEpoch(sourceEpoch, currentEpoch uint64) (uint64, bool)
    NextEpoch(startEpoch uint64) uint64
}

Then, we can implement this interface for min and max spans as they differ slightly in how the underlying implementation of these methods work.

type MinChunk struct {
    config *Config
    data   []uint16
}

type MaxChunk struct {
    config *Config
    data   []uint16
}

Testing

Given how unintuitive the optimized slasher implementation is, we need a way to ensure it works as expected for all kinds of edge cases, and ensure it can work on as many test scenarios we can devise. To do this, we leverage the simplicity of the slasher entrypoint being a single feed of indexed attestations. Slasher does not care what the source is, as long as it can receive these attestations from some event feed.

Additionally, we can create a simulator that can craft all sorts of combinations of indexed attestations for any number of validators and epochs.

type MockFeeder struct {
	feed   *event.Feed
	config *Config
}

func (m *MockFeeder) IndexedAttestationFeed() *event.Feed {
	return m.feed
}

func (m *MockFeeder) generateFakeAttestations(ctx context.Context) {
	ticker := slotutil.GetEpochTicker(m.config.GenesisTime, m.config.Interval)
	defer ticker.Done()
	for {
		select {
		case currentEpoch := <-ticker.C():
			att := m.generateIndexedAtt(ctx)
			m.feed.Send(indexedAtt)
		case <-ctx.Done():
			return
		}
	}
}

func (m *MockFeeder) generateIndexedAtt(ctx context.Context) {
	...
}

Work Estimates

This design has now been completely implemented in Prysm under the feature/slasher branch of as March 25, 2021