Fork Choice compliance testing framework

Our ultimate goal is to ensure that FC implementations conform to the FC spec. The ultimate way to reach the goal would be employing formal verification of the FC implementations, however, we assume that it's beyond our reach at present time. Therefore, testing is the only practical approach to reach our goal. It's still very challenging though and in the write-up we analyze how the problem can be solved efficiently.

The methodology is implemented in the fc-compliance branch of eth2.0-specs fork. To be merged into the main repo.

Preliminary research has been done in this repo.

Summary of proposed ideas

Here is the summary of the proposed ideas, which will be described in more details later.

employ "multiplicative" coverage criterion, which takes into account interactions of FC spec and FC implementation behavior. In particular,
- obtain coverage with two step approach: 1) obtain a good coverage of the FC spec, and then, 2) fuzz the above test suite to improve FC implementation coverage according to the "multiplicative" criterion. Both blind and client-specific coverage-guided fuzzing can be used.
- augment client-specific coverage criterion to take observed FC spec behavior into account
- optionally, desugar FC Python spec to obtain more precise coverage
spec-level test generation
- decompose FC spec to aid test generation
- use model checking to generate tests
- optionally, transpile the FC spec to C to widen selection of test generation tools
test oracles
- implement additional property checks, which will help catching more bugs and thus alleviating demand for larger test suites
performance enhancements
- (if not already done) replace BLS and hash procedures with lightweight test implementations
- ensure deterministic FC implementation configuration can be enabled
- ensure FC implementation under test is reasonably fast (e.g. one can switch off irrelevant initialization)
- transpile FC python spec to C

Testing setup

During a testing project, we typically face following problems:

what is SUT (System Under Test)?
test oracle problem - how do we check that the SUT behavior is correct?
coverage criterion - do we have enough test cases? which scenarios are not covered?
test vector generation - how do we construct test vectors which allows to satisfy our coverage goals?
efficient test execution - good coverage often means lots of tests, which can be costly to execute

SUT

In our case, the SUT is the set of the FC implementations. Testing the FC spec itself is a related problem, but we do not address it here. We decompose our goal into a set of sub-problems: test each FC implementation in isolation.

In general, it might be beneficial to test several implementations simultaneously, however, we ignore such approach for now. It can be considered in future.

NB It's expected that there may appear additional auxiliary SUTs, e.g. the original FC spec can be transformed in various ways - decomposed, restructured, desugared, transpiled, etc. In order to ensure that the transformed version behaves as the original one, we may employ testing. It's not the primary focus of the write-up, particularly since it should be an easier problem to solve.

Test oracle problem

The FC spec is executable (a reference implementation), which means we can cast the problem as differential testing of the FC client implementations against the FC spec.

That means that the test oracle problem is easy to solve: the main property to check - all programs choose the same canonical chain - is straightforward to implement.

However, this property check alone might miss certain bugs, i.e. the correct block can still outweigh others despite erroneously calculated weights, etc. Catching such bugs is possible but more test vectors are required (which perhaps should be specially designed). Therefore implementing additional property checks can improve confidence without growing the test suite size.

Additional property checks

We expect that client FC implementations already expose certain functionality to other client subsystems, which includes:

for attester/producer subsystem:
- finalized checkpoint
- justified checkpoint
- epoch boundary checkpoint
for monitoring/debugging purposes:
- "active" block tree, i.e. blocks that are descendants of a finalized and/or a justified checkpoints
- calculated weights for the blocks

We thus believe that it's reasonable to implement corresponding test oracles for the properties, which will check that the FC spec and an FC implementation agree on their outcomes for each test vector.

Coverage metric/criterion

It's well known that traditional code coverage criteria do not work well in the case of differential testing (see for example NEZHA: Efficient Domain-Independent Differential Testing). A particular statement in the spec can correspond to multiple statements in an implementation and vice versa. Therefore the traditional approach to measure coverage which sums up code elements reached is not enough.

NB Traditional "additive" criterion:

there should be a test vector triggering each (reachable) code element

For example, consider a test vector which triggers specStmt statement in the spec and implStmt in the implementation. Additionally, let's assume that both specStmt and implStmt have been already covered by prior tests. However, it can be the case that the specStmt has been tested in combination with some other implStmt2 only, while implStmt has been tested in combination with yet another specStmt2. An "additive" coverage metric may decide that the test case doesn't improve coverage, since both specStmt and implStmt have been already touched. However, when testing an implementation against a spec, it should be considered a new test vector (activating some fresh SUT behavior).

We therefore need a "multiplicative" coverage criterion which takes into account interactions of spec and implementation code elements:

there should be a test vector for each (reachable) pair of code elements of the spec and of the implementation

Path-based vs set-based coverage

There can be roughly two flavors of such "multiplicative" criteria: path-based and set-based. Typically, code coverage information is presented as a set of pairs (codeElement, count) describing how often a particular codeElement (e.g. branch or statement) has been visited during code execution. In the case of two programs, there are two such sets.
With the set-based approach, a footprint of a single test run consists of the set of code elements, visited during the run. When calculating coverage measure for a test suites one unions individual test footprints.
With the path-based approach, a footprint of a single test run consists of a single element, (abstractly) describing the test execution path, e.g. a sequence of statements visited or branches taken. When calculating coverage measure of a test suite the individual elements combined into a set.

Similarly, there can be two procedures two calculate the "multiplicative" coverage measure:

path-based: (specPath, implPath) where specPath and implPath abstracts the execution path captured while executing respectively the spec and the implementation, using the same test vector.
set-based: one can use cartesian product of specElems and implElems, where specElems and implElems consist of code elements of the spec and the implementation respectively, visited during execution of the same test vector.

Test vector generation strategies

Reaching a good code coverage can be a challenging problem. It becomes even more complicated in the case of our "multiplicative" coverage criterion, where interactions of spec and implementation code elements should be covered.

Another important limitation is that we can employ white-box approach to reach a good coverage of the FC spec, while we cannot expect the same level of efforts in the case of FC implementations. It can definitely be the case that some client implementers are ready to apply white-box techniques at the implementation level, but we can't expect it to be the common ground.

Since the FC spec is the central component and we have several FC implementations, we assume the following approach to obtain good coverage according to our "multiplicative" coverage criterion:

obtain a very good coverage of the FC spec, using a white-box techniques
use the test suite as a seed corpus for fuzzing, both blind (black-box) and coverage-guided (gray-box).

NB This is roughly the same approach as the one employed by beacon-fuzz project.

The first step ensures that each distinct FC spec behavior is covered, however, it is not enough to guarantee good coverage according to our "multiplicative" criterion. Therefore we complement it with fuzzing: for each test vector from the FC spec test suite we generate multiple new test vectors in order to cover reachable FC implementation behaviors.

While more complicated approaches are possible, we focus on the simple one for now. In particular, fuzz testing tools are available for popular programming languages.

Black- and gray-box testing of the FC implementations

So, any test suite developed for the FC spec can also be used to test FC implementations. It's obviously a black box approach regarding the FC implementation code base.

We can fuzz the suite, using it as the seed corpus, by applying mutations to its test vectors. We consider black-box (blind) and gray-box (coverage-guided) fuzzing here:

black-box mutation doesn't need to look at FC implementation code and is a (relatively) simple way. While it's blind it's still can improve coverage and it can be a common ground test suite for all implementations.
coverage-guided client specific fuzz testing, which aims at improving coverage by looking into client implementations, though in a limited fashion. It allows to improve coverage in a client-specific way.

In future, test vectors obtained during client-specific fuzzing can be merged and used to update the common test suite. That can even be done in an online fashion, i.e. while testing multiple FC implementations simultaneously.

Mutation operators

The common component for both fuzzing approaches is test vector mutators. A common problem for fuzz testing is that it's difficult to obtain a truly diverse test vectors. While some behaviors are easy to reach, eventually, fuzzing gets stuck in these "easy" regions. Obviously, some behaviors may demand very specific conditions to be satisfied, which can be extremely difficult to reach with random mutations.

One example is that a straightforward mutation tends to generate invalid inputs, thus causing SUT to repeatedly visit input validation paths. In general, it can be very difficult to generate semantically interesting inputs by dumb (unaware of input structure) fuzzing.

One way to alleviate the problem is to design mutation operators which allows for better control of expected behavior. For example, if we know that input validation paths are already well covered, we may want to reduce importance of mutations which make test vectors invalid, e.g. we may lower the probability of applying such mutation.

We can design such mutation operators based on the FC spec decomposition. For example, a mutation operator can

reorder events
duplicate events
drop events
mutate an individual events
move or copy info from one event to another (e.g. an attestation from a block can be made a standalone attestation and vice versa)

Based on the decomposition it can be easier to predict whether such mutation will change expected behavior. For example, if we know that some event is invalid, it will be dropped. Thus one can drop or add such events, w/o affecting overall outcome (of the fC spec). Similarly, certain events can be re-ordered within certain time period.

Generating test vectors for the FC spec

As the FC spec is central in our setting, we believe that it is justified to use white-box techniques to obtain sufficient spec coverage. We consider three approaches:

decompose the FC spec, so that it's easier to sample. Additionally, such decomposition allows to design better mutator operators, which can be used for fuzzing the seed corpus.
use model-based test generation. In our case we focus on Bounded Model Checking approache.
optionally, use code annotations to aid model checker, e.g. to avoid spending efforts on covering unreachable paths

Decomposition

Let's illustrate with an example, how one can exploit code structure to aid test generation. Let's assume that the SUT can be represented as a composition of two functions f(g(x)). To generate covering set of inputs, there can be two approaches:

sample x directly
sample inputs to g and to f separately

Often, the second approach can be easier, e.g. when the g is contractive, in the sense that for a reachable value of g there typically exist multiple input values leading to the output.

A simple, but realistic example is when g transforms sequences of things to sets of the same things. Assuming additionally that f is independent of order of elements in its input sets. Obviously, for each possible output of g - and correspondingly for each reachable input for f - there are multiple possible inputs of g (excluding possible degenerate cases like an empty list), which can include arbitrary reordering and duplication of elements. Due to the above assumption (f is independent of input elements order), a better test strategy would be to cover f and g independently:

one need much less samples to cover f directly, when sampling set of elements instead of sequence of elements
while testing g one can ignore f code structure

Note that FC spec is quite similar to the example. I.e. it can be restructured so that its intermediate state consists of several sets plus a counter (current time):

valid blocks
valid votes
equivocating indices

NB proposer boost can be represented as a set with at most one element

Additionally, when choosing the head based on that information one doesn't need vast majority of info contained in blocks and attestation - they are mostly required to check message validity. The only significant exception here is that justified checkpoint state is used to calculate participation and weights - however, it's a state of a single block, while a test vector can include multiple blocks.

Therefore sampling the intermediate state is much easier than sampling sequences of events.

Note also that such decomposition can help to fuzz inputs. E.g. if we have a seed vector for f which is a set, we need to construct some g input leading to the vector - e.g. choose some order of elements. It's easy to mutate the vector in a semantically meaningful way:

reorder elements
duplicate a subset of elements
drop a subset of elements
choose a subset of elements and apply element-level mutator

An example of spec decomposition/restructuring

Here is a high level description of how a decomposed FC spec can look like. There is also an example of a similar spec restructuring, albeit pursuing a somewhat different goal (simplify formal verification of the FC spec properties).

Current FC Store contains certain fields which can always be reconstructed (block_states, checkpoint_states, unrealized justifications, etc). To simplify test generation, it makes sense to restructure the Store, so that it (let's call it TestStore) contain only the following fields:

blocks (valid and in-time)
- there can be several sets of blocks, actually, e.g. in-time blocks, stale blocks, justified/finalized checkpoint descendants, etc
latest_messages, excluding votes of validators inequivocatind_indices
current_time
proposer_boost

The rest of Store properties can be reconstructed with corresponding query methods (see for example the example above).

Processing test vector events (ticks, blocks, attestations, attester slashings) is performed via the following steps:

filter out invalid events (blocks, attestations, slashings)
topologically sort events according to their dependencies (e.g. process block after parent block, etc)
an additional filtering out pass might be required, e.g. to validate a block regarding its parent
calculate necessary TestStore updates
invoking get_head function (adjusted to work with the the TestStore)

White box test generation

Coverage-guided fuzzing is a gray-box approach and thus often inefficient, since it tend to stuck in "easy" regions and it can be quite tricky to make it reach difficult code locations. One way to resolve the problem is to employ a white box approach, i.e. take control flow and data flow dependencies into account when constructing test vectors.

Bounded model checking is one such rather straightforward approach to test generation. Let's assume we have a test vector. We can inspect the execution trace of the test vector, i.e. record branches taken and construct path conditions and data flow equations along the trace. Now we might want to sample a new path. We can invert one path condition (e.g. the last one) and ask an SMT solver to solve the corresponding SMT problem. If it succeeds then the solution can be converted to a new test vector, which will lead to a different execution path (due to the inverted path condition).

Since the FC spec is relatively simple: a) there is only one spec compared to several implementations, b) it's relatively unoptimized it makes sense to employ the Model Checking approach to generate tests. Additionally, as we have seen before it can be restructured to be even easier to sample.

Spec annotations aiding test generation

Model checking is not universally applicable, there canebe tough problems:

certain code locations can be unreachable. For example, there can be a check that a map contains some key. However, it might never fail if wrong keys are already filtered out. In certain cases, e.g. inside a loop, it can be a problem for SMT solver to detect that.
an SMT solver tries to solve a "reverse" problem - construct an input that which causes a particular sequence of branches to be taken. It can be difficult or even impossible, if this problem involves reversing difficult functions. One particular example is state_transition and process_slots, which can turn out to be too complicated for even powerful SMT solvers

Therefore, it might be necessary to help test generator to resolve these problems e.g. by providing annotations or perhaps even by customizing certain test generation procedures.

Model Checking tools

There is a lack of formal method tools for Python. So, perhaps the best way to generate tests for the FC spec is to transpile the FC spec to C and employ wide selection of model checking and testing tools available for C and LLVM.

A very incomplete list of model checkers for C

https://www.cprover.org/cbmc/
can be used to generate tests: http://www.cprover.org/cprover-manual/test-suite/
http://www.esbmc.org/
https://github.com/kaled-alshmrany/FuSeBMC - a test generator based on the ESBMC
https://cpachecker.sosy-lab.org/
https://www.sosy-lab.org/research/coop-testgen/ - a test generator based on the CPAChecker

Efficient test execution

Our approach now becomes rather sophisticated, however, in practice it still does not guarantee a really good coverage - in particular, due to gray-box fuzzing of FC implementations.
We still may have to rely on a brute force approach, i.e. generate as much test vectors scenarios as possible. It's important to ensure efficient execution of the test scenarios, especially considering that there are several FC implementations.

There can be following obstacles:

one has to execute FC spec to obtain the gold output and/or to capture the FC spec footprint. Since it's written in Python it's way slower then in the case of compiled languages (roughly 10-30 times slower).
out "simultaneous" criterion can be tricky to implement efficiently, since it relies on both FC spec and FC implementation coverage. For example, augmenting the fuzzer's coverage criterion to take it into account can significantly slow down an optimized implementation.
while mutation operators can be re-used for all implementations, we want them to be fast too
FC implementation behavior can be non-deterministic. In particular, it may involve threads. Non-determinism can deteriorate benefits of coverage-guided fuzzing, since new behavior can be a consequence of non-determinism rather then of an interesting mutation.
initial setup to run a test case can be expensive (e.g. FC implementation may initialize components, which are not required for FC itself).
BLS signatures and hashing can be prohibitively expensive.
one might need huge amount of tests to catch certain bugs using only a get_head property check. It might be more productive to validate additional properties (we have discussed the problem above).

Enhancing FC implementation testability

We expect that FC client implementers are aiming at high quality of their implementations and will be willing to cooperate here, if proposed enhancements are reasonable. We also expect that such improvements are already (perhaps, partially) implemented.

Deterministic test execution

Coverage-guided fuzzing relies on recorded execution behavior: if a new, previously unseen execution trace is encountered, then fuzz heuristic concludes that the corresponding program input is interesting as it allowed to unlock new behaviors. It thus makes sense to pay more attention to the input by mutating it more often.

In practice, observed behavior (e.g. execution trace) can be affected by other factors like non-determinism and global state. Therefore, freshly observed behavior doesn't necessarily mean an interesting input is discovered. In such cases, efficiency of coverage-guided fuzzing can be undermined.

Therefore, it is important to ensure that the SUT - i.e. an FC implementation under test - can be configured to run in a non-deterministic fashion. Additionally, global state changes (which can affect recorded execution behavior) shouldn't normally be preserved across individual test cases execution.

Performance considerations

Fuzz target should normally be very fast. However, in the case of FC vast majority of time is spent in BLS and hashing routines, which will inevitably result in a very slow test execution. While in general it's important to test such routines too, but that can be done outside of fuzz testing. In order to improve fuzz testing efficiency it's crucial to be able to replace the slow production BLS and hash routines with lightweight testing routines.

Another factor which can be important is that SUT should be stripped off of the functionality which is irrelevant for the FC itself, e.g. logging, gathering statistics, initialization of other irrelevant modules, etc.

snapshot/restore functionality (optional)

TODO

Speedup the FC spec execution

The FC spec can be transpiled to C/C++ or some other compiled language to improve its execution speed. Transpiling to C/C++ is mostly beneficial since:

wide selection of fuzzers, model checking, concolic testing, coverage tools, static analysis tools, etc available for C
C/C++ is low level, so it's easier to map Python to C, compared to some other languages, which may have specific limitations (e.g. Dafny, Rust, Java, etc)

NB To obtain dramatic speed up, type inference should be performed. Which is possible, since the FC python spec is well annotated.

A downside of the approach is that the derivative C spec is different from the original one, thus one cannot be sure that we testing FC implementations against the original FC spec.

However, the problem is relatively easy to alleviate. The transpilation is performed on method-to-method basis. Therefore it's much easier to validate that the transpiled version behaves correctly, since we can split it into a number of sub-tasks too validate that each transpiled method behaves correctly. Since methods in the FC spec are quite simple, it's relatively easy to ensure good test coverage. Additionally, such method-level approach lets us construct method inputs which might be unreachable within whole-spec (system or integration-level) testing.

We roughly estimated that test version of the Python FC spec can run roughly 10-30 times slower than corresponding test version of the FC spec transpiled to C. By test version we assume:

heavy operations like BLS and hashing are replaced with lightweight testing routines
code is instrumented to gather coverage information

We didn't make particular performance tests, relying on freely available Python vs C benchmarks. Additionally, beacon fuzz developers communicated to us that Python spec performance was a real bottleneck for the state transition fuzzing.

Desugar FC spec to improve coverage criterion

Code coverage measures/criteria can be affected by code transformations. For example, real world code consists of nested expressions and statements can include complex expressions. If one flattens/simplifies such expressions, new statements are introduced.
This is also true regarding to branching, e.g. control flow inside short circuit operators can be made explicit by corresponding transformations. Another example is that exception handling can be transformed to explicit control flow.

It this might be beneficial to implement such desugaring transformations in order to obtain finer-grained and more precise coverage measures. In the case of the FC spec it will force test generation to yield more precise FC spec coverage, in the sense that more information (statements, branches, conditions) is taken into account, when calculating the coverage measures. Due to the central role of the FC spec, such precise coverage might be crucial.

Note also that while such desugaring might be difficult to implement from scratch, it's naturally obtained during transpiling the Python spec to C.

Notes on transpiling Python spec to C

In general, transpiling Python to C can be a challenging problem, at least, if one wants a high quality translation. In particular, one wants type inference to get a really fast code.

In practice, the problem can be simplified significantly:

we need to obtain high quality C version of python FC spec, which is easier than high quality transpiler for a subset of Python
high quality of C version can be ensured relatively easy, if translation is method-to-method. In this case we need to make sure that each C version of spec methods behaves correctly, i.e. unit-level testing, which is much easier problem then testing whole spec against whole implementation (system or integration level testing).
there is a prior work on transpiling python spec to other languages (Java, Kotlin, Dafny), so main components, which work with phase0 pyspec, including type inference are available (see overview here). It should be relatively easy to adapt it to output C code.
for test generation purposes, even somewhat incorrect translation to C is not a critical problem. The coverage measures might be different, however, one can still use the original Python spec as the test oracle. The main downside, that the Python version will be slow.
in practice, one can generate tests using C version of FC spec and use them to test C version of the spec against the Python FC spec (including continuous fuzzing), while using the C version of the spec when testing FC implementation.