Our ultimate goal is to ensure that FC implementations conform to the FC spec. The ultimate way to reach the goal would be employing formal verification of the FC implementations, however, we assume that it's beyond our reach at present time. Therefore, testing is the only practical approach to reach our goal. It's still very challenging though and in the write-up we analyze how the problem can be solved efficiently.
The methodology is implemented in the fc-compliance branch of eth2.0-specs fork. To be merged into the main repo.
Preliminary research has been done in this repo.
Here is the summary of the proposed ideas, which will be described in more details later.
During a testing project, we typically face following problems:
In our case, the SUT is the set of the FC implementations. Testing the FC spec itself is a related problem, but we do not address it here. We decompose our goal into a set of sub-problems: test each FC implementation in isolation.
In general, it might be beneficial to test several implementations simultaneously, however, we ignore such approach for now. It can be considered in future.
NB It's expected that there may appear additional auxiliary SUTs, e.g. the original FC spec can be transformed in various ways - decomposed, restructured, desugared, transpiled, etc. In order to ensure that the transformed version behaves as the original one, we may employ testing. It's not the primary focus of the write-up, particularly since it should be an easier problem to solve.
The FC spec is executable (a reference implementation), which means we can cast the problem as differential testing of the FC client implementations against the FC spec.
That means that the test oracle problem is easy to solve: the main property to check - all programs choose the same canonical chain - is straightforward to implement.
However, this property check alone might miss certain bugs, i.e. the correct block can still outweigh others despite erroneously calculated weights, etc. Catching such bugs is possible but more test vectors are required (which perhaps should be specially designed). Therefore implementing additional property checks can improve confidence without growing the test suite size.
We expect that client FC implementations already expose certain functionality to other client subsystems, which includes:
We thus believe that it's reasonable to implement corresponding test oracles for the properties, which will check that the FC spec and an FC implementation agree on their outcomes for each test vector.
It's well known that traditional code coverage criteria do not work well in the case of differential testing (see for example NEZHA: Efficient Domain-Independent Differential Testing). A particular statement in the spec can correspond to multiple statements in an implementation and vice versa. Therefore the traditional approach to measure coverage which sums up code elements reached is not enough.
NB Traditional "additive" criterion:
there should be a test vector triggering each (reachable) code element
For example, consider a test vector which triggers specStmt
statement in the spec and implStmt
in the implementation. Additionally, let's assume that both specStmt
and implStmt
have been already covered by prior tests. However, it can be the case that the specStmt
has been tested in combination with some other implStmt2
only, while implStmt
has been tested in combination with yet another specStmt2
. An "additive" coverage metric may decide that the test case doesn't improve coverage, since both specStmt
and implStmt
have been already touched. However, when testing an implementation against a spec, it should be considered a new test vector (activating some fresh SUT behavior).
We therefore need a "multiplicative" coverage criterion which takes into account interactions of spec and implementation code elements:
there should be a test vector for each (reachable) pair of code elements of the spec and of the implementation
There can be roughly two flavors of such "multiplicative" criteria: path-based and set-based. Typically, code coverage information is presented as a set of pairs (codeElement, count)
describing how often a particular codeElement
(e.g. branch or statement) has been visited during code execution. In the case of two programs, there are two such sets.
With the set-based approach, a footprint of a single test run consists of the set of code elements, visited during the run. When calculating coverage measure for a test suites one unions individual test footprints.
With the path-based approach, a footprint of a single test run consists of a single element, (abstractly) describing the test execution path, e.g. a sequence of statements visited or branches taken. When calculating coverage measure of a test suite the individual elements combined into a set.
Similarly, there can be two procedures two calculate the "multiplicative" coverage measure:
(specPath, implPath)
where specPath
and implPath
abstracts the execution path captured while executing respectively the spec and the implementation, using the same test vector.specElems
and implElems
, where specElems
and implElems
consist of code elements of the spec and the implementation respectively, visited during execution of the same test vector.Reaching a good code coverage can be a challenging problem. It becomes even more complicated in the case of our "multiplicative" coverage criterion, where interactions of spec and implementation code elements should be covered.
Another important limitation is that we can employ white-box approach to reach a good coverage of the FC spec, while we cannot expect the same level of efforts in the case of FC implementations. It can definitely be the case that some client implementers are ready to apply white-box techniques at the implementation level, but we can't expect it to be the common ground.
Since the FC spec is the central component and we have several FC implementations, we assume the following approach to obtain good coverage according to our "multiplicative" coverage criterion:
NB This is roughly the same approach as the one employed by beacon-fuzz project.
The first step ensures that each distinct FC spec behavior is covered, however, it is not enough to guarantee good coverage according to our "multiplicative" criterion. Therefore we complement it with fuzzing: for each test vector from the FC spec test suite we generate multiple new test vectors in order to cover reachable FC implementation behaviors.
While more complicated approaches are possible, we focus on the simple one for now. In particular, fuzz testing tools are available for popular programming languages.
So, any test suite developed for the FC spec can also be used to test FC implementations. It's obviously a black box approach regarding the FC implementation code base.
We can fuzz the suite, using it as the seed corpus, by applying mutations to its test vectors. We consider black-box (blind) and gray-box (coverage-guided) fuzzing here:
In future, test vectors obtained during client-specific fuzzing can be merged and used to update the common test suite. That can even be done in an online fashion, i.e. while testing multiple FC implementations simultaneously.
The common component for both fuzzing approaches is test vector mutators. A common problem for fuzz testing is that it's difficult to obtain a truly diverse test vectors. While some behaviors are easy to reach, eventually, fuzzing gets stuck in these "easy" regions. Obviously, some behaviors may demand very specific conditions to be satisfied, which can be extremely difficult to reach with random mutations.
One example is that a straightforward mutation tends to generate invalid inputs, thus causing SUT to repeatedly visit input validation paths. In general, it can be very difficult to generate semantically interesting inputs by dumb (unaware of input structure) fuzzing.
One way to alleviate the problem is to design mutation operators which allows for better control of expected behavior. For example, if we know that input validation paths are already well covered, we may want to reduce importance of mutations which make test vectors invalid, e.g. we may lower the probability of applying such mutation.
We can design such mutation operators based on the FC spec decomposition. For example, a mutation operator can
Based on the decomposition it can be easier to predict whether such mutation will change expected behavior. For example, if we know that some event is invalid, it will be dropped. Thus one can drop or add such events, w/o affecting overall outcome (of the fC spec). Similarly, certain events can be re-ordered within certain time period.
As the FC spec is central in our setting, we believe that it is justified to use white-box techniques to obtain sufficient spec coverage. We consider three approaches:
Let's illustrate with an example, how one can exploit code structure to aid test generation. Let's assume that the SUT can be represented as a composition of two functions f(g(x))
. To generate covering set of inputs, there can be two approaches:
x
directlyg
and to f
separatelyOften, the second approach can be easier, e.g. when the g
is contractive, in the sense that for a reachable value of g
there typically exist multiple input values leading to the output.
A simple, but realistic example is when g
transforms sequences of things to sets of the same things. Assuming additionally that f
is independent of order of elements in its input sets. Obviously, for each possible output of g
- and correspondingly for each reachable input for f
- there are multiple possible inputs of g
(excluding possible degenerate cases like an empty list), which can include arbitrary reordering and duplication of elements. Due to the above assumption (f
is independent of input elements order), a better test strategy would be to cover f
and g
independently:
f
directly, when sampling set of elements instead of sequence of elementsg
one can ignore f
code structureNote that FC spec is quite similar to the example. I.e. it can be restructured so that its intermediate state consists of several sets plus a counter (current time):
NB proposer boost can be represented as a set with at most one element
Additionally, when choosing the head based on that information one doesn't need vast majority of info contained in blocks and attestation - they are mostly required to check message validity. The only significant exception here is that justified checkpoint state is used to calculate participation and weights - however, it's a state of a single block, while a test vector can include multiple blocks.
Therefore sampling the intermediate state is much easier than sampling sequences of events.
Note also that such decomposition can help to fuzz inputs. E.g. if we have a seed vector for f
which is a set, we need to construct some g
input leading to the vector - e.g. choose some order of elements. It's easy to mutate the vector in a semantically meaningful way:
Here is a high level description of how a decomposed FC spec can look like. There is also an example of a similar spec restructuring, albeit pursuing a somewhat different goal (simplify formal verification of the FC spec properties).
Current FC Store
contains certain fields which can always be reconstructed (block_states, checkpoint_states, unrealized justifications, etc). To simplify test generation, it makes sense to restructure the Store
, so that it (let's call it TestStore
) contain only the following fields:
blocks
(valid and in-time)
latest_messages
, excluding votes of validators inequivocatind_indices
current_time
proposer_boost
The rest of Store
properties can be reconstructed with corresponding query methods (see for example the example above).
Processing test vector events (ticks, blocks, attestations, attester slashings) is performed via the following steps:
TestStore
updatesget_head
function (adjusted to work with the the TestStore
)Coverage-guided fuzzing is a gray-box approach and thus often inefficient, since it tend to stuck in "easy" regions and it can be quite tricky to make it reach difficult code locations. One way to resolve the problem is to employ a white box approach, i.e. take control flow and data flow dependencies into account when constructing test vectors.
Bounded model checking is one such rather straightforward approach to test generation. Let's assume we have a test vector. We can inspect the execution trace of the test vector, i.e. record branches taken and construct path conditions and data flow equations along the trace. Now we might want to sample a new path. We can invert one path condition (e.g. the last one) and ask an SMT solver to solve the corresponding SMT problem. If it succeeds then the solution can be converted to a new test vector, which will lead to a different execution path (due to the inverted path condition).
Since the FC spec is relatively simple: a) there is only one spec compared to several implementations, b) it's relatively unoptimized it makes sense to employ the Model Checking approach to generate tests. Additionally, as we have seen before it can be restructured to be even easier to sample.
Model checking is not universally applicable, there canebe tough problems:
state_transition
and process_slots
, which can turn out to be too complicated for even powerful SMT solversTherefore, it might be necessary to help test generator to resolve these problems e.g. by providing annotations or perhaps even by customizing certain test generation procedures.
There is a lack of formal method tools for Python. So, perhaps the best way to generate tests for the FC spec is to transpile the FC spec to C and employ wide selection of model checking and testing tools available for C and LLVM.
A very incomplete list of model checkers for C
Our approach now becomes rather sophisticated, however, in practice it still does not guarantee a really good coverage - in particular, due to gray-box fuzzing of FC implementations.
We still may have to rely on a brute force approach, i.e. generate as much test vectors scenarios as possible. It's important to ensure efficient execution of the test scenarios, especially considering that there are several FC implementations.
There can be following obstacles:
get_head
property check. It might be more productive to validate additional properties (we have discussed the problem above).We expect that FC client implementers are aiming at high quality of their implementations and will be willing to cooperate here, if proposed enhancements are reasonable. We also expect that such improvements are already (perhaps, partially) implemented.
Coverage-guided fuzzing relies on recorded execution behavior: if a new, previously unseen execution trace is encountered, then fuzz heuristic concludes that the corresponding program input is interesting as it allowed to unlock new behaviors. It thus makes sense to pay more attention to the input by mutating it more often.
In practice, observed behavior (e.g. execution trace) can be affected by other factors like non-determinism and global state. Therefore, freshly observed behavior doesn't necessarily mean an interesting input is discovered. In such cases, efficiency of coverage-guided fuzzing can be undermined.
Therefore, it is important to ensure that the SUT - i.e. an FC implementation under test - can be configured to run in a non-deterministic fashion. Additionally, global state changes (which can affect recorded execution behavior) shouldn't normally be preserved across individual test cases execution.
Fuzz target should normally be very fast. However, in the case of FC vast majority of time is spent in BLS and hashing routines, which will inevitably result in a very slow test execution. While in general it's important to test such routines too, but that can be done outside of fuzz testing. In order to improve fuzz testing efficiency it's crucial to be able to replace the slow production BLS and hash routines with lightweight testing routines.
Another factor which can be important is that SUT should be stripped off of the functionality which is irrelevant for the FC itself, e.g. logging, gathering statistics, initialization of other irrelevant modules, etc.
TODO
The FC spec can be transpiled to C/C++ or some other compiled language to improve its execution speed. Transpiling to C/C++ is mostly beneficial since:
NB To obtain dramatic speed up, type inference should be performed. Which is possible, since the FC python spec is well annotated.
A downside of the approach is that the derivative C spec is different from the original one, thus one cannot be sure that we testing FC implementations against the original FC spec.
However, the problem is relatively easy to alleviate. The transpilation is performed on method-to-method basis. Therefore it's much easier to validate that the transpiled version behaves correctly, since we can split it into a number of sub-tasks too validate that each transpiled method behaves correctly. Since methods in the FC spec are quite simple, it's relatively easy to ensure good test coverage. Additionally, such method-level approach lets us construct method inputs which might be unreachable within whole-spec (system or integration-level) testing.
We roughly estimated that test version of the Python FC spec can run roughly 10-30 times slower than corresponding test version of the FC spec transpiled to C. By test version we assume:
We didn't make particular performance tests, relying on freely available Python vs C benchmarks. Additionally, beacon fuzz developers communicated to us that Python spec performance was a real bottleneck for the state transition fuzzing.
Code coverage measures/criteria can be affected by code transformations. For example, real world code consists of nested expressions and statements can include complex expressions. If one flattens/simplifies such expressions, new statements are introduced.
This is also true regarding to branching, e.g. control flow inside short circuit operators can be made explicit by corresponding transformations. Another example is that exception handling can be transformed to explicit control flow.
It this might be beneficial to implement such desugaring transformations in order to obtain finer-grained and more precise coverage measures. In the case of the FC spec it will force test generation to yield more precise FC spec coverage, in the sense that more information (statements, branches, conditions) is taken into account, when calculating the coverage measures. Due to the central role of the FC spec, such precise coverage might be crucial.
Note also that while such desugaring might be difficult to implement from scratch, it's naturally obtained during transpiling the Python spec to C.
In general, transpiling Python to C can be a challenging problem, at least, if one wants a high quality translation. In particular, one wants type inference to get a really fast code.
In practice, the problem can be simplified significantly: