Franck Cassez, Head of Research at Mantle. — all views expressed are my own.
In a previous post I have introduced an algorithm to construct control flow graphs (CFGs) for programs in EVM bytecode. CFGs can be used for security audits or static analyses to provide strong guarantees on the program's behaviours. The assumption is that the CFG of a program provides an over-approximation of the behaviours of the program. As a result, safety properties that hold on the CFG hold on the program too.
In this post, I'll give a formal definition of correctness for CFGs and an efficient correctness test that solely depends on the semantics of the EVM.
Figure 1, top right, shows a simple example of a Yul program with a single loop. This Yul code can be compiled to EVM bytecode using the Solidity compiler (solc --yul
) which generates an hexadecimal string simpleLoop.bin
:
The sequence of bytes in simpleLoop.bin
is the program's EVM bytecode. It can be decompiled (we use our tool ByteSpector) into a more readable format, a CFG, Figure 1, left.
Figure 1 – Control Flow Graph for the binary representation of SimpleLoop.bin
. The 0x
column is the address of the instruction on the right hand-side. Most of the instructions occupy one byte in the bytecode except the PUSHk
instructions that occupy k + 1
bytes. The instruction to executed next is pointed to by the program counter pc
. Plain arrows indicate that the control flows to the next instruction at address pc + 1
(+ k
for PUSHk
), e.g. 0x01 DUP1
is followed by 0x02 JUMPDEST
, and dashed arrows indicate a jump to a target address whjich at the top of the stack, e.g. 0x20 JUMP
is followed by 0x02 JUMPDEST
.
In our context, a CFG is a graph the nodes of which are labelled with segments of EVM bytecode (as in Figure 1).
The graph has a designated intial node, in our example, the node with label segment 0
. A path in a graph is a sequence of consecutive edges (arrows) starting at the initial node. For instance the sequence segment 0 to segment 1 to segment 3 to segment 1
, is a path in the CFG above. Segment 0 to segment 2
is not a path. A path defines a sequence of EVM instructions, an execution, obtained by concatenating the labels (segments) of the nodes on the path. For instance the path segment 0 to segment 1
defines the execution that is the concatentation of the bytecode of segment 0
followed by the bytecode of segment 1
. We write Execs(g)
for the set of executions that are defined by the finite paths in g
.
Segment 1
to Segment 3
while the program only allows 10. This is perfectly fine as a CFG is meant to over-approximate the set of executions of a given program.
segment 0
to segment 1
in the CFG, it does not necessarily imply that the sequence of instructions of segment 0
followed by segment 1
is a feasible sequence of instructions. This is the intent when building a CFG but there could be some bugs in the CFG generation algorithm. So to make sure this is satisfied, we need to take into account the semantics of the instructions.
To be more concrete, assume that the instruction at 0x07
in segment 1
, Figure 1, is PUSH1 0x0a
instead of PUSH 0x13
but we still have an edge from segment 1
to segment 3
. The execution of PUSH 0x0a
followed by JUMPI
if the condition is true, leads to 0x0a
, not to 0x13
and the CFG is incorrect in the sense that executing the code in the segments does not reflect the information and labels of the CFG. In the previous example, you may argue that the error is obvious, but in a more general context and more complex or larger bytecode, the target of a JUMP/JUMPI
instruction may have been pushed a few segments before, for instance to implement a function call/return, and in that case it may not be a constant value. The objective here is to detect errors, a posteriori, that could have happened in the construction of CFG, or to confirm the absence of errors (and validate) in the CFG.
Given an input i
(e.g., the calldata
), the execution of a program prog
(in EVM bytecode) defines a unique[2] trace: the sequence of instructions executed during the execution of prog
on input i
. We write Execs(prog)
for the set of finite traces of prog
over all inputs.
We can now provide a precise definition of correctness for CFGs.
g
and a program prog
, g
is a correct CFG for prog
iff Execs(prog)
is included in Execs(g)
.
Of course this definition allows CFGs to be very permissive and for instance a trivial CFG with edges between any two segments of code satisfy this definition. The art of building useful CFGs is to build over-approximations that are not too coarse and can be used to guarantee some useful properties of the program.
Nice! We now have a well-defined criterion to check correctness of CFGs. How do we implement the verification of the criterion?
We can provide a sufficient condition (test) for the correctness criterion above in the form of a simulation relation check on the CFG. What is that? We check the following property P1
on the CFG: given a node src
in the CFG, if there is an edge (src, tgt)
, we are going to prove that executing (using the semantics of the EVM) the code in src
actually leads to the start address of the code in tgt
.
Let prog
be the program defined by the code in the segments in the nodes of the CFG. Assume the execution of the code in src
leads to target address a
; if P1
is true, this execution is matched by the edge (src, tgt)
in the CFG, because the start address of tgt
is a
. In this sense every edge (src, tgt)
in the CFG simulates (reads "can do the same and maybe more") the code in src
. A property of simulation relation for CFG (finite automata) is that, by induction, they entail language inclusion which is exactly the correctness criterion.
P1
is true for every edge in the CFG g
, and prog
is the program defined by the code in the segments of g
, then Execs(prog)
is included in Execs(g)
and g
is a correct CFG for prog
.
We can use the Dafny-EVM, a formal and executable semantics of the EVM in Dafny, to prove P1
for each edge. The Dafny-EVM provides the semantics of each EVM instruction as a state transformer. For instance, the effect of PUSH0
is defined by the function Push0
like so:
This function applies to any Evm.ExecutingState
, a non-error state of the EVM, and returns an Evm.State
. An Evm.State
is either an Evm.ExecutingState
, if the execution of the instruction does not result in an error or revert (for Push0
, this requires that the stack in state st
is not full), or an error-state (this is the case if the capacity of the stack is less than 1). In the Dafny-EVM, the type Evm.State
comprises of the program counter, stack, memory storage, context, gas etc.
Using the Dafny-EVM semantics, we can encode the effect of each segment of code in the CFG as the composition of the semantic functions of the instructions in the segment:
The code above contains some statements, requires
and assert
that are verification statements. For instance, a requires
statement is a precondition. The preconditions of ExecuteFromCFGNode_seg0
specify that this function is only defined for states st0
such that i) the PC
is 0x0
(line 3), and ii) the stack is empty (line 5), and iii) the capacity[3] of the stack is larger than 1.
And now we leverage the automated reasoning power of verification-friendly languages like Dafny. Dafny checks that preconditions hold at each call site. For instance, the call to the function ExecuteFromCFGNode_seg1(st2)
at line 13, triggers a verification that for all possible st2
, st2
satisfies the preconditions of ExecuteFromCFGNode_seg1
(lines 17, 19–20). If this is not satisfied, Dafny reports an error. On the other hand, if the code of ExecuteFromCFGNode_seg0
verifies, this implies that for any input state st0
that satisfies the preconditions of ExecuteFromCFGNode_seg0
, we are guaranteed to reach a state st2
that satisfies the preconditions of ExecuteFromCFGNode_seg1
. And as a result, it proves that executing in the EVM the instructions of segment 1
leads to the first instruction of segment 0
.
This gives us a way to check that an edge in the graph satisfies P1
.
(src, tgt)
in the graph g
as a Dafny function that captures the effect of the edge. If all the functions verify, then all the edges are semantically correct and satisfy the property P1
.
For final nodes in the graph with no outgoing edges, e.g. a segment that ends in revert
, return
or stop
, we have a simpler encoding. For instance, the node Segment 2
in the CFG above is defined by the function, without a call to a function at the end:
Let Verifier(g)
be the Dafny program defined above given a CFG g
. We can verify Verifier(g)
with Dafny, and
Verifier(g)
verifies, then g
is a correct CFG.
The previous strategy is good but it uses the exact semantics of the EVM. An Evm.State
in the Dafny-EVM is a data structure that comprises all the fields that are needed to compute the new state: the context, the stack, remaining gas etc.
This makes the verification task for each edge resource-intensive, and sometimes Dafny timeout and cannot verify the correctness of a segment. The reason is that, to prove the correctness of a sequence of instructions, a verification condition is built by Dafny, and its size depends on the size of the sequence and on the complexity of each function that gives the semantics of an EVM instruction. So for complex instructions (e.g. memory, storage), or long sequences of instructions (more than 40 instructions), the verification condition is too large and the verification tasks times out (if we allow say 20 seconds).
This is why in the actual implementation of the correctness test I use an abstract semantics of the EVM that ignores lots of components of the Emv.State
. For instance I use an abstract state of the form:
with only two fields the pc
and the stack
.
The abstract semantics of the EVM instructions is much simpler, for instance this is the abstract semantcis of Pop
and SStore
:
The actual correctness test uses the abstract semantics, and with this trick and a much simpler semantics, we can verify large segments (more than 40 instructions).
A
simulates B
and B
simulates C
then A
simulates C
;C
) simulates the abstract semantics B
; this is achieved with the test I presented above using the abstract semantics intead of the concrete Dafny-EVM; andB
simulates the Dafny-EVM semantics A
.The conclusion is that the CFG simulates the Dafny-EVM semantics.
What are the benefits? The abstract semantics is much cheaper (verification-wise) to verify when we test Verifier(g)
.
Note that the proof of simulation of the concrete (Dafny-EVM) semantics by the abstract one (step 3.) can be done once, and in Dafny.
I'll skip over the details, but the proof of 3. amounts to defining an Abstraction
function that maps Evm.State
to EState
and then proving that, for each EVM instruction, the abstract semantics simulates the Dafny-EVM semantics. It looks like this for the Pop
instruction:
We can unpack the lemma above and explain what it captures:
st: Evm.ExecutingState
, for alls: EState
that abstracts (requires
, line 3) st
,Bytecode.Pop
(concrete semantics) from st
and it leads to an ExecutingState
(non-error) Bytecode.Pop(st)
, line 4, then:
Pop
(Pop.requires(s)
, line 5) andPop(s)
abstracts Bytecode.Pop(st)
, line 5.This lemma is automatically verified by Dafny without a need for any proof hints (the body of the lemma is left empty). We can prove a similar lemma for each instruction whereby proving that the abstract semantics simulates the concrete one.
Using the technique presented above, we can check whether a CFG is correct for some program in EVM bytecode. The test is very efficient as it uses an abstract semantics that id easy to verify using a verification-friendly language like Dafny. With the abstract semantics, we can verify the CFG for contracts like an ERC-20 or the Deposit Contract.
The main advantages of this approach are:
All the steps in the previous technique are formally verified and machine-checkable.
To the best of my knowledge, this is the first technique that provides verified control flow graphs, and the guarantee that all the executions of a program are indeed executions of the CFG. As a result we provide a very high level of security, every step of the way being machine-checkable. This can be leveraged to conclude that safety properties that are satisfied by the CFG are guaranteed to hold on the EVM bytecode program.
While developing the CFG verification technique, I have uncovered a subtle bug in the algorithm that generates the CFGs. The bug was discovered while verifying one CFG that did not verify … and rightly so because it was incorrect. The bug was in the implementation of the loop invariant detection module, where a condition was missing. It affected only one CFG, but still, it was a bug.
On the positive side, all the other test CFGs (except the one mentioned previously) passed the verification test, which demonstrates that the construction algorithm is reliable and robust.