Verified Control Flow Graphs for EVM Bytcode

Franck Cassez, Head of Research at Mantle. — all views expressed are my own.

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

In a previous post I have introduced an algorithm to construct control flow graphs (CFGs) for programs in EVM bytecode. CFGs can be used for security audits or static analyses to provide strong guarantees on the program's behaviours. The assumption is that the CFG of a program provides an over-approximation of the behaviours of the program. As a result, safety properties that hold on the CFG hold on the program too.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

But what if this assumption does not hold and the CFG is flawed? What if the algorithm that generates the CFG is flawed? The fundamental premise of the security audit does not hold and this may invalidate all of its conclusions.

In this post, I'll give a formal definition of correctness for CFGs and an efficient correctness test that solely depends on the semantics of the EVM.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

This provides an efficient automated technique to formally guarantee that the CFGs over-approximate the behaviours of a program, and are correct. They can be used as trustworthy inputs to security audits.

What is a correct CFG?

Figure 1, top right, shows a simple example of a Yul program with a single loop. This Yul code can be compiled to EVM bytecode using the Solidity compiler (solc --yul) which generates an hexadecimal string simpleLoop.bin:

5f805b600a81106013575060405260206040f35b906001600a9101919050600256

The sequence of bytes in simpleLoop.bin is the program's EVM bytecode. It can be decompiled (we use our tool ByteSpector) into a more readable format, a CFG, Figure 1, left.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Figure 1 – Control Flow Graph for the binary representation of SimpleLoop.bin. The 0x column is the address of the instruction on the right hand-side. Most of the instructions occupy one byte in the bytecode except the PUSHk instructions that occupy k + 1 bytes. The instruction to executed next is pointed to by the program counter pc. Plain arrows indicate that the control flows to the next instruction at address pc + 1 (+ k for PUSHk), e.g. 0x01 DUP1 is followed by 0x02 JUMPDEST, and dashed arrows indicate a jump to a target address whjich at the top of the stack, e.g. 0x20 JUMP is followed by 0x02 JUMPDEST.

Executions of a CFG

In our context, a CFG is a graph the nodes of which are labelled with segments of EVM bytecode (as in Figure 1).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The segment in a CFG are linear in the sense that the instructions in a segment are executed one after the other. (Un-)conditional jumps can only occur at the end of a segment.

The graph has a designated intial node, in our example, the node with label segment 0. A path in a graph is a sequence of consecutive edges (arrows) starting at the initial node. For instance the sequence segment 0 to segment 1 to segment 3 to segment 1, is a path in the CFG above. Segment 0 to segment 2 is not a path. A path defines a sequence of EVM instructions, an execution, obtained by concatenating the labels (segments) of the nodes on the path. For instance the path segment 0 to segment 1 defines the execution that is the concatentation of the bytecode of segment 0 followed by the bytecode of segment 1. We write Execs(g) for the set of executions that are defined by the finite paths in g.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

A first observation is that the CFG in Figure 1 does not capture loop bounds accurately^[1] as it allows an arbitrary (but finite) number of loops from Segment 1 to Segment 3 while the program only allows 10. This is perfectly fine as a CFG is meant to over-approximate the set of executions of a given program.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

A second observation is that, if there is an edge fromsegment 0 to segment 1 in the CFG, it does not necessarily imply that the sequence of instructions of segment 0 followed by segment 1 is a feasible sequence of instructions. This is the intent when building a CFG but there could be some bugs in the CFG generation algorithm. So to make sure this is satisfied, we need to take into account the semantics of the instructions.

To be more concrete, assume that the instruction at 0x07 in segment 1, Figure 1, is PUSH1 0x0a instead of PUSH 0x13 but we still have an edge from segment 1 to segment 3. The execution of PUSH 0x0a followed by JUMPI if the condition is true, leads to 0x0a, not to 0x13 and the CFG is incorrect in the sense that executing the code in the segments does not reflect the information and labels of the CFG. In the previous example, you may argue that the error is obvious, but in a more general context and more complex or larger bytecode, the target of a JUMP/JUMPI instruction may have been pushed a few segments before, for instance to implement a function call/return, and in that case it may not be a constant value. The objective here is to detect errors, a posteriori, that could have happened in the construction of CFG, or to confirm the absence of errors (and validate) in the CFG.

Semantics of EVM bytecode

Given an input i (e.g., the calldata), the execution of a program prog (in EVM bytecode) defines a unique^[2] trace: the sequence of instructions executed during the execution of prog on input i. We write Execs(prog) for the set of finite traces of prog over all inputs.

Correctness criterion

We can now provide a precise definition of correctness for CFGs.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Given a CFG g and a program prog, g is a correct CFG for prog iff Execs(prog) is included in Execs(g).

Of course this definition allows CFGs to be very permissive and for instance a trivial CFG with edges between any two segments of code satisfy this definition. The art of building useful CFGs is to build over-approximations that are not too coarse and can be used to guarantee some useful properties of the program.

Nice! We now have a well-defined criterion to check correctness of CFGs. How do we implement the verification of the criterion?

A sufficient condition to prove correctness

We can provide a sufficient condition (test) for the correctness criterion above in the form of a simulation relation check on the CFG. What is that? We check the following property P1 on the CFG: given a node src in the CFG, if there is an edge (src, tgt), we are going to prove that executing (using the semantics of the EVM) the code in src actually leads to the start address of the code in tgt.

Let prog be the program defined by the code in the segments in the nodes of the CFG. Assume the execution of the code in src leads to target address a; if P1 is true, this execution is matched by the edge (src, tgt) in the CFG, because the start address of tgt is a. In this sense every edge (src, tgt) in the CFG simulates (reads "can do the same and maybe more") the code in src. A property of simulation relation for CFG (finite automata) is that, by induction, they entail language inclusion which is exactly the correctness criterion.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

If P1 is true for every edge in the CFG g, and prog is the program defined by the code in the segments of g, then Execs(prog) is included in Execs(g) and g is a correct CFG for prog.

We can use the Dafny-EVM, a formal and executable semantics of the EVM in Dafny, to prove P1 for each edge. The Dafny-EVM provides the semantics of each EVM instruction as a state transformer. For instance, the effect of PUSH0 is defined by the function Push0 like so:









function Push0(st: Evm.ExecutingState): (st': Evm.State)
{
  if st.Capacity() >= 1 then 
    // Push 0 on the stack and advance to next instruction
    st.Push(0).Next()
  else  
    // The stack is full and we cannot push.   
    ERROR(STACK_OVERFLOW)
}

This function applies to any Evm.ExecutingState, a non-error state of the EVM, and returns an Evm.State. An Evm.State is either an Evm.ExecutingState, if the execution of the instruction does not result in an error or revert (for Push0, this requires that the stack in state st is not full), or an error-state (this is the case if the capacity of the stack is less than 1). In the Dafny-EVM, the type Evm.State comprises of the program counter, stack, memory storage, context, gas etc.

Using the Dafny-EVM semantics, we can encode the effect of each segment of code in the CFG as the composition of the semantic functions of the instructions in the segment:






























function ExecuteFromCFGNode_seg0(st0: Evm.ExecutingState): (st': Evm.State)
  // PC requirement for this node.
  requires st0.PC() == 0x0 as nat
  // Stack size requirements for this node.
  requires st0.Operands() >= 0
  requires st0.Capacity() >= 2
{
    var st1 := Push0(st0);
    var st2 := Dup(st1, 1);
    assert st2.EXECUTING?;
    // Edge from seg0 to seg1, execute seg1 from st2
    ExecuteFromCFGNode_seg1(st2)
}

function ExecuteFromCFGNode_seg1(st0: Evm.ExecutingState): (st': Evm.State)
  // PC requirement for this node.
  requires st0.PC() == 0x2 as nat
  // Stack size requirements for this node.
  requires st0.Operands() >= 2
  requires st0.Capacity() >= 2
{
  var st1 := JumpDest(st0);
  var st2 := PushN(st1, 1, 0x0a);
  var st3 := Dup(st2, 2);
  var st4 := Lt(st3);
  var st5 := PushN(st4, 1, 0x13);
  var st6 := JumpI(st5);
  ...
}
...

The code above contains some statements, requires and assert that are verification statements. For instance, a requires statement is a precondition. The preconditions of ExecuteFromCFGNode_seg0 specify that this function is only defined for states st0 such that i) the PC is 0x0 (line 3), and ii) the stack is empty (line 5), and iii) the capacity^[3] of the stack is larger than 1.

And now we leverage the automated reasoning power of verification-friendly languages like Dafny. Dafny checks that preconditions hold at each call site. For instance, the call to the function ExecuteFromCFGNode_seg1(st2) at line 13, triggers a verification that for all possible st2, st2 satisfies the preconditions of ExecuteFromCFGNode_seg1 (lines 17, 19–20). If this is not satisfied, Dafny reports an error. On the other hand, if the code of ExecuteFromCFGNode_seg0 verifies, this implies that for any input state st0 that satisfies the preconditions of ExecuteFromCFGNode_seg0, we are guaranteed to reach a state st2 that satisfies the preconditions of ExecuteFromCFGNode_seg1. And as a result, it proves that executing in the EVM the instructions of segment 1 leads to the first instruction of segment 0.

This gives us a way to check that an edge in the graph satisfies P1.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We can encode every edge (src, tgt) in the graph g as a Dafny function that captures the effect of the edge. If all the functions verify, then all the edges are semantically correct and satisfy the property P1.

For final nodes in the graph with no outgoing edges, e.g. a segment that ends in revert, return or stop, we have a simpler encoding. For instance, the node Segment 2 in the CFG above is defined by the function, without a call to a function at the end:















function ExecuteFromCFGNode_seg2(st0: Evm.ExecutingState): (st': Evm.State)
  // PC requirement for this node.
  requires st0.pc == 0xa as nat
  // Stack requirements for this node.
  requires st0.Operands() >= 2
{
  var st1 := Pop(st0);
  var st2 := PushN(st1, 1, 0x40);
  var st3 := MStore(st2);
  var st4 := PushN(st3, 1, 0x20);
  var st5 := PushN(st4, 1, 0x40);
  var st6 := Return(st5);
  // Segment is terminal (Revert, Stop or Return)
  st6
}

Let Verifier(g) be the Dafny program defined above given a CFG g. We can verify Verifier(g) with Dafny, and

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

if Verifier(g) verifies, then g is a correct CFG.

An efficient machine-checkable test of correctness

The previous strategy is good but it uses the exact semantics of the EVM. An Evm.State in the Dafny-EVM is a data structure that comprises all the fields that are needed to compute the new state: the context, the stack, remaining gas etc.

This makes the verification task for each edge resource-intensive, and sometimes Dafny timeout and cannot verify the correctness of a segment. The reason is that, to prove the correctness of a sequence of instructions, a verification condition is built by Dafny, and its size depends on the size of the sequence and on the complexity of each function that gives the semantics of an EVM instruction. So for complex instructions (e.g. memory, storage), or long sequences of instructions (more than 40 instructions), the verification condition is too large and the verification tasks times out (if we allow say 20 seconds).

This is why in the actual implementation of the correctness test I use an abstract semantics of the EVM that ignores lots of components of the Emv.State. For instance I use an abstract state of the form:





datatype EState = EState(pc: nat, stack: seq<Int.u256>)
 {
    // The number of operands on the stack (stack size).
    function Operands(): nat { |stack| }
 }

with only two fields the pc and the stack.

The abstract semantics of the EVM instructions is much simpler, for instance this is the abstract semantcis of Pop and SStore:












function Pop(s: EState): (s': EState)
    requires s.Operands() >= 1
{
    EState(s.pc + 1, s.stack[1..])
}

function SStore(s: EState): (s': EState)
    requires s.Operands() >= 2
    ensures s'.Operands() == s.Operands() - 2
{
    EState(s.pc + 1, s.stack[2..])
}

The actual correctness test uses the abstract semantics, and with this trick and a much simpler semantics, we can verify large segments (more than 40 instructions).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

But why is it safe to use the abstract semantics instead of the concrete and precise Dafny-EVM semantics? It is because:

simulation is a transitive property. If A simulates B and B simulates C then A simulates C;
so we can prove that the CFG (C) simulates the abstract semantics B; this is achieved with the test I presented above using the abstract semantics intead of the concrete Dafny-EVM; and
we can prove that the abstract semantics B simulates the Dafny-EVM semantics A.

The conclusion is that the CFG simulates the Dafny-EVM semantics.

What are the benefits? The abstract semantics is much cheaper (verification-wise) to verify when we test Verifier(g). Note that the proof of simulation of the concrete (Dafny-EVM) semantics by the abstract one (step 3.) can be done once, and in Dafny.

I'll skip over the details, but the proof of 3. amounts to defining an Abstraction function that maps Evm.State to EState and then proving that, for each EVM instruction, the abstract semantics simulates the Dafny-EVM semantics. It looks like this for the Pop instruction:







// A Dafny lemma to prove that abstract Pop simulates Bytecode.Pop
lemma SimulationCorrectPop(s: EState, st: Evm.ExecutingState)
  requires s.Abstracts(st) 
  ensures Bytecode.Pop(st).EXECUTING? ==> 
     Pop.requires(s) && Pop(s).Abstracts(Bytecode.Pop(st))
{   //  Thanks Dafny
}

We can unpack the lemma above and explain what it captures:

given a concrete st: Evm.ExecutingState, for alls: EState that abstracts (requires, line 3) st,
if we can execute Bytecode.Pop (concrete semantics) from st and it leads to an ExecutingState (non-error) Bytecode.Pop(st), line 4, then:
- we can execute abstract Pop (Pop.requires(s), line 5) and
- the resulting state Pop(s) abstracts Bytecode.Pop(st), line 5.

This lemma is automatically verified by Dafny without a need for any proof hints (the body of the lemma is left empty). We can prove a similar lemma for each instruction whereby proving that the abstract semantics simulates the concrete one.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

This provides a machine-checkable proof that the use of the abstract semantics in the test is correct.

Conclusion

Using the technique presented above, we can check whether a CFG is correct for some program in EVM bytecode. The test is very efficient as it uses an abstract semantics that id easy to verify using a verification-friendly language like Dafny. With the abstract semantics, we can verify the CFG for contracts like an ERC-20 or the Deposit Contract.

The main advantages of this approach are:

we do not need to trust the algorithm that computes the CFG. It is not part of the Trusted Computing Base (TCB). The result of the algorithm can be independently checked against the actual semantics of the EVM.
the use of the abstract semantics vs the reference Dafny-EVM semantics is provably correct.

All the steps in the previous technique are formally verified and machine-checkable.

To the best of my knowledge, this is the first technique that provides verified control flow graphs, and the guarantee that all the executions of a program are indeed executions of the CFG. As a result we provide a very high level of security, every step of the way being machine-checkable. This can be leveraged to conclude that safety properties that are satisfied by the CFG are guaranteed to hold on the EVM bytecode program.

And … it is already been useful

While developing the CFG verification technique, I have uncovered a subtle bug in the algorithm that generates the CFGs. The bug was discovered while verifying one CFG that did not verify … and rightly so because it was incorrect. The bug was in the implementation of the loop invariant detection module, where a condition was missing. It affected only one CFG, but still, it was a bug.

On the positive side, all the other test CFGs (except the one mentioned previously) passed the verification test, which demonstrates that the construction algorithm is reliable and robust.

This is something that can actually be done by unfolding loops and get a more precise CFG. ↩︎
The EVM is deterministic. ↩︎
This follows from the fact that the stack is empty and the maximum capacity is 1024, so for segment 0, this precondition is redundant. ↩︎

What is a correct CFG?

Executions of a CFG

Semantics of EVM bytecode

Correctness criterion

A sufficient condition to prove correctness

An efficient machine-checkable test of correctness

Conclusion

And … it is already been useful

Read more

Fees on Rollups

Fair Sequencer

Formal verification of a zk gadget

Control Flow Graph reconstruction for EVM bytecode