--- title: Verified Control Flow Graphs for EVM Bytcode tags: EVM, Ethereum, Dafny, Semantics breaks: false --- [Franck Cassez](https://franck44.github.io), Head of Research at [Mantle](https://www.mantle.xyz). — all views expressed are my own. [![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/franckDownunder.svg?style=social&label=Follow%20%40FranckDownunder)](https://twitter.com/franckDownunder) In a previous [post]() I have introduced an algorithm to construct control flow graphs (CFGs) for programs in EVM bytecode. CFGs can be used for security audits or static analyses to provide strong guarantees on the program's behaviours. The assumption is that the CFG of a program provides an _over-approximation_ of the behaviours of the program. As a result, safety properties that hold on the CFG hold on the program too. :::warning :warning: But what if this assumption does not hold and the CFG is flawed? What if the algorithm that generates the CFG is flawed? The fundamental premise of the security audit does not hold and this may invalidate all of its conclusions. ::: In this post, I'll give a formal definition of _correctness_ for CFGs and an _efficient correctness test_ that solely depends on the _semantics_ of the EVM. :::success :bulb: This provides an **efficient automated technique to** formally **guarantee** that the **CFGs** over-approximate the behaviours of a program, and are **correct**. They can be used as **trustworthy** inputs to security audits. ::: ## What is a correct CFG? Figure 1, top right, shows a simple example of a [Yul](https://docs.soliditylang.org/en/latest/yul.html) program with a single loop. This Yul code can be compiled to EVM bytecode using the Solidity compiler (`solc --yul`) which generates an hexadecimal string `simpleLoop.bin`: ``` 5f805b600a81106013575060405260206040f35b906001600a9101919050600256 ``` The sequence of bytes in `simpleLoop.bin` is the program's EVM bytecode. It can be decompiled (we use our tool [ByteSpector](https://bytespector.org/#)) into a more readable format, a CFG, Figure 1, left. <!-- (the segments in Figure 1): the hexadecimal EVM opcodes e.g. `5f`, `80` are disassembled as EVM _instructions_, `PUSH0`, `DUP1`, and the code is structured into _linear segments_, i.e. sections of the bytecode where instructions are executed in the order (top to bottom) in which they appear in a segment. `JUMP/JUMPI/REVERT/RETURN/STOP` instructions can only be at the last position in a segment.^[In the implementation of thr algorithm, I have actually _proved_ this property for all input bytecode.] --> ![cfg-simpleLoop](https://hackmd.io/_uploads/ryvaaKnup.jpg) **Figure 1** -- Control Flow Graph for the binary representation of `SimpleLoop.bin`. The `0x` column is the address of the instruction on the right hand-side. Most of the instructions occopy one byte in the bytecode except the `PUSHk` instructions that occupy `k + 1` bytes. The instruction to executed next is pointed to by the _program counter_ `pc`. Plain arrows indicate that the control flows to the _next_ instruction at address `pc + 1` (`+ k` for `PUSHk`), e.g. `0x01 DUP1` is followed by `0x02 JUMPDEST`, and dashed arrows indicate a _jump_ to a target address whjich at the top of the stack, e.g. `0x20 JUMP` is followed by `0x02 JUMPDEST`. ### Executions of a CFG In our context, a CFG is a graph the nodes of which are labelled with _segments_ of EVM bytecode (as in Figure 1). :::info :bulb: The segment in a CFG are _linear_ in the sense that the instructions in a segment are executed one after the other. (Un-)conditional _jumps_ can only occur at the end of a segment. ::: The graph has a designated _intial node_, in our example, the node with label `segment 0`. A _path_ in a graph is a sequence of consecutive edges (arrows) starting at the initial node. For instance the sequence `segment 0 to segment 1 to segment 3 to segment 1`, is a path in the CFG above. `Segment 0 to segment 2` is not a path. A path defines a sequence of EVM instructions, an _execution_, obtained by concatenating the labels (segments) of the nodes on the path. For instance the path `segment 0 to segment 1` defines the execution that is the concatentation of the bytecode of `segment 0` followed by the bytecode of `segment 1`. We write `Execs(g)` for the set of executions that are defined by the finite paths in `g`. :::warning :warning: A **first** observation is that the CFG in Figure 1 does not capture loop bounds accurately^[This is something that can actually be done by unfolding loops and get a more precise CFG.] as it allows an arbitrary (but finite) number of loops from `Segment 1` to `Segment 3` while the program only allows 10. This is perfectly fine as a CFG is meant to over-approximate the set of executions of a given program. ::: :::warning :warning: A **second** observation is that, if there is an edge from`segment 0` to `segment 1` in the CFG, it does not necessarily imply that the sequence of instructions of `segment 0` followed by `segment 1` is a _feasible sequence_ of instructions. This is the _intent_ when building a CFG but there could be some bugs in the CFG generation algorithm. So to make sure this is satisfied, we need to take into account the semantics of the instructions. ::: To be more concrete, assume that the instruction at `0x07` in `segment 1`, Figure 1, is `PUSH1 0x0a` instead of `PUSH 0x13` _but_ we still have an edge from `segment 1` to `segment 3`. The execution of `PUSH 0x0a` followed by `JUMPI` if the condition is true, leads to `0x0a`, not to `0x13` and the CFG is incorrect in the sense that _executing_ the code in the segments does not reflect the information and labels of the CFG. In the previous example, you may argue that the error is obvious, but in a more general context and more complex or larger bytecode, the target of a `JUMP/JUMPI` instruction may have been pushed a few segments before, for instance to implement a function call/return, and in that case it may not be a constant value. The objective here is to detect errors, a posteriori, that could have happened in the construction of CFG, or to confirm the absence of errors (and validate) in the CFG. ### Semantics of EVM bytecode Given an input `i` (e.g., the `calldata`), the execution of a program `prog` (in EVM bytecode) defines a unique^[The EVM is deterministic.] _trace_: the _sequence of instructions executed_ during the execution of `prog` on input `i`. We write `Execs(prog)` for the set of finite traces of `prog` _over all inputs_. ### Correctness criterion We can now provide a precise definition of correctness for CFGs. :::success :bulb: Given a CFG `g` and a program `prog`, `g` is a _correct CFG for `prog`_ iff `Execs(prog)` is included in `Execs(g)`. ::: Of course this definition allows CFGs to be very permissive and for instance a trivial CFG with edges between any two segments of code satisfy this definition. The art of building useful CFGs is to build over-approximations that are not too coarse and can be used to guarantee some useful properties of the program. Nice! We now have a well-defined criterion to check correctness of CFGs. How do we implement the verification of the criterion? ## A sufficient condition to prove correctness We can provide a sufficient condition (test) for the correctness criterion above in the form of a _simulation relation_ check on the CFG. What is that? We check the following property `P1` on the CFG: given a node `src` in the CFG, if there is an edge `(src, tgt)`, we are going to prove that _executing (using the semantics of the EVM) the code_ in `src` _actually leads_ to the start address of the code in `tgt`. <!-- How does it relate to the correctness criterion? Intuitively a _simulation relation_ captures the following property: a system `A` _simulates_ `B` if `A` can do everything `B` can do (but possibly more). And we can prove that `A` simulates `B` by providing a relation which is a mapping from the states of `A` and `B` that is _inductive_. --> Let `prog` be the program defined by the code in the segments in the nodes of the CFG. Assume the execution of the code in `src` leads to target address `a`; if `P1` is true, this execution is _matched_ by the edge `(src, tgt)` in the CFG, because the start address of `tgt` is `a`. In this sense every edge `(src, tgt)` in the CFG _simulates_ (reads "can do the same and maybe more") the code in `src`. A property of simulation relation for CFG (finite automata) is that, by induction, they entail _language inclusion_ which is exactly the correctness criterion. :::success :bulb: If `P1` is true for every edge in the CFG `g`, and `prog` is the program defined by the code in the segments of `g`, then `Execs(prog)` is included in `Execs(g)` and `g` is a correct CFG for `prog`. ::: We can use the [Dafny-EVM](https://github.com/Consensys/evm-dafny), a formal and executable semantics of the EVM in Dafny, to prove `P1` for each edge. The Dafny-EVM provides the semantics of each EVM instruction as a _state transformer_. For instance, the effect of `PUSH0` is defined by the [function `Push0`](https://github.com/Consensys/evm-dafny/blob/2338bd196dadee1fdf2be02e2914f0309e7978a9/src/dafny/bytecode.dfy#L1480) like so: ```rust= function Push0(st: Evm.ExecutingState): (st': Evm.State) { if st.Capacity() >= 1 then // Push 0 on the stack and advance to next instruction st.Push(0).Next() else // The stack is full and we cannot push. ERROR(STACK_OVERFLOW) } ``` This function applies to any `Evm.ExecutingState`, a non-error state of the EVM, and returns an `Evm.State`. An `Evm.State` is either an `Evm.ExecutingState`, if the execution of the instruction does not result in an error or revert (for `Push0`, this requires that the stack in state `st` is not full), or an error-state (this is the case if the capacity of the stack is less than 1). In the Dafny-EVM, the [type](https://github.com/Consensys/evm-dafny/blob/2338bd196dadee1fdf2be02e2914f0309e7978a9/src/dafny/state.dfy#L138) `Evm.State` comprises of the **program counter**, **stack**, memory storage, context, gas etc. Using the Dafny-EVM semantics, we can encode the effect of each segment of code in the CFG as the composition of the semantic functions of the instructions in the segment: ```rust= function ExecuteFromCFGNode_seg0(st0: Evm.ExecutingState): (st': Evm.State) // PC requirement for this node. requires st0.PC() == 0x0 as nat // Stack size requirements for this node. requires st0.Operands() >= 0 requires st0.Capacity() >= 2 { var st1 := Push0(st0); var st2 := Dup(st1, 1); assert st2.EXECUTING?; // Edge from seg0 to seg1, execute seg1 from st2 ExecuteFromCFGNode_seg1(st2) } function ExecuteFromCFGNode_seg1(st0: Evm.ExecutingState): (st': Evm.State) // PC requirement for this node. requires st0.PC() == 0x2 as nat // Stack size requirements for this node. requires st0.Operands() >= 2 requires st0.Capacity() >= 2 { var st1 := JumpDest(st0); var st2 := PushN(st1, 1, 0x0a); var st3 := Dup(st2, 2); var st4 := Lt(st3); var st5 := PushN(st4, 1, 0x13); var st6 := JumpI(st5); ... } ... ``` The code above contains some statements, `requires` and `assert` that are _verification statements_. For instance, a `requires` statement is a _precondition_. The preconditions of `ExecuteFromCFGNode_seg0` specify that this function is only defined for states `st0` such that _i)_ the `PC` is `0x0` (line 3), and _ii)_ the stack is empty (line 5), and iii) the capacity^[This follows from the fact that the stack is empty and the maximum capacity is 1024, so for segment 0, this precondition is redundant.] of the stack is larger than 1. And now we leverage the automated reasoning power of verification-friendly languages like Dafny. Dafny _checks_ that preconditions hold at each call site. For instance, the call to the function `ExecuteFromCFGNode_seg1(st2)` at line 13, triggers a verification that _for all_ possible `st2`, `st2` satisfies the preconditions of `ExecuteFromCFGNode_seg1` (lines 17, 19--20). If this is not satisfied, Dafny reports an error. On the other hand, if the code of `ExecuteFromCFGNode_seg0` _verifies_, this implies that for _any input state_ `st0` that satisfies the preconditions of `ExecuteFromCFGNode_seg0`, we are guaranteed to reach a state `st2` that satisfies the preconditions of `ExecuteFromCFGNode_seg1`. And as a result, it _proves_ that _executing in the EVM_ the instructions of `segment 1` leads to the first instruction of `segment 0`. This gives us a way to check that an edge in the graph satisfies `P1`. :::success :bulb: We can encode every edge `(src, tgt)` in the graph `g` as a Dafny function that captures the effect of the edge. If all the functions verify, then all the edges are semantically correct and satisfy the property `P1`. ::: For final nodes in the graph with no outgoing edges, e.g. a segment that ends in `revert`, `return` or `stop`, we have a simpler encoding. For instance, the node `Segment 2` in the CFG above is defined by the function, without a call to a function at the end: ```rust= function ExecuteFromCFGNode_seg2(st0: Evm.ExecutingState): (st': Evm.State) // PC requirement for this node. requires st0.pc == 0xa as nat // Stack requirements for this node. requires st0.Operands() >= 2 { var st1 := Pop(st0); var st2 := PushN(st1, 1, 0x40); var st3 := MStore(st2); var st4 := PushN(st3, 1, 0x20); var st5 := PushN(st4, 1, 0x40); var st6 := Return(st5); // Segment is terminal (Revert, Stop or Return) st6 } ``` Let `Verifier(g)` be the Dafny program defined above given a CFG `g`. We can verify `Verifier(g)` with Dafny, and :::success :bulb: if `Verifier(g)` _verifies_, then `g` is a correct CFG. ::: ### An efficient machine-checkable test of correctness The previous strategy is good but it uses the _exact semantics_ of the EVM. An `Evm.State` in the Dafny-EVM is a data structure that comprises all the fields that are needed to compute the new state: the _context_, the _stack_, _remaining gas_ etc. This makes the verification task for each edge resource-intensive, and sometimes Dafny timeout and cannot verify the correctness of a segment. The reason is that, to prove the correctness of a sequence of instructions, a _verification condition_ is built by Dafny, and its size depends on the size of the sequence and on the complexity of each function that gives the semantics of an EVM instruction. So for complex instructions (e.g. memory, storage), or long sequences of instructions (more than 40 instructions), the verification condition is too large and the verification tasks times out (if we allow say 20 seconds). This is why in the actual implementation of the correctness test I use an [abstract semantics of the EVM](https://github.com/franck44/evm-dis/blob/a7fa7569af421c1cef8a3553b205d07a14e9e53d/src/dafny/AbstractSemantics/AbstractSemantics.dfy) that ignores lots of components of the `Emv.State`. For instance I use an abstract state of the form: ```rust= datatype EState = EState(pc: nat, stack: seq<Int.u256>) { // The number of operands on the stack (stack size). function Operands(): nat { |stack| } } ``` with only two fields the `pc` and the `stack`. The abstract semantics of the EVM instructions is much simpler, for instance this is the abstract semantcis of `Pop` and `SStore`: ```rust= function Pop(s: EState): (s': EState) requires s.Operands() >= 1 { EState(s.pc + 1, s.stack[1..]) } function SStore(s: EState): (s': EState) requires s.Operands() >= 2 ensures s'.Operands() == s.Operands() - 2 { EState(s.pc + 1, s.stack[2..]) } ``` The actual correctness test uses the abstract semantics, and with this trick and a much simpler semantics, we can verify large segments (more than 40 instructions). :::success :bulb: But why is it safe to use the abstract semantics instead of the concrete and precise Dafny-EVM semantics? It is because: 1. simulation is a transitive property. If `A` simulates `B` and `B` simulates `C` then `A` simulates `C`; 2. so we can prove that the CFG (`C`) simulates the abstract semantics `B`; this is achieved with the test I presented above using the abstract semantics intead of the concrete Dafny-EVM; and 3. we can prove that the abstract semantics `B` simulates the Dafny-EVM semantics `A`. The conclusion is that the CFG simulates the Dafny-EVM semantics. ::: What are the benefits? The abstract semantics is much cheaper (verification-wise) to verify when we test `Verifier(g)`. Note that the proof of simulation of the concrete (Dafny-EVM) semantics by the abstract one (step 3.) can be done _once_, and in Dafny. I'll skip over the details, but the proof of 3. amounts to defining an `Abstraction` function that maps `Evm.State` to `EState` and then proving that, for _each EVM instruction_, the abstract semantics simulates the Dafny-EVM semantics. It looks like this for the `Pop` instruction: ```rust= // A Dafny lemma to prove that abstract Pop simulates Bytecode.Pop lemma SimulationCorrectPop(s: EState, st: Evm.ExecutingState) requires s.Abstracts(st) ensures Bytecode.Pop(st).EXECUTING? ==> Pop.requires(s) && Pop(s).Abstracts(Bytecode.Pop(st)) { // Thanks Dafny } ``` We can unpack the lemma above and explain what it captures: - given a concrete `st: Evm.ExecutingState`, for all`s: EState` that abstracts (`requires`, line 3) `st`, - if we can execute `Bytecode.Pop` (concrete semantics) from `st` and it leads to an `ExecutingState` (non-error) `Bytecode.Pop(st)`, line 4, then: - we can execute abstract `Pop` (`Pop.requires(s)`, line 5) and - the resulting state `Pop(s)` abstracts `Bytecode.Pop(st)`, line 5. This lemma is automatically verified by Dafny without a need for any proof hints (the body of the lemma is left empty). We can prove a similar [lemma for each instruction](https://github.com/franck44/evm-dis/blob/a7fa7569af421c1cef8a3553b205d07a14e9e53d/src/dafny/AbstractSemantics/SimulationProof.dfy) whereby proving that the abstract semantics simulates the concrete one. :::success :bulb: This provides a machine-checkable proof that the use of the abstract semantics in the test is correct. ::: ### Conclusion Using the technique presented above, we can check whether a CFG is correct for some program in EVM bytecode. The test is very efficient as it uses an abstract semantics that id easy to verify using a verification-friendly language like Dafny. With the abstract semantics, we can verify the CFG for contracts like an ERC-20 or the Deposit Contract. The main advantages of this approach are: - _we do not need to trust the algorithm_ that computes the CFG. It is not part of the Trusted Computing Base (TCB). The result of the algorithm can be independently checked against the _actual_ semantics of the EVM. - the use of the abstract semantics vs the reference Dafny-EVM semantics is _provably correct_. All the steps in the previous technique are formally verified and machine-checkable. To the best of my knowledge, this is the first technique that provides _verified_ control flow graphs, and the guarantee that _all the executions_ of a program are indeed executions of the CFG. As a result we provide a very high level of security, every step of the way being machine-checkable. This can be leveraged to conclude that safety properties that are satisfied by the CFG are guaranteed to hold on the EVM bytecode program. ### And ... it is already been useful While developing the CFG verification technique, I have uncovered a subtle bug in the algorithm that generates the CFGs. The bug was discovered while verifying one CFG that did not verify ... and rightly so because it was incorrect. The bug was in the [implementation of the loop invariant detection module](https://github.com/franck44/evm-dis/blob/a7fa7569af421c1cef8a3553b205d07a14e9e53d/src/dafny/utils/EVMObject.dfy#L258), where a condition was missing. It affected only one CFG, but still, it was a bug. On the positive side, all the other test CFGs (except the one mentioned previously) passed the verification test, which demonstrates that the construction algorithm is reliable and robust. [^fn]: