--- title: Control Flow Graph reconstruction for EVM bytecode tags: EVM, Ethereum, Dafny, Semantics breaks: false --- [Franck Cassez](https://franck44.github.io), Head of Research at [Mantle](https://www.mantle.xyz). — all views expressed are my own. [![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/franckDownunder.svg?style=social&label=Follow%20%40FranckDownunder)](https://twitter.com/franckDownunder) --- Control flow graphs (CFGs) provide an abstract semantics of programs. They are used in compilers to optimise the code, e.g. for constants propagation or for minimising the number of used registers. CFGs are also instrumental in qualitative and quantitative analysis of programs and are used as inputs to static analysers or worst-case execution-time analysers. Building CFGs from high-level (structured) code is reasonably easy, but unfortunately, building CFGs from low-level code e.g., ARM assembly or in our context, **EVM bytecode**, is notoriously hard. This is mostly due to _dynamic jumps_: the next value of the _program counter_ may depend on the history of the computation and is not hard-coded. In this post I'll describe an algorithm to compute _precise_ and _minimal_ CFGs for EVM bytecode. The algorithm I have designed can handle complex EVM bytecode and construct CFGs for contracts like an [ERC-20](https://github.com/franck44/evm-dis/blob/0b8c7e49bf4e68885601c2f8d6bf20e43ecdfd11/src/dafny/tests/src/erc-20/erc-20-cfg.dot), or the [Deposit Contract](https://github.com/franck44/evm-dis/blob/0b8c7e49bf4e68885601c2f8d6bf20e43ecdfd11/src/dafny/tests/src/Deposit/depositContract-cfg.dot). ## Overview In the next sections I'll explain what a CFG is and explain how the algorithm works. The algorithm uses a combination of _abstract interpretation_, _path unfolding_, _loop detection_ and _automata minimisation_. I have implemented the algorithm in [Dafny](https://github.com/dafny-lang/dafny), and the code is available in the [evm-dis](https://github.com/franck44/evm-dis) repository. Dafny has several backends to generate code in C#, Java, Python and some other languages. :::success :smiley: I have built the Python, JavaScript, Java and C# artefacts so you don't need to install Dafny to run the CFG generator; you can use one of the target builds in Python, Java or C#. ::: You can refer to the [usage section](https://github.com/franck44/evm-dis#usage) of the repository to see how to run your favourite^[Other builds will be released when the corresponding Dafny backends are ready, e.g. Rust.] artefact(s) of the CFG generator. ## What are Control Flow Graphs? In this section I'll explain what a CFG is and give some simple examples for programs in EVM bytecode. ### Execution of EVM Bytecode Given an EVM program $p$ in bytecode, and an input $i$ for $p$, the [semantics of the EVM](https://github.com/Consensys/evm-dafny) uniquely defines $run(p, i)$, the _execution_ of $p$ on input $i$, as a finite sequence of EVM states. You can find the formal definition of an [EVM state](https://github.com/Consensys/evm-dafny/blob/60bce44ee75978a4c97b9eab8e03424c9c233bbd/src/dafny/state.dfy#L53) in the [Dafny-EVM](https://github.com/Consensys/evm-dafny) semantics. A state $s$ comprises of the program counter $s.pc$, the stack content $s.stack$, memory content $s.memory$, and several other fields. The set of _executions_ of $p$ over all its possible inputs is: $$Runs(p) = \{ run(p, i) \, |\, i \in {\cal I} \}$$ where ${\cal I}$ is the set of all inputs. Now, assume we restrict our attention to the _program counter_ in executions. Given an execution $e \in Runs(p)$, which is a sequence $s_0, s_1, \cdots, s_n$ of EVM states, we define $e.pc = s_0.pc, s_1.pc, \cdots, s_n.pc$ the corresponding sequence of program counters. We can apply this projection to all the executions of $p$ and we obtain the set: $$ PCRuns(p) = \{ e.pc \ | \ e \in Runs(p) \}\mathpunct. $$ $PCRuns(p)$ is a (possibly infinite) _set_ of sequences of program counters. Sets of sequences are sometimes called _languages_. [_Finite automata_](https://en.wikipedia.org/wiki/Finite-state_machine) can finitely represent sets in the family of [_regular languages_](https://en.wikipedia.org/wiki/Regular_language). It turns out that $PCRuns(p)$ is not very often a regular language but it can be _over-approximated_ by a regular language. :::info :bulb: A CFG for $p$ is a **finite automaton** that **over-approximates** $PCRuns(p)$. ::: By this definition, an automaton with a node per segment and edges between any two nodes is a CFG: it over-approximates the sequences of PCs. However this is not a very useful CFG as it does not provide any information on the program. The art of building good CFG is construct approximations that are not too coarse and provide some information ont the program structure. ### Examples of CFGs The following simple EVM program `TwoCalls` below simulates two successive calls to the same function and illustrates some difficulties in constructing CFGs. <center> <table style="font-size:100%""> <tr> <th>TwoCalls – EVM Bytecode</th> <th ALIGN="center">Control Flow Graphs</th> </tr> <tr> <td style="border: none"> The execution of the program below starts with `pc==0`, Segment 0. Segment 3 (`0x0d`) in the code below is executed twice. There are two successive _jumps_ to Segment 3: - first jump from `0x04`; - second jump from `0x0a`. ``` PC :Instruction // stack // Segment 0 0x00:PUSH1 0x05 // [0x05] 0x02:PUSH1 0x0d // [0x0d, 0x05] 0x04:JUMP // call1 JUMP to 0x0d // Segment 1 0x05:JUMPDEST // [] empty 0x06:PUSH1 // 0x0b [0x0b] 0x08:PUSH1 // 0x0d [0x0d, 0x0b] 0x0a:JUMP // call2 JUMP to 0x0d // Segment 2 0x0b:JUMPDEST // [] empty 0x0c:STOP // [] // Segment 3 0x0d:JUMPDEST // call1 [0x05] | call2 [0x0b] 0x0e:JUMP // Return to 0x05 | Return to 0x0b ``` After the first jump we should _return_ to `0x05`; this is the address that is pushed prior to the first call. After the second jump, we should _return_ to `0x0b`; this is the address pushed prior to the second call. CFG1 (opposite, top) over-approximates sequences of PCs in `TwoCalls` but contains some infeasible runs e.g., Seg0, Seg3, Seg1, Seg3, Seg1, Seg3, Seg2. CFG2 is more accurate and captures exactly the set of feasible sequences of PCs in `TwoCalls`. </td> <td style="border: none;"> ```graphviz digraph CFG { graph[fontsize=24,fontname=arial,label="CFG1 for TwoCalls"] node [shape=box] node[fontname=arial, fontsize=14] edge[fontname=arial] ranking=TB {rank=max; s15_4} s0_0 [label=< <B>N0. Segment 0 [0x0] </B><BR ALIGN="CENTER"/> PUSH1 0x05 <BR ALIGN="LEFT"/> PUSH1 0x0d <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s1_1 [label=< <B>N1. Segment 3 [0xd] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s3_2 [label=< <B>N2. Segment 1 [0x5] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> PUSH1 0x0b <BR ALIGN="LEFT"/> PUSH1 0x0d <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s15_4 [style=filled,color=orange,fontcolor=white,label=< <B>N3. Segment 2 [0xb] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> STOP <BR ALIGN="LEFT"/> >] s0_0 -> s1_1 [tooltip="Jump",style=dashed] s1_1 -> s3_2 [tooltip="Jump",style=dashed] s3_2 -> s1_1 [tooltip="Jump",style=dashed] s1_1 -> s15_4 [tooltip="Jump",style=dashed] s3_2 -> s15_4 [style=invis] s1_1 -> s15_4 [style=invis] } ``` ```graphviz digraph CFG { graph[fontsize=10,fontname=arial,label="CFG2 for TwoCalls"] node [shape=box] node[fontname=arial, fontsize=8] edge[fontname=arial] ranking=TB s0_0 [label=< <B>N0. Segment 0 [0x0] </B><BR ALIGN="CENTER"/> PUSH1 0x05 <BR ALIGN="LEFT"/> PUSH1 0x0d <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s1_1 [label=< <B>N1. Segment 3 [0xd] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s3_2 [label=< <B>N2. Segment 1 [0x5] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> PUSH1 0x0b <BR ALIGN="LEFT"/> PUSH1 0x0d <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s7_3 [label=< <B>N3. Segment 3 [0xd] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s15_4 [style=filled,color=orange,fontcolor=white,label=< <B>N4. Segment 2 [0xb] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> STOP <BR ALIGN="LEFT"/> >] s0_0 -> s1_1 [tooltip="Jump",style=dashed] s3_2 -> s7_3 [tooltip="Jump",style=dashed] s1_1 -> s3_2 [tooltip="Jump",style=dashed] s7_3 -> s15_4 [tooltip="Jump",style=dashed] } ``` </td> </tr> </table> </center> ### How can we build CFGs? The program counter is updated after each instruction. For the majority of instructions like `ADD`, `DUP1` etc, the program counter is advanced to the instruction that is next in the code, something like `pc := pc + 1`. For other instructions like `JUMP/JUMPI` the execution can jump/branch to an address that is not the instruction after the current one. In the example above, the bytecode is split into _linear segments_: contiguous sequences of instructions with no branching _except possibly the last instruction_. Segments' boundaries are determined by `JUMPDEST` (instructions we can jump to), `JUMP/JUMPI`, `STOP` or `REVERT` instructions. Each segment ends in one of the following instructions: `JUMP/JUMPI`, or `STOP/REVERT` that end the program. In the example above, there are four segments. :::info :bulb: In each segment it is easy to track the program counter upto the last instruction: it progresses to the next instruction in the code. The CFG of a segment is linear. To avoid clutering the CFG, we collapse all the nodes within a segment into the same CFG node. ::: A CFG (a finite automaton) defines sequences of CFG nodes as follows: a) there is a designated _initial node_, e.g. `N0` in the CFGs above, and b) sequences are defined as the nodes we see on _paths_ in the graph. Paths are built by following directed edges between two CFG nodes. For instance CFG1 contains a path `N0, N1, N2, N1, N2, N3`. Each node is labelled with a segment and a sequence of nodes defines a sequence of segments that are the labels of the nodes. One of the difficulties in constructing a CFG for this program is that the target of the `JUMP` in Segment 3 is determined by the top element of the stack _when we execute_ `0x0e: JUMP`; and this is not the same the first and the second time. :::info :bulb: To construct CFGs we may have to track the history of states in the executions. ::: Let's now assume that we have a way to track the states. To build a CFG, we can just _unfold each possible execution_ of the program, one at a time, and build the CFG as a collection of all the paths. This is is going to work if executions have a maximum depth but it may never end if we have _loops_ in the program. Indeed another difficulty is to deal with `JUMP/JUMPI` instructions that branch to a segment we have already seen on the path we are currently unfolding. This is similar to what is happening in the `TwoCalls` program: after Segment 0, Segment 3, Segment 1, we encounter Segment 3 again. Do we add an edge to the previously seen node `N1` that corresponds to has label Segment 3 or do we continue to unfold the graph? If we choose to create an edge to the previsouly seen Segment 3 (CFG1), we add _infeasible_ paths to the graph. If we continue to unfold and we are on a loop, the unfolding may never end. :::info :bulb: When unfolding a path in the CFG, if we encounter a segment that was already seen on a node in this path, how do we decide whether we can _branch back_ to the previously seen node and stop the unfolding of the path? ::: For the reasons explained above, designing an algorithm that builds _precise_ CFGs (minimising the number of infeasible paths) and _always terminates_ is hard. In the example of CFG1 above, CFG1 contains loops whereas `TwoCalls` does not so it is not a very precise CFG for `TwoCalls`. In the next sections I'll explain how we can track the _useful_ content of the state, and provide a criterion to stop unfolding a path. ## An Algorithm to build CFGs As mentioned before, it is notoriously hard to build CFGs for low-level code like ARM assembly. In ARM assembly, target addresses of JUMPs can be stored in memory or computed at runtime. And the target addresses can be anywhere in the code. The situation for EVM bytecode is much more favourable. First there is a known and fixed set of _valid_ targets for `JUMP`s: the `JUMPDEST` instructions. In EVM bytecode, every target of a `JUMP/JUMPI`must be a `JUMPDEST` instruction. This is the case for the program `TwoCalls` introduced above. This means that all the `JUMP` targets are in the set of `JUMPDEST`s. In itself this enables us to define a simple algorithm to builf a CFG: from each `JUMP/JUMPI`, we create edges to each `JUMPDEST`. This over-approximates the set of sequences of PCs, but this over-approximation may be too coarse to be of any use (e.g. CFG1 above). There is a second EVM feature that somehow impacts how targets of jumps are stored and retrieved accross successive jumps. :::info :bulb: It is **very unlikely** that target addresses of `JUMP/JUMPI` will be stored in memory or storage because using memory or storage is much more expensive, gas-wise, than using the stack in the EVM. ::: As a result, compilers to EVM bytecode will almost certainly _stack_ return addresses. It is also unlikely that targets of jumps are _computed_ e.g. with `ADD` or `MUL` as every operation costs some gas. :::success :bulb: Tracking the _content of the stack_ and the _effect of some simple stack operations_ may be enough to determine target addresses of `JUMP/JUMPI` instructions. ::: ### Abstract States & Semantics An (EVM) _abstract state_ `s` has two components: `pc` is the program counter, and `stack` is the stack represented by a _sequence_ of elements. We use `s.pc` and `s.stack` to reference the components. We let $\textit{AState}$ be the set of abstract states. The _empty stack_ is the empty sequence `[]`, and for non-empty stacks, the top is the first element of the sequence e.g., `[a, b]` means that `a` is at the top of the stack. We can write $s.Peek(k)$ the k+1-th element of the stack (from the top), with $s.Peek(0)$ being the top. Given a sequence $x = x_0, x_1, x_2, \ldots, x_n$, $tail(x)$ is the sequence $x_1, x_2, \ldots, x_n$. The _abstract semantics_ of EVM instructions is given by the function $$\textit{Next}: \textit{AState} \times I \to \textit{AState}$$ with $I$ the set of EVM instructions. The complete definition of the $\textit{Next}$ function can be found [here](https://github.com/franck44/evm-dis/blob/11ad490fb319eadb11e9086b7e354d3ea4078483/src/dafny/utils/Instructions.dfy#L231). The semantics^[I omit the checks on the stack sizes and assume when $Peek(k)$ is used there are at least k + 1 elements on the stack.] of `JUMP` and `JUMPI` are: given a state $s=(pc, st)$, $$ \begin{align*} \textit{Next}((pc, st), \texttt{JUMP}) & = (st.Peek(0), tail(st)) & \\ \textit{Next}((pc, st), \texttt{JUMPI}) & = \begin{cases} \mathbf{if} \, st.Peek(1) > 0 \, \mathbf{then} \, \bigl(st.Peek(0), tail(tail(st))\bigr) \\ \mathbf{else}\, \bigl(pc + 1, tail(tail(st))\bigr)\mathpunct. \end{cases} & \end{align*} $$ The abstract semantics of other instructions can be defined in the same way. However, we also want to _ignore_ some of the stack content that may not be relevant to tracking the `JUMP` targets. To do so, we have two types of elements in the stack: _values_, non-negative integers and _unknown_ denoted as $\bot$. As mentioned before, most of the time, target addresses of `JUMP`s will be stacked, with a `PUSH` instructions. The stack is then managed in a way that the target addresses are on top of the stack when needed by a `JUMP` (this is is the role of the compiler to ensure that the code meets this requirement). This also means that lots of operations on the stack e.g., `ADD`, _will not produce_ targets of `JUMP`s and the actual result can be ignored, but of course not the effect on the _stack size_. As a result we define the abstract semantics of operations like `ADD, SUB, PUSH0` and others as follows: $$ \begin{align*} \textit{Next}((pc, st), \texttt{ADD}) & = (pc + 1, [\bot] + tail(tail(st))) & \\ \textit{Next}((pc, st), \texttt{PUSH0}) & = (pc + 1, [\bot] + st) \mathpunct. & \end{align*} $$ The trick here is that the `ADD` pops the two topmost elements of the stack, adds them and pushes the result, but in the _abstract semantics_, we omit the actual result and replaces it by $\bot$. For operations that _may_ push jump targets like `PUSH1` to `PUSH32`, we want to track the values pushed on the stack^[Note that `PUSHk` instructions have k arguments, so the next instruction is at address $1 + k + pc$ in the code.]: $$ \begin{align*} \textit{Next}((pc, st), \texttt{PUSHk v}) & = (pc + 1 + k, [v] + st)\mathpunct. \end{align*} $$ And finally, for other stack operations like `DUPk`, `SWAPk`, we re-arrange the stack elements according to the operation without modifying their contents: $$ \begin{align*} \textit{Next}((pc, st), \texttt{DUP1}) & = (pc + 1, [st.Peek(0)] + st) & \\ \textit{Next}((pc, st), \texttt{SWAP1}) & = (pc + 1, [st.Peek(1), st.Peek(0)] + tail(tail(st)))\mathpunct. \end{align*} $$ Let's see on an example how this works. The program below, **Loop 1**, contains a loop. In this code snippet, the loop can repeat forever and the stack size can grow arbitrary large^[I know this is not feasible in the EVM but we are building CFGs that are over-approximations here.]. If we use our abstract semantics to track the content of the stack (`?` denotes the unknown stack value $\bot$), we obtain the values below. <center> <table> <tr> <th>Loop 1: Function Call/Return – EVM Bytecode</th> <th ALIGN="center">Control Flow Graph</th> </tr> <tr> <td style="border: none"> ``` PC Instruction // stack // Segment 0 0x00:PUSH1 0x02 // [0x02] // Segment 1 0x02:JUMPDEST // [0x02] 0x03:PUSH0 // [?, 0x02] 0x04:SWAP1 // [0x02, ?] 0x05:DUP1 // [0x02, 0x02, ?] 0x06:JUMP // [0x02, ?] ... ``` After `n` jumps, the abstract stack content at `0x02` is `[0x02, ?, ?, ... ?]` (`n` unknown values). </td> <td style="border: none"> ```graphviz digraph CFG { graph[fontname=arial, label="Loop 1 – CFG"] node [shape=box] node[fontname=arial] edge[fontname=arial] ranking=TB s0_0 [label=< <B>Segment 0 [0x0] </B><BR ALIGN="CENTER"/> PUSH1 0x02 <BR ALIGN="LEFT"/> >] s0_1 [label=< <B>Segment 1 [0x2] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> PUSH0 <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> DUP1 <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s0_0 -> s0_1 [tooltip="Next"] s0_1 -> s0_1 [tooltip="Jump",style=dashed] } ``` </td> </tr> </table> </center> This seems to be very good and we can sucessfully track the targets of `JUMP`s after each iteration. But if we unfold the CFG after each iteration, we build arbitrary long paths as the stack size grows arbitrary large. So how do we compute the CFG on the righthand side? How can we guarantee that we always end up with the same target address after the `JUMP` instruction at `0x06`? ### Loop Detection When we unfold a path $\pi$ in the CFG we can collect all the values of PCs^[We only need to collect the PCs at the beginning of each segment.] encountered so far. If we reach a value of PC that we have already seen on $\pi$ at index $i$ i.e. $\pi[i]$, we may check if we can stop unfolding the path and branch back to $\pi[i]$. :::info :question: But what is exactly the criterion to decide if we can stop unfolding and add an edge to a node we have already seen? ::: Let's take the example of **Loop 1** above. Unfolding the graph leads to something like: ```graphviz digraph { rankdir=LR node[shape=box] n1 [label="Node 0\nSegment 1 \n PUSH1 0x02"] n2 [label="Node 1\nSegment 2 \nJUMPDEST\nPUSH0 \n... \nDUP1\nJUMP"] n3 [label="Node 2\nSegment 2 \nJUMPDEST\nPUSH0 \n... \nDUP1\nJUMP"] n4 [label="Node 3 \nSegment 2 \nJUMPDEST\nPUSH0 \n... \nDUP1\nJUMP"] n5 [shape=plain,label=" ..."] n1 -> n2 [label="Next"] n2 -> n3 [label="JUMP"] n3 -> n4 [label="JUMP"] n4 -> n5 } ``` As we have seen an occurrence of Segment 1 before `Node 2`, we would like to know if can we stop at `Node 2` and add a loop edge on `Node 1`. We would like to _fold_ this graph like so: ```graphviz digraph { rankdir=LR node[shape=box] n1 [label="Node 0 (Segment 0)"] n2 [label="Node 1 (Segment 1)"] n1 -> n2 [label="Next"] n2 -> n2 [style=dashed, label="JUMP ?"] } ``` We can do so if we _can guarantee_ that every time we execute Segment 1 (`Node 1`), we end up after the `JUMP` at Segment 1 again, i.e. with `pc==0x02`. A **sufficient condition** to test whether this holds is the following: 1. compute the _weakest precondition_ for Segment 1, with a target post condition`pc==0x02`; 2. check whether this weakest precondition is _invariant_ when executing Segment 1. #### Weakest Preconditions If you are not familiar with _preconditions_ and _weakest preconditions_, I'll try to explain what they mean. Assume we want to _guarantee_ that after executing a `JUMP` instruction, the program counter is `0x02`. As a `JUMP` instruction updates the program counter to the value at the _top_ of the stack, the _least_ we can require is that, before we execute the `JUMP` instruction, the top of the stack contains `0x02`. Any other choice would **not lead** to `0x02` after the `JUMP`. Overall, if we start in an (EVM) state such that the top of the stack is `0x02`, written as the predicate `Peek(0)==0x02`, a `JUMP` instruction will update the program counter to `0x02`. `Peek(0)==0x02` is a _precondition_ (holds _before_) that ensures that executing a `JUMP` instruction leads to the postcondition (holds _after_) `pc==0x02`. If you are familiar with Hoare triples and Hoare logic, this can be summarised as: $$ \{\ Peek(0)= 0x02 \ \}\quad \texttt{JUMP} \quad \{\ pc = 0x02 \ \}\mathpunct. $$ Moreover, `Peek(0)==0x02` is the _minimal requirement_ to ensure that `pc==0x02` after executing a `JUMP`. So `Peek(0)==0x02` is the _weakest precondition_ for `pc==0x02` after a `JUMP`. Let's write that as follows: $$ \text{WPre}(\texttt{JUMP}, pc == 0x02) = Peek(0) = 0x02 \mathpunct. $$ Remember that a weakest precondition is a predicate, so the equation above reads "WPre is the predicate `Peek(0) == 0x02`" not that "Wpre is equal to Peek(0) itself equal to `0x02`". Computing weakest preconditions can be extended to _sequences_ of instructions by computing the _WPre_ from the last to to first instructions. For a sequence of instructions $xi + [a]$, and a predicate $P$, $$WPre(xi + [a], P) = \text{WPre}(xi, \text{WPre}(a, P))\mathpunct.$$ For instance: $$ \begin{align} \text{WPre}(\texttt{DUP1 JUMP}, pc == 0x02) & = \text{WPre}(\texttt{DUP1} , \text{WPre}(\texttt{JUMP}, pc = 0x02) ) & \\ & = \text{WPre}(\texttt{DUP1} , Peek(0) = 0x02 ) & \\ & = Peek(0) = 0x02 \end{align} $$ We just back propagate weakest preconditions from the last instruction to the first. If we do this computation for all the instructions in Segment 1, we obtain: $$ \begin{align} & \text{WPre}(\texttt{JUMPDEST PUSH0 SWAP1 DUP1 JUMP}, pc = 0x02) & \\ & = \text{WPre}(\texttt{JUMPDEST PUSH0 SWAP1}, Peek(0) = 0x02) & \\ & = \text{WPre}\biggl(\texttt{JUMPDEST PUSH0}, \text{WPre}(\texttt{SWAP1}, Peek(0) = 0x02)\biggr) & \\ & = \text{WPre}(\texttt{JUMPDEST PUSH0}, Peek(1) = 0x02) & \\ & = \text{WPre}\biggl(\texttt{JUMPDEST}, \text{WPre}(\texttt{PUSH0}, Peek(1) = 0x02)\biggr) & \\ & = \text{WPre}(\texttt{JUMPDEST}, Peek(0) = 0x02) & \\ & = Peek(0) = 0x02 & \end{align} $$ So the weakest precondition for Segment 1 to guarantee the postcondition $pc=0x02$ is $Peek(0) = 0x02$. Let's call this precondition `WPRE1`. #### Is the Weakest Precondition an invariant? To decide whether we can add a loop edge on `Node 1` in the CFG above, we are going to check whether `WPRE1` is _preserved_ by Segment 1. By _preserved_, I mean, if an EVM abstract state $s_0$ satisfies `WPRE1`, and we execute Segment 1, then the resulting state $s_1$ also satisfies `WPRE1`. You can see here that if this is the case, `WPRE1` is an _invariant_ for Segment 1: if it holds before we execute it, it holds at the end. And this implies that after _any number of executions_ of Segment 1, we have `pc == 0x02`. Putting everything together, we can guarantee that any future computations after Segment 1 can only execute Segment 1 and so we can safely add a loop edge to `Node 1` in our CFG. How do we check that `WPRE1` is an invariant? We are going to compute the postcondition of `WPRE1` after Segment 1. If it _implies_ `WPRE1` then it is an invariant. I'll skip the details here, but let's start in a state $s$ that satifies `WPRE1` i.e. such that $s.stack = [0x02, \ldots]$ and $s.pc = 0x02$. After executing Segment 1 from $s$, we end up in a new state $s'$ with $s'.stack = [0x02, 0x00, \ldots]$ and $s'.pc = 0x02$. Does $s'$ satisfy `WPRE1`? Yes, because the top of the stack is `0x02`. So `WPRE1` is an invariant for Segment 1. And this means that we can add an edge from `Node 1` to `Node 1` in the CFG because we have **proved** that this is the only possible target of the `JUMP`. To be complete we also need to take into account any stack constraint `SC1` that should hold when we reach the first occurrence of `Node 1`. We do so by building a conjunct `WPre := WPRE1 and SC1` and checking the invariance of `WPRE`. ### Automata Minimisation The algorithm presented above unfolds each execution path separately and stops the unfolding when an _invariant_ (as presented below) is detected on the path. This may lead to CFGs that are _not_ minimal. For instance, the CFG on the lefthand side below corresponds to the unfolding of two paths that end up in the same sequence of segments: `Segment 3, Segment 1`. In other words, the orange and red nodes can be merged as they define the same _future sequences_ of PCs. Minimising the raw CFG on the left amounts to minimising an automaton and can be done with standard techniques to _partition_ the set of of nodes in _equivalence classes_. I have implemented a minimisation algorithm^[A mixture of Moore and Hopcroft] that can generate a minimised version of the path-unfolding CFG. The result of the minimisation for the Raw CFG below is depicted on the righthand side. <center> <table> <tr> <th>Raw CFG</th> <th ALIGN="center">Minimised CFG</th> </tr> <tr> <td style="border: none"> ```graphviz digraph CFG { node [shape=box] node[fontname=arial, fontsize=10] edge[fontname=arial] ranking=TB s0_0 [label=< <B>Segment 0 [0x0] </B><BR ALIGN="CENTER"/> PUSH1 0x0a <BR ALIGN="LEFT"/> PUSH1 0x08 <BR ALIGN="LEFT"/> PUSH1 0x03 <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> PUSH1 0x13 <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s1_1 [label=< <B>Segment 2 [0x13] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> SWAP2 <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> DUP1 <BR ALIGN="LEFT"/> DUP4 <BR ALIGN="LEFT"/> LT <BR ALIGN="LEFT"/> PUSH1 0x1f <BR ALIGN="LEFT"/> JUMPI <BR ALIGN="LEFT"/> >] s1_2 [style=filled, color=orange, label=< <B>Segment 3 [0x1c] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> POP <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s5_3 [style=filled,color=olivedrab,fontcolor=white,label=< <B>Segment 1 [0xa] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> PUSH1 0x40 <BR ALIGN="LEFT"/> MSTORE <BR ALIGN="LEFT"/> PUSH1 0x20 <BR ALIGN="LEFT"/> PUSH1 0x40 <BR ALIGN="LEFT"/> RETURN <BR ALIGN="LEFT"/> >] s3_2 [label=< <B>Segment 4 [0x1f] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> SWAP2 <BR ALIGN="LEFT"/> POP <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> PUSH0 <BR ALIGN="LEFT"/> PUSH1 0x1c <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s7_3 [style=filled, color=tomato, label=< <B>Segment 3 [0x1c] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> POP <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s15_4 [style=filled,color=olivedrab,fontcolor=white,label=< <B>Segment 1 [0xa] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> PUSH1 0x40 <BR ALIGN="LEFT"/> MSTORE <BR ALIGN="LEFT"/> PUSH1 0x20 <BR ALIGN="LEFT"/> PUSH1 0x40 <BR ALIGN="LEFT"/> RETURN <BR ALIGN="LEFT"/> >] s0_0 -> s1_1 [tooltip="Jump",style=dashed] s1_1 -> s1_2 [tooltip="Next"] s1_2 -> s5_3 [tooltip="Jump",style=dashed] s1_1 -> s3_2 [tooltip="Jump",style=dashed] s3_2 -> s7_3 [tooltip="Jump",style=dashed] s7_3 -> s15_4 [tooltip="Jump",style=dashed] } ``` </td> <td style="border: none"> ```graphviz digraph CFG { node [shape=box] node[fontname=arial, fontsize=10] edge[fontname=arial] ranking=TB s0_0 [label=< <B>Segment 0 [0x0] </B><BR ALIGN="CENTER"/> PUSH1 0x0a <BR ALIGN="LEFT"/> PUSH1 0x08 <BR ALIGN="LEFT"/> PUSH1 0x03 <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> PUSH1 0x13 <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s1_1 [label=< <B>Segment 2 [0x13] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> SWAP2 <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> DUP1 <BR ALIGN="LEFT"/> DUP4 <BR ALIGN="LEFT"/> LT <BR ALIGN="LEFT"/> PUSH1 0x1f <BR ALIGN="LEFT"/> JUMPI <BR ALIGN="LEFT"/> >] s1_2 [label=< <B>Segment 3 [0x1c] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> POP <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s3_2 [label=< <B>Segment 4 [0x1f] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> SWAP2 <BR ALIGN="LEFT"/> POP <BR ALIGN="LEFT"/> SWAP1 <BR ALIGN="LEFT"/> PUSH0 <BR ALIGN="LEFT"/> PUSH1 0x1c <BR ALIGN="LEFT"/> JUMP <BR ALIGN="LEFT"/> >] s5_3 [style=filled,color=olivedrab,fontcolor=white,label=< <B>Segment 1 [0xa] </B><BR ALIGN="CENTER"/> JUMPDEST <BR ALIGN="LEFT"/> PUSH1 0x40 <BR ALIGN="LEFT"/> MSTORE <BR ALIGN="LEFT"/> PUSH1 0x20 <BR ALIGN="LEFT"/> PUSH1 0x40 <BR ALIGN="LEFT"/> RETURN <BR ALIGN="LEFT"/> >] s0_0 -> s1_1 [tooltip="Jump",style=dashed] s1_1 -> s1_2 [tooltip="Next"] s1_1 -> s3_2 [tooltip="Jump",style=dashed] s1_2 -> s5_3 [tooltip="Jump",style=dashed] s3_2 -> s1_2 [tooltip="Jump",style=dashed] } ``` </td> </tr> </table> </center> ## Conclusion I have presented an algorithm to construct CFGs from EVM bytecode. The algorithm uses an abstract semantics of the EVM, and a _semantically sound_ criterion to detect loops on execution paths. Because we use an abstract semantics, the construction may fail in some cases, for instance: 1. if a target of a `JUMP` is stored outside of the stack, 2. if a target of a `JUMP` is computed with e.g. `ADD`, 3. if we never find an invariant. As discussed in this post, 1. and 2. are unlikely to occur as _reasonable_ compiler strategies would choose to store targets of `JUMP`s on the stack. Regarding 3., "the proof (of usefulness) is in the pudding". I have tried to construct several CFGs for programs with loops and the algorithm always terminated and successfuly detected path invariants. The algorithm presented above offers **precision** by unfolding each path separately, **sound loop folding** by using a _semantically sound_ criterion to detect loops on a path, and a **minimal size CFG** by collapsing nodes to equivalence classes of nodes in the graph. #### Acknowledgements Thanks to Andreas Penzkofer for insighful comments on the draft version of this post. I have previsouly worked on CFG construction. The technique presented above is inspired by the Master's Thesis work of [D. Klumpp](https://dominik-klumpp.net) that I supervised at Macquarie University, Sydney. [Automated Control Flow Reconstruction from Assembler Programs](https://www.researchgate.net/publication/329701188_Automated_Control_Flow_Reconstruction_from_Assembler_Programs). D. Klumpp. Masters' Thesis. Dec. 2018 [^fn]: