EVM Parallelization Refinement Notes

Introduction

In this note, we present two example algorithms that utilize formal methods for better EVM parallelization by refining the state witness oracles.

Formal Methods

One way to facilitate better refinement is using formal methods(a type of static analysis) to infer the shared location information $\kappa$. Assuming executing $txn_i$ and $txn_j$ in either order using the sequential EVM $\delta$ generate the same Ethereum state, then we say they are parallelizable, otherwise, they are conflicting. If our analysis gives no false negatives on the conflicting property, then our analysis is sound. If our analysis gives no false positives on the conflicting property, then our analysis is precise.

So we have two choices here, we could give a sound list, overestimate the storage accesses, and guarantee no conflicts but perform more unnecessary sequential computation. Or, we could give a precise list, underestimating the storage access, thus taking some risks of ending up with inconsistent states from $\delta$ and $\delta_p$, but the time we saved from speculative execution might be worth to have few re-computations (that's why it's the art between soundness and preciseness).

If you think more on the idea of soundness and preciseness, you could easily come up with example cases where we could parallelize two transactions but it is not obvious to see. For example, there is a DeFi protocol that only allows people who holds their native token to claim some airdrop of another token:

function claim_airdrop() external {
    require(!claimed[msg.sender]);
    require(IERC20(native_token).balanceOf(msg.sender) > 0);
    claimed[msg.sender] = true;
    IERC20(airdrop_token).mint(msg.sender, 400 ether);
    emit AirdropClaimed(msg.sender, IERC20(airdrop_token).symbol(), 400 ether);
}

Now suppose some Alice and Bob both already have a balance of 100 native tokens, and there are two transactions $txn_1$ and $txn_2$: in $txn_1$ Alice calls the claim_airdrop function and in $txn_2$ Bob calls IERC20(native_token).transfer transferring all his 100 tokens native tokens to Alice. Placing $txn_2$ before $txn_1$ makes $txn_1$ read a different native token balance from storage for Alice, but this read doesn't affect $txn_1$'s execution (control flow) and it doesn't have an impact on the final state change.

And this is just a very simple case, you get the idea: there could be other more complex function intertwines making it difficult to deduce the parallelizability of $txn_1$ and $txn_2$ without actually executing them. In fact, it is impossible to be sound and precise at the same time with only static reasoning. Although it is a fantastic idea to build quick pattern matchers for those special cases and use them as optimizations for our main static analysis engine, for now, we leave that for future work and focus on building a "good enough" static analyzer for deriving storage access information $\kappa$.

Approach

We define our algorithm using abstract interpretation. In short, abstract interpretation is to replace the concrete value of a variable (like uint256 18000000) with an abstract value during the program "interpretation." And this abstract value is usually defined as an element from a complete lattice. In short, a complete lattice $(L, \sqsubseteq, \sqcup, \sqcap, \bot, \top)$ satisfies that $L$ being a partially ordered set under operation $\sqsubseteq$, and every subset of $L$ has both a supremum (the greatest element in terms of $\sqsubseteq$) and an infimum (the smallest element in terms of $\sqsubseteq$). The operator for getting the supremum is called join ($\sqcup$) and the operator for getting the infimum is called meet ($\sqcap$). Finally, we have $\bot = \sqcap \ L$ the infimum of the set of all elements in $L$, and $\top = \sqcup \ L$ the supremum of the set of all elements in $L$.

Why use such a fancy mathematical structure you ask? Well, it helps us to formulate the relation between abstract interpretation and concrete execution with Galois Connection and prove some properties of it, for example, that with well-defined transition functions, our interpretation always terminates. For interested wrinkle brain theorists out there, you can dig into abstract interpretation more by starting from Patrick Cousot's tutorial (if you didn't understand or think it's too early for symbols and greeks, don't worry! We will present concrete examples at the end of this section).

Assumption

We assume that our EVM has a back-tracking capability, in other words, it can take and keep a snapshot of the EVM state whenever we reach an opcode that changes our program counter pc, so that later when we finish the execution, we could "go back in time" to the previous EVM state and choose a different path. For example, when we encounter JUMPI, our EVM can take a shot of its state, and then proceed without jumping (just incrementing pc), and then later when it finishes execution it could come back and pick up the snapshotted state to execute with a jump (changing pc to the jump destination).

Algorithms

We present two algorithms utilizing abstract interpretation. One for general transactions and another for special case transactions like ERC20 token transfer or Uniswap add liquidity, we can run the algorithms beforehand to get state witness summaries and determine runtime which summaries to use based on dynamic profiling.

Coarse Analysis

For general transactions, we perform abstract interpretation on them after the client receives them. Thus we are performing a Just-In-Time (JIT) abstract interpretation. Obviously, this would require the abstract interpretation to be fast, therefore we use a coarse domain (i.e., a complete lattice that has relatively fewer elements and a simpler structure under its $\sqsubseteq$ operation) for our abstract value.

Since the goal is to have the overall amortized time of parallel coarse analysis and execution be much less than the time it takes for sequential execution, we deliberately choose a coarse domain that includes havoc operations. Basically, havoc just means that we are deliberately overestimating the abstract value of some variable. And the reason we do this is that by setting the abstract value of a variable to $\top$ (unknown) we save analysis time (think of $\top$ as meaning the value could be anything) because performing join/meet with unknown is fast (constant time complexity, because $\forall v, v \sqcap \top = v, v \sqcup \top = \top$). And since $\top$ is an overestimation, every variable that depends (whether it is control-flow dependent or data-flow dependent) on a variable that has $\top$ abstract value is also going to have $\top$ as its abstract value. Thus making operations on them O(1) as well (kind of like a Domino effect).

And the specific reason we use this is that when encountering SLOAD (recall this is the most expensive runtime opcode), we want to cut its execution time to very low, so we won't really read storage, as that would just make "abstract interpretation" an actual "concrete interpretation." So whenever we have SLOAD, we just put a $\top$ value at the top of the stack. And then go on with our analysis. According to previous performance analysis, SLOAD is taking roughly 33% of total EVM instruction emulation overhead. So cutting this alone could directly cut interpretation time by 33%, not accounting for the Domino effect.

We define the complete lattice $(L_c, \sqsubseteq, \sqcup, \sqcap, \bot, \top)$ used by our coarse analysis below:

$(L_c ,\sqsubseteq)$: $\bot \sqsubseteq 0, 1, 2, 3, ..., (2^{256}-1) \sqsubseteq \top$
$\top$: Unkonwn value
$\bot$: Undefined
$\sqcap$: $\forall v, v \sqcap \bot = \bot, v \sqcap \top = v$ (notice $\sqcap$ is commutative)
$\sqcup$: $\forall v, v \sqcup \top = \top, v \sqcup \bot = v$

Next, we present a sound (conflict is impossible) naive algorithm for parallel EVM.

For each $txn_i$ in $\bar{t}$, run the following in parallel:
1.1. Initialize a set $l_i^p = \emptyset$ which will contain all storage access traces in a certain execution path $p$ (an execution path is defined as a sequence of EVM opcode executions starting from the contract function selector to REVERT or RETURN). Initialize a set $e_i = \emptyset$ which will include the storage access trace of $txn_i$ across all paths.
1.2. Run abstract interpretation using the transition functions listed below in the table. Whenever we encounter SLOAD or SSTORE, and we see value $v$ on top of the stack, we do $l_i^p = l_i^p \cup \{ v \}$ (recall the fact that the topmost element of the stack is the storage key for SSTORE and SLOAD). And whenever we encounter JUMPI, we perform a backtrack, creating two execution paths $p_1$ and $p_2$ each with their storage trace set $l_i^{p_1}$ and $l_i^{p_2}$ initializaed to $l_i^p$. Whenever we finish a path $p$, we do $e_i = e_i \cup \ l_i^p$.
Check $e_i$ with other transactions, if $e_i \cap e_j = \emptyset$, then execute $txn_i$ and $txn_j$ in parallel, else execute them in order of $\bar{t}$.

Note that we are introducing $\top$ on the top of the stack after we interpret SLOAD abstractly. This should not be confused with us seeing the value $v$ at the current stack top and adding it to set $l_i$ (the storage locations accessed set) right before our interpretation of SLOAD and SSTORE.

We now walk through how we would do abstract interpretation on some representative EVM opcodes in the following table. Whenever we see an EVM opcode in pc and "stack input" on the EVM stack, we perform "operation" and get "stack output." During our abstract interpretation (as shown in the following table), we only introduce $\top$ whenever we interpret SLOAD, but we allow $\top$ to be propagated when we interpret other opcodes like ADD, SHA3, etc,.

opcode        |stack input           |stack output           |operation
--------------+----------------------+-----------------------+----------------------
SLOAD         |key|...               |⊤|...                  push ⊤, add key to l_i
SSTORE        |key|value|...         |...                    nop, add key to l_i
MLOAD         |offset|...            |value(⊤)|...           if offset = ⊤ then push ⊤ else push memory[offset:offset+32]
ADD           |a|b|...               |a+b(⊤)|...             if a = ⊤ or b = ⊤ then push ⊤ else push a+b
SHA3          |offset|length|...     |hash(⊤)|...            if offset = ⊤ or length = ⊤ then push ⊤ else push hash
BALANCE       |addr|...              |addr.balance(⊤)|...    if addr = ⊤ then push ⊤ else push address(addr).balance
JUMPI         |dest|cond|...         |...                    if cond = ⊤ then backtrack $pc else $pc = dest, if dest = ⊤ then return with warning
JUMP          |dest|...              |...                    if dest = ⊤ then return with warning else $pc = dest	
PUSH1         |...                   |uint8|...              pushes a 1-byte value to stack
DUP1          |value|...             |value|value|...        duplicates value whose domain is L_c
CALLDATALOAD  |i|...                 |msg.data[i:i+32]|...   if i = ⊤ then push ⊤ else loads a uint256 from message data
SELFDESTRUCT  |addr|...              |...                    if addr = ⊤ then just destroys contract else destroys the contract and sends all funds to addr
RETURN        |offset|length|...     |...                    return l_i the collected storage accesses
REVERT        |offset|length|...     |...                    return l_i the collected storage accesses
...

And to deal with inter-procedural calls, we perform the operation shown in the following table:

opcode: CALL          
stack input: |gas|addr|value|argsOffset|argsLength|retOffset|retLength|...
stack output: |success|...
operation: if addr = ⊤ then return with ⊤ added to l_i else proceed

For other interprocedural opcodes like DELEGATECALL and STATICCALL their execution is analogous to that of CALL.

Precise Analysis

For transactions that interact with popular contracts (e.g., Uni v3 pool) we pre-perform a static analysis with a relatively refined domain on the contract. Specifically, we call the functions in those contracts special case functions, denoted by $f_i$. Our goal is to generate a $\kappa$ that is very close to $\kappa_{perfect}$ in this step, without much consideration for the speed of our analysis.

We define the complete lattice $(L_p, \sqsubseteq, \sqcup, \sqcap, \bot, \top)$ used by our precise analysis below:

$(L_p ,\sqsubseteq)$: $\bot \sqsubseteq$ exp $\sqsubseteq \top$. exp's formal definition in BNF is given below. Note that hash functions / SLOAD / MLOAD / CALLDATALOD / … are treated as uninterpreted functions (e.g., sha3(a + sha3(0x1efff321 + b))) so later we could use congruence closure for automated reasoning.
$\top$: Unkonwn value
$\bot$: Undefined
$\sqcap$: $\forall v, v \sqcap \bot = \bot, v \sqcap \top = v$ (notice $\sqcap$ is commutative)
$\sqcup$: $\forall v, v \sqcup \top = \top, v \sqcup \bot = v$

exp      ::= literal | symbol | function
literal  ::= 1 | 2 | ... | 2^256 - 1
symbol   ::= a | b | c | ... 
function ::= sha3(exp, exp) | sload(exp) | mload(exp) | calldataload(exp)

Next, we present a sound naive algorithm for $\delta_p$.

For each special case function $f_i$ run the following before real-time execution:
1.1. Initialize a set $l_i^p = \emptyset$ which will contain all storage access traces in a certain execution path $p$. Initialize an additional set $c_i^p = \emptyset$ used for tracking environment constraints we collect on path $p$. Finally, initialize a set $e_i = \emptyset$ whose elements will be in form of a tuple $(l_i^p, c_i^p)$.
1.2. Run abstract interpretation using the transition rules defined below. Whenever we encounter SLOAD or SSTORE, and we see value $v$ on top of the stack, we do $l_i^p = l_i^p \cup \{ v \}$ (recall the fact that the topmost element of the stack is the storage key for SSTORE and SLOAD). And whenever we encounter JUMPI, we perform a backtrack, creating two execution paths $p_1$ and $p_2$ each with their storage trace set $l_i^{p_1}$ and $l_i^{p_2}$ initialized to $l_i^p$. Whenever we finish a path $p$, we do $e_i = e_i \cup \ (l_i^p, c_i^p)$.
Upon receiving the $\bar{t}$ and $s_{k-1}$ (Etherum state in previous block, i.e., block $k-1$) in real time:
2.1. Let each symbolic value equal to the actual concrete value in $c_i^p$, then solve for $\bigwedge c_i^p \wedge \forall v, sstore(sload(v)) = v$ (basically this just means we "and" all the conditions in $c_i^p$ together, plus "andding" an additional theorem $\forall v, sstore(sload(v)) = v$ to save us time from actually loading the concrete value of $sload(v)$ from storage), if the solver returns true, then insert the set of concrete values into $l_i^p$ and let $l_i = l_i^p$.
2.2. Check $l_i$ with other transactions, if $l_i \cap l_j = \emptyset$, then execute $txn_i$ and $txn_j$ in parallel, else execute them in order of $\bar{t}$.

The intuition for step 2.1 is that since we've collected the satisfiability condition of each path $p$, $c_i^p$, and the storage access trace in this path, $l_i^p$, we want to see that in the real transaction, which path $p$ is satisfiable by checking if $c_i^p$ solves with the concrete values inserted. If it solves, then indeed in the real transaction we will take this path $p$ and end up with the storage access trace set $l_i^p$.

We walk through how our analysis generates $\kappa$ for a simple ERC20 contract template taken from OpenZeppelin.

















function _transfer(
        address sender,
        address recipient,
        uint256 amount
) internal virtual {
        require(sender != address(0));
        require(recipient != address(0));

        uint256 senderBalance = _balances[sender];
        require(senderBalance >= amount);
        unchecked {
            _balances[sender] = senderBalance - amount;
        }
        _balances[recipient] += amount;

        emit Transfer(sender, recipient, amount);
}

The source code, up to line 9, roughly translates^[1] to the following bytecode (with sender, recipient represented as symbolic values) with our analysis steps written as comments:

    ... loading parameters ...
    PUSH            // stack: sender | ...
    PUSH            // stack: 0 | sender | ...
    NEQ             // fork two path with c_i = {sender = 0} and c_i = {sender != 0} respectively, stack: 1 or 0 | ...
    PUSH CONT1      // stack: cont1 | 1 or 0
    JUMPI           // based on which fork we are on, jump to cont1 or not, stack: |...
    REVERT          // return l_i = \emptyset and c_i = {sender = 0}
    JUMPDEST CONT1:
    PUSH recipient  // stack: recipient | ...
    PUSH 0          // stack: 0 | recipient | ...
    NEQ             // fork two path with c_i = {recipient = 0} and c_i = {recipient != 0} respectively, stack: 1 or 0 | ...
    PUSH CONT2      // stack: cont2 | 1 or 0
    JUMPI           // based on which fork we are on, jump to cont2 or not, stack: |...
    REVERT          // return l_i = \emptyset and c_i = {sender != 0, recipient = 0}
    JUMPDEST CONT2:
    PUSH 0x3        // stack: 0x3 | ...
    SHA3            // stack: 0x1bf0b26eb2090599dd68cbb42c86a674cb07ab7adc103ad3ccdf521bb79056b9 | ...
    PUSH sender     // stack: sender | 0x1bf0b26eb2090599dd68cbb42c86a674cb07ab7adc103ad3ccdf521bb79056b9 | ...
    ADD             // stack: sender + 0x1bf0b26eb2090599dd68cbb42c86a674cb07ab7adc103ad3ccdf521bb79056b9 | ...
    SHA3            // stack: sha3(sender + 0x1bf0b26eb2090599dd68cbb42c86a674cb07ab7adc103ad3ccdf521bb79056b9) | ...
    SLOAD           // stack: sload(sha3(sender + 0x1bf0b26eb2090599dd68cbb42c86a674cb07ab7adc103ad3ccdf521bb79056b9)) | ..., l_i = {sha3(sender + 0x1bf0b26eb2090599dd68cbb42c86a674cb07ab7adc103ad3ccdf521bb79056b9)}
    ...

In short, we use the idea of forking in symbolic execution by introducing an environment constraint set $c_i^p$ and symbolic values for every unknown input. And whenever we encounter an opcode concerning external input/output, we treat it as if it was an uninterpreted function. And this collected $c_i^p$ will finally be used to solve for $l_i^p$. Note that we could also do unification on symbolic constraints (which is analogous to summary-based symbolic evaluation) before real-time transactions come in to save reasoning time.

Of course, the actual compiled bytecode might have parts where symbolic execution simply cannot solve. For example, the Solidity compiler will compile the comparison with a zero address not to numeric operations but a bit-vector operation. However, bit operations are hard to solve using SMTs. We can combat this situation by either transpiling the bytecode language to another more formal-methods friendly language, or use concolic execution so we only have symbolic values in path conditions but not in potential $\kappa$ (stack values are now concrete and no longer symbolic, this makes $\kappa$ potentially unsound but accelerates the analysis a lot). Of course, we still suffer from path explosion. But since we are not doing the analysis on-the-fly, we can just always retreat to a more coarse method and get a strictly better result than if we didn't use any refinement for parallelization.

Pushing the boundary

There are many long-tail optimizations^[2] that we can do to our algorithms:

separate read and write operations in $l_i$: currently if $txn_i$ and $txn_j$ both only read and not write, we don't parallelize while we should.
commutativity of the write operation: commutative write operations could make storage access to the same location parallelizable. Previous work showed that it is possible to even with precision.
more refined $l_i^p$ set: currently $l_i^p$ is this big array that doesn't distinguish read/writes and doesn't account for the location of those read/writes. There are several optimizations we could do to make this faster, such as providing a more refined $l_i^p$ structure and then eliminating cases we talked about in section Parallelization algorithms, etc,. Details are described below.
pattern matchers: since our analysis result on $l_i^p$ only guarantees that there are no false negatives, and that we don't want to make our analysis too complex, we could build simple pattern matchers to eliminate special cases to make us more precise while keeping soundness.
abstract interpretation tweaks: we could tweak the design on $L_c$ to use other domains, or to use other ordering strategies to try to accelerate the process. Or we could even tweak $F$ the transition function, for example, we could probabilistically make MLOAD be opaque (pushes $\top$ every time) and see how that affects precision and performance.

The bytecode I shared here has major differences from the ones that Solidity outputs. For example, a comparison of address sender and zero is done by bit-vector operations. Here we deliberately "hand-compiled" the code for simplicity in our demonstration. ↩︎
It would also be interesting to study how the preciseness of our analysis (e.g., tuning on the domains) will impact our overall performance. Basically, this is asking: if you were to plot a graph where the x-axis is abstraction level (how much information $κ$ gives) and the y-axis is the performance boost to parallel execution (the amortized execution time of $δ_{p}$ , it would also be interesting to see the conflict rate of $\bar{t}$ under $κ$ 's information). ↩︎

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.