owned this note
owned this note
Published
Linked with GitHub
# Vyper's O(1) selector tables
In [version 0.3.10](https://github.com/vyperlang/vyper/releases/tag/v0.3.10), vyper shipped with "O(1) selector tables". What is a selector table and why is it important? Read on to find out!
## Selector Tables
In every EVM language, when you call a function within a contract, you provide its method ID as a 4-byte signature ("selector", or "method id", and I will use these terms interchangeably). For example, the 4-byte signature for `foo()` is `0xc2985578`. This is calculated as the last 4 bytes of the hash (keccak256) of the string `"foo()"`. This is defined in the [ABI v2 spec](https://docs.soliditylang.org/en/v0.8.21/abi-spec.html#formal-specification-of-the-encoding).
Note that the EVM does not know anything about the ABI! The ABI is essentially a convention which is agreed upon by all smart contracts so that they can talk to each other. So when you pass that 4-byte signature to the contract, it has to decode it and figure out which part of the contract to jump to.
Prior to 0.3.10, to do this vyper did a linear search. That is, it did something like
```python
method_id = calldata[:4]
if method_id == 0xc2985578:
goto foo__label
if method_id == 0xfebb0f7e: # method id for `bar()`
goto bar__label
...
# go to the fallback function when no method id is found
goto fallback__label
```
However, that was not very efficient. Each branch in the EVM could cost 22 gas, and 11 bytes worth of bytecode. This adds up! For a large contract with 80 external methods, that costs at least 880 bytes, and 1760 gas to resolve the method (just finding the location in the code!) in the worst case. The one redeeming feature of this system is that users could manually order "hot" methods towards the top of the file to come earlier in the selector table. But in general, the system is very expensive.
There are actually some additional checks which I've glossed over. They are the calldatasize check and the payable check. These look something like this
```python
if method_id == 0xc2985578:
assert calldatasize >= 4 # prevents certain strange collisions between method ids
assert callvalue == 0 # foo() is a nonpayable function
```
These can be considered a constant addition to the gas cost since they are not traversed during the search, but they do get executed 1 time before function entry. They also add some code size. In particular, `CALLDATASIZE PUSH1 4 LT PUSH2 <revert block> JUMPI` costs 8 bytes, and `CALLVALUE PUSH2 <revert block> JUMPI` costs 5 bytes. Sounds cheap, but it adds up -- for our hypothetical large contract with 80 methods, that can cost another 1040 bytes!
## Take 1 -- Magic
My first encounter with this problem was nearly 3 years ago, in [July 2021](https://github.com/vyperlang/vyper/issues/2386). I had been working on codesize optimizations for Vyper and realized that the selector table was particularly costly. I was familiar with [common jumptable implementations](https://en.wikipedia.org/wiki/Branch_table) for switch statements, but those typically work where the input values are dense. In our case, method ids are more or less randomly distributed through 32-bit space, so a naive jumptable would require code size on the order of `2**32`! Not feasible.
My first idea here was to find some magic hash function which would map those 4-byte method ids to the dense set from 0-N (where N is the number of selectors). These are typically called [perfect hash functions](https://en.wikipedia.org/wiki/Perfect_hash_function), and they can theoretically be computed given a statically known set of numbers. I quickly wrote up a script to find this magic number given a set of selectors. It looked something like this:
```python
#!/usr/bin/env python
from vyper.utils import method_id_int
seed = 1
method_ids = [f"foo{seed + i}()" for i in range(10)]
def HASH(method_id, magic, n):
return (method_id * magic) % n
magic = 0
while True:
method_ids = [method_id_int for sig in signatures]
mapped_method_ids = [HASH(method_id, magic, len(method_ids)) for method_id in method_ids]
if len(set(mapped_method_ids)) == len(method_ids):
print("FOUND MAGIC:", magic)
break
magic += 1
```
For 5 selectors, it worked! I increased the number of selectors. By 15 selectors, performance dropped off a cliff -- I let my laptop crunch for hours and it could not find a solution. It seemed exponential. For moderately larger numbers, it might take longer than the heat death of the universe to find a solution. I tried different hash functions -- sha256 instead of mulmod, "clever" starting points for the magic number, etc., nothing improved the algorithmic performance.
What about regular hash tables? Those didn't seem like they would work too well. To minimize collisions, you need [low load](https://en.wikipedia.org/wiki/Hash_table#Load_factor), and space is at a premium here. And even once you resolve to a bucket, resolving the collisions inside buckets (classically implemented as a linked list traversal) would not be efficient enough on the EVM, which is already resource-starved. I did some rough estimates and concluded -- if we need to loop, we have already lost!
I set the problem aside for some time.
## Take 2: Hash Tables
I picked up this problem again in May of 2023. @bout3fiddy from Curve was having some issues with codesize and so I made some small "optimizations" to the selector table -- several of which turned out to not work very well. Frustrated, I started thinking about the O(1) selector table again. I looked at some resources I had looked at before including research on perfect hashing and [cuckoo hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing).
But this time I read the resources more carefully.
Both perfect hash constructions and cuckoo hashing use a two level technique. I had already rejected the cuckoo hash because it eats up two slots per item, which would not result in enough codesize savings.
However, the two-level technique got me thinking. Why don't we use a two-level technique here? We can do something which looks like a regular hash table. But after scoping into a bucket, we don't need to loop to resolve the collision. We can use the magic number technique from before! Even though we can't calculate the magic number for sets larger than ~12, we can just target buckets of size 10-12. Thus, we just require two levels of pointer indirection in order to resolve from the selector to the code destination we need to jump to!
Excited, I quickly wrote some scripts to prove out that, indeed, we could solve this "perfect hash" for sets larger than 10. I started solving sets > 100 within a few hundred ms and was satisfied. Eventually, these scripts evolved into tuning benchmarks which [remain in the vyper codebase today](https://github.com/vyperlang/vyper/blob/db8ac3c29ebae17a123ad526ec4ce69f3734be40/vyper/codegen/jumptable_utils.py#L166-L186).
At this point, I was confident that the technique would work. However, I had to iron out all the details. Even if the technique was O(1), if the constant was high, it could still be more expensive than other techniques. For reference, a selector table with a single method id in it (traversing it takes a single branch) can be implemented to cost between 56-90 gas. So if the hash table traversal could not be comparable to that, it would not be a cheap implementation and users would want to use cheaper alternatives. In other words, even with our super-cool O(1) technique, it still had to be hyper-optimized to be competitive!
At this point, I was taken on a trip, and was forced to be in nature for some time with my thoughts. This was helpful for ironing out details.
## Details, and a Tradeoff
The first order of business was to actually finalize the hash function. The hash function has to be extremely cheap -- keep in mind that even the simplest arithmetic operation costs at least 9 gas on the EVM (e.x. `DUP1 PUSH1 24 SHR`). Given that our target is 56 gas, we don't have a lot of budget for arithmetic operations! I ended up choosing a simple hash function which performed well during tuning -- a single multiplication, right shift, and modulus operation.
That set, I started to work on the layout of the hash table. It would live in code as kind of part of a data section. I quickly realized that for our purposes here, the `CODECOPY` operation is actually quite expensive, because it takes 3 arguments. The minimum cost for a `CODECOPY` operation is 15 gas -- 9 gas for the arguments, and 6 gas for the copy operation itself. So we want to minimize the number of times we need to use `CODECOPY`.
I ended up with two main sections: the bucket headers, and the tables with method information. To get the bucket, we would simply `MOD` the input against the number of buckets, and use that as an offset into the bucket header section. From the bucket header section, we would get the magic number for that bucket (which essentially encodes the only variable information about the hash function for the bucket), and then use the hash function described in the previous step to find the location of the method in the given bucket. This results in a total of two `CODECOPY`s.
But this introduced an interesting problem, which is that maybe the bucket overhead could overshadow the bytecode savings? I realized we could improve the distribution of items by trying a range of target bucket sizes. We want to minimize bucket overhead (interestingly, opposite of "normal" hash table implementations, by maximizing bucket size). So we would spend some compilation time trying to optimize bucket size. By trying a range of possible bucket sizes, we can avoid with very high probability "pathological" cases, where the bucket overhead overshadows the rest of the selector table. In the vast majority of cases, the bucket overhead is less than 10% of the selector table.
### Layout
So let's review the layout. The layout I ended up with was as follows (cf. [the vyper source code](https://github.com/vyperlang/vyper/blob/37ef8f4b54375a458e8b708cf3c41877b5f1655e/vyper/codegen/module.py#L158)):
bucket magic number <2 bytes> | bucket location <2 bytes> | bucket size <1 byte>
We would codecopy the entire bucket at `BUCKET_OFST + (method_id % num_buckets) * sz_bucket` to memory, then `mload` it and unpack it using bit operations. Then we would find the function by using the formula `bucket_location + (method_id * bucket_magic_number >> 24) * sz_function_info`
The info for each function would look like ([link to source code](https://github.com/vyperlang/vyper/blob/37ef8f4b54375a458e8b708cf3c41877b5f1655e/vyper/codegen/module.py#L186))
method id <4 bytes> | label <2 bytes> | func info <1-3 bytes>
func info (1-3 bytes, packed) for: expected calldatasize, is_nonpayable bit
The number of bytes required for the func info (that 1-3 bytes quantity) is determined at compile-time, by the largest `expected_calldatasize` quantity in the contract.
I was able to pack the `is_nonpayable` bit into the func info even when the func info is only 1 byte, because expected calldatasize is always a multiple of 32 + 4, so it always ends with the bit sequence `0b00100`.
All told, the func info is 7-9 (typically 7) bytes, and the bucket header is 5 bytes -- since there are typically 1 bucket header for every 10 func infos, the amortized size of func info + bucket header is 8 bytes per method. This is good! For that 80 method contract, we are saving 1.4KB!
But I was running into the limits of this approach. Remember how I was saying each arithmetic operation costs at least 9 gas? Well, it turns out that that constant runtime factor I was worried about is larger than I wanted.
### The Trouble with Gas
Each component of decoding the hash table, even though well optimized, added up. Let's break it down. I'll use "pseudo"-assembly, so the instructions are accounted for, but for the sake of presentation I'll be a little loose with stack variable locations and cleanup. For instance, SCHEDULE<variable> indicates that we scheduled `<variable>` to be in the right place on the stack, and don't need real (DUP or SWAP) instructions to get it there.
```asm
// Getting the method ID: 11 gas
PUSH0 CALLDATALOAD PUSH1 224 SHR
// Finding the bucket ID: 11 gas
DUP1 PUSH <n_buckets> MOD
// Resolving to code location, copy to memory: 26 gas
PUSH1 5 PUSH1 <szbucket> MUL PUSH2 <BUCKETS_OFST> ADD PUSH1 27 CODECOPY
// Unpack bucket header: 39 gas
PUSH0 MLOAD // load bucket info to stack
DUP PUSH1 24 SHR // bucket magic
DUP PUSH1 8 SHR PUSH1 0xFFFF AND // bucket location
DUP PUSH1 0xFF AND // bucket size
// Calculate function info location, copy to memory: 42 gas
PUSH1 7 // func info size
SCHEDULE<bucket size>
DUP<method id> SCHEDULE<bucket magic> MUL PUSH1 24 SHR MOD // our hash function!
PUSH1 5 MUL SCHEDULE<bucket location> ADD // the func info pointer
PUSH1 27 CODECOPY
// Decode func info: 47 gas
PUSH0 MLOAD // load func info to stack
DUP PUSH1 1 AND // is nonpayable
DUP PUSH1 0xFFFE AND // expected calldatasize
DUP PUSH1 8 SHR PUSH1 0xFFFF AND // function label
DUP PUSH1 24 SHR // function method id
// Perform necessary checks: 59 gas
PUSH1 3 CALLDATASIZE GT // in case there are trailing 0s in the method id
DUP<method_id> SCHEDULE<function method id> EQ // method id from calldata == method id
AND // should fallback
PUSH2 <join> JUMPI PUSH2<fallback function> JUMP JUMPDEST<join> // if should_fallback: goto fallback
CALLVALUE SCHEDULE<is nonpayable> MUL // bad callvalue
SCHEDULE<expected calldatasize> CALLDATASIZE LT // bad calldatasize
OR PUSH2<revert block> JUMPI // if bad calldatasize or bad callvalue: goto fail
// We are free! Jump to the function: 8 gas
SCHEDULE<function label> JUMP
```
Grand total: 219 gas[^1]! That's a lot compared to our benchmark. If we compare to a binary search implementation, we start to outperform binary search (gas-wise) at around 64 methods. We'd really like to be as close to the benchmark as possible.
[^1]: I saved myself the work of doing the implementation to find out the real cost by doing these estimations in my head (and, once there started to be multiple options, with excel+calculator). I was pleasantly surprised, once done with the implementation, that it matched my estimations very closely :). Being in nature helped!
So, looking over our hashtable implementation, we can see there are a few places which are really cutting into our gas budget.
1. Two hash functions -- the first mod to find the bucket and second more complicated hash function -- costs 25% of our budget.
2. Unpacking all the metadata is actually very expensive! This costs 40% of our budget.
3. We are overpaying for the callvalue and calldatasize checks. Because they are stored as data, we cannot do certain optimizations -- for example, we cannot optimize out the callvalue check when the nonpayable flag is false, because we need to inspect it at runtime.
Foiled again! Although it seems like we are on the right track. It seems like .. we are starting to have a code size vs gas tradeoff now. If we inspect the layout, a lot of the cost is in unpacking. We could improve the gas cost, if we gave up some packing. What if we tried to move some of these things which currently behave like runtime data, and move them into regular code blocks?
So I moved onto trying to optimize this for gas, and not codesize.
## Optimizing for Gas
Now that it was clear that the bulk of the gas cost was copying things from the data section, I shifted gears a little bit. We could use the same technique, but with a twist. After the first `MOD` operation to find the bucket, instead of copying function info from a data section, we could use a more traditional hash table implementation. But traversing data is expensive! That's ok, because we actually know the structure of the hash table at compile-time. Instead of traversing data, we can traverse code!
The structure is actually very similar to the original linear search algorithm. However, we use that initial `MOD` operation to get us *really* close to our final target, before starting the linear search. And this time, we want to optimize for as small buckets as possible, to minimize the gas spent traversing any particular bucket[^4]. This technique is better explained using pseudo-python code instead of assembly.
```python
calldata_method_id = calldataload(0) >> 224
bucket = calldata_method_id % n_buckets
# grab bucket location from data section
sz_bucket = 2
codecopy(28, BUCKET_OFST + bucket*sz_bucket, sz_bucket)
bucket_location = mload(0)
goto bucket_location
...
jumpdest <bucket 1 location>
if method_id == 0xabcd:
... # calldatasize check
... # callvalue check
goto method_0xabcd
if method_id == 0x1234
... # calldatasize check
goto method_0x1234
goto fallback # close off the bucket
jumpdest <bucket 2 location>
if method_id == 0xbdef
... # calldatasize check
... # callvalue check
goto method_0xbdef
goto fallback # close off the bucket
```
This is cool! We have reduced the amount of hashtable overhead. We can actually estimate -- that bucket jumping preamble looks like it costs about 60 gas. And since the data is all encoded in code, we don't need to design all that pesky code layout stuff. Keeping in mind that each branch costs about 22 gas, it looks like we can get into our function in about 80-100 gas, depending on which checks are required!
[^4]: I also wrote some tuning scripts for this. It used more or less similar techniques as described before, so I won't go into it, but [the script is in the vyper codebase as before](https://github.com/vyperlang/vyper/blob/db8ac3c29ebae17a123ad526ec4ce69f3734be40/vyper/codegen/jumptable_utils.py#L189-L212).
However, this is going to be strictly larger in codesize than the original linear search. You can see this because the hashtable structure is kind of superimposed over the linear search structure of the code. So, while this is super cheap gas-wise (it is pretty much always more performant than linear search for any N > 1!), it doesn't solve the original problem of improving code size. And, try as I might, I couldn't find a middle ground, an implementation which was strictly an improvement in both code size and gas.
## A Conundrum - Have your cake and eat it?
So, what was I to do? After meditating a bit, I created a spreadsheet comparing the options and shopped it around.
![image](https://hackmd.io/_uploads/H1Rw-kqc6.png)
After getting some feedback, I found that both options would be useful for users, given that the user should be able to decide which one, based on their performance constraints. So I decided to implement both! And which implementation is chosen is determined by a new CLI flag, `-Ogas` vs `-Ocodesize`[^2]. You can see the work and description in the PR to the vyper codebase https://github.com/vyperlang/vyper/pull/3496.
[^2]: A source code pragma, e.g. `#pragma optimize codesize`, can also be used.
## Conclusion
The proof is in the pudding. With its new selector tables, Vyper contracts are competitive with -- and often more performant than -- handrolled huff / assembly / bytecode contracts[^3]. This is because the selector table gives a gas advantage out the gate, before you have even started executing any user code! (Implementing O(1) selector tables is, of course, possible in huff or assembly, but it is complicated and the various approaches end up being difficult to maintain). If you care about gas or codesize and have been on the fence about trying vyper, now is the time to get involved!
[^3]: https://twitter.com/big_tech_sux/status/1679991525172813824
As always, feel free to jump in our discord (https://discord.gg/6tw7PTM7C2) or get your hands dirty and visit our github (https://github.com/vyperlang/vyper).
Until next time!