ignacio (jsign) - Update 15

# ignacio (jsign) - Update 15 In the last two weeks I had the opportunity to dive into the _replay benchmak_, and the _tree conversion_ tool. ## What is the _replay benchmark_? The _replay benchmark_ is a setup of a geth node importing a chain of ~500k blocks. It's a fast way to benchmark the performance of block processing in geth. This allows us to import this chain using the current Merkle Patricia Trie or the Verkle Trie, and compare their performance. This benchmark is the best one we have today for comparing both kinds of trees, because compared with _synthetic benchmarks_ (i.e: generating random key-values in in-memory trees): - Allows to see the big picture on how the trie is wired into geth. - The access patterns (e.g: insertions, updating, reads) are closer to the real world (e.g: compared with random key-values which aren't representative). - Allows to see how the serialization and deserialization of the tree nodes plays out in the client execution pipeline. - The geth node also uses other methods in the `go-verkle` library which are unrelated to the _tree_ (e.g: to calculate tree keys, you also have to cryptographic operations). This benchmark was created by Guillaume, and he has been validating every evaluated improvement in it. Now it was included as a CI action in the go-verkle repository, and he has sent me some extra files to be able to run it locally. ## What is the _tree conversion_ tool? This is a new CLI command in geth that allows to _migrate_ an existing Merkle Patricia Trie (MPT) to it's equivalent Verkle Trie (VKT). This tool is useful to allow the _replay benchmark_ to run both for MPT and VKT. To run the _replay benchmark_ using the VKT, you need to first migrate the initial state of the benchmark to a VKT, and then you're able import the chain. You can think about this the same as an SQL database schema upgrading. Whenever you start up the new version of the application, you might need to do some SQL migrations first to accomodate existing data to the new schema, and then you can run the upgraded version of your applications. (i.e: if you don't do this, you might not find data where you expect). ## Exploration I've been diving into running the _replay benchmark_ locally to understand the performance difference of MPT vs VKTs. I've been merging some tentative optimizations that we still have as draft PRs to discover if the improvements that we expect are materialized in the benchmark. ### Serializing commitments in uncompressed form I tested an idea that we were trying to explore previous weeks around saving uncompressed forms of commitments in tree nodes (instead of compressed). After looking at the impact of this change locally, I've included other optimizations that helped to speed things up more. You can see the related PRs in: - go-ethereum: https://github.com/gballet/go-ethereum/pull/164 - go-verkle: https://github.com/gballet/go-verkle/pull/327 - go-ipa: https://github.com/crate-crypto/go-ipa Locally, this has lead to a 1.57x speedup in this e2e benchmark which was very exciting. Unfortunately, when I ran this in an AWS `m5.xlarge` VM I didn't see that kind of speedup. Diving deeper into why, we discovered some interesting facts: - The CPU improvements were still there, so we were saving work for the CPU. - The reason this didn't materialize into the expected _wall-clock_ time improvement is because the change puts more work into the disk. So the CPU improvement is offseted by the disk taking more time to do the expected work. - The reason why I see more wall-clock gains in my machine compared to `m5.xlarge` is because my NVME is faster (more IOPS and less latency) than `m5.xlarge` with a fast-ish disk. Which makes sense, since my computer is a desktop computer (direct metal) instead of a virtualized machine. Despite we already suspected that saving points in uncompressed form was going add more stress to the disk, I didn't expect that to account for that much. This made me look better what the disk was doing in general. ### Further look at disk IO I compared face to face the _replay benchmark_ runs for MPT and VKTs related to disk IO. This is a summary of the _replay benchmark_ for MPT: ![](https://i.imgur.com/agqkZ3v.png) This is a summary for the _replay benchmark_ for VKT: ![](https://i.imgur.com/JWjwkbR.png) You can see there how `Write(MB)` for MPT and VKT are ~816MiB and ~35GiB respectively. The concrete numbers doesn't matter much, but what matters is the order of magnitude difference. VKTs are doing more than an order of magnitude of disk IO, which flips the performance problem from being CPU related to being disk related. This order of magnitude difference is true independent of the _uncompressed vs compressed_ serialization change. If you think about it for a moment, it makes sense since the nodes of the tree are _fatter_ in VKTs which has this side-effect. This will probably need more thought to confirm if that is the complete explanation, and if we can do anything about it. ## Conclusion It has been very useful to get my hands on this benchmark. It has re-calibrated my point of view on the performance difference on MPTs vs VKTs in the most real-ish case scenario of a client using them with real transactions.