# SHA256 optimization for Merkle Roots <p style="text-align: center;"> We benchmark a SHA256 algorithm optimized for 64 bytes blocks.<br> We find substantial improvements over existing Ethereum consensus clients</p> ## TL; DR; - Most SHA256 libraries are highly optimized to hash large buffers. - When computing the root of a Merkle tree, the padding block takes as much time as the data block. - Hardcoding the precomputed scheduled words of the padding block we save 40% of hashing time. ## 1. Introduction The SHA256 hashing algorithm takes a message MSG of arbitrary length and produces a digest DIGEST of 256 bits (32 bytes). The algorithm goes roughly as follows 1. Add bits to MSG in a predetermined manner to obain a padded message PMSG consisting on an integral multiple of 512 bits (64 bytes). 2. Start with a hardcoded predetermined initial digest DIGEST = DIGEST_INIT. This digest is successively transformed by reading one 512 bits BLOCK at a time from PMSG and updating DIGEST schematically as: a. Read 4 double words SCHEDA ($4 \cdot 32 = 128$ bits) from BLOCK. b. compute 4 more double words SCHEDB. c. Update DIGEST as a function of the read 4 double words SCHEDA. After four iterations of the above, there are no more double words to read from BLOCK, the algorithm continues as follows d. Consider the previously read 4 words as a message block: SCHEDA = SCHEDB. e. compute 4 more double words SCHEDB. f. Update DIGEST as a function of SCHEDA. Points e) trough f) are repeated 60 times. 3. Repeat 2) by reading the next 512 bits BLOCK from PMSG until there are no more bits on PMSG to read from. The resulting DIGEST obtained at the last iteration of f) is returned as a digest of the message. The computational part of the algorithm is divided in two parts: updating the DIGEST in points c) and f) above, and *scheduling* which consists on computing the words SCHEDB in points b) and e) above. When MSG itself is an integral multiple of 512 bits, the padding is an entire extra block. Moreover, this padding block does not depend on MSG, only on its size. Most applications of SHA256 consist of validating large chunks of data, transferred over the wire, and checksummed in the receiving end. Since this ought to be on live systems, performance is tantamount and most applications are written either directly in highly optimized assembly code or C intrinsics. However, they are optimized for this task: hashing of large messages. Tricks and tools vary from using special instructions on newer CPUS to hashing several blocks in parallel by using large registers within the CPU to hold several of the scheduled words SCHEDX at the same time. While the computation of DIGEST has to be carried out sequentially from one BLOCK to the next BLOCK, the computation the scheduled words can be made in parallell, so while hashing one block one can compute the scheduled words for the next one at the same time. When the message is thousands of blocks large, an extra padding block does not affect performance. However the situation for Merkle trees such as implemented in Ethereum, with 32 bytes chunks, is precisely the opposite. A Merkle tree consists of a binary tree schematically as depicted in Figure 1. graphviz digraph G { root [color=blue, style=bold] a [color=blue, style=bold] root -> a [color=blue]; b [label ="#(#(0,0),#(0,0))",style=filled] root -> b c [label = "b"] d [label ="c",color=blue,style=bold] e [label = "#(0,0)", style=filled] f [label = "#(0,0)", style=filled] a -> c a -> d[color=blue] b -> e; b -> f; c -> 1; c -> 2; 3 4 [label ="0",color=blue,style=bold] 5 [label ="0",style=filled] 6 [label ="0",style=filled] 7 [label ="0",style=filled] 8 [label = "0",style=filled] d -> 3; d -> 4 [color=blue] e -> 5; e -> 6; f -> 7; f -> 8; labeltoc = "t" label="Figure 1: A Merkle tree of depth 3." }  Each leaf node consists of a 32 bytes chunk of data, and each parent node has the digest of the 64 byte block obtained by concatenating the two children chunks. When hashing a current BeaconState on the Ethereum consensus layer, we have to hash a few Merkle trees of depth 40, that is, there are $2^{40}$ leaves!. Fortunately, most of the leaves are filled with zero. In the above Figure 1, only the first three leaves have data, while the last 5 are filled with zero. In a BeaconState we find trees with about 200,000 chunks with data out of the possible $2^{40}$ (this is still about $2^{18}$ worth of chunks!). All clients therefore precompute a *zero hash array* (the grayed area in Figure 1). Thus at the time of computing the digest in the node labeled root, only the nodes with actual data need to be hashed, plus the siblings of the *zero hash* part of the tree. In the example of Figure 1, the only boundary node is the one labeled a that needs to be hashed concatenating it with its sibling. Even after considering these optimizations, hashing a complete BeaconState today takes about 2 million calls to the SHA256 hashing algorithm, and each of these calls consist of hashing one 64 byte block (two sibling nodes concatenated). So for each one of these calls, an extra padding block is added. Exactly the same block in each case (remember the padding block only depends on the size of MSG which is one BLOCK for all calls to the SHA256 algorithm). It takes the same time to hash the paddings, than the actual data we want to hash. ## 2. Hardcoding the padding scheduled words Recall that while we cannot update the DIGEST for one round of the algorithm without computing the previous round, we can in fact compute the scheduled words without knowing the updated DIGEST. This is particularly useful in the Merkle tree computation: we have an entire block of 64 bytes consisting of the padding block, that we know before hand. So the corresponding scheduled words SCHEDB for each round of the algorithm can be hardcoded since they are **known before compile time!**. The overal time savings from this idea ammounted to 40% of the total hashing time in the benchmark below. This implementation detail is used in at least one of [bitcoind's](https://github.com/bitcoin/bitcoin) hashing algorithms, but surprisingly seems to be missing on the existing Ethereum clients algorithms. ## 3. Caching is more important at runtime Before showing the results of our benchmarks, it is important to notice that in real-life applications (namely all consensus clients), one rarely needs to hash a complete BeaconState from scratch. An optimization that clients use is to keep a cache of the whole hash tree after they have computed it initially. Thus, if some leaves change, only few of the nodes need to be recomputed in order to update the digest at root. Consider the example of Figure 1. If some data is added in the leaf node highlighted in blue, only the blue nodes need to be recomputed, that is the nodes labelled c, a and root. This optimization reduces dramatically the hashing time of the digest at root, since most lists in the largest ssz object on the beacon chain (the BeaconState) are slowly changing (consider the validator list, which can only add a few validators per epoch plus a few of them that change status). ## 4. Our benchmarks We benchmarked the computation of the hash tree root of the beacon state at slot [1605741](https://beaconcha.in/block/1605741) on mainnet. This block ocurred on Jul 12, 2021, 9:28 AM and serializes to 28MB. The tests were carried on an AMD Ryzen 5 3600 6-Core Processor. The implementation using the idea of Section 2 can be found in https://github.com/potuz/mammon. The test consists simply on loading the ssz file from disk, deserialize into a BeaconState object and time the computation of the hash_tree_root. We include here the results of the tests from - [Prysm](https://github.com/prysmaticlabs/prysm): it is based on [fastssz](https://github.com/ferranbt/fastssz) which in turn is based on [sha256-simd](https://github.com/minio/sha256-simd) which is a fork from [intel-ipsec-mb](https://github.com/intel/intel-ipsec-mb). The hashing algorithm ultimately used is that of [supranational's blst](https://github.com/supranational/blst/blob/master/src/sha256.h) - [Nimbus](https://github.com/status-im/nimbus-eth2): hashing ultimately done by blst as above. - [Lighhouse](https://github.com/sigp/lighthouse): ~~hashing ultimately done by blst as above.~~ @sproul tells me that they just switched libraries and are using the sha2 Rust crate. We include three different results for [mammon's implementation](https:://github.com/potuz/mammon) 1. OpenSSL's hashing algorithm (no hardcoding of the padding block) 2. 1x SHA-ni implementation with hardcoded padding block. 3. 2x SHA-ni implementation with hardcoded padding block. The three implementations use Intel's original assembly with [SHA Extensions](https://software.intel.com/content/www/us/en/develop/articles/intel-sha-extensions.html). The CPU used supports native SHA Extensions, and OpenSSL is compiled with this support. The second implementation consists of the same Intel code, optimized for 64 bytes by hardcoding the padding block. The third implementation in addition uses SSE instructions to compute the scheduling of two blocks at a time. The results are summarized in the following table: | implementation | time to hash | | :------------- | :------------- | | Prysm | 292ms | | Nimbus | 251ms | | Lighthouse | 718ms | | Mammon OpenSSL | 270ms | | Mammon 1x SHA-ni | 189ms | | Mammon 2x SHA-ni | 161ms | In all cases, the assembly code goes back to Intel's original implementation with minor changes so the differences can only be traced back to either the Merkleization algorithm (language calls overhead and the such may enter here) or the hardcoding of the padding block in the SHA algorithm. The 30% difference between Mammon's OpenSSL implementation and the 1x hardcoded one, can only be attributed to the hardcoding of the padding block as the Merkleization algorithm is invariant. The total time saving by using basic SSE vectorization techniques went to 40%. The results of Prysm, Nimbus and Mammon without hardcoding the padding block are at par (with variation less than 20% between top and bottom), supporting the idea that little can be done to optimize the Merkleization algorithm nor the SHA implementation that has years of research into Intel's code. ~~The results from Lighthouse's benchmark is unexpected and unexplained at this time~~ (see Section 7). We remark again that these numbers do not reflect at all the timings of real-life implementation where cache-techniques are dominant. There is nothing that can be inferred from this experiment other than unequivocally pointing that hardcoding the padding block into the hashing algorithm of all clients (presumably supranational's blst) would speed up hashing by at least 30% regardless of implementation details. We did not benchmark Teku due to the complications of benchmarking JVM. I will be happy to include Teku's results if someone designs the benchmark. ## 5. Conclusions - We proved that hardcoding the padding block in the SHA256 algorithm carries substantial improvements regardless of implementation details. - We provided a working implementation of Serialization / Merkleization that implements the above ideas. - We did not benchmark AVX2 nor AVX512 implementations instead of SHA-ni implementations. The gain by hardcoding the padding block's scheduled words should be present in this situation as well as it is independent of implementation. ## 6. Acknowledgments: I am indebted to @nisdas from prysmaticlabs, @arnetheduck from Nimbus team, @pawan and @sproul from sigma prime and @Adrian Sutton from Pegasys for helping me benchmark (or showing me why it is diffucult in the last case) their respective implementations. I am particularly indebted to @proto for his generosity with his own time. ## 7. Update: We implemented assembly code for AVX2 (8 blocks at a time), AVX (4 blocks at a time) and SSE (1 block at a time) and ran benchmarks on the same state on an Intel laptop capable of AVX2 but without SHA extensions:  Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz  The results are sumarized in the following table: |implementation | time to hash | |:------------- | :----------- | |Lighthouse | 998ms | | Prysm | 1085ms | | Nimbus | 1112ms | | Mammon | 654ms | Notice Mammon's gain over the consensus clients is roughly the same as with SHA extensions, that is, hardcoding the padding block results in 40% decrease of the hashing time. The assembly for these implementations and the above one are based on [intel-ipsec-mb](https://github.com/intel/intel-ipsec-mb) and are available in Mammon's repo https://github.com/potuz/mammon ### Lighthouse: This also explains the poor results of Lighthouse in the previous benchmark of Section 4. Lighthouse only uses SHA extensions in the experimental branch. In fact running the benchmark of Section 4 with Lighthouse's expermiental branch, we get 319ms instead of the reported 718ms. That is, Lighthouse, without using any paralellization (their most efficient code uses paralellization, which we didn't benchmark to avoid misleading differences here), is at par with the remaining clients. The poor performance without SHA extensions is actually due to AMD's poor performance in AVX2 operations: the same benchmark of this section 7, carried on the Ryzen 5 CPU of section 4, leads to 760ms for Mammon's performance!, that is, this Desktop is slower than the older laptop to hash without the SHA extensions. And Lighthouse's algorithm, without hardcoding the padding block, hashes faster than Mammon's algorithm (we assume that this is due to the heavy usage of 256bits vectors on Mammon's algorithm)