Crushing Prysm's Hasher

Created: November 19, 2021 8:57 PM
Last Edited Time: November 20, 2021 2:45 PM
Type: Technical Design

Introduction

A detailed analysis of the Merkleization algorithm used by both Prysm's custom hasher or fastssz's one indicated the presence of two crucial flaws:

The low-level SHA256 algorithm was not hardcoding the padding block.
The high level algorithm would make repeated calls to the low level library, N times to hash 2 chunks, instead of one call to hash 2N chunks.

The first flaw is independent of prysm and every implementation of Merkle trees in Ethereum clients was subject to it. It was documented in detail in this document. We briefly describe this problem below, but we will mainly focus in this note in 2. and its implications to Prysm's code and its interaction with fastssz.

Benchmarks

The following benchmarks were carried on the develop branch of prysm vs a naive implementation of a Merkleizing algorithm that does not suffer from 1) and 2) above. The tests consists of simply calling to hash a list with 400 000 randomuint64. The results are visibly striking

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkHashBalanceShani-12                  160       7629704 ns/op
BenchmarkHashBalanceShaniPrysm-12              15      74012328 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
BenchmarkHashBalanceAVX-4               68      26677965 ns/op
BenchmarkHashBalancePrysm-4              7     165434686 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashBalanceAVX2-4             121       9711482 ns/op
BenchmarkHashBalancePrysm-4             10     103716714 ns/op
PASS

The inner workings of SHA256

The padding block

From Intel's description of the SHA256 Hashing algorithm:

The process of SHA to calculate the message digest has two phases. First is the preprocessing of the message to pad it out to a 64 byte multiple with the length of the message embedded in the last 8 bytes. The message is then split into 64 byte blocks to be processed in the next phase. The second phase is the hash computation, which has two main components itself. One is the message schedule which takes the 64 byte block and expands it into 32-bit dwords to be processed per round, and the other is the absorption of a given rounds message dword into the working variables.

From the description above, suppose your message is length bytes long. If length % 64is bigger than 56, then your message will require an extra 64-byte block just to encode length as a little endian uint64. One of the main applications of hashing libraries is to provide a quick way of validating the contents of a large message/buffer: you compute its hash and check it against a given 32 byte hash from a trusted source. Imagine you want to check that you have downloaded from a torrent the correct version of geth, which is several megabytes in size. The extra cost of hashing possibly an extra 64 block is minimal. On the other end of the spectrum, imagine that you only need to hash 64 bytes, in this case you need to process two blocks, the first block has your message, the second block simply contains the length, which is known beforehand: it is 128 bytes (including the padding block). That is, the second block will always be equal to 0x80in big Endian.

This brings us to the second part of the above quote: there are two main components to the hashing algo: scheduling and the actual computation rounds. Without getting into details of how these components work, it suffices to say that scheduling can be done in parallel: it depends on the message block itself, and not on previous work done. The message rounds on the other hand depends on previously processed words. Since we already know in advance that half of our message consists of simply the padding block, we can save already the scheduled part, precomputed, and feed it to the messaging rounds at the right time. This procedure saves about 35%-40% of the total hashing time for the 128 blocks.

When we compute the hash tree root of a Merkle tree, we hash several times 64 byte blocks, for example, to compute the hash tree root of the validator registry list of Randao Reveals, we currently perform a little bit above 500,000 hashes of 64 blocks. That is, on each such a call, we are saving the scheduling of the padding block by hardcoding it.

SIMD and Vectorization

SIMD stands for "Single instruction Multiple Data", the Wikipedia article does a good job at describing it so we won't get in too much detail. For the purpose of this short document, it suffices to say that it allows a modern processor supporting 64, 128, 256 or 512 bit registers, to operate on 2, 4, 8 or 16 double words (32 bits) at a time respectively, performing the same instruction on each one of them.

Every modern computer supports some form of vectorization, realistically, going back even 15 years to 2004, we can assume that every computer supports SSE3. In practice, almost all modern computers would support AVX2 (256bit registers) and newer Intel models support AVX512.

How does this affect hashing? Different subtrees on a Merkle tree are completely independent when computing the hash tree root, so we could be processing at the same time 2, 4, 8 or 16 blocks on each call to the low-level hashing algorithm if our CPU supports it.

SHA-ni Extensions

Due to the importance of hashing, some modern CPUs starting from the Rizen series in AMD and newer Intel CPUs, support direct machine instructions that perform the scheduling and the messaging rounds for SHA algorithms. Even though these instructions can handle only 2 blocks in parallel, they perform much faster than vectorized implementations.

Prysm's current implementation

Prysm currently has two different pathways to compute the hash tree root of an SSZ object. Either relying in fastssz to do it, or using a custom Hasher in the case of the BeaconState. Either way, in the case where Hashing has most impact, which is the case of validator registry sized lists, we rely on fastssz. Take for example the call to hash the balance list:

    return b.recomputeFieldTrie(validators, b.state.Validators)
case balances:
    return stateutil.Uint64ListRootWithRegistryLimit(b.state.Balances)
case randaoMixes:

// Uint64ListRootWithRegistryLimit computes the HashTreeRoot Merkleization of
// a list of uint64 and mixed with registry limit.
func Uint64ListRootWithRegistryLimit(balances []uint64) ([32]byte, error) {
    **hasher := hash.CustomSHA256Hasher()**
    balancesMarshaling := make([][]byte, 0)
    for i := 0; i < len(balances); i++ {
		    balanceBuf := make([]byte, 8)
				binary.LittleEndian.PutUint64(balanceBuf, balances[i])
				balancesMarshaling = append(balancesMarshaling, balanceBuf)
		}
		balancesChunks, err := ssz.Pack(balancesMarshaling)
		if err != nil {
				return [32]byte{}, errors.Wrap(err, "could not pack balances into chunks")
		}
		maxBalCap := params.BeaconConfig().ValidatorRegistryLimit
		elemSize := uint64(8)
		balLimit := (maxBalCap*elemSize + 31) / 32
		if balLimit == 0 {
				if len(balances) == 0 {
						balLimit = 1
				} else {
						balLimit = uint64(len(balances))
				}
		}
		**balancesRootsRoot, err := ssz.BitwiseMerkleize(hasher, balancesChunks, uint64(len(balancesChunks)), balLimit)**
		if err != nil {
				return [32]byte{}, errors.Wrap(err, "could not compute balances merkleization")
		}
		balancesLengthRoot := make([]byte, 32)
		binary.LittleEndian.PutUint64(balancesLengthRoot, uint64(len(balances)))
		return ssz.MixInLength(balancesRootsRoot, balancesLengthRoot), nil
}

There are two lines remarked in the above snippet: the parameter hasher that is a custom Hasher that is passed to fastssz in the second remarked line. Here's a declaration with its signature:

func CustomSHA256Hasher() func([]byte) [32]byte

We should suspect already here that there is something wrong: we are getting only one root out of this function! there is no way that BitwiseMerkleize can actually process several blocks in parallel with a function of that signature. Indeed, if we track down the implementation of that function we eventually are led to this loop:

for j = 0; ; j++ {
		// stop merging when we are in the left side of the next combi
		if i&(uint64(1)<<j) == 0 {
				// if we are at the count, we want to merge in zero-hashes for padding
				if i == count && j < depth {
						v := hasher.Combi(hArr, trie.ZeroHashes[j])
						copy(h, v[:])
				} else {
						break
				}
		} else {
		// keep merging up if we are the right side
		v := hasher.Combi(tmp[j], hArr)
		copy(h, v[:])
		}
}

And we see that the hasher is being called with two chunks of 32 bytes to make 1 block of 64 bytes on each call. In fact the signature of that hasher.Combi function is

Combi(a [32]byte, b [32]byte) [32]byte

It does not matter what low level implementation of SHA256 we use, this method would never use vectorization.

A proposal for a solution

We propose a solution in two parts, which are independent and can be implemented in the immediate term to solve 1) and in a short-to-medium term to solve 2).

Solving 1

In order to solve 1 we need to use our hasher with a different signature than either Combi or CustomSha256Hasher above, we need something like

Hasher(dst [][32]byte, chunks [][32]byte) error

This function takes a list chunks of 64 byte blocks to be hashed, and it produces the roots in all of them to dst. As such, this function should check that dst is properly allocated and has the same length as chunks, or allocate it correctly inside. Before producing an explicit algorithm using this signature, let us describe it as it is more enlightening. We are given a list of say 260,000 uint64 corresponding to the balances of all validators currently. We encode each uint64 into 8 bytes as little endian and then pack them together in 32 byte chunks.

All in all, we have 32,500 blocks of 64 bytes. We are supposed to construct a tree, with these 32,500 pairs of sibling leaves at the bottom, and then add extra 137,438,920,972 imaginary pairs that are zeroed, so let us not worry about this implementation detail. The point is that the first level up the tree we need to hash our 32,500 pairs, and we can do so in parallel. So if our CPU supports AVX2, we can do so 8 pairs at a time. In the next layer we will have 16,250 hashes to perform and these we can still do 8 at a time. At this rate, we will perform aproximatedly 65,000 hashes at a rate of 8 per round, while Prysm's algorithm would do one pair at a time!.

Here's a complete algorithm:

func merkleizeFlatArray(vec [][32]byte,
	depth uint8,
	**hasher func([][32]byte, [][32]byte, uint64),**
	zero_hash_array [][32]byte) [32]byte {

	if depth == 0 && len(vec) == 1 {
		return vec[0]
	}
	if len(vec) == 0 {
		panic("Can't have empty vec")
	}

	// allocate size for the buffer (everything hardcoded cause
	layer := (len(vec) + 1) / 2
	length := 0
	for {
		length += layer - 1
		if layer == 1 {
			break
		}
		layer = (layer + 1) / 2
	}
	length += int(depth)
	hash_tree := make([][32]byte, length)

	first := uint64(0)
	height := uint8(1)
	last := uint64(len(vec)+1) / 2
	if len(vec) > 1 {
		hasher(hash_tree, vec, last)
	}
	if len(vec)%2 == 1 {
		hash_tree[last-1] = hash.Hash2ChunksShani(vec[len(vec)-1], zero_hash_array[0])
	}
	for {
		dist := last - first
		if dist < 2 {
			break
		}
		**hasher(hash_tree[last:], hash_tree[first:], dist/2)**
		first = last
		last += (dist + 1) / 2

		if dist%2 != 0 {
			hash_tree[last-1] = hash.Hash2ChunksShani(hash_tree[first-1], zero_hash_array[height])
		}
		height++
	}
	for {
		if height >= depth {
			break
		}
		hash_tree[last] = hash.Hash2ChunksShani(hash_tree[last-1], zero_hash_array[height])
		last++
		height++
	}
	return hash_tree[last-1]
}

The relevant lines have been highlighted. The hasher has the signature above (with chunks of 32 bytes instead of 64, but this is irrelevant as described below). This is a flat array algorithm described in more detail in this document.

Solving 2

Unfortunately there is no easy way to solving 2 without convincing some serious people to maintain a hashing library or to write them ourselves. As an insecure starting point, assembly implementing the hardcoded padding block has been posted in this branch. Currently we have implementations for SSE3 (1 block at a time), AVX (4 blocks at a time), AVX2 (8 blocks at a time) and SHA-ni extensions (2 and 1 block at a time). We are missing ARM assembly and AVX512. Most importantly we are missing hardening the assembly which is no easy task. One the the toughest points is stack control. All of these instructions rely on correct stack alignment. Currently we are achieving this by first exporting those symbols to a C library that then Go can import into Prysm. The signature of those functions is

#ifndef CUSTOM_HASHER
#define CUSTOM_HASHER
#include <stdint.h>
extern void sha256_4_avx(unsigned char* output, const unsigned char* input, uint64_t blocks);
extern void sha256_8_avx2(unsigned char* output, const unsigned char* input, uint64_t blocks);
extern void sha256_shani(unsigned char* output, const unsigned char* input, uint64_t blocks);
extern void sha256_1_avx(unsigned char* output, const unsigned char* input);
#endif

That is those functions take a pointer to the destination buffer, a pointer to the input chunks to be hashed and a count representing the number of blocks to hash. There is no bounds check on these functions, it is up to the caller to have its stack aligned and to have the buffers previously allocated correctly. The bindings for this functions can be of this form

// #include "hasher.h"
import C
func HasherAVX2(dest [][32]byte, chunks [][32]byte) error {
	 if len(dst)*2 != len(chunks) {
				 return errors.New("incorrect length for destination buffer")
	 } 
   C.sha256_8_avx2((*C.uchar)(&dst[0]), (*C.uchar)(&inp[0]), C.ulong(len(dst)))
	 return nil
}

So as to make sure that the buffers are correctly allocated. We see here that we are passing pointers to our underlying assembly functions, so the size of the chunks is irrelevant, we are using 32 bytes in this example but it could easily be []byte in the signature.

Issues and future work

I propose concretely

Implementing the above signature for the custom hasher immediately in all fastssz-independent code. This is entirely within Go
Implement this algorithm in Kasey's ssz library in the near future.
Adjust our data structure to use flat arrays instead of triple leaved trees. This makes passing pointers to the low-level functions much easier and it optimizes jumps. The above snippet, modulo minor obvious micro optimizations, is already highly optimized.
Implement our own assembly and use it to gain a ~x2 performance because of the hardcoded padding block.

This last point is tricky for a number of reasons. Besides the inexistent AVX512 and ARM code, here are a few of the points that have to be addressed

Register clobbering is different in Linux and Windows, I didn't care about Windows when I wrote the above. It can be fixed easily, but it requires a lot of testing
Making it portable may result in a nightmare with Bazel. A good assembler is needed, and this could further delay Apple's M1 support.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.