potuz
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Crushing Prysm's Hasher Created: November 19, 2021 8:57 PM Last Edited Time: November 20, 2021 2:45 PM Type: Technical Design # Introduction A detailed analysis of the Merkleization algorithm used by both Prysm's custom hasher or fastssz's one indicated the presence of two crucial flaws: 1. The low-level SHA256 algorithm was not hardcoding the padding block. 2. The high level algorithm would make repeated calls to the low level library, `N` times to hash `2` chunks, instead of one call to hash `2N` chunks. The first flaw is independent of prysm and every implementation of Merkle trees in Ethereum clients was subject to it. It was documented in detail in [this document](https://hackmd.io/hOZnkh50RKehsOyf0b3CdQ). We briefly describe this problem below, but we will mainly focus in this note in 2. and its implications to Prysm's code and its interaction with fastssz. ## Benchmarks The following benchmarks were carried on the develop branch of prysm vs a naive implementation of a Merkleizing algorithm that does not suffer from 1) and 2) above. The tests consists of simply calling to hash a list with 400 000 random`uint64`. The results are visibly striking ```jsx goos: linux goarch: amd64 cpu: AMD Ryzen 5 3600 6-Core Processor BenchmarkHashBalanceShani-12 160 7629704 ns/op BenchmarkHashBalanceShaniPrysm-12 15 74012328 ns/op PASS goos: linux goarch: amd64 cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz BenchmarkHashBalanceAVX-4 68 26677965 ns/op BenchmarkHashBalancePrysm-4 7 165434686 ns/op PASS goos: linux goarch: amd64 cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz BenchmarkHashBalanceAVX2-4 121 9711482 ns/op BenchmarkHashBalancePrysm-4 10 103716714 ns/op PASS ``` # The inner workings of SHA256 ## The padding block From [Intel's description of the SHA256 Hashing algorithm](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html): > The process of SHA to calculate the message digest has two phases. First is the preprocessing of the message to pad it out to a 64 byte multiple with the length of the message embedded in the last 8 bytes. The message is then split into 64 byte blocks to be processed in the next phase. The second phase is the hash computation, which has two main components itself. One is the message schedule which takes the 64 byte block and expands it into 32-bit dwords to be processed per round, and the other is the absorption of a given rounds message dword into the working variables. > From the description above, suppose your message is `length` bytes long. If `length % 64`is bigger than 56, then your message will require an extra 64-byte block just to encode `length` as a little endian `uint64`. One of the main applications of hashing libraries is to provide a quick way of validating the contents of a large message/buffer: you compute its hash and check it against a given 32 byte hash from a trusted source. Imagine you want to check that you have downloaded from a torrent the correct version of geth, which is several megabytes in size. The extra cost of hashing possibly an extra 64 block is minimal. On the other end of the spectrum, imagine that you only need to hash 64 bytes, in this case you need to process two blocks, the first block has your message, the second block simply contains the length, which is known beforehand: it is 128 bytes (including the padding block). That is, the second block will always be equal to `0x80`in big Endian. This brings us to the second part of the above quote: there are two main components to the hashing algo: scheduling and the actual computation rounds. Without getting into details of how these components work, it suffices to say that scheduling can be done in parallel: it depends on the message block itself, and not on previous work done. The message rounds on the other hand depends on previously processed words. Since we already know in advance that half of our message consists of simply the padding block, we can save already the scheduled part, precomputed, and feed it to the messaging rounds at the right time. This procedure saves about 35%-40% of the total hashing time for the 128 blocks. **When we compute the hash tree root of a Merkle tree, we hash several times 64 byte blocks**, for example, to compute the hash tree root of the validator registry list of Randao Reveals, we currently perform a little bit above 500,000 hashes of 64 blocks. That is, on each such a call, we are saving the scheduling of the padding block by hardcoding it. ## SIMD and Vectorization SIMD stands for "Single instruction Multiple Data", the [Wikipedia article](https://en.wikipedia.org/wiki/SIMD) does a good job at describing it so we won't get in too much detail. For the purpose of this short document, it suffices to say that it allows a modern processor supporting 64, 128, 256 or 512 bit registers, to operate on 2, 4, 8 or 16 double words (32 bits) at a time respectively, performing the same instruction on each one of them. Every modern computer supports some form of vectorization, realistically, going back even 15 years to 2004, we can assume that every computer supports SSE3. In practice, almost all modern computers would support AVX2 (256bit registers) and newer Intel models support AVX512. **How does this affect hashing?** Different subtrees on a Merkle tree are completely independent when computing the hash tree root, so we could be processing at the same time 2, 4, 8 or 16 blocks on each call to the low-level hashing algorithm if our CPU supports it. ## SHA-ni Extensions Due to the importance of hashing, some modern CPUs starting from the Rizen series in AMD and newer Intel CPUs, support direct machine instructions that perform the scheduling and the messaging rounds for SHA algorithms. Even though these instructions can handle only 2 blocks in parallel, they perform much faster than vectorized implementations. # Prysm's current implementation Prysm currently has two different pathways to compute the hash tree root of an SSZ object. Either relying in fastssz to do it, or using a custom Hasher in the case of the `BeaconState`. Either way, in the case where Hashing has most impact, which is the case of validator registry sized lists, we rely on fastssz. Take for example the call to [hash the balance list](https://github.com/prysmaticlabs/prysm/blob/develop/beacon-chain/state/v1/state_trie.go#L322): ```go return b.recomputeFieldTrie(validators, b.state.Validators) case balances: return stateutil.Uint64ListRootWithRegistryLimit(b.state.Balances) case randaoMixes: ``` ```go // Uint64ListRootWithRegistryLimit computes the HashTreeRoot Merkleization of // a list of uint64 and mixed with registry limit. func Uint64ListRootWithRegistryLimit(balances []uint64) ([32]byte, error) { **hasher := hash.CustomSHA256Hasher()** balancesMarshaling := make([][]byte, 0) for i := 0; i < len(balances); i++ { balanceBuf := make([]byte, 8) binary.LittleEndian.PutUint64(balanceBuf, balances[i]) balancesMarshaling = append(balancesMarshaling, balanceBuf) } balancesChunks, err := ssz.Pack(balancesMarshaling) if err != nil { return [32]byte{}, errors.Wrap(err, "could not pack balances into chunks") } maxBalCap := params.BeaconConfig().ValidatorRegistryLimit elemSize := uint64(8) balLimit := (maxBalCap*elemSize + 31) / 32 if balLimit == 0 { if len(balances) == 0 { balLimit = 1 } else { balLimit = uint64(len(balances)) } } **balancesRootsRoot, err := ssz.BitwiseMerkleize(hasher, balancesChunks, uint64(len(balancesChunks)), balLimit)** if err != nil { return [32]byte{}, errors.Wrap(err, "could not compute balances merkleization") } balancesLengthRoot := make([]byte, 32) binary.LittleEndian.PutUint64(balancesLengthRoot, uint64(len(balances))) return ssz.MixInLength(balancesRootsRoot, balancesLengthRoot), nil } ``` There are two lines remarked in the above snippet: the parameter `hasher` that is a custom Hasher that is passed to fastssz in the second remarked line. Here's a declaration with its signature: ```go func CustomSHA256Hasher() func([]byte) [32]byte ``` We should suspect already here that there is something wrong: we are getting only one root out of this function! there is no way that `BitwiseMerkleize` can actually process several blocks in parallel with a function of that signature. Indeed, if we track down the implementation of that function we eventually are led to [this loop](https://github.com/prysmaticlabs/prysm/blob/develop/encoding/ssz/merkleize.go#L93-L108): ```go for j = 0; ; j++ { // stop merging when we are in the left side of the next combi if i&(uint64(1)<<j) == 0 { // if we are at the count, we want to merge in zero-hashes for padding if i == count && j < depth { v := hasher.Combi(hArr, trie.ZeroHashes[j]) copy(h, v[:]) } else { break } } else { // keep merging up if we are the right side v := hasher.Combi(tmp[j], hArr) copy(h, v[:]) } } ``` And we see that the hasher is being called with two chunks of 32 bytes to make 1 block of 64 bytes on each call. In fact the signature of that `hasher.Combi` function is ```go Combi(a [32]byte, b [32]byte) [32]byte ``` **It does not matter what low level implementation of SHA256 we use, this method would never use vectorization**. # A proposal for a solution We propose a solution in two parts, which are independent and can be implemented in the immediate term to solve 1) and in a short-to-medium term to solve 2). ## Solving 1 In order to solve 1 we need to use our hasher with a different signature than either `Combi` or `CustomSha256Hasher` above, we need something like ```go Hasher(dst [][32]byte, chunks [][32]byte) error ``` This function takes a list `chunks` of 64 byte blocks to be hashed, and it produces the roots in all of them to `dst`. As such, this function should check that `dst` is properly allocated and has the same length as `chunks`, or allocate it correctly inside. Before producing an explicit algorithm using this signature, let us describe it as it is more enlightening. We are given a list of say 260,000 `uint64` corresponding to the balances of all validators currently. We encode each `uint64` into 8 bytes as little endian and then pack them together in 32 byte chunks. All in all, we have 32,500 blocks of 64 bytes. We are supposed to construct a tree, with these 32,500 pairs of sibling leaves at the bottom, and then add extra 137,438,920,972 imaginary pairs that are zeroed, so let us not worry about this implementation detail. The point is that the first level up the tree we need to hash our 32,500 pairs, and we can do so in parallel. So if our CPU supports AVX2, we can do so 8 pairs at a time. In the next layer we will have 16,250 hashes to perform and these we can still do 8 at a time. At this rate, we will perform aproximatedly 65,000 hashes at a rate of 8 per round, while Prysm's algorithm would do one pair at a time!. [Here's a complete algorithm:](https://github.com/prysmaticlabs/prysm/blob/custom_hasher/beacon-chain/state/stateutil/validator_root.go#L175-L232) ```go func merkleizeFlatArray(vec [][32]byte, depth uint8, **hasher func([][32]byte, [][32]byte, uint64),** zero_hash_array [][32]byte) [32]byte { if depth == 0 && len(vec) == 1 { return vec[0] } if len(vec) == 0 { panic("Can't have empty vec") } // allocate size for the buffer (everything hardcoded cause layer := (len(vec) + 1) / 2 length := 0 for { length += layer - 1 if layer == 1 { break } layer = (layer + 1) / 2 } length += int(depth) hash_tree := make([][32]byte, length) first := uint64(0) height := uint8(1) last := uint64(len(vec)+1) / 2 if len(vec) > 1 { hasher(hash_tree, vec, last) } if len(vec)%2 == 1 { hash_tree[last-1] = hash.Hash2ChunksShani(vec[len(vec)-1], zero_hash_array[0]) } for { dist := last - first if dist < 2 { break } **hasher(hash_tree[last:], hash_tree[first:], dist/2)** first = last last += (dist + 1) / 2 if dist%2 != 0 { hash_tree[last-1] = hash.Hash2ChunksShani(hash_tree[first-1], zero_hash_array[height]) } height++ } for { if height >= depth { break } hash_tree[last] = hash.Hash2ChunksShani(hash_tree[last-1], zero_hash_array[height]) last++ height++ } return hash_tree[last-1] } ``` The relevant lines have been highlighted. The hasher has the signature above (with chunks of 32 bytes instead of 64, but this is irrelevant as described below). This is a flat array algorithm described in more detail in [this document](https://hackmd.io/ylcpJV11ReuldNXge370qQ). ## Solving 2 Unfortunately there is no easy way to solving 2 without convincing some serious people to maintain a hashing library or to write them ourselves. As an insecure starting point, assembly implementing the hardcoded padding block has been posted in [this branch](https://github.com/prysmaticlabs/prysm/tree/custom_hasher/crypto/hash/custom_hasher/assembly). Currently we have implementations for SSE3 (1 block at a time), AVX (4 blocks at a time), AVX2 (8 blocks at a time) and SHA-ni extensions (2 and 1 block at a time). We are missing ARM assembly and AVX512. Most importantly we are missing hardening the assembly which is no easy task. One the the toughest points is stack control. All of these instructions rely on correct stack alignment. Currently we are achieving this by first exporting those symbols to a C library that then Go can import into Prysm. The signature of those functions is ```c #ifndef CUSTOM_HASHER #define CUSTOM_HASHER #include <stdint.h> extern void sha256_4_avx(unsigned char* output, const unsigned char* input, uint64_t blocks); extern void sha256_8_avx2(unsigned char* output, const unsigned char* input, uint64_t blocks); extern void sha256_shani(unsigned char* output, const unsigned char* input, uint64_t blocks); extern void sha256_1_avx(unsigned char* output, const unsigned char* input); #endif ``` That is those functions take a pointer to the destination buffer, a pointer to the input chunks to be hashed and a count representing the number of blocks to hash. There is no bounds check on these functions, it is up to the caller to have its stack aligned and to have the buffers previously allocated correctly. The bindings for this functions can be of this form ```go // #include "hasher.h" import C func HasherAVX2(dest [][32]byte, chunks [][32]byte) error { if len(dst)*2 != len(chunks) { return errors.New("incorrect length for destination buffer") } C.sha256_8_avx2((*C.uchar)(&dst[0]), (*C.uchar)(&inp[0]), C.ulong(len(dst))) return nil } ``` So as to make sure that the buffers are correctly allocated. We see here that we are passing pointers to our underlying assembly functions, so the size of the chunks is irrelevant, we are using 32 bytes in this example but it could easily be `[]byte` in the signature. # Issues and future work I propose concretely - Implementing the above signature for the custom hasher immediately in all fastssz-independent code. This is entirely within Go - Implement this algorithm in Kasey's ssz library in the near future. - Adjust our data structure to use flat arrays instead of triple leaved trees. This makes passing pointers to the low-level functions much easier and it optimizes jumps. The above snippet, modulo minor obvious micro optimizations, is already highly optimized. - Implement our own assembly and use it to gain a ~x2 performance because of the hardcoded padding block. This last point is tricky for a number of reasons. Besides the inexistent AVX512 and ARM code, here are a few of the points that have to be addressed - Register clobbering is different in Linux and Windows, I didn't care about Windows when I wrote the above. It can be fixed easily, but it requires a lot of testing - Making it portable may result in a nightmare with Bazel. A good assembler is needed, and this could further delay Apple's M1 support.

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully