Improve BeaconState hashTreeRoot()

# Improve BeaconState hashTreeRoot() ## 1. Description As of Jun 2024, holesky validator size is 1.73M and mainnet validator size is 1.45M. We see the BeaconState's hashTreeRoot() was up to 1.5s to prepare for next epoch and it will increase in the upcoming months. Improving this function is critical for lodestar, especially after Electra where we'll likely see longer epoch transition there With the new SIMD implementation of sha256, we hope that we can improve this time more, see below for more details * ### 1.1 Current implementation Majority of the hashTreeRoot() time is for validators so we need to care about it the most. Right now each validator is typed as a ContainerNodeStruct which cache validator value without a tree. When a validator is modified, we need to recreate th wholee tree and compute root for it. Note that the tree is not stored anywhere, we just create when we need it and evicted by gc after all - Pros: - It's super fast to get a value of validator - It's much lighter to store validator as value compared to nodes because an epoch field is tracked as 1 number in value instead of 8 as in a Node - Cons: - Tree creation time is a lot - Cannot cache hashed nodes - Cannot batch hash * ### 1.2 Benchmark This is the benchmark result with a 200k validator beacon state and half of that is modified ``` ✓ BeaconState ViewDU recursive hash vc=200000 1.045194 ops/s 956.7598 ms/op - 13 runs 16.5 s ✓ BeaconState ViewDU recursive hash - commit step vc=200000 30.49162 ops/s 32.79589 ms/op - 12 runs 2.96 s ✓ BeaconState ViewDU validator tree creation vc=100000 4.024561 ops/s 248.4743 ms/op - 15 runs 6.76 s ``` This shows something we need to improve: - Find a way for batch hash - validator creation tree * ### 1.3 SIMD sha256 There's hashtree-js binding work [here](https://github.com/ChainSafe/hashtree-js) which call the great native hashtree implementation of Prysm which promise to be 20x faster than the current version of Lodestar. I'm waiting for Mathew to try its performance on different cpus but if it's 5x faster than the current version, it's already amazing to me Besides that, we also have wasm implementation [here](https://github.com/ChainSafe/ssz/pull/367), the best it can do is 2x faster on Macbook 2017 Intel cpu It shows that hashtree is almost 5x faster than as-sha256 on Mac M1 cpu ``` hasher as-sha256 ✓ hash 2 Uint8Array 500000 times - as-sha256 2.717292 ops/s 368.0135 ms/op - 12 runs 47.8 s ✓ hashTwoObjects 500000 times - as-sha256 2.864475 ops/s 349.1041 ms/op - 12 runs 45.4 s - batchHashObjects - as-sha256 ✓ executeHashComputations - as-sha256 31.25938 ops/s 31.99039 ms/op - 37 runs 3.19 s hashtree ✓ hash 2 Uint8Array 500000 times - hashtree 5.551561 ops/s 180.1295 ms/op - 12 runs 23.4 s ✓ hashTwoObjects 500000 times - hashtree 6.362881 ops/s 157.1615 ms/op - 12 runs 20.4 s - batchHashObjects - hashtree ✓ executeHashComputations - hashtree 149.3710 ops/s 6.694742 ms/op - 136 runs 8.26 s ``` ## 2. Get HashComputations and execute it ### 2.1 persistent-merkle-tree I created this data structure for batch hash work, the field name is not really nice but good enough for a POC ```typescript= export type HashComputation = { src0: Node; src1: Node; dest: Node; }; ``` it shows batch hash is less than 2x faster than regular hash although `hashtree.executeHashComputations()` is 5x faster than as-sha256, I think HashComputation above caused gc to run more? ``` Node batchHash ✓ getHashComputations 250000 nodes 86.79037 ops/s 11.52202 ms/op - 32 runs 28.0 s ✓ batchHash 250000 nodes 19.13791 ops/s 52.25231 ms/op - 29 runs 22.2 s ✓ get root 250000 nodes 10.25390 ops/s 97.52391 ms/op - 18 runs 14.1 s ✓ getHashComputations 500000 nodes 38.27039 ops/s 26.12986 ms/op - 34 runs 57.2 s ✓ batchHash 500000 nodes 8.678133 ops/s 115.2322 ms/op - 27 runs 38.3 s ✓ get root 500000 nodes 5.180939 ops/s 193.0152 ms/op - 12 runs 18.8 s ``` `getHashComputation()` takes 11% - 13% time, it basically do one traversal from root to group HashComputations by level and execute from the bottom. There's an idea to improve `getHashComputation()` as below ### 2.2 Get HashComputations during ViewDU.commit() in ssz ViewDU in ssz() has a great design to track changed nodes and views. Based on this we can compute HashComputations to save one tree traversal (see above). The saved time is very small as I benchmarked, however it's reasonable to do so. It's 10ms saved as in my benchmark result ``` BeaconState ViewDU partially modified tree vc=200000 numModified=100000 ✓ BeaconState ViewDU batchHash - commit & getHashComputation vc=20 1.802902 ops/s 554.6614 ms/op - 79 runs 52.9 s ✓ BeaconState ViewDU hashTreeRoot - commit step vc=200000 1.835587 ops/s 544.7849 ms/op - 59 runs 39.5 s ``` ## 3. Lazily create and persist validator trees TLDR: This is only good enough to start with, persisting the tree in order to batch hash is not a great idea after I run a node on holesky (see below) The current implementation only create validator tree on the fly without tracking it so we cannot do batch hash. First approach is to lazily compute it when needed and persist the tree in BranchNodeStruct ``` BeaconState ViewDU partially modified tree vc=200000 numModified=100000 ✓ BeaconState ViewDU recursive hash vc=200000 1.128123 ops/s 886.4279 ms/op - 27 runs 29.1 s ✓ BeaconState ViewDU recursive hash - commit step vc=200000 13.13220 ops/s 76.14873 ms/op - 45 runs 8.58 s ✓ BeaconState ViewDU validator tree creation vc=100000 2.462462 ops/s 406.0976 ms/op - 67 runs 39.7 s ✓ BeaconState ViewDU batchHash vc=200000 1.292733 ops/s 773.5550 ms/op - 43 runs 37.7 s ✓ BeaconState ViewDU batchHash - commit & getHashComputation vc=20 1.802902 ops/s 554.6614 ms/op - 79 runs 52.9 s ✓ BeaconState ViewDU batchHash - hash step vc=200000 3.886184 ops/s 257.3218 ms/op - 57 runs 57.0 s ✓ BeaconState ViewDU hashTreeRoot vc=200000 1.322128 ops/s 756.3564 ms/op - 63 runs 53.6 s ✓ BeaconState ViewDU hashTreeRoot - commit step vc=200000 1.835587 ops/s 544.7849 ms/op - 59 runs 39.5 s ✓ BeaconState ViewDU hashTreeRoot - hash step vc=200000 3.717275 ops/s 269.0142 ms/op - 57 runs 60.2 s ``` Above benchmark shows that tree creation takes even 1.5x hash time while we cannot do much to improve hash time, I decided to improve this. Note that tree creation time increased 1.6x (406ms vs 248ms) because of the persistence of it, which shows it's not a great approach But batch hash helped a bit (total time is 756ms vs 886ms as the old way) ## 4. Avoid validator tree creation time TLDR: great for the benchmark, paid 9GB more RAM to save 1s epoch transition, not sustainable in the long run ## 4.1. Maintain both validator tree and value I enhanced ContainerNodeStruct ViewDU to track both the value, nodes and views changed. - The read operations are as great as earlier - The write (commit/hashTreeRoot) operations are as great as Container, no tree creation involved - Benchmark results are great: - no tree creation time, default recursive hash is 2x faster - number of HashComputations are reduced, hash time is 2x faster - in total, it shows 3x faster than the original result (328ms vs 956ms) ``` BeaconState ViewDU partially modified tree vc=200000 numModified=100000 ✓ BeaconState ViewDU recursive hash vc=200000 2.383132 ops/s 419.6159 ms/op - 18 runs 12.6 s ✓ BeaconState ViewDU recursive hash - commit step vc=200000 5.037264 ops/s 198.5205 ms/op - 16 runs 6.25 s ✓ BeaconState ViewDU validator tree creation vc=100000 35.76944 ops/s 27.95682 ms/op - 38 runs 20.7 s ✓ BeaconState ViewDU batchHash vc=200000 2.855090 ops/s 350.2516 ms/op - 13 runs 7.67 s ✓ BeaconState ViewDU batchHash - commit & getHashComputation vc=20 4.192557 ops/s 238.5179 ms/op - 14 runs 6.50 s ✓ BeaconState ViewDU batchHash - hash step vc=200000 7.569542 ops/s 132.1084 ms/op - 51 runs 29.1 s ✓ BeaconState ViewDU hashTreeRoot vc=200000 3.039540 ops/s 328.9972 ms/op - 12 runs 6.44 s ✓ BeaconState ViewDU hashTreeRoot - commit step vc=200000 4.499143 ops/s 222.2646 ms/op - 21 runs 8.92 s ✓ BeaconState ViewDU hashTreeRoot - hash step vc=200000 7.361959 ops/s 135.8334 ms/op - 36 runs 21.0 s ``` ### 4.2 Testing I continued with spec tests, they all passed Then I integrated to lodestar and got this surprise: OOM My investigation showed that I can only load validators until 1.2M, both mainnet and holesky surpassed that Its not possible to store a full BeaconState tree with the current implementation ### 4.3 Troubleshooting Each node object could takes up to 2 bytes x8 for property names and 8 bytes x8 for property values, plus the overhead for Object itself, so it takes at least (2x8 + 8x8) = 80 bytes A validator value is represented by: ``` - pubkey: 96 bytes = 3 nodes - withdrawalCredentials: 32 bytes = 1 node - effectiveBalance: 8 bytes = 1 node - slashed: boolean = 1 node - activationEligibilityEpoch: 8 bytes = 1 node - activationEpoch: 8 bytes = 1 node - exitEpoch: 8 bytes = 1 node - withdrawableEpoch: 8 bytes = 1 node ``` validator tree: ``` level 0 validator root / \ 1 10 11 / \ / \ 2 20 21 22 23 / \ / \ / \ / \ 3 pub with eff sla act act exit with / \ 4 pub0 pub1 ``` ``` level 0: 1 node level 1: 2 nodes level 2: 4 nodes level 3: 8 nodes (8 properties in total) level 4: 2 nodes (only for pubkey) total: 17 nodes ``` so it has 16 nodes in addition to the current implementation (which is 1 node). Consider each node is 80 bytes, a validator tree could take at least 16 x 80 = 1280 bytes, so with more than 1.7M of holesky validators it takes at least 1.28x1.7 = 2.178 GB, this explains the OOM ### 4.4 More testing I was only able to run a node with this branch with max-old-heap-size being set to 16GB PrepareNextEpoch duration is 1s less than stable ![Screenshot 2024-06-18 at 16.32.19](https://hackmd.io/_uploads/BkeE4CABA.png) So we pay around 9GB to save 1s epoch transition which may not be worth it, `gc` increased through and it does not seem sustainable in the long term ## 5. Batch commit Right now we commit validators individually so we have to put everything inside HashComputations in order to batch, but that requries persisting trees as we see. We need to find a way to commit validators and compute its root in batch ### 5.1 Prepopulate validator trees - Extend state.validators type to specify its own ViewDU called ValidatorsViewDU - ValidatorsViewDU extends ListCompositeViewDU - Create 16 validator tree once - Each item of it is ValidatorViewDU - For every 16 validators, populate trees from ValidatorViewDU value, compute root in batch - ValidatorViewDU create new node =root + value from there - ValidatorViewDU extends ContainerStructViewDU - new method to populate tree from value: valueToTree(node: Node) void - new method commitHashObject(ho: HashObject): this create new root from there + value, apply ho (HahsObject) to root - BranchNodeStruct - no change - In ValidatorViewDU.commitHashObject(), new an instance and applyHash() directly - Pros: - able to batch hash but no need to create tree every time - trees are created once and reuse, should be less gc - Cons: - no generic, has to create specific type for state.validators - 2-step commits, one for validators and one for remaining. state.validators.commit() actually do the hash as well Result (with hashtree) is great: - recursive hash is 3.24x faster - batchhash is 2.89x faster - hashTreeRoot is 3x faster ``` BeaconState ViewDU partially modified tree vc=200000 numModified=100000 ✓ BeaconState ViewDU recursive hash vc=200000 3.653621 ops/s 273.7010 ms/op - 28 runs 12.8 s ✓ BeaconState ViewDU recursive hash - validators commit step vc=20 4.545098 ops/s 220.0172 ms/op - 18 runs 6.41 s ✓ BeaconState ViewDU validator tree creation vc=100000 128.2135 ops/s 7.799492 ms/op - 28 runs 22.1 s ✓ BeaconState ViewDU batchHash vc=200000 3.733401 ops/s 267.8523 ms/op - 18 runs 6.93 s ✓ BeaconState ViewDU batchHash - commit & getHashComputation vc=20 4.310165 ops/s 232.0097 ms/op - 15 runs 5.80 s ✓ BeaconState ViewDU batchHash - hash step vc=200000 31.72988 ops/s 31.51603 ms/op - 14 runs 10.9 s ✓ BeaconState ViewDU hashTreeRoot vc=200000 3.954695 ops/s 252.8640 ms/op - 16 runs 6.07 s ✓ BeaconState ViewDU hashTreeRoot - commit step vc=200000 4.277347 ops/s 233.7898 ms/op - 18 runs 6.67 s ✓ BeaconState ViewDU hashTreeRoot - hash step vc=200000 30.73572 ops/s 32.53543 ms/op - 26 runs 15.0 s ``` ### 5.2 Prepopulate Uint8Array Start from same idea to 5.1, the idea is that we don't need to populate middle nodes in order to have validator root. We only need to have Uint8Array at different levels of validator,and we let the hasher to compute root Result is even 30% faster than 5.1 ``` BeaconState ViewDU partially modified tree vc=200000 numModified=100000 ✓ BeaconState ViewDU recursive hash vc=200000 4.706004 ops/s 212.4945 ms/op - 23 runs 9.20 s ✓ BeaconState ViewDU recursive hash - validators commit step vc=20 6.520329 ops/s 153.3665 ms/op - 12 runs 3.77 s ✓ BeaconState ViewDU validator tree creation vc=100000 7.992198 ops/s 125.1220 ms/op - 14 runs 6.58 s ✓ BeaconState ViewDU batchHash vc=200000 4.850986 ops/s 206.1437 ms/op - 14 runs 4.79 s ✓ BeaconState ViewDU batchHash - commit & getHashComputation vc=20 5.573049 ops/s 179.4350 ms/op - 14 runs 4.28 s ✓ BeaconState ViewDU batchHash - hash step vc=200000 40.95685 ops/s 24.41594 ms/op - 16 runs 10.0 s ✓ BeaconState ViewDU hashTreeRoot vc=200000 5.156880 ops/s 193.9157 ms/op - 23 runs 6.88 s ✓ BeaconState ViewDU hashTreeRoot - commit step vc=200000 5.963063 ops/s 167.6991 ms/op - 14 runs 4.41 s ✓ BeaconState ViewDU hashTreeRoot - hash step vc=200000 35.06000 ops/s 28.52253 ms/op - 30 runs 13.2 s - BeaconState ViewDU hashTreeRoot - commit step each validator vc=200000 ``` #### 5.2.1 Improve hashtree digestNLevelUnsafe The key function to compute validator root from a single Uint8Array is digestNLevelUnsafe(). We used 2 Uint8Arrays to hash at each level, turned out we can use one only, it avoid a `Uint8Array.set()`. Also, we should not allocate temporary HashObject and call "applyHash()", each hash method should support result as a output parameter. This gives 30% faster than the above ``` BeaconState ViewDU partially modified tree vc=200000 numModified=100000 ✓ BeaconState ViewDU recursive hash vc=200000 6.950837 ops/s 143.8676 ms/op - 14 runs 5.26 s ✓ BeaconState ViewDU recursive hash - commit step vc=200000 9.200534 ops/s 108.6893 ms/op - 13 runs 3.20 s ✓ BeaconState ViewDU validator tree creation vc=100000 8.308017 ops/s 120.3657 ms/op - 12 runs 4.73 s ✓ BeaconState ViewDU batchHash vc=200000 7.139702 ops/s 140.0619 ms/op - 15 runs 3.88 s ✓ BeaconState ViewDU batchHash - commit & getHashComputation vc=20 8.223193 ops/s 121.6073 ms/op - 18 runs 4.35 s ✓ BeaconState ViewDU batchHash - hash step vc=200000 35.87954 ops/s 27.87103 ms/op - 20 runs 8.51 s ✓ BeaconState ViewDU hashTreeRoot vc=200000 7.531041 ops/s 132.7838 ms/op - 19 runs 4.53 s ✓ BeaconState ViewDU hashTreeRoot - commit step vc=200000 9.199858 ops/s 108.6973 ms/op - 13 runs 3.24 s ✓ BeaconState ViewDU hashTreeRoot - hash step vc=200000 38.55383 ops/s 25.93776 ms/op - 22 runs 8.32 s ``` Compare to 1.2, this is 956/132 ~= 7.24x faster than unstable in my environment. ## Lesson learn from this work hashtree performance is great, but an important improvement from this work is to use reusable memory to hash validators in batch. We may not use batch technique in other places, but having reusable memory to hash is a great technique, we can use it to compute root for: - Attestation - Signing Root hashtree also has a nice "hashInto" api which saved a lot of memory for consumer, should apply the samething to as-sha256

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.