# NOMT
## Scaling Blockchains with a High Throughput State Database
---
NOMT is a state database, a critical component of a blockchain node.
- High Throughput Commits
- Low latency lookups
- Optimized for **merklized state**
- Optimized for modern Solid State Drives (SSDs)
- Full-block state proofs
---
NOMT is designed to be a drag-and-drop replacement for RocksDB, MDBX, LMDB, etc
And to encapsulate all the Merkle Trie logic within it.
---
Inspired by this blog post.
<img src="https://hackmd.io/_uploads/HygqiZNsWJg.png" style="width: 60%; height=auto">
<small>and funded with a grant from Sovereign Labs (targeted first user)</small>
---
Scan for GitHub
<img src="https://hackmd.io/_uploads/ryAJf7ibke.png" style="width: 33%; height=auto">
or navigate to
https://github.com/thrumdev/nomt
---
Built with care by
<img src="https://hackmd.io/_uploads/r1Hhm7iW1g.png" style="width: 20%; height=auto">
<img src="https://hackmd.io/_uploads/rkk5X7sWke.png" style="width: 20%; height=auto">
<img src="https://hackmd.io/_uploads/S1D6mQiZyx.png" style="width: 20%; height=auto">
<small>three Rust systems programmers from the Polkadot sphere</small>
---
## Why?
* Large State = many users = bigger network effects
* Large State = more complex apps = more value-add
* Every 10k TPS of throughput gives space for ~100M DAU.
* SSDs are getting better fast
* Trend of industry to scale with RAM is misguided long-term
* it's cool
---
## TL;DW
* Support billions of accounts on a 4TB SSD
* Targeting 25k worst-case TPS for V1
* Potential for cheap scaling with multiple SSDs and SSD improvements
* Fully authenticated, ready for interoperability
---
## State Database?
Every node needs to store state: token balances, contract code, prices, etc.
Nodes keep this state in an embedded database on their disk.
---
<img src="https://hackmd.io/_uploads/Bk6t87sZke.png" style="width: 60%; height=auto">
Authenticated State Trees
---
### Authenticated State Trees
- Encode the entire state of the blockchain
- With a super light commitment (the "root")
- That can be used to prove the value of any item in the state
**Incredibly important for interoperability, light clients, bridging, coprocessors, etc.**
---
NOMT uses a Binary Merkle Trie
![image](https://hackmd.io/_uploads/rJexnlEo-1g.png)
---
Let's talk about SSDs
---
Typical high-end SSD specs (Q4 2024):
- 4TB capacity
- 40 microsecond latency
- 1M IOPS (I/Os per Second)
---
![image](https://hackmd.io/_uploads/SkImNVsb1l.png)
Latency: time to serve a single request.
<small>increases as more requests are queued</small>
---
![image](https://hackmd.io/_uploads/Bk-1SVoZyx.png)
Throughput: number of requests per second.
<small>note that average latency at 1M IOPS is 1 microsecond!</small>
---
## What's an I/O?
A 4 KB (single **page**) read or write operation on the disk.
<small>SSDs only work in data of exactly one page.</small>
---
## SSDs are advancing rapidly
![image alt](https://hackmd.io/_uploads/S1RzY1n-yx.png =500x500)
---
## SSDs are advancing rapidly
| | Crucial T705 | Samsung 980 |
| -------- | -------- | -------- |
| Release Date | Feb. 2024 | Sep. 2020 |
| IOPS | 1.5M | 1M |
| Price at Launch| $689 | $430 |
| Price (Nov. 2024)| $564 | $199
| Capacity | 4TB | 2TB
---
## SSDs are advancing rapidly
![image](https://hackmd.io/_uploads/S1nSik2Wyg.png)
<small>source: Tom's Hardware</small>
---
![for_exceptional_user](https://hackmd.io/_uploads/By-zEm2Zyx.png)
---
## Why not just add more RAM?
- A SuperMicro 12TB motherboard + 12TB of compatible RAM would cost over $40k (for DDR4, not DDR5 RAM)
- A 4TB RAM configuration would cost $5k USD
- 3 4TB SSDs would cost $1500.
- SSD reads can be _faster_ than RAM copies due to Direct Memory Access
- PCI is getting _crazy fast_
---
![for_esteemed_broski](https://hackmd.io/_uploads/H1ChCynb1e.png)
---
## The SSD caveat: Parallelism
- You need a parallel VM to take advantage of SSD IOPS.
- Theoretical max sequential throughput at 40us latency is only ~6K TPS
---
![image](https://hackmd.io/_uploads/S1127en-1e.png)
---
# Back To NOMT
---
## NOMT Architecture
<img src="https://hackmd.io/_uploads/H1noBg2W1x.png" style="width: 90%; height=auto">
---
## Bitbox: Merkle Node Storage 🥁
<img src="https://www.rob.tech/assets/images/merklization_diagrams/page_1.svg" style="width: 80%; height=auto">
<small>Just like this, except each page is 4KB and stores 6 layers</small>
<small>This is known as a "proto-VEB layout"</small>
<small>Pages stored in an on-disk Hash-Table with a WAL</small>
<small>Top 4 levels of the page tree are kept in memory</small>
---
## Beatree: Key-Value Store 🪘
![image](https://hackmd.io/_uploads/r1vnKM3-kl.png)
<small>Branches use bitwise prefix compression and separators</small>
<small>Optimized for high-entropy, short keys</small>
<small>Typical Branching factor of 375</small>
<small>~200MB memory per billion items</small>
---
## Engineering Decisions
- Targets Linux as first-class (needed for io-uring)
- Targets generic Unix/macOS as functional (for development)
- Direct I/O for max IOPS
- Single-state DB, stores only finalized state. Unfinalized state must live in-memory.
- Rust 🦀
---
## Space Usage
- 1 billion items estimated at 150GB on-disk
- Large occasional jumps in space usage in bitbox every time the number of items increases by ~64.
- We are working to make this _gradual_
- Aiming to support tens of billions of items per 4TB of storage.
---
## Theoretical Performance Example
<small>
Assume 10 billion accounts and simple transactions (2 random read-writes).
Reads: 3 (1 beatree, 2 merkle)
Writes: 5 (2 beatree, 3 merkle)
Total I/O per TX: 16
At 1M IOPS: 62kTPS
At 500k IOPS: 31.5kTPS
</small>
---
Our performance goal for NOMT V1 (EOY 2024) is 25kTPS with 1 billion accounts.
<br>
<small>Current benchmarks on 128M accounts at ~40kTPS</small>
<small>side note: we assume random worst-case access</small>
---
## Node Requirements
- RAM/Disk size scale with DB size. 4GB RAM and 256GB SSD for 1B items.
- Disk: >500k IOPS
- CPU: 8 cores
---
## Thank You
<img src="https://hackmd.io/_uploads/ryAJf7ibke.png" style="width: 40%; height=auto">
https://github.com/thrumdev/nomt
---
## (Appendix: New Bottlenecks)
---
## Bottleneck: Bandwidth
Scenario: Prove 300k state reads against a database of 1B items.
<small>this is 25kTPS * 6 seconds</small>
Proof size is <span style="color: red">120MB</span>, requiring 160Mbps of bandwidth.
<small>Note that Polkadot PoVs are only 5MB!</small>
---
## Bottleneck: Bandwidth
![image](https://hackmd.io/_uploads/ryKUlQhWkg.png)
<small>Possible solution: use succinct proving techniques to "compress" state proofs. Likely needs GPU acceleration.</small>
---
## Bottleneck: Signature Verification
* Ed25519 signature verification is 26us or 11.5us when batched.
* Secp256k1 verification is 70us
Solution: have dedicated CPU cores or GPU acceleration for signature verification
{"title":"NOMT: Scaling Blockchains Talk Sub0 Nov/2024","contributors":"[{\"id\":\"5e8fb0c4-eedb-4242-a274-7d4baea002b3\",\"add\":9165,\"del\":2118}]","description":"What is \"NOMT\"?"}