# NOMT ## Scaling Blockchains with a High Throughput State Database --- NOMT is a state database, a critical component of a blockchain node. - High Throughput Commits - Low latency lookups - Optimized for **merklized state** - Optimized for modern Solid State Drives (SSDs) - Full-block state proofs --- NOMT is designed to be a drag-and-drop replacement for RocksDB, MDBX, LMDB, etc And to encapsulate all the Merkle Trie logic within it. --- Inspired by this blog post. <img src="https://hackmd.io/_uploads/HygqiZNsWJg.png" style="width: 60%; height=auto"> <small>and funded with a grant from Sovereign Labs (targeted first user)</small> --- Scan for GitHub <img src="https://hackmd.io/_uploads/ryAJf7ibke.png" style="width: 33%; height=auto"> or navigate to https://github.com/thrumdev/nomt --- Built with care by <img src="https://hackmd.io/_uploads/r1Hhm7iW1g.png" style="width: 20%; height=auto"> <img src="https://hackmd.io/_uploads/rkk5X7sWke.png" style="width: 20%; height=auto"> <img src="https://hackmd.io/_uploads/S1D6mQiZyx.png" style="width: 20%; height=auto"> <small>three Rust systems programmers from the Polkadot sphere</small> --- ## Why? * Large State = many users = bigger network effects * Large State = more complex apps = more value-add * Every 10k TPS of throughput gives space for ~100M DAU. * SSDs are getting better fast * Trend of industry to scale with RAM is misguided long-term * it's cool --- ## TL;DW * Support billions of accounts on a 4TB SSD * Targeting 25k worst-case TPS for V1 * Potential for cheap scaling with multiple SSDs and SSD improvements * Fully authenticated, ready for interoperability --- ## State Database? Every node needs to store state: token balances, contract code, prices, etc. Nodes keep this state in an embedded database on their disk. --- <img src="https://hackmd.io/_uploads/Bk6t87sZke.png" style="width: 60%; height=auto"> Authenticated State Trees --- ### Authenticated State Trees - Encode the entire state of the blockchain - With a super light commitment (the "root") - That can be used to prove the value of any item in the state **Incredibly important for interoperability, light clients, bridging, coprocessors, etc.** --- NOMT uses a Binary Merkle Trie ![image](https://hackmd.io/_uploads/rJexnlEo-1g.png) --- Let's talk about SSDs --- Typical high-end SSD specs (Q4 2024): - 4TB capacity - 40 microsecond latency - 1M IOPS (I/Os per Second) --- ![image](https://hackmd.io/_uploads/SkImNVsb1l.png) Latency: time to serve a single request. <small>increases as more requests are queued</small> --- ![image](https://hackmd.io/_uploads/Bk-1SVoZyx.png) Throughput: number of requests per second. <small>note that average latency at 1M IOPS is 1 microsecond!</small> --- ## What's an I/O? A 4 KB (single **page**) read or write operation on the disk. <small>SSDs only work in data of exactly one page.</small> --- ## SSDs are advancing rapidly ![image alt](https://hackmd.io/_uploads/S1RzY1n-yx.png =500x500) --- ## SSDs are advancing rapidly | | Crucial T705 | Samsung 980 | | -------- | -------- | -------- | | Release Date | Feb. 2024 | Sep. 2020 | | IOPS | 1.5M | 1M | | Price at Launch| $689 | $430 | | Price (Nov. 2024)| $564 | $199 | Capacity | 4TB | 2TB --- ## SSDs are advancing rapidly ![image](https://hackmd.io/_uploads/S1nSik2Wyg.png) <small>source: Tom's Hardware</small> --- ![for_exceptional_user](https://hackmd.io/_uploads/By-zEm2Zyx.png) --- ## Why not just add more RAM? - A SuperMicro 12TB motherboard + 12TB of compatible RAM would cost over $40k (for DDR4, not DDR5 RAM) - A 4TB RAM configuration would cost $5k USD - 3 4TB SSDs would cost $1500. - SSD reads can be _faster_ than RAM copies due to Direct Memory Access - PCI is getting _crazy fast_ --- ![for_esteemed_broski](https://hackmd.io/_uploads/H1ChCynb1e.png) --- ## The SSD caveat: Parallelism - You need a parallel VM to take advantage of SSD IOPS. - Theoretical max sequential throughput at 40us latency is only ~6K TPS --- ![image](https://hackmd.io/_uploads/S1127en-1e.png) --- # Back To NOMT --- ## NOMT Architecture <img src="https://hackmd.io/_uploads/H1noBg2W1x.png" style="width: 90%; height=auto"> --- ## Bitbox: Merkle Node Storage 🥁 <img src="https://www.rob.tech/assets/images/merklization_diagrams/page_1.svg" style="width: 80%; height=auto"> <small>Just like this, except each page is 4KB and stores 6 layers</small> <small>This is known as a "proto-VEB layout"</small> <small>Pages stored in an on-disk Hash-Table with a WAL</small> <small>Top 4 levels of the page tree are kept in memory</small> --- ## Beatree: Key-Value Store 🪘 ![image](https://hackmd.io/_uploads/r1vnKM3-kl.png) <small>Branches use bitwise prefix compression and separators</small> <small>Optimized for high-entropy, short keys</small> <small>Typical Branching factor of 375</small> <small>~200MB memory per billion items</small> --- ## Engineering Decisions - Targets Linux as first-class (needed for io-uring) - Targets generic Unix/macOS as functional (for development) - Direct I/O for max IOPS - Single-state DB, stores only finalized state. Unfinalized state must live in-memory. - Rust 🦀 --- ## Space Usage - 1 billion items estimated at 150GB on-disk - Large occasional jumps in space usage in bitbox every time the number of items increases by ~64. - We are working to make this _gradual_ - Aiming to support tens of billions of items per 4TB of storage. --- ## Theoretical Performance Example <small> Assume 10 billion accounts and simple transactions (2 random read-writes). Reads: 3 (1 beatree, 2 merkle) Writes: 5 (2 beatree, 3 merkle) Total I/O per TX: 16 At 1M IOPS: 62kTPS At 500k IOPS: 31.5kTPS </small> --- Our performance goal for NOMT V1 (EOY 2024) is 25kTPS with 1 billion accounts. <br> <small>Current benchmarks on 128M accounts at ~40kTPS</small> <small>side note: we assume random worst-case access</small> --- ## Node Requirements - RAM/Disk size scale with DB size. 4GB RAM and 256GB SSD for 1B items. - Disk: >500k IOPS - CPU: 8 cores --- ## Thank You <img src="https://hackmd.io/_uploads/ryAJf7ibke.png" style="width: 40%; height=auto"> https://github.com/thrumdev/nomt --- ## (Appendix: New Bottlenecks) --- ## Bottleneck: Bandwidth Scenario: Prove 300k state reads against a database of 1B items. <small>this is 25kTPS * 6 seconds</small> Proof size is <span style="color: red">120MB</span>, requiring 160Mbps of bandwidth. <small>Note that Polkadot PoVs are only 5MB!</small> --- ## Bottleneck: Bandwidth ![image](https://hackmd.io/_uploads/ryKUlQhWkg.png) <small>Possible solution: use succinct proving techniques to "compress" state proofs. Likely needs GPU acceleration.</small> --- ## Bottleneck: Signature Verification * Ed25519 signature verification is 26us or 11.5us when batched. * Secp256k1 verification is 70us Solution: have dedicated CPU cores or GPU acceleration for signature verification
{"title":"NOMT: Scaling Blockchains Talk Sub0 Nov/2024","contributors":"[{\"id\":\"5e8fb0c4-eedb-4242-a274-7d4baea002b3\",\"add\":9165,\"del\":2118}]","description":"What is \"NOMT\"?"}
    942 views