Binary Sv2 - HackMD

# Binary Sv2 With our efforts during benchmarking, we were able to verify our claims about having a lifetime on Sv2 types that leak to the user application. We saw that, with lifetimes, the frames being stored eventually exhaust the pool capacity, forcing memory allocation and incurring more latency, as deserialization now needs to be done on newly allocated space. This is in contrast to the case where we release the pool space as soon as we receive it and immediately clone into an owned type, avoiding the need to spend time on new memory allocation during deserialization and keeping the pool space free for concurrent deserialization requests. This is a generated report of some interesting benchmark result (AI is used to render the result in nice tables): ``` What the Results Show Pool Exhaustion Boundary (exhaustion_boundary) | Frame index | zc_hold (ns) | owned_release (ns) | |-------------|--------------|--------------------| | 1 | 76 | 465 | | 2 | 75 | 425 | | 3 | 74 | 413 | | 4 | 74 | 396 | | 5 | 77 | 485 | | 6 | 82 | 494 | | 7 | 91 | 661 | | 8 | 99 | 475 | | 9 | 197 ← 2× jump | 471 ← stays flat | | 10 | 166 | 455 | | 11 | 159 | 462 | | 12 | 180 | 495 | The step-change from frame 8 → 9 is unmistakable. This is exactly POOL_CAPACITY = 8. After that point every single ZC decode triggers a malloc from the system allocator instead of a pointer-bump from the pool. Copy Overhead (copy_overhead) | Operation | Time | |------------------------------------------------|-------| | Decode only (frame + pool slot, no field copy) | 80 ns | | Decode + deserialize + copy to owned | 318 ns| The copy cost for this message (32-byte merkle root + 64-byte coinbase) is ~238 ns — roughly 3 memory allocations (Vec::from_slice for coinbase). This is the price the owned pattern pays per message. The Crossover Analysis This is the critical insight: - ZC within pool capacity (frames 1–8): ~76–99 ns/msg — fast pool bump allocation - ZC beyond pool capacity (frames 9+): ~160–200 ns/msg — every decode is a malloc - Owned every message: ~450 ns/msg — constant, includes the copy For a 100-message burst: - ZC held: 8 × ~80ns + 92 × ~180ns = ~17,200 ns plus 92 heap allocations still live - Owned: 100 × ~450ns = ~45,000 ns but pool always free, zero residual heap pressure So for small payloads, ZC held is still faster in total time. But this reverses for larger payloads (B016M, 1 MB coinbase) where the copy cost of owned would be a fraction of the repeated malloc/mmap calls for large heap allocations. More importantly, "ZC held" produces 92 live heap allocations that the allocator must eventually free — the GC pressure and memory fragmentation cost does not appear in per-decode timings but shows up in total system throughput. For the sustained-100 and throughput groups the results will make the full story visible. Run the complete suite: cargo bench --bench pool_lifecycle --features with_buffer_pool The benchmark files are at: - sv2/codec-sv2/benches/pool_lifecycle.rs — all 5 benchmark groups - sv2/codec-sv2/benches/common.rs — added OwnedMsg + OwnedMsg::from_zc --- Throughput results: ZC Hold vs Owned Release | N msgs | ZC Hold (total) | Owned Release (total) | ZC ns/msg | Owned ns/msg | ZC overhead | |--------|------------------|------------------------|------------|---------------|--------------| | 1 | 323 ns | 347 ns | 323 | 347 | -7% (ZC faster) | | 4 | 1.16 µs | 1.38 µs | 290 | 345 | -16% (ZC faster) | | 8 | 2.51 µs | 2.69 µs | 314 | 336 | -6% (ZC faster) | | 9 | 2.95 µs | 3.01 µs | 328 | 334 | ≈ break-even | | 16 | 5.74 µs | 5.63 µs | 359 | 352 | +2% (owned faster) | | 32 | 13.1 µs | 11.9 µs | 409 | 371 | +10% | | 64 | 30.4 µs | 26.4 µs | 475 | 413 | +15% | | 100 | 46.2 µs | 41.6 µs | 462 | 416 | +11% | | 200 | 97.0 µs | 80.2 µs | 485 | 401 | +21% | | 1000 | 505 µs | 387 µs | 505 | 387 | +30% | --- What this tells you Crossover is at N≈9–16. Before that, ZC is faster (no copy overhead, pool back-mode). After that, ZC degrades monotonically because every message past the 8th pool slot forces a malloc. At 100 messages: - ZC: 462 ns/msg — each message after slot 8 is a heap allocation - Owned: 416 ns/msg — constant-time; the pool slot recycles every iteration - ZC is 11% slower per message At 1000 messages: - ZC: 505 ns/msg - Owned: 387 ns/msg - ZC is 30% slower per message — and this gap keeps widening with N because every single one of the 992 over-capacity messages pays malloc while owned stays at 0-slot pool occupancy the entire run The owned pattern's per-message cost is essentially flat (~387–416 ns) across all N because the pool slot recycles. The ZC pattern's per-message cost keeps climbing because accumulated frames pin more and more allocator pressure even after the 8-slot limit is hit (more live allocations = more fragmentation, longer free/malloc paths at teardown). ``` Let me be clear: the cost we incur due to memory allocation for deserialization when the memory pool is full is not much, just a few nanoseconds. In fact, if we look at the full picture, zero-copy (or lifetime-based) approaches are somewhat faster compared to the owned variant we are discussing. From first principles, we have already allocated memory and simply reuse it to reference the frame/message, so we avoid the cost of additional memory allocation, making it faster. The problem arises in how we use it. In our applications, the usual workflow is to get bytes allocated for the frame from the pool and then use the message from the frame, sometimes storing it. This leads to buffer pool exhaustion, which increases deserialization cost because memory allocation is then required. At the same time, we often change the type of the message we store in our applications we clone it or make it `'static`, incurring additional cost and heap allocation due to cloning, which removes much of the advantage of the initial zero-copy approach. The lifetime around the types also leaks the abstraction to the user layer. We do not want that. We need to give the user the responsibility to decide whether to allocate memory for a message in the same memory pool or use a global allocator. This is very similar to how serde allows opting in to zero-copy or not. ## Migration To provide this opt-in behavior we don't need to change many things in our current implementation. Currently, all our types are backed via the `Inner` type ```rust pub enum Inner<'a, const ISFIXED: bool, const SIZE: usize, const HEADERSIZE: usize, const MAXSIZE: usize> { Ref(&'a mut [u8]), // zero-copy, borrows from buffer Owned(Vec<u8>), // heap-allocated copy } ``` ```rust pub type U256Ref<'a> = Inner<'a, true, 32, 0, 0>; pub type SignatureRef<'a> = Inner<'a, true, 64, 0, 0>; pub type B064KRef<'a> = Inner<'a, false, 1, 2, { u16::MAX as usize }>; pub type B0255Ref<'a> = Inner<'a, false, 1, 1, 255>; pub type Str0255Ref<'a> = Inner<'a, false, 1, 1, 255>; ``` This forces the type implementation even owned to have a reference to origin memory allocation space. ```rust // We have no choice — the lifetime bleeds up the entire call chain struct SetupConnection<'a> { vendor: Str0255<'a>, hardware: Str0255<'a>, firmware: Str0255<'a>, device_id: B032<'a>, } ``` The clever way to do this would be to separate the inner types into Ref(&'a mut [u8]) and Owned(Vec<u8>). This would remove any lifetime associated with the initial memory region and make downstream implementations agnostic to lifetimes if they do not want to opt for zero-copy. This might be the way through which the user needs to decide whether to opt in to zero-copy or not. ```rust pub enum InnerRef<'a, const ISFIXED: bool, const SIZE: usize, const HEADERSIZE: usize, const MAXSIZE: usize> { Ref(&'a mut [u8]), } pub enum Inner<const ISFIXED: bool, const SIZE: usize, const HEADERSIZE: usize, const MAXSIZE: usize> { Owned(Vec<u8>), } ``` ```rust // Pool slot is freed as soon as from_bytes() returns #[derive(Serialize, Deserialize)] struct SetupConnection { min_version: u16, max_version: u16, flags: u32, endpoint_host: Str0255, // ← owned, heap Vec endpoint_port: u16, vendor: Str0255, // ← owned, heap Vec hardware_version: Str0255, // ← owned, heap Vec firmware: Str0255, // ← owned, heap Vec device_id: Str0255, // ← owned, heap Vec } // generated impl: Decodable where each field is decoded by copying from buffer // OPT-IN — explicit 'decoder lifetime — zero-copy, pool-tied #[derive(Serialize, Deserialize)] struct SetupConnectionRef<'decoder> { min_version: u16, max_version: u16, flags: u32, endpoint_host: Str0255Ref<'decoder>, // ← zero-copy, pool stays pinned endpoint_port: u16, vendor: Str0255Ref<'decoder>, hardware_version: Str0255Ref<'decoder>, firmware: Str0255Ref<'decoder>, device_id: Str0255Ref<'decoder>, } // generated impl: Decodable<'decoder> where fields borrow from input slice ``` All in all, we would want to use the pool in the owned-type scenario only to hold the frame during deserialization, but once the message is deserialized, it should reside in separate memory. There are improvements we want to make to streamline the implementation, but they are not part of this document.