Author: Rose (rose2221)
Associated PR: github.com/rose2221/go-yamux/tree/feat/better-buffering
Abstract
Frame batching in Yamux, a core component of libp2p, demonstrates significant performance improvements by doubling throughput and reducing syscalls by a factor of ten, at the cost of only 64 KiB of memory per peer session. This paper details the design, benchmarking, and analysis of a write coalescing prototype for the Go implementation of Yamux. The proposed solution is code-complete and this research documents its potential benefits pending further discussion with upstream maintainers.
1. Introduction
Yamux serves as the default stream-multiplexer for Go-libp2p and, by extension, is a critical component for numerous Ethereum execution and consensus clients, including Prysm, Teku, and Lodestar. In its standard upstream implementation, every Yamux frame is flushed directly to the underlying socket. When operating over a TLS-encrypted connection, this behavior results in one syscall and one TLS record being generated for each frame. This process demonstrably inflates CPU usage, increases latency, and raises the overall packet count.
This research, conducted as part of the Ethereum-Protocol-Fellowship (EPF), evaluates whether write coalescing—the practice of buffering several frames for a short duration (a few milliseconds) before flushing them to the socket—can yield meaningful performance gains without necessitating alterations to the Yamux wire-protocol.
2. Proposed Solution: A Prototype for Write Coalescing
A prototype patch was developed to implement write coalescing in go-yamux. The core of this solution involves a time.Timer that triggers periodic flushes, governed by a WriteCoalesceDelay parameter set to 2 ms. To prevent buffer overflow, an additional size check flushes the buffer once it becomes full.
To control this behavior, two new configuration knobs were introduced to the Config struct:
WriteCoalesceDelay
: Sets the time-based flush interval.WriteCoalesceSize
: Sets the size-based flush threshold.The implementation is designed to be build-time selectable. Setting the WriteCoalesceDelay to 0 effectively disables the feature and restores the original upstream behavior.
2.1. Addressing Implementation Challenges
A key challenge in batching is the potential delay of critical, time-sensitive messages. The proposed solution bypasses the batching mechanism for important control frames such as Ping and WindowUpdate, as well as Gossipsub control frames. These frames are sent immediately and do not enter the coalescing buffer, ensuring that their latency is not negatively impacted.
For a potential rollout in a client like Prysm, a phased approach is recommended:
/eth2/beacon_chain/req/blocks/
)./meshsub/1.1.0
).3. Benchmark Methodology
The performance evaluation was conducted using a controlled and instrumented environment.
The introduction of write coalescing yielded substantial performance improvements across all measured metrics. The table below compares the baseline (no batching) against the patched version (2 ms flush delay, 64 KiB buffer).
Metric (100 ms RTT, 16 MiB window, 4 KiB msg, 1 000 streams) | Baseline (no batching) | Batching (2 ms flush, 64 KiB buf) | Gain |
---|---|---|---|
Throughput (MiB/s) | ~12.4 | ~27.8 | 2.2 × |
write() syscalls / s |
~10 400 | ~1 050 | 9.9 × fewer |
TLS records / s | ~10 100 | ~990 | 9.7 × fewer |
Avg CPU (both peers, 8 vCPU) | 59 % | 48 % | −11 pp |
p99 stream latency (µs) | 870 | 310 | 2.8 × faster |
Extra memory / session | ~0 B | +64 KiB | n/a |
Order‑of‑magnitude improvements are consistent across 5–150 ms RTT; gains grow with latency.
Note: These order-of-magnitude improvements remain consistent across a range of 5–150 ms RTT, with gains becoming more pronounced as latency increases.
Figure 1: Throughput improvement with and without batching.
Figure 2: Reduction in write()
syscalls and TLS records per second.
Throughput improvement with and without batching.
5.1. Interpretation of Results
The benchmark results can be attributed to several factors:
CPU & TLS Overhead: In workloads characterized by many small frames, the overhead of kernel context switches and TLS record processing becomes a dominant bottleneck. Batching amortizes this cost by consolidating many small writes into a single, larger operation, thereby reducing calls into the kernel and OpenSSL by approximately 90%.
Latency: The improvement in p99 latency is counter-intuitive but significant. By merging multiple small TLS records into one, the system avoids head-of-line blocking at the TLS layer, leading to more efficient data transmission and lower tail latencies.
Memory Trade-off: The additional memory required for the coalescing buffer is negligible in the context of modern validator hardware. A 64 KiB buffer per peer connection, even with 1,000 peers, would only amount to a 64 MiB worst-case memory footprint.
5.2. Trade-offs of Write Coalescing
Despite the clear benefits, batching introduces a set of trade-offs that must be carefully managed.
Good (why batching helps) | Bad (why batching hurts) | |
---|---|---|
CPU + syscalls | 1 big write() instead of 10 000 tiny writes ≈ 10 × less kernel/user crossings |
none |
TLS overhead | fewer TLS records = less crypto, less header bloat | none |
Throughput | socket stays full; bandwidth-delay product utilised | none |
Latency (tail) | *increases* by flush delay (e.g. 2 ms) for messages stuck behind the buffer | |
Fairness | — | a “chatty control” stream can be blocked by bulk data already in the buffer (HOL) |
Flow control | — | sender thinks bytes left immediately; window maths drifts until flush |
** 5.3. Mitigating Head-of-Line (HOL) Blocking**
The primary drawback of a simple coalescing buffer is Head-of-Line (HOL) blocking, where a greedy stream fills the buffer and forces urgent frames from other streams to wait. The following table outlines a series of incremental techniques to mitigate this issue.
Level | Idea | How it works | Why it helps |
---|---|---|---|
0 – Fast-lane bypass ↔ “Don’t batch at all for control frames.” | When you see a frame tagged Ping, WindowUpdate, or any application-defined high-priority tag → flush immediately, skip the buffer. | Urgent frames never wait behind bulk data. | |
1 – Reserve space in the buffer | Divide the 64 KiB buffer into two “buckets”: e.g. 48 KiB for regular data, 16 KiB kept free for priority frames. Before writing bulk data, check bufWriter.Buffered() —if > 48 KiB, flush first. |
Guarantees that, even at maximum fill, a high-priority frame can still enter the buffer right away and ride the same flush. | |
2 – Early flush on high watermark (80 %) | If buffer occupancy crosses 80 % of its capacity, trigger an immediate flush regardless of the timer. | Prevents a single stream from monopolising the buffer and keeps latency bounded even under heavy bulk writes. | |
3 – Per-stream round-robin enqueue | Instead of writing directly into the buffer from whichever stream calls Write() , place frames into per-stream queues and have the sendLoop pull from them in round-robin order. |
Ensures no one stream can starve others, even before frames enter the coalescing buffer. | |
4 – Two-queue scheduler (“CoDel light”) | Maintain high-prio and low-prio frame queues. Drain high-prio first; only drain low-prio while high-prio is empty or after sending N high-prio frames to prevent starvation. |
Gives you latency isolation without needing per-stream metadata in the buffer itself. | |
5 – Dynamic flush delay | Keep a moving RTT estimate; set flushDelay = min(1 ms, RTT/4) . High-latency links still benefit from batching, but on LAN links the delay shrinks to a few hundred µs, shrinking HOL stalls. |
Adapts automatically to network conditions—no one-size-fits-all constant. |
6. Implementation Status
The prototype patch is code-complete and has demonstrated significant gains. However, internal priorities shifted during the development cycle, and ownership of the libp2p-Go module within the EPF cohort was not firmly allocated. To avoid landing a potentially unmaintained patch into the upstream repository, the decision was made to keep the branch private for the time being, publish the design and results, and maintain an open line of communication with the libp2p maintainers to revisit the integration post-fellowship.
7. Future Work
https://github.com/libp2p/go-libp2p/issues/644
https://github.com/libp2p/go-yamux/blob/master/session.go
https://github.com/hashicorp/yamux/blob/master/spec.md
https://discuss.libp2p.io/t/optimizing-yamux-flow-control-sending-window-update-frames-early/843
https://github.com/kata-containers/agent/issues/231
https://github.com/emacs-lsp/lsp-mode/issues/1752
https://stackoverflow.com/questions/13426142/bufferedwriter-not-writing-everything-to-its-output-file