Try   HackMD

Optimizing Yamux Frame Transmission: A Study on Write Coalescing for libp2p

Author: Rose (rose2221)
Associated PR: github.com/rose2221/go-yamux/tree/feat/better-buffering

Abstract
Frame batching in Yamux, a core component of libp2p, demonstrates significant performance improvements by doubling throughput and reducing syscalls by a factor of ten, at the cost of only 64 KiB of memory per peer session. This paper details the design, benchmarking, and analysis of a write coalescing prototype for the Go implementation of Yamux. The proposed solution is code-complete and this research documents its potential benefits pending further discussion with upstream maintainers.

1. Introduction
Yamux serves as the default stream-multiplexer for Go-libp2p and, by extension, is a critical component for numerous Ethereum execution and consensus clients, including Prysm, Teku, and Lodestar. In its standard upstream implementation, every Yamux frame is flushed directly to the underlying socket. When operating over a TLS-encrypted connection, this behavior results in one syscall and one TLS record being generated for each frame. This process demonstrably inflates CPU usage, increases latency, and raises the overall packet count.

This research, conducted as part of the Ethereum-Protocol-Fellowship (EPF), evaluates whether write coalescing—the practice of buffering several frames for a short duration (a few milliseconds) before flushing them to the socket—can yield meaningful performance gains without necessitating alterations to the Yamux wire-protocol.

2. Proposed Solution: A Prototype for Write Coalescing
A prototype patch was developed to implement write coalescing in go-yamux. The core of this solution involves a time.Timer that triggers periodic flushes, governed by a WriteCoalesceDelay parameter set to 2 ms. To prevent buffer overflow, an additional size check flushes the buffer once it becomes full.

To control this behavior, two new configuration knobs were introduced to the Config struct:

  • WriteCoalesceDelay: Sets the time-based flush interval.
  • WriteCoalesceSize: Sets the size-based flush threshold.

The implementation is designed to be build-time selectable. Setting the WriteCoalesceDelay to 0 effectively disables the feature and restores the original upstream behavior.

2.1. Addressing Implementation Challenges
A key challenge in batching is the potential delay of critical, time-sensitive messages. The proposed solution bypasses the batching mechanism for important control frames such as Ping and WindowUpdate, as well as Gossipsub control frames. These frames are sent immediately and do not enter the coalescing buffer, ensuring that their latency is not negatively impacted.

For a potential rollout in a client like Prysm, a phased approach is recommended:

  1. Initially, enable batching exclusively for block sync streams (e.g., /eth2/beacon_chain/req/blocks/).
  2. Keep batching disabled for high-priority Gossipsub topics (e.g., /meshsub/1.1.0).
  3. Carefully monitor gossip latency. If performance remains acceptable, the use of batching can be gradually expanded to other stream types.

3. Benchmark Methodology
The performance evaluation was conducted using a controlled and instrumented environment.

  • Hardware: Two Ubuntu 22.04 virtual machines, each configured with 8 vCPUs, 8 GiB of RAM, and a 10 Gbps virtual network interface card (vNIC).
  • Network Shaping: To simulate real-world network conditions, traffic was shaped using tc netem delay 100ms and tc tbf rate 1gbit.
  • Workload: The test workload consisted of 1,000 concurrent streams, with each stream sending a 4 KiB message every 100 ms.
  • Window Size: The Yamux stream window was set to 16 MiB, which is the default for libp2p.
  • Duration: Each test was run for 30 seconds. To ensure data quality, the best and worst runs were discarded, and the reported figures are averages of the remaining runs.
  • Instrumentation: A combination of tools was used for measurement, including strace -c for syscall counting, a custom OpenSSL BIO hook for TLS record tracking, pidstat for CPU utilization, and in-code latency histograms.

Results

The introduction of write coalescing yielded substantial performance improvements across all measured metrics. The table below compares the baseline (no batching) against the patched version (2 ms flush delay, 64 KiB buffer).

Metric (100 ms RTT, 16 MiB window, 4 KiB msg, 1 000 streams) Baseline (no batching) Batching (2 ms flush, 64 KiB buf) Gain
Throughput (MiB/s) ~12.4 ~27.8 2.2 ×
write() syscalls / s ~10 400 ~1 050 9.9 × fewer
TLS records / s ~10 100 ~990 9.7 × fewer
Avg CPU (both peers, 8 vCPU) 59 % 48 % −11 pp
p99 stream latency (µs) 870 310 2.8 × faster
Extra memory / session ~0 B +64 KiB n/a

Order‑of‑magnitude improvements are consistent across 5–150 ms RTT; gains grow with latency.
Note: These order-of-magnitude improvements remain consistent across a range of 5–150 ms RTT, with gains becoming more pronounced as latency increases.

Figure 1: Throughput improvement with and without batching.

Figure 2: Reduction in write() syscalls and TLS records per second.


Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Throughput improvement with and without batching.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →


5.1. Interpretation of Results
The benchmark results can be attributed to several factors:

  • CPU & TLS Overhead: In workloads characterized by many small frames, the overhead of kernel context switches and TLS record processing becomes a dominant bottleneck. Batching amortizes this cost by consolidating many small writes into a single, larger operation, thereby reducing calls into the kernel and OpenSSL by approximately 90%.

  • Latency: The improvement in p99 latency is counter-intuitive but significant. By merging multiple small TLS records into one, the system avoids head-of-line blocking at the TLS layer, leading to more efficient data transmission and lower tail latencies.

  • Memory Trade-off: The additional memory required for the coalescing buffer is negligible in the context of modern validator hardware. A 64 KiB buffer per peer connection, even with 1,000 peers, would only amount to a 64 MiB worst-case memory footprint.

5.2. Trade-offs of Write Coalescing
Despite the clear benefits, batching introduces a set of trade-offs that must be carefully managed.

Good (why batching helps) Bad (why batching hurts)
CPU + syscalls 1 big write() instead of 10 000 tiny writes ≈ 10 × less kernel/user crossings none
TLS overhead fewer TLS records = less crypto, less header bloat none
Throughput socket stays full; bandwidth-delay product utilised none
Latency (tail) *increases* by flush delay (e.g. 2 ms) for messages stuck behind the buffer
Fairness a “chatty control” stream can be blocked by bulk data already in the buffer (HOL)
Flow control sender thinks bytes left immediately; window maths drifts until flush

** 5.3. Mitigating Head-of-Line (HOL) Blocking**
The primary drawback of a simple coalescing buffer is Head-of-Line (HOL) blocking, where a greedy stream fills the buffer and forces urgent frames from other streams to wait. The following table outlines a series of incremental techniques to mitigate this issue.

Level Idea How it works Why it helps
0 – Fast-lane bypass“Don’t batch at all for control frames.” When you see a frame tagged Ping, WindowUpdate, or any application-defined high-priority tag → flush immediately, skip the buffer. Urgent frames never wait behind bulk data.
1 – Reserve space in the buffer Divide the 64 KiB buffer into two “buckets”: e.g. 48 KiB for regular data, 16 KiB kept free for priority frames. Before writing bulk data, check bufWriter.Buffered()—if > 48 KiB, flush first. Guarantees that, even at maximum fill, a high-priority frame can still enter the buffer right away and ride the same flush.
2 – Early flush on high watermark (80 %) If buffer occupancy crosses 80 % of its capacity, trigger an immediate flush regardless of the timer. Prevents a single stream from monopolising the buffer and keeps latency bounded even under heavy bulk writes.
3 – Per-stream round-robin enqueue Instead of writing directly into the buffer from whichever stream calls Write(), place frames into per-stream queues and have the sendLoop pull from them in round-robin order. Ensures no one stream can starve others, even before frames enter the coalescing buffer.
4 – Two-queue scheduler (“CoDel light”) Maintain high-prio and low-prio frame queues. Drain high-prio first; only drain low-prio while high-prio is empty or after sending N high-prio frames to prevent starvation. Gives you latency isolation without needing per-stream metadata in the buffer itself.
5 – Dynamic flush delay Keep a moving RTT estimate; set flushDelay = min(1 ms, RTT/4). High-latency links still benefit from batching, but on LAN links the delay shrinks to a few hundred µs, shrinking HOL stalls. Adapts automatically to network conditions—no one-size-fits-all constant.

6. Implementation Status
The prototype patch is code-complete and has demonstrated significant gains. However, internal priorities shifted during the development cycle, and ownership of the libp2p-Go module within the EPF cohort was not firmly allocated. To avoid landing a potentially unmaintained patch into the upstream repository, the decision was made to keep the branch private for the time being, publish the design and results, and maintain an open line of communication with the libp2p maintainers to revisit the integration post-fellowship.

7. Future Work

  • The results from this study suggest several promising avenues for future research and development:
  • Implement an adaptive tuning mechanism for WriteCoalesceDelay based on real-time RTT measurements.
  • Explore optional toggling of TCP_NODELAY to further maximize the utilization of full network packets.
  • Port the write-coalescing concept to the stream scheduler in quic-go to enable similar layer-7 batching benefits for QUIC-based transports.

References

https://github.com/libp2p/go-libp2p/issues/644
https://github.com/libp2p/go-yamux/blob/master/session.go
https://github.com/hashicorp/yamux/blob/master/spec.md
https://discuss.libp2p.io/t/optimizing-yamux-flow-control-sending-window-update-frames-early/843
https://github.com/kata-containers/agent/issues/231
https://github.com/emacs-lsp/lsp-mode/issues/1752
https://stackoverflow.com/questions/13426142/bufferedwriter-not-writing-everything-to-its-output-file