Network / Transport Working Group - BSDCan 2019 Devsummit

State of RACK/BBR (lstewart@)

To avoid destablizing the base TCP stack while supporting development of TCP stack changes, Netflix has added support for pluggable TCP stacks. Randall and others work on newer features in developmental TCP stacks, one of which is RACK (Recent ACKnowledgement). The RACK stack also includes other newer IETF? standards such as TLP (Tail Loss Probe) and Proportional Rate Reduction. All of this is upstreamed, but the modules are not enabled by default. Can be enabled by adding an EXTRA_TCP_STACKS option to a kernel config file (makeoption perhaps?).

Should these be enabled by default? jtl@ asserts they were considered experimental in FreeBSD 11, but not as much in FreeBSD 12. Conclusion is we will enable alternate stacks in the build by default.

What is the status of RACK and BBR in RFCs? BBR is an experimental draft, and RACK is standards track and adopted by the TCP working group. TLP is part of the RACK RFC, but PRR is a separate experimental RFC. Microsoft has fully deployed RACK. It is also the default in Linux.

draft-ietf-tcpm-rack, so still a draft.

The default stack is set via a sysctl. The stack of a given socket can be changed via setsockopt(), even on a live connection. New sockets from accept() inherit the stack of the parent listen socket.

RACK

RACK is relatively stable and the bulk of fast-paced development has subsided. Netflix still needs to verify that all of RACK is upstreamed to FreeBSD.

BBR is a congestion control algorithm Google published towards the end of 2016. They rolled it out internally and eventually on the internet. YouTube now uses BBR.

Randall has a review on phabricator D19878 to sync FreeBSD's RACK with Netflix's soon-to-be production RACK.

BBR

Randall forked RACK at the time the paper was published to develop a BBR stack. Been doing A/B testing against the stock RACK stack.

Have not done BBRv2. They have a bug-for-bug compatible BBRv1, and a second v1-inspired BBR that is better tuned for Netflix's workload.

Don't want to upstream BBR until they feel confident that is a "BBRv1 certified" stack. Think they will be ready to publish v1 in the not too distant future.

BBR doesn't use the modular congestion control system because CC doesn't have the right hooks, but Lawrence wants to revisit the CC to fix it eventually, but initially BBR will be a separate stack instead of a CC algorithm.

BBRv1 ignores packet loss assuming it isn't a valid indicator of congestion. However, networks have evolved over time based on existing TCP and intentionally use loss to signal congestion. BBRv2 added back a form of using loss as a congestion signal. Uses a variant of ECN.

Hardware/Software Pacing

As of a year ago, there is now a software pacing system (TCPHTS) that aims to give timer precision on the order of a few microseconds. BBR uses it for pacing, RACK uses it for burst mitigation.

There is also hardware pacing (RATELIMIT) available for both cxgbe and mlx5. Intel has some hardware that does it, but does not currently support it in FreeBSD drivers.

Could we add support via iflib? Some speculation that some common abstraction could be pulled up to iflib.

Discussed common abstraction for hardware/software pacing last year. No one has worked on it subsequently. For BBR, Randall has some code that handles hardware vs software pacing, but the software pacing is at transport layer so it's at a different layer and has different semantics to the hardware pacing.

jhb wants to clarify where the layer of abstraciton need to reside e.g. if application, existing socket option is fine. jtl is thinking about abstraction inside the kernel for use among various in-kernel consumers. Hardware rates are often at fixed values, and how you map a request to the available rates dpeends on implementation of policy e.g. give me closest rate at or above X.

lstewart notes that Linux style qdisc pacing which is transport agnostic and sits logically below the transport layer is quite different to the transport-layer pacing we currently have and use e.g. in the BBR stack. There may be reason to consider a transport-agnostic pacer in the broader HW/SW pacing discussion.

Someone from iX was thinking that pacing may provide a useful control knob for storage system QoS e.g. to control concurrent users from stomping on one another.

Navdeep asked if having an interface-wide rate limit would be valuable e.g. for jails. jtl asked room and noone had an immediate use case.

TLS offload

TLS offload refers to 2 things:

offloading crypto to kernel
and/or
offloading within kernel to the NIC

Both helpful for outbound heavy TLS workloads.

jhb working on portions of NIC TLS offload on collaboration with other folks who have worked on it. Goal is to allow a sendfile-like data model for TLS connections. TLS work depends on a new type of ext mbufs based on an array of physaddrs. Patches exist for the mbuf changes to Chelsio and mlx5

kernel software TLS makes use of the same idea as "non ready buffers waiting for disk IO", and instead marks mbufs as "not ready waiting for crypto". Doesn't use in kernel crypto framework, instead boringssl bits are plugged in to the kernel as a module.

kernel hardware TLS reuses send tags used for RATELIMIT to allocate TLS sessions on a NIC. mbufs are tagged with a handle to the TLS session. IP output layer ensures unencrypted TLS frames don't go out on the wire. NIC driver has recognize TLS mbufs and ensure they are encrypted and segmented.

Kernel <-> user API for both software and hardware TLS is a new socket option to set a TLS session with keys. Does not currently support key renegotiation, but it should be possible to support it using the existing socket option. There is also a socket option to query/set software vs hardware TLS offload. One important part of the API is that userland just writes application data via write() or sendfile() and the kernel at the socket buffer layer determines the TLS framing so that TCP accounts for TLS framing overhead correctly. Userland can send a TLS frame with non-application-data type by using sendmsg() with a special control-message to set the TLS record type.

Another key note is that the TLS offload only offloads "bulk-data" part of the connection after keys have been negotiated. OpenSSL in userland is used to perform key negotation and only the negotiated session keys are provided to the kernel. In addition, the current API only offloads transmit, not receive.

IRC Q&A:

My usecase is this: privsep a process. handle key exchange in one process and pass the data. without kernel tls this requires a proxy process to encrypt/decrypt. with the right api the key exchange process could pass a kernel tls enabled socket to the less trusted process. The problem is receive. The received data is still encrypted TLS records and right now that is still done in userland. The target workload are things like web servers with unbalanced traffic (little rx, lots of tx). thx that answers my question

How does userspace know if it should send normal data or control message? Only key-reneg in OpenSSL needs to send non-app-data.

is the syscall api finished? is it possible to inject kernel-tls sockets into tls unaware processes? You should ask questions during the session. :) Kernel TLS depends on a patched openssl.

NUMA improvements for TCP

Drew created a socket option that's used with Nginx at Netflix to associate a listen socket with a NUMA domain to help passive open affinity. Drew wondering if others have a use for anything like this and if the interface could be improved.

Only works on the passive open side. Navdeep wondering if we could add smarts for active open side.

Navdeep asking about mbuf allocators being NUMA aware. Jeff has a branch on github that makes UMA close to NUMA aware, and one patch in particular that makes freeing to the wrong domain super cheap so you can mark zones NUMA without issue.

Epoch design question

Conversion from locks to epochs was done fairly mechanically, and in the new world order, we could enter the epoch much earlier in packet processing and protect the entirety of processing instead of entering/exiting many times as is done currently. Gleb has WIP to take the epoh earlier and hold longer, but it's a lot of work and wants to check in that it's the right thing to pursue before continuing work.

Question on whether benchmarking has shown a current problem… yes there is profiling data to show this is in fact a performance issue currently.

pkelsey asked if this would cause a bunch more work to be queued up and done on exit from epoch section, essentially pessimising the exit point. Answer: no, reclamation work is defferred to a different context.

bjoern, markj +1 Gleb's proposal.

Drew noted a strong dislike for locking macros, which some in room agree with. Gleb should be able to remove a bunch of lock macros that will become superfluous as a result of his change. pkelsey said it would make his life harder as a downstream consumer of the stack if all lock macros were removed e.g. for libuinet, as he redefs the macros to a nop. sjg asked if it would be a significant problem for Juniper… JUniper will be in tonnes of pain on next update regardless, so doesn't seem like a deal breaker.

For dev, would be nice to have equivalent of an witness warning when recursing on an epoch, if only specifically for the net epoch.

IETF and "some congestion experienced" (SCE)

Rod has a functional implementation of SCE and is involved in the IETF efforts to promote/discuss SCE. Curious if there was much interest in the room, noone spoke up. lstewart commented that SCE has something of a luke warm reception in IETF land and if Rod had thoughts on the future of SCE vs other uses for the ECN bits a la accurate ecn. Rod says intent is to keep working on SCe and progressing it in IETF so we'll see… essentially will be a competing proposal for bits/interpretation of bits.

IN_foo macros

Rough consensus in the room that we should change ifconfig so that no netmask will return an error. Will need a warning period before transitioning to hard error for 13. Also need a ports exp run.

ILNP

Locator/network ID split… intent is to focus on integrating with v6 only.

Bruce Simpson has a FreeBSD implementation… Bjoern soliciting for anyone with interest to perhaps help with looking at the code with an eye to bringing it in to FreeBSD. Bruce's code is very modular and mostly doesn't touch the main kernel code.

Stack modularity 2nd half

Bjoern asked how you do with symbol collision. A: Symbol munging, but imperfect as it doesn't deal well e.g. with structs.
Netflix does do full copies, but does reference common code symbols and non-inlined functions without where it makes sense to.

Rod asked if CC modules are treated as sub-modules of the stacks and there is any ordering dependency. A: no, getting to the CC module is via an indirect pointer and newreno is always present

Usecase for stack as module?

Juniper described their abstraction to use their custom stack with FreeBSD drivers (different layer of abstraction to the TCP modules Netflix uses and upstreamed). They put their whole-of-network-stack-as-a-module infrastructure proposal up for review some years, but didn't get heaps of review feedback. One point of concern from Gleb was overhead of adding accessors to ifnet. sjg noted prior discussion of a project to break up the ifnet, though there aren't any concrete patches around which take this route. Some small bits of Juniper abstraction patch set got merged but bulk of the work is not. Currently can only have a single network stack loaded, but goal is to allow simultaneous load and dynamic runtime selection.

Some discussion around the overhead of indirect calls and whether any data has been collected to justify various people's concerns raised in the past, but noone in room could recall any such data.

Q: What is the long term goal for the whole stacks? keep both? unify into one "ideal" design? IP stacks not just protocols
Steve stated for whole stacks to be loaded simulataneously and runtime selectable for different use cases.

From modular TCP-protocol stack perspective, Jonathan feels having soon-to-be 3 stacks (default, rack, bbr) will be undesirable and difficult to maintain long term, and we need to discuss unification.

Comment: this reminds me a lot of https://www.netbsd.org/gallery/presentations/dennis/2015_AsiaBSDCon/BSDNet.pdf

jtl notes having Randall present for a future revisit of this topic would be good.

lstewart put forward a strawman potential proposal of a current prod stack and innovation stack that gets pivoted to on some sort of time/release cadence. Plenty of other models could be imagined. To be discussed further at a transport call or devsummit, but some food for thought.

Bjoern lamented some KPI/KBI breakage that happened in 11 (context?). Some discussion around lack of clarity of which structures are part of the KPI/KBI; general consensus on need for better clarity, probably via some documentation as a starting point. Unclear if anyone identified to do the work…

sjg proposed a strawman that no MFC of struct changes under sys without a KPI/KBI breakage sanity check review. pkelsey countered that finding the right person at the right time to review is unlikely to always be an option, and codifying KPI/KBI rules in a tool that can be run by developers and RE folks is a better option.

jhb notes that traditionally we've really only tried to guarantee stability for a few things like network drivers, file systems, but don't guarantee (and don't want to) for e.g. VM. So what would we mark as covered by our (best effort) guarantee? sys/sys and sys/net*? Perhaps could pick a trial rule and run an experiment seeing how many merges would have been pulled up by the trial rule and see what false positive rate looks like + refine to a point where project could adopt it.

Things people are working on:

pkelsey: Project requirements for 10k+ ALTQ queue, ALTQ at 10Gbps, thousands of VLANs. pkelsey has been fixing various bottlnecks to realise these requirements.
Going to do QinQ support.

bjoern: Working towards being able to compile without any IPv4 (headers gone)

Q: Is anyone working on (>=1Gb/s, 1mpps) if_bridge or RPVST+ (per vlan spanning tree, context = connecting bridges via tunnels carrying multiple vlans, switches don't like RSTP inside vlans e.g. HP switches just STP block the whole physical port)? What happend to bhyve VPC?
Doesn't sound like folks in this room know the answer
thx. sry i picked the wrong format

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Network / Transport Working Group - BSDCan 2019 Devsummit

State of RACK/BBR (lstewart@)

RACK

BBR

Hardware/Software Pacing

TLS offload

IRC Q&A:

NUMA improvements for TCP

Epoch design question

IETF and "some congestion experienced" (SCE)

IN_foo macros

ILNP

Stack modularity 2nd half

Context sharing