BSDCam Transport Session

Agenda Bashing
Linux NetDev Report from thj@
- Co-located with IETF event
- Not especially useful for FreeBSD people
- Things they are doing:
  - tight vendor intergration for switch ASICs
  - switchdev API, switch configurations
    - Mellanox, Barefoot, and Cumulous
  - FreeBSD likely to lag behind
  - Barefoot: Intellectual Property in compiler
    - Would be willing to open source spec for configuring ASIC
  - Librification of netfilter tools (firewall rules in JSON)
  - Write firewall config tools in higher level languages
  - What do sysadmins want to have libs/JSON etc
  - Demo of netfilter implemented in eBPF
Have a "Tell developers what you think/want" session at MeetBSD
- Getting more feedback from users and sysadmins
Have a FreeBSDCon, a devsummit focused on getting users to tell us about their needs/pains/desires
Making IPv6 Suck Less
- Perform Better
- Missing RFCs
- thj is implementing RFC7112
- Roaming WiFi: ipv4 renegotiates DHCP, but SLAAC doesn't get reset
- jtl's concern: the complexity of headers, cases where host may be instructed to do work.
  - There are some measurements of what % of traffic gets dropped if it has extension headers. Cisco is apparently doing fresh stats on this.
- jtl would like a sysctl bitmask to ignore extension specific types of headers
- A bug with v6 fragments, if RSS enabled, counter of how many headers have been processed gets reset to 0
- Optimizations that have only been done to v4, may need to be replicated for v6
- v4 may be more strictly compliant, v6 is often less complaint
  - v4 would not accept more than 16 fragments
  - bz would liken us to be RFC8200 complaint
- Who wants to actually work on v6: thj, bz, gallatin@, left 1/2 of rrs@, right 1/2 of tuexen@
- Old ipv6 todo page: https://wiki.freebsd.org/IPv6/ToDo
- An equiv to the v4 RFC page: https://wiki.freebsd.org/TransportProtocols/tcp_rfc_compliance
- We need more test cases, both for things that work (so we don't break them), and for things that are broken (so we know when it is fixed)
  - OpenBSD has a python based v6 test suite that works on FreeBSD
- tuexen@ has a set of test packages that are ready to be hooked up to CI
- Take Away: status reports on the bi-weekly transport call
IP[46]/TCP Reassembly Bugs/Stuff
- Researcher found that 'walking linked list is slow, and bad'
- The kernel created long linked lists for out-of-order TCP segments and fragment chains.
- IPv6: Used to limit resources in very differently than v4, now uses the same vocabularity
- IPv6 fragments were not hashed into buckets, now they are
- Performance suffers too much when the list exceeds 100, this is the new limit
  - Mostly just a workaround, papers over the problem. Needs an algorithmic fix
  - If more than a trivial number of fragments, needs a better solution. glebius@ is working on an implementation of fragment processing code using red-black tree. Needs a security review. Is the performance impact acceptable.
- TCP: rrs@ working on collescing code
  - Updated version coming to phabricator soon
  - tuxen@ wrote test cases for reassmbly
- jhb@ and jtl@ have a todo list
  - use queue.h
- v6 code requires changes in many places
- Need a modernization pass, remove #ifdef KAME etc
  - Too much noise in the code, harder to read and reason about
  - Need a regression suite
  - Give it the FreeBSD stink™
- bz@ may have old project in perforce that does some cleanup, likely applies fairly well
- Todo: pf
- brooks@ would prefer a cleanup of the IOCTLs
TFO (TCP Fast Open)
- Who might have patches?
- Known interop problem with Windows
  - TCP option alignment
- tuexen@ has test cases for this, need to extract them from him (with pliers)
  - Limelight extension with shared secret
Alternate Stacks
- Infrastructure
  - Allow different TCP stacks concurrently (side-by-side)
  - Use setsockopt() to assign individual sockets to the alternative stacks
  - Requires that when you switch stacks you must update the common TCPcb
  - A/B test stacks, route n% of traffic to the new stack, compare stats from the two stacks
  - Can be used to different workloads
  - Live-patching by loading newer version of stack without rebooting
  - Allows much more active development, frees development from usual requirements (work across low cpu/ram count to high cpu/ram count)
- RACK
  - IETF draft: https://tools.ietf.org/html/draft-ietf-tcpm-rack-04
    - Our code only supports draft -02.
    - Netflix not driven to update at this time
  - Recent ACK + Tail loss probe
    - use RTT to predict when to try to keep transmitting
    - use SACK to use RTT to predict when to retransmit
    - PRR Proportional Rate Reduction (https://tools.ietf.org/html/rfc6937), keep sending more data as you get ACKs instead of waiting for 1/2 of window
    - Burst mitigation, high percision timing system
    - Much better quality of experience
    - Keeps a send map, how many times each segment has been sent, better than old SACK
    - robert@ asks about reducing diff between base stack and RACK
    - Improved recovery
  - In head, higher cost to use
  - Most all video traffic at Netflix uses RACK
  - Even fill traffic will use RACK eventually
  - Head is a bit different than what Netflix is using right now, head is considered far better
  - Doing new tests to compare 2017 to 2018 stack
- BBR
  - Experimental congestion control, but actually a different stack
  - Builds on RACK
  - Even higher than cost than RACK
  - BBR v1.0 is controversial
  - Netflix has enhanced this for their implementation
  - In router small buffer scenarios it is unfair to newreno/cubic
  - BBR v2.0 looks to improve this
  - Netflix not necessarily sold on Google's ideas
  - Assumes loss is not congestion based
  - "Policer detection" to notice when you are being rate limited by a middlebox
"Blackbox Recorder"
- Volunteer to make ports? :-)
- Came from Netflix
- Log state of TCPcb, the packet, timers, other data to ring buffer
- Can be dumped out to userspace
- Tooling exists, needs ports
- Writes out pcapng files
- Traceviewer provides visual interface
- Analysis daemon that runs continuously and runs tests againsts the data, in the form of assertions
- After panic, can extract the data from the ring buffer
- RACK and BBR development depended upon blackbox
- Extend wireshark to understand the metadata
  - Attend SharkFest to present FreeBSD work
RCU "Locking"
- mmacy@ applied RCU to IP stack
- Requires a mindset change
- read-locks are not always "locks"
  - Register your intention to read the data structure
  - ConcurrencyKit will not garbage collect the data while you are using it
- In 13 we should shift to using these more
- To date we only have a first pass
- More though about which data structures requires "full" locks
- Make engineering decisions to use the new CK features more
- Avoid "lock chains" that require acquiring many locks in a sequence
- Rethink locking from a more fundamental prespective
- Used to allow add/remove from list, while another process is walking through the list
Netflix is committed to upstreaming and being good community citizens

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.