changed 6 years ago
Linked with GitHub

BSDCam Transport Session

  • Agenda Bashing
  • Linux NetDev Report from thj@
    • Co-located with IETF event
    • Not especially useful for FreeBSD people
    • Things they are doing:
      • tight vendor intergration for switch ASICs
      • switchdev API, switch configurations
        • Mellanox, Barefoot, and Cumulous
      • FreeBSD likely to lag behind
      • Barefoot: Intellectual Property in compiler
        • Would be willing to open source spec for configuring ASIC
      • Librification of netfilter tools (firewall rules in JSON)
      • Write firewall config tools in higher level languages
      • What do sysadmins want to have libs/JSON etc
      • Demo of netfilter implemented in eBPF
  • Have a "Tell developers what you think/want" session at MeetBSD
    • Getting more feedback from users and sysadmins
  • Have a FreeBSDCon, a devsummit focused on getting users to tell us about their needs/pains/desires
  • Making IPv6 Suck Less
    • Perform Better
    • Missing RFCs
    • thj is implementing RFC7112
    • Roaming WiFi: ipv4 renegotiates DHCP, but SLAAC doesn't get reset
    • jtl's concern: the complexity of headers, cases where host may be instructed to do work.
      • There are some measurements of what % of traffic gets dropped if it has extension headers. Cisco is apparently doing fresh stats on this.
    • jtl would like a sysctl bitmask to ignore extension specific types of headers
    • A bug with v6 fragments, if RSS enabled, counter of how many headers have been processed gets reset to 0
    • Optimizations that have only been done to v4, may need to be replicated for v6
    • v4 may be more strictly compliant, v6 is often less complaint
      • v4 would not accept more than 16 fragments
      • bz would liken us to be RFC8200 complaint
    • Who wants to actually work on v6: thj, bz, gallatin@, left 1/2 of rrs@, right 1/2 of tuexen@
    • Old ipv6 todo page: https://wiki.freebsd.org/IPv6/ToDo
    • An equiv to the v4 RFC page: https://wiki.freebsd.org/TransportProtocols/tcp_rfc_compliance
    • We need more test cases, both for things that work (so we don't break them), and for things that are broken (so we know when it is fixed)
      • OpenBSD has a python based v6 test suite that works on FreeBSD
    • tuexen@ has a set of test packages that are ready to be hooked up to CI
    • Take Away: status reports on the bi-weekly transport call
  • IP[46]/TCP Reassembly Bugs/Stuff
    • Researcher found that 'walking linked list is slow, and bad'
    • The kernel created long linked lists for out-of-order TCP segments and fragment chains.
    • IPv6: Used to limit resources in very differently than v4, now uses the same vocabularity
    • IPv6 fragments were not hashed into buckets, now they are
    • Performance suffers too much when the list exceeds 100, this is the new limit
      • Mostly just a workaround, papers over the problem. Needs an algorithmic fix
      • If more than a trivial number of fragments, needs a better solution. glebius@ is working on an implementation of fragment processing code using red-black tree. Needs a security review. Is the performance impact acceptable.
    • TCP: rrs@ working on collescing code
      • Updated version coming to phabricator soon
      • tuxen@ wrote test cases for reassmbly
    • jhb@ and jtl@ have a todo list
      • use queue.h
    • v6 code requires changes in many places
    • Need a modernization pass, remove #ifdef KAME etc
      • Too much noise in the code, harder to read and reason about
      • Need a regression suite
      • Give it the FreeBSD stink
    • bz@ may have old project in perforce that does some cleanup, likely applies fairly well
    • Todo: pf
    • brooks@ would prefer a cleanup of the IOCTLs
  • TFO (TCP Fast Open)
    • Who might have patches?
    • Known interop problem with Windows
      • TCP option alignment
    • tuexen@ has test cases for this, need to extract them from him (with pliers)
      • Limelight extension with shared secret
  • Alternate Stacks
    • Infrastructure
      • Allow different TCP stacks concurrently (side-by-side)
      • Use setsockopt() to assign individual sockets to the alternative stacks
      • Requires that when you switch stacks you must update the common TCPcb
      • A/B test stacks, route n% of traffic to the new stack, compare stats from the two stacks
      • Can be used to different workloads
      • Live-patching by loading newer version of stack without rebooting
      • Allows much more active development, frees development from usual requirements (work across low cpu/ram count to high cpu/ram count)
    • RACK
      • IETF draft: https://tools.ietf.org/html/draft-ietf-tcpm-rack-04
        • Our code only supports draft -02.
        • Netflix not driven to update at this time
      • Recent ACK + Tail loss probe
        • use RTT to predict when to try to keep transmitting
        • use SACK to use RTT to predict when to retransmit
        • PRR Proportional Rate Reduction (https://tools.ietf.org/html/rfc6937), keep sending more data as you get ACKs instead of waiting for 1/2 of window
        • Burst mitigation, high percision timing system
        • Much better quality of experience
        • Keeps a send map, how many times each segment has been sent, better than old SACK
        • robert@ asks about reducing diff between base stack and RACK
        • Improved recovery
      • In head, higher cost to use
      • Most all video traffic at Netflix uses RACK
      • Even fill traffic will use RACK eventually
      • Head is a bit different than what Netflix is using right now, head is considered far better
      • Doing new tests to compare 2017 to 2018 stack
    • BBR
      • Experimental congestion control, but actually a different stack
      • Builds on RACK
      • Even higher than cost than RACK
      • BBR v1.0 is controversial
      • Netflix has enhanced this for their implementation
      • In router small buffer scenarios it is unfair to newreno/cubic
      • BBR v2.0 looks to improve this
      • Netflix not necessarily sold on Google's ideas
      • Assumes loss is not congestion based
      • "Policer detection" to notice when you are being rate limited by a middlebox
  • "Blackbox Recorder"
    • Volunteer to make ports? :-)
    • Came from Netflix
    • Log state of TCPcb, the packet, timers, other data to ring buffer
    • Can be dumped out to userspace
    • Tooling exists, needs ports
    • Writes out pcapng files
    • Traceviewer provides visual interface
    • Analysis daemon that runs continuously and runs tests againsts the data, in the form of assertions
    • After panic, can extract the data from the ring buffer
    • RACK and BBR development depended upon blackbox
    • Extend wireshark to understand the metadata
      • Attend SharkFest to present FreeBSD work
  • RCU "Locking"
    • mmacy@ applied RCU to IP stack
    • Requires a mindset change
    • read-locks are not always "locks"
      • Register your intention to read the data structure
      • ConcurrencyKit will not garbage collect the data while you are using it
    • In 13 we should shift to using these more
    • To date we only have a first pass
    • More though about which data structures requires "full" locks
    • Make engineering decisions to use the new CK features more
    • Avoid "lock chains" that require acquiring many locks in a sequence
    • Rethink locking from a more fundamental prespective
    • Used to allow add/remove from list, while another process is walking through the list
  • Netflix is committed to upstreaming and being good community citizens
Select a repo