--- tags: packet, freebsd, bgp, bird --- # Faster failover and better load balancing with BGP & ECMP Previously I've combined [CARP] and [haproxy] for high-availability solutions, but CARP is relatively slow (around 3 seconds) to fail over, and only suited to a single datacentre environment. I'd really like to have a solution that can handle a major screw-up, whether on my part, or provider-induced. If you get a failover IP or a load-balanced solution from the Big Cloud Vendors, they are using network layer 3 protocols to implement this functionality. This same functionality is also used to wire up the separate physical and logical ISP networks out there into "the internet" as we know it today. The protocols that allow these networks to exchange their availability and reachability constantly, are described by internet RFCs, and there are a large number of open source implementations along with that, as well as a slew of commercial options. We can use the same functionality that makes the internet resilient, to make our own networks resilient too, assuming our providers support it. With some testing it looks like I have sub-second fail-over over at the IP layer, and almost seamless load balancing at layer 7 as a result. If you accept that any mid-flight transaction that loses its server TCP connection will inevitably have to re-try that transaction, and that your application has to be able to handle that presumably infrequent interruption, this is a really really sweet solution. The protocols we'll look into today are called Border Gateway Protocol v4, aka "BGP" as the older versions are effectively gone now, and Equal Cost MultiPathing or ECMP. As with all my work, this is based off FreeBSD, but you could do this on any modern UNIX, and uses a number of open source technologies to achieve it. > Caveat dear reader, this is new tech for me and while I've attempted > to verify my understanding, I may have made grevious mistakes that could > endanger many kittens, and put countless cat GIFs at risk should you > implement this in your own environment without suitable guidance. > Corrections welcomed, [email] or [mastodon]. ## [BGP] TLDR A protocol for exchanging network reachability information, over TCP port 179, to explicitly configured neighbours. BGP daemons *usually* (but not always, viz ExaBGP) also interact with the server's routing services, usually called the FIB (forwarding information base), to provide the kernel with optimal routes (based on pricing, transit agreements, appropriate shortest or fastest paths). Some good resources are: - https://vincent.bernat.ch/en/blog/2013-exabgp-highavailability what we want to end up with - https://en.wikipedia.org/wiki/Border_Gateway_Protocol - https://man.openbsd.org/bgpd a really good explanation of the rules - https://twitter.com/bgpstream everything is broken all the time - https://www.inetdaemon.com/tutorials/internet/ip/routing/bgp/ the internals - https://www.bgp4.as/ so much BGP stuff here - https://tools.ietf.org/html/rfc4271 the official RFC - There are 2 common modes of using BGP - "iBGP" where the routing is all internal to your own network, and not visible externally, or "eBGP" where the rest of the internet can see your routing changes. Functionally there is no difference, however operationally it is a significant change. Fail over time *within* your own network can be sub seconds, but externally on the open internet, you are relying on upstream routers to propagate changes. The BFD extensions to BGP4 allow this to be improved but I've not yet explored this functionality. Each BGP "separate network" is known as an Autonomous System, or AS. These have to be registered with IANA and regional internet registries such as RIPE (for Europe), and there are a set of private AS for internal usage just like RFC1918 addresses are used for IP ranges internally to your network today. There are many other routing protocols in use today, but BGP is both the most common on the internet, and the one you're most likely to be able to set up with your ISP or hosting provider today. > NB BGP has simplistic security using TCP MD5 checksums, but nothing > more. Better functionality is in the works, but not all providers and > ISPs have it yet. ## [ECMP] TLDR Equal Cost Multi Pathing is exactly what it says on the tin. If the IP routing layer is aware of which next-hop routers (in our case servers running our precious applications) are available, then it can route IP datagrams consistently to those available servers based on certain rules. See [RFC2991] and [RFC2992] for very readable details. [RFC2991]: https://tools.ietf.org/html/rfc2991 [RFC2992]: https://tools.ietf.org/html/rfc2992 If the routing of IP datagrams is done to the same server, hashed by (for example) source IP and destination IP, then we can expect that our front end load balancers running [haproxy] will consistently receive TCP flows as intended, barring any fail-over changes. It's important to understand that we're playing with fire here - routing is generally an IP (layer 3) concern, and we are actually abusing it by introducing some additional knowledge from the layers above, to gain both load balancing of bandwidth and requests across multiple servers, and additional redundancy. This tech is called anycast, and has been initially used for UDP-based DNS lookups - a perfect solution as there's no stream or flow involved across datagrams. Thus, *TCP anycast* does introduce some risk, although in my experience it's an excellent tradeoff. ## why BGP on servers? I've previously implemented reliable systems using [CARP] and [VRRP] which are only suitable for servers within the same datacentre, and aren't suitable for sharing the load across those servers. I would like to be able to move that failure point outside of my servers, and get faster failover time. A router is just a server with a shitload of network bandwidth and fancy software. It knows how to route around damaged or inefficient parts of the internet, and how to communicate that status to other routers. We can re-use that same functionality to advertise micro-routes to a single IP or small subnet, and do this on multiple servers. If we assign an IP to each application, and advertise that IP from only active/healthy servers or application instances, then we get an extremely high speed failover implementation. In addition, if the upstream router provides [ECMP], then we also get load balancing as well - bonus! Finally, if we can teach our *software* on our application servers to do their own BGP announcements, we can have a software stack that handles its own failover and load balancing functionality internally. This is not *quite* as easy as the first part, so let's just keep it simple for the moment. [ECMP]: https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing [CARP]: https://www.freebsd.org/doc/handbook/carp.html [VRRP]: https://tools.ietf.org/html/rfc2338 ### ipsec The BGP TCP sessions are secured using ipsec md5, available in FreeBSD 11 and up by default. As [bird] does the injection of these keys for you directly, based on the AS password in the conf file, this makes it really easy. [OpenBGPD] doesn't do this, which is why I ended up with bird. I've since discovered that setting it up is not very tricky, so once I have ipv6 working, expect an [OpenBGPD] example in the future. Below we can see the ipsec status for 2 connected nodes before (no SAD entries) vs after starting bird, with 2 entries: ``` # setkey -D No SAD entries. # birdc enable bgp1 BIRD 1.6.4 ready. bgp1: enabled # setkey -D 10.99.71.130 10.99.71.131 tcp mode=any spi=233523366(0x0deb48a6) reqid=0(0x00000000) A: tcp-md5 61343834 66303030 32356439 31374145 31316163 seq=0x00000000 replay=0 flags=0x00000040 state=mature created: Dec 12 21:18:32 2018 current: Dec 12 21:18:34 2018 diff: 2(s) hard: 0(s) soft: 0(s) last: hard: 0(s) soft: 0(s) current: 0(bytes) hard: 0(bytes) soft: 0(bytes) allocated: 0 hard: 0 soft: 0 sadb_seq=1 pid=18787 refcnt=1 10.99.71.131 10.99.71.130 tcp mode=any spi=4096(0x00001000) reqid=0(0x00000000) A: tcp-md5 61343834 66303030 32356439 31374145 31316163 seq=0x00000000 replay=0 flags=0x00000040 state=mature created: Dec 12 21:18:32 2018 current: Dec 12 21:18:34 2018 diff: 2(s) hard: 0(s) soft: 0(s) last: hard: 0(s) soft: 0(s) current: 0(bytes) hard: 0(bytes) soft: 0(bytes) allocated: 0 hard: 0 soft: 0 sadb_seq=0 pid=18787 refcnt=1 # ``` ### manage setkey manually - the key sequences `0x1000` are arbitrary but must be unique ``` # echo 'add 10.99.71.131 10.99.71.130 tcp 0x1000 -A tcp-md5 "very_secret";' | setkey -c # setkey -D 10.99.71.131 10.99.71.130 tcp mode=any spi=4096(0x00001000) reqid=0(0x00000000) A: tcp-md5 76657279 5f736563 726574 seq=0x00000000 replay=0 flags=0x00000040 state=mature created: Mar 19 13:44:59 2019 current: Mar 19 13:45:08 2019 diff: 9(s) hard: 0(s) soft: 0(s) last: Mar 19 13:45:04 2019 hard: 0(s) soft: 0(s) current: 93(bytes) hard: 0(bytes) soft: 0(bytes) allocated: 1 hard: 0 soft: 0 sadb_seq=0 pid=41924 refcnt=1 ``` ## manual setup - enable forwarding in the kernel - set up the lo interface to present our external IP - set up ipsec - start the router ``` # not actually required but could be useful # inject the routes only into (e.g.) a jail # echo 'net.fibs="16"' >> /boot/loader.conf echo net.inet.ip.forwarding=1 >> /etc/sysctl.conf echo net.inet6.ip6.forwarding=1 >> /etc/sysctl.conf ifconfig lo1 create inet 147.75.194.20/32 up name bgp1 kldload tcpmd5 touch /etc/ipsec.conf sysrc routing_enable=YES sysrc ipsec_enable=YES service ipsec start service sysctl reload service routing start ``` ## FreeBSD config ``` hostname="f01.skunkwerks.at" # changes cloned_interfaces="lagg0" ifconfig_igb0="up" ifconfig_igb1="up" ifconfig_lagg0="laggproto loadbalance laggport igb0 laggport igb1" ifconfig_lagg0_alias0="inet <public_ip> netmask 255.255.255.254" ifconfig_lagg0_alias1="inet <rfc1918_ip> netmask 255.255.255.254" ifconfig_lagg0_ipv6="inet6 <public_ip6> prefixlen 127" defaultrouter="<ip4_upstream_router>" ipv6_defaultrouter="<ip6_upstream_router>" static_routes="private" route_private="-net <rfc1918_subnet> <rfc1918_upstream_router>" cloned_interfaces="${cloned_interfaces} lo1" kld_list="${kld_list} tcpmd5" ifconfig_lo1="inet <floating_ip/32> rxcsum txcsum up" routing_enable="YES" ipsec_enable="YES" bird_enable="YES" ... ``` ``` # /usr/local/etc/bird.conf # various options for logging, not mutually exclusive log stderr all; log syslog all; log "/var/log/bird.log" { debug, trace, info, remote, warning, error, auth, fatal, bug }; filter packet_bgp { # the IP range(s) to announce via BGP from this machine if net = <floating_ip/32> then accept; } router id <rfc1918_ip>; # this server's IP address debug protocols { events, states }; watchdog warning 5s; watchdog timeout 30s; protocol direct { interface "bgp1"; # restrict network interfaces it works with (our renamed lo interface) } protocol kernel { persist; # don't remove routes on bird shutdown depends on your setup if you want this scan time 10; # scan kernel routing table every 10 seconds import all; # default is import all export all; # default is export none } protocol device { scan time 10; # scan interfaces every 10 seconds } protocol bgp { export filter packet_bgp; local as 65000; graceful restart yes; # ensure we don't trash the existing routing table on startup graceful restart time 0; # BFD is only available in later versions # bfd graceful; # long lived graceful restart yes; # long lived stale time 120; source address <rfc1918_private_ip>; neighbor <rfc1918_upstream_router> as 65530; password "..."; } ``` ## interacting with birdc - are we broadcasting and accepting routes? - what protocols are enabled? - disable our BGP announcement (aka take ourselves out of the loop) - check the new status ``` # birdc show status BIRD 1.6.4 ready. BIRD 1.6.4 Router ID is 10.99.71.133 Current server time is 2018-12-12 22:59:46 Last reboot on 2018-12-11 11:34:56 Last reconfiguration on 2018-12-12 22:57:08 Daemon is up and running # birdc show protocols BIRD 1.6.4 ready. name proto table state since info direct1 Direct master up 2018-12-11 kernel1 Kernel master up 2018-12-11 device1 Device master up 2018-12-11 bgp1 BGP master up 22:53:28 Established # birdc disable bgp1 BIRD 1.6.4 ready. bgp1: disabled # birdc show protocols BIRD 1.6.4 ready. name proto table state since info direct1 Direct master up 2018-12-11 kernel1 Kernel master up 2018-12-11 device1 Device master up 2018-12-11 bgp1 BGP master down 22:59:56 ``` ## Testing TCP-MD5 directly You can use tcpdump or ngrep as usual to see what's happening on the wire. Use `nc -4vSkl 179` to listen on a socket, and have the kernel handle your MD5SIG for you. Use `nc -4v 127.0.0.1 179` to connect. I recommend using IPs initially to be certain you're not having traffic go out some other path or interface unexpectedly - which won't then have the signature applied in-kernel. ## Confusing things and helpful people If you don't have `source address` set your packets will not flow. TCP-MD5SIG can *only* work with direct TCP connections, as both endpoints are part of the signed content. This means that NAT (including CGN) breaks your MD5 data. The solution, as such, is to use IPv6 and hope for a direct connection, or possibly use some sort of tunnelling software. Understanding why the floating IP actually lives on a loopback address was kindly clarified by [Peter Hessler](https://bsd.network/@phessler) who does a really neat OpenBGPD introduction course at many BSD events. You should skip this blog and go do his course next time. The whole ipsec setup wasn't at all clear, while it was trivial to set up with bird, I wasted a lot of time not understanding what the problem was initially. Luckily, another helpful FreeBSD developer, [Olivier Cochard-Labbé](https://plus.google.com/104048332916271832882) helped me on IRC to get some things sorted out with FreeBSD's TCP-MD5 protection. I had already spent half a day mis-understanding this so his timely intervention was much appreciated by my family :-) If you want to configure these manually in future, simply refer to the last part of https://www.freebsd.org/cgi/man.cgi?query=setkey&sektion=8 and you're done. His [BSDRP] BSD Routing Project - contains a full FreeBSD based VM and lab to work with, includes [FRRouting] (fork of [Quagga]) and [bird], a great deal of useful information and configuration tips in there. In particular, https://bsdrp.net/documentation/examples and the [lab source](https://github.com/ocochard/BSDRP/blob/master/BSDRP/Files/usr/local/sbin/labconfig) was extremely useful. It is rare to have a fully-fledged self-configured VM environment to refer to where everything works. This is an incredible project. [OpenBGPD]'s excellent man pages https://man.openbsd.org/bgpd & https://man.openbsd.org/bgpd.conf were very helpful. The authors are friendly and I'm also grateful to Peter for making my first BSDCan conference so enjoyable that I have been to all the ones I could get to since then. ### blog posts - https://bird.network.cz/?get_doc&f=bird.html&v=16 - https://gitlab.labs.nic.cz/labs/bird/wikis/home - https://gitlab.labs.nic.cz/labs/bird/wikis/Examples - https://gitlab.labs.nic.cz/labs/bird/wikis/transition-notes-to-bird-2 - http://godevops.net/2016/08/22/haproxy-high-availability-using-rhi-quagga-and-ospf - https://thebsd.club/index.php?p=/discussion/8/qemu-with-a-bridged-tap0-interface-on-a-freebsd-host - https://www.slideshare.net/shusugimoto1986/tutorial-using-gobgp-as-an-ixp-connecting-router - http://networkstatic.net/gobgp-control-plane-evolving-software-networking/ - https://fastnetmon.com/docs/gobgp-integration/ - https://blog.marquis.co/configuring-bgp-using-bird-on-ubuntu-14-04lts/ - https://genneko.github.io/playing-with-bsd/networking/freebsd-vti-ipsec/ - https://vincent.bernat.ch/en/blog/2018-bgp-llgr - https://blog.apnic.net/2018/11/06/bgp-llgr-robust-and-reactive-bgp-sessions/ - https://labs.ripe.net/Members/claudio_jeker/openbgpd-adding-diversity-to-route-server-landscape - https://www.freebsd.org/cgi/man.cgi?query=if_ipsec(4) - https://www.noction.com/blog/equal-cost-multipath-ecmp - https://www.routetocloud.com/2015/01/asymmetric-routing-with-ecmp-and-edge-firewall-enabled/ - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225792 - https://sostechblog.com/2017/07/20/nsx-bgp-ecmp-quick-hits/ - https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/mp_l3_vpns/configuration/xe-3s/asr903/mp-l3-vpns-xe-3s-asr903-book/mp-l3-vpns-xe-3s-asr903-book_chapter_0100.pdf - http://blog.cochard.me/2016/01/example-of-freebsd-bug-hunting-session.html - https://conferences.sigcomm.org/sigcomm/2003/papers/p49-sobrinho.pdf - https://www.inetdaemon.com/tutorials/internet/ip/routing/bgp/operation/messages/index.shtml ### open source BGP daemons - https://www.bizety.com/2018/09/04/bgp-open-source-tools-quagga-vs-bird-vs-exabgp/ - [bird] - [GoBGPD] - https://osrg.github.io/gobgp/ - [FRRouting] a fork of Quagga - [Quagga] - [OpenBGPD] an OpenBSD-based project, recently with a further update - [ExaBGP] for simply announcing route availability [BSDRP]: https://bsdrp.net/ [bird]: http://bird.network.cz/ [GoBGPD]: https://github.com/osrg/gobgp [FRRouting]: https://frrouting.org/ [Quagga]: http://www.nongnu.org/quagga/ [OpenBGPD]: http://www.openbgpd.org/ [ExaBGP]: https://github.com/Exa-Networks/exabgp [BGP]: https://tools.ietf.org/html/rfc4271 [haproxy]: http://www.haproxy.org/ [email]: mailto:dch@skunkwerks.at [mastodon]: https://bsd.network/@dch/