Resilient Control Plane Networking

--- title: resilient-control-plane-networking authors: - @aojea - @sttts reviewers: - @sttts - @rphilips - @bbennet approvers: - @mfojtik creation-date: 2021-09-23 last-updated: 2021-10-05 status: implementable --- # Resilient Control Plane Networking ## Release Signoff Checklist - [x] Enhancement is `implementable` - [ ] Design details are appropriately documented from clear requirements - [ ] Test plan is defined - [ ] Operational readiness criteria is defined - [ ] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) ## Summary An internal API load-balancer (LB) is used today by a number of core components (kubelet, SDN itself, kube-controller-manager, kube-scheduler, multus/CNI plugins, and potentially more) to connect to the API without the need of a running SDN. Internal load-balancer is: 1. used for discovery reasons (bootstrap) 2. **and** for routing the actual control-plane data to healthy and ready API servers. (1) is necessary, especially during the bootstrapping phase of a new node, but can also be done with DNS. (2) is an implementation detail, LBs polls the active API servers and redirect the client connections against the healthy ones. In the past, the internal LB was subject of several customer escalation, and is repeatedly reflected in CI instability. The LB is often a black box and is not a component of the cluster, as it can be provided by the customer or the cloud provider, making debugging often very hard and increasing the response time. The current architecture is particularly sensitive to LB misconfiguration or misbehavior and became a major problem for the supportability of the product, regularly involving core engineering to diagnose customer problems and causing large outages on customers clusters. This enhancement is not about replacing the internal LB or changing the Openshift architecture, is to provide a self healing solution for when (2) fails, reducing the unavailability time from hours/days to seconds, also to improve the network performance of the cluster, removing current penalty of using two hops for reaching the API server through the LB, preferring local connectivity. ## Motivation Load-balancer issues have been proven to be very hard to debug in customer cases. The internal API LB is critical for the health of the cluster. Misconfiguration or misbehaviour, often neither black or white, leads to a large variety of symptoms and cascading failures, often not easy to distinguish from other in-cluster problems, even for the most advanced OpenShift engineers. For customers and support engineers these situations are overwhelming, especially when the usual cluster debugging tools are not available or unreliable in these moments. The usual procedure to prove misconfigured LB involved switching the control plane to connect to API server via localhost and then observing stabilization. If that was the case, it served as enough proof for customer to investigate their infrastructure. Hence, one wish in these situations is that at least the control plane stays up and debugging tools keep working. If customer workloads are also affected (often using the same load-balancer), the customer is much better equipped with knowledge to diagnose misbehaviour regarding their own applications than deep in control-plane components. Today, the API LBs are configured by the installer (IPI) or provided by the customer (UPI). The OpenShift teams whose components depend on the load-balancer often have limited to zero knowledge about how these are configured, or how to diagnose misbehaviour. In the escalations, the LBs were misconfigured accidentally by customers, or the LB hardware was faulty, eventually leading to production outages. Engineering spent many hours with the customer to root cause each individual problem. ### Goals - Maximize fault-tolerance of the control plane by minimizing dependencies on customer or cloud provided infrastructure. - Improve performance of components that actively use the internal API LB today for API traffic (including SDN, KCM, KS, kubelets on masters, kubelets on workers) to use direct host-network connection where ever possible. - Improve resilience of components that actively use the internal API LB today for API traffic (including SDN, KCM, KS, kubelets on masters, kubelets on workers) to be able to fall back to different endpoints in case of network failures. - Respect health and readiness of API server instances (by using only ready endpoints). - Keep backwards compatibility, using the internal load-balancer for discovery, bootstrapping, and potentially fallback. - **Short-term:** opt-in for direct connections via feature-gate (without blocking upgrades), to give an easy diagnosis mechanism in customer cases. - **Medium-term:** switch to direct connections by default, potentially with an opt-out mechanism. - **Medium-term** remove the iptables based hair-pinning work-around on GCP and Azure, that is provided by MCO today. ### Non-Goals - Eliminate internal API load-balancer entirely. - Switch over every OpenShift component/operator or customer workload. This enhancement is *not* about the "kubernetes" API service and its consumers, and not about consumers of the internal API load-balancer. ## Proposal We propose to implement [RFC 7838](https://datatracker.ietf.org/doc/html/rfc7838), HTTP Alternative Services in Openshift. This specification defines a new concept in HTTP, "Alternative Services", that allows an origin server to nominate additional means of interacting with it on the network. The kube-apiserver instances publish their IPs via the "kubernetes" service endpoints. In case of OpenShift, these IP addresses map to master nodes IP addresses (as kube-apiserver pods are running on the "host network"). They are routable from the host networks on all nodes. These endpoints respect the readiness of kube-apiserver, which means if the kube-apiserver is being terminated (and with that starts returning failure on `/readyz`), the endpoint is automatically removed from the service by the kube-apiserver itself. The kube-apiserver instances will add the Alt-Svc HTTP Header with the available kube-apiserver IPs, based on the published endpoints. The client-go will allow to to use a custom round-tripper to process the Alt-Svc headers and select the destination host for each request. Only HTTP2 and HTTPS will be supported, to reduce the risk of incompatibilities and security issues. It will cache the endpoint list for at most `shutdown-delay-duration` many seconds (configured via kube-apiserver flag; today this is 70s), and relist on the next resolver call. We will hide this behaviour behind a feature-gate `APIServerAlternativeServices`: - if the feature-gate is disabled (which will be the default in the beginning, but will switch eventually), the endpoint listing will be done but not used. - if the feature-gate is enabled, the clients using the Alternative Service round tripper will be able to choose the destination API server dynamically, preferring localhost. The clients will not access the feature gate directly, but just looking into the Alt-Svc headers in the request. ### User Stories 1. As a cluster admin I want my Openshift Cluster to be resilient to load-balancer issues: - the load-balancer has non-perfect logic for health and readiness of API server instances (e.g. /readyz is not respected, but only tcp health). - the load-balancer has limited bandwidth and hence throttles traffic or drops packets. - the load-balancer is unaccesible due to a network partition. 2. As a cluster admin I want my OpenShift Cluster control-plane to stay up whatever it costs such that the OpenShift cluster debugging tooling keeps working (oc debug, oc must-gather, etc.). As a non-core OpenShift engineer I am overwhelmed by issues deep in the control-plane components. 3. Openshift components should be as much performant as possible, that means using the most optimum network path in each situation: - Local traffic: maximum performance and bandwidth and minimum jitter and latency, practically no packet drops, and very cheap. - Node - Node: network performance is limited to the network and the nodes' interfaces, traffic is affected by the network. - Node - LB - Node: network performance is influenced by network, the nodes' interfaces *and* the LB. LBs are costly. 4. As an Openshift developer I want all cluster components to adapt to a changing number or to changing IP addresses of master machines. I don't want the workers to reboot for that, and a reconciling client should not be any more noticable as with a load-balancer in front of the kube-apiservers. 5. As an Openshift developer I want all cluster components to fail over fast in case of an API server instance goes down and not depend on the leader election time outs. ### Implementation Details/Notes/Constraints [optional] A number of Openshift cluster components depend on an internal API load-balancer. Its URL is added by the installer to the `Infrastructure` CR and consumed either directly or configured through operators: ```yaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: name: cluster spec: platformSpec: type: None status: apiServerInternalURI: https://internal-apiserver-load-balancer.rh:6443 apiServerURL: https://api.load-balancer.rh:6443 ``` Since all the components have to connect to the kube-apiservers, they can process the Alt-Svc headers in order to obtain the available API servers IPs and apply certain heuristic client-side to choose the best API server, instead of going through the load balancer. There are two problems we have to solve: - get a list of available control-plane nodes - client capability to be able to connect to multiple control-planes #### Kube-apiserver: list of available control-plane nodes This problem can be solved in two different ways: 1. Manually, configuring on the client the list of available control-plane nodes This can be done extending the client-go configuration with an additional field that allows users to include a list of alternative control-plane nodes that the client can use. `k8s.io/client-go/rest/config.go` ```go type Config struct { // Host must be a host string, a host:port pair, or a URL to the base of the apiserver. // If a URL is given then the (optional) Path of that URL represents a prefix that must // be appended to all request URIs used to access the apiserver. This allows a frontend // proxy to easily relocate all of the apiserver endpoints. Host string // AlternativeHosts must be a comma separated list if hosts, host:port pairs or URLs to the base of // different apiservers. The client can use any of them to access the apiserver. AlternativeHosts string ``` 2. automatically, using RFC7838 HTTP Alternative Services This specification defines a new concept in HTTP, "Alternative Services", that allows an origin server to nominate additional means of interacting with it on the network. All conformant clusters are required to publish a list of endpoints with the apiserver addresses. In the default implementation, this list of endpoints is generated by a [reconcile loop that guarantees that only ready apiserver addresses are present](https://github.com/kubernetes/kubernetes/tree/master/pkg/controlplane/reconcilers). The proposal is for API servers to have an option to enable the generation of the Alternative Services headers based on the list of Endpoints created for the `kubernetes.default` service, this requires that the users of this feature belong to the cluster network or we risk to blackhole traffic. To avoid this problem and other security issues, the RFC7838 AltSvc headers will not be inserted in all requests, they will be filtered by RBAC and will be optionally enabled, another possibility is to add configuration options to restrict the servers Alternative Services to certain subnets or discriminate the clients by source IP, ... An example request with headers will be like: ```sh 1103 12:30:59.066469 1558329 round_trippers.go:454] GET https://openshit.internal.lb:6443/api/v1/namespaces/default/pods?limit=500 200 OK in 1 milliseconds I1103 12:30:59.066484 1558329 round_trippers.go:460] Response Headers: I1103 12:30:59.066491 1558329 round_trippers.go:463] Cache-Control: no-cache, private I1103 12:30:59.066502 1558329 round_trippers.go:463] Alt-Svc: h2="10.0.0.2:6443", h2="10.0.0.3:6443", h2="10.0.0.4:6443 ``` #### Client-go Client-go creates a layer on top of the golang net/http library abstracting the communications against the apiservers using [Requests](https://github.com/kubernetes/client-go/blob/master/rest/request.go). The golang net/http library allows to modify a request in different points: - RoundTripper/Transport - Dialer - DNS dialer Client-go already implements some [custom RoundTrippers](https://github.com/kubernetes/client-go/blob/master/transport/round_trippers.go) for some functionalities, like debugging, authentication, impersonation, ... The client-go base-URL is immutable after creation, that guarantees that Requests will not be able to modify the URL while the round tripper is trying to use Alternative Services. The proposal is to implement a new round tripper that has a local cache with the list of available apiservers. This list can be created manually via configuration (comma-separated list on the Config object) or automatically via the Alt-Svc headers. The round tripper process the Alt-Svc headers and store them in a local cache that will be used for new connections. For each new connection, the round tripper implements the following logic: 1. If it is already using an Alternative Service, stick to it, to avoid flapping. 2. If is a connection to the original host, but there are Alternative Services in the cache: 2.1 If one of the alternative services run in the same host as the client, use it. 2.2 If the original host is present as an alternative service, use it. 3. If a connection against an alternative service fails, the request is retried until it exhause all the alternative services, falling back to the original host in case no alternative service is available. This ensure backwards compatibility. 3.1 If is a network error, the round tripper block that alternative service for a specified timeout so it will not be retried again. 3.2 If is a certificate error, the alternative service is never tried again. The round tripper modifies the original `URL.Host` field with the alternative service `Host` and sets the http2 pseudo header `:authority` field and the SNI TLS server name to `kubernetes.default` Example: ```sh https://apiserver.lb:6443/path?query ``` is converted to: ```sh https://10.0.0.2:6443/path?query Host: kubernetes.default SNI TLS.ServerName = kubernetes.default ``` Requiring HTTPS protects the client against security attacks. Requiring HTTP/2 allows the client to detect stale entries, since the HTTP2 client detect idle connections by default, using PING frames. It also allows to multiple multiple requests in a single TCP connection, reducing the number of TCP connections required and simplifying possible network issues related to the number of open file descriptors on host. Consumers of the client-go library will not need to do any additional change, since the round tripper will be a noop if there are no alternative headers present in the request. #### Prototype A working prototype can be found in: https://github.com/openshift/kubernetes/pull/1019 An example of an external project consuming the new client-go can be found in the OVN project: https://github.com/openshift/ovn-kubernetes/pull/823 #### Priority&Fairness To not taint the priority&fairness behaviour due to the new Endpoints requests, we will add a FlowSchema `kubernetes-endpoints` that white-lists these requests in its own p&f bucket. ### Risks and Mitigations To avoid possible attacks on the Alternative Services implementation, only http2 and https will be allowed. The client-go roundtripper will be a no-op if there are no Alternative Service present, it will be 100% compatible with current implementation. The kube-apiserver generation of Alt-Svc headers will be optional and count with the corresponding configurations options to avoid security issues. ### Test Plan This features allow to remove the dependency on the internal load-balancer, but it is still required for bootstrapping. The feature will be tested adding duplicate jobs that use internal load-balancers, but in this case, the internal load-balancer will be removed once the cluster was deployed. ### Graduation Criteria - beta: full backwards compatible and jobs using the new feature are more resilient to other jobs using cloud load-balancers. - stable: no performance degradation respect "vanilla" client-go #### Dev Preview -> Tech Preview - Ability to utilize the enhancement end to end - End user documentation, relative API stability - Sufficient test coverage - Gather feedback from users rather than just developers - Enumerate service level indicators (SLIs), expose SLIs as metrics - Write symptoms-based alerts for the component(s) #### Tech Preview -> GA - More testing (upgrade, downgrade, scale) - Sufficient time for feedback - Available by default - Backhaul SLI telemetry - Document SLOs for the component - Conduct load testing **For non-optional features moving to GA, the graduation criteria must include end to end tests.** #### Removing a deprecated feature - Announce deprecation and support policy of the existing feature - Deprecate the feature ### Upgrade / Downgrade Strategy If applicable, how will the component be upgraded and downgraded? Make sure this is in the test plan. Consider the following in developing an upgrade/downgrade strategy for this enhancement: - What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior? - What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement? Upgrade expectations: - Each component should remain available for user requests and workloads during upgrades. Ensure the components leverage best practices in handling [voluntary disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to this should be identified and discussed here. - Micro version upgrades - users should be able to skip forward versions within a minor release stream without being required to pass through intermediate versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1` as an intermediate step. - Minor version upgrades - you only need to support `x.N->x.N+1` upgrade steps. So, for example, it is acceptable to require a user running 4.3 to upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step. - While an upgrade is in progress, new component versions should continue to operate correctly in concert with older component versions (aka "version skew"). For example, if a node is down, and an operator is rolling out a daemonset, the old and new daemonset pods must continue to work correctly even while the cluster remains in this partially upgraded state for some time. Downgrade expectations: - If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is misbehaving, it should be possible for the user to rollback to `N`. It is acceptable to require some documented manual steps in order to fully restore the downgraded cluster to its previous state. Examples of acceptable steps include: - Deleting any CVO-managed resources added by the new version. The CVO does not currently delete resources that no longer exist in the target version. ### Version Skew Strategy How will the component handle version skew with other components? What are the guarantees? Make sure this is in the test plan. Consider the following in developing a version skew strategy for this enhancement: - During an upgrade, we will always have skew among components, how will this impact your work? - Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used? - Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet. ## Implementation History Major milestones in the life cycle of a proposal should be tracked in `Implementation History`. ## Drawbacks The API server resolver only influences the hostname, this means that the API server Endpoints have to have the same Port than the one configured in apiServerInternalURI. However, this is mitigated falling back to previous behavior and using the default golang resolver. ## Alternatives - Per node load-balancer: this options requires adding a new component per node, with its own lifecyle and operations. It solves the problem of depending on an external opaque Load Balancer, but it adds a new problem of maintanance and supportability. In addition, to be backwards compatible, it requires to modify the local resolvers to redirect the apiServerInternalURI to the per node load-balancers, but we still have an intermediate hop. - Custom Dialer: this is the simplest one, but is not easy to compatibilize, since current client-go already uses some custom Dialers. A prototype implementation can be found in https://github.com/aojea/client-go-multidialer - Custom In Memory DNS: this options is the most complicated and harder to maintain. It creates and in memory network DNS, but still requires to create a custom Dialer for client-go. A prototype implementation can be found in https://github.com/aojea/kubernetes/pull/1 ## Infrastructure Needed [optional] Use this section if you need things from the project. Examples include a new subproject, repos requested, github details, and/or testing infrastructure. Listing these here allows the community to get the process for these resources started right away.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.