API Retry in Kube

# API Retry in Kube `client-go` provides three modes of communication: - `Do` and `DoRaw`: for regular requests, like when we create a ConfigMap or get a `Pod` object. - `Watch`: applies to watch request(s) - `Stream`: for streaming APIs, when we do `oc log`? Previously only `Do` and `DoRaw` had built-in retry logic. Recently we did the following: - refactor the retry logic to be contained in a single unit, reusable and testable. - add retry logic to `Watch` and `Stream`. You can go through the following PRs to get a glimpse of what was done: - https://github.com/kubernetes/kubernetes/pull/102217 - https://github.com/kubernetes/kubernetes/pull/102606 ## Retry Semantics: How does client know which failed request to retry? `client-go` uses the following net/http function to send a request to the server ``` var request http.Request var client http.Client response, err := client.Do(request) ``` - `err != nil` implies that the request failed with an error - depending on the `StatusCode` of the http `response` obtained, the client will decide whether to retry the request. - `Watch`: `StatusCode != 200` is an error and we can safely retry. - `Stream`: `StatusCode >= 200 && StatusCode < 300` is a success and any other `StatusCode` implies we can retry. ### Server If the server wants the caller to retry the request then it sends the following response to the caller: ``` StatusCode = {429|5xx} Header: retry-After: N ``` - A: response `StatusCode` is either `429` or `5xx` - B: the response has a `Retry-After` header with a numeric value `N (N >= 0)` Both `A` and `B` must be present in the response for the request to be retried. One interesting fact, getting `429` does not always mean that the request was rejected by Priority and Fairness. The `kube-apiserver` rest/registry layer returns `429` to the caller if it wants the caller to retry the request. ### Client ``` response, err := client.Do(request) ``` after receiving `response` and `err`, `client-go` determines retryability: - is the `err` set and is it retryable? - are both `A` and `B` true? if `err` is set then we retry only if: - the request `verb` is `GET` (write operations are not retried as they may not be idempotent) - if the `err` is: - "connection reset" - EOF - unexpected EOF - connection reset by peer - use of closed network connection - http2: server sent GOAWAY and closed the connection if either of the above is true then client-go goes ahead and attempts a retry, but there are a few constraints: ### Retry Constraints - `MaxRetries`: `client-go` allows the caller to set the maximum number of retries, the default is `10` - The retry is roughly bound to the `context` of the `http.Request` object, if the `context` expires the retry operation is aborted - the `body` of the request must be an `io.Seeker` so that we can seek to the beginning of the buffer before the next retry is attempted. if `Seek(0,0)` fails we usually see the following error ``` can't Seek() back to beginning of body ``` ## Retry Loop - 1:`client-go` attaches a `backoff` with a request, the default is `no backoff`. There is a way you can use exponential backoff by setting two environment variables. - 2: calculate how much time to wait before the next attempt - 3: call `Backoff.Sleep` - 4: apply any client-side throttling using the `rateLimiter` associated with the request. - 5: send the request - 6: update `Backoff` with the response from the server - 7: if it is a retryable response, calculate based on `(response, err)` - 8: Seek to the beginning of the request `body` before the next retry, abort if any error. - 9: Sleep for N seconds, N is obtained from the `Retry-After` response header - 10: close the `body` of the http `Response` object to avoid memory leak ```sequence Client->Client: 1: calculate Backoff Wait Client->Client: 2: Backoff.Sleep Client->Client: 3: Apply client-side throttling Client->Server: 4: Send Request: client.Do(request) Server->Client: 429 ('Retry-After: N') Client->Client: 5: Update Backoff Client->Client: 6: Is (response, err) retryable? Yes Client->Client: 7: Current Attempt < MaxRetries? Yes Client->Client: 8: Seek to the beginning of request body Client->Client: 9: Sleep(N seconds) Client->Client: 10: Close the body of the response ``` How to enable `URL Backoff`: ``` Environment variables: Note that the duration should be long enough that the backoff persists for some reasonable time (i.e. 120 seconds). The typical base might be "1". envBackoffBase = "KUBE_CLIENT_BACKOFF_BASE" envBackoffDuration = "KUBE_CLIENT_BACKOFF_DURATION" ```