Resilience: Solving application networking challenges

# Resilience: Solving application networking challenges ## Building resilience into application libraries Before service-mesh technology was widely available, as service developers we had to write a lot of these basic resilience patterns into our application code. For example Finagle (Twitter), Hystrix (Netflix), Ribbon (Netflix). Both of these libraries were very popular in the Java community, including the Spring Framework. The problem with these frameworks is that across different permutations of languages frameworks, and infrastructure, we will have varying implementations. Twitter Finagle and NetflixOSS were great for Java developers, but Node.js, Go, and Python developers had to find or implement their own variants of these patterns. In some cases, these libraries were also invasive to the application code, so networking code was sprinkled around and obscured the actual business logic. ## Using Istio to build resilience for application Istio’s service proxy sits next to the application and handles all network traffic to and from the application. With Istio, since the service proxy understands application-level requests and messages (such as HTTP requests), we can implement resilience features within the proxy. For example, we can configure Istio to retry a failed request up to three times when we experience an HTTP 503 error on a service call. Istio’s service proxy implements these basic resilience patterns out of the box: + Client-side load balancing + Locality-aware load balancing + Timeouts and retries + Circuit breaking In these following examples, we use this architecter. ![](https://i.imgur.com/pmZn7wz.png) ### Client-side load balancing Service operators and developers can configure what load-balancing algorithm a client uses by defining a `DestinationRule` resource. Istio’s service proxy is based on Envoy and supports Envoy’s load-balancing algorithms: + Round robin (default) + Random + Weighted least request Let's deploy simple backend and simple web. ```bash kubectl apply -f - <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: simple-backend --- apiVersion: v1 kind: Service metadata: labels: app: simple-backend name: simple-backend spec: ports: - name: http port: 80 protocol: TCP targetPort: 8080 selector: app: simple-backend --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "TIMING_VARIANCE" value: "40ms" - name: "TIMING_50_PERCENTILE" value: "150ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-2 spec: replicas: 2 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-2" - name: "TIMING_VARIANCE" value: "10ms" - name: "TIMING_50_PERCENTILE" value: "150ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` ```bash kubectl apply -f - <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: simple-web --- apiVersion: v1 kind: Service metadata: labels: app: simple-web name: simple-web spec: ports: - name: http port: 80 protocol: TCP targetPort: 8080 selector: app: simple-web --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-web name: simple-web spec: replicas: 1 selector: matchLabels: app: simple-web template: metadata: labels: app: simple-web spec: serviceAccountName: simple-web containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "UPSTREAM_URIS" value: "http://simple-backend:80/" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-web" - name: "MESSAGE" value: "Hello from simple-web!!!" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-web ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: simple-web-gateway spec: selector: istio: ingressgateway servers: - port: number: 80 name: http protocol: HTTP hosts: - "simple-web.istioinaction.io" --- apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: simple-web-vs-for-gateway spec: hosts: - "simple-web.istioinaction.io" gateways: - simple-web-gateway http: - route: - destination: host: simple-web EOF ``` Let’s specify the load balancing for any client calling the simple-backend service to be `ROUND_ROBIN` with an Istio `DestinationRule` resource. ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-backend-dr spec: host: simple-backend trafficPolicy: loadBalancer: simple: ROUND_ROBIN EOF ``` Let’s look at a somewhat realistic scenario using a load generator and changing the latency of the simple-backend service. ### CLI load generation tool Let's set up our scenario, calling our service again and observe service response times: ```bash time curl -s -o /dev/null -H "Host: simple-web.istioinaction.io" localhost ``` ```bash real 0m0.372s user 0m0.015s sys 0m0.014s ``` ```bash time curl -s -o /dev/null -H "Host: simple-web.istioinaction.io" localhost ``` ```bash real 0m0.240s user 0m0.000s sys 0m0.005s ``` Each time we call the service, the response times are different. Load balancing can be an effective strategy to reduce the effect of endpoints unexpected latency. We will use a CLI load generation tool called [Fortio](https://github.com/fortio/fortio/releases). Run with Docker: ```bash docker run --rm --network host --name fortio fortio/fortio:1.6.8 curl -H "Host: simple-web.istioinaction.io" http://<your-machine-ip>/ ``` Next, we create a version of the `simple-backend-1` service that increases latency for up to one second. ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "TIMING_VARIANCE" value: "10ms" - name: "TIMING_50_PERCENTILE" value: "1000ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` Now, We will use Fortio to send 1,000 requests per second through 10 connections for 60 seconds. By running Fortio in server mode, we can access a web dashboard where we can input the parameters of our test, execute the test, and visualize the results. ```bash docker run --rm -p 8080:8080 --name fortio fortio/fortio:1.6.8 server ``` Open your browser and fill in the following parameters: + Title: roundrobin + URL: `http://localhost` + QPS: 1000 + Duration: 60s + Threads: 10 + Jitter: Checked + Headers: "Host: simple-web.istioinaction.io" Start running the test by clicking the Start button and wait for the test to complete. ![](https://i.imgur.com/TOVXWrP.png) For this round-robin load-balancing strategy, the resulting latencies are as follows: + 50%: 191.47 ms + 75%: 1013.31 ms + 90%: 1033.15 ms + 99%: 1045.05 ms + 99.9%: 1046.24 ms Now, let’s change the load-balancing algorithm to `LEAST_CONN` and try the same load test again: ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-backend-dr spec: host: simple-backend trafficPolicy: loadBalancer: simple: LEAST_CONN EOF ``` For this least-connection load-balancing strategy, the latencies are as follows: + 50%: 184.79 ms + 75%: 195.63 ms + 90%: 1036.89 ms + 99%: 1124.00 ms + 99.9%: 1132.71 ms You received a better result. > Even though the Istio configuration refers to the least-request load balancing as `LEAST_CONN`, Envoy is tracking request depths for endpoints, not connections. At this point, we are finished with the Fortio web UI. Shut down the fortio server command by pressing `Ctrl - C`. ### Locality-aware load balancing One role of a control plane like Istio’s is understanding the topology of services and how that topology may evolve. For example, Istio can identify the region and availability zone in which a particular service is deployed and give priority to services that are closer. ![](https://i.imgur.com/gMhD0bp.png) On Cloud, when deploying in Kubernetes, region and zone information can be added to labels on the Kubernetes nodes. Istio picks up these node labels and enriches the Envoy loadbalancing endpoints with the locality information. In previous versions of Kubernetes' API, `failure.domain.beta.kubernetes.io/region` and `failure-domain.beta.kubernetes.io/zone` were the labels used to identify the region and zone. In recent versions, those labels have been replaced with `topology.kubernetes.io/region` and `topology.kubernetes.io/zone`. **Be aware that cloud vendors still use the older failure-domain labels. Istio looks for both.** Istio provides an approach to explicitly set the locality of our workloads for testing. We can label our Pod with `istio-locality` and give it an explicit region/zone. For example: ```yaml ... labels: app: simple-backend istio-locality: us-west1.us-west1-b ... ``` Istio’s locality-aware load balancing is enabled by default. If you wish to disable it, you can configure the `meshConfig.localityLbSetting.enabled` setting to be false. For locality-aware load balancing to work in Istio, we need to configure health checking for Istio to know which endpoints in the loadbalancing pool are unhealthy. For example a passive health-checking configuration by configuring outlier detection for the simple-backend service: ```yaml apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-backend-dr spec: host: simple-backend trafficPolicy: outlierDetection: consecutive5xxErrors: 1 interval: 5s baseEjectionTime: 30s maxEjectionPercent: 100 ``` ### Locality-aware with weighted distribution By default, Istio’s service proxy sends all traffic to services in the same locality and spills over only when there are failures or unhealthy endpoints. Let’s say there is incoming load that the services in a certain zone or region won’t be able to handle. We want to spill over to a neighboring locality so that 70% of traffic goes to the closest locality and 30% goes to the neighboring locality. ![](https://i.imgur.com/SmCdl5J.png) For example. ```bash apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-backend-dr spec: host: simple-backend.istioinaction.svc.cluster.local trafficPolicy: loadBalancer: localityLbSetting: distribute: - from: us-west1/us-west1-a/* to: "us-west1/us-west1-a/*": 70 "us-west1/us-west1-b/*": 30 connectionPool: http: http2MaxRequests: 10 maxRequestsPerConnection: 10 outlierDetection: consecutive5xxErrors: 1 interval: 5s baseEjectionTime: 30s maxEjectionPercent: 100 ``` ## Transparent timeouts and retries Istio allows us to configure various types of timeouts and retries to overcome inherent network unreliability. ### Timeouts To guard against when things slow down, resources may be held longer and requests take long time to handle, we should implement timeouts on the connection. Let’s deploy a version of the `simple-backend` service that inserts a one-second delay in processing for 50% of the calls to that instance: ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "TIMING_VARIANCE" value: "10ms" - name: "TIMING_50_PERCENTILE" value: "1000ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` Let's call call the services and see the how long each call takes: ```bash for in in {1..10}; do time curl -s -H "Host: simple-web.istioinaction.io" localhost | jq .code; printf "\n"; done ``` You weill see some request take one second or longer. We can specify per-request timeouts with the Istio VirtualService resource, for example: ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: simple-backend-vs spec: hosts: - simple-backend http: - route: - destination: host: simple-backend timeout: 0.5s EOF ``` When we call the services again, the maximum time is 0.5 second, but the calls fail with an HTTP 500 error: ```bash for in in {1..10}; do time curl -s -H "Host: simple-web.istioinaction.io" localhost | jq .code; printf "\n"; done ``` ```bash 500 real 0m0.545s user 0m0.008s sys 0m0.000s 200 real 0m0.171s user 0m0.009s sys 0m0.000s 200 real 0m0.174s user 0m0.011s sys 0m0.000s ... ``` In the next section, we discuss other options to remedy failures like timeouts. ### Retries When calling a service and experiencing intermittent network failures, we may want the application to retry the request. Before we begin, let's reset `simple-backend` service back to default. ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "TIMING_VARIANCE" value: "40ms" - name: "TIMING_50_PERCENTILE" value: "150ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` Istio has retries enabled by default and will retry up to two times. So, let’s disable the default retries for our example ```bash istioctl install --set profile=demo --set meshConfig.defaultHttpRetryPolicy.attempts=0 ``` Now let’s deploy a version of the simple backend service that has periodic (75%) failures. ![](https://i.imgur.com/6x2MIFm.png) ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "ERROR_TYPE" value: "http_error" - name: "ERROR_RATE" value: "0.75" - name: "ERROR_CODE" value: "503" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.14.1 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` We define our `simple-backend` service to return HTTP 503 codes. If we call the service a number of times, we should see some failures: ```bash for in in {1..10}; do curl -s -H "Host: simple-web.istioinaction.io" localhost | jq .code; printf "\n"; done ``` ```bash 500 200 200 200 200 200 200 500 200 200 ``` In the previous configurations, we disabled the default retry policy. Let’s explicitly configure retry attempts to be 2 for calls to `simple-backend` with the `VirtualService` resource: ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: simple-backend-vs spec: hosts: - simple-backend http: - route: - destination: host: simple-backend retries: attempts: 2 EOF ``` If we call our service again, we see no failures: ```bash for in in {1..10}; do curl -s -H "Host: simple-web.istioinaction.io" localhost | jq .code; printf "\n"; done ``` Although there were failures (as we saw earlier), they are not bubbled up to the caller because we enabled Istio’s retry policy to work around those errors. By default, HTTP 503 is one of the retriable status codes. But, You can configure a retry policy in more detail. For example: ```yaml apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: simple-backend-vs spec: hosts: - simple-backend http: - route: - destination: host: simple-backend retries: attempts: 2 retryOn: gateway-error,connect-failure # Errors to retry perTryTimeout: 300ms # Timeouts retryRemoteLocalities: true # Whether to retry endpoints in other localities ``` So, if we deploy our `simple-backend` service to return HTTP 500 codes, the default retry behavior will not catch that: ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "ERROR_TYPE" value: "http_error" - name: "ERROR_RATE" value: "0.75" - name: "ERROR_CODE" value: "500" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.14.1 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` Let's check: ```bash for in in {1..10}; do curl -s -H "Host: simple-web.istioinaction.io" localhost | jq .code; printf "\n"; done ``` HTTP 500 is not among the status codes that are retried. Let's change a retry policy that retries on all HTTP 500 codes. ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: simple-backend-vs spec: hosts: - simple-backend http: - route: - destination: host: simple-backend retries: attempts: 2 retryOn: 5xx # Retries on HTTP 5xx EOF ``` ```yaml ... retries: attempts: 2 retryOn: 5xx # Retries on HTTP 5xx ``` Try again and we not see HTTP 500 errors: ```bash for in in {1..10}; do curl -s -H "Host: simple-web.istioinaction.io" localhost | jq .code; printf "\n"; done ``` ### Retries in terms of timeouts Each retry has its own `perTryTimeout`. Note: `(perTryTimeout * attempts) + back off < timeouts`. Also keep in mind there is a backoff delay between retries, which goes against the overall request timeout. Between retries, Istio will `back off` the retry with a base of 25 ms. This means for each successive retry, Istio backs off (waits) until `(25 ms x attempt #)` to stagger the retries. ![](https://i.imgur.com/mXPZ0Gn.png) The important thing is the default retry settings of Istio `(attempts: 2)` can lead to a `thundering herd` problem, for example: ![](https://i.imgur.com/yBH2NhQ.png) **An option to deal with this situation is to limit the retry attempts at the edges of your architecture to one or none.** ### Request Hedging This is a advanced topic that is not directly exposed in the Istio API. When a request reaches its threshold and times out, we can optionally configure Envoy **send another request to a different host to "race" the original**. If the raced request returns successfully, its response is sent to the original downstream caller. This is called "request hedging", to set up request hedging, we can use `EnvoyFilter`. ## Circuit breaking with Istio Circuit Breaker is the pattern that can prevent an application from repeatedly trying to execute an operation that's likely to fail. In the network stack, a circuit breaker acts as a proxy that can monitor the number of recent failures that have occurred, and use this information to decide whether to allow the operation to proceed, or simply return an exception immediately. We want to reduce traffic to unhealthy systems, so we don’t continue to overload them and prevent them from recovering. Istio doesn’t have an explicit configuration called “circuit breaker,” but it provides two controls for limiting load on backend services, especially those experiencing issues, to effectively enforce a circuit breaker. **The first** is to manage how many connections and outstanding requests are allowed to a specific service. ![](https://i.imgur.com/A0IelsQ.png) In Istio, we use the `connectionPool` settings in a *Destination Rule* to limit the number of connections and requests that send to the service. If too many requests pile up, we can `short-circuit` them (fail fast) and return to the client. **The second** control is to observe the health of endpoints in the load-balancing pool and remove misbehaving endpoints for a time (skip sending traffic to them). Let's practice, first, we scale down the `simple-backend-2` service to a replica of 0. ```bash kubectl scale deploy/simple-backend-2 --replicas=0 ``` Next, let’s deploy the version of `simple-backend-1` service back to a one-second delay. ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "TIMING_VARIANCE" value: "10ms" - name: "TIMING_50_PERCENTILE" value: "1000ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` We delete all existing destination rules. ```bash kubectl delete destinationrule --all ``` Now, Let’s run a very simple load test with one connection `(-c 1)` sending one request per second `(-qps 1)`. 10.135.130.143 ```bash docker run --rm --network host --name fortio fortio/fortio:1.6.8 load -H "Host: simple-web.istioinaction.io" -quiet -jitter -t 30s -c 1 -qps 1 http://<your-machine-ip>/ ``` ``` Aggregated Function Time : count 30 avg 1.0122189 +/- 0.004946 min 1.0084185 max 1.033391 sum 30.366567 # target 50% 1.02047 # target 75% 1.02693 # target 90% 1.03081 # target 99% 1.03313 # target 99.9% 1.03337 Sockets used: 1 (for perfect keepalive, would be 1) Jitter: true Code 200 : 30 (100.0 %) All done 30 calls (plus 1 warmup) 1012.219 ms avg, 1.0 qps ``` ### Connection Pool Next, we configure with a very simple set of limits: ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-backend-dr spec: host: simple-backend trafficPolicy: connectionPool: tcp: maxConnections: 1 http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 maxRetries: 1 http2MaxRequests: 1 EOF ``` Here’s what these settings mean: + maxConnections: total number of connections + http1MaxPendingRequests: the allowable number of requests that are pending and don’t have a connection to use (queued requests) + http2MaxRequests: the maximum number of parallel requests across all endpoints/hosts both HTTP2 and HTTP1 (this setting is unfortunately misnamed in Istio) Next, we increased the number of connections and requests per second from 1 to 2, we might trip the circuit breaker. ```bash docker run --rm --network host --name fortio fortio/fortio:1.6.8 load -H "Host: simple-web.istioinaction.io" -quiet -jitter -t 30s -c 2 -qps 2 http://<your-machine-ip>/ ``` ```bash ... # target 50% 1.12 # target 75% 1.68 # target 90% 2.00125 # target 99% 2.01704 # target 99.9% 2.01862 Sockets used: 27 (for perfect keepalive, would be 2) Jitter: true Code 200 : 31 (55.4 %) Code 500 : 25 (44.6 %) All done 56 calls (plus 2 warmup) 815.563 ms avg, 1.8 qps ``` Requests were returned as failed (HTTP 5xx). So, how do we know for sure that they were affected by circuit breaking and not upstream failures? To do this, we enable statistics collection for the service. To extend the statistics exposed by Istio, we use the annotation `sidecar.istio.io/statsInclusionPrefixes`, for example: ```yaml ... template: metadata: annotations: sidecar.istio.io/statsInclusionPrefixes: "cluster.outbound|80||simple-backend" labels: app: simple-web ... ``` Let's apply to the `simple-web` service. ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-web name: simple-web spec: replicas: 1 selector: matchLabels: app: simple-web template: metadata: annotations: sidecar.istio.io/statsInclusionPrefixes: "cluster.outbound|80||simple-backend" labels: app: simple-web spec: serviceAccountName: simple-web containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "UPSTREAM_URIS" value: "http://simple-backend:80/" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-web" - name: "MESSAGE" value: "Hello from simple-web!!!" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-web ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` Next, we reset all the statistics for the Istio proxy in the `simple-web` service. ```bash kubectl exec -it deploy/simple-web -c istio-proxy -- curl -X POST localhost:15000/reset_counters ``` We generate load again. ```bash docker run --rm --network host --name fortio fortio/fortio:1.6.8 load -H "Host: simple-web.istioinaction.io" -quiet -jitter -t 30s -c 2 -qps 2 http://<your-machine-ip>/ ``` And we can check the statistics from the Istio proxy. ```bash kubectl exec -it deploy/simple-web -c istio-proxy -- curl localhost:15000/stats | grep simple-backend | grep overflow ``` ```bash cluster.outbound|80||simple-backend.istio-action.svc.cluster.local.upstream_cx_overflow: 54 cluster.outbound|80||simple-backend.istio-action.svc.cluster.local.upstream_cx_pool_overflow: 0 cluster.outbound|80||simple-backend.istio-action.svc.cluster.local.upstream_rq_pending_overflow: 24 cluster.outbound|80||simple-backend.istio-action.svc.cluster.local.upstream_rq_retry_overflow: 0 ``` The statistics we’re most interested in are `upstream_cx_overflow` and `upstream_rq_pending_overflow`, which indicate that enough connections and requests went over our specified thresholds (either too many requests in parallel or too many queued up) to trip the circuit breaker. When a request fails for tripping a circuit breaking threshold, Istio’s service proxy adds an `x-envoy-overloaded` header. ### Removing misbehavior service with outlier detection Istio uses Envoy’s outlier-detection functionality for removing certain hosts of a service that are misbehaving. To get started, let’s restart everything. ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "TIMING_VARIANCE" value: "40ms" - name: "TIMING_50_PERCENTILE" value: "150ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-2 spec: replicas: 2 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-2" - name: "TIMING_VARIANCE" value: "10ms" - name: "TIMING_50_PERCENTILE" value: "150ms" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.17.0 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` ```bash kubectl delete destinationrule --all ``` Next, We create the `simple-backend` service that return a HTTP 500 on 75% of the calls. ```bash kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: labels: app: simple-backend name: simple-backend-1 spec: replicas: 1 selector: matchLabels: app: simple-backend template: metadata: labels: app: simple-backend spec: serviceAccountName: simple-backend containers: - env: - name: "LISTEN_ADDR" value: "0.0.0.0:8080" - name: "SERVER_TYPE" value: "http" - name: "NAME" value: "simple-backend" - name: "MESSAGE" value: "Hello from simple-backend-1" - name: "ERROR_TYPE" value: "http_error" - name: "ERROR_RATE" value: "0.75" - name: "ERROR_CODE" value: "500" - name: KUBERNETES_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace image: nicholasjackson/fake-service:v0.14.1 imagePullPolicy: IfNotPresent name: simple-backend ports: - containerPort: 8080 name: http protocol: TCP securityContext: privileged: false EOF ``` Run our tests. ```bash docker run --rm --network host --name fortio fortio/fortio:1.6.8 load -H "Host: simple-web.istioinaction.io" -allow-initial-errors -quiet -jitter -t 30s -c 10 -qps 20 http://<your-machine-ip>/ ``` ``` ... Code 200 : 313 (52.2 %) Code 500 : 287 (47.8 %) All done 600 calls (plus 10 warmup) 94.778 ms avg, 19.9 qps ``` Some calls did indeed fail. If we are sending requests to a service that is failing regularly can make maybe it’s overloaded, so we should stop sending traffic to it for a while. Let’s configure outlier detection to do exactly that: ```bash kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: simple-backend-dr spec: host: simple-backend trafficPolicy: outlierDetection: consecutive5xxErrors: 1 interval: 5s baseEjectionTime: 5s maxEjectionPercent: 100 EOF ``` In this destination rule, we configure `consecutive5xxErrors` with a value of 1, which means outlier detection will trip after only one bad request. The interval setting specifies how often the Istio service proxy checks on the hosts and decides whether to eject an endpoint (auto task). If a service endpoint is ejected, it is ejected for `n * baseEjectionTime`, where n is the number of times that particular endpoint has been ejected. Next, we configure `maxEjectionPercent` with a value of `100%`, we’re willing to eject 100% of the hosts - there are no requests will pass through when all the hosts are misbehaving. Re-run our tests. ```bash docker run --rm --network host --name fortio fortio/fortio:1.6.8 load -H "Host: simple-web.istioinaction.io" -allow-initial-errors -quiet -jitter -t 30s -c 10 -qps 20 http://<your-machine-ip>/ ``` ```bash ... Code 200 : 586 (97.7 %) Code 500 : 14 (2.3 %) All done 600 calls (plus 10 warmup) 160.247 ms avg, 19.9 qps ``` Our error rate is reduced. However, we still have 11 failed calls. During the five-second interval setting, requests hit this misbehaving host, and it wasn’t until the outlier-detection check happened (after five seconds). Let's enable default retry settings to work around those last few errors. ```bash istioctl install --set profile=demo --set meshConfig.defaultHttpRetryPolicy.attempts=2 ``` Try the load test once again, and you should see no errors. ###### tags: `istio`