# Using `timeoutSeconds: 10` instead of 30 seconds since cert-manager 1.12 This change from 30 to 10 seconds happened in https://github.com/cert-manager/cert-manager/pull/3323. Unfortunately, in case of a network issue, lowering to 10 seconds hides useful error messages under the unhelpful `context deadline exceeded`: ```text Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io": Post https://kubernetes.default.svc:443/apis/webhook.certmanager.k8s.io/v1beta1/mutations?timeout=30s: context deadline exceeded ``` By have been setting `timeoutSeconds` to 10 seconds, the TCP handshake timeout (10 seconds), TLS handshake timeout (30 seconds) and HTTP response headers timeout (10 seconds) never show: ```text context deadline exceeded | 10 seconds | timeout v +-------------------------------------------------------------------------+ i/o timeout | net/http: TLS handshake timeout 10 seconds | | timeout v | +------------+ 30 seconds | net/http: request canceled |TCP | timeout v while awaiting headers |handshake +---------------------+ | +------------| TLS | | | handshake +------------+ 10 seconds | +---------------------| sending | timeout v | request +------------+ +------------|receiving |------+ |resp. header| recv.| +------------+ resp.| | body +-----+ +------|other| |logic| +-----+ ``` <!-- https://textik.com/#5c3239d6c79478e3 --> <!-- Kubernetes "wraps" the HTTP calls made to the webhook with a context: https://github.com/kubernetes/kubernetes/blob/8d25cc31939/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/validating/dispatcher.go#L246-L251 --> When someone needs to debug `context deadline exceeded`, I would recommend people to set the timeoutSeconds to its maximum (30 seconds). One quick way of doing that is to run the following: ```bash kubectl patch validatingwebhookconfigurations cert-manager-webhook --type=json \ -p '[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]' kubectl patch mutatingwebhookconfigurations cert-manager-webhook --type=json \ -p '[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]' ``` ## Reproducing `EOF` Let's hijack the webhook deployment so that validation and mutation requests are sent to my machine's host namespace using Telepresence v1: ``` telepresence --swap-deployment cert-manager-webhook --namespace cert-manager --run-shell ``` Let's trigger a webhook request: ``` kubectl apply -f- <<<'{"apiVersion":"cert-manager.io/v1","kind":"Certificate","metadata":{"name":"test"}}' ``` When nothing listening on 12500, kube-apiserver tells us "EOF": ``` failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": EOF ``` ## Reproducing `context deadline exceeded` Now, let us run something on 12500 so that the TCP connection succeeds but not the TLS connection: ``` sudo nc -v -l 10250 ``` Let's trigger the webhook request again: ``` failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": context deadline exceeded ``` ## Reproducing `net/http: TLS handshake timeout` In the previous test, we were hitting the "overall" context timeout of 10 seconds. It is configured in the validation and mutation configurations. Let us configure it to 30 seconds, which the maximum kube-apiserver will accept: ```bash kubectl patch validatingwebhookconfigurations cert-manager-webhook --type=json \ -p '[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]' kubectl patch mutatingwebhookconfigurations cert-manager-webhook --type=json \ -p '[{"op": "replace", "path": "/webhooks/0/timeoutSeconds", "value": 30}]' ``` If we trigger a webhook request again, we can now see the "actual" problem: ``` failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s": net/http: TLS handshake timeout ``` This means the TCP connection succeeded, but Client Hello was not responded to, which is what we were expecting. ## Reproducing `connect: connection refused` Now, let us stop the Telepresence session to see what happens when the pod port doesn't have anything listening on 12500: ``` kubectl -n cert-manager get deploy cert-manager-webhook -ojson \ | jq 'del(.spec.template.spec.containers[0].args[] | select(test("--secure-port="))) | .spec.template.spec.containers[0].args += ["--secure-port=12599"]' \ | kubectl apply -f- ``` Now, I trigger the request to the webhook: ``` kubectl apply -f- <<<'{"apiVersion":"cert-manager.io/v1","kind":"Certificate","metadata":{"name":"test"}}' ``` The TCP connection can't be established: ``` failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s": dial tcp 10.0.159.136:443: connect: connection refused ``` ![](https://hackmd.io/_uploads/rJQl4BKpi.png) ## Reproducing `context deadline exceeded (Client.Timeout exceeded while awaiting headers)` First, let's pretend that packets are dropped: ```bash sudo iptables -A INPUT -p tcp --destination-port 12345 -j DROP ``` Then: ```go package main import ( "context" "fmt" "net" "net/http" "time" ) func main() { tests := []struct { contextTimeout time.Duration clientTimeout time.Duration dialTimeout time.Duration tlsHandshakeTimeout time.Duration respHeaderTimeout time.Duration }{ { 100 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, }, { 1000 * time.Millisecond, 100 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, }, { 2000 * time.Millisecond, 100 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, }, { 1000 * time.Millisecond, 1000 * time.Millisecond, 100 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, }, { 2000 * time.Millisecond, 1000 * time.Millisecond, 100 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, }, { 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 100 * time.Millisecond, 1000 * time.Millisecond, }, { 2000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 100 * time.Millisecond, 1000 * time.Millisecond, }, { 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 100 * time.Millisecond, }, { 2000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 1000 * time.Millisecond, 100 * time.Millisecond, }, } fmt.Printf("| ContextTimeout | ClientTimeout | DialTimeout | TLSHandshakeTimeout | RespHeaderTimeout| ActualTimeout | Message |\n") fmt.Printf("|----------------|---------------|-------------|---------------------|------------------|---------------|---------|\n") for _, test := range tests { start := time.Now() ctx, _ := context.WithTimeout(context.Background(), test.contextTimeout) client := &http.Client{ Timeout: test.clientTimeout, Transport: &http.Transport{ DialContext: (&net.Dialer{ Timeout: test.dialTimeout, }).DialContext, TLSHandshakeTimeout: test.tlsHandshakeTimeout, ResponseHeaderTimeout: test.respHeaderTimeout, }, } req, err := http.NewRequestWithContext(ctx, "GET", "https://localhost:12345", nil) if err != nil { panic(err) } req = req.WithContext(ctx) _, err = client.Do(req) fmt.Printf("| %v | %v | %v | %v | %v | %v | %v |\n", test.contextTimeout, test.clientTimeout, test.dialTimeout, test.tlsHandshakeTimeout, test.respHeaderTimeout, time.Now().Sub(start), err) } } ``` When packets are dropped somewhere: | ContextTimeout | ClientTimeout | DialTimeout | TLSHandshakeTimeout | RespHeaderTimeout| ActualTimeout | Message | |----------------|---------------|-------------|---------------------|------------------|---------------|---------| | 100ms | 1s | 1s | 1s | 1s | 100.4ms | Get "https://localhost:12345": context deadline exceeded | | 1s | 100ms | 1s | 1s | 1s | 100.2ms | Get "https://localhost:12345": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | | 2s | 100ms | 1s | 1s | 1s | 100.2ms | Get "https://localhost:12345": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | | 1s | 1s | 100ms | 1s | 1s | 100.4ms | Get "https://localhost:12345": dial tcp 127.0.0.1:12345: i/o timeout | | 2s | 1s | 100ms | 1s | 1s | 100.3ms | Get "https://localhost:12345": dial tcp 127.0.0.1:12345: i/o timeout | | 1s | 1s | 1s | 100ms | 1s | 1s | Get "https://localhost:12345": context deadline exceeded | | 2s | 1s | 1s | 100ms | 1s | 1.0008s | Get "https://localhost:12345": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | | 1s | 1s | 1s | 1s | 100ms | 1s | Get "https://localhost:12345": context deadline exceeded | | 2s | 1s | 1s | 1s | 100ms | 1s | Get "https://localhost:12345": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | When port is listening but isn't doing TLS: | ContextTimeout | ClientTimeout | DialTimeout | TLSHandshakeTimeout | RespHeaderTimeout| ActualTimeout | Message | |----------------|---------------|-------------|---------------------|------------------|---------------|---------| | 100ms | 1s | 1s | 1s | 1s | 100.4ms | Get "https://localhost:12346": context deadline exceeded | | 1s | 100ms | 1s | 1s | 1s | 100.4ms | Get "https://localhost:12346": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | | 2s | 100ms | 1s | 1s | 1s | 100.4ms | Get "https://localhost:12346": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | | 1s | 1s | 100ms | 1s | 1s | 1.0005s | Get "https://localhost:12346": context deadline exceeded | | 2s | 1s | 100ms | 1s | 1s | 1.0004s | Get "https://localhost:12346": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | | 1s | 1s | 1s | 100ms | 1s | 100.5ms | Get "https://localhost:12346": net/http: TLS handshake timeout | | 2s | 1s | 1s | 100ms | 1s | 100.5ms | Get "https://localhost:12346": net/http: TLS handshake timeout | | 1s | 1s | 1s | 1s | 100ms | 1.0003s | Get "https://localhost:12346": context deadline exceeded | | 2s | 1s | 1s | 1s | 100ms | 1.0003s | Get "https://localhost:12346": context deadline exceeded (Client.Timeout exceeded while awaiting headers) | When port not being listened to on remote: | 100ms | 1s | 1s | 1s | 1s | 300µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 1s | 100ms | 1s | 1s | 1s | 300µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 2s | 100ms | 1s | 1s | 1s | 100µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 1s | 1s | 100ms | 1s | 1s | 100µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 2s | 1s | 100ms | 1s | 1s | 200µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 1s | 1s | 1s | 100ms | 1s | 100µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 2s | 1s | 1s | 100ms | 1s | 100µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 1s | 1s | 1s | 1s | 100ms | 100µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | | 2s | 1s | 1s | 1s | 100ms | 100µs | Get "https://localhost:12347": dial tcp 127.0.0.1:12347: connect: connection refused | ## Reproducing `dial tcp: i/o timeout` Like before, make sure you are hijacking the cert-manager webhook using Telepresence v1: ```bash telepresence --swap-deployment cert-manager-webhook --namespace cert-manager --run-shell ``` Then, let us drop all traffic coming to 10250 so that the TCP connection cannot be established: ```bash sudo iptables -A INPUT -p tcp --destination-port 10250 -j DROP ``` Surpringly, it doesn't show `dial tcp: i/o timeout`. After investigating, it makes sense since the sshuttle proxy pod (started by Telepresence v1) is terminating the TCP connection. The error is `net/http: TLS handshake timeout`: ``` net/http: TLS handshake timeout ``` Instead of doing that through sshuttle, let's do this within the cert-manager-webhook pod. Let us exit Telepresence, and exec into the network namespace in order to drop TCP connections going to 10250: ```bash docker exec -it kind-control-plane bash -c \ 'ip netns exec $(ip netns identify $(ps -ax | grep /app/cmd/webhook/webhook | grep -v grep | awk '{print $1}')) iptables -A INPUT -p tcp --destination-port 10250 -j DROP' ``` This time, we get: ``` net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) ``` It seems like `tcp dial: i/o timeout` is by far the trickiest error to reproduce. It often involves Cilium or Weave Net. My understanding is that `tcp dial: i/o timeout` indicates that the TCP handshake succeeded, but that ```bash docker exec -it kind-control-plane "apt update && apt install tcpdump -y" docker exec -i kind-control-plane tcpdump -i any -s 0 -U -n -w - | wireshark -k -i - ``` The SYN packet is sent: ![](https://hackmd.io/_uploads/r13TjXtai.png) What is the difference between `dial tcp: i/o timeout` and `net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)`? ```bash docker exec -i kind-control-plane bash -c \ 'ip netns exec $(ip netns identify $(ps -ax | grep /app/cmd/webhook/webhook | grep -v grep | awk {print\ \$1})) tcpdump -i any -s 0 -U -n -w -' | wireshark -k -i - ``` `i/o timeout` doesnt seem to happen when kube-apiserver contacts the webhook. In [cert-manager#5189](https://github.com/cert-manager/cert-manager/issues/5189), `i/o timeout` happens in controller-runtime: > webhook.go:122] cert-manager "msg"="Failed initialising server" "error"="error building admission chain: Get "https://10.96.0.1:443/api": dial tcp 10.96.0.1:443: i/o timeout" It originates from [`c.discovery.ServerGroups()`](https://github.com/jetstack/cert-manager/blob/4a7fa9071085266e60132aba59c0f1bf06706b10/internal/plugin/admission/certificaterequest/approval/certificaterequest_approval.go#L283). In the following error, it originates from kube-apiserver talking to the webhook (v0.14.0). > Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: dial tcp 10.43.18.211:443: i/o timeout Similarly, in [cert-manager#4918](https://github.com/cert-manager/cert-manager/issues/4918), the cainjector (v1.7.0) fails: > logr.go:265] cert-manager "msg"="Failed to get API Group-Resources" "error"="Get \"https://10.3.240.1:443/api?timeout=32s\": dial tcp 10.3.240.1:443: i/o timeout" ## Problems tcpdump + wireshark ```bash docker exec -it kind-control-plane bash -c "tcpdump -i any -s0 -n -w - 2>/dev/null" \ | wireshark -k -i - ``` ![](https://hackmd.io/_uploads/S1IwK7FTo.png) This is because stderr and stdout are "mixed" with `docker exec`. --- ```bash docker exec -it kind-control-plane bash -c "tcpdump -i any -s0 -n -w - 2>/dev/null" \ | wireshark -k -i - ``` ![](https://hackmd.io/_uploads/S1NMtXtao.png) "Frame too long" This is because, again, docker exec's `-t` combines stdout and stderr. Works: ```bash docker exec -i kind-control-plane tcpdump -i any -s 0 -U -n -w - | wireshark -k -i - ```