# The Definitive Debugging Guide for the cert-manager Webhook Pod
> ✨ This page will be copied over to cert-manager.io to gain visibility, see the pull request [#1002](https://github.com/cert-manager/website/pull/1002 "Validation webhook troubleshooting: add a step by step guide").
The cert-manager webhook, also called "validation webhook", is a pod that
runs as part of your cert-manager installation. When applying a manifest
with `kubectl`, the Kubernetes apiserver calls the cert-manager webhook
over TLS to validate your manifests. This guide helps you debug
connectivity issues between the Kubernetes apiserver and the cert-manager
webhook pod.
The error messages listed in this page are encountered while installing or
upgrading cert-manager, or shortly after installing or upgrading
cert-manager when trying to create a Certificate, Issuer, or any other
cert-manager custom resource.
> Last updated on 7 June 2022 (cert-manager 1.8).
In the below diagram, we show the common pattern when debugging an issue with the cert-manager webhook: when creating a cert-manager custom resource, the apiserver connects over TLS to the cert-manager webhook pod. The red cross indicates that the apiserver fails talking to the webhook.
<img alt="Diagram that shows a kubectl command that aims to create an issuer resource, and an arrow towards the Kubernetes apiserver, and an arrow between the apiserver and the webhook that indicates that the apiserver tries to connect to the webhook. This last arrow is crossed in red." src="https://i.imgur.com/9rIBMeC.png" width="500"/>
The rest of this document presents the known error messages that you may encounter.
## Error 1: `connect: connection refused`
> This issue was reported in 4 GitHub issues ([#2736](https://github.com/jetstack/cert-manager/issues/2736 "Getting WebHook Connection Refused error when using Azure DevOps Pipelines"), [#3133](https://github.com/jetstack/cert-manager/issues/3133 "Failed calling webhook webhook.cert-manager.io: connect: connection refused"), [#3445](https://github.com/jetstack/cert-manager/issues/3445 "Connection refused for cert-manager-webhook service"), [#4425](https://github.com/cert-manager/cert-manager/issues/4425 "Webhook error")), was reported in 1 GitHub issue in an external project ([aws-load-balancer-controller#1563](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1563 "Internal error occurred: failed calling webhook webhook.cert-manager.io, no endpoints available")), on Stackoverflow ([serverfault#1076563](https://web.archive.org/web/20210903183221/https://serverfault.com/questions/1076563/creating-issuer-for-kubernetes-cert-manager-is-causing-404-and-500-error "Creating issuer for kubernetes cert-manager is causing 404 and 500 error")), and was mentioned in 13 Slack messages that can be listed with the search `in:#cert-manager in:#cert-manager-dev ":443: connect: connection refused"`. This error message can also be found in other projects that are building webhooks ([kubewarden-controller#110](https://github.com/kubewarden/kubewarden-controller/issues/110 "Investigate failure on webhooks not ready when installing cert-manager from helm chart: connection refused")).
Shortly after installing or upgrading cert-manager, you may hit this error when
creating a Certificate, Issuer, or any other cert-manager custom resource. For
example, creating an Issuer resource with the following command:
```sh
kubectl apply -f- <<EOF
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: example
spec:
selfSigned: {}
EOF
```
shows the following error message:
```text
Error from server (InternalError): error when creating "STDIN":
Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook:
Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
dial tcp 10.96.20.99:443: connect: connection refused
```
When installing or upgrading cert-manager 1.5.0 and above with Helm, a very similar
error message may appear when running `helm install` or `helm upgrade`:
```text
Error: INSTALLATION FAILED: Internal error occurred:
failed calling webhook "webhook.cert-manager.io": failed to call webhook:
Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
dial tcp 10.96.20.99:443: connect: connection refused
```
The message "connection refused" happens when the apiserver tries to establish
a TCP connection with the cert-manager-webhook. In TCP terms, the apiserver
sent the `SYN` packet to start the TCP handshake, and received an `RST` packet
in return.
If we were to use tcpdump inside the control plane node where the apiserver
is running, we would see a packet returned to the apiserver:
```text
192.168.1.43 (apiserver) -> 10.96.20.99 (webhook pod) TCP 59466 → 443 [SYN]
10.96.20.99 (webhook pod) -> 192.168.1.43 (apiserver) TCP 443 → 59466 [RST, ACK]
```
The `RST` packet is sent by the Linux kernel when nothing is listening to the
requested port. The `RST` packet can also be returned by one of the TCP hops,
e.g., a firewall, as detailed in the StackOverflow page [What can be the reasons of connection refused errors?](https://stackoverflow.com/a/2333446/3808537).
Note that firewalls usually don't return an `RST` packet; they usually drop
the `SYN` packet entirely, and you end up with the error message `i/o
timeout` or `context deadline exceeded` described below.
Let's eliminate the possible causes from the closest to the source of the TCP
connection (the apiserver) to its destination (the cert-manager-webhook pod).
Let's imagine that the name `cert-manager-webhook.cert-manager.svc` was
resolved to 10.43.183.232. This is a cluster IP. The control plane node, in
which the apiserver process runs, uses its iptables to rewrite the IP
destination using the pod IP. That might be the first problem: sometimes,
no pod IP is associated with a given cluster IP because the kubelet doesn't
fill in the Endpoint resource with pod IPs as long as the readiness probe
doesn't work.
Let us first check whether it is a problem with the Endpoint resource:
```sh
kubectl get endpoints -n cert-manager cert-manager-webhook
```
A valid output would look like this:
```text
NAME ENDPOINTS AGE
cert-manager-webhook 10.244.0.2:10250 27d ✅
```
If you have this valid output and have the `connect: connection refused`,
then the issue is deeper in the networking stack. We won't dig into this
case, but you might want to use `tcpdump` and Wireshark to see whether
traffic properly flows from the apiserver to the node's host namespace. The
traffic from the host namespace to the pod's namespace already works fine
since the kubelet was already able to reach the readiness endpoint.
Common issues include firewall dropping traffic from the control plane to
workers; for example, the apiserver on GKE is only allowed to talk to worker
nodes (which is where the cert-manager webhook is running) over port 10250.
In EKS, your security groups might deny traffic from your control plane
VPC towards your workers VPC over TCP 10250.
If you see `<none>`, it indicates that the cert-manager webhook is properly running
but its readiness endpoint can't be reached:
```text
NAME ENDPOINTS AGE
cert-manager-webhook <none> 236d ❌
```
To fix `<none>`, you will have to check whether the cert-manager-webhook deployment
is healthy. The endpoints stays to `<none>` as long as cert-manager-webhook isn't
marked as "healthy".
```sh
kubectl get pod -n cert-manager -l app.kubernetes.io/name=webhook
```
You should see that the pod is `Running`, and that the number of pods
that are ready is `0/1`:
```text
NAME READY STATUS RESTARTS AGE
cert-manager-76578c9687-24kmr 0/1 Running 7 (8h ago) 28d ❌
```
We won't be detailing the case where you get `1/1` and `Running`, since
it would indicate an inconsistent state in Kubernetes.
Continuing with `0/1`, that means the readiness endpoint isn't answering.
When that happens, no endpoint is created. The next step is to figure out
why the readiness endpoint isn't answering. Let us see which port the
kubelet is using when hitting the readiness endpoint:
```sh
kubectl -n cert-manager get deploy cert-manager-webhook -oyaml | grep -A5 readiness
```
In our example, the port that the kubelet will try to hit is 6080:
```yaml
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 6080 # ✨
scheme: HTTP
```
Now, let us port-forward to that port and see if `/healthz` works. In a shell session,
run:
```sh
kubectl -n cert-manager port-forward deploy/cert-manager-webhook 6080
```
In another shell session, run:
```sh
curl -sS --dump-header - 127.0.0.1:6080/healthz
```
The happy output is:
```http
HTTP/1.1 200 OK ✅
Date: Tue, 07 Jun 2022 17:16:56 GMT
Content-Length: 0
```
If the readiness endpoint doesn't work, you will see:
```text
curl: (7) Failed to connect to 127.0.0.1 port 6080 after 0 ms: Connection refused ❌
```
At this point, verify that the readiness endpoint is configured on that
same port. Let us see the logs to check that our webhook is listening on 6080
for its readiness endpoint:
```console
$ kubectl logs -n cert-manager -l app.kubernetes.io/name=webhook | head -10
I0607 webhook.go:129] "msg"="using dynamic certificate generating using CA stored in Secret resource"
I0607 server.go:133] "msg"="listening for insecure healthz connections" "address"=":6081" ❌
I0607 server.go:197] "msg"="listening for secure connections" "address"=":10250"
I0607 dynamic_source.go:267] "msg"="Updated serving TLS certificate"
...
```
In the above example, the issue was a misconfiguration of the readiness port. In
the webhook deployment, the argument `--healthz-port=6081` was mismatched with
the readiness configuration.
## Error 2: `i/o timeout`
> This error message was reported 26 times on Slack. To list these messages, do a search with `in:#cert-manager in:#cert-manager-dev "443: i/o timeout"`. The error message was reported X times in GitHub issues ([#2811](https://github.com/cert-manager/cert-manager/issues/2811 "i/o timeout from apiserver when connecting to webhook on k3s"), [#4073](https://github.com/cert-manager/cert-manager/issues/4073 "Internal error occurred: failed calling webhook"))
```text
Error from server (InternalError): error when creating "STDIN": Internal error occurred:
failed calling webhook "webhook.cert-manager.io": failed to call webhook:
Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
dial tcp 10.0.0.69:443: i/o timeout
```
When the apiserver tries to talk to the cert-manager webhook, the `SYN` packet
is never answered, and the connection times out. If we were to run tcpdump inside
the webhook's net namespace, we would see:
```text
192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP 44772 → 443 [SYN]
192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP [TCP Retransmission] 44772 → 443 [SYN]
192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP [TCP Retransmission] 44772 → 443 [SYN]
192.168.1.43 (apiserver) -> 10.0.0.69 (webhook pod) TCP [TCP Retransmission] 44772 → 443 [SYN]
```
This issue is caused by the `SYN` packet being dropped somewhere.
If you are using a "private" GKE cluster, your worker nodes won't have
external IPs assigned. Unlike public GKE clusters where the control plane
can freely talk to pods over any TCP port thanks to the external IPs, the
control plane in private GKE clusters can only talk to the pods in worker
nodes over TCP port 10250 and 443. That port is the port inside the pod,
not the service port: the service used to expose the cert-manager webhook
must use the `targetPort` 10250 or 443. Furtunately, by default, the
cert-manager Helm chart uses 443 as the target port:
```yaml
apiVersion: v1
kind: Service
metadata:
name: cert-manager-webhook
spec:
ports:
- name: https
port: 443
protocol: TCP
targetPort: "https"
```
If `targetPort` doesn't say 443 or 10250, you will have to add a new
firewall rule. You can read the section [Adding a firewall rule in a GKE
private
cluster](https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules)
in the Google documentation.
If you are on EKS and you are using a custom CNI such as Weave or Calico,
the Kubernetes apiserver (which is in its own node) might not be able to
reach the webhook pod. This happens because the control plane cannot be
configured to run on a custom CNI on EKS, meaning that the CNIs cannot
enable connectivity between the apiserver and the pods running in the
worker nodes. To address this, use `--set webhook.hostNetwork=true` with
Helm. And because the kubelet also listens to 10250, you will have to
change the port the webhook listens to with `--set
webhook.securePort=10260`.
If you are using a network policy controller such as Calico, check that
there exist a policy allowing traffic from the apiserver to the webhook pod
over TCP port 10250.
If you are using AWS, the control plane is in its own VPC, and the worker
nodes in their own VPC. You might want to check that your security groups
allow TCP traffic over 10250 from the control plane's VPC to the workers
VPC.
If you need to debug reachability issues (i.e., packets being dropped), we
advise to use `tcpdump` along with Wireshark at every TCP hop. You can
follow the article [Debugging Kubernetes Networking: my kube-dns is not
working!](https://maelvls.dev/debugging-kubernetes-networking/) to learn
how to use `tcpdump` with Wireshark to debug networking issues.
## Error 3: `x509: certificate is valid for xxx.internal, not cert-manager-webhook.cert-manager.svc`
```text
Internal error occurred: failed calling webhook "webhook.cert-manager.io":
Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
x509: certificate is valid for ip-192-168-xxx-xxx.xxx.compute.internal,
not cert-manager-webhook.cert-manager.svc
```
> This issue was first reported in
> [#3237](https://github.com/cert-manager/cert-manager/issues/3237 "Can't
> create an issuer when cert-manager runs on EKS in Fargate pods (AWS)").
This is most probably because you are running on EKS with Fargate enabled. Fargate
creates a micro-VM per pod, and the VM's kernel is used to run the container in its
own namespace. The problem is that each micro-VM gets its own kubelet. As for any
Kubernetes node, the VM's port 10250 is listened to by a kubelet process. And 10250
is also the port that the cert-manager webhook listens on.
But that's not a problem: the kubelet process and the cert-manager webhook process
are running in two separate net namespaces, and ports don't clash. That's the case
both in traditional Kubernetes nodes, as well as inside a Fargate microVM.
The problem arises when the apiserver tries hitting the Fargate pod: the microVM's
host net namespace is configured to port-forward every possible port for maximum
compatibility with traditional pods, as demonstrated in this [StackOverflow page](https://stackoverflow.com/questions/66445207/eks-fargate-connect-to-local-kubelet "EKS Fargate connect to local kubelet").
But the port 10250 is already used by the microVM's kubelet, so anything hitting
this port won't be port-forwarded and will hit the kubelet instead.
To sum up, the cert-manager webhook looks healthy and is able to listen to port 10250
as per its logs, but the microVM's host does not port-forward 10250 to the webhook's
net namespace. That's the reason you see a message about an unexpected domain showing
up when doing the TLS handshake: although the cert-manager webhook is properly running,
the kubelet is the one responding to the apiserver.
This is a limitation of Fargate's microVMs: the IP of the pod and the IP of the node
are the same. It gives you the same experience as traditional pods, but it poses
networking challenges.
To fix the issue, the trick is to change the port the cert-manager webhook is
listening on. Using Helm, we can use the parameter `webhook.securePort`:
```sh
helm install \
cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.8.0 \
--set webhook.securePort=10260
```
## Error 4: `service "cert-managercert-manager-webhook" not found`
```text
Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred:
failed calling webhook "webhook.cert-manager.io": failed to call webhook:
Post "https://cert-managercert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
service "cert-managercert-manager-webhook" not found
```
This error was treated in the two following issues:
- [#3195](https://github.com/jetstack/cert-manager/issues/3195 "service cert-manager-webhook not found")
- [#4999](https://github.com/cert-manager/cert-manager/issues/4999 "Verification on 1.7.2 fails (Kubectl apply), service cert-manager-webhook not found")
## Error 5: `no endpoints available for service "cert-manager-webhook"`
```text
Error: INSTALLATION FAILED: Internal error occurred:
failed calling webhook "webhook.cert-manager.io":
Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
no endpoints available for service "cert-manager-webhook"
```
This error is rare and was only seen in OVHcloud managed Kubernetes clusters, where
the etcd resource quota is quite low. When the quota is reached, the whole cluster
starts behaving erratically and one symptom is that Endpoint resources aren't created
by the kubelet.
The workaround is to remove some other resources to get under the limit. That, or move
to a Kubernetes offering that doesn't have this sort of limitation on core resources.
## Error 6: `x509: certificate has expired or is not yet valid`
> This error message was reported once in Slack: [1](https://kubernetes.slack.com/archives/C4NV3DWUC/p1618579222346800).
When using `kubectl apply`:
```text
Internal error occurred: failed calling webhook "webhook.cert-manager.io":
Post https://kubernetes.default.svc:443/apis/webhook.cert-manager.io/v1beta1/mutations?timeout=30s:
x509: certificate has expired or is not yet valid
```
This error showed up
[once](https://kubernetes.slack.com/archives/C4NV3DWUC/p1618579222346800).
Please answer to that Slack thread since we are still unsure as to what may
cause this issue; to get access to the Kubernetes Slack, visit
[https://slack.k8s.io/](https://slack.k8s.io/).
## Error 7: `net/http: request canceled while waiting for connection`
```text
Error from server (InternalError): error when creating "STDIN":
Internal error occurred: failed calling webhook "webhook.cert-manager.io":
Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
```
## Error 8: `context deadline exceeded`
> This error message was reported in GitHub issues ([2319](https://github.com/cert-manager/cert-manager/issues/2319 "Documenting context deadline exceeded errors relating to the webhook, on bare metal"), [2706](https://github.com/cert-manager/cert-manager/issues/2706 "") [5189](https://github.com/cert-manager/cert-manager/issues/5189 "Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded"), [5004](https://github.com/cert-manager/cert-manager/issues/5004 "After installing cert-manager using kubectl, cmctl check api fails with https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s: context deadline exceeded")), and once [on Stackoverflow](https://stackoverflow.com/questions/72059332/how-can-i-fix-failed-calling-webhook-webhook-cert-manager-io).
This error appears when trying to apply an Issuer or any other cert-manager
custom resource after having installed or upgraded cert-manager:
```text
Error from server (InternalError): error when creating "STDIN":
Internal error occurred: failed calling webhook "webhook.cert-manager.io":
Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
context deadline exceeded
```
> The message `context deadline exceeded` also appears when using `cmctl
> check api`. The cause is identical, you can continue reading this section
> to debug it.
>
> ```text
> Not ready: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook:
> Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s":
> context deadline exceeded
> ```
The above message corresponds to cert-manager 0.12 and above, where the
admission webhook mechanism is used (with the resources
ValidatingWebhookConfiguration and MutatingWebhookConfiguration).
In cert-manager 0.11 and below, the webhook relied on the APIService
mechanism, and the message looked a bit differently but the cause was the
same:
```text
Error from server (InternalError): error when creating "STDIN":
Internal error occurred: failed calling webhook "webhook.certmanager.k8s.io":
Post https://kubernetes.default.svc:443/apis/webhook.certmanager.k8s.io/v1beta1/mutations?timeout=30s:
context deadline exceeded
```
At this point, we want to understand which timeout we are hitting and why.
The Kubernetes apiserver may hit two different timeouts when talking to the
cert-manager webhook:
- `i/o timeout` is the TCP handshake timeout and comes from
[`DialTimeout`](https://pkg.go.dev/net#DialTimeout) in the Kubernetes
apiserver. The name resolution may be the cause, but usually, this
message appears after the apiserver sent the `SYN` packet and waited for
10 seconds for the `SYN-ACK` packet to be received from the cert-manager
webhook.
- `net/http: request canceled while waiting for connection (Client.Timeout
exceeded while awaiting headers)` is the HTTP response timeout and comes
from
[here](https://github.com/kubernetes/kubernetes/blob/abba1492f/staging/src/k8s.io/apiserver/pkg/util/webhook/webhook.go#L96-L101)
and is configured to [30
seconds](https://github.com/kubernetes/kubernetes/blob/abba1492f/staging/src/k8s.io/apiserver/pkg/util/webhook/webhook.go#L36-L38).
The Kubernetes apiserver already sent the HTTP request is is waiting for
the HTTP response headers (e.g., `HTTP/1.1 200 OK`).
- `net/http: TLS handshake timeout` is when the TCP handshake is done, and
the Kubernetes apiserver sent the initial TLS handshake packet
(`ClientHello`) and waited for 10 seconds for the cert-manager webhook to
anwser with the `ServerHello` packet.
- `context deadline exceeded` is a "higher level" timeout and acts on the
whole HTTP interaction: it includes the DNS resolution, the TCP
handshake, the TLS handshake, sending the HTTP request and receiving the
HTTP response. The Kubernetes apiserver often sets this timeout to 10 or
30 seconds. When talking to the cert-manager webhook, the apiserver also
tells the cert-manager webhook how long the timeout will be using the
query parameter `?timeout=10s` or `?timeout=30s`:
```text
Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s"
```
In this example, if the HTTP response is not received under 10 seconds,
the Kubernetes apiserver will error with the message `context deadline
exceeded`.
As you can guess, the message `context deadline exceeded` obfuscates which
part of the connection was hanging. Was it waiting for the TCP handshake,
or had the TCP connection already been established but the webhook wasn't
sending any HTTP response?
We can sort these messages in two categories: either it is a connectivity
issue (`SYN` is dropped), or it is a webhook issue (i.e., the TLS
certificate is wrong, or the webhook is not returning any HTTP response).
| Message | Category |
|-----------------------------------------------------|--------------------------------------|
| `i/o timeout` | connectivity issue |
| `net/http: request canceled while awaiting headers` | webhook issue |
| `net/http: TLS handshake timeout` | webhook issue |
| `context deadline exceeded` | either connectivity or webhook issue |
With the message `context deadline exceeded`, we first have to know whether
it is a connectivity or a webhook issue. To rule out that it is a webhook
issue, we can port-forward:
```sh
kubectl -n cert-manager port-forward deploy/cert-manager-webhook 10250
```
In another shell session, check that you can reach the webhook:
```sh
curl -vsS --resolve cert-manager-webhook.cert-manager.svc:10250:127.0.0.1 \
--service-name cert-manager-webhook-ca \
--cacert <(kubectl -n cert-manager get secret cert-manager-webhook-ca -ojsonpath='{.data.ca\.crt}' | base64 -d) \
https://cert-manager-webhook.cert-manager.svc:10250/validate 2>&1 -d@- <<'EOF' | sed '/^* /d; /bytes data]$/d; s/> //; s/< //'
{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","request":{"kind":{"group":"cert-manager.io","version":"v1","kind":"Certificate"},"resource":{"group":"cert-manager.io","version":"v1","resource":"certificates"},"subResource":"status","requestKind":{"group":"cert-manager.io","version":"v1","kind":"Certificate"},"requestResource":{"group":"cert-manager.io","version":"v1","resource":"certificates"},"requestSubResource":"status","name":"example","namespace":"default","operation":"UPDATE","userInfo":{"username":"system:admin","groups":["system:masters","system:authenticated"]},"object":{"apiVersion":"cert-manager.io/v1","kind":"Certificate","metadata":{},"spec":{"dnsNames":["example.boring.mael-valais-gcp.jetstacker.net"],"issuerRef":{"group":"cert-manager.io","kind":"Issuer","name":"letsencrypt"},"secretName":"example","usages":["digital signature","key encipherment"]},"status":{}},"oldObject":{"apiVersion":"cert-manager.io/v1","kind":"Certificate","metadata":{},"spec":{"dnsNames":["example.boring.mael-valais-gcp.jetstacker.net"],"issuerRef":{"group":"cert-manager.io","kind":"Issuer","name":"letsencrypt"},"secretName":"example","usages":["digital signature","key encipherment"]},"status":{}},"dryRun":false,"options":{"kind":"UpdateOptions","apiVersion":"meta.k8s.io/v1"}}}
EOF
```
The happy output looks like this:
```http
POST /validate HTTP/1.1
Host: cert-manager-webhook.cert-manager.svc:10250
User-Agent: curl/7.83.0
Accept: */*
Content-Length: 1299
Content-Type: application/x-www-form-urlencoded
HTTP/1.1 200 OK
Date: Wed, 08 Jun 2022 14:52:21 GMT
Content-Length: 2029
Content-Type: text/plain; charset=utf-8
...
"response": {
"uid": "",
"allowed": true
}
```
If the response shows `200 OK`, then we were able to rule out the webhook
issue. Since the message is `context deadline exceeded`, and not a TLS
issue, we can conclude that the problem is a connectivity issue. Please
follow the instructions in the `i/o timeout` section above to continue
debugging.
## Error 8: `net/http: TLS handshake timeout`
> This error message was reported in 1 GitHub issue ([#2602](https://github.com/cert-manager/cert-manager/issues/2602 "Internal error occurred: failed calling webhook webhook.cert-manager.io: Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s: net/http: TLS handshake timeout")).
```text
Error from server (InternalError): error when creating "STDIN":
Internal error occurred: failed calling webhook "webhook.cert-manager.io":
Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
net/http: TLS handshake timeout
```
## Error 9: `HTTP probe failed with statuscode: 500`
> This error message was reported in 2 GitHub issue ([#3185](https://github.com/cert-manager/cert-manager/issues/3185 "kubectl install cert-manager: Readiness probe failed: HTTP probe failed with statuscode: 500"), [#4557](https://github.com/cert-manager/cert-manager/issues/4557 "kubectl install cert-manager: Readiness probe failed: HTTP probe failed with statuscode: 500")).
The error message is visible as an event on the cert-manager webhook:
```text
Warning Unhealthy <invalid> (x13 over 15s) kubelet, node83
Readiness probe failed: HTTP probe failed with statuscode: 500
```
## Error 10: `Service Unavailable`
> This error was reported in 1 GitHub issue ([#4281](https://github.com/cert-manager/cert-manager/issues/4281 "Can't deploy Issuer, Service Unavailable"))
```text
Error from server (InternalError): error when creating "STDIN": Internal error occurred:
failed calling webhook "webhook.cert-manager.io":
Post "https://my-cert-manager-webhook.default.svc:443/mutate?timeout=10s":
Service Unavailable
```
The above message appears in Kubernetes clusters using the Weave CNI.
## Error 11: `failed calling admission webhook: the server is currently unable to handle the request`
> This issue was reported in X GitHub issues ([1369](https://github.com/cert-manager/cert-manager/issues/1369 "the server is currently unable to handle the request"), [1425](https://github.com/cert-manager/cert-manager/issues/1425 "Verifying Install: failed calling admission webhook (Azure, GKE private cluster)") [3542](https://github.com/cert-manager/cert-manager/issues/3542 "SSL Certificate Manager has got expired, we need to renew SSL certificate in existing ClusterIssuer Kubernetes Service (AKS)"), [4852](https://github.com/cert-manager/cert-manager/issues/4852 "error: unable to retrieve the complete list of server APIs: webhook.cert-manager.io/v1beta1: the server is currently unable to handle the request (AKS)"))
```text
Error from server (InternalError): error when creating "test-resources.yaml": Internal error occurred:
failed calling admission webhook "issuers.admission.certmanager.k8s.io":
the server is currently unable to handle the request
```
## Error 12: `x509: certificate signed by unknown authority`
> Reported in GitHub issues
> ([2602](https://github.com/cert-manager/cert-manager/issues/2602#issuecomment-606474055 "x509: certificate signed by unknown authority"))
When installing or upgrading cert-manager and using a namespace that is not
`cert-manager`:
```text
Error: UPGRADE FAILED: release core-l7 failed, and has been rolled back due to atomic being set:
failed to create resource: conversion webhook for cert-manager.io/v1alpha3, Kind=ClusterIssuer failed:
Post https://cert-manager-webhook.core-l7.svc:443/convert?timeout=30s:
x509: certificate signed by unknown authority
```
A very similar error message may show when creating an Issuer or any other
cert-manager custom resource:
```text
Internal error occurred: failed calling webhook "webhook.cert-manager.io":
Post https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=30s:
x509: certificate signed by unknown authority`
```
With `cmctl install` and `cmctl check api`, you might see the following
error message:
```text
2022/06/06 15:36:30 Not ready: the cert-manager webhook CA bundle is not injected yet
(Internal error occurred: conversion webhook for cert-manager.io/v1alpha2, Kind=Certificate failed:
Post "https://<company_name>-cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s":
x509: certificate signed by unknown authority)
```
If you are using cert-manager 0.14 and below with Helm, and that you are
installing in a namespace different from `cert-manager`, the CRD manifest
had the namespace name `cert-manager` hardcoded. You can see the hardcoded
namespace in the following annotation:
```sh
kubectl get crd issuers.cert-manager.io -oyaml | grep inject
```
You will see the following:
```yaml
cert-manager.io/inject-ca-from-secret: cert-manager/cert-manager-webhook-ca
# ^^^^^^^^^^^^
# hardcoded
```
> **Note 1:** this bug in the cert-manager Helm chart was [was
fixed](https://github.com/cert-manager/cert-manager/commit/f33beefc3289725b8d52fee0109138a9a4b7840e)
in cert-manager 0.15.
>
> **Note 2:** since cert-manager 1.6, this annotation is [not being
> used](https://github.com/cert-manager/cert-manager/pull/4841) anymore on
> the cert-manager CRDs since there is no need for conversion anymore.
The solution, if you are still using cert-manager 0.14 or below, is to
render the manifest using `helm template`, then edit the annotation to use
the correct namespace, and then use `kubectl apply` to install
cert-manager.
If you are using cert-manager 1.6 and below, the issue might be due to the
cainjector being stuck trying to inject the self-signed certificate that
the cert-manager webhook created and stored in the Secret resource
`cert-manager-webhook-ca` into the `spec.caBundle` field of the
cert-manager CRDs. The first step is to check whether the cainjector is
running with no problem:
```console
$ kubectl -n cert-manager get pods -l app.kubernetes.io/name=cainjector
NAME READY STATUS RESTARTS AGE
cert-manager-cainjector-5c55bb7cb4-6z4cf 1/1 Running 11 (31h ago) 28d
```
Looking at the logs, you will be able to tell if the leader election
worked. It can take up to one minute for the leader election work to
complete.
```console
I0608 start.go:126] "starting" version="v1.8.0" revision="e466a521bc5455def8c224599c6edcd37e86410c"
I0608 leaderelection.go:248] attempting to acquire leader lease kube-system/cert-manager-cainjector-leader-election...
I0608 leaderelection.go:258] successfully acquired lease kube-system/cert-manager-cainjector-leader-election
I0608 controller.go:186] cert-manager/secret/customresourcedefinition/controller/controller-for-secret-customresourcedefinition "msg"="Starting Controller"
I0608 controller.go:186] cert-manager/certificate/customresourcedefinition/controller/controller-for-certificate-customresourcedefinition "msg"="Starting Controller"
I0608 controller.go:220] cert-manager/secret/customresourcedefinition/controller/controller-for-secret-customresourcedefinition "msg"="Starting workers" "worker count"=1
I0608 controller.go:220] cert-manager/certificate/customresourcedefinition/controller/controller-for-certificate-customresourcedefinition "msg"="Starting workers" "worker count"=1
```
The happy output contains lines like this:
```console
I0608 sources.go:184] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
"msg"="Extracting CA from Secret resource" "resource_name"="issuers.cert-manager.io" "secret"="cert-manager/cert-manager-webhook-ca"
I0608 controller.go:178] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
"msg"="updated object" "resource_name"="issuers.cert-manager.io"
```
Now, look for any message that indicates that the Secret resource that the
cert-manager webhook created can't be loaded. The two error messages that
might show up are:
```text
E0608 sources.go:201] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
"msg"="unable to fetch associated secret" "error"="Secret \"cert-manager-webhook-caq\" not found"
```
The following message indicates that the given CRD has been skipped because
the annotation is missing. You can ignore these messages:
```text
I0608 controller.go:156] cert-manager/secret/customresourcedefinition/generic-inject-reconciler
"msg"="failed to determine ca data source for injectable" "resource_name"="challenges.acme.cert-manager.io"
```
If nothing seems wrong with the cainjector logs, you will want to check
that the `spec.caBundle` field in the validation, mutation, and conversion
configurations are correct. The Kubernetes apiserver uses the contents of
that field to trust the cert-manager webhook. The `caBundle` contains the
self-signed CA created by the cert-manager webhook when it started.
```console
$ kubectl get validatingwebhookconfigurations cert-manager-webhook -ojson | jq '.webhooks[].clientConfig'
{
"caBundle": "LS0tLS1...LS0tLS0K",
"service": {
"name": "cert-manager-webhook",
"namespace": "cert-manager",
"path": "/validate",
"port": 443
}
}
```
```console
$ kubectl get mutatingtingwebhookconfigurations cert-manager-webhook -ojson | jq '.webhooks[].clientConfig'
{
"caBundle": "LS0tLS1...RFLS0tLS0K",
"service": {
"name": "cert-manager-webhook",
"namespace": "cert-manager",
"path": "/validate",
"port": 443
}
}
```
Let us see the contents of the `caBundle`:
```console
$ kubectl get mutatingwebhookconfigurations cert-manager-webhook -ojson \
| jq '.webhooks[].clientConfig.caBundle' -r | base64 -d \
| openssl x509 -noout -text -in -
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
ee:8f:4f:c8:55:7b:16:76:d8:6a:a2:e5:94:bc:7c:6b
Signature Algorithm: ecdsa-with-SHA384
Issuer: CN = cert-manager-webhook-ca
Validity
Not Before: May 10 16:13:37 2022 GMT
Not After : May 10 16:13:37 2023 GMT
Subject: CN = cert-manager-webhook-ca
```
Let us check that the contents of `caBundle` works for connecting to the
webhook:
```console
$ kubectl -n cert-manager get secret cert-manager-webhook-ca -ojsonpath='{.data.ca\.crt}' \
| base64 -d | openssl x509 -noout -text -in -
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
ee:8f:4f:c8:55:7b:16:76:d8:6a:a2:e5:94:bc:7c:6b
Signature Algorithm: ecdsa-with-SHA384
Issuer: CN = cert-manager-webhook-ca
Validity
Not Before: May 10 16:13:37 2022 GMT
Not After : May 10 16:13:37 2023 GMT
Subject: CN = cert-manager-webhook-ca
```
Our final test is to try to connect to the webhook using this trust bundle.
Let us port-forward to the webhook pod:
```sh
kubectl -n cert-manager port-forward deploy/cert-manager-webhook 10250
```
In another shell session, send a `/validate` HTTP request with the
following command:
```sh
curl -vsS --resolve cert-manager-webhook.cert-manager.svc:10250:127.0.0.1 \
--service-name cert-manager-webhook-ca \
--cacert <(kubectl get validatingwebhookconfigurations cert-manager-webhook -ojson | jq '.webhooks[].clientConfig.caBundle' -r | base64 -d) \
https://cert-manager-webhook.cert-manager.svc:10250/validate 2>&1 -d@- <<'EOF' | sed '/^* /d; /bytes data]$/d; s/> //; s/< //'
{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","request":{"kind":{"group":"cert-manager.io","version":"v1","kind":"Certificate"},"resource":{"group":"cert-manager.io","version":"v1","resource":"certificates"},"subResource":"status","requestKind":{"group":"cert-manager.io","version":"v1","kind":"Certificate"},"requestResource":{"group":"cert-manager.io","version":"v1","resource":"certificates"},"requestSubResource":"status","name":"example","namespace":"default","operation":"UPDATE","userInfo":{"username":"system:admin","groups":["system:masters","system:authenticated"]},"object":{"apiVersion":"cert-manager.io/v1","kind":"Certificate","metadata":{},"spec":{"dnsNames":["example.boring.mael-valais-gcp.jetstacker.net"],"issuerRef":{"group":"cert-manager.io","kind":"Issuer","name":"letsencrypt"},"secretName":"example","usages":["digital signature","key encipherment"]},"status":{}},"oldObject":{"apiVersion":"cert-manager.io/v1","kind":"Certificate","metadata":{},"spec":{"dnsNames":["example.boring.mael-valais-gcp.jetstacker.net"],"issuerRef":{"group":"cert-manager.io","kind":"Issuer","name":"letsencrypt"},"secretName":"example","usages":["digital signature","key encipherment"]},"status":{}},"dryRun":false,"options":{"kind":"UpdateOptions","apiVersion":"meta.k8s.io/v1"}}}
EOF
```
You should see a successful HTTP request and response:
```http
POST /validate HTTP/1.1
Host: cert-manager-webhook.cert-manager.svc:10250
User-Agent: curl/7.83.0
Accept: */*
Content-Length: 1299
Content-Type: application/x-www-form-urlencoded
HTTP/1.1 200 OK
Date: Wed, 08 Jun 2022 16:20:45 GMT
Content-Length: 2029
Content-Type: text/plain; charset=utf-8
...
```
## Error 13: `cluster scoped resource "mutatingwebhookconfigurations/" is managed and access is denied`
> This message was reported in GitHub issue
> [3717](https://github.com/cert-manager/cert-manager/issues/3717 "Cannot
> install on GKE autopilot cluster due to mutatingwebhookconfigurations
> access denied").
While installing cert-manager on GKE Autopilot, you will see the following
message:
```text
Error: rendered manifests contain a resource that already exists. Unable to continue with install:
could not get information about the resource:
mutatingwebhookconfigurations.admissionregistration.k8s.io "cert-manager-webhook" is forbidden:
User "XXXX" cannot get resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope:
GKEAutopilot authz: cluster scoped resource "mutatingwebhookconfigurations/" is managed and access is denied
```
This error message will appear when using Kubernetes 1.20 and below with
GKE Autopilot. It is due to a [restriction on mutating admission webhooks
in GKE
Autopilot](https://github.com/cert-manager/cert-manager/issues/3717).
As of October 2021, the "rapid" Autopilot release channel has rolled out
version 1.21 for Kubernetes masters. Installation via the Helm chart may
end in an error message but cert-manager is reported to be working by some
users. Feedback and PRs are welcome.
## Error 14: `the namespace "kube-system" is managed and the request's verb "create" is denied`
When installing cert-manager on GKE Autopilot with Helm, you will see the
following error message:
```text
Not ready: the cert-manager webhook CA bundle is not injected yet
```
After this failure, you should still see the three pods happily running:
```console
$ kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-76578c9687-24kmr 1/1 Running 0 47m
cert-manager-cainjector-b7d47f746-4799n 1/1 Running 0 47m
cert-manager-webhook-7f788c5b6-mspnt 1/1 Running 0 47m
```
But looking at either of the logs, you will see the following error
message:
```text
E0425 leaderelection.go:334] error initially creating leader election record:
leases.coordination.k8s.io is forbidden: User "system:serviceaccount:cert-manager:cert-manager-webhook"
cannot create resource "leases" in API group "coordination.k8s.io" in the namespace "kube-system":
GKEAutopilot authz: the namespace "kube-system" is managed and the request's verb "create" is denied
```
That is due to a limitation of GKE Autopilot. It is not possible to create
resources in the `kube-system` namespace, and cert-manager uses the
well-known `kube-system` to manage the leader election. To get around the
limitatin, you can tell Helm to use a different namespace for the leader
election:
```sh
helm install cert-manager jetstack/cert-manager --version 1.8.0 \
--namespace cert-manager --create-namespace \
--set global.leaderElection.namespace=cert-manager
```