✨ This page also exists at https://cert-manager.io/docs/troubleshooting/webhook/. This page was last updated on 13 Feb 2023 (cert-manager 1.11).
The cert-manager webhook is a pod that runs as part of your cert-manager
installation. When applying a manifest with kubectl
, the Kubernetes API server
calls the cert-manager webhook over TLS to validate your manifests. This guide
helps you debug connectivity issues between the Kubernetes API server and the
cert-manager webhook pod.
The error messages listed in this page are encountered while installing or upgrading cert-manager, or shortly after installing or upgrading cert-manager when trying to create a Certificate, Issuer, or any other cert-manager custom resource.
In the below diagram, we show the common pattern when debugging an issue with the cert-manager webhook: when creating a cert-manager custom resource, the API server connects over TLS to the cert-manager webhook pod. The red cross indicates that the API server fails talking to the webhook.
The rest of this document presents the known error messages that you may encounter.
connect: connection refused
This issue was reported in 4 GitHub issues (#2736, #3133, #3445, #4425), was reported in 1 GitHub issue in an external project (
aws-load-balancer-controller#1563
), on Stack Overflow (serverfault#1076563
), and was mentioned in 13 Slack messages that can be listed with the searchin:#cert-manager in:#cert-manager-dev ":443: connect: connection refused"
. This error message can also be found in other projects that are building webhooks (kubewarden-controller#110
).
Shortly after installing or upgrading cert-manager, you may hit this error when creating a Certificate, Issuer, or any other cert-manager custom resource. For example, creating an Issuer resource with the following command:
shows the following error message:
When installing or upgrading cert-manager 1.5.0 and above with Helm, a very
similar error message may appear when running helm install
or helm upgrade
:
The message "connection refused" happens when the API server tries to establish
a TCP connection with the cert-manager-webhook. In TCP terms, the API server
sent the SYN
packet to start the TCP handshake, and received an RST
packet
in return.
If we were to use tcpdump
inside the control plane node where the API server is running, we would see a
packet returned to the API server:
The RST
packet is sent by the Linux kernel when nothing is listening to the
requested port. The RST
packet can also be returned by one of the TCP hops,
e.g., a firewall, as detailed in the Stack Overflow page What can be the
reasons of connection refused errors?.
Note that firewalls usually don't return an RST
packet; they usually drop the
SYN
packet entirely, and you end up with the error message i/o timeout
or
context deadline exceeded
. If that is the case, continue your investigation
with the section Error 2: i/o timeout
(connectivity issue) and Error 8: context deadline exceeded
respectively.
Let's eliminate the possible causes from the closest to the source of the TCP
connection (the API server) to its destination (the pod cert-manager-webhook
).
Let's imagine that the name cert-manager-webhook.cert-manager.svc
was resolved
to 10.43.183.232. This is a cluster IP. The control plane node, in which the API
server process runs, uses its iptables to rewrite the IP destination using the
pod IP. That might be the first problem: sometimes, no pod IP is associated with
a given cluster IP because the kubelet doesn't fill in the Endpoint resource
with pod IPs as long as the readiness probe doesn't work.
Let us first check whether it is a problem with the Endpoint resource:
A valid output would look like this:
If you have this valid output and have the connect: connection refused
, then
the issue is deeper in the networking stack. We won't dig into this case, but
you might want to use tcpdump
and Wireshark to see whether traffic properly
flows from the API server to the node's host namespace. The traffic from the
host namespace to the pod's namespace already works fine since the kubelet was
already able to reach the readiness endpoint.
Common issues include firewall dropping traffic from the control plane to workers; for example, the API server on GKE is only allowed to talk to worker nodes (which is where the cert-manager webhook is running) over port 10250. In EKS, your security groups might deny traffic from your control plane VPC towards your workers VPC over TCP 10250.
If you see <none>
, it indicates that the cert-manager webhook is properly
running but its readiness endpoint can't be reached:
To fix <none>
, you will have to check whether the cert-manager-webhook
deployment is healthy. The endpoints stays to <none>
as long as
cert-manager-webhook isn't marked as "healthy".
You should see that the pod is Running
, and that the number of containers that
are ready is 0/1
:
We won't be detailing the case where you get 1/1
and Running
, since it would
indicate an inconsistent state in Kubernetes.
Continuing with 0/1
, that means the readiness endpoint isn't answering. When
that happens, no endpoint is created. The next step is to figure out why the
readiness endpoint isn't answering. Let us see which port the kubelet is using
when hitting the readiness endpoint:
In our example, the port that the kubelet will try to hit is 6080:
Now, let us port-forward to that port and see if /healthz
works. In a shell
session, run:
In another shell session, run:
The happy output is:
If the readiness endpoint doesn't work, you will see:
At this point, verify that the readiness endpoint is configured on that same port. Let us see the logs to check that our webhook is listening on 6080 for its readiness endpoint:
In the above example, the issue was a misconfiguration of the readiness port. In
the webhook deployment, the argument --healthz-port=6081
was mismatched with
the readiness configuration.
i/o timeout
(connectivity issue)This error message was reported 26 times on Slack. To list these messages, do a search with
in:#cert-manager in:#cert-manager-dev "443: i/o timeout"
. The error message was reported X times in GitHub issues (#2811, #4073)
When the API server tries to talk to the cert-manager webhook, the SYN
packet
is never answered, and the connection times out. If we were to run tcpdump
inside the webhook's net namespace, we would see:
This issue is caused by the SYN
packet being dropped somewhere.
The default Helm configuration should work with GKE private clusters, but
changing securePort
might break it.
For context, unlike public GKE clusters where the control plane can freely talk
to pods over any TCP port, the control plane in private GKE clusters can only
talk to the pods in worker nodes over TCP port 10250 and 443. These two open
ports refer to the containerPort
inside the pod, not the port called port
in
the Service resource.
For it to work, the containerPort
inside the Deployment must match either
10250 or 443; containerPort
is configured by the Helm value
webhook.securePort
. By default, webhook.securePort
is set to 10250.
To see if something is off with the containerPort
, let us start looking at the
Service resource:
Looking at the output, we see that the targetPort
is set to "https"
:
The reason the above port: 443
can't be the cause is because kube-proxy, which
also runs on the control plane node, translates the webhook's cluster IP to a
pod IP, and also translates the above port: 443
to the value in
containerPort
.
To see how what is behind the target port "https"
, we look at the
Deployment resource:
The output shows that the containerPort
is not set to 10250, meaning that
a new firewall rule will have to be added in Google Cloud.
To recap, if the above containerPort
doesn't say 443 or 10250 and that
you prefer not changing containerPort
to 10250, you will have to add a
new firewall rule. You can read the section Adding a firewall rule in a
GKE private
cluster
in the Google documentation.
For context, the reason we did not default securePort
to 443 is because
binding to 443 requires one additional Linux capability
(NET_BIND_SERVICE
); on the other side, 10250 doesn't require any
additional capability.
If you are on EKS and you are using a custom CNI such as Weave or Calico, the Kubernetes API server (which is in its own node) might not be able to reach the webhook pod. This happens because the control plane cannot be configured to run on a custom CNI on EKS, meaning that the CNIs cannot enable connectivity between the API server and the pods running in the worker nodes.
Supposing that you are using Helm, the workaround is to add the following
value in your values.yaml
file:
Or if you are using Helm from the command-line, use the following flag:
By setting hostNetwork
to true
, the webhook pod will be run in the
host's network namespace. By running in the host's network namespace, the
webhook pod becomes accessible over the node's IP, which means you will
work around the fact that kube-apiserver can't reach any pod IPs nor
cluster IPs.
By setting securePort
to 10260 instead of relying on the default value
(which is 10250), you will prevent a conflict between the webhook and the
kubelet. The kubelet, which is an agent that runs on every Kubernetes
worker node and that runs directly on the host, uses the port 10250 to
expose its internal API to kube-apiserver.
To understand how hostnetwork
and securePort
interact, we have to look
at how the TCP connection is established. When the kube-apiserver process
tries to connect to the webhook pod, kube-proxy (which also runs on control
plane nodes, even without a CNI) kicks in and translates the webhook's
cluster IP to the webhook's host IP:
The reason 10250 is used as the default securePort
is because it works
around another limitation with GKE Private Clusters, as detailed in the
above section GKE Private Cluster.
Assuming that you are using the Helm chart and that you are using the
default value of webhook.securePort
(which is 10250), and that you are
using a network policy controller such as Calico, check that there exists a
policy allowing traffic from the API server to the webhook pod over TCP
port 10250.
Assuming that you are using the Helm chart and that you are using the
default value of webhook.securePort
(which is 10250), you might want to
check that your AWS Security Groups allow TCP traffic over 10250 from the
control plane's VPC to the workers VPC.
If none of the above causes apply, you will need to figure out why the webhook is unreachable.
To debug reachability issues (i.e., packets being dropped), we advise to
use tcpdump
along with Wireshark at every TCP hop. You can follow the
article Debugging Kubernetes Networking: my kube-dns
is not
working! to learn
how to use tcpdump
with Wireshark to debug networking issues.
x509: certificate is valid for xxx.internal, not cert-manager-webhook.cert-manager.svc
(EKS with Fargate pods)This issue was first reported in #3237.
This is most probably because you are running on EKS with Fargate enabled. Fargate creates a microVM per pod, and the VM's kernel is used to run the container in its own namespace. The problem is that each microVM gets its own kubelet. As for any Kubernetes node, the VM's port 10250 is listened to by a kubelet process. And 10250 is also the port that the cert-manager webhook listens on.
But that's not a problem: the kubelet process and the cert-manager webhook process are running in two separate network namespaces, and ports don't clash. That's the case both in traditional Kubernetes nodes, as well as inside a Fargate microVM.
The problem arises when the API server tries hitting the Fargate pod: the microVM's host net namespace is configured to port-forward every possible port for maximum compatibility with traditional pods, as demonstrated in the Stack Overflow page EKS Fargate connect to local kubelet. But the port 10250 is already used by the microVM's kubelet, so anything hitting this port won't be port-forwarded and will hit the kubelet instead.
To sum up, the cert-manager webhook looks healthy and is able to listen to port 10250 as per its logs, but the microVM's host does not port-forward 10250 to the webhook's net namespace. That's the reason you see a message about an unexpected domain showing up when doing the TLS handshake: although the cert-manager webhook is properly running, the kubelet is the one responding to the API server.
This is a limitation of Fargate's microVMs: the IP of the pod and the IP of the node are the same. It gives you the same experience as traditional pods, but it poses networking challenges.
To fix the issue, the trick is to change the port the cert-manager webhook is
listening on. Using Helm, we can use the parameter webhook.securePort
:
service "cert-managercert-manager-webhook" not found
We are unsure about the cause of this error, please comment on one of the GitHub issues above if you happen to come across the issue.
no endpoints available for service "cert-manager-webhook"
(OVHCloud)This issue was first reported once in Slack (1).
This error is rare and was only seen in OVHcloud managed Kubernetes clusters, where the etcd resource quota is quite low. etcd is the database where your Kubernetes resources (such as pods and deployments) are stored. OVHCloud limits the disk space used by your resources in etcd. When the limit is reached, the whole cluster starts behaving erratically and one symptom is that Endpoint resources aren't created by the kubelet.
To verify that it is in fact a problem of quota, you should be able to see the following messages in your kube-apiserver logs:
The workaround is to remove some resources such as CertificateRequest resources to get under the limit, as explained in OVHCloud's ETCD Quotas error, troubleshooting page.
x509: certificate has expired or is not yet valid
This error message was reported once in Slack: 1.
When using kubectl apply
:
This error was reported in 1 Slack message (1).
Please answer to the above Slack message since we are still unsure as to what may cause this issue; to get access to the Kubernetes Slack, visit https://slack.k8s.io/.
net/http: request canceled while waiting for connection
This error was reported in 1 Slack message: 1.
This
context deadline exceeded
This error message was reported in GitHub issues (2319, 2706 5189, 5004), and once on Stack Overflow.
This error appears with cert-manager 0.12 and above when trying to apply an Issuer or any other cert-manager custom resource after having installed or upgraded cert-manager:
ℹ️ In older releases of cert-manager (0.11 and below), the webhook relied on the APIService mechanism, and the message looked a bit different but the cause was the same:
ℹ️ The message
context deadline exceeded
also appears when usingcmctl check api
. The cause is identical, you can continue reading this section to debug it.
The trouble with the message context deadline exceeded
is that it obfuscates
the part of the HTTP connection that timed out. When this message appears, we
can't tell which part of the HTTP interaction timed out. It might be the DNS
resolution, the TCP handshake, the TLS handshake, sending the HTTP request or
receiving the HTTP response.
ℹ️ For context, the query parameter
?timeout=30s
that you can see in the above error messages is a timeout that the API server decides when calling the webhook. It is often set to 10 or 30 seconds.
The following diagram shows what are the three errors that may be hidden behind the all-catching "context deadline exceeded" error message, represented by the outer box, that is usually thrown after 30 seconds:
In the rest of the section, we will be trying to trigger one of the three "more specific" errors:
i/o timeout
is the TCP handshake timeout and comes from
DialTimeout
in the Kubernetes
apiserver. The name resolution may be the cause, but usually, this message
appears after the API server sent the SYN
packet and waited for 10 seconds
for the SYN-ACK
packet to be received from the cert-manager webhook.net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
is the HTTP response timeout and comes from
here
and is configured to 30
seconds.
The Kubernetes API server already sent the HTTP request is is waiting for the
HTTP response headers (e.g., HTTP/1.1 200 OK
).net/http: TLS handshake timeout
is when the TCP handshake is done, and the
Kubernetes API server sent the initial TLS handshake packet (ClientHello
)
and waited for 10 seconds for the cert-manager webhook to answer with the
ServerHello
packet.We can sort these three messages in two categories: either it is a connectivity
issue (SYN
is dropped), or it is a webhook issue (i.e., the TLS certificate is
wrong, or the webhook is not returning any HTTP response):
Timeout message | Category |
---|---|
i/o timeout |
connectivity issue |
net/http: TLS handshake timeout |
webhook-side issue |
net/http: request canceled while awaiting headers |
webhook-side issue |
The first step is to rule out a webhook-side issue. In your shell session, run the following:
In another shell session, check that you can reach the webhook:
The happy output looks like this:
If the response shows 200 OK
, we can rule out a webhook-side issue. Since the
initial error message was context deadline exceeded
and not an apiserver-side
issue such as x509: certificate signed by unknown authority
or x509: certificate has expired or is not yet valid
, we can conclude that the problem
is a connectivity issue: the Kubernetes API server isn't able to establish a TCP
connection to the cert-manager webhook. Please follow the instructions in the
section Error 2: i/o timeout
(connectivity issue) above to
continue debugging.
net/http: TLS handshake timeout
This error message was reported in 1 GitHub issue (#2602).
Looking at the above diagram, this error message indicates that the
Kubernetes API server successfully established a TCP connection to the pod IP
associated with the cert-manager webhook. The TLS handshake timeout means that
the cert-manager webhook process isn't the one ending the TCP connection: there
is some HTTP proxy in between that is probably waiting for a plain HTTP request
instead a ClientHello
packet.
We are unsure of the cause of this error. Please comment on the above GitHub issue if you notice this error.
HTTP probe failed with statuscode: 500
This error message was reported in 2 GitHub issue (#3185, #4557).
The error message is visible as an event on the cert-manager webhook:
We are unsure of the cause of this error. Please comment on the above GitHub issue if you notice this error.
Service Unavailable
This error was reported in 1 GitHub issue (#4281)
The above message appears in Kubernetes clusters using the Weave CNI.
We are unsure of the cause of this error. Please comment on the above GitHub issue if you notice this error.
failed calling admission webhook: the server is currently unable to handle the request
This issue was reported in 4 GitHub issues (1369, 1425 3542, 4852)
We are unsure of the cause of this error. Please comment in one of the above GitHub issues if you are able to reproduce this error.
x509: certificate signed by unknown authority
Reported in GitHub issues (2602)
When installing or upgrading cert-manager and using a namespace that is not
cert-manager
:
A very similar error message may show when creating an Issuer or any other cert-manager custom resource:
With cmctl install
and cmctl check api
, you might see the following error
message:
If you are using cert-manager 0.14 and below with Helm, and that you are
installing in a namespace different from cert-manager
, the CRD manifest had
the namespace name cert-manager
hardcoded. You can see the hardcoded namespace
in the following annotation:
You will see the following:
Note 1: this bug in the cert-manager Helm chart was was fixed in cert-manager 0.15.
Note 2: since cert-manager 1.6, this annotation is not being used anymore on the cert-manager CRDs since there is no need for conversion anymore.
The solution, if you are still using cert-manager 0.14 or below, is to render
the manifest using helm template
, then edit the annotation to use the correct
namespace, and then use kubectl apply
to install cert-manager.
If you are using cert-manager 1.6 and below, the issue might be due to the
cainjector being stuck trying to inject the self-signed certificate that the
cert-manager webhook created and stored in the Secret resource
cert-manager-webhook-ca
into the spec.caBundle
field of the cert-manager
CRDs. The first step is to check whether the cainjector is running with no
problem:
Looking at the logs, you will be able to tell if the leader election worked. It can take up to one minute for the leader election work to complete.
The happy output contains lines like this:
Now, look for any message that indicates that the Secret resource that the cert-manager webhook created can't be loaded. The two error messages that might show up are:
The following message indicates that the given CRD has been skipped because the annotation is missing. You can ignore these messages:
If nothing seems wrong with the cainjector logs, you will want to check that the
spec.caBundle
field in the validation, mutation, and conversion configurations
are correct. The Kubernetes API server uses the contents of that field to trust
the cert-manager webhook. The caBundle
contains the self-signed CA created by
the cert-manager webhook when it started.
Let us see the contents of the caBundle
:
Let us check that the contents of caBundle
works for connecting to the
webhook:
Our final test is to try to connect to the webhook using this trust bundle. Let us port-forward to the webhook pod:
In another shell session, send a /validate
HTTP request with the following
command:
You should see a successful HTTP request and response:
cluster scoped resource "mutatingwebhookconfigurations/" is managed and access is denied
This message was reported in GitHub issue 3717.
While installing cert-manager on GKE Autopilot, you will see the following message:
This error message will appear when using Kubernetes 1.20 and below with GKE Autopilot. It is due to a restriction on mutating admission webhooks in GKE Autopilot.
As of October 2021, the "rapid" Autopilot release channel has rolled out version 1.21 for Kubernetes masters. Installation via the Helm chart may end in an error message but cert-manager is reported to be working by some users. Feedback and PRs are welcome.
the namespace "kube-system" is managed and the request's verb "create" is denied
When installing cert-manager on GKE Autopilot with Helm, you will see the following error message:
After this failure, you should still see the three pods happily running:
But looking at either of the logs, you will see the following error message:
That is due to a limitation of GKE Autopilot. It is not possible to create
resources in the kube-system
namespace, and cert-manager uses the well-known
kube-system
to manage the leader election. To get around the limitation, you
can tell Helm to use a different namespace for the leader election:
x509: certificate is valid for cert-manager-webhook.cert-manager.svc, not cert-manager-webhook.somenamespace.svc
This error was reported in 1 GitHub issue (#2602).
If you installed cert-manager using Helm