# Design: DNS over HTTPS instead of plain DNS for DNS-01 ACME Issuers **Status:** this is a design draft created by Maël Valais on 12 April 2022 and updated on 9 June 2023. Issues: - [Support DoT (DNS over TLS) for Recursive Nameservers](https://github.com/cert-manager/cert-manager/issues/4153) implemented by [Implement the DNS-over-HTTPS check](https://github.com/cert-manager/cert-manager/pull/5003). **TL;DR**: We recommend supporting the RFC DNS-over-HTTPS by extending the existing `--dns01-recursive-nameservers`. This will unblock at least the 12 people who reacted on the PR and issue. The aim isn't to have a full support of all secure DNS protocols; instead, the goal is to work around egress limitations (egress port 53 closed) in companies with restricted egress access. If someone wants further secure DNS protocols, they can use instead one of the DoH proxies listed below. ## The problem Some companies forbid traffic over UDP port 53, and where the only allowed egress traffic must be TCP traffic over 443. In this case, cert-manager is unable to perform the ACME DNS-01 self-check. This is a stronger version of the split-horizon DNS problem. It does not solve the above problem by running cert-manager with the [`--dns01-recursive-nameservers`](dns01-recursive-nameservers) flag since cert-manager can't do DNS lookups over UDP port 53. For example, this would not work: ```text --dns01-recursive-nameservers=1.1.1.1:53 ``` [dns01-recursive-nameservers]: https://cert-manager.io/docs/configuration/acme/dns01/#setting-nameservers-for-dns01-self-check Users affected: - Florian Liebhart (Volkswagen), - Matthew de Haast (Fynbos), - Sven Schliesing (Tagesschau). ## Solution 1: DNS-over-HTTPS > Florian has an in-flight PR: [#5003](https://github.com/cert-manager/cert-manager/pull/5003). There are two DNS-over-HTTPS protocols: - The DNS Wire format, as defined in [RFC 8484][rfc8484] - The JSON API format, an ad-hoc format created by Google, and supported by CloudFlare and Alibaba. [rfc8484]: https://www.rfc-editor.org/rfc/rfc8484 "DNS Queries over HTTPS (DoH)" [curl-wiki-doh]: https://github.com/curl/curl/wiki/DNS-over-HTTPS "DoH - DNS over HTTPS" The DNS Wire format is supported by 63 providers (as of 11 April 2022, the providers are listed in the curl wiki page [DNS over HTTPS][curl-wiki-doh]). <!-- ```console $ curl -sS https://github.com/curl/curl/wiki/DNS-over-HTTPS | htmlq --pretty 'table:nth-child(6) > tbody > tr > td:nth-child(1)' --text --remove-nodes strong | grep . | wc -l 63 ``` --> On the other side, the JSON API format is supported by 3 providers: - Google's Cloud DNS, - CloudFlare's 1.1.1.1, - Alibaba Public DNS. Although is hasn't been standardized, the JSON API is mostly well documented in the CloudFlare [Using JSON](https://developers.cloudflare.com/1.1.1.1/encryption/dns-over-https/make-api-requests/dns-json/) page. Although the JSON API is easier to write a client for, to test and debug, and supports `TXT`, it doesn't support `SOA`, `CNAME`, or `CAA`. We also found that it only supports 3 providers, and the JSON format only supports a subset of the DNS records. The plan is to extend the existing global flag `--dns01-recursive-nameservers`, e.g., ```bash cert-manager-controller \ --dns01-recursive-nameservers "https://8.8.8.8/resolve" ``` Later on, if someone needs selecting the DNS resolver per-issuer, we could also add a field to the issuer to override the global flag. But we chose not to implement it for now. It would look like this: ```yaml # This is a FUTURE POSSIBLE example. apiVersion: cert-manager.io/v1 kind: Issuer spec: acme: - solvers: dns01: selfCheck: dnsOverHTTPSRFC: https://8.8.8.8/resolve ``` **5 July 2022 (Mael, Florian): Field vs. Flag**: Today we have identified the two arguments in favor of going with a field and not a flag: - (weak argument) A field means testing is much easier, since our testing infra doesn't like flags and we have nothing in place to handle flags. We always test with the same flags. - (weak argument) If we were going with a field, we would prevent people from using one DNS-01 provider that supports dns-over-https along with another DNS-01 provider that doesn't support dns-over-https. The reason it is a weak argument is that we identified that the problem that DNS-over-HTTPS solves will block you from using a non-DNS-over-HTTPS provider. In order for this argument to be "strong", we should find a least one current cert-manager user who has two DNS-01 providers and one of the two is "internal" (meaning that it would not need dns-over-https). We think this case is very rare, and as a proof of this rarity, no one ever mentioned the limitation with `--recursive-nameservers` with two DNS-01 providers being in conflict. Admittedly, `--recursive-nameservers` is not as conflicting as having a flag for DNS-over-HTTPS, which would break any non-DNS-over-HTTPS-enabled DNS-01 providers). We don't have strong arguments in favor of the field (over the flag). We aim to go with the field approach since it makes end-to-end testing much easier. **26 July 2022 (Sven): DNS-over-HTTPS over HTTP proxy:** On top of forbidding traffic over UDP port 53, companies often require the use of an HTTP proxy (as opposed to allowing egress HTTPS traffic for specific domains). Sven Schliesing [pointed out](https://kubernetes.slack.com/archives/C4NV3DWUC/p1658828857158289?thread_ts=1656539619.508449&cid=C4NV3DWUC) that the `cloudflared` solution does not deal with the HTTP proxy problem. Another hurdle when using an HTTP proxy is that instead of using the standard CONNECT protocol, some HTTP proxies do TLS reencryption with their own root certificate. There are also HTTP proxies requiring NTLM authentication. To conclude this section, we suggest adding `dnsOverHTTPSJSONEndpoint` to support the minimal set of features that will unblock at least 7 people. ### Solution 2: DNS-over-HTTPS Proxy > **TL;DR:** although DoH proxies exist and work for the most part, we have found that it is challenging to implement (lack of maintained containers, requires CAP_ADMIN. As [proposed on the Kubernetes Slack](https://kubernetes.slack.com/archives/C4NV3DWUC/p1655321707116059) by Matthew de Haast, it is possible to enable DNS-over-HTTPS with cert-manager by deploying the `cloudflared` proxy as a deployment/service in the same namespace as cert-manager. cert-manager is then set to point to that service for DNS queries. This work around has been tested and solves a split DNS issue with a AWS Hosted zone. To learn more about `cloudflared`, you can visit: <https://developers.cloudflare.com/1.1.1.1/encryption/dns-over-https/dns-over-https-client/>. It is also possible to use the sidecar approach (i.e., run that container in the same pod as the cert-manager-controller container), but that requires a lot of changes to the cert-manager Helm chart. Cloudflared is not the only alternative. In June 2023, we have found that there is no Kubernetes-enabled DNS proxy tool that fits the bill: | Project | Stars | State | |------------------------------------------------------------|-------|----------------------------------------------------------------------------| | [AdguardTeam/dnsproxy][] | 1800 | no image | | [aarond10/https_dns_proxy][] | 705 | no upstream image, unofficial image [`moranbw/https-dns-proxy-docker`][] | | [DNSCrypt/doh-server][] | 573 | no image | | [facebookarchive/doh-proxy][] | 462 | archived | | [satishweb/docker-doh][] | 88 | no image | | [junkurihara/doh-server][] (DNSCrypt fork) | 1 | no upstream image, unofficial image [`jqtype/doh-server`][] | | [jacobwoffenden/container-doh-proxy][] (cloudflared-based) | 0 | official image [`ghcr.io/jacobwoffenden/doh-proxy`][] | [DNSCrypt/doh-server]: https://github.com/DNSCrypt/doh-server [satishweb/docker-doh]: https://github.com/satishweb/docker-doh [jacobwoffenden/container-doh-proxy]: https://github.com/jacobwoffenden/container-doh-proxy [`ghcr.io/jacobwoffenden/doh-proxy`]: https://github.com/jacobwoffenden/container-doh-proxy/pkgs/container/doh-proxy [AdguardTeam/dnsproxy]: https://github.com/AdguardTeam/dnsproxy [facebookarchive/doh-proxy]: https://github.com/facebookarchive/doh-proxy [aarond10/https_dns_proxy]: https://github.com/aarond10/https_dns_proxy [`moranbw/https-dns-proxy-docker`]: https://github.com/moranbw/https-dns-proxy-docker [junkurihara/doh-server]: https://github.com/junkurihara/doh-server [`jqtype/doh-server`]: https://hub.docker.com/r/jqtype/doh-server We tried [aarond10/https_dns_proxy][] since it had an image available (although unofficial) and it has many stars so it must be somewhat maintained. It successfully worked. That said, since no official image is available, we do not recommend using it nor any of the tools in this list. Instead, we recommend implemeting DNS-over-HTTPS in cert-manager. For more elaborated use-cases (such as DNS-over-HTTPS using the RFC protocol), we recommend using one of the DoH proxies that are still maintained. ## Appendix ### Does DNS-over-HTTPS need finding the authoritative nameservers and following CNAME records? > **TL;DR:** “Authoritative nameserver lookup” and "CNAME follow" need to be disabled when using DNS-over-HTTPS (see conclusion). For context, when solving a DNS-01 challenge, cert-manager does DNS queries at two moments: - **Find Zone**: before adding a `TXT` record, cert-manager calls [`FindZoneByFqdn`](https://github.com/cert-manager/cert-manager/blob/440da719a9b30d0d2c891b93b08d89bc09e637e2/pkg/issuer/acme/dns/util/wait.go#L324) to find the apex domain of the zone that cert-manager needs to add the `TXT` record. It finds it by looking for the first `SOA` record starting with the domain on which the `TXT` is meant to be inserted. - **Self-Check**: after adding the `TXT` record, cert-manager does a self-check by querying the `TXT` record in the function [`checkDNSPropagation`](https://github.com/cert-manager/cert-manager/blob/440da719a9b30d0d2c891b93b08d89bc09e637e2/pkg/issuer/acme/dns/util/wait.go#L104). Years ago, cert-manager used to do a simple DNS lookup. Nowadays, cert-manager has three different DNS schemes: 1. **Resolver Mode**: In this mode, cert-manager crawls up the tree using 8.8.8.8 but uses the first IP in the `NS` record to fetch the `TXT` record. - It starts by querying the SOA record for `_acme-challenge.domain.com.` using 8.8.8.8. - When it finds a `CNAME` it follows it, and when it finds a `CAA` records it does something (I can't remember). - When it finds an `SOA` record, it fetches the `NS` record using 8.8.8.8. - Finally, it uses one of the DNS IPs in there to fetch the `TXT` record. 2. **Resolver Mode Without Authoritative Check**: (enabled with `--dns01-recursive-nameservers-only=true`) Same as "resolver mode" except cert-manager uses 8.8.8.8 for fetching the `TXT` record. In the three modes, it is possible to change the default 8.8.8.8 nameserver by using the flag `--dns01-recursive-nameservers`. **Note:** I think talking about "recursive name server" isn't a good idea. 99.9% of end-user DNS servers (e.g., 1.1.1.1) and DoH endpoints are recursive resolvers. The only non-recursive name servers are the `NS` servers (i.e., the authoritative name servers) and are never used by anyone (except other nameservers), so there is no good reason to talk about "recursive" name servers in the documentation. I think it was a mistake to mention "recursive" in the flag `--dns01-recursive-nameservers`, it adds doubt to the user for no good reason. No one would think that they need to give some `NS` IPs. But our current DNS-01 self-check does not simply rely on a single DNS lookup. In order to work around split-horizon DNS, where cert-manager and Let's Encrypt are relying on different name servers, we have advised a work around a long time ago. We call it the "authoritative nameserver lookup". The authoritative nameserver lookup does what a recursive name server would do if we were to query it. The goal in mimicking what a recursive DNS would do, solving two issues: 1. The DNS that cert-manager is talking to is caching records, and we worried that the negative-caching TTL period (how long the DNS caches an NXDOMAIN response) would lead to more waiting. 2. We want to do the self-check on the authoritative nameserver so we can be sure the record has been updated at the root. This is what Let's Encrypt does. But in the face of restricted DNS environments, as James Munnelly explains in [Solutions for split-horizon and restricted DNS environment issues](https://github.com/cert-manager/cert-manager/issues/903) (written in 2018), the "authoritative nameserver lookup" won't work. In 2018, the problem is stated as: > A user has configured their cluster/VPC so that all outbound traffic on port 53 is denied, **except for the one cluster DNS server** (i.e. kube-dns, or their route53 resolver). In which case the "authoritative nameserver lookup" does not make sense and has to be disabled, since cert-manager can only expect to be talking to that one resolver. But what if all outbound traffic over UDP port 53 is blocked, and that there is no way to contact an outside recursive nameserver over UDP port 53? > Other cert-manager issues in which using a DNS lookup instead of mimicking the behavior of a recursive DNS resolver would solve the issue: > > - [Wrong SOA record while updating delegated _acme-challenge zone](https://github.com/cert-manager/cert-manager/issues/3453) Comment from Jake Sanders: Richard Wall is working on Readiness gates for ACME challenges. Perhaps we could work out what the API looks like for them and see if it makes sense to add conditions for "passed DNS check", or "passed DNS over HTTPS check". Conclusion 9 June 2023: - “Authoritative nameserver lookup” does not make sense in the context of DNS-over-HTTPS since cert-manager can only talk to one resolver. - "CNAME Following" needs to be disabled for the same reason. ---- The reason we talk about "authoritative name servers" in cert-manager's DNS-01 code base is because cert-manager's self-check tries to be as close as possible to how the challenge will be performed by Let's Encrypt. Let's Encrypt queries the authoritative name servers via Unbound's resolver. Unlike our laptops and smartphones that rely on a non-authoritative name server which will do the recursive look-up for us, Unbound first queries the `NS` record for the top-level domain (e.g., `.com`), and then recursively finds the name servers of the zone in which the challenged domain is located in. For example, that would be `delegated.domain.com` for the challenge `_challenge.delegated.domain.com`. Imagine you have a setup with 1 main primary name server and 3 secondary, you will hit [4-ns-out-of-sync][] if one of the 3 secondaries isn't in sync with the 1 primary. [4-ns-out-of-sync]: https://community.letsencrypt.org/t/please-query-the-authoritative-dns-sec-with-dns-01/146187/2