# Design: DNS over HTTPS instead of plain DNS for DNS-01 ACME Issuers **Status:** this is a design draft created by Maël Valais on 12 April 2022 and updated on 6 July 2022. On Sept 2022, this page will be copied over to the [`design/`](https://github.com/cert-manager/cert-manager/blob/master/design/) folder. Issues: - [Support DoT (DNS over TLS) for Recursive Nameservers](https://github.com/cert-manager/cert-manager/issues/4153) implemented by [Implement the DNS-over-HTTPS check](https://github.com/cert-manager/cert-manager/pull/5003). ## The problem Some companies forbid traffic over UDP port 53, and where the only allowed egress traffic must be TCP traffic over 443. In this case, cert-manager is unable to perform the ACME DNS-01 self-check. This is a stronger version of the split-horizon DNS problem. It does not solve the above problem by running cert-manager with the [`--dns01-recursive-nameservers`](dns01-recursive-nameservers) flag since cert-manager can't do DNS lookups over UDP port 53. For example, this would not work: ```text --dns01-recursive-nameservers= ``` [dns01-recursive-nameservers]: https://cert-manager.io/docs/configuration/acme/dns01/#setting-nameservers-for-dns01-self-check Users affected: - Florian Liebhart (Volkswagen), - Matthew de Haast (Fynbos), - Sven Schliesing (Tagesschau). ## Solution 1: DNS-over-HTTPS > Florian has an in-flight PR: [#5003](https://github.com/cert-manager/cert-manager/pull/5003). There are two DNS-over-HTTPS protocols: - The DNS Wire format, as defined in [RFC 8484][rfc8484] - The JSON API format, an ad-hoc format created by Google, and supported by CloudFlare and Alibaba. [rfc8484]: https://www.rfc-editor.org/rfc/rfc8484 "DNS Queries over HTTPS (DoH)" [curl-wiki-doh]: https://github.com/curl/curl/wiki/DNS-over-HTTPS "DoH - DNS over HTTPS" The DNS Wire format is supported by 63 providers (as of 11 April 2022, the providers are listed in the curl wiki page [DNS over HTTPS][curl-wiki-doh]). <!-- ```console $ curl -sS https://github.com/curl/curl/wiki/DNS-over-HTTPS | htmlq --pretty 'table:nth-child(6) > tbody > tr > td:nth-child(1)' --text --remove-nodes strong | grep . | wc -l 63 ``` --> On the other side, the JSON API format is supported by 3 providers: - Google's Cloud DNS, - CloudFlare's, - Alibaba Public DNS. Although is hasn't been standardized, the JSON API is mostly well documented in the CloudFlare [Using JSON](https://developers.cloudflare.com/ page. We went with the JSON API for two reasons: 1. because it is easy to write a client, test and debug, 2. because our use-case (fetching TXT records) does not require complex DNS queries that the DNS Wire format would require. The major caveat is that only 3 providers support the JSON API. The plan is to add a new field `dnsOverHTTPSEndpoint` on the Issuer and ClusterIssuer resources: ```yaml apiVersion: cert-manager.io/v1 kind: Issuer spec: acme: - solvers: dns01: dnsOverHTTPSEndpoint: # 🚧 ``` This endpoint would have to be compatible with the JSON API. We chose to use a field on the Issuer and ClusterIssuer types because this setting is a per-acme-provider setting. For example, if you have an internal Smallstep ACME server, and that you are also using Let's Encrypt, you would not want to be using a cert-manager-wide flag such as: ```text --acme-dns01-check-dns-over-https-endpoint= ``` since the self-check for Smallstep would start failing. For the same reasons, we may have made a mistake in making the "alternative nameserver for self-checks" setting a flag: ```text --dns01-recursive-nameservers= ``` It should have been a field on the Issuer and ClusterIssuer types: ```yaml apiVersion: cert-manager.io/v1 kind: Issuer spec: acme: - solvers: dns01: selfCheckDNS: # 🌟 ``` **5 July 2022 (Mael, Florian): Field vs. Flag**: Today we have identified the two arguments in favor of going with a field and not a flag: - (weak argument) A field means testing is much easier, since our testing infra doesn't like flags and we have nothing in place to handle flags. We always test with the same flags. - (weak argument) If we were going with a field, we would prevent people from using one DNS-01 provider that supports dns-over-https along with another DNS-01 provider that doesn't support dns-over-https. The reason it is a weak argument is that we identified that the problem that DNS-over-HTTPS solves will block you from using a non-DNS-over-HTTPS provider. In order for this argument to be "strong", we should find a least one current cert-manager user who has two DNS-01 providers and one of the two is "internal" (meaning that it would not need dns-over-https). We think this case is very rare, and as a proof of this rarity, no one ever mentioned the limitation with `--recursive-nameservers` with two DNS-01 providers being in conflict. Admittedly, `--recursive-nameservers` is not as conflicting as having a flag for DNS-over-HTTPS, which would break any non-DNS-over-HTTPS-enabled DNS-01 providers). We don't have strong arguments in favor of the field (over the flag). We aim to go with the field approach since it makes end-to-end testing much easier. **26 July 2022 (Sven): DNS-over-HTTPS over HTTP proxy:** On top of forbidding traffic over UDP port 53, companies often require the use of an HTTP proxy (as opposed to allowing egress HTTPS traffic for specific domains). Sven Schliesing [pointed out](https://kubernetes.slack.com/archives/C4NV3DWUC/p1658828857158289?thread_ts=1656539619.508449&cid=C4NV3DWUC) that the `cloudflared` solution does not deal with the HTTP proxy problem. Another hurdle when using an HTTP proxy is that instead of using the standard CONNECT protocol, some HTTP proxies do TLS reencryption with their own root certificate. There are also HTTP proxies requiring NTLM authentication. ### Solution 2 (workaround): Cloudflared for Proxying DNS Queries Coming From cert-manager As [proposed on the Kubernetes Slack](https://kubernetes.slack.com/archives/C4NV3DWUC/p1655321707116059) by Matthew de Haast, it is possible to enable DNS-over-HTTPS with cert-manager by deploying the `cloudflared` proxy as a deployment/service in the same namespace as cert-manager. cert-manager is then set to point to that service for DNS queries. This work around has been tested and solves a split DNS issue with a AWS Hosted zone. To learn more about `cloudflared`, you can visit: <https://developers.cloudflare.com/>. ## Appendix ### Does DNS-over-HTTPS need finding the authoritative nameservers and following CNAME records? For context, when solving a DNS-01 challenge, cert-manager does DNS queries at two moments: - **Before**: before adding a `TXT` record, cert-manager calls [`FindZoneByFqdn`](https://github.com/cert-manager/cert-manager/blob/440da719a9b30d0d2c891b93b08d89bc09e637e2/pkg/issuer/acme/dns/util/wait.go#L324) to find the apex domain of the zone that cert-manager needs to add the `TXT` record. It finds it by looking for the first `SOA` record starting with the domain on which the `TXT` is meant to be inserted. - **After**: after adding the `TXT` record, cert-manager does a self-check by querying the `TXT` record in the function [`checkDNSPropagation`](https://github.com/cert-manager/cert-manager/blob/440da719a9b30d0d2c891b93b08d89bc09e637e2/pkg/issuer/acme/dns/util/wait.go#L104). Years ago, cert-manager used to do a simple DNS lookup. and here is why we started doing the recursive lookup ourselves instead of relying on the whichever DNS the Kubernetes is configured with: But our current DNS-01 self-check does not simply rely on a single DNS lookup. In order to work around split-horizon DNS, where cert-manager and Let's Encrypt are relying on different name servers, we have advised a work around a long time ago. We call it the "authoritative nameserver lookup". The authoritative nameserver lookup does what a recursive name server would do if we were to query it. The goal in mimicking what a recursive DNS would do, solving two issues: 1. The DNS that cert-manager is talking to is caching records, and we worried that the negative-caching TTL period (how long the DNS caches an NXDOMAIN response) would lead to more waiting. 2. We want to do the self-check on the authoritative nameserver so we can be sure the record has been updated at the root. This is what Let's Encrypt does. But in the face of restricted DNS environments, as James Munnelly explains in [Solutions for split-horizon and restricted DNS environment issues](https://github.com/cert-manager/cert-manager/issues/903) (written in 2018), the "authoritative nameserver lookup" won't work. In 2018, the problem is stated as: > A user has configured their cluster/VPC so that all outbound traffic on port 53 is denied, **except for the one cluster DNS server** (i.e. kube-dns, or their route53 resolver). In which case the "authoritative nameserver lookup" does not make sense and has to be disabled, since cert-manager can only expect to be talking to that one resolver. But what if all outbound traffic over UDP port 53 is blocked, and that there is no way to contact an outside recursive nameserver over UDP port 53? > Other cert-manager issues in which using a DNS lookup instead of mimicking the behavior of a recursive DNS resolver would solve the issue: > > - [Wrong SOA record while updating delegated _acme-challenge zone](https://github.com/cert-manager/cert-manager/issues/3453) Comment from Jake Sanders: Richard Wall is working on Readiness gates for ACME challenges. Perhaps we could work out what the API looks like for them and see if it makes sense to add conditions for "passed DNS check", or "passed DNS over HTTPS check".