Try   HackMD

Design: DNS over HTTPS instead of plain DNS for DNS-01 ACME Issuers

Status: this is a design draft created by Maël Valais on 12 April 2022 and updated on 9 June 2023.

Issues:

TL;DR: We recommend supporting the RFC DNS-over-HTTPS by extending the existing --dns01-recursive-nameservers. This will unblock at least the 12 people who reacted on the PR and issue. The aim isn't to have a full support of all secure DNS protocols; instead, the goal is to work around egress limitations (egress port 53 closed) in companies with restricted egress access. If someone wants further secure DNS protocols, they can use instead one of the DoH proxies listed below.

The problem

Some companies forbid traffic over UDP port 53, and where the only allowed egress traffic must be TCP traffic over 443. In this case, cert-manager is unable to perform the ACME DNS-01 self-check.

This is a stronger version of the split-horizon DNS problem. It does not solve the above problem by running cert-manager with the --dns01-recursive-nameservers flag since cert-manager can't do DNS lookups over UDP port 53. For example, this would not work:

--dns01-recursive-nameservers=1.1.1.1:53

Users affected:

  • Florian Liebhart (Volkswagen),
  • Matthew de Haast (Fynbos),
  • Sven Schliesing (Tagesschau).

Solution 1: DNS-over-HTTPS

Florian has an in-flight PR: #5003.

There are two DNS-over-HTTPS protocols:

  • The DNS Wire format, as defined in RFC 8484
  • The JSON API format, an ad-hoc format created by Google, and supported by CloudFlare and Alibaba.

The DNS Wire format is supported by 63 providers (as of 11 April 2022, the providers are listed in the curl wiki page DNS over HTTPS).

On the other side, the JSON API format is supported by 3 providers:

  • Google's Cloud DNS,
  • CloudFlare's 1.1.1.1,
  • Alibaba Public DNS.

Although is hasn't been standardized, the JSON API is mostly well documented in the CloudFlare Using JSON page.

Although the JSON API is easier to write a client for, to test and debug, and supports TXT, it doesn't support SOA, CNAME, or CAA. We also found that it only supports 3 providers, and the JSON format only supports a subset of the DNS records.

The plan is to extend the existing global flag --dns01-recursive-nameservers, e.g.,

cert-manager-controller \
  --dns01-recursive-nameservers "https://8.8.8.8/resolve"

Later on, if someone needs selecting the DNS resolver per-issuer, we could also add a field to the issuer to override the global flag. But we chose not to implement it for now. It would look like this:

# This is a FUTURE POSSIBLE example.
apiVersion: cert-manager.io/v1
kind: Issuer
spec:
  acme:
    - solvers:
        dns01:
          selfCheck: 
            dnsOverHTTPSRFC: https://8.8.8.8/resolve

5 July 2022 (Mael, Florian): Field vs. Flag: Today we have identified the two arguments in favor of going with a field and not a flag:

  • (weak argument) A field means testing is much easier, since our testing infra doesn't like flags and we have nothing in place to handle flags. We always test with the same flags.
  • (weak argument) If we were going with a field, we would prevent people from using one DNS-01 provider that supports dns-over-https along with another DNS-01 provider that doesn't support dns-over-https. The reason it is a weak argument is that we identified that the problem that DNS-over-HTTPS solves will block you from using a non-DNS-over-HTTPS provider. In order for this argument to be "strong", we should find a least one current cert-manager user who has two DNS-01 providers and one of the two is "internal" (meaning that it would not need dns-over-https). We think this case is very rare, and as a proof of this rarity, no one ever mentioned the limitation with --recursive-nameservers with two DNS-01 providers being in conflict. Admittedly, --recursive-nameservers is not as conflicting as having a flag for DNS-over-HTTPS, which would break any non-DNS-over-HTTPS-enabled DNS-01 providers).

We don't have strong arguments in favor of the field (over the flag). We aim to go with the field approach since it makes end-to-end testing much easier.

26 July 2022 (Sven): DNS-over-HTTPS over HTTP proxy: On top of forbidding traffic over UDP port 53, companies often require the use of an HTTP proxy (as opposed to allowing egress HTTPS traffic for specific domains). Sven Schliesing pointed out that the cloudflared solution does not deal with the HTTP proxy problem.

Another hurdle when using an HTTP proxy is that instead of using the standard CONNECT protocol, some HTTP proxies do TLS reencryption with their own root certificate. There are also HTTP proxies requiring NTLM authentication.

To conclude this section, we suggest adding dnsOverHTTPSJSONEndpoint to support the minimal set of features that will unblock at least 7 people.

Solution 2: DNS-over-HTTPS Proxy

TL;DR: although DoH proxies exist and work for the most part, we have found that it is challenging to implement (lack of maintained containers, requires CAP_ADMIN.

As proposed on the Kubernetes Slack by Matthew de Haast, it is possible to enable DNS-over-HTTPS with cert-manager by deploying the cloudflared proxy as a deployment/service in the same namespace as cert-manager. cert-manager is then set to point to that service for DNS queries. This work around has been tested and solves a split DNS issue with a AWS Hosted zone. To learn more about cloudflared, you can visit: https://developers.cloudflare.com/1.1.1.1/encryption/dns-over-https/dns-over-https-client/.

It is also possible to use the sidecar approach (i.e., run that container in the same pod as the cert-manager-controller container), but that requires a lot of changes to the cert-manager Helm chart.

Cloudflared is not the only alternative. In June 2023, we have found that there is no Kubernetes-enabled DNS proxy tool that fits the bill:

Project Stars State
AdguardTeam/dnsproxy 1800 no image
aarond10/https_dns_proxy 705 no upstream image, unofficial image moranbw/https-dns-proxy-docker
DNSCrypt/doh-server 573 no image
facebookarchive/doh-proxy 462 archived
satishweb/docker-doh 88 no image
junkurihara/doh-server (DNSCrypt fork) 1 no upstream image, unofficial image jqtype/doh-server
jacobwoffenden/container-doh-proxy (cloudflared-based) 0 official image ghcr.io/jacobwoffenden/doh-proxy

We tried aarond10/https_dns_proxy since it had an image available (although unofficial) and it has many stars so it must be somewhat maintained. It successfully worked.

That said, since no official image is available, we do not recommend using it nor any of the tools in this list. Instead, we recommend implemeting DNS-over-HTTPS in cert-manager. For more elaborated use-cases (such as DNS-over-HTTPS using the RFC protocol), we recommend using one of the DoH proxies that are still maintained.

Appendix

Does DNS-over-HTTPS need finding the authoritative nameservers and following CNAME records?

TL;DR: “Authoritative nameserver lookup” and "CNAME follow" need to be disabled when using DNS-over-HTTPS (see conclusion).

For context, when solving a DNS-01 challenge, cert-manager does DNS queries at two moments:

  • Find Zone: before adding a TXT record, cert-manager calls FindZoneByFqdn to find the apex domain of the zone that cert-manager needs to add the TXT record. It finds it by looking for the first SOA record starting with the domain on which the TXT is meant to be inserted.
  • Self-Check: after adding the TXT record, cert-manager does a self-check by querying the TXT record in the function checkDNSPropagation.

Years ago, cert-manager used to do a simple DNS lookup. Nowadays, cert-manager has three different DNS schemes:

  1. Resolver Mode:
    In this mode, cert-manager crawls up the tree using 8.8.8.8 but uses the first IP in the NS record to fetch the TXT record.
    • It starts by querying the SOA record for _acme-challenge.domain.com. using 8.8.8.8.
    • When it finds a CNAME it follows it, and when it finds a CAA records it does something (I can't remember).
    • When it finds an SOA record, it fetches the NS record using 8.8.8.8.
    • Finally, it uses one of the DNS IPs in there to fetch the TXT record.
  2. Resolver Mode Without Authoritative Check: (enabled with --dns01-recursive-nameservers-only=true)
    Same as "resolver mode" except cert-manager uses 8.8.8.8 for fetching the TXT record.

In the three modes, it is possible to change the default 8.8.8.8 nameserver by using the flag --dns01-recursive-nameservers.

Note: I think talking about "recursive name server" isn't a good idea. 99.9% of end-user DNS servers (e.g., 1.1.1.1) and DoH endpoints are recursive resolvers. The only non-recursive name servers are the NS servers (i.e., the authoritative name servers) and are never used by anyone (except other nameservers), so there is no good reason to talk about "recursive" name servers in the documentation. I think it was a mistake to mention "recursive" in the flag --dns01-recursive-nameservers, it adds doubt to the user for no good reason. No one would think that they need to give some NS IPs.

But our current DNS-01 self-check does not simply rely on a single DNS lookup. In order to work around split-horizon DNS, where cert-manager and Let's Encrypt are relying on different name servers, we have advised a work around a long time ago. We call it the "authoritative nameserver lookup". The authoritative nameserver lookup does what a recursive name server would do if we were to query it. The goal in mimicking what a recursive DNS would do, solving two issues:

  1. The DNS that cert-manager is talking to is caching records, and we worried that the negative-caching TTL period (how long the DNS caches an NXDOMAIN response) would lead to more waiting.
  2. We want to do the self-check on the authoritative nameserver so we can be sure the record has been updated at the root. This is what Let's Encrypt does.

But in the face of restricted DNS environments, as James Munnelly explains in Solutions for split-horizon and restricted DNS environment issues (written in 2018), the "authoritative nameserver lookup" won't work. In 2018, the problem is stated as:

A user has configured their cluster/VPC so that all outbound traffic on port 53 is denied, except for the one cluster DNS server (i.e. kube-dns, or their route53 resolver).

In which case the "authoritative nameserver lookup" does not make sense and has to be disabled, since cert-manager can only expect to be talking to that one resolver.

But what if all outbound traffic over UDP port 53 is blocked, and that there is no way to contact an outside recursive nameserver over UDP port 53?

Other cert-manager issues in which using a DNS lookup instead of mimicking the behavior of a recursive DNS resolver would solve the issue:

Comment from Jake Sanders: Richard Wall is working on Readiness gates for ACME challenges. Perhaps we could work out what the API looks like for them and see if it makes sense to add conditions for "passed DNS check", or "passed DNS over HTTPS check".

Conclusion 9 June 2023:

  • “Authoritative nameserver lookup” does not make sense in the context of DNS-over-HTTPS since cert-manager can only talk to one resolver.
  • "CNAME Following" needs to be disabled for the same reason.

The reason we talk about "authoritative name servers" in cert-manager's DNS-01 code base is because cert-manager's self-check tries to be as close as possible to how the challenge will be performed by Let's Encrypt.

Let's Encrypt queries the authoritative name servers via Unbound's resolver. Unlike our laptops and smartphones that rely on a non-authoritative name server which will do the recursive look-up for us, Unbound first queries the NS record for the top-level domain (e.g., .com), and then recursively finds the name servers of the zone in which the challenged domain is located in. For example, that would be delegated.domain.com for the challenge _challenge.delegated.domain.com.

Imagine you have a setup with 1 main primary name server and 3 secondary, you will hit 4-ns-out-of-sync if one of the 3 secondaries isn't in sync with the 1 primary.