SIG: SIG-Policy
Begin Design Discussion: 2024-02-12
Cilium Release: 1.16
Authors: Casey Callendrello cdc@isovalent.com, Sebastian Wicki sebastian@isovalent.com
Improve FQDN-based policy performance by changing how identities and labels are managed. Rather than propagating FQDN -> IP mappings to selectors, label IPs by the FQDN selector(s) that return them.
e.g. 18.192.231.252 -> (dns:*.cilium.io)
ToFQDN scaling limitations are a source of pain for many, many Cilium end users. When a domain name returns a large and constantly varying set of IPs (e.g. S3), this causes significant identity churn and increased memory usage.
At the time of writing, Cilium allocates a unique security identity for each IP address, even if all of those addresses are selected by a single selector. All these identities then need to be pushed in to every relevant PolicyMap as well as Envoy – all before returning the DNS response to the waiting endpoint.
First, let us understand the existing components at work for ToFQDN policies.
Label: A key-value pair, optionally tagged with a source. For example, k8s:app:database
is equivalent to the Pod label app=database
Identity: A unique set of labels. Every identity has a numeric identifier.
Selector: A policy entity that selects identities.
Label Selector: A selector that references by label key and/or value.
CIDR Selector: A selector that references by CIDR. CIDR selectors are converted to a special form of label selectors.
ToFQDN Selector: A selector that dynamically selects based on the result of intercepted DNS requests and responses.
The BPF policy engine relies on identities – it does not understand CIDRs or IPs directly. So, special CIDR labels are used, and special local identities are allocated on-demand to represent these external IP blocks.
The IPCache aggregates multiple sources of labels for external CIDRs. As necessary, it will generate the complete set of labels and allocate a numerical identity.
Consider a ToCIDR block that selects 192.0.2.0/24
. The policy subsystem will upsert the prefix-label mapping 192.0.2.0/24 (cidr:192.0.2.0/24)
. The IPCache will allocate a local identity (e.g. 16777216) for this set labels. It will also convert the ToCIDR selector to a label selector, selecting the identical CIDR label. The SelectorCache will then select this numeric identity when generating the policy maps.
If another ToCIDR block selects 192.0.2.128/25
, a similar preifx-label mapping will be created 192.0.2.128/25 -> (cidr:192.0.2.128/25)
. Likewise, a new identity will be allocated for this unique set of labels. The SelectorCache will be updated, and the label selector cidr:192.0.2.0/24
will now match two identities.
The SelectorCache maintains a list of all "active" selectors for policies. If, for example, there is a network policy that grants access to label team: database
, this selector will "live" in the SelectorCache. The policy will ask the SelectorCache for all identities that match this selector. Furthermore, the policy will "register" for incremental updates to this selector.
The NameManager maintains two key lists:
As DNS responses come in, the NameManager determines if the name is selected by any ToFQDN policies. If so, it inserts the IP in to the ipcache (if needed) and updates the relevant ToFQDN selectors.
The IPCache stores the mapping from prefix to labels. It also allocates identities for labels, and manages the mapping from prefix -> numeric identity.
ipc.UpsertPrefixes()
is called, associating the complete set of CIDR labels with this IP address.sc.UpdateFQDNSelector()
, updating the set of selected IPs for this given FQDNSelectorOnly at this point can the DNS packet be returned to the waiting endpoint.
Some DNS names (e.g S3) will return many different IP addresses. S3 even has a TTL of 5 seconds, so the set of IPs will quickly be very broad. This causes a large number of identities to be allocated – one per IP. These identities then need to plumbed through to all endpoints.
Adding a new indentity to a selector requires an xDS roundrip to Envoy. This has a high tail latency, and in practice can take up to 100ms. This may have to happen multiple times for a single proxied FQDN request.
This CFP proposes 3 key changes. Together, they preserve correctness while radically restructuring how prefixes are selected.
192.0.0.0/22
has label foo:bar
, then prefix 192.0.2.0/24
also has label foo:bar
, unless overridden. The IPCache aggregates labels accordingly.192.0.2.0/24
, only the prefix -> label mapping 192.0.2.0/24 -> cidr:192.0.2.0/24
is added to the IPCache.name -> IP
pair, it upserts the label IP -> (dns:selector)
It does not create any CIDR labels.Currently, prefixes in the ipcache are independent. Instead, labels should flow downwards from prefixes. For example, consider an IPCache with two prefixes:
Then, the complete set of labels for 192.0.2.2/32
would be reserved:world, k8s:foo:bar, cidr:192.0.2.2/32, reserved:kube-apiserver
. As a special case, shorter CIDR labels can be omitted from descendant prefixes, as the selector logic will match longer prefixes correctly.
This is a cornerstone for implementing another feature, non-k8s label sources. It is also required for correctness within the context of this proposal, especially when ToFQDN and CIDR selectors overlap.
With label down-propagation, it is important that updates are propagated accordingly. When a prefix P is updated or deleted, then updates must be also triggered for all prefixes contained in P.
This update may be calculated if we use a bitwise LPM trie. Alternatively, scanning the entire ipcache to determine affected prefixes may be more efficient.
The label selector logic is already cidr-aware, and does not require cidr label expansion for shorter-prefixes to correctly match descendant prefixes.
So, if selector A selects cidr:192.168.0.0/24
, there will be a single identity, id1 -> (cidr:192.168.0.0/24)
. If selector B selects cidr:192.168.0.0/25
, then there will be two identities: id1-> (cidr:192.168.0.0/24)
and id2 -> (cidr:192.168.0.0/25)
. The label matching logic ensures that selector A will now select id1 and id2, and selector B will select just id2.
Thus, when generating the set of labels for a prefix, we should discard any intermediate CIDR labels, keeping only the longest prefix.
Currently, FQDN selectors (as stored in the SelectorCache) understand the set of IPs that they select. They determine the relevant set of identities by generating the equivalent cidr:x.x.x.x/32
labels and looking for matching identities.
Presently, when a new (name: IP)
mapping is learned, the NameManager inserts that IP in the IPCache and updates the selectors' set of desired labels to include that IP's CIDR label.
Instead, the NameManager should label IPs with the selector(s) that match that name. So, if a DNS answer www.cilium.io: 52.58.254.253, 18.192.231.252
is seen, and there is a ToFQDN policy that allows *.cilium.io
, two new entries are inserted in to the IPCache metadata layer:
If no identity exists for the label set (dns:*.cilium.io)
, it would be allocated and the selector would be updated. Then, the updated mapping from prefix to identity would be inserted for both IPs in the IPCache.
This optimizes a common case in the FQDN response path: learning a new IP. If the new IP does not require an identity allocation, then a only a single BPF write to the IP -> identity cache is needed – neither the PolicyMaps nor Envoy need to be updated.
We would like to optimize the dynamic path, so that learning a new IP generally does not require an identity allocation or policy update. One way to do this is to pre-allocate an identity for FQDN selectors, even if there are no corresponding IPs.
This has advantages and disadvantages. It could cause unnecessary BPF PolicyMap entries, but reduces the critical-path latency. It also reduces the chance that packets are dropped on agent restart. We plan to enable pre-allocation.
Consider two selectors: one selects *.cilium.io
, another selects www.cilium.io
. Imagine DNS responses have been seen for www.cilium.io: IP1, IP2
and dev.cilium.io: IP2, IP3
.
The IPCache label map would have:
There would, then, be two identities allocated:
and the IP -> ID mapping would be
The selector *.cilium.io
selects ID1, ID2
, and www.cilium.io
selects ID1
. Every selector derives to the correct set of IPs, thus overlapping selectors are handled correctly.
It is possible for the same IP to map to multiple names. Consider two names, foo.com: IP1, IP2
and bar.com: IP2, IP3
. If selector A selects foo.com
, and selector B selects bar.com
, the state of the IPCache should be
There would be 3 identities allocated for the 3 unique sets of labels, and the selectors would select these identities accordingly.
So, overlapping IPs are handled correctly.
Consider a selector, *.cilium.io
, which currently selects one IP address, 52.58.254.253
. If, separately, a ToCIDR selector selects 52.58.254.0/24
, the state of the IPCache will be
However, when the IP -> Identity mapping is calculated, because of label propagation, the complete set of labels for 52.58.254.253/32
is (cidr:52.58.254.0/24, dns:*.cilium.io)
. So, the selector for 52.58.254.0/24
correctly selects both identities. Therefore, overlapping ToFQDN and CIDR selectors are handled correctly.
While this proposal reduces operations for the common case (adding a new IP to an existing selector), it can increase identity churn for other cases. Identity churn is the allocation of a new identity for an existing prefix.
Identity churn is not a correctness issue, as policy updates are carefully sequenced so to not drop traffic. It is, however, a general load on the system.
In general, this proposal should be a significant performance improvement. CIDR selectors are relatively static, whereas FQDN updates are highly dynamic (and have latency guarantees). But, it is a tradeoff. A cluster with dynamic, overlapping CIDR selectors may find the identity churn to be more costly than expected.
Currently, the only set of labels that apply to a CIDR are the cidr:
labels, which clearly propagate downwards. There are no other label sources for non-leaf CIDRs. Were this to change, we may need to re-evaluate this decision.
As described on Components & Flow, the name manager maintains a mapping from each DNS name to all associated IPs (and vice-versa). It is responsible for adding IPs observed during DNS lookups to IPCache.
The name manager learns of new IP<->Name
mappings whenever there is a intercepted DNS lookup and adds those to it's internal DNS cache. IPs need to be evicted from that cache once their TTL expired and they are not still in use by an active connection.[1]
The name cache can be essentially viewed as a mapping from a DNS name to a list of IPs associated with that DNS name. Note that an IP might be associated with multiple names (e.g. IP 104.198.14.52
may be found both in the cilium.io
entry, as well as the ebpf.io
entry).
The relevant questions around name cache GC which this CFP needs to answer is how the name manager reacts to changes in the name manager DNS cache, i.e. what and when IPCache updates need to be issued when a entry is added or evicted from the DNS name cache.
With CIDR-based identities (status quo), the lifecycle of an DNS IP in IPCache is rather simple: IPs are upserted as CIDR-prefixes into IPCache when they are observed as part of a DNS lookup and there is a matching selector for it. When the last reference of an IP in the name manager cache is garbage collected, we remove the prefix form IPCache (more details on this can be found in the appendix).
With fqdn
-label-based identities, we need to track what labels an IP needs to be associated with. An IP can be part of multiple DNS lookups, and thus be selected by different selectors. IPCache updates issued by the name manager (which add or remove labels from IPs) are all associated with the the DNS name that maps to the IP.
As an example, let us assume we have two ToFQDN
selectors, matchName: *.com
and matchName: *.io
.
Let's assume we have two lookups, one for ebpf.com
and one for cilium.io
each, both mapping to the same IP (1.2.3.4
),.
Then the IPCache metadata cache will store the following labels, with the domain name as the associated name owner:
1.2.3.4
, owner: daemon/fqdn-name-manager/ebpf.com
, labels: fqdn:*.com
1.2.3.4
, owner: daemon/fqdn-name-manager/cilium.io
, labels: fqdn:*.io
When IPCache then resolves the identity of the prefix 1.2.3.4
, it merges the two label sets into the final identity label set fqdn:*.com,fqdn:*.io
.
Note that the two metadata entries are independently managed. If all selectors matching cilium.io
are removed, we issue an IPCache update removing all labels for the resource owner daemon/fqdn-name-manager/cilium.io
. However, because the IP is still matched by the fqdn:*.com
selector via the ebpf.com
unaffected by the change, the IP itself remains in IPCache, but with an updated identity.
Labels are added or removed from IPCache under the following conditions:
(name, IP)
pair not yet in the name cache,ToFQDN
selector is added or removed from the selector cache, we(name, IP)
pairs from the name cache,IP
for the given resource owner name
name
are removed from IPCache.(name, IP)
pairsname
)(name, IP)
pair no longer present in theCilium allocates security identities locally for certain types of destinations. Remote IPs discovered by ToFQDN policies are one source of local identities. In order to prevent traffic drops on restart, Cilium attempts to maintain a stable mapping between IP and numeric identity between restarts.
The system as of v1.15 reads the "outgoing" BPF IPCache, which provides Prefix :: Numeric identity mappings. It then uses heuristics to reconstruct the likely set of labels for that prefix, and attempts to ensure the same prefix gets the same numeric identity. One all sources of metadata have synchronized, the IPCache proceeds with determining labels and generating the new BPF map.
Since, in v1.15 and before, the only possible labels are the CIDR labels themselves and the reserved:kube-apiserver
label, the simple 1:1 mapping between identity and prefix is simple to recreate.
As always, the numeric identity for a given prefix should, as much as possible, be stable. The reconstruction heuristics will need to change, as a single ID can correspond to multiple prefixes. We can no longer use a simple heuristic to re-create identities.
However, all hope is not lost. The state of the FQDN policy engine is also checkpointed to disk, so we can mostly recreate the set of labels that apply to a given IP. Additionally, we will checkpoint the local allocator state to disk, so that the same set of labels receives the same identity.
Checkpointing the local allocator is required so that policy maps (which contain only numeric IDs) remain stable across restarts.
The full procedure is as follows:
In this proposal, the Label -> NID mapping is now fixed. Thus, for prefixes to have stable NIDs across restarts, they must have the same labels. The agent waits for its data sources to synchronize before computing labels, so this should primarily be the case.
The IPCache has three prefixes, each with a corresponding identity.
192.168.0.0/24
-> (cidr:192.168.0.0/24)
1.1.1.1/32
-> (fqdn:*, fqdn:one.one.one.one)
2.2.2.2/32
-> (fqdn:*)
If, on agent restart, all prefixes have the same set of labels, then the identity mapping is stable. If, however, the FQDN checkpointing is not exact (which does happen when an IP is near expiration), then 1.1.1.1/32
may change labels to (fqdn:*)
. However, since the identity for fqdn:*
is already allowed by policy maps, no traffic to fqdn:*
is dropped.
Thus, transitioning to label-based, rather than prefix-based allocation should preserve policy stability on restarts.
There is currently a circuit-breaker, --tofqdns-endpoint-max-ip-per-hostname
, which is intended to prevent names such as S3 causing identity churn. In this proposal, large names such as S3 should not allocate more than one or two identities, making this less critical.
However, this is still a security-sensitive area, since we are allocating resources based on external, untrusted packets. We will need a sane circuit-breaker to ensure we do not allocate a large number of identities based on malicious input.
The maximum number of identities allocated by ToFQDN selectors is the number of unique selector combinations seen for given IPs. The theoretical maximum is, therefore, [2]. 10 distinct selectors would be OK (1024). More than 16 selectors could theoretically exceed identity space.
We need a good circuit breaker for this that does not break the S3 case.
It is desired that the (ip -> nid) and (selector -> nid) mappings remain stable across restarts. This is because policy updates are not completely atomic, so identity stablity is required to prevent spurious drops during the regeneration process. The window for drops widens when Envoy is also in use. This mechanism is best-effort but generally quite reliable.
On restart, there will be no local allocator checkpoint, so no local identities will exist. The restoration will dump the list of (prefix -> nid) pairs, then insert a (cidr:xxx/32)
label in to the ipcache and reserve the nid for that prefix as a "fallback"
The FQDN state will be restored, loading (domain -> ip) pairs. The the (selector -> ip) mapping will be recreated, and (fqdn:selectorXXX)
labels will be inserted in to the ipcache for relevant prefixes.
After reading all k8s objects, the ipcache will proceed with injection. Every restored prefix will have the labels (cidr:xxx/32, fqdn:selectorXXX, fqdn:selectorYYY)
. An identity will be allocated for every IP, and thanks to the reservation mechanism, the numeric ID will be stable. Thus, the selector->nids and ip->nid mappings will remain stable on upgrade, and no traffic will be dropped.
This relies on IPs having the same set of labels before and after restart, which is not guaranteed. Logic changes may cause label sets to shift. We may need to have an "upgrade" mechanism that modifies
existing labels based on certain upgrade heuristics.
The constraints for downgrades are somewhat reduced; it is acceptable for brief policy drops while the agent finishes converging.
On downgrade to v1.15, the restoration logic will, again, upsert (cidr:xxx/32) labels and reserved numeric identities in to the ipcache. However, for v1.16 more than one IP may share the same local identity. Thus, one IP will "capture" the numeric ID, while the others will change. Connections to those IPs may be briefly interrupted only during endpoint regeneration. Once regeneration is complete, there should be no further disruption.
As a precautionary measure, v1.15.tip should be updated
TODO
This section describes how IPs were added and removed from IPCache by the name manager before this proposal. All prefixes were added with the same metadata resource owner, daemon/fqdn-name-manager
.
Prefixes were upserted into IPCache:
(name, IP)
pair not yet in the name cache,Prefixes were removed from IPCache:
This is an oversimplification. In reality, the GC mechanism is rather complex and there are per-endpoint and a global cache. However, as the exact caching and GC mechanism itself is not subject to change for this CFP, thus we omit going into details on how it works and we assume a simplified view. ↩︎
I somehow didn't believe this and actually calcuated which is very silly indeed. ↩︎