In the default mode of the VPC CNI, without prefix delegation, singular IP addresses are assigned to ENIs.
With prefix delegation, a range of IP addresses is assigned to an ENI. Traffic to an address within the assigned range is directed to the ENI. This greatly increases the number of IP addresses, and thus pods, that may be on a single node.
More specifically, you can assign a private CIDR range (network prefix) to your network interfaces (ENIs) on EC2 instances. IPv4 and IPv6 ranges may be assigned to an ENI. The prefix assignments may be manually or automatically managed.
Amazon VPC CNI assigns network prefixes to Amazon EC2 network interfaces to increase the number of IP addresses available to nodes and increase pod density per node. You can configure version 1.9.0 or later of the Amazon VPC CNI add-on to assign IPv4 and IPv6 CIDR instead of assigning individual IP addresses to network interfaces.
[[GDC: JC do you prefer "ENABLE_PREFIX_DELEGATION
is set to true" as the first line (techincal accuracy) or "When prefix delgation is enabled" (more general, users should follow the link to learn how to enable)]]
[[GDC: SJ is it always a /28 prefix? is it always an IPv4 prefix? Do we want to add detail here about prefix selection?]
[SJ: Yes, it is always /28, VPC doesn't allow custom prefix today]
]
When ENABLE_PREFIX_DELEGATION
is set to true, Amazon VPC CNI will allocate /28 IPv4 address prefixes to the ENIs. With prefix assignment mode, the maximum number of elastic network interfaces per instance type remains the same, but you can now configure Amazon VPC CNI to assign /28 (16 IP addresses) IPv4 address prefixes, instead of assigning individual IPv4 addresses to network interfaces. The Pods are assigned an IPv4 address from the prefix assigned to the ENI. Please follow the instructions mentioned in the EKS user guide to enable Prefix IP mode.
During worker node initialization, the VPC CNI assigns a CIDR block prefix to the primary ENI. The CNI pre-allocates a prefix and IP addresses for faster pod startup by maintaining a warm pool. The number of prefixes to be held in warm pool can be controlled by setting environment variables.
WARM_PREFIX_TARGET
, the number of prefixes to be allocated in excess of instant need.WARM_IP_TARGET
, the number of IP addresses to be allocated in excess of instant need.MINIMUM_IP_TARGET
, the minimum number of IP addresses to be allocated at any time.WARM_IP_TARGET
and MINIMUM_IP_TARGET
if set will override WARM_PREFIX_TARGET
.As IP needs increase (as more Pods scheduled) additional prefixes will be requested for the existing ENI.
First, the VPC CNI attemps to allocate a new prefix to an existing ENI. If the ENI is at capacity, the VPC CNI attempts to allocate a new ENI to the instance. New ENIs will be attached until the maximum per instance (defined by the instance type) is reached.
When a new ENI is allocated, ipamd will allocate either 1 prefix or number of prefixes needed to maintain the WARM_PREFIX_TARGET, WARM_IP_TARGET and MINIMUM_IP_TARGET setting.
[[GDC: I am reluctant to include this. I think the guidance should be only to use the script/resources linked in the last paragraph.]]
[[GDC: note to self – I think this guidance appears multiple times. Reduce duplication]]
You can use the following formula to determine the maximum number of Pods you can deploy on a node when Prefix IP mode is enabled.
For example, say you’re using an m5.large. For instance, the maximum number of Pods you can run without prefix IP mode is 29.
whereas with prefix assignments, it is 110 on smaller instance types and 250 on larger instance types with more than 30 vCPU.
Managed node groups automatically calculate the maximum number of Pods for you. Avoid changing EKS's recommended value for the maximum number of Pods to avoid Pod scheduling failures due to resource limitations.
For self-managed nodes, we suggest setting the maximum Pods per EKS user guide to avoid exhaustion of the instance’s CPU and memory resources. You may consider using a script called max-pod-calculator.sh to calculate EKS's recommended maximum Pods for a given instance type. Also, the Kubernetes community recommends that the maximum number of Pods be no more than 110 or 10 times the number of cores.
Consider using similar instance types in a node group to maximize node use. Your node group may contain instances of many types. If an instance has a low maximum pod count, that value is applied to all nodes in the node group.
!!! warning
The maximum pod count for all nodes in a partiular node group is defined by the lowest maximum pod count of any single instance type in the node group.
WARM_PREFIX_TARGET
to conserve IPv4 addressesThe installation manifest's default value for WARM_PREFIX_TARGET is 1. In most cases, the recommended value of 1 for WARM_PREFIX_TARGET will provide a good mix of fast pod launch times while minimizing unused IP addresses assigned to the instance.
If you have a need to further conserve IPv4 addresses per node use WARM_IP_TARGET
and MINIMUM_IP_TARGET
settings, which override WARM_PREFIX_TARGET
if set. By setting WARM_IP_TARGET
to a value less than 16, you can prevent the CNI from keeping an entire excess prefix attached.
Allocating an additional prefix to an existing ENI is a faster EC2 API operation than creating and attaching a new ENI to the instance. Using prefixes improves performance while being frugal with IPv4 address allocation. Attaching a prefix typically completes in under a second, whereas attaching a new ENI can take up to 10 seconds. For most use cases, the CNI will only need a single ENI per worker node when running in prefix assignment mode. If you can afford (in the worst case) up to 15 unused IPs per node, we strongly recommend using the newer prefix assignment networking mode, and realizing the performance and efficiency gains that come with it.
When EC2 allocates a /28 IPv4 prefix to an ENI, it has to be a contiguous block of IP addresses from your subnet. If the subnet that the prefix is generated from is fragmented (a highly used subnet with scattered secondary IP addresses), the prefix attachment may fail, and you will see the following error message in the VPC CNI logs:
To avoid fragmentation and have sufficient contiguous space for creating prefixes, you may use VPC Subnet CIDR reservations to reserve IP space within a subnet for exclusive use by prefixes. Once you create a reservation, the VPC CNI plugin will call EC2 APIs to assign prefixes that are automatically allocated from the reserved space.
[[GDC: link to how to do this?]]
It is recommended to create a new subnet, reserve space for prefixes, and enable prefix assignment with VPC CNI for worker nodes running in that subnet. If the new subnet is dedicated only to Pods running in your EKS cluster with VPC CNI prefix assignment enabled, then you can skip the prefix reservation step.
Prefix mode works with VPC CNI version 1.9.0 and later. Downgrading of the Amazon VPC CNI add-on to a version lower than 1.9.0 must be avoided once the prefix mode is enabled and prefixes are assigned to ENIs. You must delete and recreate nodes if you decide to downgrade the VPC CNI.
It is highly recommended that you create new nodes to increase the number of available IP addresses. Cordon and drain all the existing nodes to safely evict all of your existing Pods. Pods on new nodes will be assigned an IP from a prefix assigned to an ENI. After you confirm the Pods are running, you can delete the old nodes and node groups.
[[GDC: replace with link to IPv6 page? move following section to IPv6 page?]]
[[GDC: I think this is general description of IPv6, how it works on EKS, and benefits.]
[SJ: Yes, next section is not under recommendation but more so additional information. The intent is to cover how Prefix assignment works in IPv6 clusters]
]
Cluster networking on IPv6 clusters is supported only in prefix assignment mode by AWS VPC CNI and only works with AWS Nitro-based EC2 instances. Prefix assignment is enabled by default on IPv6 clusters (supported by VPC CNI supported v1.10.0+). In contrast to IPv4, VPC CNI assigns an /80 IPv6 prefix to ENI.
A single IPv6 Prefix-Delegation prefix has many addresses (/80 => 1014 addresses per ENI) and is big enough to support large clusters with millions of Pods, also removing the need for warm prefixes and minimum IP configurations.
A single IPv6 prefix is sufficient to run many Pods on a single node. This also effectively removes the max-pods limitations tied to ENI and IP limitations. Although IPv6 removes direct dependency on max-pods, when using prefix attachments with smaller instance types like the m5.large, you’re likely to exhaust the instance’s CPU and memory resources long before you exhaust its IP addresses. Please follow the recommendations for maximum Pods as below.