Post-Mortem Summary: Production Cluster Outage Due to CNI Version Mismatch
Event Overview
On April 19, 2024, an unexpected update to the VPC CNI plugin during a cluster capacity expansion caused significant disruption in our production Kubernetes cluster. The update, triggered by an Infrastructure as Code (IaC) automation, resulted in a version mismatch that prevented updates and pod terminations.
Impact
The mainnet production environment faced approximately 3.5 hours of downtime, impacting core services and preventing settlements during this period.
Resolution Steps
The resolution involved upgrading the Kubernetes cluster and setting up new node groups with a new version of the networking plugin.