### Post-Mortem Summary: Production Cluster Outage Due to CNI Version Mismatch
#### Event Overview
On April 19, 2024, an unexpected update to the VPC CNI plugin during a cluster capacity expansion caused significant disruption in our production Kubernetes cluster. The update, triggered by an Infrastructure as Code (IaC) automation, resulted in a version mismatch that prevented updates and pod terminations.
#### Impact
The mainnet production environment faced approximately 3.5 hours of downtime, impacting core services and preventing settlements during this period.
#### Resolution Steps
The resolution involved upgrading the Kubernetes cluster and setting up new node groups with a new version of the networking plugin.
#### Cause
The root cause was identified as the lack of a hardcoded CNI version in our IaC, combined with a recent change in our IaC library version. This mismatch in combination with not using properly pinned package versions led to the automatic version update of the networking resource. Without a compatible CNI version, newly deployed pods were not able to receive an IP address and thus remained in pending state.
The capacity update was performed as a result of a performance regression a change to the liquidity fetching logic of CoW Protocol's baseline liquidity had caused. Increased memory and CPU consumption caused core services getting frequently evicted. After the accidental CNI update any restart resulted in pods stuck pending state and the system became increasingly unavailable.
#### Lessons Learned
This incident underscores the importance of specifying versions in our IaC scripts to prevent unintended updates. It also highlights the need for easy to revert change management especially in production environments.
#### Recovery Timeline
- **16:40 UTC**: Outage officially begins with settlements ceasing.
- **17:03 UTC**: Identification of the CNI version mismatch as the cause.
- **~19:00 UTC**: Cluster upgrade started.
- **20:08 UTC**: Services restored and operational.
For further details, updates on this incident were communicated via our official Twitter [here](https://twitter.com/CoWSwap/status/1781406819006464111).