Article import issues `#67855`

# Article import issues `#67855` Date: 2022-02-02 ## Contributors: * Lukáš Loukota (Ackee BE) * Martin Beránek (Ackee DevOps) #### People involved: * Marek Keřka (LS) * Filip Horký (LS) * Aleš Kafka (LS) ## Summary: Article import got stuck. No messages from `extract-article` queue were consumed. ## Impact: Clients did not received the latest news for ~30 minutes. ## Root-cause: GKE node pool got updated due to terraform module [latest bug fix](https://github.com/AckeeCZ/terraform-gke-vpc/commit/cc0889f8e87aa440006fbbbdec6e92032691363a), that caused nodes to slowly evict all the running pods, install [netd](https://github.com/GoogleCloudPlatform/netd) daemon set and allow pods to be scheduled again. Due to higher CPU requests of underlying agents, pods with higher CPU demands have become unschedulable. Hence, boxer did not fit to any pod and import got stuck. ## Resolution: Boxer deployment CPU requests got lower. That enabled scheduler to deploy pods back to the nodes. Higher numbers of pods on HPA were added, node pools were set to scale for additional 4 nodes (in each zone). ## Detection: By customer, first reported at [Slack `#ls-alerts` channel](https://ackee.slack.com/archives/C01V8TMSTMZ/p1643733866648409). ## Action items: * Node pools will be vertically scaled to leave space for local agents, the size of nodes is currently evaluated at development environment, see * https://gitlab.ack.ee/Infra/flash-news/flash-news-infrastruktura-development/-/blob/master/gke.tf#L15 * Alerts for messages stuck in the queue will get adjusted to react faster on messages older than 600 seconds * Better testing of allocation of CPU will be done each time node pool is updated # Lessons Learned ## What went well: * Reservation could be changed pretty fast from [Workload console](https://console.cloud.google.com/kubernetes/workload/overview?project=flash-news-production) ## What went wrong: * Upgrade of node pool got tested on [FS](https://gitlab.ack.ee/Infra/flash-sport/flash-sport-infrastruktura-production/-/commit/e55318a86f94d192fe67e7e5add995abdbb2dfbe) but issue did not manifest itself due to size of nodes at FS * Incident manifest itself after a long time due to fact that upgrade of node pool takes hours ## Where we got lucky: * Lukáš Loukota was online and checked for issues faster than Martin even had a chance to get online 👍 ## Timeline: * 1.2.22 17:44 - customer reported an issue at [Slack `#ls-alerts` channel](https://ackee.slack.com/archives/C01V8TMSTMZ/p1643733866648409) * 1.2.22 17:52 - [Marek Keřka gave a call](https://ackee.app.eu.opsgenie.com/alert/detail/03331667-902c-4feb-b765-fc3ba53f5519-1643734375258/logs) to support number and reported cron [retriever-application-cron-pull-storyfa-articles](https://console.cloud.google.com/kubernetes/cronjob/europe-west3/flash-news-production-50479/production/retriever-application-cron-pull-storyfa-articles/details?project=flash-news-production) as being stuck * 1.2.22 17:56 - monitoring [reported](https://ackee.slack.com/archives/C01V8TMSTMZ/p1643734594516599) max age alert `extract-article` queue * 1.2.22 17:59 - after short situation evaluation, Martin Beránek executed [cron job manually](https://cloudlogging.app.goo.gl/MnSkEZLBsrhGARka7) to resolve reported issue, workload did not reported any underlying issues, reported issue was not the root-cause * 1.2.22 18:02 - Lukáš Loukota reported possible scheduler issues at GKE for boxer deployment, confirmed by Martin Beránek * 1.2.22 18:11 - Martin Beránek deployed lower CPU requests for Boxer which enabled scheduler to deploy the pods to the node pools * 1.2.22 18:23 - Martin Beránek [reported](https://ackee.slack.com/archives/C01V8TMSTMZ/p1643736210058339?thread_ts=1643733866.648409&cid=C01V8TMSTMZ) that messages are being consumed ## Supporting information: Slack discussion of the issue https://ackee.slack.com/archives/C01V8TMSTMZ/p1643733866648409