# Pub/Sub billing alert `#70349` Date: 2022-04-07 ## Contributors: * Ondřej Štorc (Ackee DevOps) * Martin Beránek (Ackee DevOps) * Jaroslav Šmolík (Ackee BE) #### People involved: * Ladislav Louka (Ackee BE) * Jan Vítek (Ackee PM) * Dominik Veselý (Ackee CTO) ## Summary: Enormous increase in traffic on `extract-article` Pub/Sub queue on stage environment caused significant price [increase in billing](https://console.cloud.google.com/billing/018C0C-488344-C3897B/reports;chartType=STACKED_BAR;timeRange=CUSTOM_RANGE;from=2022-04-03;to=2022-04-06;grouping=GROUP_BY_SKU;projects=flash-news-stage?project=flash-news-stage). Overall cost was ~1200 EUR. ## Impact: Queue processing of `extract-article` was stuck on page https://www.villarrealcf.es/en/news/item/33908-weekend-results-for-yellows-sides Other messages were processed correctly. There was NO IMPACT ON PRODUCTION. ## Root-cause: Custom mechanism of DLQ implementation was misconfigurated. Implementation counts occurrences of `UUID` hashes from the article tracing info and increments a counter under a dedicated redis key. When the counter for message from `extract-article` subscription reaches threshold (3), message is ACKed and moved the the `extract-article-error`. Due to an configuration error, the DLQ topic, was the same as normal subscription (`extract-article`). Since there is a topic of the same name, linked to the subscription, the faulty message kept republishing to the same topic, reappearing in the same subscription over and over again, even after the threshold was met. (source: [Jaroslav Šmolík](https://ackee.slack.com/archives/CQ0DZNWG5/p1649322975780909?thread_ts=1649317020.988629&cid=CQ0DZNWG5)) ## Resolution: By [merge](https://gitlab.ack.ee/Backend/flash-news-boxer/-/merge_requests/273/diffs) of valid configuration and then deployment of this fixed version of boxer. ## Detection: By the billing alert at [Slack `#ls-alerts` channel](https://ackee.slack.com/archives/C01V8TMSTMZ/p1649314809243409). ## Action items: * Fix misconfiguration and check it once merged between environments * Introduce new alert on billing which would notify us extra about increases over given value. Based on current empiric experience, we will apply 30 USD threshold. * Investigate possible alert floods on stage. * Write to billing support to check if there is any condition to lower the high billing. # Lessons Learned ## What went well: * Quick response from all parties involved after the issue was acknowledged and escalated (once detected). ## What went wrong: * Configuration error got to the production without any monitoring incident. * Due the alert noise was the first [Billing alert](https://ackee.slack.com/archives/C01V8TMSTMZ/p1649228409676269) not checked properly. * Signs of a problem were visible in the Slack channel, yet were not noticed due to issues coming into channel each morning due to K6 benchmarks. * [Site size](https://www.villarrealcf.es/en/news/item/33908-weekend-results-for-yellows-sides) was causing issues during processing, which resulted message being processed by DLQ. * Processed site size was discussed in the passed: there is no reason why to process size larger than given threshold. Threshold itself was [never decided](https://ackee.slack.com/archives/CQ0DZNWG5/p1630423170022300?thread_ts=1630421940.019600&cid=CQ0DZNWG5). * Due to logging, each time the [site](https://www.villarrealcf.es/en/news/item/33908-weekend-results-for-yellows-sides) was processed, large log line was generated causing large billing for log volume. ## Where we got lucky: * Similar issue didn't happened at production and development * Fix happened within few minutes once detected ## Timeline: * 28.3.2022 13:24 - Version of boxer with invalid configuration was [deployed](https://gitlab.ack.ee/Backend/flash-news-boxer/-/pipelines/301908) to `stage` * 04.4.2022 18:42 - Around this time was published article for [processing](https://cloudlogging.app.goo.gl/hwLyQiUX1NZRNy4t9), which wasn't processed, and was sent back to the same queue * 06.4.2022 09:00 - First billing [alert](https://ackee.slack.com/archives/C01V8TMSTMZ/p1649228409676269) with increased Log Volume amount to 40$ from 7.3$ * 06.4.2022 09:03 - Ondřej Štorc [acknowledged](https://ackee.slack.com/archives/C01V8TMSTMZ/p1649228601393439) first alert, didn't pursue the investigation that deeply * 07.4.2022 09:00 - Second billing [alert](https://ackee.slack.com/archives/C01V8TMSTMZ/p1649314809243409) with increased Log Volume to over 300$ * 07.4.2022 09:03 - Ondřej Štorc [acknowledged](https://ackee.slack.com/archives/C01V8TMSTMZ/p1649314998631049) second alert, started analyzing the issue * 07.4.2022 09:37 - Martin Beránek created slack [thread](https://ackee.slack.com/archives/CQ0DZNWG5/p1649317020988629) where he tagged Ladislav Louka and Jaroslav Smolik * 07.4.2022 09:37 - Ondřej Štorc found out that the issue [is caused](https://ackee.slack.com/archives/CQ0DZNWG5/p1649317157388199?thread_ts=1649317020.988629&cid=CQ0DZNWG5) by increased traffic from `extract-article` queue * 07.4.2022 09:46 - Martin Beránek [stopped](https://ackee.slack.com/archives/CQ0DZNWG5/p1649317596559019?thread_ts=1649317020.988629&cid=CQ0DZNWG5) boxer processing on `stage` to avoid another increase in billing * 07.4.2022 10:05 - Martin Beránek [found](https://ackee.slack.com/archives/CQ0DZNWG5/p1649318710434369?thread_ts=1649317020.988629&cid=CQ0DZNWG5) article which processing potentially could cause the issue * 07.4.2022 11:08 - Jaroslav Šmolík [found](https://ackee.slack.com/archives/CQ0DZNWG5/p1649322519522289?thread_ts=1649317020.988629&cid=CQ0DZNWG5) the issue in configuration of boxer * 07.4.2022 11:11 - Jaroslav Šmolík [merged](https://gitlab.ack.ee/Backend/flash-news-boxer/-/merge_requests/273) fix to `stage` branch * 07.4.2022 11:17 - Fixed version of boxer was [deployed](https://gitlab.ack.ee/Backend/flash-news-boxer/-/jobs/1795897) to the `stage`, which also started boxer on `stage` * 07.4.2022 12:06 - Ondřej Štorc [confirmed](https://ackee.slack.com/archives/CQ0DZNWG5/p1649325968840869?thread_ts=1649317020.988629&cid=CQ0DZNWG5) that this issue is also affecting `production`, however Martin Beránek checked that it didn't manifest there * 07.4.2022 13:33 - Jaroslav Šmolík [deployed](https://gitlab.ack.ee/Backend/flash-news-boxer/-/commit/0f1b8f791f33a454f4a1bd0a55e3227302ea4029) fixed version to `production` ## Supporting information: Slack discussion of the issue https://ackee.slack.com/archives/CQ0DZNWG5/p1649317020988629