# GCP LB HTTP 500 class responses `#55654`
Date: 2021-02-06
## Contributors:
* Jaroslav Šmolík (Ackee BE)
* Martin Beránek (Ackee DevOps)
#### People involved:
* Ondřej Vicha (LS)
* Marek Keřka (LS)
* Filip Horký (LS)
* Jan Vítek (Ackee PM)
* Marek Mouček (Ackee PM)
* Josef Gattermayer (Ackee)
* Tomáš Hejátko (Ackee DevOps)
* Marek Přibáň (Ackee PM)
* Jakub Baierl (Ackee FE)
* Jiří Čermák (Ackee FE)
* Tomáš Sluťák (Lukapo L1 support)
## Summary:
Customer content editors were not able to edit sources in the CMS FE.
## Impact:
The content collectors could not work properly during the weekend (assuming that content from new sources was not up to date).
Incident did not have any impact on the clients of the FlashNews with regard to the availability of the application,
latency, ...
## Root-cause:
Hotfix of unrelated [issue](https://gitlab.ack.ee/Backend/flash-news-shared-pei/-/commit/2c0b7fc6197701fa73270bcc9d18dbb90398ace1#0ad871e22898bb236d726d0964c5d27ca8560f94_246_245)
introduced incorrect validation of `subtype` field on the backend. CMS frontend was not able to submit any changes in sources.
All those calls to Rest API reported as http 500.
## Resolution:
BE team created [hotfix](https://gitlab.ack.ee/Backend/flash-news-setter/-/commit/596c6082308c9e99623c91c69013ad9585e0fdad)
from released [fix](https://gitlab.ack.ee/Backend/flash-news-shared-pei/-/commits/44.4.2)
in [flash-news-shared-pei](https://gitlab.ack.ee/Backend/flash-news-shared-pei/) repository.
## Detection:
Customer at 6.2.2020 [informed](https://ackee.slack.com/archives/C017B41VBP0/p1612599941006900) about the issue in
the Slack channel.
## Action items:
* L1 support was informed to take http 500 error alerts more seriously
* Discussion about the alert condition took place, threshold of incoming intensity of 500 errors codes will be lowered
* Users of CMS should get direct contact to support (#livesport-alerts, phone number, ...),
the chain of informed people in the alert was above 10.
* Backend will test hotfixes in the future
# Lessons Learned
## What went well:
* People important for the solution of the issue were available in a few minutes after getting the notice.
* Finding the root-cause in CMS wasn't too complicated.
* Fix was deployed in a matter of minutes.
## What went wrong:
* L1 support did not notice the issue on Friday and did not escalated it.
* There was no grafana alert on Saturday, the intensity of http 500 errors was not high enough.
* People involved in the alert grew above 10 yet only 2-3 had the impact on the fix of issue.
* Customer did not informed about the issue to the L1 support, information was passed via 3 people
(Josef Gattermayer, Marek Mouček, Ondřej Vicha) to DevOps first.
* The message about the alert was sent to Slack channel #fn_ackee_cms instead of #livesport-alerts
## Where we got lucky:
* Fix was deployed in a matter of minutes (Kudos to Jára).
## Timeline:
* 5.2.21 18:28 - [Noticed](https://ackee.slack.com/archives/C01AWR4EZE0/p1612546102057500) already on friday,
was not escalated due to misunderstanding of support [playbooks](https://gitlab.ack.ee/Infra/livesport-support/)
from L1 support
* 6.2.21 9:25 - [Customer](https://ackee.slack.com/archives/C017B41VBP0/p1612618663014700)
reported the issue as highly important
* 6.2.21 9:30 - Marek Keřka reported correct number to L1 support, nobody called which partially broke the chain of support
* 6.2.21 9:52 - Filip Horký marked Jan Vítek to inform him and ask for help
* 6.2.21 9:53 - Jan Vítek marked Marek Mouček who was only partially available due to issues with the internet access
* 6.2.21 10:11 - Marek Mouček wrote to [Slack](https://ackee.slack.com/archives/CQ0DZNWG5/p1612602695069800)
to ask for anyone available
* 6.2.21 10:16 - Marek Mouček wrote a personal message to Martin Beránek about the issue and if he can investigate
* 6.2.21 10:19 - Marek Mouček created a personal message between him, Martin Beránek and Ondřej Vicha for Martin Beránek
to report on the progress of the issue, also stated that he'll be only partially available on Slack due to the connection issues.
* 6.2.21 10:22 - Marek Mouček [created](https://ackee.slack.com/archives/CQ0DZNWG5/p1612603345071800?thread_ts=1612602695.069800&cid=CQ0DZNWG5)
a user for Martin Beránek to check the root-cause
* 6.2.21 10:30 - Martin Beránek found the issue on the frontend, where backend reported problems with validation of field `subtype`
* 6.2.21 10:39 - Martin Beránek called Jaroslav Šmolík to fix the issue
* 6.2.21 10:45 - Martin Beránek called Jiří Čermák just in case the issue was also frontend related
* 6.2.21 10:54 - Jaroslav Šmolík investigated the issue, created hotfix and pushed the changes into
[flash-news-shared-pei](https://gitlab.ack.ee/Backend/flash-news-shared-pei/-/commits/feat/55654-fix-subtype-null-validation)
* 6.2.21 11:19 - [Deploy](https://gitlab.ack.ee/Backend/flash-news-setter/-/pipelines/99407) of the fix to the production took place
* 6.2.21 11:30 - Customer was informed about the [fix](https://ackee.slack.com/archives/C017B41VBP0/p1612607647011100?thread_ts=1612599941.006900&cid=C017B41VBP0)
## Supporting information:
Slack discussion of the issue on Saturday at Ackee Slack https://ackee.slack.com/archives/CQ0DZNWG5/p1612602695069800
Slack discussion about the issue https://ackee.slack.com/archives/C017B41VBP0/p1612599941006900
Logs with http 500 error codes https://console.cloud.google.com/logs/viewer?hl=cs&project=flash-news-production&minLogLevel=0&expandAll=false×tamp=2021-02-06T09:22:58.368181000Z&customFacets=&limitCustomFacetWidth=true&dateRangeStart=2021-02-06T08:23:28.210Z&dateRangeEnd=2021-02-06T09:23:28.210Z&interval=PT1H&resource=http_load_balancer%2Fforwarding_rule_name%2Fk8s2-fs-94e0magq-production-setter-b9gl5h83&scrollTimestamp=2021-02-06T09:19:33.038191000Z&advancedFilter=resource.type%3D%22http_load_balancer%22%0Aresource.labels.forwarding_rule_name%3D%22k8s2-fs-94e0magq-production-setter-b9gl5h83%22%0AhttpRequest.status%3E%3D500