GCP LB HTTP 500 class responses `#61521`

# GCP LB HTTP 500 class responses `#61521` Date: 2021-07-22 ## Contributors: * Tomáš Hejátko (Ackee DevOps) * Martin Beránek (Ackee DevOps) ## People involved: * Vojtěch Bruk (LS) * Marek Keřka (LS) * Lukáš Horák (Ackee FE) * Patrik Šonský (Ackee PM) ## Summary: Flashsport.com was partially unavailable due to issues with elasticsearch internal load balancer. ## Impact: Backend of the application couldn't serve requests directed to elasticsearch, meaning users couldn't use search. Also, other related endpoints (see [logs](https://cloudlogging.app.goo.gl/GgBDRfzK1pm2Vtp68)) started to return HTTP 504 responses. Users could not access articles in the application. ## Root-cause: Health checks checking elasticsearch cluster nodes health started to report nodes as unhealthy. Based on that, all the instances were removed from the internal TCP load balancer backend. Load balancer started timeout all the requests. ## Resolution Redeploying whole network setup (internal load balancer, health checks) fixed the issue. ## Detection Issue was reported in #ls-alerts channel by [OpsGenie](https://ackee.slack.com/archives/C01V8TMSTMZ/p1626937557097200) ## Action items: * Customer & developers are to be informed to report the issues only to #ls-alerts channel * GCP support is to be notified about the [issue](https://console.cloud.google.com/support/cases/detail/28547460?project=flash-news-production) * There was no notification from Blue Medora elasticsearch monitoring, [Tomáš Hejátko](https://redmine.ack.ee/issues/61532) was asked to inspect the issue * [Communication pattern](https://sre.google/workbook/incident-response/) (IC, OL, CL) from SRE should be adopted in the larger cases to avoid communication overhead on oncall fixing the issue # Lessons Learned ## What went well: * Issue was identical with the issue which happened on Flashnews at 21.7.21, the fix was also identical ## What went wrong: * Multiple parties informed about the alert and notified oncall duty -- instead of fixing the issue, majority of work done was communication, therefore detection of the rootcause took a bit longer than expected * Terraform state on `flash-sport-infrastruktura-production` contains 539 resources which slows down `terraform refresh` and consequently `terraform apply` command * Oncall duty got mislead by resources which got changed outside of terraform: ```hcl Note: Objects have changed outside of Terraform Terraform detected the following changes made outside of Terraform since the last "terraform apply": # module.cloudflare.cloudflare_custom_pages.flashsport_app_http500 has been deleted - resource "cloudflare_custom_pages" "flashsport_app_http500" { - id = "a05cff376bf17d1977a6353a6a0834ab" -> null - state = "customized" -> null - type = "500_errors" -> null - url = "https://maintenance.flashsport.app/index.html" -> null - zone_id = "811e48d7459fb60837636e44e0543426" -> null } ``` # Timeline * 22.7.21 8:54 - [Customer](https://ackee.slack.com/archives/C01FSPZPFC6/p1626936862124700) noticed weird behavior and reported it into **private channel** where oncall duty did not have access to * 22.7.21 9:00 - [OpsGenie](https://ackee.slack.com/archives/C01V8TMSTMZ/p1626937217096300) detected the issue with larger latencies than expected and reported it into #ls-alerts channel * 22.7.21 9:05 - [OpsGenie](https://ackee.slack.com/archives/C01V8TMSTMZ/p1626937557097200) detected the issue and reported it into #ls-alerts channel * 22.7.21 9:06 - Tomáš Hejátko reported the issue to Martin Beránek in the private message * 22.7.21 9:07 - [Lukáš Horák](https://ackee.slack.com/archives/CKD4SUW8N/p1626937679093100) reported the issue into #flash-sport-api channel * 22.7.21 9:09 - Oncall duty informed in all the channels that the issue was noticed: * [x_flashsport_product](https://ackee.slack.com/archives/C01FSPZPFC6/p1626937775128100) * [flash-sport-api](https://ackee.slack.com/archives/CKD4SUW8N/p1626937724093200?thread_ts=1626937679.093100&cid=CKD4SUW8N) * 22.7.21 9:12 - [Marek Keřka](https://ackee.slack.com/archives/C01V8TMSTMZ/p1626937938097700?thread_ts=1626937217.096300&cid=C01V8TMSTMZ) was notified about the issue * 22.7.21 9:17 - After analysis, oncall discovered the issue is [the same](https://ackee.slack.com/archives/C017B41VBP0/p1626855776000600) as the one which happened at 21.7.21, the issue was confirmed with Tomáš Hejátko at 9:19 * 22.7.21 9:21 - [Customer](https://ackee.slack.com/archives/C01FSPZPFC6/p1626938476135400) was informed about the fix being on the way * 22.7.21 9:23 - Patrik Šonsky asked in private message oncall to report on the issue, was denied to wait until the issue is fixed * 22.7.21 9:20 - Load balancer and [health checks](https://cloudlogging.app.goo.gl/GzxPirq7D1onK1AZ6) were redeployed * 22.7.21 9:40 - [Customer](https://ackee.slack.com/archives/C01FSPZPFC6/p1626939606140800) was informed about the issue being fixed * 22.7.21 9:45 - [Alert](https://ackee.slack.com/archives/C01V8TMSTMZ/p1626939941097900) got closed by the monitoring ## Supporting information: Slack notifications: * Flashsport channel https://ackee.slack.com/archives/CKD4SUW8N/p1626937679093100 * Customer private slack channel https://ackee.slack.com/archives/C01FSPZPFC6/p1626936862124700 * ls-alerts channel https://ackee.slack.com/archives/C01V8TMSTMZ/p1626937217096300