WATcloud Incident Log

# WATcloud Incident Log ###### tags: `infra` ## HOWTO - As soon as an issue arises, create a page from the [on-call incident template](/Ylm_iG-qTAmlDhI9FvV5Mw) - Don't be afraid to post scrap notes. Prefer to overpost rather than underpost. - Record: - What service was down? - What was the problem? Why was the service down? - How was this problem discovered? - How was it solved? - Who were the on-call engineers? - Was the proper escalation procedure followed? Nits: - Try to be precise about time zones. Know the difference between `EST` and `EDT`. Use offsets from `UTC` is acceptable too. - Try using ISO date format e.g. `2022-01-01` instead of `2022-Jan-1` or `Jan 1, 2022`. ## Index - [2021-12-22 thor-ubuntu1 connectivity issues ](/jQ30njACTHeu-AB21V_y_g) - [2021-12-28 thor-ubuntu1 issues](/dUMJroayR9e23OpO6m3Mbw) - [2021-12-27 tr ubuntu1 connectivity issues](/vnFR68VATNaXnjMcuNtg6Q) - [2022-01-01 bastion, thor-ubuntu1, tr-ubuntu1 issues](/UF7Rn6a0QdiM0YEEK9iYqg) - [2022-02-13 Teleport, Ceph issues](/kw-_WGv7RoG3VpDkclmYvQ) - ## 2022-01-16 1:42 PM EST - Bastion SSH was down - The Healthchecks.io log contained this: ``` Command failed with status 255. Sleeping for 10s and retrying (1 left) Executing ['ssh' '-o' 'StrictHostKeyChecking=no' '-i' '/home/watonomous/.ssh/connection_test' 'connection_test@bastion.watonomous.ca'] Pseudo-terminal will not be allocated because stdin is not a terminal. ``` - From here the problem I can see that a test ssh connection to bastion failed discovered through the healthchecks log - I let it retry the test ssh and it passed the next time, this can be explained as flakiness in bastion - Arjun Krishna - Yes ## 2022-01-22 2:10 PM EST *Problem:* **delta-ubuntu1 suffered `io-error`** *Symptoms:* - no internet access (the whole VM appeared to be down) - healthchecks.io failed - pagerduty triggered - PVE web UI displayed `Status: io-error` *Diagnosis:* The VM uses a volume stored in the `delta-lvm-thin` pool, but its size (225.49GiB) was over-provsioned since the pool only had a size of 214.75 GiB, and hence produces an I/O error since there could be no write to this disk image. checked using the `qm config` command This likely happened because the configs refer to disk size in gibi-bytes (GiB or G) while the VM Disks panel of proxmox report the size in giga-bytes (GB). To fix this we either need to make the VM size lower than the total pool size of 214GiB through the host ## 2022-02-08 10:19 AM EST Problem: Everything went down TODO: Arthur Chen https://discord.com/channels/478659303167885314/580890419379044362/940718454825156709 ## 2022-02-16 08:25 PM EST Problem: thor-ubuntu1 file system capacity was down Symptoms: pager duty alert Diagonosis: we needed to free up some filesystem space Solution: just ran `docker system prune` to free up filesystem capacity taken up by docker containers