# WATcloud Incident Log
###### tags: `infra`
## HOWTO
- As soon as an issue arises, create a page from the [on-call incident template](/Ylm_iG-qTAmlDhI9FvV5Mw)
- Don't be afraid to post scrap notes. Prefer to overpost rather than underpost.
- Record:
- What service was down?
- What was the problem? Why was the service down?
- How was this problem discovered?
- How was it solved?
- Who were the on-call engineers?
- Was the proper escalation procedure followed?
Nits:
- Try to be precise about time zones. Know the difference between `EST` and `EDT`. Use offsets from `UTC` is acceptable too.
- Try using ISO date format e.g. `2022-01-01` instead of `2022-Jan-1` or `Jan 1, 2022`.
## Index
- [2021-12-22 thor-ubuntu1 connectivity issues ](/jQ30njACTHeu-AB21V_y_g)
- [2021-12-28 thor-ubuntu1 issues](/dUMJroayR9e23OpO6m3Mbw)
- [2021-12-27 tr ubuntu1 connectivity issues](/vnFR68VATNaXnjMcuNtg6Q)
- [2022-01-01 bastion, thor-ubuntu1, tr-ubuntu1 issues](/UF7Rn6a0QdiM0YEEK9iYqg)
- [2022-02-13 Teleport, Ceph issues](/kw-_WGv7RoG3VpDkclmYvQ)
-
## 2022-01-16 1:42 PM EST
- Bastion SSH was down
- The Healthchecks.io log contained this:
```
Command failed with status 255. Sleeping for 10s and retrying (1 left)
Executing ['ssh' '-o' 'StrictHostKeyChecking=no' '-i' '/home/watonomous/.ssh/connection_test' 'connection_test@bastion.watonomous.ca']
Pseudo-terminal will not be allocated because stdin is not a terminal.
```
- From here the problem I can see that a test ssh connection to bastion failed discovered through the healthchecks log
- I let it retry the test ssh and it passed the next time, this can be explained as flakiness in bastion
- Arjun Krishna
- Yes
## 2022-01-22 2:10 PM EST
*Problem:* **delta-ubuntu1 suffered `io-error`**
*Symptoms:*
- no internet access (the whole VM appeared to be down)
- healthchecks.io failed
- pagerduty triggered
- PVE web UI displayed `Status: io-error`
*Diagnosis:*
The VM uses a volume stored in the `delta-lvm-thin` pool, but its size (225.49GiB) was over-provsioned since the pool only had a size of 214.75 GiB, and hence produces an I/O error since there could be no write to this disk image.
checked using the `qm config` command
This likely happened because the configs refer to disk size in gibi-bytes (GiB or G) while the VM Disks panel of proxmox report the size in giga-bytes (GB).
To fix this we either need to make the VM size lower than the total pool size of 214GiB through the host
## 2022-02-08 10:19 AM EST
Problem: Everything went down
TODO: Arthur Chen
https://discord.com/channels/478659303167885314/580890419379044362/940718454825156709
## 2022-02-16 08:25 PM EST
Problem: thor-ubuntu1 file system capacity was down
Symptoms: pager duty alert
Diagonosis: we needed to free up some filesystem space
Solution: just ran `docker system prune` to free up filesystem capacity taken up by docker containers