# On-Call Duty Cheatsheet ## Example Procedure - [ ] Pagerduty messages or calls you - [ ] Acknowledge outage on Pagerduty and Discord - [ ] Go to healthchecks.io - [ ] Find the issue that triggered the alert - [ ] Read the log and get information regarding what the issue is, what caused it - [ ] Use the previous information to debug and solve the isue - [ ] You can connect to service and debug using Linux commands - [ ] You can see the [On-Call Duty Log](/98cb9-l5QUGttT8jxTyiOQ) to see how people solved similar issues in the past - [ ] If you can't solve the issue (and it doesn't resolve itself after some time) - [ ] You can consider just restarting the service (through proxmox, etc.) - [ ] You can explicitly communicate with other members of the Server Cluster division to ask for help - [ ] You can explicitly escalate the issue to the secondary on-call engineer (through Discord and Pagerduty) ## Tools - [Server Cluster Status Dashboard](https://status.watonomous.ca/) - https://status.watonomous.ca/ - [Proxmox Dashboard](https://wato-nuc.uwaterloo.ca:8006/) - https://wato-nuc.uwaterloo.ca:8006/ - Need to be connect through [Campus VPN](https://uwaterloo.ca/science-computing/how-tos/campus-vpn-virtual-private-network) or port forwarding - [Healthchecks](https://healthchecks.io) - https://healthchecks.io - Can check the logs here to see what triggered the alert ## Common Linux Commands Used - tsh, ssh: teleport and ssh used to connect to the server - e.g. `tsh ssh arjun_krishna@wato3-ubuntu1` - [How to set up Teleport](https://hackmd.io/@watonomous/teleport) - `ip a`, `ethtool`: tools for debugging connections - [Linux System Administrator Cheatsheet](https://cheatography.com/wbrandes/cheat-sheets/linux-system-administration/) ## Resources - [Issues with On-Call Duty we are trying to improve currently](https://docs.google.com/document/d/1_xXpyOk0WDQCSzX5ztNdxiWYAZyfuHpkqPWJwG2tIS8/edit?usp=sharing) - Reading this will give you more context behind how to be a great On-Call engineer - [Server cluster outage simulations](https://hackmd.io/iCYMOwUXQDqP6FsTWr1Rww) - These show an outage being solved in real-time - [Garage Electrical Dependencies](https://docs.google.com/drawings/d/1Le6XuCEPQCeM1vM22u2CyHL1ZKMJm2VNcRG2pDwi8XM/edit?usp=sharing) - Displays the electrical dependencies of machines in the garage ## Primary On-Call Engineer Responsibilities - Acknowledge outages in pagerduty and discord within a reasonable amount of time (5-10 minutes) - Try to resolve the issue - Explicitly communicate with other Server Cluster division members if you can't resolve the issue on your own and ask for help - You can also escalate the outage to the secondary On-Call Engineer - After issue is resolved record in [On-Call Duty log]()