# On-Call Duty Cheatsheet
## Example Procedure
- [ ] Pagerduty messages or calls you
- [ ] Acknowledge outage on Pagerduty and Discord
- [ ] Go to healthchecks.io
- [ ] Find the issue that triggered the alert
- [ ] Read the log and get information regarding what the issue is, what caused it
- [ ] Use the previous information to debug and solve the isue
- [ ] You can connect to service and debug using Linux commands
- [ ] You can see the [On-Call Duty Log](/98cb9-l5QUGttT8jxTyiOQ) to see how people solved similar issues in the past
- [ ] If you can't solve the issue (and it doesn't resolve itself after some time)
- [ ] You can consider just restarting the service (through proxmox, etc.)
- [ ] You can explicitly communicate with other members of the Server Cluster division to ask for help
- [ ] You can explicitly escalate the issue to the secondary on-call engineer (through Discord and Pagerduty)
## Tools
- [Server Cluster Status Dashboard](https://status.watonomous.ca/)
- https://status.watonomous.ca/
- [Proxmox Dashboard](https://wato-nuc.uwaterloo.ca:8006/)
- https://wato-nuc.uwaterloo.ca:8006/
- Need to be connect through [Campus VPN](https://uwaterloo.ca/science-computing/how-tos/campus-vpn-virtual-private-network) or port forwarding
- [Healthchecks](https://healthchecks.io)
- https://healthchecks.io
- Can check the logs here to see what triggered the alert
## Common Linux Commands Used
- tsh, ssh: teleport and ssh used to connect to the server
- e.g. `tsh ssh arjun_krishna@wato3-ubuntu1`
- [How to set up Teleport](https://hackmd.io/@watonomous/teleport)
- `ip a`, `ethtool`: tools for debugging connections
- [Linux System Administrator Cheatsheet](https://cheatography.com/wbrandes/cheat-sheets/linux-system-administration/)
## Resources
- [Issues with On-Call Duty we are trying to improve currently](https://docs.google.com/document/d/1_xXpyOk0WDQCSzX5ztNdxiWYAZyfuHpkqPWJwG2tIS8/edit?usp=sharing)
- Reading this will give you more context behind how to be a great On-Call engineer
- [Server cluster outage simulations](https://hackmd.io/iCYMOwUXQDqP6FsTWr1Rww)
- These show an outage being solved in real-time
- [Garage Electrical Dependencies](https://docs.google.com/drawings/d/1Le6XuCEPQCeM1vM22u2CyHL1ZKMJm2VNcRG2pDwi8XM/edit?usp=sharing)
- Displays the electrical dependencies of machines in the garage
## Primary On-Call Engineer Responsibilities
- Acknowledge outages in pagerduty and discord within a reasonable amount of time (5-10 minutes)
- Try to resolve the issue
- Explicitly communicate with other Server Cluster division members if you can't resolve the issue on your own and ask for help
- You can also escalate the outage to the secondary On-Call Engineer
- After issue is resolved record in [On-Call Duty log]()