2i2c Triager Process DRAFT PROPOSAL

# 2i2c Triager Process DRAFT PROPOSAL ## Design Goals Team members take turns being responsible for normal requests from customers *Site :* Protected from interruptions to the greatest extent possible, and able to work on projects. * Triager: Focus on interrupt driven work Note while the below draws extensively on oncall rotation design for engineers who can get paged for production issues, this is not the goal for the triage role. Some teams combine the support/interrupt into an oncall role, others use a different model of primary/secondary oncall where the secondary handles pages. Generally as support falls into a form of operational load for a team looking oncall practices are a useful reference point. ## Terminology * Rotation cycle - the length of time for each engineer to be rotated through the role * Shift - a specific time range in a day that an engineer is on the rotation. * Ticket - an entry in FreshDesk Support Queue ### Roles *Community representative* Primary point of contact for a community, and ensures that the interests of the Hub Community are represented in the infrastructure, and that the hub serves their needs. They have the authority to speak on behalf of the community, and make decisions about the infrastructure that the community uses. *Support Stewards* A **two-person team** of `Support Stewards` work together to triage and communicate with all external support requests. Tenure on the support team is **for four weeks**. Every **two weeks** (generally at the sprint meeting), a Support Steward cycles off the support team, and a new team member joins the team. The support team rotates through [the “Open Infrastructure Engineering Team” on this page](https://team-compass.2i2c.org/en/latest/reference/team.html). The primary responsibilities of the Support Stewards are: - Ensure that we meet `our Support Service Level Objectives `. - Carry out our support process. - Act as primary points of contact with `Community Representative`s. - Trigger an `Incident Response` if need be. Common alternate terms: **Customer Liason**, **External Liason**, or **Customer Support**. *Hub administrator* ## Background Current process description ## Current Roles and team structure Supporting a 2i2c hub is a collaborative process between 2i2c and the community we serve. The `Support Team` is one of the main teams in our `Managed JupyterHub Service Team`. This consists of three main roles: `Support Stewards`, `Community Representatives`, and`Hub Administrators`. ## Proposal ### Roles and Responsibilities ### Ticket workflow ### Triage checklist ### Rotation and Shift Pattern Design With 2i2c's current timezone distribution of the single SRE team, as much as possible shift rotation should be equitable, preventing a "Night Shift" (Team Lifecycles) phenomena for some of the team. The below assumes a team of 3 SREs in Americas timezones, and 3 SREs in European timezone. #### Scenario I: 2-person team with 15 days rotations doing triage at 50% of your time. Scenario I as described 50% of time often expands. With a team of 6 this means a shift cycle of six weeks if all engineers are available - giving 2 weeks of interrupt time and on average 4 weeks project focused time #### Scenario II: 1-person team with 15 days rotations doing triage at 100% of your time. With a team of 6 this means a shift cycle of 12 weeks - leaving a gap of 10 weeks between shifts. That frequency risks engineers not oncall being out of touch with production processes. #### Scenario III: 2 1-person teams grouped by TZ with 15 days rotation Shift cycle would be six weeks but requires two sub-teams of 3, so changes to shift pattern likely to be more frequent. ## Triager Responsibilities Triager is responsible for: TODO(pnasrat): consolidate the below taken from slack - ack the tickets - classify them - resolve them if you can (and if the resolution is straightforward, ie: docs, suggestions, etc) - create GH issues for them when we need eng resources to be assigned to the resolution. 1. Watch for new tickets (interrupts) 2. Make judgement calls, and then use some method to assign them to someone (could be yourself) 3. Folks work on the tickets assigned to them on both GitHub and freshdesk. So now soruce of truth for 'what you are working on' is in two places. We can try using Freshdesk's github integration if needed to bring this to one place, but I think it's ok to start without that. > At the start of each shift, the on-call engineer reads the handoff from the previous shift. The on-call engineer minimizes user impact first, then makes sure the issues are fully addressed. At the end of the shift, the on-call engineer sends a handoff email to the next engineer on-call. REF ## GitLab Handover process A handover occurs between shifts, in the case of 2 people in the rotation for timezone coverage a daily handover and an end of rotation handover. The GitLab On-Call Handover describes a tool for engineers on-call (EOC) that will "automatically create a new issue and pre-populate some information such as outgoing/incoming EOC handles, open/closed incidents, resolved alerts, etc" used for the daily handoff between the oncall shifts - [example handover issue](https://gitlab.com/gitlab-com/gl-infra/on-call-handovers/-/issues/3602). In addition there is an end of GitLab's weekly rotation there is a "summary comment of notable events, incidents, etc" posted to a mailing list [sample weekly summary](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12323/?_gl=1*17pe3zo*_ga*MTY0Nzg0MDIwNC4xNjc2OTg5OTY4*_ga_ENFH3X7M5Y*MTY3Njk5NTQ5OS4yLjEuMTY3Njk5NTg2Mi4wLjAuMA..#note_487947960) Lastly, GitLab uses a weekly 30 minute oncall meeting between the oncall team rotating off, and the new oncall team rotating on. That is likely unnecessary for a pure ticket/support rotation if this extends to being pagable oncall it might be worth doing at the end of the 15 day rotation. ## Further Reading * Google SRE * [Dealing with Interrupts](https://sre.google/sre-book/dealing-with-interrupts/) * [Chapter 11: Being On-Call](https://sre.google/sre-book/being-on-call/) * [SRE Workbook: On-Call](https://sre.google/workbook/on-call/) * [Team Lifecyles](https://sre.google/workbook/team-lifecycles/) * GitLab * [How Our Production Team Runs The Weekly On-Call Handover](https://medium.com/gitlab-magazine/how-our-production-team-runs-the-weekly-on-call-handover-6b7cae8e68fb) * [GitLab On-Call Handover](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/on-call-handover/) * [GitLab Handover Examples](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12323/?_gl=1*vb3roo*_ga*MTY0Nzg0MDIwNC4xNjc2OTg5OTY4*_ga_ENFH3X7M5Y*MTY3Njk4OTk2OC4xLjEuMTY3Njk5MDAxMS4wLjAuMA..#note_487947960) * Other * [Making On-Call Not Suck](https://dev.to/molly/making-on-call-not-suck-490) * [No More On-Call Martyrs](https://sysadvent.blogspot.com/2016/12/day-6-no-more-on-call-martyrs.html) * [On-Call Handbook: Optimizing Your On-Call Rotations](https://github.com/alicegoldfuss/oncall-handbook/blob/master/optimizing_rotations.md)