Proposal for Arringo Incident Response Strategy

# Proposal for Arringo Incident Response Strategy Following the recents production issues it appears important that we introduce a proper workflow for responding to future incidents. Topic to discuss: - Incident **reporting** - Incident **identification** and **escalation** - Incident **resolution** - Incident **post analysis** ## Incident reporting and classification When a problem first appears to a custumer, the first line for helping him is Support Agents. Agents need to understand if customer has done something wrong and if the problem he is experiencing depends from his action/environment or if it is our application's problem. There is probably already a workflow setup for identifying if this is an application problem: - Verify the user is doing correct procedure - Clear browser cache, try incognito mode or different browser etc If support identify an application problem they will create a ticket describing the problem the user is facing. It is very important to have as much info as possible in the ticket: - UserId - Screenshots or video of the problem - [Web Session Traffic, HAR file](https://support.google.com/admanager/answer/10358597?hl=en) - **IMPROVEMENT** Screen recording of the session ([Logrocket](https://logrocket.com/) or similar) The support agent should also respond very clearly to the following questions to classify the problem: * (Scope) Who the problem is affecting: * All the customers and Admins (ALL) * All the customers (ALL-USERS) * All the admins (ALL-ADMINS) * Only certain customers or admins (USERS or ADMINS) * Only one customer or admin (S-USER, S-ADMIN) * (Criticality) How bad the error is: * The users of the admin having the problem are completely **blocked from using the application** (ALLBLOCK) * A **feature of the application is not working** but application still usable (FBLOCK) * A feature of the application misbehave but it still **usable** (FBEHAV) From that classification we can come in a tagline that can be appended before the description like: ADMINS-FBLOCK: Clicking on user details gives an error ## Incident When an application problem is reported, the manager can sometimes immediately spot what is exactly the error to fix. Sometimes just identify the error some deeper investigation is needed. As timing to respond to the problem is critical on production applications it is **vital to be able to reproduce the error as fast as possible** * Feature an db parity (eventually mangled data) on staging environment * **IMPROVEMENT** Possibility to copy the user affected to the staging db * **IMPROVEMENT** **Team leads (BE and FE)** coud have credential access to Production envirionment. This will speed up problem identification and resolution. Afrer analysis Team lead can classify even further the problem: - FEBE: Both Frontend and Backend needed for the resolution - FE: Frontend only - BE: Backend only - DEVOPS: Devops problmem After analysis based on the type of incident, the number of users affected and if it is blocking an urgency level would be assigned (Jira labels) The escalation process should be based on: - Urgency - Team involved The example would become: ADMINS-BLOCK-FE: Clicking on user details gives an error ## Incident Resolution Flow resolution should happen for solving the problem. Most of the time this will happen as an HOT FIX from the main repository branch. It is important to label the changes in code as HOT FIX so that later the fixed code can be merged back in all the other branches Sometimes a fix is not urgent and can require some more time: then this can proceed as an epic or taks in Jira following the normal user process. ## Incident Post Analisys (Post Mortem) Production problems can happen but we should learn from our mistakes and ensure that we gradually **prevent, reduce, and contain most of the errors**. After the problem resolution, a quick call and a summarized document on the mesures to put in place to prevent the encountered error to happen again can be helpful.