# Postmortem/Retrospective (adapted from Amazon Web Services) ## Mission and Tenets Mission: Improve the overall quality of our systems by documenting events to identify root causes and address them through trackable action items. * The postmortem process should always be seen as an opportunity to learn and improve, not blame or punish. * We want cumulative organizational data and lesson sharing to limit fallout from similar events in the future, or better still, prevent them entirely. ## The Document Template * [Summary] (<= 3 paragraphs) * [Metrics/Graphs] (>=1 graphs/tables illustrating impact of event) * [User Impact]: (1-2 paragraph summary of user-facing impact/experience during the event) * [Incident Response] (Detection/Mitigation/Diagnosis/Resolution): Four questions (see below). * [Timeline]: Explain how incident was managed. Include *event* start and end times, not just team's perception of event. * [Five Whys][] ([wikipedia][Five Whys wikipedia]): dig down until *root cause* is identified * [Lessons Learned], [Corrective Actions], [Action Items]: The [Five Whys] yield lessons, and lessons yield actions. [Summary]: #Summary [Metrics/Graphs]: #Metrics/Graphs [User Impact]: #User-Impact [Incident Response]: #Incident-Response [Timeline]: #Timeline [Five Whys]: #Five-Whys [Lessons Learned]: #Lessons-Learned [Corrective Actions]: #Corrective-Actions [Action Items]: #Action-Items [Five Whys wikipedia]: https://en.wikipedia.org/wiki/Five_whys # Title Here ## Summary (<= 3 paragraphs) ## Metrics/Graphs (>=1 graphs/tables illustrating impact of event) (embedded mermaid is nice, but links to images is also fine) ```mermaid gantt title Timeline of Issue Arrival dateFormat YYYY-MM-DD HH:mm (start point): 2021-05-06 11:58, 1h fallout event 1 : 2021-05-06 15:08, 1h discussion (detection point): 2021-05-06 15:11, 1d fallout event 2 : 2021-05-07 16:02, 1h fix deployed (mitigation point) : 2021-05-08 10:00, 1h ``` ## User Impact (1-2 paragraph summary of user-facing impact/experience during the event) ## Incident Response (Note: See terminology [cheat sheet](#Incident-Response-Cheat-Sheet) below the four Q&A's) (Detection/Mitigation/Diagnosis/Resolution: Four questions, see below.) * Question: How was the event detected (e.g. an alarm? manual?) * [ ] Answer here * Question: How could time to detection be improved? As a thought exercise, how would you have cut the time in half? * [ ] Answer here * Question: How did you reach the point where you knew how to mitigate the impact (here called the "decision point")? * [ ] Answer here * Question: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half? * [ ] Answer here ### Incident Response Cheat Sheet [Term](https://greatcircle.com/blog/2018/04/24/how-to-improve-your-incident-response-times/) | Definition -----------|------ Start Time | when your users first started being impacted Detection Time | when the team came to know there was impact. Response Time | when a person first started actively working on the problem (not merely acknowledged it) Mitigation Time | when the problem was resolved from the user’s point of view Resolution Time | when the incident response is “finished” from the responder’s point of view ## Timeline (Explain how incident was managed. Include *event* start and end times, not just team's perception of event.) Timestamp (Time Zone here) | Event --------------------| ----- 2021-05-05 22:09 | event description 2021-05-06 11:48 | event description ## Five Whys ([wikipedia][Five Whys wikipedia]; dig down until *root cause* is identified) * [ ] Q: Why did event happen * [ ] A: because of action X * [ ] Q: Why did action X happen * [ ] A: because of constraint Y * [ ] Q: Why was Y a constraint * [ ] A: because of root cause Z ==> Lesson Learned. ## Lessons Learned, Corrective Actions, Action Items The [Five Whys] yield lessons, and lessons yield actions. * Each chain of [Five Whys] should yield at least one lesson learned. Each lesson learned should yield at least one corrective action. Each corrective action should yield at least one action item. * Corrective Actions are high level activities to address the problem, while Action Items are individual tasks that can be assigned due dates. ### Lessons Learned * [ ] Lesson 1 * [ ] Lesson 2 ### Corrective Actions (link back to lessons) * [ ] Action 1a * [ ] Action 1b * [ ] Action 2a ### Action Item (link back to actions, and include near term deadlines) * [ ] Action Item * [ ] Action Item