# Postmortem/Retrospective
(adapted from Amazon Web Services)
## Mission and Tenets
Mission: Improve the overall quality of our systems by documenting events to identify root causes and address them through trackable action items.
* The postmortem process should always be seen as an opportunity to learn and improve, not blame or punish.
* We want cumulative organizational data and lesson sharing to limit fallout from similar events in the future, or better still, prevent them entirely.
## The Document Template
* [Summary] (<= 3 paragraphs)
* [Metrics/Graphs] (>=1 graphs/tables illustrating impact of event)
* [User Impact]: (1-2 paragraph summary of user-facing impact/experience during the event)
* [Incident Response] (Detection/Mitigation/Diagnosis/Resolution): Four questions (see below).
* [Timeline]: Explain how incident was managed. Include *event* start and end times, not just team's perception of event.
* [Five Whys][] ([wikipedia][Five Whys wikipedia]): dig down until *root cause* is identified
* [Lessons Learned], [Corrective Actions], [Action Items]: The [Five Whys] yield lessons, and lessons yield actions.
[Summary]: #Summary
[Metrics/Graphs]: #Metrics/Graphs
[User Impact]: #User-Impact
[Incident Response]: #Incident-Response
[Timeline]: #Timeline
[Five Whys]: #Five-Whys
[Lessons Learned]: #Lessons-Learned
[Corrective Actions]: #Corrective-Actions
[Action Items]: #Action-Items
[Five Whys wikipedia]: https://en.wikipedia.org/wiki/Five_whys
# Title Here
## Summary
(<= 3 paragraphs)
## Metrics/Graphs
(>=1 graphs/tables illustrating impact of event)
(embedded mermaid is nice, but links to images is also fine)
```mermaid
gantt
title Timeline of Issue Arrival
dateFormat YYYY-MM-DD HH:mm
(start point): 2021-05-06 11:58, 1h
fallout event 1 : 2021-05-06 15:08, 1h
discussion (detection point): 2021-05-06 15:11, 1d
fallout event 2 : 2021-05-07 16:02, 1h
fix deployed (mitigation point) : 2021-05-08 10:00, 1h
```
## User Impact
(1-2 paragraph summary of user-facing impact/experience during the event)
## Incident Response
(Note: See terminology [cheat sheet](#Incident-Response-Cheat-Sheet) below the four Q&A's)
(Detection/Mitigation/Diagnosis/Resolution: Four questions, see below.)
* Question: How was the event detected (e.g. an alarm? manual?)
* [ ] Answer here
* Question: How could time to detection be improved? As a thought exercise, how would you have cut the time in half?
* [ ] Answer here
* Question: How did you reach the point where you knew how to mitigate the impact (here called the "decision point")?
* [ ] Answer here
* Question: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?
* [ ] Answer here
### Incident Response Cheat Sheet
[Term](https://greatcircle.com/blog/2018/04/24/how-to-improve-your-incident-response-times/) | Definition
-----------|------
Start Time | when your users first started being impacted
Detection Time | when the team came to know there was impact.
Response Time | when a person first started actively working on the problem (not merely acknowledged it)
Mitigation Time | when the problem was resolved from the user’s point of view
Resolution Time | when the incident response is “finished” from the responder’s point of view
## Timeline
(Explain how incident was managed. Include *event* start and end times, not just team's perception of event.)
Timestamp (Time Zone here) | Event
--------------------| -----
2021-05-05 22:09 | event description
2021-05-06 11:48 | event description
## Five Whys
([wikipedia][Five Whys wikipedia]; dig down until *root cause* is identified)
* [ ] Q: Why did event happen
* [ ] A: because of action X
* [ ] Q: Why did action X happen
* [ ] A: because of constraint Y
* [ ] Q: Why was Y a constraint
* [ ] A: because of root cause Z ==> Lesson Learned.
## Lessons Learned, Corrective Actions, Action Items
The [Five Whys] yield lessons, and lessons yield actions.
* Each chain of [Five Whys] should yield at least one lesson learned. Each lesson learned should yield at least one corrective action. Each corrective action should yield at least one action item.
* Corrective Actions are high level activities to address the problem, while Action Items are individual tasks that can be assigned due dates.
### Lessons Learned
* [ ] Lesson 1
* [ ] Lesson 2
### Corrective Actions
(link back to lessons)
* [ ] Action 1a
* [ ] Action 1b
* [ ] Action 2a
### Action Item
(link back to actions, and include near term deadlines)
* [ ] Action Item
* [ ] Action Item