Postmortem/Retrospective

(adapted from Amazon Web Services)

Mission and Tenets

Mission: Improve the overall quality of our systems by documenting events to identify root causes and address them through trackable action items.

  • The postmortem process should always be seen as an opportunity to learn and improve, not blame or punish.
  • We want cumulative organizational data and lesson sharing to limit fallout from similar events in the future, or better still, prevent them entirely.

The Document Template

Title Here

Summary

(<= 3 paragraphs)

Metrics/Graphs

(>=1 graphs/tables illustrating impact of event)

(embedded mermaid is nice, but links to images is also fine)

gantt
   title Timeline of Issue Arrival
	dateFormat  YYYY-MM-DD HH:mm
	(start point): 2021-05-06 11:58, 1h
	fallout event 1 : 2021-05-06 15:08, 1h
	discussion (detection point): 2021-05-06 15:11, 1d
	fallout event 2 : 2021-05-07 16:02, 1h
	fix deployed (mitigation point) : 2021-05-08 10:00, 1h

User Impact

(1-2 paragraph summary of user-facing impact/experience during the event)

Incident Response

(Note: See terminology cheat sheet below the four Q&A's)

(Detection/Mitigation/Diagnosis/Resolution: Four questions, see below.)

  • Question: How was the event detected (e.g. an alarm? manual?)

    • Answer here
  • Question: How could time to detection be improved? As a thought exercise, how would you have cut the time in half?

    • Answer here
  • Question: How did you reach the point where you knew how to mitigate the impact (here called the "decision point")?

    • Answer here
  • Question: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?

    • Answer here

Incident Response Cheat Sheet

Term Definition
Start Time when your users first started being impacted
Detection Time when the team came to know there was impact.
Response Time when a person first started actively working on the problem (not merely acknowledged it)
Mitigation Time when the problem was resolved from the user’s point of view
Resolution Time when the incident response is “finished” from the responder’s point of view

Timeline

(Explain how incident was managed. Include event start and end times, not just team's perception of event.)

Timestamp (Time Zone here) Event
2021-05-05 22:09 event description
2021-05-06 11:48 event description

Five Whys

(wikipedia; dig down until root cause is identified)

  • Q: Why did event happen
    • A: because of action X
  • Q: Why did action X happen
    • A: because of constraint Y
  • Q: Why was Y a constraint
    • A: because of root cause Z ==> Lesson Learned.

Lessons Learned, Corrective Actions, Action Items

The Five Whys yield lessons, and lessons yield actions.

  • Each chain of Five Whys should yield at least one lesson learned. Each lesson learned should yield at least one corrective action. Each corrective action should yield at least one action item.
  • Corrective Actions are high level activities to address the problem, while Action Items are individual tasks that can be assigned due dates.

Lessons Learned

  • Lesson 1
  • Lesson 2

Corrective Actions

(link back to lessons)

  • Action 1a
  • Action 1b
  • Action 2a

Action Item

(link back to actions, and include near term deadlines)

  • Action Item
  • Action Item
Select a repo