# Postmortem Analysis
Felix Klock & Wesley Wiser
---
## Background
* Rust 1.52.0 added extra incr-comp validation
* Result: Exposed preexisting issues in `rustc`; ICEs for users.
* Release: Fri May 6th; mitigation deployed on Mon May 10th (disabling incr-comp).
* Since then, team resolving uncovered issues and on track to re-enable incremental in 1.54.
---
## Moving forward
* Because was a high-profile issue affecting many users, we've conducted a postmortem/retrospective/correction of error.
* Felix has written a postmortem document and we've used three of our weekly steering meetings to hold postmortem discussions.
---
* We want to understand *what* and *why*:
* Explain issue: error itself and user impact
* Explain why it happened
* So that we can *fix* our processes
* Prevent similar issues from occurring
* Install guardrails to reduce severity of future incidents
* 9 section postmortem doc drives above efforts
---
## Preventing repeat occurrence
Focus is on process, not people, so we fix processes.
<!-- Often, all involved make reasonable *local* decisions. -->
<!-- One of our findings was that, essentially, each person involved in the chain of events made
reasonable decisions at each step, but because we didn't have good processes in place for this
kind of change, this lead to things slipping through the cracks unintentionally. -->
---
## Anti-pattern: finger-pointing
Shame is the path to the dark side.
Shame ⟹ Fear ⟹ Hiding Facts ⟹ Misunderstanding ⟹ Repeat Failures
(Also: Fear ⟹ Risk Aversion ⟹ Stagnation.)
<!-- .slide:
data-background="https://i.redd.it/70su2f1iwtn31.png" data-background-size="40%" data-background-position="top left"
-->
###### image courtesy of u/noteblockiller
---
## Template: Understand Issue
* Summary (<= 3 paragraphs)
* Metrics/Graphs (>=1 graphs/tables illustrating impact of event)
* User Impact: (1-2 paragraph summary of user-facing impact/experience during the event)
---
## Template: Understand Response
* Incident Response; four questions:
* How was the event detected? <!-- (e.g. an alarm? manual?) -->
* How could time to detection be halved<!--; e.g.: how could you halve the time-->?
* How did you reach the point where you knew how to mitigate the impact?
* How could time to mitigation be halved<!--; e.g.: how could you halve the time-->?
* Timeline: Explain how incident was managed. Include *event* start and end times, not just team's perception of event.
---
## Template: Why event happened
* Five Whys ([wikipedia][Five Whys wikipedia]): dig down until *root cause* is identified.
* Each answer is meant to spark one or more why questions, until you find the root cause.
* Search may end up digging into events before the incident began.
[Five Whys wikipedia]: https://en.wikipedia.org/wiki/Five_whys
---
## Template: How to fix
* Lessons Learned, Corrective Actions, Action Items
* The Five Whys yield lessons and lessons yield long- and short-term actions.
----
## Example "Five Why" trajectory
* Why did Beta channel fail as canary for 1.52.0? (⟹ two fold answer)
* Why didn't Beta channel cause ICE's for CI services, the hypothesized primary user?
* Why are end developers favoring Nightly or Stable over Beta channel?
* Why not incentivize developers to use Beta; e.g. Beta could provide access to features unavailable in Stable?
---
## Outcomes for our fingerprint postmortem
- "5 Whys" turned into 21 Whys + answers
- 6 Lessons Learned
- 9 Corrective Actions
- 11 Action Items
- These are fairly large items so please get in touch if you want to help!
---
## Conclusions
- Overall, we're very happy with the results of the process!
- We've had good discussions about the principles we use to make difficult decisions.
- We have some new strategies to avoid similar issues in the future.
---
## References
* Document Template: https://hackmd.io/esAoSljqQtyJnDi4ecCg2g
* pnkfelix postmortem slide deck (longer): https://hackmd.io/SbubQQIlRGSlrOpKQf73Tg
* 1.52 incremental fingerprint postmortem: https://hackmd.io/DhKzaRUgTVGSmhW8Mj0c8A
{"metaMigratedAt":"2023-06-16T04:20:43.200Z","metaMigratedFrom":"YAML","title":"Driving Discussions via Postmortem Analysis","breaks":true,"contributors":"[{\"id\":\"b859b5b9-394b-4459-a237-20c3fd40c185\",\"add\":3757,\"del\":836},{\"id\":\"ebc76d8d-2914-465e-bd23-99641782eac2\",\"add\":12405,\"del\":11220},{\"id\":\"27d8f2ac-a5dc-4ebe-8c31-cc45bcd8447e\",\"add\":100,\"del\":0}]"}