Postmortem Analysis
Felix Klock & Wesley Wiser
Background
Rust 1.52.0 added extra incr-comp validation
Result: Exposed preexisting issues in rustc
; ICEs for users.
Release: Fri May 6th; mitigation deployed on Mon May 10th (disabling incr-comp).
Since then, team resolving uncovered issues and on track to re-enable incremental in 1.54.
Moving forward
Because was a high-profile issue affecting many users, we've conducted a postmortem/retrospective/correction of error.
Felix has written a postmortem document and we've used three of our weekly steering meetings to hold postmortem discussions.
We want to understand what and why :
Explain issue: error itself and user impact
Explain why it happened
So that we can fix our processes
Prevent similar issues from occurring
Install guardrails to reduce severity of future incidents
9 section postmortem doc drives above efforts
Preventing repeat occurrence
Focus is on process, not people, so we fix processes.
Anti-pattern: finger-pointing
Shame is the path to the dark side.
Shame ⟹ Fear ⟹ Hiding Facts ⟹ Misunderstanding ⟹ Repeat Failures
(Also: Fear ⟹ Risk Aversion ⟹ Stagnation.)
image courtesy of u/noteblockiller
Template: Understand Issue
Summary (<= 3 paragraphs)
Metrics/Graphs (>=1 graphs/tables illustrating impact of event)
User Impact: (1-2 paragraph summary of user-facing impact/experience during the event)
Template: Understand Response
Incident Response; four questions:
How was the event detected?
How could time to detection be halved ?
How did you reach the point where you knew how to mitigate the impact?
How could time to mitigation be halved ?
Timeline: Explain how incident was managed. Include event start and end times, not just team's perception of event.
Template: Why event happened
Template: How to fix
Lessons Learned, Corrective Actions, Action Items
The Five Whys yield lessons and lessons yield long- and short-term actions.
Example "Five Why" trajectory
Why did Beta channel fail as canary for 1.52.0? (⟹ two fold answer)
Why didn't Beta channel cause ICE's for CI services, the hypothesized primary user?
Why are end developers favoring Nightly or Stable over Beta channel?
Why not incentivize developers to use Beta; e.g. Beta could provide access to features unavailable in Stable?
Outcomes for our fingerprint postmortem
"5 Whys" turned into 21 Whys + answers
6 Lessons Learned
9 Corrective Actions
11 Action Items
These are fairly large items so please get in touch if you want to help!
Conclusions
Overall, we're very happy with the results of the process!
We've had good discussions about the principles we use to make difficult decisions.
We have some new strategies to avoid similar issues in the future.
Resume presentation
Postmortem Analysis Felix Klock & Wesley Wiser
{"metaMigratedAt":"2023-06-16T04:20:43.200Z","metaMigratedFrom":"YAML","title":"Driving Discussions via Postmortem Analysis","breaks":true,"contributors":"[{\"id\":\"b859b5b9-394b-4459-a237-20c3fd40c185\",\"add\":3757,\"del\":836},{\"id\":\"ebc76d8d-2914-465e-bd23-99641782eac2\",\"add\":12405,\"del\":11220},{\"id\":\"27d8f2ac-a5dc-4ebe-8c31-cc45bcd8447e\",\"add\":100,\"del\":0}]"}