Rust itself may be safe
(at least in its idealized form)
… but
process failures happen
(we are all only human, after all!)
A Postmortem (or "Correction of Error") is a structured process to
Why do this?
Primary: prevent future occurrences of similar events.
Secondary: reduce fallout from future failure events.
Assigning blame to people is not a goal.
Goal is to identify processes that can be improved, not people to publicly shame.
Software development is a collective, social activity
In open source, we all have differing degrees of ownership.
So: Don't play the blame game. For example, there's no need to include names in the document.
Assumption: There is always room for improvement.
(A portmortem may conclude that no currently cost-effective change is known. But: 1. this is expected to be rare outcome, and 2. the exercise of determining that conclusion is still worthwhile.)
Primary: Concrete improvements to development and deployment processes (or plans to make such improvements)
Secondary: Documentation of failure event itself
Even though the document itself is secondary, writing and discussing the document is amazing tool to drive team towards the primary output.
The postmortem document
a document, suitable for general audience, that:
Note: Postmortem document has three potential audiences
Focus is on:
As Rust developers, we often focus on 100% prevention of bugs ahead of time, e.g. via static analysis or tricky type system encodings.
The Postmortem can end up discussing ways to achieve 100% prevention.
But: first focus of the document is on how the teams responded to the incident.
Rust 1.52.0 fingerprint bug
Event started on Thursday May 6th, and mitigation was deployed on Monday May 10th.
Compiler team proposed applying the Postmortem process on June 2nd. We have since had three one-hour steering meetings about the event (on June 25th, July 2nd, and July 9th), all driven by one (evolving) Postmortem document.
Postmortem document itself (aka "Correction of Error", or COE): https://hackmd.io/DhKzaRUgTVGSmhW8Mj0c8A
Incident Response (Detection/Mitigation/Diagnosis/Resolution) and Timeline: drives brainstorming on what to fix
New section, "Leadup": Our release process naturally lends itself to discussion of what happened before the release of a buggy tool, since that is also relevant to prevention.
Five Whys: iterative process to identify root causes, by branching through different causal paths.
Lessons Learned, Corrective Actions, Action Items: Again, this guides collective discussion.
Action Items are meant to have deadlines attached – otherwise, they're not short term!
from fingerprint 1.52.0 postmortem
(see doc for elided answers; point is, each answer directly yields followup "Why" Q's)
Lesson Learned (5/5): We need to be willing and able to communicate more quickly to our users in response to an event like this.
Corrective Action (5a): Define a playbook for how to communicate effectively to our users after a user-impacting event has been discovered.
Corrective Action (5b): Add ability to issue Public Service Announcements (PSAs) from rustup.
Action Item: (5a-i, due 2021-08-05): pnkfelix does a survey of the landscape for response playbooks during user-impacting events, especially when it comes to programming languages.
No concrete action item yet for (5b).
Participants from compiler seemed to like the structured process.
However: writing this document was not trivial.
pnkfelix vastly underestimated the effort involved, even after being repeatedly warned that such documents can be multiple-week efforts.
Failures happen… but we don't have to panic!
in response
Don't play the blame game
Investing effort in a well-written, acted upon, and widely-shared postmortem can drive positive organizational change and prevent repeat failure events.
https://medium.com/the-cloud-architect/incident-postmortem-template-7b0e0a04f7a8
https://medium.com/@josh_70523/postmortem-correction-of-error-coe-template-db69481da31d