(adapted from Amazon Web Services)
Mission: Improve the overall quality of our systems by documenting events to identify root causes and address them through trackable action items.
This document deviates slightly from the above template: It provides a Lead-Up section, separate from the Timeline. The focus of Lead-Up is to provide a chronological explanation of why the error happened (in other documents of this type, one can derive this from a well-written "Five Whys"; in this case, we have crisp chronological narrative, so I included it separately).
On Thursday 2021-05-06 11:48 EST, Rust 1.52.0-stable was released.
At 2021-05-06 15:08 EST, a user reported that the rust compiler was signaling an internal compiler error (ICE) with the message "found unstable fingerprints." 3 minutes later, at 2021-05-06 15:11 EST, a team member investigated, and determined that the fingerprint check (an internal consistency check known to be in need of mitigation, and expected to land in the 1.53 release on 2021-06-17) was in fact part of the 1.52.0-stable release.
Two days of discussion followed, largely debating whether to either (1.) leave the compiler's operational behavior mostly unchanged, but give users better diagnostic guidance as to what to do in response to a fingerprint ICE, or (2.) disable incremental-compilation entirely, which sidesteps the fingerprint checking and thus the stops the ICE from occurring, at the cost of making compile-times slower for users.
On Friday at 2021-05-07 16:25 EST, text explaining the problem and providing advice to users on how to address it locally was pinned to the github issue tracker. On Saturday at 2021-05-08 14:54 EST, a plan was approved by leads of the release and compiler teams to soft-disable incremental compilation on the stable channel. On Monday 2021-05-10, Rust issued a point release, 1.52.1, along with a blog post explaining the event.
This picture shows how 20 issues arrived after the 1.52.0 release on Thursday, in parallel with team discussion about what to do. That discussion time ends when a decision was released, and the decision time block ends at the 1.52.1 release on the following Monday.
It is interesting to note that the rate of incoming issues did not really change in response to the github pinned issue describing the problem. However, that same github pinned issue explicitly said that we still wanted bug reports about occurrences of the ICE, and thus we should not make inferences about the effectiveness of the pinned issue based on that observation.
The fingerprint ICEs all arise from inconsistencies when comparing the incremental-compilation cache stored on disk against the values computed during a current rustc invocation, which means they all arise from using incremental compilation.
This has the following implications:
The four likely ways that a user could have incremental compilation turned on (and thus hit the fingerprint ICE) were identified in the 1.52.1 blog post. We expect the most common one was users who were building with with the dev
or test
profiles.
20 distinct issues reporting the fingerprint ICE were filed against the 1.52.0 release. There were at least two tweets (1, 2) mentioning ICEs on 1.52.0.
We do not have telemetry from rustc, so we do not have a good way to measure "true" impact of event.
[FIXME: Anyone else have ideas on other data to include here?]
(Note: See terminology cheat sheet below the four Q&A's)
Note: The time to detection, when viewed as the start time (1.52.0 release) to the start of discussion, was 3h 11min.
Question: How could time to detection be improved? As a thought exercise, how would you have cut the time in half?
Question: How did you reach the point where you knew how to mitigate the impact (here called the "decision point")?
Note: The time to mitigation, when viewed as period from the event start time (1.52.0 release, Thursday 2021-05-06) to the mitigation time (1.52.1 release, Monday 2021-05-10), was 4 days.
Question: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?
Thought experiment (for future reference): In order to gain trust with our users, should we have acted more quickly to disable incremental compilation with no environment variable opt-in, and get that hypothetical patch release out on Friday?
pnkfelix thinks we should not be satisfied. But pnkfelix is also not sure on how best to address it; further discussed in Five Whys.
Term | Definition |
---|---|
Start Time | when your users first started being impacted |
Detection Time | when the team came to know there was impact. |
Response Time | when a person first started actively working on the problem (not merely acknowledged it) |
Mitigation Time | when the problem was resolved from the user’s point of view |
Resolution Time | when the incident response is “finished” from the responder’s point of view |
The Rust compiler ("rustc") has support for incremental compilation ("incr-comp"), which tracks how parts of the input source code influence the final build product.
Over three years ago, the rust compiler gained, via PR #45867, experimental code to verify incremental compilation hash (aka "fingerprint") of a query result. At that time, fingerprint mismatches were believed to, at most, be correlated with unnecessary work from incremental compilation. We did not think fingerprint mismatches could cause unsoundness via miscompilation.
On 2021-03-08, a contributor discovered an example of an incremental compilation bug that did yield miscompilation (segmentation faults, out-of-bounds errors, etc) and was correlated with fingerprint mismatches.
In response to this, on 2021-03-13, we turned fingerprint checking on by default, via PR #83007. We knew at that time that this PR would inject some number of internal compiler errors (ICEs).
The fingerprint checking was initially deployed on the Rust nightly channel, following its train release model. Therefore, ICE's started being identified by nightly users, who filed bug reports. Concurrently, changes were made to the compiler to address the problems being filed. However, some of those fixes only landed in nightly after the fingerprint checking had been lifted to the beta channel (following the train model). Thus, there was a mismatch, where the beta compiler would yield ICE's that a nightly user would never see, and yet the beta was the product that was on deck to be promoted to the stable release.
On Wednesday 2021-05-05, a team member noted, by filing Issue #84970, that we needed a plan for addressing the ICE's injected by PR #83007. However, team leadership failed to realize that the ICE's would be part of the 1.52.0 release.
Timestamp (Eastern Time) | Event |
---|---|
2021-05-05 22:09 | Issue #84970 is filed, acknowledging need to mitigate fingerprint ICEs before they slip to stable version. It says slippage is expected to occur in 1.53, due out six weeks later (2021-06-17). |
2021-05-06 11:48 | Rust 1.52.0 is released – start point |
2021-05-06 15:08 | Issue #85003 filed. |
2021-05-06 15:11 | Team member starts investigating – detection point |
2021-05-06 16:02 | Issue #85004 filed. |
2021-05-06 16:55 | Issue #85010 filed. |
2021-05-06 23:28 | Issue #85019 filed. |
2021-05-07 04:43 | Issue #85025 filed |
2021-05-07 06:05 | End user reverts to 1.51 release in order to work around fingerprint errors from rustc |
2021-05-07 07:12 | Issue #85028 filed |
2021-05-07 11:27 | Issue #85039 filed |
2021-05-07 15:34 | blog post explaining issue put up for team review |
2021-05-07 16:25 | Issue #84970 pinned to github issues page, transcribing blog post advice for our users |
2021-05-07 21:40 | Issue #85062 filed |
2021-05-07 22:21 | Issue #85064 filed |
2021-05-08 02:32 | Issue #85294 reported (via comment) |
2021-05-08 04:54 | Issue #85069 filed |
2021-05-08 05:09 | Issue #85070 filed |
2021-05-08 12:01 | Issue #85080 filed |
2021-05-08 14:54 | proposal put forward to soft-disable incr-comp in stable. Compiler team lead adds a
Image Not Showing
Possible Reasons
|
2021-05-08 17:17 | Issue #85091 filed |
2021-05-09 01:33 | PR #85097 lands, soft-disabling incr-comp in stable |
2021-05-09 21:16 | Issue #85129 filed |
2021-05-10 00:56 | Issue #85134 filed |
2021-05-10 04:43 | Issue #85140 filed |
2021-05-10 10:40 | Rust 1.52.1 is released – mitigation point |
2021-05-13 09:28 | Issue #85257 is filed |
2021-05-18 16:50 | Issue #85449 is filed |
2021-05-20 18:23 | Issue #86498 is filed |
From perspective of this document, the release of 1.52.1 on 2021-05-10 is the effective end of the event for our users (apart from detail that more bugs trickled in from users who had not yet upgraded from 1.52.0). The issue itself is not fully resolved until incremental-compilation has been re-enabled for users of stable Rust, currently planned as part of the 1.54.0 release (2021-07-29).
At the time of the 1.52.1 release, the bugs themselves are still in the mainline code base, but the critical point of failure, the fingerprint ICE,
has been prevented, via the heavy hammer of disabling incremental compilation entirely on the stable release channel.
Also, the above timeline only includes issues filed against 1.52-stable;
issues against nightlies do not represent the same kind of fallout that we are
trying to measure here. This in part because we, by design, do not hold the nightly builds of rustc
to the same standards as the stable and beta builds.
Why did the 1.52.0 release start emitting Internal Compiler Errors (ICEs) on valid input source code?
Why did the team enable an internal consistency check that could yield ICEs with an unknown extent of user impact?
Why was Safety First the guiding principle for handling this situation?
Why wasn't the team proactively working to assess the impact of this check and ensure that all instances of it were resolved before it hit beta?
Why wasn't the compiler team broadly aware of the scope of the problem? Aren't you triaging the incoming bugs?
Why are you relying on human beings to notice that kind of correlation? Can't machines do that?
Why didn't the compiler team file an issue to track the associated ICE's as part of the plan that enabled fingerprinting by default
Why didn't you make a finer-grain check that would catch only the unsound cases?
Why does the check yield an Internal Compiler Error rather than a
nice diagnostic error like the errors we see from programmatic mistakes in our
source code?
Why aren't all Internal Compiler Errors using the nice diagnostic machinery?
Why are you not putting proper diagnostics on the ICE's that have been long-lived,
since the Rust project cannot provide a guarantee that all ICE's will be short-lived?
Why did the 1.52.1 release disable incremental compilation entirely (when obscure opt-in flag absent)
rather than instruct users to clean their cached state in response to the ICE?
Why did the 1.52.1 release disable incremental compilation, rather than recovering from the ICE by
automatically deleting the incremental cache and restarting compilation.
Why didn't the beta channel act as a canary about 1.52.0 causing ICE's for end developers?
Why didn't the beta channel act as a canary about 1.52.0 causing ICEs for rustc
bootstrap?
rustc
bootstrap process has had problems with incremental for a long time, long predating this bug. Many developers have internalized the need to x.py clean
after a rebase to clear out incremental cache. These history has led to people ignoring ICEs during bootstrap, even for steps where it represents a real bug.Why are real bugs conflated with rustc
bootstrapping artifacts?
Why didn't the beta channel act as a canary about 1.52.0 causing ICE's for CI services?
Why are end developers favoring nightly or stable over beta channel?
Why don't you entice people to use Beta by providing some sort of carrot (i.e. reward) in return?
For example, beta could provide access to certain features that are not available in Stable?
testing
channelWhy did users continue to file issues about the fingerprint ICE after the 1.52.1-stable release had been put out?
Why does the ICE message suggest filing a bug report against an outdated version of the compiler?
Why did it take until Monday for you to tell your users about the problem?
No, really: Why did it take until Monday for you to tell your users about the problem?
Why don't you have a lighter weight means than big blog posts to communicate important changes in status to your users, such as a banner on the website or a social media account?
Note: pnkfelix is not expecting many of the proposals below to be adopted. These are, however, the best ideas he could come up with for addressing the lessons learned written above.
rustc
), starting with either an MCP or RFC; probably needs a project group dedicated to initial engagement with key stakeholders.x.py
to reduce cases where incr-comp state is incorrectly reused and clarify which incr-comp ICE's represent real issues with the underlying rustc
.Note: In principle, Action Items are meant to be things that can be assigned deadlines. I have not attempted to actually assign any deadlines below, but I do believe they all could be assigned deadlines, if we were to choose to invest time in them. Most of them take form of "Find owner", since our work assignment is largely bottom-up; i.e. it is dependent on volunteers getting interested in doing something, rather than leadership dictating what must happen.
rustc
bootstrap.x.py
should be revised to raise an alarm on ICEs that come from the beta compiler.