Try   HackMD

Course Correction: Rust 1.52 fingerprint bug

Items from 2021-07-09 meeting

  • proposed action: add ability to issue Public Service Announcements (PSAs) from rustup.
  • proposed action: start adding links to pre-existing issues for the diagnostics emitted by "long-lived ICEs"
  • edit: clarify distinction between "We need better visibility (or more advertisement within the teams) of when user-disruptive changes are scheduled to hit." and "We need better tracking of when certain ICE's start becoming more serious problems for our users."
  • edit: why was "We need to stop unfixed ICEs from becoming problems that we never hear about." considered a separate Lesson? Just merge it with neighboring lesson.
  • edit: for Need more testing of dev workflows, add Corrective Action that covers automated testing, e.g. something analogous to mw's incr-comp commit checker.
  • proposal (from Wesley): I would really like to have a dedicated WG/team whatever of beta testers. They might also be a good candidate set of people for opt-in telemetry

Items for 2021-07-02 meeting

  • Review Five Whys and Lessons Learned, get feedback on what other "Whys" are missing, what to elaborate on, and what to cut.
  • Discuss: How can we increase beta engagement.
  • Discuss: Many versions of Rust shipped with these bugs undetected; how do we calculate whether a given ICE "is worth it"? (See also item below, which seems strongly related.)
  • Discuss: What options, besides Telemetry, are there for evaluating "true user impact" for these events? In particular: How can we know, in general, whether we overreacted (especially when our reaction may actualy be the reason we didn't observe a broader impact)?
  • Discuss: What needs to be done to this doc before we publicize it beyond our team? (Also: Where should we publicize it?)
  • If time, review Corrective Actions and Action Items; but those are less important than items above.

Todo Items for Doc Itself

  • clean up introduction before broader dissemination.
  • add appendix with timeline of issues filed about the ICE against nightly/beta before the 1.52.0, to provide perspective on what evidence we did have that there was a problem here, and what the lower-bound on its scope was.
  • expand five whys section with elaboration on how the incr-comp miscompilations are fundamentally more problematic than other miscompilations, because they are so hard to reproduce outside of user's context.

Mission and Tenets

(adapted from Amazon Web Services)

Mission: Improve the overall quality of our systems by documenting events to identify root causes and address them through trackable action items.

  • The Correction of Error (COE) process should always be seen as an opportunity to learn and improve, not blame or punish.
  • We want cumulative organizational data and lesson sharing to limit fallout from similar events in the future, or better still, prevent them entirely.

The COE Template

This document deviates slightly from the above template: It provides a Lead-Up section, separate from the Timeline. The focus of Lead-Up is to provide a chronological explanation of why the error happened (in other documents of this type, one can derive this from a well-written "Five Whys"; in this case, we have crisp chronological narrative, so I included it separately).

The Rust 1.52.0 fingerprint bug

Summary

On Thursday 2021-05-06 11:48 EST, Rust 1.52.0-stable was released.
At 2021-05-06 15:08 EST, a user reported that the rust compiler was signaling an internal compiler error (ICE) with the message "found unstable fingerprints." 3 minutes later, at 2021-05-06 15:11 EST, a team member investigated, and determined that the fingerprint check (an internal consistency check known to be in need of mitigation, and expected to land in the 1.53 release on 2021-06-17) was in fact part of the 1.52.0-stable release.

Two days of discussion followed, largely debating whether to either (1.) leave the compiler's operational behavior mostly unchanged, but give users better diagnostic guidance as to what to do in response to a fingerprint ICE, or (2.) disable incremental-compilation entirely, which sidesteps the fingerprint checking and thus the stops the ICE from occurring, at the cost of making compile-times slower for users.

On Friday at 2021-05-07 16:25 EST, text explaining the problem and providing advice to users on how to address it locally was pinned to the github issue tracker. On Saturday at 2021-05-08 14:54 EST, a plan was approved by leads of the release and compiler teams to soft-disable incremental compilation on the stable channel. On Monday 2021-05-10, Rust issued a point release, 1.52.1, along with a blog post explaining the event.

Metrics/Graphs

This picture shows how 20 issues arrived after the 1.52.0 release on Thursday, in parallel with team discussion about what to do. That discussion time ends when a decision was released, and the decision time block ends at the 1.52.1 release on the following Monday.

2021-05-072021-05-092021-05-112021-05-132021-05-152021-05-172021-05-191.52.0 release (start point)85003 discussion (detection point)85004 85010 85019 85025 85028 85039 github pinned issue85062 85064 85294 85069 85070 85080 decision 85091 85129 85134 85140 1.52.1-release (mitigation point) 85257 85449 86498 Timeline of Issue Arrival

It is interesting to note that the rate of incoming issues did not really change in response to the github pinned issue describing the problem. However, that same github pinned issue explicitly said that we still wanted bug reports about occurrences of the ICE, and thus we should not make inferences about the effectiveness of the pinned issue based on that observation.

User Impact

The fingerprint ICEs all arise from inconsistencies when comparing the incremental-compilation cache stored on disk against the values computed during a current rustc invocation, which means they all arise from using incremental compilation.
This has the following implications:

  1. an initial clean build would never hit the fingerprint ICE,
  2. a build with incremental compilation disabled would never hit the fingerprint ICE thus, a release build with default settings would never hit the fingerprint ICE.

The four likely ways that a user could have incremental compilation turned on (and thus hit the fingerprint ICE) were identified in the 1.52.1 blog post. We expect the most common one was users who were building with with the dev or test profiles.

20 distinct issues reporting the fingerprint ICE were filed against the 1.52.0 release. There were at least two tweets (1, 2) mentioning ICEs on 1.52.0.

We do not have telemetry from rustc, so we do not have a good way to measure "true" impact of event.

[FIXME: Anyone else have ideas on other data to include here?]

Incident Response

(Note: See terminology cheat sheet below the four Q&A's)

  • Question: How was the event detected (e.g. an alarm? manual?)
    • Answer: An issue was filed from a user on the day of the release. A team member noticed issues being filed, and raised the alarm on our shared Zulip chat space.

Note: The time to detection, when viewed as the start time (1.52.0 release) to the start of discussion, was 3h 11min.

  • Question: How could time to detection be improved? As a thought exercise, how would you have cut the time in half?

    • Answer: Given that the response time after the user filed an issue was nearly immediate, the only way to cut time further would be to detect the problem before our users do. However, merely saying "more automated testing" is too vague to be actionable.
    • The main actionable thing I can imagine would be to add tests of representative developer workflows, namely incremental rebuilds in a non-release environment (perhaps such a thing could be added to our crater infrastructure). Notably, our current test suite and CI infrastructure do not really attempt to emulate the typical developer experience in a way that exercises the incremental compiler.
      • Even if such workflows are too expensive to be part of the CI or nightly test suites, it probably makes sense to invest in such as part of preparing the beta for promotion to stable.
    • One other point: rustc itself is bootstrapped atop a beta compiler, so one would think that we should have caught this when 1.52-beta was used as the basis for bootstrapping rustc. Why did we not?
      • Are most rustc developers not opting into incremental=1?
      • Or was this bug actually rare enough that many of us just happened to miss it?
      • (Or worse, how many rustc dev's just clean the cache and keep moving in this scenario, despite knowing that the current beta is the next release?)
        • Maybe x.py is part of the answer here: It could perhaps provide proactive guidance advising one to file a bug when a problem is detected on a run of the beta compiler, as opposed to the locally built compiler under development.
  • Question: How did you reach the point where you knew how to mitigate the impact (here called the "decision point")?

    • Answer: Collective discussion over 2 day period.

Note: The time to mitigation, when viewed as period from the event start time (1.52.0 release, Thursday 2021-05-06) to the mitigation time (1.52.1 release, Monday 2021-05-10), was 4 days.

  • Question: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?

    • Time of detection to time of decision (of how to mitigate) was just shy of 48 hours. In other words, it took us two days to decide what to do, and at the end of those two days, it was Saturday, so a release was not going to go out the door until Monday. (A point release is very unlikely to be issued on the weekend under our current development model.)
    • To cut the time in half, we probably would need to decide in the future to spend less time debating and instead acting more quickly to get a point release out the door. However, it was not obvious that a point release was the right answer here, and even if it were, a point release scheduled for Friday would have needed to use a truly trivial patch to achieve that turnaround time.
    • The main realistic options for a Friday patch release would have been either:
      1. disable the fingerprint check (which is counter to our principle that Safety First is paramount), or
      2. disable incremental compilation, with no environment variable to opt back into it.

    Thought experiment (for future reference): In order to gain trust with our users, should we have acted more quickly to disable incremental compilation with no environment variable opt-in, and get that hypothetical patch release out on Friday?

    • See also: https://twitter.com/pnkfelix/status/1397571643405475840
    • Very few people made use of the environment variable opt-in, so maybe we over-indexed on providing that esoteric functionality.
    • (Also from that tweet: 30% of the respondents didn't even know there was an issue with 1.52.0. This may be further evidence that the this ICE didn't affect quite as many as we thought.)

Question: Are we happy/satisfied with how quickly we acted here? Are our users?

pnkfelix thinks we should not be satisfied. But pnkfelix is also not sure on how best to address it; further discussed in Five Whys.

Incident Response Cheat Sheet

Term Definition
Start Time when your users first started being impacted
Detection Time when the team came to know there was impact.
Response Time when a person first started actively working on the problem (not merely acknowledged it)
Mitigation Time when the problem was resolved from the user’s point of view
Resolution Time when the incident response is “finished” from the responder’s point of view

Lead-Up

What actions led to the event itself?

The Rust compiler ("rustc") has support for incremental compilation ("incr-comp"), which tracks how parts of the input source code influence the final build product.

Over three years ago, the rust compiler gained, via PR #45867, experimental code to verify incremental compilation hash (aka "fingerprint") of a query result. At that time, fingerprint mismatches were believed to, at most, be correlated with unnecessary work from incremental compilation. We did not think fingerprint mismatches could cause unsoundness via miscompilation.

On 2021-03-08, a contributor discovered an example of an incremental compilation bug that did yield miscompilation (segmentation faults, out-of-bounds errors, etc) and was correlated with fingerprint mismatches.
In response to this, on 2021-03-13, we turned fingerprint checking on by default, via PR #83007. We knew at that time that this PR would inject some number of internal compiler errors (ICEs).

The fingerprint checking was initially deployed on the Rust nightly channel, following its train release model. Therefore, ICE's started being identified by nightly users, who filed bug reports. Concurrently, changes were made to the compiler to address the problems being filed. However, some of those fixes only landed in nightly after the fingerprint checking had been lifted to the beta channel (following the train model). Thus, there was a mismatch, where the beta compiler would yield ICE's that a nightly user would never see, and yet the beta was the product that was on deck to be promoted to the stable release.

On Wednesday 2021-05-05, a team member noted, by filing Issue #84970, that we needed a plan for addressing the ICE's injected by PR #83007. However, team leadership failed to realize that the ICE's would be part of the 1.52.0 release.

Timeline

Timestamp (Eastern Time) Event
2021-05-05 22:09 Issue #84970 is filed, acknowledging need to mitigate fingerprint ICEs before they slip to stable version. It says slippage is expected to occur in 1.53, due out six weeks later (2021-06-17).
2021-05-06 11:48 Rust 1.52.0 is released start point
2021-05-06 15:08 Issue #85003 filed.
2021-05-06 15:11 Team member starts investigating detection point
2021-05-06 16:02 Issue #85004 filed.
2021-05-06 16:55 Issue #85010 filed.
2021-05-06 23:28 Issue #85019 filed.
2021-05-07 04:43 Issue #85025 filed
2021-05-07 06:05 End user reverts to 1.51 release in order to work around fingerprint errors from rustc
2021-05-07 07:12 Issue #85028 filed
2021-05-07 11:27 Issue #85039 filed
2021-05-07 15:34 blog post explaining issue put up for team review
2021-05-07 16:25 Issue #84970 pinned to github issues page, transcribing blog post advice for our users
2021-05-07 21:40 Issue #85062 filed
2021-05-07 22:21 Issue #85064 filed
2021-05-08 02:32 Issue #85294 reported (via comment)
2021-05-08 04:54 Issue #85069 filed
2021-05-08 05:09 Issue #85070 filed
2021-05-08 12:01 Issue #85080 filed
2021-05-08 14:54 proposal put forward to soft-disable incr-comp in stable. Compiler team lead adds a decision point
2021-05-08 17:17 Issue #85091 filed
2021-05-09 01:33 PR #85097 lands, soft-disabling incr-comp in stable
2021-05-09 21:16 Issue #85129 filed
2021-05-10 00:56 Issue #85134 filed
2021-05-10 04:43 Issue #85140 filed
2021-05-10 10:40 Rust 1.52.1 is released mitigation point
2021-05-13 09:28 Issue #85257 is filed
2021-05-18 16:50 Issue #85449 is filed
2021-05-20 18:23 Issue #86498 is filed

From perspective of this document, the release of 1.52.1 on 2021-05-10 is the effective end of the event for our users (apart from detail that more bugs trickled in from users who had not yet upgraded from 1.52.0). The issue itself is not fully resolved until incremental-compilation has been re-enabled for users of stable Rust, currently planned as part of the 1.54.0 release (2021-07-29).

At the time of the 1.52.1 release, the bugs themselves are still in the mainline code base, but the critical point of failure, the fingerprint ICE,
has been prevented, via the heavy hammer of disabling incremental compilation entirely on the stable release channel.

Also, the above timeline only includes issues filed against 1.52-stable;
issues against nightlies do not represent the same kind of fallout that we are
trying to measure here. This in part because we, by design, do not hold the nightly builds of rustc to the same standards as the stable and beta builds.

Five Whys

  • Why did the 1.52.0 release start emitting Internal Compiler Errors (ICEs) on valid input source code?

    • Because the compiler team turned on an internal consistency check that was triggered by pre-existing bugs.
  • Why did the team enable an internal consistency check that could yield ICEs with an unknown extent of user impact?

    • Because some of the bugs caught by the check could otherwise cause unsoundness in the object code
      generated by the compiler. Such miscompilations can be subtle and hard to diagnose (for both the end user and our own compiler developers). In any case, we value Safety First even if it means exposing a subpar user experience.
    • While we did not (and still do not) know the full extent of user impact, we did have some bounds on its extent: e.g. release builds not opting into incremental compilation are not affected.
  • Why was Safety First the guiding principle for handling this situation?

    • Miscompilations are a very serious issue for us because they undermine the guarantees of safe Rust and potentially expose users to issues beyond their control.
    • It is worth exploring whether we responded with the right level of intensity:
      • Did we respond too aggressively? After all, many versions of stable Rust shipped with this issue and (arguably) minimal impact to users.
      • Did we not respond forcefully enough? Perhaps the opt-in to re-enable a known unsound feature was not the best idea? Perhaps we should have supplied patches for old Rust versions disabling the feature as well?
  • Why wasn't the team proactively working to assess the impact of this check and ensure that all instances of it were resolved before it hit beta?

    • The team leads were aware of the fingerprint check being turned on by default, and advertised it as part of the 2021-03-25 wg-incr-comp checkin.
    • But there was no team-wide tracking of the set of issues that the fingerprint checking caused.
      • The closest thing to a tracking issue was that when a bug report was filed, another contributor might happen to point out its relationship with PR #83007, which made PR #83007 itself an informal tracking issue. Other than that, there was no way to discover these other than a github search for A-incr-comp and I-ICE, or I-ICE and fingerprint.
    • Thus, there was no team-wide discussion of scope of the problem (i.e. the number of ICEs this was causing) until 2021-05-06, with the filing of #84970, and the team as a whole did not recognize the scope of the problem.
  • Why wasn't the compiler team broadly aware of the scope of the problem? Aren't you triaging the incoming bugs?

    • By default, we don't assign a P-critical priority to ICE's: they are not treated as release blockers. They are usually treated as P-high (or even P-medium) bugs.
    • We rely on human beings to 1. notice any uptick in a particular class of ICEs representing a significant user impact, and 2. file a corresponding issue or request corresponding increase in prioritization of existing issues.
  • Why are you relying on human beings to notice that kind of correlation? Can't machines do that?

    • That would be an interesting avenue to explore. Noted in Lessons Learned.
  • Why didn't the compiler team file an issue to track the associated ICE's as part of the plan that enabled fingerprinting by default

  • Why didn't you make a finer-grain check that would catch only the unsound cases?

    • Because we don't know how to distinguish the two cases via the compiler.
  • Why does the check yield an Internal Compiler Error rather than a
    nice diagnostic error like the errors we see from programmatic mistakes in our
    source code?

    • Because the problems caught by the check reflect bugs in the compiler itself,
      not mistakes made in the user's source code.
      We want to keep it clear that the problem here is entirely within the compiler.
    • However, PR 84998 did switch to a structured diagnostic that still included
      the words "internal compiler error". That just wasn't part of output associated
      with the diagnostic from the original ICE.
  • Why aren't all Internal Compiler Errors using the nice diagnostic machinery?

    • Internal Compiler Errors are not meant to be long-lived errors in
      the product; they are meant to have their root causes identified
      quickly and then be subsequently resolved. Therefore, they are
      usually not worth investing the extra development effort
      associated with a proper diagnostic error.
  • Why are you not putting proper diagnostics on the ICE's that have been long-lived,
    since the Rust project cannot provide a guarantee that all ICE's will be short-lived?

    • That sounds like a plausible idea. Added to Corrective Actions below.
    • However, keep in mind the question of whether the fingerprint ICE would have been considered "long-lived" by whatever criteria the question is employing. In other words, proper diagnostics on long-lived ICEs might not have addressed this case in the first place.
    • Also: Arguably we already do this, just in an ad-hoc manner: If users or we ourselves start encountering an ICE that is correlated with some source code error, we strive to replace that experience with a proper diagnostic. We just do not have a formal process in place to drive that work.
    • Also: A long-lived ICE is not necessarily one that is affecting users significantly. Arguably, the real Action needed here, based on this line of thinking (and on our inability a good evaluation of "User Impact") is to invest in Telemetry, feeding from rustc back to the team, so that we could actually learn which ICE's need the most attention. Added to Corrective Actions below.
  • Why did the 1.52.1 release disable incremental compilation entirely (when obscure opt-in flag absent)
    rather than instruct users to clean their cached state in response to the ICE?

    • Because the ICE appeared to be happening with such frequency for some projects that users
      would in practice be cleaning their cached state on every run; i.e. their overall workflow would be at least 2x slower.
    • Also, in some IDEs, rebuilds are happening on the fly, not as the result of an explicit user invocation.
      In that context, a message instructing users to clean their incremental cache before building would
      be unlikely to be addressed by the IDE. The resulting ICEs would deliver a poor user experience for
      the IDE user.
    • Thus, we decided it was better to just disable incremental compilation entirely,
      rather than cause frequent breakage for existing user workflows.
  • Why did the 1.52.1 release disable incremental compilation, rather than recovering from the ICE by
    automatically deleting the incremental cache and restarting compilation.

    • This was considered, but dismissed as too difficult and
      risky to implement in the short-time frame for 1.52.1.
    • (Also: the same reasoning from previous "Why" applies here: for some users, this was occurring on every build, and thus automatic recovery would have the effect of making most builds 2x slower.)
    • As for considering that idea for a longer time frame:
      Following the dictum that "tolerated bugs don't get fixed,"
      we choose to invest effort in fixing the bugs themselves,
      rather than papering over them in a way that provides a subpar user experience.
  • Why didn't the beta channel act as a canary about 1.52.0 causing ICE's for end developers?

    • Because the primary users of the beta channel are not end developers. The two primary users are 1. continuous integration (CI) services and 2. the rustc bootstrap process itself.
      (End developers tend to use either nightly or stable for local development.)
  • Why didn't the beta channel act as a canary about 1.52.0 causing ICEs for rustc bootstrap?

    • The rustc bootstrap process has had problems with incremental for a long time, long predating this bug. Many developers have internalized the need to x.py clean after a rebase to clear out incremental cache. These history has led to people ignoring ICEs during bootstrap, even for steps where it represents a real bug.
  • Why are real bugs conflated with rustc bootstrapping artifacts?

    • Because its not easy to correlate the output from x.py with what ICE's represent real bug.
    • But: this could be addressed. Added as Lesson Learned and Corrective Action.
  • Why didn't the beta channel act as a canary about 1.52.0 causing ICE's for CI services?

    • CI services tend to do from-scratch builds, which inherently have no incremental cached state and
      thus would not hit the fingerprint ICE's.
    • Also, under our current defaults, incremental compilation is only enabled for debugging and testing builds. Release builds don't use incremental compilation unless you opt-into it. So CI services focusing on release builds also will not hit this, unless they enable incremental explicitly.
  • Why are end developers favoring nightly or stable over beta channel?

    • Because using Beta usually doesn't provide significant benefit over the alternatives.
      Nightly gives access to latest and greatest features, and the fastest turnaround time for
      feature request to delivery. Stable provides a stability guarantee: It has been the most tested
      and imposes the highest bar for backports, which are their own source of risk.
      Beta has all the disadvantages of Stable, and some amount of added risk over Stable,
      with no advantages of its own. Someone who wants to try out the shiny new things is more likely
      to just use Nightly (potentially a slightly out-of-date version of Nightly).
  • Why don't you entice people to use Beta by providing some sort of carrot (i.e. reward) in return?
    For example, beta could provide access to certain features that are not available in Stable?

    • Well, beta does do that, in the sense that Beta provides a preview of what the next Stable release will
      provide.
      But that "carrot" only entices people who care about those very specific features. In the common
      case that set of features do not provide sufficient enticement.
    • Beyond that: Providing access to features not available in Stable would defeat part of the purpose
      of Beta, in that we aspire for Beta to closely match the Stable experience.
      If we had a subset of features exposed in 1.x-beta that end up
      hidden in 1.x-stable, that would counteract that aforementioned
      aspiration.
    • However, it might be reasonable to provide a separate fork of
      the beta channel (where the fork is perhaps updated three weeks
      into each six-week release cycle) that would provide such
      enhancements.
      See also jonhoo's posts about a testing channel
      (irlo 1, zulip,
      irlo 2, rfc), though I think those propsals largely
      focus on exposing certain features on the stable channel, and
      thus would not directly serve as a carrot for use of the beta
      channel without some revision.
  • Why did users continue to file issues about the fingerprint ICE after the 1.52.1-stable release had been put out?

    • We do not proactively push users to upgrade to the latest compiler, so users will sometimes encounter ICE's, and file bugs (as suggested by the ICE diagnostic) without first checking against a newer version of the compiler.
  • Why does the ICE message suggest filing a bug report against an outdated version of the compiler?

    • Because the compiler does not attempt to reflectively determine if a newer version of itself is available.
    • It is not clear with what frequency users encounter ICE's that have their root causes fixed on later versions of the compiler, so it may be counterproductive to blindly advise users to upgrade in respoonse to an ICE, without also filing a bug report for the ICE.
  • Why did it take until Monday for you to tell your users about the problem?

    • We did put up a github pinned issue on Friday
  • No, really: Why did it take until Monday for you to tell your users about the problem?

    • We had a blog post drafted on Friday. We were not yet sure what form the mitigation would take, so we couldn't include all of the details of the mitigation in that draft.
    • One detail here was that since we knew that the mitigation might take the form of a 1.52.1 point release. So some of the delay was motivated by the idea that it was better to have one single post (rather than have one post on Friday announcing the problem along with a workaround, and then a follow-up post on Monday announcing the 1.52.1 point release).
  • Why don't you have a lighter weight means than big blog posts to communicate important changes in status to your users, such as a banner on the website or a social media account?

    • We probably could have tweeted something from the @rustlang twitter account.
    • We probably need a playbook for how to handle situations like this in the future, that includes advice like that.

Lessons Learned, Corrective Actions, Action Items

  • Each chain of Five Whys should yield at least one lesson learned. Each lesson learned should yield at least one corrective action. Each corrective action should yield at least one action item.
  • Corrective Actions are high level activities to address the problem, while Action Items are individual tasks that can be assigned due dates.

Lessons Learned

Lack of visibility

  1. We need better visibility (or more advertisement within the teams) of when user-disruptive changes are scheduled to hit.
    • Notably, PR #83007 was known to inject ICE's, and it had been assigned a github milestone of 1.52.0. However, we didn't see that PR, or its milestone, when we were planning our next steps as part of reviewing issue #84970.

Long-lived, Unfixed ICE's

  1. We need better tracking of when certain ICE's start becoming more serious problems for our users.
  2. We need to stop unfixed ICEs from becoming problems that we never hear about.
    • (Arguably we didn't learn this lesson from this event, but rather from a particular line of Socratic inquiry.)

Some user workflows aren't modelled

  1. We need better testing of representative user workflows (i.e., debug builds) in our pipeline; at the very least, sometime prior to a stable release.

Communication Breakdown

  1. We need to be willing and able to communicate more quickly to our users in response to an event like this.

Bootstrap Errors are Obscure

  1. x.py's output hides which ICEs are rebuild artifacts vs legitimate bugs.

Corrective Actions

Note: pnkfelix is not expecting many of the proposals below to be adopted. These are, however, the best ideas he could come up with for addressing the lessons learned written above.

Review Milestone Issues

  • (1a.) Proposal: T-compiler should be reviewing the issues on milestones at some cadence oriented around the release cycle.

Correlate ICEs

  • (2a.) Proposal: Use machine automation to correlate ICE's being filed, and raise priority on ones that seem to have broad impact.
  • (2b.) When we add internal checking with the intent of uncovering pre-existing bugs, we should file a tracking issue as well, to track the bugs uncovered by the ICE (I am presuming this is superior to current practice of relying on people linking to the PR that added the check as means of correlation).

Dev Tool Telemetry

  • (2c.) Proposal: Invest effort in Dev-Tool Telemetry (including for rustc), starting with either an MCP or RFC; probably needs a project group dedicated to initial engagement with key stakeholders.

Track ICEs

  • (3a.) Proposal: Make list of all ICE's beyond some age (six months?)
    that are still open and still reproduce on nightly. Then, invite volunteers to drive down the size of the list, either by addressing their root causes,
    or convert them to proper diagnostic errors (to give users better guidance about what they might do in response to them).

Widen Beta Audience

  • (4a.) Proposal: Increase usage of the beta channel, in order to provide direct testing feedback.

Event Playbook

  • (5a.) Proposal: Define a playbook for how to communicate effectively to our users after a user-impacting event has been discovered.

Revisit Release Timing

  • (5b.) Proposal: Consider moving releases from Thursday to Wednesday, so that we have more time to react before the weekend for user events that have immediate impact.

Deobscure Bootstrap Errors w.r.t. Incremental

  • (6a.) Proposal: Revise x.py to reduce cases where incr-comp state is incorrectly reused and clarify which incr-comp ICE's represent real issues with the underlying rustc.

Action Item

Note: In principle, Action Items are meant to be things that can be assigned deadlines. I have not attempted to actually assign any deadlines below, but I do believe they all could be assigned deadlines, if we were to choose to invest time in them. Most of them take form of "Find owner", since our work assignment is largely bottom-up; i.e. it is dependent on volunteers getting interested in doing something, rather than leadership dictating what must happen.

  • (1a-i.) Proposal: T-compiler leads allocate time to learn how release team leverages milestones.
  • (1a-ii.) T-compiler lead(s) attend certain release team meetings, to get visibility into with discussions of milestoned issues and pull requests.
  • (1a-iii.) T-compiler leads allocate meeting slot, independent of release team, to discuss milestoned issues and pull requests. (The choice between (1a-ii.) vs (1a-iii.) will in part be driven by what is learned from (1a-i.))
  • (2a-i.) Proposal: Find owner for hypothetical Automated ICE Analysis initiative
  • (2b-i.) Proposal: Update rustc-dev-guide with guidance on adding a tracking issue whenever adding ICE that is known to expose old and previously unknown bugs.
  • (2c-i.) Proposal: Find owner for hypothetical Dev-Tool Telemetry initiative.
  • (3a-i.) Proposal: Find owner to build tooling to list "old" ICEs (e.g. > 6 months?) that still reproduce on nightly.
  • (4a-i.) Proposal: Find owner to investigate ways to provide "carrots" for users to use beta, or a fork thereof ("gamma"?), rather than
    stable or nightly. Perhaps try to align it with jonhoo's "Preview for Unstable Features" proposal (which has been closed, but has useful discussion).
  • (5a-i.) Proposal: pnkfelix does a survey of the landscape for response playbooks during user-impacting events, especially when it comes to programming languages: what avenues/resources do we have available? What styles of communication work best when interfacing with developers?
  • (6a-i.) We need to fix how version tracking is handled during rustc bootstrap.
  • (6a-ii.) x.py should be revised to raise an alarm on ICEs that come from the beta compiler.