---
tags: perf.rlo, rustc
---
# Categorized Pain Points for perf.rlo
The following is a list of identified pain points with the performance measuring process sorted by relevance to our current target audience: rustc devs who wish to measure the performance impact of their changes to the compiler.
* Note the most common case of "their changes" is an individual Pull Request being evaluated via `@rust-timer queue`. (Longer term performance trends are of interest too, but we prioritize things that benefit making decisions about individual PR's.)
## Feedback Analysis
The following is an analysis of the feedback gathered:
* Lack of insight into:
* **what** benchmarks are measuring
* **What** parts of the compilation process a given benchmark stresses?
* **how** to best interpret results
* **How** can I translate percentage changes in a given metric into understanding of how the compiler's performance has changed? **How** can I do this quickly?
* **when** to act on benchmark results vs. ignore them
* **When** can I ignore noise? **When** is a legitimate performance regression small enough to ignore?
* Lack of insight is compounded by multiple performance criteria
* There are multiple performance criteria (e.g., instruction count, memory consumption, storage consumption, wall time, etc.) and no easy way to quickly see relevant changes in all critera.
* Need for a better way to translate benchmark results into an understanding of the reason(s) behind a performance regression.
* For example, it's hard to determine the cause of a performance regression on rollups.
* Need for a better process for ensuring performance regressions are tracked and fixed (or marked as "won't fix").
* Not all relevant performance criteria is being measured:
* No disk space metric
* I/O usage not properly measured
* Only Linux
* Limited benchmark sample size
* Are we not measuring common usage patterns?
* Are parts of the compiler not being stressed?
## The current process
This highlights the current process for discovering performance regressions in rustc _as they happen_. Note: this does not talk about any ongoing effort to track general performance trends.
* A rustc contributor opens a pull request for some change.
* Someone (typically the author or the reviewer) determines whether this change is likely to impact compiler performance and issues a call to `@rust-timer queue`
* The reviewer and/or author determine if the metrics indicate a regression and if that regression needs to be addressed.
* Subsequent runs to `@rust-timer queue` are performed until the regression is elimanated or deemed accetpable, and the change is merged.
* The perf suite is run on all merges and a weekly triage report is produced showing regressions.
* Someone is responsible for interpreting the results and "nagging" the authors and maintainers to look into the performance regressions.
* Sometimes this issues are followed up on and sometimes not.
### Quesitions of improvement in the process
* How can we ensure that `@rust-timer queue` is being used when it should be?
* In other words, how do we lower the number of performance regressions that are not caught in initial review?
* How can we ensure that reviewers properly interpret performance metrics and don't merge "unacceptable" performance regressions?
* Can we provide guidance for when a regression is acceptable and when it is not?
* This question relates to a lot of the feedback about it being hard to interpret the metrics and to know when metrics represent a "true" regression.
* It arguably takes longer to spot a regression than it should (especially when that regression exposes itself through certain under-emphasized metrics such as memory consumption).
* Sometimes what appears to be a regression is just noise, but knowing when this is a case is a bit of an art.
* How can we more easily ensure that performance regressions that make it past review are brought to the attention of the author/reviewer?
* How can we ensure they are fixed or deemed "won't fix"?
* Specifically for rollups:
* how can we ensure that performance sensitive changes are not being included in rollups?
* if a regression does appear in a rollup, how can we easily spot the culprit?
## Next Steps
* Post on the perf bot's response whether a run resulted in possible performance regression. (other)
* Compiler triage meeting point on whether we should suggest perf run when compiler/ dir is changed (pnkfelix)
* max rss needs to be part of triage (pnkfelix)
* Think about tags for performance regression tracking (open an MCP?)
* Perhaps start as larger doc highlighting the larger perf proposal changes (rylev)
## Appendix
### Raw Feedback
* Not clear which benchmarks matter. Am I supposed to weight all equally? (Does summary weight all equally?)
* noise, in particular on some benchmarks, is frequently present, and makes it easy to essentially ignore such benchmarks.
* the compare mode presentation, for each benchmark, has one "full" and then several incr variants. This might be leading to over-emphasis on incr performance.
* How do I tell what things are likely noise vs real gains and losses in performance
* regressions have unclear responsibility in terms of tracking down cause/fixing and deciding whether to revert. (is it the author? performance triager?)
* rollups often lead to performance regressions
* regressions during a rollup have unclear provenance
* hard to get a good overview of performance (of a particular diff) across many metric types (e.g.., wall time, cpu instructions, rss, etc.)
* who decides how much regression is "too much"?
* The detailed comparison view often does not correlate with summarized comparison view for small regressions/speedups.
* making decisions based on compare view often seems fairly arbitrary (eh, some minor regressions, some minor improvements, seems good enough :shrug:)
* We don't have disk space as metric. :)
* "well known" that some components are causing major increases in timing (e.g., simd in core/std), but unclear if we can say "stop adding intrinsics" or how to evaluate the tradeoff between features and performance. particularly performance of rustc developers vs end users, too
* The impact of cgu partitioning is not controlled for, while introducing systematic bias in measurements.
* overwhelming for newcomers on how to understand what they're looking at
* unclear on how representative some of the current benchmarks are
* coverage and scale: which parts of the compiler are "well covered" by the current benchmarks, and which are not; at which points do some pieces of rustc stop scaling
* performance budget is limited, in that we are not easily able to add everything we'd like benchmarked
* no data about IO, while incremental benchmarks hitting the disk more can be impactful, and hard to see
* (maybe just a problem for me: possible lack of statistics / insight into the benchmarked crates, what and why they exercize what they do)
* too much data -- boiling data down to a view of "how is compiler doing". The dashboard helps here, but is it representative?
* no measurement of runtime performance of generated code (kind of orthogonal, kind of not)