Categorized Pain Points for perf.rlo

--- tags: perf.rlo, rustc --- # Categorized Pain Points for perf.rlo The following is a list of identified pain points with the performance measuring process sorted by relevance to our current target audience: rustc devs who wish to measure the performance impact of their changes to the compiler. * Note the most common case of "their changes" is an individual Pull Request being evaluated via `@rust-timer queue`. (Longer term performance trends are of interest too, but we prioritize things that benefit making decisions about individual PR's.) ## Feedback Analysis The following is an analysis of the feedback gathered: * Lack of insight into: * **what** benchmarks are measuring * **What** parts of the compilation process a given benchmark stresses? * **how** to best interpret results * **How** can I translate percentage changes in a given metric into understanding of how the compiler's performance has changed? **How** can I do this quickly? * **when** to act on benchmark results vs. ignore them * **When** can I ignore noise? **When** is a legitimate performance regression small enough to ignore? * Lack of insight is compounded by multiple performance criteria * There are multiple performance criteria (e.g., instruction count, memory consumption, storage consumption, wall time, etc.) and no easy way to quickly see relevant changes in all critera. * Need for a better way to translate benchmark results into an understanding of the reason(s) behind a performance regression. * For example, it's hard to determine the cause of a performance regression on rollups. * Need for a better process for ensuring performance regressions are tracked and fixed (or marked as "won't fix"). * Not all relevant performance criteria is being measured: * No disk space metric * I/O usage not properly measured * Only Linux * Limited benchmark sample size * Are we not measuring common usage patterns? * Are parts of the compiler not being stressed? ## The current process This highlights the current process for discovering performance regressions in rustc _as they happen_. Note: this does not talk about any ongoing effort to track general performance trends. * A rustc contributor opens a pull request for some change. * Someone (typically the author or the reviewer) determines whether this change is likely to impact compiler performance and issues a call to `@rust-timer queue` * The reviewer and/or author determine if the metrics indicate a regression and if that regression needs to be addressed. * Subsequent runs to `@rust-timer queue` are performed until the regression is elimanated or deemed accetpable, and the change is merged. * The perf suite is run on all merges and a weekly triage report is produced showing regressions. * Someone is responsible for interpreting the results and "nagging" the authors and maintainers to look into the performance regressions. * Sometimes this issues are followed up on and sometimes not. ### Quesitions of improvement in the process * How can we ensure that `@rust-timer queue` is being used when it should be? * In other words, how do we lower the number of performance regressions that are not caught in initial review? * How can we ensure that reviewers properly interpret performance metrics and don't merge "unacceptable" performance regressions? * Can we provide guidance for when a regression is acceptable and when it is not? * This question relates to a lot of the feedback about it being hard to interpret the metrics and to know when metrics represent a "true" regression. * It arguably takes longer to spot a regression than it should (especially when that regression exposes itself through certain under-emphasized metrics such as memory consumption). * Sometimes what appears to be a regression is just noise, but knowing when this is a case is a bit of an art. * How can we more easily ensure that performance regressions that make it past review are brought to the attention of the author/reviewer? * How can we ensure they are fixed or deemed "won't fix"? * Specifically for rollups: * how can we ensure that performance sensitive changes are not being included in rollups? * if a regression does appear in a rollup, how can we easily spot the culprit? ## Next Steps * Post on the perf bot's response whether a run resulted in possible performance regression. (other) * Compiler triage meeting point on whether we should suggest perf run when compiler/ dir is changed (pnkfelix) * max rss needs to be part of triage (pnkfelix) * Think about tags for performance regression tracking (open an MCP?) * Perhaps start as larger doc highlighting the larger perf proposal changes (rylev) ## Appendix ### Raw Feedback * Not clear which benchmarks matter. Am I supposed to weight all equally? (Does summary weight all equally?) * noise, in particular on some benchmarks, is frequently present, and makes it easy to essentially ignore such benchmarks. * the compare mode presentation, for each benchmark, has one "full" and then several incr variants. This might be leading to over-emphasis on incr performance. * How do I tell what things are likely noise vs real gains and losses in performance * regressions have unclear responsibility in terms of tracking down cause/fixing and deciding whether to revert. (is it the author? performance triager?) * rollups often lead to performance regressions * regressions during a rollup have unclear provenance * hard to get a good overview of performance (of a particular diff) across many metric types (e.g.., wall time, cpu instructions, rss, etc.) * who decides how much regression is "too much"? * The detailed comparison view often does not correlate with summarized comparison view for small regressions/speedups. * making decisions based on compare view often seems fairly arbitrary (eh, some minor regressions, some minor improvements, seems good enough :shrug:) * We don't have disk space as metric. :) * "well known" that some components are causing major increases in timing (e.g., simd in core/std), but unclear if we can say "stop adding intrinsics" or how to evaluate the tradeoff between features and performance. particularly performance of rustc developers vs end users, too * The impact of cgu partitioning is not controlled for, while introducing systematic bias in measurements. * overwhelming for newcomers on how to understand what they're looking at * unclear on how representative some of the current benchmarks are * coverage and scale: which parts of the compiler are "well covered" by the current benchmarks, and which are not; at which points do some pieces of rustc stop scaling * performance budget is limited, in that we are not easily able to add everything we'd like benchmarked * no data about IO, while incremental benchmarks hitting the disk more can be impactful, and hard to see * (maybe just a problem for me: possible lack of statistics / insight into the benchmarked crates, what and why they exercize what they do) * too much data -- boiling data down to a view of "how is compiler doing". The dashboard helps here, but is it representative? * no measurement of runtime performance of generated code (kind of orthogonal, kind of not)