perf.rlo - HackMD

--- tags: perf.rlo, rustc --- # perf.rlo ## What are goals of meeting? * Use meeting time itself to review site and/or establish goals? * What action items can we establish? * Who are the stake-holders? * Do we think we have sufficient present here to assign action items usefully? * Do we think we have sufficient present here to make decisions? ## Topics * Who are the groups of people that consume this data and what do each of these groups want to see when visiting the site? * Compiler team: curious as to the general trajectory of performance * Those actively working on performance (perf team or motivated contributor) * rustc dev tracking performance impact of their changes * Curious non-dev person looking to see whether rustc is fast enough * Curious person who uses Rust * Who should establish the set of benchmarks? * How often should the set change? * How can we be confident this captures a reasonable slice of "representative" Rust programs? * Do we need to gather expertise in Data Science * Is there research-worthy material here? simulacrum thinks we are running up against open research problems (at least not with well known solutions) * Changes/enhancements to discuss * How to present the metrics we gather today * Do the current defaults reflect what matters? * Cannot currently visualize more than one metric at once. * Should we change how the metrics are stored? * What metrics to gather in the future * Disk space usage is an obvious one we need * pnkfelix thinks more stuff about cache-misses also needed (more real-world impact than instruction counts) * (Note that perf counters have architectural limits) * How do we track cross platform performance? * What automation could be added, to ease existing workflows (e.g., triage report)? * "Competitors" exist * e.g. arewefastyet.rs * should we embrace these (e.g. actively advertise them), leverage them as staging grounds for ideas? * That is, ideas w.r.t what to benchmark, what metrics to gather, and how to present them. # collected pain points * Not clear which benchmarks matter. Am I supposed to weight all equally? (Does summary weight all equally?) * noise, in particular on some benchmarks, is frequently present, and makes it easy to essentially ignore such benchmarks. * the compare mode presentation, for each benchmark, has one "full" and then several incr variants. This might be leading to over-emphasis on incr performance. * How do I tell what things are likely noise vs real gains and losses in performance * too much data -- boiling data down to a view of "how is compiler doing". The dashboard helps here, but is it representative? * regressions have unclear responsibility in terms of tracking down cause/fixing and deciding whether to revert. (is it the author? performance triager?) * regressions during a rollup have unclear provenance * hard to get a good overview of performance (of a particular diff) across many metric types (e.g.., wall time, cpu instructions, rss, etc.) * who decides how much regression is "too much"? * The detailed comparison view often does not correlate with summarized comparison view for small regressions/speedups. * making decisions based on compare view often seems fairly arbitrary (eh, some minor regressions, some minor improvements, seems good enough :shrug:) * We don't have disk space as metric. :) * "well known" that some components are causing major increases in timing (e.g., simd in core/std), but unclear if we can say "stop adding intrinsics" or how to evaluate the tradeoff between features and performance. particularly performance of rustc developers vs end users, too * The impact of cgu partitioning is not controlled for, while introducing systematic bias in measurements. * overwhelming for newcomers on how to understand what they're looking at * no measurement of runtime performance of generated code (kind of orthogonal, kind of not) ## Understanding the issue With a focus on rustc devs who use perf.rlo to gauge the performance impact of their changes to the compiler, it's important to understand a typical workflow: * Dev introduces change * Dev or a reviewer kicks off a perf run to see what the impact is * TODO FINSIH