Parallel measurements

# Parallel measurements ## Setup 1. Get a parallel feature enabled rustc/cargo toolchain (Zoxc) * In the root of rust repo, set `[rust] parallel-compiler = true` in config.toml * (Optional) If there was an existing compilation before, you need to run `./x.py clean` * 6f087ac1c17723a84fd45f445c9887dbff61f8c0 is a try build commit for --enable-parallel-compiler * compare with 80e7cde2238e837a9d6a240af9a3253f469bb2cf (bors master) 2. The flags for cargo, rustc for compilation with N threads (Zoxc) * `cargo -jN` * `rustc -Zthreads=N` 3. The configuration/changes to perf to support multi-thread benchmarks (TODO - @simulacrum) 4. The (possibly) perf support for full crate graph build times (TODO - @simulacrum) ## What to measure (in order of importance): - perf runs with 1 thread (seq. overhead) - perf runs with 2 threads (expected benefit) - full crate graph measurements - rustc bootstrap - servo - cranelift - ripgrep - ideally: perf runs with 4, 8 threads ## Machines to measure on - perf.rlo (4 cores, 8 threads: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz) - Niko's "14 core" - eddyb's "lotsa cores" ## Measurements: https://docs.google.com/spreadsheets/d/1bNQJSDhbmOKbtb8EKg40C7D_gboayDpDoC-eQO6XbdI/edit#gid=0 ### from perf.rlo (4 core, 8 thread CPU) Wall time, single crate: * [single vs. -Zthreads=8](https://perf.rust-lang.org/compare.html?start=single-threaded&end=parallel-rustc-8&stat=wall-time) - 40-60% speedup on average; no significant performance hits * [single vs. -Zthreads=4](https://perf.rust-lang.org/compare.html?start=single-threaded&end=parallel-rustc-4&stat=wall-time) - 35-50% speedup on average; no significant performance hits * [single vs. -Zthreads=2](https://perf.rust-lang.org/compare.html?start=single-threaded&end=parallel-rustc-2&stat=wall-time) - 20-30% speedup on average; no significant performance hits (most tiny, i.e., <1 second) * [single vs. -Zthreads=1](https://perf.rust-lang.org/compare.html?start=single-threaded&end=parallel-rustc-1&stat=wall-time) - 10%-20% slowdown (i.e., overhead of parallelism) ### from Niko's 14 core (28 thread?) * [single vs. -Zthreads=8](https://perf.rust-lang.org/compare.html?start=niko-single-thread&end=niko-parallel-rustc-8&stat=wall-time) * [single vs. -Zthreads=4](https://perf.rust-lang.org/compare.html?start=niko-single-thread&end=niko-parallel-rustc-4&stat=wall-time) * [single vs. -Zthreads=2](https://perf.rust-lang.org/compare.html?start=niko-single-thread&end=niko-parallel-rustc-2&stat=wall-time) * [single vs. -Zthreads=1](https://perf.rust-lang.org/compare.html?start=niko-single-thread&end=niko-parallel-rustc-1&stat=wall-time) ## Whole crate graph data -- from Mark's computer (8 core, 16 thread) [Summary Sheet](https://docs.google.com/spreadsheets/d/1vadQWQQqTODU1_cAENnUjLyXM6cxms-tiCf2kCiNGGM/edit#gid=0) > `0` is a sequential compiler for reference; > `1` is `-Zthreads=1` which shows the seq overhead; > `2` is `-Zthreads=2` which shows the benefit from parallel ### rustc stage 0 compilation ==> ./rustc-stage0-single-threaded <== 552.759227012 seconds time elapsed ( +- 0.13% ) ==> ./rustc-stage0-multi-threaded-t2-j8 <== 683.030195011 seconds time elapsed ( +- 5.02% ) ==> ./rustc-stage0-multi-threaded-t2-j16 <== 688.029377313 seconds time elapsed ( +- 7.82% ) ==> ./rustc-stage0-multi-threaded-t4-j16 <== 710.423041145 seconds time elapsed ( +- 8.60% ) ==> ./rustc-stage0-multi-threaded-t16-j16 <== 812.783412266 seconds time elapsed ( +- 8.37% ) ==> ./rustc-stage0-multi-threaded-t2-j4 <== 957.243363191 seconds time elapsed ( +- 5.62% ) ### Cargo compilation (not yet all data gathered) ==> ./cargo-multi-threaded-t1-j1 <== 553.290588371 seconds time elapsed ( +- 0.05% ) ==> ./cargo-multi-threaded-t2-j1 <== 553.770275596 seconds time elapsed ( +- 0.03% ) ==> ./cargo-multi-threaded-t3-j1 <== 553.533103358 seconds time elapsed ( +- 0.04% ) ==> ./cargo-multi-threaded-t4-j1 <== 553.758315225 seconds time elapsed ( +- 0.03% ) ==> ./cargo-multi-threaded-t6-j1 <== 553.333730522 seconds time elapsed ( +- 0.03% ) ==> ./cargo-multi-threaded-t8-j1 <== 553.487285875 seconds time elapsed ( +- 0.03% ) Mark's initial thoughts: * Do we need data on e.g. Windows, macOS? Currently perf can't do that but could manually collect