WG Parallel Planning

--- tags: parallel, rustc --- # WG Parallel Planning ## Persistent links * [Rustc Parallel Conventions document](https://hackmd.io/jY9E4_8qS3-v1lgIKgzFlA) Todo: * Mark: rustc patch: 4 thread cap, default parallel on nightly * Alex: Announce intent for the 4 thread change * Alex: Announce nightly with parallelism * before/after -- previous nightly vs. next nightly, all default settings * full crate graph, single crate (`cargo clean -p $(cargo pkgid)`) * Mark: rustc patch: talk to Cargo for request token/release token * note: make sure LLVM threading is using this logic Known bugs: * rustc will release implicit token (will run over max implicit parallelism) * rustc spawns all rayon threads immediately instead of rate-limiting them ---- ## Current Action Items * [x] simulacrum will pursue lint store cleanup -- simulacrum * [x] land the rayon PR -- niko will review and publish * [x] create a hackmd with guidelines -- niko, [done](https://hackmd.io/jY9E4_8qS3-v1lgIKgzFlA) * [x] upload video -- niko, [done](https://www.youtube.com/watch?v=VtsaiTiAjz8&feature=youtu.be) * [ ] maybe prep a short blog post -- niko * [x] prepare some notes on how jobserver integration in rustc works -- cuviper * ping Alex re: hand-off around LLVM translation / compilation * https://hackmd.io/DnuWl7-PRb6t15GBduGLZw * [x] pre-audit `librustc_data_structures::sync` -- spastorino * [x] https://hackmd.io/rvXvkvKfSOWlw_4DjiMdZw * [ ] Finishing up Body Cache update - nashenas88 (Paul) * https://hackmd.io/9GU-3sz0T3Wqee3D6PN5hA * [x] review CrateMetadata locks -- spastorino * [x] https://hackmd.io/42cKqJroTGKQlNdcID_GlA ## 2019.10.28 [Recording](https://youtu.be/Wh20eXfMOSk) Agenda: * How it "felt" to move from `Sess` to `TyCtxt` * there are a lot of things in `Sess` that are not immutable *yet* * but generally become so once "compilation proper" starts * Mark did a LintStore change, where we created it and moved it, frozen, into lty ctxt * Will this be a problem? * hopefully not -- queries are good enough here * [Go over the `sync` module review that we did?](https://hackmd.io/rvXvkvKfSOWlw_4DjiMdZw) * Atomic -- maybe not worth it to sometimes use atomic, sometimes not * Just switch over to crossbeam's `Atomic<T>` unilaterally? * Lrc/Weak -- confusing for new folks * maybe just not use it? * Arc may not have much cost * Lock -- * we probably can't just switch these to use Lock unilaterally * MTLock/MTRef -- this we could probably get rid of them * Review more mutable state * CrateMetadata - Santiago is taking a look at this * https://hackmd.io/42cKqJroTGKQlNdcID_GlA * syntax gated spans - why Lock? * Session - move things over to queries or just directly assign * Mark will be taking a look at this * plugins, crate types, recursion/type length limits, etc * allocator_kind, inject, etc. * trait_methods_found, confused_type_with_std_module * move to Resolver outputs directly * Performance action items (Alex to take a look at it): * rip out jobserver and bench perf * Mark to get a try build here and handoff artifacts * investigate exact cause of slowdown * Alex will take point on this for next meeting * alex's analysis - https://hackmd.io/oUdvUU2lTk2ZxfOWQBT9xQ?both * is it jobserver? can the coarser thread allocation work better? * do we thrash with "big loops"? * get strace logs for jobserver file handle * i.e., how many times per session does rustc read/write? * get a self profile crox view of parallel compiler? ## 2019.10.07 ### Tasks * [LintStore](https://hackmd.io/shGNgcHTQ0mGxAiEnv0xYw?edit) * conclusions: * code seems correct but could be made simpler * remove `Lock<Option<..>>` inside of `LintStore` and register a "builder" instead of the pass itself * can we "remove" the `RwLock` somehow from session? * one idea is `Frozen` pattern that maybe implements `Deref` * another is to "steal" from the sess when creating tcx and expose via a query * [PR#63756](https://hackmd.io/@simulacrum/r1jsI5U_r) * conclusions: * code looks fine * future cleanup may be worth it, in light of current ongoing work holding off * minor concern over atomic ordering, comment left on PR ### Guidelines and conventions Separate document: https://hackmd.io/jY9E4_8qS3-v1lgIKgzFlA * Atomic orderings: * use SeqCst everywhere unless there is a strong reason to do otherwise * if so, it should be documented * Interior mutability in a struct: * always private fields * document lock ordering if there are multiple locks * try to keep minimize overall size of module * don't return guards or have "open-ended" locking patterns * Initialization pattern in session: * How to handle this? Will require some exploration ### Other * Measurements: * https://mark.rousskov.org/parallel-compiler-data/ * also wallclock data at the bottom of this doc * conclusions: * `-j1` performance is not too bad, ~5-10%, for very large crates (e.g., script-servo) that amounts to a few seconds from a multiple minute build * scalability is not great but for a full crate graph build can be significant wins * we don't really know why that is * What to discuss next time? * rayon fork changes? * should land josh's changes, do a semver bump * jobserver? * further audits? ## Roadmap * Sequential overhead * Rerun perf benchmark with `-j1` (but not limiting parallel codegen) and identify hotspots * Identify cases one by one and optimize * Overly fine-grained locking risks subtle ordering or dead-lock bugs * Solution: audit * Poor jobserver integration leading to overall poor scaling * Little public testing of correctness and performance * Call for permance testing, asking for data with `-Ztimings` * requires us to have easy builds available, perhaps? At least useful for correctness * Rayon fork * Do we feel the need to eliminate it? * Let's update it at least * Solution: Review the patches * Documentation of key components * What are the major sources of shared state and where is each documented? * How does jobserver integration work and can we improve on that? * Why do we have the Rayon fork? * How does thread-local state work -- how does it get communicated to the workers? * How to handle multiple threads competing for a single query * Unify parallel type-check and parallel code-gen into one framework * [Idea: alexcrichton can explain](https://rust-lang.zulipchat.com/#narrow/stream/187679-t-compiler.2Fwg-parallel-rustc/topic/truly.20parallel.20codegen) ## Another view on the above, categorized by "next step" * measure performance * initial focus: seq overhead * produce binary builds * we as developers should be able to easily test out changes * document this for other developers too, not time to get everyone else involved! * audit and document fine-grained locking * produce a list of things to be audited * schedule a weekly meeting, recorded Zoom calls * meetings with the "explain, discuss, document" format * rayon fork features -- do we need them? * parallel code-gen: how does it work, can it be unified? * jobserver integration * major sources of shared state * [pre-existing questions doc](https://hackmd.io/XDC24IlWT4OIxYdmIxH4Xg) * user involvement: start getting people using it * create instructions on how to use alt builds for correctness checking * or to roll your own build for perf testing * how to gather data, where to submit ## Scenarios to profile and measure * Whole crate graph builds with full parallelism. Should see significant wins in build time as well as CPU usage. * Cargo (done) * Rustc (done) * sccache (done) * `script` from Servo (done) * ... this is also what the post on internals would ask for * See: https://mark.rousskov.org/parallel-compiler-data/ * Run the `perf` suite with full parallelism enabled. Gets an idea for single-crate what is the benefit, when we have full parallelism, across a suite of scenarios (incremental, warm cache, cold cache, etc) * [-Zthreads=virtual cores and physical cores](https://perf.rust-lang.org/compare.html?start=0221e265621a5fcc68ca62bdcdeabad1882a0e9a&end=f0b7b0a9327d3b43aa45a89e90d9785a06059b5a&stat=wall-time) * ~20-40% win wall time in best case * [-Zthreads=physical cores](https://perf.rust-lang.org/compare.html?start=032a53a06ce293571e51bbe621a5c480e8a28e95&end=3ec96e4535a2cfb4ee83cd65d738f98aef82bc8a&stat=wall-time) * What is the impact of "`-j1`" w.r.t. non-codegen threads * Run perf suite and look at `*-check` benchmarks * [Detailed results](https://perf.rust-lang.org/compare.html?start=702b45e409495a41afcccbe87a251a692b0cefab&end=dc78b8ba143915e07375e9d7f05838222cb1db3e&stat=wall-time) * 5-17% regression in wall time for single-threaded vs. -Zthreads=1 parallel * need to recollect for self profile, self profile is currently broken

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.