owned this note
owned this note
Published
Linked with GitHub
---
tags: parallel, rustc
---
# WG Parallel Planning
## Persistent links
* [Rustc Parallel Conventions document](https://hackmd.io/jY9E4_8qS3-v1lgIKgzFlA)
Todo:
* Mark: rustc patch: 4 thread cap, default parallel on nightly
* Alex: Announce intent for the 4 thread change
* Alex: Announce nightly with parallelism
* before/after -- previous nightly vs. next nightly, all default settings
* full crate graph, single crate (`cargo clean -p $(cargo pkgid)`)
* Mark: rustc patch: talk to Cargo for request token/release token
* note: make sure LLVM threading is using this logic
Known bugs:
* rustc will release implicit token (will run over max implicit parallelism)
* rustc spawns all rayon threads immediately instead of rate-limiting them
----
## Current Action Items
* [x] simulacrum will pursue lint store cleanup -- simulacrum
* [x] land the rayon PR -- niko will review and publish
* [x] create a hackmd with guidelines -- niko, [done](https://hackmd.io/jY9E4_8qS3-v1lgIKgzFlA)
* [x] upload video -- niko, [done](https://www.youtube.com/watch?v=VtsaiTiAjz8&feature=youtu.be)
* [ ] maybe prep a short blog post -- niko
* [x] prepare some notes on how jobserver integration in rustc works -- cuviper
* ping Alex re: hand-off around LLVM translation / compilation
* https://hackmd.io/DnuWl7-PRb6t15GBduGLZw
* [x] pre-audit `librustc_data_structures::sync` -- spastorino
* [x] https://hackmd.io/rvXvkvKfSOWlw_4DjiMdZw
* [ ] Finishing up Body Cache update - nashenas88 (Paul)
* https://hackmd.io/9GU-3sz0T3Wqee3D6PN5hA
* [x] review CrateMetadata locks -- spastorino
* [x] https://hackmd.io/42cKqJroTGKQlNdcID_GlA
## 2019.10.28
[Recording](https://youtu.be/Wh20eXfMOSk)
Agenda:
* How it "felt" to move from `Sess` to `TyCtxt`
* there are a lot of things in `Sess` that are not immutable *yet*
* but generally become so once "compilation proper" starts
* Mark did a LintStore change, where we created it and moved it, frozen, into lty ctxt
* Will this be a problem?
* hopefully not -- queries are good enough here
* [Go over the `sync` module review that we did?](https://hackmd.io/rvXvkvKfSOWlw_4DjiMdZw)
* Atomic -- maybe not worth it to sometimes use atomic, sometimes not
* Just switch over to crossbeam's `Atomic<T>` unilaterally?
* Lrc/Weak -- confusing for new folks
* maybe just not use it?
* Arc may not have much cost
* Lock --
* we probably can't just switch these to use Lock unilaterally
* MTLock/MTRef -- this we could probably get rid of them
* Review more mutable state
* CrateMetadata - Santiago is taking a look at this
* https://hackmd.io/42cKqJroTGKQlNdcID_GlA
* syntax gated spans - why Lock?
* Session - move things over to queries or just directly assign
* Mark will be taking a look at this
* plugins, crate types, recursion/type length limits, etc
* allocator_kind, inject, etc.
* trait_methods_found, confused_type_with_std_module
* move to Resolver outputs directly
* Performance action items (Alex to take a look at it):
* rip out jobserver and bench perf
* Mark to get a try build here and handoff artifacts
* investigate exact cause of slowdown
* Alex will take point on this for next meeting
* alex's analysis - https://hackmd.io/oUdvUU2lTk2ZxfOWQBT9xQ?both
* is it jobserver? can the coarser thread allocation work better?
* do we thrash with "big loops"?
* get strace logs for jobserver file handle
* i.e., how many times per session does rustc read/write?
* get a self profile crox view of parallel compiler?
## 2019.10.07
### Tasks
* [LintStore](https://hackmd.io/shGNgcHTQ0mGxAiEnv0xYw?edit)
* conclusions:
* code seems correct but could be made simpler
* remove `Lock<Option<..>>` inside of `LintStore` and register a "builder" instead of the pass itself
* can we "remove" the `RwLock` somehow from session?
* one idea is `Frozen` pattern that maybe implements `Deref`
* another is to "steal" from the sess when creating tcx and expose via a query
* [PR#63756](https://hackmd.io/@simulacrum/r1jsI5U_r)
* conclusions:
* code looks fine
* future cleanup may be worth it, in light of current ongoing work holding off
* minor concern over atomic ordering, comment left on PR
### Guidelines and conventions
Separate document: https://hackmd.io/jY9E4_8qS3-v1lgIKgzFlA
* Atomic orderings:
* use SeqCst everywhere unless there is a strong reason to do otherwise
* if so, it should be documented
* Interior mutability in a struct:
* always private fields
* document lock ordering if there are multiple locks
* try to keep minimize overall size of module
* don't return guards or have "open-ended" locking patterns
* Initialization pattern in session:
* How to handle this? Will require some exploration
### Other
* Measurements:
* https://mark.rousskov.org/parallel-compiler-data/
* also wallclock data at the bottom of this doc
* conclusions:
* `-j1` performance is not too bad, ~5-10%, for very large crates (e.g., script-servo) that amounts to a few seconds from a multiple minute build
* scalability is not great but for a full crate graph build can be significant wins
* we don't really know why that is
* What to discuss next time?
* rayon fork changes?
* should land josh's changes, do a semver bump
* jobserver?
* further audits?
## Roadmap
* Sequential overhead
* Rerun perf benchmark with `-j1` (but not limiting parallel codegen) and identify hotspots
* Identify cases one by one and optimize
* Overly fine-grained locking risks subtle ordering or dead-lock bugs
* Solution: audit
* Poor jobserver integration leading to overall poor scaling
* Little public testing of correctness and performance
* Call for permance testing, asking for data with `-Ztimings`
* requires us to have easy builds available, perhaps? At least useful for correctness
* Rayon fork
* Do we feel the need to eliminate it?
* Let's update it at least
* Solution: Review the patches
* Documentation of key components
* What are the major sources of shared state and where is each documented?
* How does jobserver integration work and can we improve on that?
* Why do we have the Rayon fork?
* How does thread-local state work -- how does it get communicated to the workers?
* How to handle multiple threads competing for a single query
* Unify parallel type-check and parallel code-gen into one framework
* [Idea: alexcrichton can explain](https://rust-lang.zulipchat.com/#narrow/stream/187679-t-compiler.2Fwg-parallel-rustc/topic/truly.20parallel.20codegen)
## Another view on the above, categorized by "next step"
* measure performance
* initial focus: seq overhead
* produce binary builds
* we as developers should be able to easily test out changes
* document this for other developers too, not time to get everyone else involved!
* audit and document fine-grained locking
* produce a list of things to be audited
* schedule a weekly meeting, recorded Zoom calls
* meetings with the "explain, discuss, document" format
* rayon fork features -- do we need them?
* parallel code-gen: how does it work, can it be unified?
* jobserver integration
* major sources of shared state
* [pre-existing questions doc](https://hackmd.io/XDC24IlWT4OIxYdmIxH4Xg)
* user involvement: start getting people using it
* create instructions on how to use alt builds for correctness checking
* or to roll your own build for perf testing
* how to gather data, where to submit
## Scenarios to profile and measure
* Whole crate graph builds with full parallelism. Should see significant wins in build time as well as CPU usage.
* Cargo (done)
* Rustc (done)
* sccache (done)
* `script` from Servo (done)
* ... this is also what the post on internals would ask for
* See: https://mark.rousskov.org/parallel-compiler-data/
* Run the `perf` suite with full parallelism enabled. Gets an idea for single-crate what is the benefit, when we have full parallelism, across a suite of scenarios (incremental, warm cache, cold cache, etc)
* [-Zthreads=virtual cores and physical cores](https://perf.rust-lang.org/compare.html?start=0221e265621a5fcc68ca62bdcdeabad1882a0e9a&end=f0b7b0a9327d3b43aa45a89e90d9785a06059b5a&stat=wall-time)
* ~20-40% win wall time in best case
* [-Zthreads=physical cores](https://perf.rust-lang.org/compare.html?start=032a53a06ce293571e51bbe621a5c480e8a28e95&end=3ec96e4535a2cfb4ee83cd65d738f98aef82bc8a&stat=wall-time)
* What is the impact of "`-j1`" w.r.t. non-codegen threads
* Run perf suite and look at `*-check` benchmarks
* [Detailed results](https://perf.rust-lang.org/compare.html?start=702b45e409495a41afcccbe87a251a692b0cefab&end=dc78b8ba143915e07375e9d7f05838222cb1db3e&stat=wall-time)
* 5-17% regression in wall time for single-threaded vs. -Zthreads=1 parallel
* need to recollect for self profile, self profile is currently broken