owned this note
owned this note
Published
Linked with GitHub
# Brainstorming for shrinkmem sprint
## Tools for measuring memory usage
- DHAT
- massif + massif-visualizer
- `*top`
- `perf`
- ebpf
- heaptrack
- elfutils
- visualization with chrome/firefox devtools
- `-Z time-passes`
DHAT seems to be them most common, but has no experts on T-compiler. Consider adding a DHAT tutorial?
- Linux and Windows are the most common dev platforms
- Should be published before first week of March
On linux `perf` is capable of statistically profiling the entire system [kernel and userspace].
Also on linux ebpf does event tracing and profiling safely interfacing between userspace programs and the kernel using a reduced instruction set and sandboxing.
For linux heaptrack traces all memory allocations and annotates those events with stack traces. [Comparison to massif](https://github.com/KDE/heaptrack#comparison-to-valgrinds-massif). Breaks [when linking jemalloc statically](https://bugs.kde.org/show_bug.cgi?id=431746).
`-Z time-passes` makes it easy to identify which parts of compilation are causing memory increases. Real profilers don't always make this obvious. But be aware of some gotchas, which can lead to misinterpretation of the output. For example, sometimes passes "cause" memory increases by virtue of being the first to call some query, the output doesn't show how passes are nested, and some passes can occur concurrently (codegen and optimization).
## LLVM uses the most memory in codegen builds
- Find a way to swap in a custom allocator? (`#[global_allocator]` won't work)
- Schedule codegen units differently? We currently schedule all of the largest CGUs first (roughly), which noticeably increases memory usage when there are outsized CGUs. See https://github.com/rust-lang/rust/issues/82685.
- Intelligently divide outsized CGUs, with attention to limiting impact on inlining decisions. It seems like it would be feasible to analyze the mono item graph and partition it into groups of mono items which cannot be reached from one another. Splitting the CGU on those partitions couldn't affect inlining.
- Factor in inlining before or during CGU merging. This can help merging produce a more even partition. See https://github.com/rust-lang/rust/pull/65281.
- Make CGU merging smarter. It currently merges the smallest CGUs until we're under the limit on the number of CGUs. This strategy can cause our final set of CGUs to have a larger greatest sized CGU than other strategies.
For example, with a CGUs limit of 2, and four CGUs of size 3, 3, 2, and 1, the merge-smallest strategy leaves us with CGUs of sizes 6 and 3. But we could have merged CGUs such that we have CGUs of sizes 5 and 4.
- More heavily rely on LTO for inlining. Instead of including inlined functions in every CGU that uses them, just let inlining occur during LTO if applicable. Not familiar enough with area to know if this is feasible.
- Immediately wrap LLVM allocated types in RAII-style wrappers, so that we don't inadvertently leak memory. I believe there are a couple places where we leak, though I don't know that they're impactful.
- Enable more MIR optimizations.
* Automatically set mir-opt-level=2 in release mode
* move all mir-opt-level=3 opts to level 4
* (level 4 now meaning "slow")
* move all mir-opt-level=2 opts to level 3
* add individual opts to level 2
* would probably be nicer to do after reviving https://github.com/rust-lang/rust/pull/77665
* enable inliner by default only in release mode and without incremental: https://github.com/rust-lang/rust/pull/82280
- Enable polymorphization.
## The DepGraph requires a large amount of memory
This has been demonstrated by tgnottingham's PRs,
notably https://github.com/rust-lang/rust/pull/79589
and https://github.com/rust-lang/rust/pull/80957.
- Zoxc prototyped [a change](https://github.com/rust-lang/rust/pull/60035) where some DepGraph data is persisted directly to disk rather than kept in memory.
- Perhaps parts of the `PreviousDepGraph` could be dropped as they're found to be irrelevant to the current session. Or the `PreviousDepGraph` and `CurrentDepGraph` could be merged, and `PreviousDepGraph` parts either shared or dropped appropriately during compilation.
- Can shave off 4 bytes per node in `PreviousDepGraph` by only encoding the end edge in `edge_list_indices`, and deriving the start edge from the end edge of the previous node. This would require storing edge data in `edge_list_data` in order of nodes, which we currently don't do. When I considered this, I estimated about 1% max-rss reduction on the style-servo-opt incr-unchanged and incr-patched benchmarks. It might be slightly better now that some other memory usage improvements have landed, but it'll still be very small.
- Can shave off 4 bytes per node in `CurrentDepGraph` during incr-full builds by avoiding use of the `hybrid_indices` collection during those builds (i.e. when `PreviousDepGraph` node count is zero). Likely to be a very small max-rss improvement in real-world benchmarks.
- Can more aggressively share edges with `PreviousDepGraph`. We only share edges with the PDG for `DarkGreen` nodes, because they're known to be sharable without any extra processing. `LightGreen` and `Red` nodes can share edges with the PDG too, if we compare the edges and find them to be the same. I'd probably redefine the meaning of `Light` and `Dark` if we did this, so that `Light` means we can't share edges and `Dark` means we can. Then we'd replace the `Red` collection with `LightRed` and `DarkRed` collections. Also likely to be a very small max-rss improvement in real-world benchmarks.
## Inefficient layout of rustc data structures
Some data structures in the AST and HIR require more memory than
strictly required to encode the stored information.
For instance:
- `hir::FnHeader` requires 4 bytes to store,
but only encodes 8 bits worth of information
(unsafety, constness, async are 1 bit each, abi is 5 bits);
- uneven enum variants waste space in padding.
Sprint-size proposal:
- census of such inefficiencies,
- targetted fixes.
Long-term possibilities:
- transition from tree-like HIR to index-based HIR
(ie: like rust-analyzer does, using `ExprId` instead of `&Expr<'_>`);
- implement dynamically-sized enums in the language
(a proc-macro based on [RFC 2594](https://github.com/rust-lang/rfcs/pull/2594) should be enough);
- add bit fields to the language.
## Duplicated information storage
Even packed, spans infest everything, and are stored many times in different places.
Meanwhile, their use is exceptional for error reporting
(except for the SyntaxContext which contains hygiene information).
This duplicated storage for unlikely usage could to be reduced.
Sprint-size proposal:
- census of all queries that store spans, attributes and pointers to HIR data;
- targetted removal of such redundancies.
## Useless information storage
The rustc query system stores a lot of information forever.
However, some of this information may not be useful during
all the compilation session, and could be reclaimed by the system.
Sprint-sized questions:
- Are there more queries whose results should be stolen by a subsequent one?
- Could HIR and TypeckResults be stolen at MIR building?
## Memory usage by AST expansion
Macro expansion can increase significantly the size of a crate.
At the moment, resolving, AST-HIR lowering and HIR indexing require
the full AST to be expanded at the same moment.
This is explained by the current resolve algorithm,
which is a fixed-point method walking the whole crate.
Such a behaviour corresponds to the allowed mutual dependency
between files in the same crate.
However, an incremental version of resolve could be invented:
given a partially-expanded
(= incomplete AST where the missing parts have a derived SyntaxContext),
performing resolution using just the expanded part
(eventually failing for some names),
and completing later once a more thorough expansion happens.
(ie: resolve with not-fully-expanded, and on-demand expansion).
Together with incremental HIR lowering and indexing,
this could allow for incremental AST expansion.
Sprint-size problem:
- implement incremental HIR indexing,
- implement incremental HIR lowering (ie: on-demand lowering of an item),
- investigate whether a partial Rust hygienic resolution is even
possible with a partially expanded AST?
Long-term possibility: depending on the above questions,
finally reach end-to-end queries.