or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Analysis of
rustc-benchmarking-data
lqd gathered a lot of data in the
rustc-benchmarking-data
repository. This document is nnethercote's analysis of it (with a few additional comments from others). It is long, detailed, and quite dry. It is aimed at Rust compiler developers, and not intended for a general audience. It is also not the highest quality prose, in part because it is likely to become out of date in the not too distant future as performance work addresses things this measurement and analysis has identified.See the roadmap for a higher-level view of rustc perf work for 2022.
As well as an analysis, it will serve as a means of tracking who is doing/has done what work. Task assignations are shown in square bracket, e.g. “[name]”.
round-1-cachegrind-check
Executive summary
parse_tt
and other functions related to macro parsing are the hottest, and correlate highly with allocations. [nnethercote, this blog post has details]memcpy
is high in functions using BitSets a lot for dataflow analysis, e.g. inhttp-0.2.6
. [nnethercote, #93984]SetImpliedBits
function getting target feature information.Hot functions in a single crate
deunicode-1.3.1
dominated bycore::ascii::escape_default
[martingms, #94776]tinyvec-1.5.1
dominated by<rustc_mir_build::build::Builder>::diverge_cleanup
[nnethercote: not worth fixing within rustc, but there are several possible fixes withintinyvec
itself. See #161 for details.]unicode-normalization-0.1.19
dominated bytry_eval_bits
[nnethercote, #97936]Widely used functions
This shows all functions across all benchmarks, weighted by their
Ir
percentage. This demonstrates breadth of usage.I've excluded malloc, memcpy, and dlopen/elf stuff, which made up lots of slots.
This table is hard to read, but metadata decoding dominates because of its effect on small crates. The next section ("Hot functions in multiple crates") breaks hot functions down more and is probably more useful.
Hot functions in multiple crates
This section lists all the functions that hit 1.5% or higher in one benchmark and appear in more than one benchmark. It's a long list. Related functions (i.e. functions that are hot in tandem) are grouped together.
[nnethercote, mostly related to macro parsing, greatly improved, see here for details]
This table undersells the cost of allocations a lot, because it's only showing
_int_free
results. But it also oversells a little in a different way, because jemalloc is more efficient than glibc malloc (which is measured here). We can probably assume allocations in general account for double the percentage in this table. See the DHAT results for more data.Note that these crates correlate highly with the crates where
parse_tt
and related functions are hot.[nnethercote, greatly improved, see here for details]
These are all the macro parsing functions.
[nnethercote, #93984, completed, addresses the biggest of these: keccak, http, vte]
keccak and http-0.2.6 high numbers are due to large bitsets in borrowck dataflow analysis. Note that keccak-0.1.0 has some significant changes vs. keccak in rustc-benchmarks.
[Hard to improve. On x86-64 we query ~50 target feature flags, for things like SSE*, AVX*, etc. This is within
target_features
incompiler/rustc_codegen_llvm/src/llvm_util.rs
. We check one flag at a time because the LLVM interface makes it hard to do otherwise, and LLVM is moderately slow to check each one. Even though it's a significant fraction of execution time for small programs, the absolute time is low, so doesn't seem worth any further effort.]This is significant only for very small crates. It's getting some target feature information from LLVM.
[nnethercote, #97575 fixes it]
Metadata decoding. High relative number for many, but mostly on very short-running crates, with constant amounts of decoding, presumably for decoding common libs like
std
,core
.[nnethercote, #94316]
[nnethercote, #98153]
[lcnr + nnethercote, #97345]
[This code was heavily optimised a couple of years ago for rustc-perf benchmarks like
keccak
andinflate
, and further improvements are difficult. #97674 has some small improvements.]A few crates over-represented:
wast-39.0.0
,rustc-serialize-0.3.24
,wasmparser-0.82.0
,inflate-0.4.5
.[This is caused by lots of type folding and interning, very hard to improve.]
[nnethercote, #96210 + #96683]
[nnethercote, #93984]
[lcnr + nnethercote, #97345]
[lcnr + nnethercote, #97345]
[lcnr + nnethercote, #97345]
round-2-llvm-lines-leaf-crate
Executive summary
The top functions in
std
,alloc
andcore
, as weighted by "Lines" counts. (The percentages here are more useful as a relative measure than an absolute measure.)grow_amortized
has been heavily optimized in the past, and the other top functions are generally very small, hard to improve upon.One possibility:
map_fold
: just inline and remove it? Most affected: actix-router, quote, diesel_derives, bytecount [nnethercote, #94442], didn't help]round-3-dhat
Executive summary
parse_tt
and related macro parsing functions cause by far the most allocations, and correlate highly with Cachegrind results. [nnethercote, this blog post has details]match_impl
,super_relate_tys
, ena snapshot vecs,escape_defaults
,ModChild
, thirmirror_expr_inner
, etc. [lcnr + nnethercote, #97345, deals withmatch_impl
/super_relate_tys
; nnethercote #98569, deals withModChild
; not much other scope for easy improvements]Top 20 malloc users, from Cachegrind, and biggest source of allocations as determined by looking at the DHAT profiles.
Hottest program points (PPs) by allocation rate (blocks). This isn't a perfect metric because sometimes multiple distinct PPs are best considered in combination, which requires human understanding of the stack traces. But it's a good start, and while the Cachegrind numbers can be high if there's lots of allocations spread across lots of places, single PPs that allocate a lot are more likely to be optimizable.
Ones marked with
**
are not in the top 20 Cachegrind list above.Macro parsing dominates here, again.
match_impl
also shows up.Hottest program points (PPs) by allocation rate (bytes). Excludes tiny crates dominated by metadata decoding, which all have a
bytes
value in the range 0.9-2.0MB, mostly around 1.4MB. The rightmost column indicates the hot allocation causes.Ones marked with
**
are not in the top 20 Cachegrind list above.A much wider range of results here.
Hottest program points (PPs) by peak memory usage. The rightmost column indicates the hot allocation causes.
BitSets are again common.
DroplessArena
ones are difficult to action because that covers many different types. Otherwise, fairly spread out.Highest peak memory usage, absolute.
Other than
http-0.2.6
, which is dominated by BitSets, these are not all that interesting. No particularly hot allocations sites, a pretty similar mix, with higher ones tending to beDroplessArena
, mir CFG building, metadata encoding, etc.round-4-line-counts
Biggest crates.
Smallest crates.
Not much to analyze here.
round-5-cachegrind-debug
This is similar to round-1-cachegrind-check, but with additional LLVM costs, which aren't very interesting to analyze here.
round-6-llvm-lines-project
This gave very similar results to round-2-llvm-lines-leaf-crate, so I haven't analyzed it.
round-7-cargo-timing-check-j1
The use of
-j1
forces codegen to be non-parallel, which makes these results non-representative. See the-j8
results instead.round-8-cargo-timing-debug-j1
The use of -j1 force codegen to be non-parallel, which makes these results non-representative. See the
-j8
results instead.round-9-cargo-timing-opt-j1
The use of -j1 force codegen to be non-parallel, which makes these results non-representative. See the
-j8
results instead.round-10-cachegrind-opt
This is similar to round-1-cachegrind-check, but with additional LLVM costs, which aren't very interesting to analyze here.
round-13-cargo-timing-opt-j8
Most expensive crates. The counts are seconds of compile time, e.g.
syn
accounted for 643.5 seconds of compile time, which is 8.1% of the total. (There is of course overlap in crate compilation, so this doesn't say much about the critical path.)syn
/quote
/proc-macro2
(and their build scripts) are the most frequent.Very surprising to see so many build scripts in there! Definitely worth investigation.
Some analysis of build script use-cases (areas where declaratively supporting the feature in cargo would remove the need for the script):
syn
proc-macro2
libc
serde
log
. This seems more of a convenience than something impossible without a script though: the crate could likely containcfg
expressions matching the same targets (doing so would remove this node from 120 crates' dependency graph in the dataset) [https://github.com/rust-lang/log/issues/489]proc-macro2
: e.g. for the wasm targetlibc
: e.g. for the FreeBSD target versionsmemchr
: e.g. for SIMDserde
: e.g. for wasm/asm.js, and architectures where libstd supports atomicsfutures-core
: e.g. for targets without atomic CAS opstarget
use-case above as parsing theTARGET
env var):proc-macro2
also checks theDOCS_RS
env var, likely to control and improve rustdoc output on docs.rslibc
for CI to deny warnings, to check if it's a dependency of libstd, and to access cargo feature flags (which are probably equivalent to usingcfg!
expressions in the build script)proc-macro2
)TODO: also investigate the build script compile times. Some of these scripts are simple (but use various parts of libstd), but compile slowly (e.g.
syn
's build script compiles in >400ms in 150 crates). We need to look into that: whether it's because of opt levels or else; maybe some simple scripts could be interpreted.Most popular crates, i.e. how often they are dependencies for other crates.
libc
,cfg-if
,unicode-xid
, andsyn
/quote
/proc-macro2
/unicode-xid
are the most popular.The biggest projects, i.e. most crates compiled.
219 out of 777 projects contain a single crate, i.e. zero dependencies.
Observations just from looking at some timings graphs.
hyper
crates depends on theh2
crate, but doesn't start building untilhyper
is fully compiled, rather than whenhyper
's metadata is emitted before codegen'. Is this necessary? E.g. inwarp-0.3.2
. [lqd, hyper:#2770, complete]syn
, e.g. inactix-connect-2.0.0
zstd-sys build script (run)
inawc-3.0.0-beta.19
. Can we do better with them? Prioritizing some of them earlier in the pipeline could help, thanks to increased parallelism. The same thing used to happen on servo but I've also seen it on crates depending onopenssl
, and is tracked in this cargo issue. Note: although, native library builds can also compete for tokens and build in parallel, and moving those earlier can in turn make them build slower because of higher contention and less resources.round-11-cargo-timing-check-j8
Most expensive crates, same idea as for round-13.
Reasonably similar results to round-13.
round-12-cargo-timing-debug-j8
Most expensive crates, same idea as for round-13.
Reasonably similar results to round-13.
round-14-self-profile-check
The heaviest relative queries seen. (More data here). The
expand_crate
ones have some correlation with the hot macro parsing results seen with Cachegrind and DHAT.Slowest passes overall, weighted by percentages.
round-15-self-profile-debug
The heaviest relative queries seen.
Slowest passes overall, weighted by percentages.
round-16-self-profile-opt
The heaviest relative queries seen.
Slowest passes overall, weighted by percentages.
round-17-time-passes-check
Executive summary
-Ztime-passes
gives both time and RSS (absolute and change) for each pass. Self-profiling covers time, so I'll just analyze the change in RSS for each stage. I don't entirely trust the RSS numbers produced by-Ztime-passes
, the sometimes seem wonky, but here goes.Weighted RSS changes. Note that the totals aren't that meaningful, it's about the percentages.
I don't think the
total
number is meaningful.macro_expand_crate
andexpand_crate
are almost always identical, not sure what to make of that, seems suspicious.round-18-time-passes-debug
Numbers for front-end passes are similar to
round-17
, as expected. Codegen passes add some extra memory use, unsurprisingly.round-19-time-passes-opt
LLVM_lto_optimize
is the most memory-hungry pass, in general.