notes about pyca/cryptography

# notes about pyca/cryptography note to self: some python packages seem required for the crate to compile (`cffi` I think) #### 0. nightly vs msrv I'm currently testing on a recent nightly, compared to the crate's MSRV of 1.56. I'm mentioning this to reproduce some of my results (and nightlies are built differently compared to stable: more assertions/logging can slow nightlies down a bit, so that may skew the results), but that maybe there are optimizations applicable to older versions. Let me know if I should benchmark and profile on a different version of the compiler instead. (I'm using an epyc-7401P with the usual benchmarking settings, so it's 32 cores but you mentioned having 22 so I'll use that as the maximum) It seems cargo can't easily clean this cdylib project so that was interesting, maybe I'll open an issue about this. Not a big deal, but it makes benchmarking only the final crate harder; so I'll also usually benchmark with incremental compilation turned off explicitly (where touching the source files and rebuilding should only then build the final crate). #### 1. crates One of the first things I wanted to do was deduplicate syn from the dependency tree, but then I saw that of course alex had not only noticed that already, but also opened issues and PRs to upgrade dependencies to syn2. And the last crate introducing syn1 also has multiple open PRs to switch to v2. But since I was surprised at the cost of cdylibs vs dylibs I hadn't never noticed before (8% here) I also wondered whether introducing inner crates would be beneficial for parallelism but it looks already setup this way, with only the python-related code in the cdylib. (I've also wondered whether there was a way to avoid having ouroboros involved if the self-referential structures didn't change super often, à la vendoring the macro-expanded code in the repository, and updating it manually on-demand. Since it also uses syn1 and all, and `ouroboros_macro` is taking a few seconds to build. Not a concern if build dependencies were cached of course) I also had issues trying to visualize our full self-profiling data with events, which is currently the only way to tie some rustc slowness to some piece of code in the crate -- so I'll need to investigate these panics as well (one of the reasons why I'll do a follow-up analysis, when I'm at least able to gather this vital information to diagnose compile times). #### 2. linker alex mentioned link times in the email, and it generally seems safe to use a faster linker. (This generally also improves build dependencies compilation, since even proc-macros are linked.) A particularly easy way to test this with the same LLVM version as rustc is using is via `-Zgcc-ld=lld` on nightlies (distributed via rustup, not a distro. The former are packaged with a build of lld on major targets). It's not extremely significant overall, still 3%, and the `run_linker` self profile events show that linking times are reduced to a 1/3 of the default. (note: mold in the same ballpark, only slightly faster) #### 3. lto off vs default in the email alex mentioned turning off LTO was underwhelming. I'm getting slightly better results than you are. By default rustc will do "local thinlto", unless explicitly said not to (or if it's not needed, e.g. when there's only a single codegen unit). (And running ThinLTO on the the stdlib's CGUs is like 3s) So I've tried the 3 of them. ``` > hyperfine -w2 --runs 10 --prepare "cp Cargo.{rev}.toml Cargo.toml && cargo clean" -L rev thin,default,off "CARGO_INCREMENTAL=0 cargo build --no-default-features --release -j22" -n "LTO: {rev}" Benchmark 1: LTO: thin Time (mean ± σ): 21.461 s ± 0.017 s [User: 93.153 s, System: 7.243 s] Range (min … max): 21.436 s … 21.488 s 10 runs Benchmark 2: LTO: default Time (mean ± σ): 19.343 s ± 0.027 s [User: 86.496 s, System: 7.074 s] Range (min … max): 19.304 s … 19.399 s 10 runs Benchmark 3: LTO: off Time (mean ± σ): 18.696 s ± 0.022 s [User: 75.188 s, System: 6.599 s] Range (min … max): 18.662 s … 18.734 s 10 runs Summary 'LTO: off' ran 1.03 ± 0.00 times faster than 'LTO: default' 1.15 ± 0.00 times faster than 'LTO: thin' ``` And at j2: ``` > hyperfine -w2 --runs 10 --prepare "cp Cargo.{rev}.toml Cargo.toml && cargo clean" -L rev thin,default,off "CARGO_INCREMENTAL=0 cargo build --no-default-features --release -j2" -n "LTO: {rev}" Benchmark 1: LTO: thin Time (mean ± σ): 53.429 s ± 0.848 s [User: 91.500 s, System: 6.369 s] Range (min … max): 51.326 s … 54.396 s 10 runs Benchmark 2: LTO: default Time (mean ± σ): 48.973 s ± 0.232 s [User: 85.584 s, System: 6.148 s] Range (min … max): 48.509 s … 49.222 s 10 runs Benchmark 3: LTO: off Time (mean ± σ): 44.538 s ± 0.361 s [User: 74.366 s, System: 6.005 s] Range (min … max): 43.589 s … 44.826 s 10 runs Summary 'LTO: off' ran 1.10 ± 0.01 times faster than 'LTO: default' 1.20 ± 0.02 times faster than 'LTO: thin' ``` That looks like an interesting improvement, but I'm currently in "rustc profiling mode" where even small % wins are hard to achieve. Also, at j2, maybe the default of 16 CGUs in this situation is also suboptimal with ThinLTO: it seems that 4 could be a better choice, at around 5% faster (with caveats about the runtime performance of course) and which is also 5% faster under the default "thin local" LTO. At j22, closer to the core count with some leeway should generally be an improvement to compile times, like 28-30, but one can surely look for a good value for each useful context in the "fast iteration loop vs runtime performance" scale (and I've sometimes seen lore like "try 2x the core count"). Since I'm not sure whether this is acceptable in your use-cases, I have more questions about the use of LTO at the end of these notes. #### 4. cargo llvm lines To stay on the topic of LTO, ThinLTO is what's causing the cargo-llvm-lines issue. Some workarounds (that are probably unsatisfying since you're likely trying to understand compile times in the actual config you ship) to avoid that issue/warning: - using fat LTO - turning off LTO altogether Otherwise, _with_ ThinLTO, using a single CGU would also give a bit more information (though I'm locally seeing slightly more results than your 3 lines mentioned in the issue). Something in the range of 350K lines of IR, though I'm not sure if this is accurate. Slightly unrelated note: In the same vein as `cargo-llvm-lines`, there's `-Zdump-mono-stats` that prints similar stats but currently only at the MIR level (so not actually emitted LLVM IR). That makes it less accurate since the "cost" is an approximation (until I manage to add stats to our codegen backends in the future, then these flags will be close to parity with the tool), but the number of instantiations of generic functions is itself accurate. That can give an idea on the side and number of monomorphisations. As an example, here's an excerpt of how it looks like this on the main crate. (The default format is markdown but you can also use `-Zdump-mono-stats-format=json` if you'd want to do some processing with `jq`) | Item | Instantiation count | Estimated Cost Per Instantiation | Total Estimated Cost | | --- | ---: | ---: | ---: | | `std::result::Result::<T, E>::map` | 257 | 23 | 5911 | | `ouroboros::macro_help::alloc::raw_vec::RawVec::<T, A>::grow_amortized` | 28 | 176 | 4928 | | `<std::slice::Iter<'a, T> as std::iter::Iterator>::next` | 27 | 178 | 4806 | | `x509::certificate::parse_cert_ext` | 1 | 4324 | 4324 | | `x509::extensions::encode_extension` | 1 | 4075 | 4075 | llvm-lines also reports (with the workarounds from above) the same functions. So I presume you were able to work around that issue, and then discovered the suboptimal IR lowering for matching on arrays ? #### 5. array matching lowering To have a quick idea about the rough cost of this meh lowering, I tested rewriting the match statements to if/else blocks in: - `x509::certificate::parse_cert_ext` - `x509::sign::identify_key_hash_type_for_oid` - `x509::extensions::encode_extension` That does reduce the IR amount reported, and here's how that looks for compilation Kinda washed amongst the full project build, but if we e.g. only rebuild the cdylib, it's a tiny bit better. At j2: ``` > hyperfine -w2 --runs 10 --prepare "cargo clean && CARGO_INCREMENTAL=0 cargo build --no-default-features --release && touch src/*.rs && cp extensions.{rev}.rs src/x509/extensions.rs && cp sign.{rev}.rs src/x509/sign.rs && cp certificate.{rev}.rs src/x509/certificate.rs" -L rev before,after "CARGO_INCREMENTAL=0 cargo build --no-default-features --release -j2" -n "array matches: {rev}" Benchmark 1: array matches: before Time (mean ± σ): 21.215 s ± 0.032 s [User: 38.014 s, System: 0.777 s] Range (min … max): 21.144 s … 21.267 s 10 runs Benchmark 2: array matches: after Time (mean ± σ): 20.901 s ± 0.032 s [User: 37.403 s, System: 0.766 s] Range (min … max): 20.830 s … 20.932 s 10 runs Summary 'array matches: after' ran 1.02 ± 0.00 times faster than 'array matches: before' ``` Small improvement, so it doesn't seem to be the biggest of compile-time cost. I'll gather more stats about the actual modules/functions and optimization passes (in particular in the `x509` crate) in my next tests and confirm this either way. I have not looked into seeing how to fix the issue in rustc itself but I saw some other contributors were already looking into it. #### 6. cost of overflow checks Just for my own knowledge, I also wanted to see the compile-time cost of overflow checks (I'm sure you wouldn't disable them on crypto code). Good to see it's low, at most 1-2%. #### 7. cost of optimizing dependencies For full builds where maybe the runtime performance of dependencies is less important than their build times (if this use-case is applicable to you, maybe running some quick test maybe) and where there is no CI caching: it's possible to lower the opt-level. Since the cargo timings reports crate e.g. `asn1` and `pyo3` as having the biggest codegen times, I also wanted to try that. Turning off LTO, and setting opt-level = 0 to them, is around 5% at j2. ``` Benchmark 1: no LTO, dependencies at default opt-level Time (mean ± σ): 44.236 s ± 0.516 s [User: 74.406 s, System: 6.068 s] Range (min … max): 43.510 s … 45.036 s 10 runs Benchmark 2: no LTO, dependencies at opt-level=0 Time (mean ± σ): 42.122 s ± 0.215 s [User: 70.034 s, System: 6.092 s] Range (min … max): 41.770 s … 42.509 s 10 runs Summary 'no LTO, dependencies at opt-level=0' ran 1.05 ± 0.01 times faster than 'no LTO, dependencies at default opt-level' ``` (Sometimes a level of 1 can also be a possibility, where less time in spent in optimization but still enough for runtime to be better than without optimizations.) All this is moot if the use-cases allow caching pre-built, fully optimized, dependencies on CI for example, but I wanted to mention it in case it's not. #### 8. build overrides Specifically tackling build dependencies (eg incremental compilation) doesn't make a difference, which is expected on recent nightlies. On CI, I tend to turn incremental off explicitly, as it can also be a small win on local build dependencies (which are built without optimizations even in release builds). #### 9. possibly flawed example Again I'm not sure how this impacts codegen and your use cases and it's likely not usable as-is, but here's what I tried in the end: - default thin local LTO - avoid the suboptimal arrays in match arms - opt-level = 0 for pyo3/asn1 - linking with rust-lld - hopefully strip libstd's debuginfo - explicit no incremental At j2 with 4 cgus, that's 21% better (a bit more with no LTO, 26%). At j22 with 30 cgus that's 13% (26% with no LTO here as well). Nothing mindblowing but somewhat expected when LLVM dominates the runtime that much. I'll need to dig deeper how to understand why. #### outro. questions about use cases All of the above was mostly introductory for me, rather than actionable advice: familiarizing myself with the crate a bit, before digging deeper to get hopefully more interesting stats, like which functions are taking the most time to optimize and so on. But now that I'm a bit more familiar with the crate, I wanted to learn more about the important use-cases we're trying to optimize for: - what kinds of workflows are the most common ? can they be split into different build configurations ? - the `-j2` stats in the email make me think CI times are important. If that's the case then the full clean build times can be reduced via caching, at the very least for the dependencies (via sccache or maybe cargo chef and similar). Or having different CI jobs dedicated to fast feedback, vs ones at max opt levels w/ LTO in a nightly and published releases. - Are fast local iteration times important ? What kind of tests are done ? Functional, performance, etc ? Maybe we can try to look more into incremental compilation, compared to only looking at release builds with LTO. Or are from scratch builds the most important use case ? I'm also not sure which parallelism limit would be best to run my tests with, even if there are multiple values of interest; depending on the use-cases either j2 or j22 could be most realistic and I'd like at least to have a view that would match the issues you encounter.