Alex Crichton
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Rustc's parallel LLVM backend Main phases of a crate going from rustc and getting out a binary artifact: 1. Crate is split into *codegen units*, an rustc-specific data structure to say what to translate where 2. Each codegen unit is translated to an LLVM module. Each LLVM module is in its own `LLVMContextRef` and is independent of all other modules/contexts, allowing us to send these between threads. 3. Codegen units are always optimized. Even at `-O0` we optimize. (think of `#[inline(always)]`) 4. If enabled, codegen units then paraticipate in ThinLTO or "fat LTO" 5. Remaining codegen units are then translated to machine code Step (1) cannot be parallelized, it's a query. Step (2) today is sequential. With a parallel rustc it can be parallelized Step (3) is parallelized today Step (4) cannot be parallelized (by definition it's serial work) Step (5) is parallelized today An excellent visual overview of codegen today looks like the graph in [this comment](https://github.com/rust-lang/rust/issues/64913#issuecomment-537226853). The notable points on that graph are: * Thread 0 is the thread that generates LLVM modules. (translates to LLVM). It runs at the top but finishes after a point producing modules. * The "stair step" look of the other threads represents how the main thread creates an LLVM module and then sends it to another thread. After sending the LLVM module to another thread the main thread keeps translating. This "stair" is "fixed" with a parallel rustc and it can look more like a wall. * This is an optimized build, so ThinLTO is enabled. There's a first point where all LLVM cgus stop. That's where the first phase of optimization is done and a small amount of serial work is done. This serial work is attributed to Thread 6, you can see it poke out a bit. * Afterwards a wall of optimization happens while all CGUs are optimized after ThinLTO in parallel. ![](https://i.imgur.com/Ei7b0iD.png) ## Codegen Units For the purposes of parallelism codegen units aren't really too interesting beyond the fact that rustc can create multiple codegen units for one crate. One very interesting aspect though is that the split of a crate into codegen units is automatic and has no user input. We ideally want each codegen unit to take roughly the same amount of time in LLVM to avoid having one thread spinning on an extra-huge CGU while all other threads are idle because their LLVM modules were small. [PRs like this](https://github.com/rust-lang/rust/pull/65281) are recent attempts to improve the auto-splitting algorithm on behalf of rustc. Apart from codegen splitting, though, this isn't too particularly interesting with respect to the parallelism of the backend. ## Main Thread vs Workers > **Note**: this is documentation that is basically only relevant to today's architecture. Much of this will change with a truly parallel rustc. Translation from Rust MIR to LLVM IR requires the `TyCtxt` to be around, which means that today this translation is a single-threaded task. The main thread is the only thread which can create an LLVM module, and as a result it will create CGUs and then send them to a "coordinator thread" for further work. The "coordinator thread" above is unconditionally spawned by rustc and is used to coordinate work between the main thread and worker threads which are optimized/codegen'ing. It has a [very large doc block](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1055-L1189). This thread is responsible for a few critical tasks: * It actually signals the main thread when it can translate a new CGU (more on this later) * It receives translated CGUs from the main thread, and then it will spawn a new thread to optimize the CGU * It collects optimized CGUs from worker threads. It also performs serial ThinLTO work (or linking in "fat LTO"). * It manages codegen for each CGU. * It is responsible for collecting all of the results and sending them back to the main thread when everything is ready. Today the coordinator thread has a "mysterious" channel in `TyCtxt` which is a backchannel to send it information. This has probably been refactored by this point though, but there's basically a channel to send messages to it. Worker threads which actually perform work are pretty simple. They have some shenanigans to ensure that error reports from LLVM are routed back to the main thread as well as handling of panics to ensure things are torn down cleanly. Other than that though they're pretty standard "just go do the work then exit" threads. We don't currently have a thread pool, and I don't think we've actually yet seen benchmarks showing that we need a thread pool... ## ThinLTO This was discussed a bit above and you can see it in action [with the timing graph](https://github.com/rust-lang/rust/issues/64913#issuecomment-537226853). The general gist is that *any parallelism at all* requires multiple CGUs. We cannot run passes on an LLVM module in parallel, each LLVM module is single-threaded (think today's rustc `TyCtxt`). Therefore to get any parallelism whatsoever we need to create multiple codegen units for the backend. As soon as you create multiple codegen units though you've now removed inlining opportunities between those codegen units. Due to the automatic nature of partitioning and the lack of "`#[inline]` everywhere" when ThinLTO was added, we basically need an automatic compiler-built-in way of recovering inlining opportunities. While we *could* just execute full LTO (merge everything into one CGU) that removes parallelism opportunities. ThinLTO exists for this purpose! ThinLTO is designed to allow cross-module inlining to happen and then performs optimization passes on each module in isolation. AKA it requires a piece of serial work to calculate some inlining data structures and then each module can be optimized in parallel. This has almost all of the benefits of "single CGU fat LTO" but critically can take advantage of all your CPU cores. In any case, that's the motivation for ThinLTO. The impact on the compiler is that we have a coordination point for crates just after optimization and just before codegen when compiled in release mode. This is handled by the coordinator thread and is quite complicated today, unfortunately. ## Jobserver Parallelism in a build tool is hard. Cargo will, for example spawn `$NCPU` `rustc` processes in parallel. It would be a big bummer for each `rustc` to then spawn `$NCPU` threads, possibly creating `$NCPU * $NCPU` amount of work on the system. In addition to overloading the system work-wise it can also cause a lot of OOM situations because that's a huge amount of CGUs in memory. Anyway the "solution" to this is to use a [jobserver](https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html), which was invented by `make` and we've lifted it to use in Cargo. A "jobserver" is a glorified IPC semaphore. You add in a bunch of tokens to it, share it with all your processes, and then whenever a process wants to do more work it acquires a token first. By placing N tokens in a jobserver you're guaranteed more than N processes won't be running at a time. The integration in rustc isn't too too interesting here. The [`jobserver` crate](https://crates.io/crates/jobserver) has all the cross-platform details and is used by both `rustc` and Cargo. First `rustc` needs a [`Client`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html) which it does so via the [`Client::from_env`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html#method.from_env) method. If that fails it just creates a local one with `$NCPU` tokens. The weird thing about jobserver is that they're always blocking. Apparently across platforms and across `make` versions you just can't rely on nonblocking I/O. That's a bummer for having our coordinator thread, because while blocking for a jobserver token other messages might come in. Turns out `make` literally relies on EINTR via SIGCHLD to wake up the blocking call for a jobserver token. It's weird. In any case we solve this with **another** helper thread via [`Client::into_helper_thread`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html#method.into_helper_thread). The jobserver helper thread works as so: * Occasionally you call [`HelperThread::request_token`](https://docs.rs/jobserver/0.1.17/jobserver/struct.HelperThread.html#method.request_token). This will cause the helper thread to attempt to read a token from the jobserver (blocking the helper thread) * When acquired, the helper thread will invoke the callback passed to [`Client::into_helper_thread`](https://docs.rs/jobserver/0.1.17/jobserver/struct.Client.html#method.into_helper_thread). In rustc's case this sends the `Acquired` on a channel back to the coordinator thread. * The coordinator thread receives a token and then may let itself spawn more work. Tokens are [stored locally in the coordinator thread](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1224-L1226), are [then use to limit](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1343-L1358) how much work is spun up if we have work to do, and they are [immediately released](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1360-L1361) if we have tokens but aren't using them for work. In general this strategy works out pretty well because the units of parallelism here are extremely coarse (whole LLVM modules) so excessive jobserver traffic doesn't really show up much. ## Limiting CGUs in-memory The last thing that's a real gotcha with the parallel backend today is a feature added long ago in [#43506](https://github.com/rust-lang/rust/pull/43506). The jobserver is a coarse way to limit the amount of active memory on a system, but that only matters if each process doesn't actually allocate a ton of memory in a single threaded context. Before [#43506](https://github.com/rust-lang/rust/pull/43506) rustc would, serially, translate all modules to LLVM and *then* process all modules in parallel. This means that rustc had a massive peak memory spike during translation where `TyCtxt` and every single LLVM module were all resident in memory at the same time. The solution to this problem was basically to avoid all LLVM modules being resident in memory at the same time. The way to do that was to hook the main translation thread into the coordinator thread, and have the coordinator thread start/stop the main thread whenever it thinks that there's enough LLVM modules in memory and such. Inside the coordinator thread loop [is a block that governs this](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1253-L1275) as well as [a heuristic function](https://github.com/rust-lang/rust/blob/b7a9c285a50f3a94c44687ba9ff3ab0648243aaa/src/librustc_codegen_ssa/back/write.rs#L1518-L1526). This heuristic attempts to keep the number of active LLVM modules in memory at a minimum, stopping the main thread translating new LLVM modules as approriate and using the main thread's jobserver token to instead optimize/codegen a module. ## How parallel rustc changes translation A parallel compiler will radically change our LLVM backend and (I believe) drastrically simplify it. As soon as we can use `TyCtxt` in parallel this lifts the limitation that only the main thread can produce LLVM modules, allowing us to remove the coordinator thread entirely and restructure in a much more understandable format. It won't be "insanely simple" but for an optimized (ThinLTO) build we could start having driver code that looks like: ```rust // serially partition the crate let cgus = tcx.codegen_units(); // in parallel translate CGUs to LLVM and then optimize. let objects = cgus .par_iter() .map(|cgu| translate_to_llvm(cgu)) .map(|cgu| optimize(cgu)) .collect::<Vec<_>>(); // serially calculate ThinLTO data let data = thin_lto::prepare(&objects); // and then in parallel again do final optimizations and codegen let objects = objects .into_par_iter() .map(|cgu| thinlto_optimize_and_codegen(&data, cgu)) .collect::<Vec<_>>(); return objects; ``` One of the critical simplifications is that jobserver specialization goes away since it's baked into whatever rustc has for "run this loop in parallel" already. Additionally the previous section, limiting CGUs in-memory, no longer needs special treatment. We don't run the risk of the main thread running away and translating all LLVM modules at once because there is no main thread. ... as I write this though I realize that we don't necessarily drop `TyCtxt` eagerly here and that may be important. Anyway this is definitely something we can play with but I think will become much simpler at least for removing the coordinator thread. (because oh man would it be nice to do that.)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully