SIMD - HackMD

# ChaCha20 SIMD Usage We want to make sure the `chacha_block` function of the ChaCha20 stream cipher is making full use of an architecture's [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) instructions. Ideally the LLVM automatically "vectorizes" the code without having to rely on processor type dependent "intrinsic" hints from the developer. With that said, the code might have to be massaged a bit in order for the LLVM to identifify a vectorizable function. * [LDK ChaCha20 Implementation in v0.0.123](https://github.com/lightningdevkit/rust-lightning/blob/0.0.123/lightning/src/crypto/chacha20.rs) ## Naive Implementation with Rust v1.80.1 Initial tests were performed on the current stable version of `rustc` which is `1.80.1`. The more modern tooling is generally better to diagnose asm usage than the conservative MSRV version `1.63.0`, and hopefully any findings can be backported to the older Rust compiler without too much trouble. The tests involve two compile time flags which enable the compiler to squeeze out as much SIMD instructions for the code. * `opt-level=3` -- Turning the optimization level to the max `3`, a.k.a. `cargo`'s `--release` profile. Requiring this flag doesn't feel like a huge ask for library users. * `target-cpu=native` -- A general "tune the shit out of this code for this specific CPU". I imagine this is at the cost of the resulting exectuable being less portable, but I haven't dug into the weight of the tradeoffs there. But it likely is a larger ask for library users. The naive implementation of `chacha_block` is at least very well contained in the `State` struct which represents the state of the ChaCha20 cipher. This makes it easy to analyze with tools like `asm` to see exactly what the functions are boiling down to in machine code and if the LLVM is taking advantage of SIMD instructions. ```rust /// The 512-bit cipher state is chunk'd up into 16 32-bit words. /// /// The 16 words can be visualized as a 4x4 matrix: /// /// 0 1 2 3 /// 4 5 6 7 /// 8 9 10 11 /// 12 13 14 15 #[derive(Clone, Copy, Debug)] struct State([u32; 16]); impl State { /// New prepared state. const fn new(key: SessionKey, nonce: Nonce, count: u32) -> Self { let mut state: [u32; 16] = [WORD_1, WORD_2, WORD_3, WORD_4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]; // The next two rows (8 words) are based on the session key. // Using a while loop here to keep the function constant. let mut i = 0; while i < 8 { state[i + 4] = u32::from_le_bytes([ key.0[i * 4], key.0[i * 4 + 1], key.0[i * 4 + 2], key.0[i * 4 + 3], ]); i += 1; } // The final row is the count and the nonce. state[12] = count; state[13] = u32::from_le_bytes([nonce.0[0], nonce.0[1], nonce.0[2], nonce.0[3]]); state[14] = u32::from_le_bytes([nonce.0[4], nonce.0[5], nonce.0[6], nonce.0[7]]); state[15] = u32::from_le_bytes([nonce.0[8], nonce.0[9], nonce.0[10], nonce.0[11]]); State(state) } /// Each "quarter" round of ChaCha scrambles 4 of the 16 words (where the quarter comes from) /// that make up the state using some Addition (mod 2^32), Rotation, and XOR (a.k.a. ARX). fn quarter_round(&mut self, a: usize, b: usize, c: usize, d: usize) { self.0[a] = self.0[a].wrapping_add(self.0[b]); self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16); self.0[c] = self.0[c].wrapping_add(self.0[d]); self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12); self.0[a] = self.0[a].wrapping_add(self.0[b]); self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8); self.0[c] = self.0[c].wrapping_add(self.0[d]); self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7); } /// Transform the state by performing the ChaCha block function. fn chacha_block(&mut self) { let initial_state = self.0; for _ in 0..10 { for (a, b, c, d) in CHACHA_ROUND_INDICIES { self.quarter_round(a, b, c, d); } } for (modified, initial) in self.0.iter_mut().zip(initial_state.iter()) { *modified = modified.wrapping_add(*initial) } } /// Expose the 512-bit state as a byte stream. fn keystream(&self) -> [u8; 64] { let mut keystream: [u8; 64] = [0; 64]; for (k, s) in keystream.chunks_exact_mut(4).zip(self.0) { k.copy_from_slice(&s.to_le_bytes()); } keystream } } ``` The following long output is the asm of the function generated with the `cargo-show-asm` tool: `cargo asm --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block`. This uses the aggressive `release` build setting by default, a high-optimization level, so should be attempting to make use as much of the SIMD instructions as possible. Using the `--rust` flag with `cargo-show-asm` outputs little hints as to where you are in the Rust source. Some findings! * The `quarter_round` function is always being inlined by the compiler no matter the optimization level, so in order to analyze its machine code we need to add a `#[inline(never)]` just for testing purposes. For now, just letting it get inlined. * The `chacha_block` function has some pretty easy to find for loops, the first nested pair map to `.LBB0_1` (outer) and `.LBB0_2` (inner). * `.LBB0_2` clearly contains all the `quarter_round` craziness, the function has definitely been inlined by the compiler. * There are some SIMD instructions being used, like `movdqu` and `paddd`, but they are only used in the *third* (last) for loop of the `chacha_block` function. The inner quarter round operations are performed using scalar instructions (`add`, `xor`, `rol`). * `movups`, `movaps`, `movdqu`, `movdqa` -- All ways to efficiently move around 128 bits in the special SIMD `XMM` register. * `paddd` -- Parallel addition of four 32-bit integers. * These are all SSE (Streaming SIMD Extensions) instructions which are a subset of all possible SIMD instructions. More modern AVX ones are also possible, but I am not sure if there is any benefit in this scenario. AVX operate on larger number of bits (256 and 512). * Even with both the `opt-level=3` and `target-cpu=native` flags enabled, the quarter round does not use SIMD instructions. Sad! ASM command, no optimization flags: `cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block` ```asm .section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits .p2align 4, 0x90 .type chacha20_poly1305::chacha20::State::chacha_block,@function chacha20_poly1305::chacha20::State::chacha_block: // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 103 fn chacha_block(&mut self) { .cfi_startproc push rbp .cfi_def_cfa_offset 16 push r15 .cfi_def_cfa_offset 24 push r14 .cfi_def_cfa_offset 32 push r12 .cfi_def_cfa_offset 40 push rbx .cfi_def_cfa_offset 48 sub rsp, 336 .cfi_def_cfa_offset 384 .cfi_offset rbx, -48 .cfi_offset r12, -40 .cfi_offset r14, -32 .cfi_offset r15, -24 .cfi_offset rbp, -16 mov rbx, rdi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 104 let initial_state = self.0; movups xmm0, xmmword ptr [rdi] movaps xmmword ptr [rsp + 48], xmm0 movups xmm0, xmmword ptr [rdi + 16] movaps xmmword ptr [rsp + 32], xmm0 movups xmm0, xmmword ptr [rdi + 32] movaps xmmword ptr [rsp + 16], xmm0 movups xmm0, xmmword ptr [rdi + 48] movaps xmmword ptr [rsp], xmm0 xor ebp, ebp lea r14, [rip + .L__unnamed_1] lea r15, [rsp + 64] mov r12, qword ptr [rip + memcpy@GOTPCREL] .p2align 4, 0x90 .LBB0_1: // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/int_macros.rs : 2222 let (a, b) = intrinsics::add_with_overflow(self as $ActualT, rhs as $ActualT); inc ebp // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 106 for (a, b, c, d) in CHACHA_ROUND_INDICIES { mov edx, 256 mov rdi, r15 mov rsi, r14 call r12 mov qword ptr [rsp + 320], 0 mov qword ptr [rsp + 328], 8 mov edx, 24 .p2align 4, 0x90 .LBB0_2: // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ptr/mod.rs : 1325 crate::intrinsics::read_via_copy(src) mov rax, qword ptr [rsp + rdx + 40] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92 self.0[a] = self.0[a].wrapping_add(self.0[b]); cmp rax, 15 ja .LBB0_9 mov rdi, qword ptr [rsp + rdx + 48] cmp rdi, 16 jae .LBB0_10 mov rcx, qword ptr [rsp + rdx + 56] mov r8, qword ptr [rsp + rdx + 64] mov esi, dword ptr [rbx + 4*rdi] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) add esi, dword ptr [rbx + 4*rax] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92 self.0[a] = self.0[a].wrapping_add(self.0[b]); mov dword ptr [rbx + 4*rax], esi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93 self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16); cmp r8, 16 jae .LBB0_11 xor esi, dword ptr [rbx + 4*r8] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); rol esi, 16 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93 self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16); mov dword ptr [rbx + 4*r8], esi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94 self.0[c] = self.0[c].wrapping_add(self.0[d]); cmp rcx, 16 jae .LBB0_12 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) add esi, dword ptr [rbx + 4*rcx] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94 self.0[c] = self.0[c].wrapping_add(self.0[d]); mov dword ptr [rbx + 4*rcx], esi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95 self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12); xor esi, dword ptr [rbx + 4*rdi] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); rol esi, 12 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95 self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12); mov dword ptr [rbx + 4*rdi], esi // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) add esi, dword ptr [rbx + 4*rax] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 96 self.0[a] = self.0[a].wrapping_add(self.0[b]); mov dword ptr [rbx + 4*rax], esi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97 self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8); xor esi, dword ptr [rbx + 4*r8] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); rol esi, 8 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97 self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8); mov dword ptr [rbx + 4*r8], esi // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) add esi, dword ptr [rbx + 4*rcx] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 98 self.0[c] = self.0[c].wrapping_add(self.0[d]); mov dword ptr [rbx + 4*rcx], esi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99 self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7); xor esi, dword ptr [rbx + 4*rdi] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); rol esi, 7 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99 self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7); mov dword ptr [rbx + 4*rdi], esi // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/index_range.rs : 119 if self.len() > 0 { add rdx, 32 cmp rdx, 280 jne .LBB0_2 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565 fn lt(&self, other: &$t) -> bool { (*self) < (*other) } cmp ebp, 10 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753 if self.start < self.end { jne .LBB0_1 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111 *modified = modified.wrapping_add(*initial) movdqu xmm0, xmmword ptr [rbx] movdqa xmm1, xmmword ptr [rsp + 48] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) paddd xmm1, xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111 *modified = modified.wrapping_add(*initial) movdqu xmm0, xmmword ptr [rbx + 16] movdqa xmm2, xmmword ptr [rsp + 32] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) paddd xmm2, xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111 *modified = modified.wrapping_add(*initial) movdqu xmm0, xmmword ptr [rbx + 32] movdqa xmm3, xmmword ptr [rsp + 16] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) paddd xmm3, xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111 *modified = modified.wrapping_add(*initial) movdqu xmm0, xmmword ptr [rbx + 48] movdqa xmm4, xmmword ptr [rsp] // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) paddd xmm4, xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111 *modified = modified.wrapping_add(*initial) movdqu xmmword ptr [rbx], xmm1 movdqu xmmword ptr [rbx + 16], xmm2 movdqu xmmword ptr [rbx + 32], xmm3 movdqu xmmword ptr [rbx + 48], xmm4 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 113 } add rsp, 336 .cfi_def_cfa_offset 48 pop rbx .cfi_def_cfa_offset 40 pop r12 .cfi_def_cfa_offset 32 pop r14 .cfi_def_cfa_offset 24 pop r15 .cfi_def_cfa_offset 16 pop rbp .cfi_def_cfa_offset 8 ret .LBB0_9: .cfi_def_cfa_offset 384 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92 self.0[a] = self.0[a].wrapping_add(self.0[b]); lea rdx, [rip + .L__unnamed_2] mov esi, 16 mov rdi, rax call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL] .LBB0_10: lea rdx, [rip + .L__unnamed_3] mov esi, 16 call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL] .LBB0_11: // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93 self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16); lea rdx, [rip + .L__unnamed_4] mov esi, 16 mov rdi, r8 call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL] .LBB0_12: // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94 self.0[c] = self.0[c].wrapping_add(self.0[d]); lea rdx, [rip + .L__unnamed_5] mov esi, 16 mov rdi, rcx call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL] ``` Adding the `target-cpu=native` compile flag, which is kind of a general "tune the shit out of this code for this specific CPU", enables a bit more SIMD usage on my laptop. It uses the 256-bit AVVX2 instructions instead of the 128-bit SSE2. But it still does not optimize the quarter round operations, so not a game changer. ASM command: `RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block` ## Type Modifications Something needs to change with the `quarter_round` function so the LLVM knows it can use SIMD instructions in it. The bulk of work is happening in the `quarter_round`, so for it to not be optimized is a big blow. But we don't want to depend on intrinsic hints (e.g. `#[target_feature(enable = "avx2")]`) since they are architecture specific. One option which allows for more optimizations is to generate multiple blocks in parallel. The only difference to the state inputs would be the different counts for the block counter. This is a form of *Horizontal Vectorization*. Each state representation, 512-bits, can theoretically be operated on in parallel. And it nicely fits into some of the most modern SIMD `AVX-512` instructions, which I doubt is a coincidence. ``` // Horizontal Vectorization, each vector is operated on in parallel. [a1, a2, a3, a4] -> a1 + a2 + a3 + a4 [b1, b2, b3, b4] -> b1 + b2 + b3 + b4 ``` This does introduce quite a bit more bookkeeping complexity for keeping track of the count and making sure generated blocks are used in order. Any missteps and the train goes off the rails. It has the benefit of kind of being a "wrapper" around the current `State` implementation, but with that said, I am not quite sure what other "tips" I would give the LLVM in order to leverage this. Going to focus on lower hanging fruit. And that fruit is *Verticle Vectorization*. ``` // Verticle Vectorization, operates on date in multile vectors in parallel. [a1, a2, a3, a4] + [b1, b2, b3, b4] = [a1+b1, a2+b2, a3+b3, a4+b4] ``` When looking at ChaCha's quarter round function, the simple *scalar* approach operates on 4 words (32-bit) at a time, which is 1/4 of the total state. A quick recap here, the cipher's state is broken up into 16 words. When state is updated (a.k.a. a block created), the quarter round function is run on the words with indexes `[0,4,8,12]`. Looking at the ascii art below, you can visualize this as a "column" of the state. ``` 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ``` Next, a quarter round is run on `[1,5,9,13]`, then `[2,6,10,14]`, and finally `[3,7,11,15]`. All the rows! It then runs some quarter rounds across some "diagonals", but before going there, notice that the row quarter rounds each act on an independent chunk of data. So, while the scalar quarter round implementation acts on a "column", the vectorized approach will take all the "rows" and apply the quarter round to each "lane" (SIMD word for column). ``` [ 0 | 1 | 2 | 3 ] // U32x4 [ 4 | 5 | 6 | 7 ] // U32x4 [ 8 | 9 | 10 | 11 ] // U32x4 [ 12 |13 | 14 | 15 ] // U32x4 ``` So each row is a vector of 4 32-bit words. Not by chance, this vector size fits perfectly into a lot of SIMD instructions. The quarter round opertions are still applied on the columns (e.g. `[0,4,8,12]`), but this can be done in parallel with the other columns by leveraging the SIMD lanes. We just have to organize our data in a way for the compiler to pick up on the fact that it can do this. And as we saw from above, tossing a bunch of tuples at it ain't good enough. The `std` library actually has some examples of these in the appropriately named [`std::simd`](https://doc.rust-lang.org/std/simd/index.html) crate: [`u32x4`](https://doc.rust-lang.org/std/simd/type.u32x4.html) and [`u32x16`](https://doc.rust-lang.org/std/simd/type.u32x16.html). These are also available in core which is good news for no-std support. However, the interface is still experimental and feature gated on [86656](https://github.com/rust-lang/rust/issues/86656), and the `portable_simd` feature is not available in Rust 1.63.0 (MSRV). Bummer! But we can experiment with types similar to the std library, which is also what LDK did in their implementation. Here is the new fancy `U32x4` type which matches the std library's simd interface for easy migration in the future, but implements the bare-minimum for ChaCha. ``` #[repr(C, align(16))] #[derive(Clone, Copy, Debug, PartialEq)] struct U32x4([u32; 4]); impl U32x4 { #[inline(always)] fn wrapping_add(self, rhs: Self) -> Self { let mut result = [0u32; 4]; (0..4).for_each(|i| { result[i] = self.0[i].wrapping_add(rhs.0[i]); }); U32x4(result) } #[inline(always)] fn rotate_left(self, n: u32) -> Self { let mut result = [0u32; 4]; (0..4).for_each(|i| { result[i] = self.0[i].rotate_left(n); }); U32x4(result) } #[inline(always)] fn rotate_elements_left<const N: u32>(self) -> Self { let mut result = [0u32; 4]; (0..4).for_each(|i| { result[i] = self.0[(i + N as usize) % 4]; }); U32x4(result) } #[inline(always)] fn rotate_elements_right<const N: u32>(self) -> Self { let mut result = [0u32; 4]; (0..4).for_each(|i| { result[i] = self.0[(i + 4 - N as usize) % 4]; }); U32x4(result) } #[inline(always)] fn to_le_bytes(self) -> [u8; 16] { let mut bytes = [0u8; 16]; bytes[0..4].copy_from_slice(&self.0[0].to_le_bytes()); bytes[4..8].copy_from_slice(&self.0[1].to_le_bytes()); bytes[8..12].copy_from_slice(&self.0[2].to_le_bytes()); bytes[12..16].copy_from_slice(&self.0[3].to_le_bytes()); bytes } } impl BitXor for U32x4 { type Output = Self; #[inline(always)] fn bitxor(self, rhs: Self) -> Self { let mut result = [0u32; 4]; (0..4).for_each(|i| { result[i] = self.0[i] ^ rhs.0[i]; }); U32x4(result) } } ``` ## SIMD Results After a few small tweaks to how data flows from the top of the `chacha_block` function, here are the results with the new `U32x4` type usage. Without the `opt-level=3` or `target-cpu=native` flags, the code is entirely scalar. It is not equivalent to the naive version above, so maybe the compiler is able to make some other optimizations, but they are not SIMD related. With `opt-level=3`, there is still no SIMD usage which I find disappointing, but not sure if a deal-breaker. But with the `target-cpu=native` flag, the code is entirely SIMD optimized, including the quarter round. There is rampant use of the `xmm*` registers and SIMD instructions throughout. ASM command: `RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block` ``` .section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits .p2align 4, 0x90 .type chacha20_poly1305::chacha20::State::chacha_block,@function chacha20_poly1305::chacha20::State::chacha_block: .cfi_startproc // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 224 let mut a = self.a; mov esi, dword ptr [rdi] vmovdqu xmm2, xmmword ptr [rdi + 4] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 225 let mut b = self.b; mov r10d, dword ptr [rdi + 32] vmovdqu xmm1, xmmword ptr [rdi + 20] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 226 let mut c = self.c; vmovdqu xmm0, xmmword ptr [rdi + 36] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 227 let mut d = self.d; mov edx, dword ptr [rdi + 52] mov ecx, dword ptr [rdi + 56] mov eax, dword ptr [rdi + 60] vpshufd xmm3, xmm1, 198 vpshufd xmm4, xmm2, 161 vpinsrd xmm8, xmm4, esi, 2 vpshufd xmm4, xmm0, 24 vpinsrd xmm10, xmm4, edx, 0 vpshufd xmm4, xmm0, 255 vpinsrd xmm11, xmm4, eax, 1 vpblendd xmm7, xmm3, xmm2, 8 mov r8d, 10 vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0] vmovdqa xmm4, xmmword ptr [rip + .LCPI0_1] vmovdqa xmm5, xmmword ptr [rip + .LCPI0_2] vmovdqa xmm6, xmmword ptr [rip + .LCPI0_3] mov r9d, ecx .p2align 4, 0x90 .LBB0_1: // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpshufd xmm9, xmm7, 57 vpaddd xmm8, xmm9, xmm8 vpinsrd xmm9, xmm10, r10d, 0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; vpshufd xmm12, xmm8, 78 vpbroadcastd xmm10, xmm10 vpblendd xmm10, xmm12, xmm10, 8 vpunpcklqdq xmm11, xmm11, xmm8 vpinsrd xmm11, xmm11, r9d, 2 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpshufb xmm11, xmm11, xmm3 vpshufb xmm10, xmm10, xmm3 vpxor xmm10, xmm10, xmm11 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpaddd xmm9, xmm10, xmm9 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; vpshufd xmm11, xmm9, 57 vpxor xmm7, xmm11, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpsrld xmm11, xmm7, 20 vpslld xmm7, xmm7, 12 vpor xmm7, xmm11, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpshufd xmm11, xmm7, 57 vpaddd xmm8, xmm11, xmm8 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpshufb xmm11, xmm8, xmm4 vpshufb xmm10, xmm10, xmm5 vpxor xmm10, xmm11, xmm10 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpaddd xmm9, xmm10, xmm9 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; vpshufd xmm11, xmm9, 57 vpxor xmm7, xmm11, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpsrld xmm11, xmm7, 25 vpslld xmm7, xmm7, 7 vpor xmm7, xmm11, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpaddd xmm8, xmm8, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpshufb xmm11, xmm8, xmm3 vpshufb xmm10, xmm10, xmm6 vpxor xmm10, xmm11, xmm10 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpaddd xmm11, xmm10, xmm9 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; vpxor xmm7, xmm11, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpsrld xmm9, xmm7, 20 vpslld xmm7, xmm7, 12 vpor xmm7, xmm9, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpaddd xmm8, xmm8, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpshufb xmm9, xmm8, xmm5 vpshufb xmm10, xmm10, xmm5 vpxor xmm9, xmm9, xmm10 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpaddd xmm12, xmm9, xmm11 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; vpxor xmm7, xmm12, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232 return intrinsics::rotate_left(self, n); vpsrld xmm10, xmm7, 25 vpslld xmm7, xmm7, 7 vpor xmm7, xmm10, xmm7 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753 if self.start < self.end { vmovd r10d, xmm12 vpextrd r9d, xmm9, 3 vpblendd xmm10, xmm12, xmm9, 1 vpshufd xmm11, xmm9, 233 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565 fn lt(&self, other: &$t) -> bool { (*self) < (*other) } dec r8d // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753 if self.start < self.end { jne .LBB0_1 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpextrd r8d, xmm8, 2 add r8d, esi vpshufd xmm3, xmm8, 241 vpblendd xmm3, xmm3, xmm7, 8 vpaddd xmm2, xmm3, xmm2 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234 self.a = a.wrapping_add(self.a); mov dword ptr [rdi], r8d // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpbroadcastd xmm3, xmm12 vpshufd xmm4, xmm7, 198 vpblendd xmm3, xmm4, xmm3, 8 vpaddd xmm1, xmm3, xmm1 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234 self.a = a.wrapping_add(self.a); vmovdqu xmmword ptr [rdi + 4], xmm2 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vpbroadcastq xmm2, xmm9 vpshufd xmm3, xmm12, 219 vpblendd xmm2, xmm3, xmm2, 8 vpaddd xmm0, xmm2, xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 235 self.b = b.wrapping_add(self.b); vmovdqu xmmword ptr [rdi + 20], xmm1 // /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753 intrinsics::wrapping_add(self, rhs) vmovd esi, xmm9 add esi, edx add r9d, ecx vpextrd ecx, xmm9, 2 add ecx, eax // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 236 self.c = c.wrapping_add(self.c); vmovdqu xmmword ptr [rdi + 36], xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 237 self.d = d.wrapping_add(self.d); mov dword ptr [rdi + 52], esi mov dword ptr [rdi + 56], r9d mov dword ptr [rdi + 60], ecx // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 238 } ret ``` Is it unusual for a library to require the `target-cpu` flag to get SIMD instructions? Using the LDK's ChaCha20 implementation as a benchmark, it does not use SIMD instructions without at least `opt-level=3`. And the quarter round is not vecotorized without `target-cpu`. So perhaps the `target-cpu` flag is a necessity for now until the usage of Rust's native SIMD types is possible. ## Rust 1.63.0 MSRV With fully optimized ASM output, the `1.63.0` version of `rustc` is partially using SIMD instructions with the new code. It is better than the initial naive version on `1.80.1`, but the quarter round is not using SIMD and generally using more instructions. ASM command: `RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block` ``` .section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits .p2align 4, 0x90 .type chacha20_poly1305::chacha20::State::chacha_block,@function chacha20_poly1305::chacha20::State::chacha_block: // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 221 fn chacha_block(&mut self) { .cfi_startproc push rbp .cfi_def_cfa_offset 16 push r15 .cfi_def_cfa_offset 24 push r14 .cfi_def_cfa_offset 32 push r13 .cfi_def_cfa_offset 40 push r12 .cfi_def_cfa_offset 48 push rbx .cfi_def_cfa_offset 56 sub rsp, 48 .cfi_def_cfa_offset 104 .cfi_offset rbx, -56 .cfi_offset r12, -48 .cfi_offset r13, -40 .cfi_offset r14, -32 .cfi_offset r15, -24 .cfi_offset rbp, -16 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 224 let mut a = self.a; mov eax, dword ptr [rdi] mov dword ptr [rsp + 16], eax vmovdqu xmm0, xmmword ptr [rdi + 4] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 225 let mut b = self.b; vmovdqu xmm1, xmmword ptr [rdi + 20] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 226 let mut c = self.c; vmovdqu xmm2, xmmword ptr [rdi + 36] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 227 let mut d = self.d; mov r8d, dword ptr [rdi + 52] mov r9d, dword ptr [rdi + 56] mov qword ptr [rsp + 40], rdi mov ecx, dword ptr [rdi + 60] // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/iter/range.rs : 621 if self.start < self.end { vpextrd ebp, xmm0, 1 vpextrd r10d, xmm0, 2 vmovd eax, xmm0 vpextrd esi, xmm0, 3 vmovd r14d, xmm1 vpextrd edi, xmm1, 1 vpextrd edx, xmm1, 2 vpextrd r12d, xmm1, 3 vpextrd r15d, xmm2, 1 vmovd r11d, xmm2 vpextrd r13d, xmm2, 2 vpextrd ebx, xmm2, 3 mov dword ptr [rsp + 32], 10 mov dword ptr [rsp + 28], ecx mov dword ptr [rsp + 4], ecx mov dword ptr [rsp + 24], r9d mov dword ptr [rsp], r9d mov r9d, edi mov dword ptr [rsp + 20], r8d mov edi, r8d mov ecx, dword ptr [rsp + 16] .p2align 4, 0x90 .LBB14_1: mov dword ptr [rsp + 12], r9d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add ecx, esi add eax, r14d add ebp, r9d add r10d, edx // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor ebx, ecx xor edi, eax mov dword ptr [rsp + 8], edi mov r9d, dword ptr [rsp] xor r9d, ebp mov edi, dword ptr [rsp + 4] xor edi, r10d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx r8d, ebx, 16 rorx ebx, dword ptr [rsp + 8], 16 mov dword ptr [rsp], ebx rorx r9d, r9d, 16 rorx edi, edi, 16 mov dword ptr [rsp + 4], edi // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add r12d, r8d add r11d, ebx add r15d, r9d add r13d, edi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor esi, r12d xor r14d, r11d mov ebx, dword ptr [rsp + 12] xor ebx, r15d xor edx, r13d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx edi, esi, 20 rorx esi, r14d, 20 mov dword ptr [rsp + 36], esi rorx ebx, ebx, 20 rorx edx, edx, 20 mov dword ptr [rsp + 8], edx // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add ecx, edi mov r14d, edi add eax, esi add ebp, ebx add r10d, edx // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor r8d, ecx mov edi, dword ptr [rsp] xor edi, eax xor r9d, ebp mov esi, dword ptr [rsp + 4] xor esi, r10d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx edx, r8d, 24 mov dword ptr [rsp], edx rorx r8d, edi, 24 mov dword ptr [rsp + 12], r8d rorx r9d, r9d, 24 mov dword ptr [rsp + 4], r9d rorx edi, esi, 24 // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add r12d, edx add r11d, r8d add r15d, r9d add r13d, edi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor r14d, r12d mov esi, dword ptr [rsp + 36] xor esi, r11d xor ebx, r15d mov edx, dword ptr [rsp + 8] xor edx, r13d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx r14d, r14d, 25 mov dword ptr [rsp + 8], r14d rorx r9d, esi, 25 rorx ebx, ebx, 25 rorx r8d, edx, 25 // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add ecx, r9d add eax, ebx add ebp, r8d add r10d, r14d // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor edi, ecx mov r14d, dword ptr [rsp] xor r14d, eax mov edx, dword ptr [rsp + 12] xor edx, ebp mov esi, dword ptr [rsp + 4] xor esi, r10d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx edi, edi, 16 mov dword ptr [rsp + 4], edi rorx r14d, r14d, 16 mov dword ptr [rsp], r14d rorx edx, edx, 16 rorx esi, esi, 16 // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add r15d, edi add r13d, r14d add r12d, edx add r11d, esi // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor r9d, r15d xor ebx, r13d xor r8d, r12d mov edi, dword ptr [rsp + 8] xor edi, r11d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx r14d, r9d, 20 rorx r9d, ebx, 20 rorx ebx, r8d, 20 mov dword ptr [rsp + 8], ebx rorx edi, edi, 20 mov dword ptr [rsp + 12], edi // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add ecx, r14d add eax, r9d mov r8d, r9d add ebp, ebx add r10d, edi mov ebx, dword ptr [rsp + 4] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor ebx, ecx mov edi, dword ptr [rsp] xor edi, eax xor edx, ebp xor esi, r10d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx r9d, ebx, 24 rorx ebx, edi, 24 rorx edi, edx, 24 rorx edx, esi, 24 mov dword ptr [rsp + 4], r9d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) add r15d, r9d add r13d, ebx add r12d, edi mov dword ptr [rsp], edx add r11d, edx // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119 result[i] = self.0[i] ^ rhs.0[i]; xor r14d, r15d xor r8d, r13d mov edx, dword ptr [rsp + 8] xor edx, r12d mov esi, dword ptr [rsp + 12] xor esi, r11d // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212 intrinsics::rotate_left(self, n as $SelfT) rorx r14d, r14d, 25 rorx r9d, r8d, 25 rorx edx, edx, 25 rorx esi, esi, 25 // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/cmp.rs : 1398 fn lt(&self, other: &$t) -> bool { (*self) < (*other) } dec dword ptr [rsp + 32] // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/iter/range.rs : 621 if self.start < self.end { jne .LBB14_1 // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) vmovd xmm3, eax vpinsrd xmm3, xmm3, ebp, 1 vpinsrd xmm3, xmm3, r10d, 2 vpinsrd xmm3, xmm3, esi, 3 vmovd xmm4, r14d vpinsrd xmm4, xmm4, r9d, 1 vpinsrd xmm4, xmm4, edx, 2 vpinsrd xmm4, xmm4, r12d, 3 add ecx, dword ptr [rsp + 16] vmovd xmm5, r11d vpinsrd xmm5, xmm5, r15d, 1 vpinsrd xmm5, xmm5, r13d, 2 mov rax, qword ptr [rsp + 40] // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234 self.a = a.wrapping_add(self.a); mov dword ptr [rax], ecx // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) vpinsrd xmm5, xmm5, ebx, 3 add edi, dword ptr [rsp + 20] mov edx, dword ptr [rsp] add edx, dword ptr [rsp + 24] mov ecx, dword ptr [rsp + 4] add ecx, dword ptr [rsp + 28] vpaddd xmm0, xmm3, xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234 self.a = a.wrapping_add(self.a); vmovdqu xmmword ptr [rax + 4], xmm0 // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) vpaddd xmm0, xmm4, xmm1 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 235 self.b = b.wrapping_add(self.b); vmovdqu xmmword ptr [rax + 20], xmm0 // /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184 intrinsics::wrapping_add(self, rhs) vpaddd xmm0, xmm5, xmm2 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 236 self.c = c.wrapping_add(self.c); vmovdqu xmmword ptr [rax + 36], xmm0 // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 237 self.d = d.wrapping_add(self.d); mov dword ptr [rax + 52], edi mov dword ptr [rax + 56], edx mov dword ptr [rax + 60], ecx // /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 238 } add rsp, 48 .cfi_def_cfa_offset 56 pop rbx .cfi_def_cfa_offset 48 pop r12 .cfi_def_cfa_offset 40 pop r13 .cfi_def_cfa_offset 32 pop r14 .cfi_def_cfa_offset 24 pop r15 .cfi_def_cfa_offset 16 pop rbp .cfi_def_cfa_offset 8 ret ``` ## Is the struct needed?