# ChaCha20 SIMD Usage
We want to make sure the `chacha_block` function of the ChaCha20 stream cipher is making full use of an architecture's [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) instructions. Ideally the LLVM automatically "vectorizes" the code without having to rely on processor type dependent "intrinsic" hints from the developer. With that said, the code might have to be massaged a bit in order for the LLVM to identifify a vectorizable function.
* [LDK ChaCha20 Implementation in v0.0.123](https://github.com/lightningdevkit/rust-lightning/blob/0.0.123/lightning/src/crypto/chacha20.rs)
## Naive Implementation with Rust v1.80.1
Initial tests were performed on the current stable version of `rustc` which is `1.80.1`. The more modern tooling is generally better to diagnose asm usage than the conservative MSRV version `1.63.0`, and hopefully any findings can be backported to the older Rust compiler without too much trouble.
The tests involve two compile time flags which enable the compiler to squeeze out as much SIMD instructions for the code.
* `opt-level=3` -- Turning the optimization level to the max `3`, a.k.a. `cargo`'s `--release` profile. Requiring this flag doesn't feel like a huge ask for library users.
* `target-cpu=native` -- A general "tune the shit out of this code for this specific CPU". I imagine this is at the cost of the resulting exectuable being less portable, but I haven't dug into the weight of the tradeoffs there. But it likely is a larger ask for library users.
The naive implementation of `chacha_block` is at least very well contained in the `State` struct which represents the state of the ChaCha20 cipher. This makes it easy to analyze with tools like `asm` to see exactly what the functions are boiling down to in machine code and if the LLVM is taking advantage of SIMD instructions.
```rust
/// The 512-bit cipher state is chunk'd up into 16 32-bit words.
///
/// The 16 words can be visualized as a 4x4 matrix:
///
/// 0 1 2 3
/// 4 5 6 7
/// 8 9 10 11
/// 12 13 14 15
#[derive(Clone, Copy, Debug)]
struct State([u32; 16]);
impl State {
/// New prepared state.
const fn new(key: SessionKey, nonce: Nonce, count: u32) -> Self {
let mut state: [u32; 16] =
[WORD_1, WORD_2, WORD_3, WORD_4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
// The next two rows (8 words) are based on the session key.
// Using a while loop here to keep the function constant.
let mut i = 0;
while i < 8 {
state[i + 4] = u32::from_le_bytes([
key.0[i * 4],
key.0[i * 4 + 1],
key.0[i * 4 + 2],
key.0[i * 4 + 3],
]);
i += 1;
}
// The final row is the count and the nonce.
state[12] = count;
state[13] = u32::from_le_bytes([nonce.0[0], nonce.0[1], nonce.0[2], nonce.0[3]]);
state[14] = u32::from_le_bytes([nonce.0[4], nonce.0[5], nonce.0[6], nonce.0[7]]);
state[15] = u32::from_le_bytes([nonce.0[8], nonce.0[9], nonce.0[10], nonce.0[11]]);
State(state)
}
/// Each "quarter" round of ChaCha scrambles 4 of the 16 words (where the quarter comes from)
/// that make up the state using some Addition (mod 2^32), Rotation, and XOR (a.k.a. ARX).
fn quarter_round(&mut self, a: usize, b: usize, c: usize, d: usize) {
self.0[a] = self.0[a].wrapping_add(self.0[b]);
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
self.0[c] = self.0[c].wrapping_add(self.0[d]);
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
self.0[a] = self.0[a].wrapping_add(self.0[b]);
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
self.0[c] = self.0[c].wrapping_add(self.0[d]);
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
}
/// Transform the state by performing the ChaCha block function.
fn chacha_block(&mut self) {
let initial_state = self.0;
for _ in 0..10 {
for (a, b, c, d) in CHACHA_ROUND_INDICIES {
self.quarter_round(a, b, c, d);
}
}
for (modified, initial) in self.0.iter_mut().zip(initial_state.iter()) {
*modified = modified.wrapping_add(*initial)
}
}
/// Expose the 512-bit state as a byte stream.
fn keystream(&self) -> [u8; 64] {
let mut keystream: [u8; 64] = [0; 64];
for (k, s) in keystream.chunks_exact_mut(4).zip(self.0) {
k.copy_from_slice(&s.to_le_bytes());
}
keystream
}
}
```
The following long output is the asm of the function generated with the `cargo-show-asm` tool: `cargo asm --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block`. This uses the aggressive `release` build setting by default, a high-optimization level, so should be attempting to make use as much of the SIMD instructions as possible. Using the `--rust` flag with `cargo-show-asm` outputs little hints as to where you are in the Rust source.
Some findings!
* The `quarter_round` function is always being inlined by the compiler no matter the optimization level, so in order to analyze its machine code we need to add a `#[inline(never)]` just for testing purposes. For now, just letting it get inlined.
* The `chacha_block` function has some pretty easy to find for loops, the first nested pair map to `.LBB0_1` (outer) and `.LBB0_2` (inner).
* `.LBB0_2` clearly contains all the `quarter_round` craziness, the function has definitely been inlined by the compiler.
* There are some SIMD instructions being used, like `movdqu` and `paddd`, but they are only used in the *third* (last) for loop of the `chacha_block` function. The inner quarter round operations are performed using scalar instructions (`add`, `xor`, `rol`).
* `movups`, `movaps`, `movdqu`, `movdqa` -- All ways to efficiently move around 128 bits in the special SIMD `XMM` register.
* `paddd` -- Parallel addition of four 32-bit integers.
* These are all SSE (Streaming SIMD Extensions) instructions which are a subset of all possible SIMD instructions. More modern AVX ones are also possible, but I am not sure if there is any benefit in this scenario. AVX operate on larger number of bits (256 and 512).
* Even with both the `opt-level=3` and `target-cpu=native` flags enabled, the quarter round does not use SIMD instructions. Sad!
ASM command, no optimization flags: `cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block`
```asm
.section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits
.p2align 4, 0x90
.type chacha20_poly1305::chacha20::State::chacha_block,@function
chacha20_poly1305::chacha20::State::chacha_block:
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 103
fn chacha_block(&mut self) {
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
push r15
.cfi_def_cfa_offset 24
push r14
.cfi_def_cfa_offset 32
push r12
.cfi_def_cfa_offset 40
push rbx
.cfi_def_cfa_offset 48
sub rsp, 336
.cfi_def_cfa_offset 384
.cfi_offset rbx, -48
.cfi_offset r12, -40
.cfi_offset r14, -32
.cfi_offset r15, -24
.cfi_offset rbp, -16
mov rbx, rdi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 104
let initial_state = self.0;
movups xmm0, xmmword ptr [rdi]
movaps xmmword ptr [rsp + 48], xmm0
movups xmm0, xmmword ptr [rdi + 16]
movaps xmmword ptr [rsp + 32], xmm0
movups xmm0, xmmword ptr [rdi + 32]
movaps xmmword ptr [rsp + 16], xmm0
movups xmm0, xmmword ptr [rdi + 48]
movaps xmmword ptr [rsp], xmm0
xor ebp, ebp
lea r14, [rip + .L__unnamed_1]
lea r15, [rsp + 64]
mov r12, qword ptr [rip + memcpy@GOTPCREL]
.p2align 4, 0x90
.LBB0_1:
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/int_macros.rs : 2222
let (a, b) = intrinsics::add_with_overflow(self as $ActualT, rhs as $ActualT);
inc ebp
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 106
for (a, b, c, d) in CHACHA_ROUND_INDICIES {
mov edx, 256
mov rdi, r15
mov rsi, r14
call r12
mov qword ptr [rsp + 320], 0
mov qword ptr [rsp + 328], 8
mov edx, 24
.p2align 4, 0x90
.LBB0_2:
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ptr/mod.rs : 1325
crate::intrinsics::read_via_copy(src)
mov rax, qword ptr [rsp + rdx + 40]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
self.0[a] = self.0[a].wrapping_add(self.0[b]);
cmp rax, 15
ja .LBB0_9
mov rdi, qword ptr [rsp + rdx + 48]
cmp rdi, 16
jae .LBB0_10
mov rcx, qword ptr [rsp + rdx + 56]
mov r8, qword ptr [rsp + rdx + 64]
mov esi, dword ptr [rbx + 4*rdi]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rax]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
self.0[a] = self.0[a].wrapping_add(self.0[b]);
mov dword ptr [rbx + 4*rax], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
cmp r8, 16
jae .LBB0_11
xor esi, dword ptr [rbx + 4*r8]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 16
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
mov dword ptr [rbx + 4*r8], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
self.0[c] = self.0[c].wrapping_add(self.0[d]);
cmp rcx, 16
jae .LBB0_12
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rcx]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
self.0[c] = self.0[c].wrapping_add(self.0[d]);
mov dword ptr [rbx + 4*rcx], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
xor esi, dword ptr [rbx + 4*rdi]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 12
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 95
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(12);
mov dword ptr [rbx + 4*rdi], esi
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rax]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 96
self.0[a] = self.0[a].wrapping_add(self.0[b]);
mov dword ptr [rbx + 4*rax], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
xor esi, dword ptr [rbx + 4*r8]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 8
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 97
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(8);
mov dword ptr [rbx + 4*r8], esi
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
add esi, dword ptr [rbx + 4*rcx]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 98
self.0[c] = self.0[c].wrapping_add(self.0[d]);
mov dword ptr [rbx + 4*rcx], esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
xor esi, dword ptr [rbx + 4*rdi]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
rol esi, 7
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 99
self.0[b] = (self.0[b] ^ self.0[c]).rotate_left(7);
mov dword ptr [rbx + 4*rdi], esi
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/index_range.rs : 119
if self.len() > 0 {
add rdx, 32
cmp rdx, 280
jne .LBB0_2
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565
fn lt(&self, other: &$t) -> bool { (*self) < (*other) }
cmp ebp, 10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
if self.start < self.end {
jne .LBB0_1
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx]
movdqa xmm1, xmmword ptr [rsp + 48]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm1, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx + 16]
movdqa xmm2, xmmword ptr [rsp + 32]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm2, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx + 32]
movdqa xmm3, xmmword ptr [rsp + 16]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm3, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmm0, xmmword ptr [rbx + 48]
movdqa xmm4, xmmword ptr [rsp]
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
paddd xmm4, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 111
*modified = modified.wrapping_add(*initial)
movdqu xmmword ptr [rbx], xmm1
movdqu xmmword ptr [rbx + 16], xmm2
movdqu xmmword ptr [rbx + 32], xmm3
movdqu xmmword ptr [rbx + 48], xmm4
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 113
}
add rsp, 336
.cfi_def_cfa_offset 48
pop rbx
.cfi_def_cfa_offset 40
pop r12
.cfi_def_cfa_offset 32
pop r14
.cfi_def_cfa_offset 24
pop r15
.cfi_def_cfa_offset 16
pop rbp
.cfi_def_cfa_offset 8
ret
.LBB0_9:
.cfi_def_cfa_offset 384
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 92
self.0[a] = self.0[a].wrapping_add(self.0[b]);
lea rdx, [rip + .L__unnamed_2]
mov esi, 16
mov rdi, rax
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_10:
lea rdx, [rip + .L__unnamed_3]
mov esi, 16
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_11:
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 93
self.0[d] = (self.0[d] ^ self.0[a]).rotate_left(16);
lea rdx, [rip + .L__unnamed_4]
mov esi, 16
mov rdi, r8
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
.LBB0_12:
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 94
self.0[c] = self.0[c].wrapping_add(self.0[d]);
lea rdx, [rip + .L__unnamed_5]
mov esi, 16
mov rdi, rcx
call qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
```
Adding the `target-cpu=native` compile flag, which is kind of a general "tune the shit out of this code for this specific CPU", enables a bit more SIMD usage on my laptop. It uses the 256-bit AVVX2 instructions instead of the 128-bit SSE2. But it still does not optimize the quarter round operations, so not a game changer.
ASM command: `RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block`
## Type Modifications
Something needs to change with the `quarter_round` function so the LLVM knows it can use SIMD instructions in it. The bulk of work is happening in the `quarter_round`, so for it to not be optimized is a big blow. But we don't want to depend on intrinsic hints (e.g. `#[target_feature(enable = "avx2")]`) since they are architecture specific.
One option which allows for more optimizations is to generate multiple blocks in parallel. The only difference to the state inputs would be the different counts for the block counter. This is a form of *Horizontal Vectorization*. Each state representation, 512-bits, can theoretically be operated on in parallel. And it nicely fits into some of the most modern SIMD `AVX-512` instructions, which I doubt is a coincidence.
```
// Horizontal Vectorization, each vector is operated on in parallel.
[a1, a2, a3, a4] -> a1 + a2 + a3 + a4
[b1, b2, b3, b4] -> b1 + b2 + b3 + b4
```
This does introduce quite a bit more bookkeeping complexity for keeping track of the count and making sure generated blocks are used in order. Any missteps and the train goes off the rails. It has the benefit of kind of being a "wrapper" around the current `State` implementation, but with that said, I am not quite sure what other "tips" I would give the LLVM in order to leverage this. Going to focus on lower hanging fruit.
And that fruit is *Verticle Vectorization*.
```
// Verticle Vectorization, operates on date in multile vectors in parallel.
[a1, a2, a3, a4] + [b1, b2, b3, b4] = [a1+b1, a2+b2, a3+b3, a4+b4]
```
When looking at ChaCha's quarter round function, the simple *scalar* approach operates on 4 words (32-bit) at a time, which is 1/4 of the total state. A quick recap here, the cipher's state is broken up into 16 words. When state is updated (a.k.a. a block created), the quarter round function is run on the words with indexes `[0,4,8,12]`. Looking at the ascii art below, you can visualize this as a "column" of the state.
```
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
```
Next, a quarter round is run on `[1,5,9,13]`, then `[2,6,10,14]`, and finally `[3,7,11,15]`. All the rows! It then runs some quarter rounds across some "diagonals", but before going there, notice that the row quarter rounds each act on an independent chunk of data. So, while the scalar quarter round implementation acts on a "column", the vectorized approach will take all the "rows" and apply the quarter round to each "lane" (SIMD word for column).
```
[ 0 | 1 | 2 | 3 ] // U32x4
[ 4 | 5 | 6 | 7 ] // U32x4
[ 8 | 9 | 10 | 11 ] // U32x4
[ 12 |13 | 14 | 15 ] // U32x4
```
So each row is a vector of 4 32-bit words. Not by chance, this vector size fits perfectly into a lot of SIMD instructions. The quarter round opertions are still applied on the columns (e.g. `[0,4,8,12]`), but this can be done in parallel with the other columns by leveraging the SIMD lanes. We just have to organize our data in a way for the compiler to pick up on the fact that it can do this. And as we saw from above, tossing a bunch of tuples at it ain't good enough.
The `std` library actually has some examples of these in the appropriately named [`std::simd`](https://doc.rust-lang.org/std/simd/index.html) crate: [`u32x4`](https://doc.rust-lang.org/std/simd/type.u32x4.html) and [`u32x16`](https://doc.rust-lang.org/std/simd/type.u32x16.html). These are also available in core which is good news for no-std support. However, the interface is still experimental and feature gated on [86656](https://github.com/rust-lang/rust/issues/86656), and the `portable_simd` feature is not available in Rust 1.63.0 (MSRV). Bummer! But we can experiment with types similar to the std library, which is also what LDK did in their implementation.
Here is the new fancy `U32x4` type which matches the std library's simd interface for easy migration in the future, but implements the bare-minimum for ChaCha.
```
#[repr(C, align(16))]
#[derive(Clone, Copy, Debug, PartialEq)]
struct U32x4([u32; 4]);
impl U32x4 {
#[inline(always)]
fn wrapping_add(self, rhs: Self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[i].wrapping_add(rhs.0[i]);
});
U32x4(result)
}
#[inline(always)]
fn rotate_left(self, n: u32) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[i].rotate_left(n);
});
U32x4(result)
}
#[inline(always)]
fn rotate_elements_left<const N: u32>(self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[(i + N as usize) % 4];
});
U32x4(result)
}
#[inline(always)]
fn rotate_elements_right<const N: u32>(self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[(i + 4 - N as usize) % 4];
});
U32x4(result)
}
#[inline(always)]
fn to_le_bytes(self) -> [u8; 16] {
let mut bytes = [0u8; 16];
bytes[0..4].copy_from_slice(&self.0[0].to_le_bytes());
bytes[4..8].copy_from_slice(&self.0[1].to_le_bytes());
bytes[8..12].copy_from_slice(&self.0[2].to_le_bytes());
bytes[12..16].copy_from_slice(&self.0[3].to_le_bytes());
bytes
}
}
impl BitXor for U32x4 {
type Output = Self;
#[inline(always)]
fn bitxor(self, rhs: Self) -> Self {
let mut result = [0u32; 4];
(0..4).for_each(|i| {
result[i] = self.0[i] ^ rhs.0[i];
});
U32x4(result)
}
}
```
## SIMD Results
After a few small tweaks to how data flows from the top of the `chacha_block` function, here are the results with the new `U32x4` type usage.
Without the `opt-level=3` or `target-cpu=native` flags, the code is entirely scalar. It is not equivalent to the naive version above, so maybe the compiler is able to make some other optimizations, but they are not SIMD related. With `opt-level=3`, there is still no SIMD usage which I find disappointing, but not sure if a deal-breaker. But with the `target-cpu=native` flag, the code is entirely SIMD optimized, including the quarter round. There is rampant use of the `xmm*` registers and SIMD instructions throughout.
ASM command: `RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block`
```
.section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits
.p2align 4, 0x90
.type chacha20_poly1305::chacha20::State::chacha_block,@function
chacha20_poly1305::chacha20::State::chacha_block:
.cfi_startproc
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 224
let mut a = self.a;
mov esi, dword ptr [rdi]
vmovdqu xmm2, xmmword ptr [rdi + 4]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 225
let mut b = self.b;
mov r10d, dword ptr [rdi + 32]
vmovdqu xmm1, xmmword ptr [rdi + 20]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 226
let mut c = self.c;
vmovdqu xmm0, xmmword ptr [rdi + 36]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 227
let mut d = self.d;
mov edx, dword ptr [rdi + 52]
mov ecx, dword ptr [rdi + 56]
mov eax, dword ptr [rdi + 60]
vpshufd xmm3, xmm1, 198
vpshufd xmm4, xmm2, 161
vpinsrd xmm8, xmm4, esi, 2
vpshufd xmm4, xmm0, 24
vpinsrd xmm10, xmm4, edx, 0
vpshufd xmm4, xmm0, 255
vpinsrd xmm11, xmm4, eax, 1
vpblendd xmm7, xmm3, xmm2, 8
mov r8d, 10
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0]
vmovdqa xmm4, xmmword ptr [rip + .LCPI0_1]
vmovdqa xmm5, xmmword ptr [rip + .LCPI0_2]
vmovdqa xmm6, xmmword ptr [rip + .LCPI0_3]
mov r9d, ecx
.p2align 4, 0x90
.LBB0_1:
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpshufd xmm9, xmm7, 57
vpaddd xmm8, xmm9, xmm8
vpinsrd xmm9, xmm10, r10d, 0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpshufd xmm12, xmm8, 78
vpbroadcastd xmm10, xmm10
vpblendd xmm10, xmm12, xmm10, 8
vpunpcklqdq xmm11, xmm11, xmm8
vpinsrd xmm11, xmm11, r9d, 2
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm11, xmm11, xmm3
vpshufb xmm10, xmm10, xmm3
vpxor xmm10, xmm10, xmm11
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm9, xmm10, xmm9
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpshufd xmm11, xmm9, 57
vpxor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm11, xmm7, 20
vpslld xmm7, xmm7, 12
vpor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpshufd xmm11, xmm7, 57
vpaddd xmm8, xmm11, xmm8
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm11, xmm8, xmm4
vpshufb xmm10, xmm10, xmm5
vpxor xmm10, xmm11, xmm10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm9, xmm10, xmm9
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpshufd xmm11, xmm9, 57
vpxor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm11, xmm7, 25
vpslld xmm7, xmm7, 7
vpor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm8, xmm8, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm11, xmm8, xmm3
vpshufb xmm10, xmm10, xmm6
vpxor xmm10, xmm11, xmm10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm11, xmm10, xmm9
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpxor xmm7, xmm11, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm9, xmm7, 20
vpslld xmm7, xmm7, 12
vpor xmm7, xmm9, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm8, xmm8, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpshufb xmm9, xmm8, xmm5
vpshufb xmm10, xmm10, xmm5
vpxor xmm9, xmm9, xmm10
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpaddd xmm12, xmm9, xmm11
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
vpxor xmm7, xmm12, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 232
return intrinsics::rotate_left(self, n);
vpsrld xmm10, xmm7, 25
vpslld xmm7, xmm7, 7
vpor xmm7, xmm10, xmm7
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
if self.start < self.end {
vmovd r10d, xmm12
vpextrd r9d, xmm9, 3
vpblendd xmm10, xmm12, xmm9, 1
vpshufd xmm11, xmm9, 233
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/cmp.rs : 1565
fn lt(&self, other: &$t) -> bool { (*self) < (*other) }
dec r8d
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/iter/range.rs : 753
if self.start < self.end {
jne .LBB0_1
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpextrd r8d, xmm8, 2
add r8d, esi
vpshufd xmm3, xmm8, 241
vpblendd xmm3, xmm3, xmm7, 8
vpaddd xmm2, xmm3, xmm2
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
self.a = a.wrapping_add(self.a);
mov dword ptr [rdi], r8d
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpbroadcastd xmm3, xmm12
vpshufd xmm4, xmm7, 198
vpblendd xmm3, xmm4, xmm3, 8
vpaddd xmm1, xmm3, xmm1
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
self.a = a.wrapping_add(self.a);
vmovdqu xmmword ptr [rdi + 4], xmm2
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vpbroadcastq xmm2, xmm9
vpshufd xmm3, xmm12, 219
vpblendd xmm2, xmm3, xmm2, 8
vpaddd xmm0, xmm2, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 235
self.b = b.wrapping_add(self.b);
vmovdqu xmmword ptr [rdi + 20], xmm1
// /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/num/uint_macros.rs : 1753
intrinsics::wrapping_add(self, rhs)
vmovd esi, xmm9
add esi, edx
add r9d, ecx
vpextrd ecx, xmm9, 2
add ecx, eax
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 236
self.c = c.wrapping_add(self.c);
vmovdqu xmmword ptr [rdi + 36], xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 237
self.d = d.wrapping_add(self.d);
mov dword ptr [rdi + 52], esi
mov dword ptr [rdi + 56], r9d
mov dword ptr [rdi + 60], ecx
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 238
}
ret
```
Is it unusual for a library to require the `target-cpu` flag to get SIMD instructions? Using the LDK's ChaCha20 implementation as a benchmark, it does not use SIMD instructions without at least `opt-level=3`. And the quarter round is not vecotorized without `target-cpu`. So perhaps the `target-cpu` flag is a necessity for now until the usage of Rust's native SIMD types is possible.
## Rust 1.63.0 MSRV
With fully optimized ASM output, the `1.63.0` version of `rustc` is partially using SIMD instructions with the new code. It is better than the initial naive version on `1.80.1`, but the quarter round is not using SIMD and generally using more instructions.
ASM command: `RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo asm --lib --rust --package chacha20-poly1305 chacha20_poly1305::chacha20::State::chacha_block`
```
.section .text.chacha20_poly1305::chacha20::State::chacha_block,"ax",@progbits
.p2align 4, 0x90
.type chacha20_poly1305::chacha20::State::chacha_block,@function
chacha20_poly1305::chacha20::State::chacha_block:
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 221
fn chacha_block(&mut self) {
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
push r15
.cfi_def_cfa_offset 24
push r14
.cfi_def_cfa_offset 32
push r13
.cfi_def_cfa_offset 40
push r12
.cfi_def_cfa_offset 48
push rbx
.cfi_def_cfa_offset 56
sub rsp, 48
.cfi_def_cfa_offset 104
.cfi_offset rbx, -56
.cfi_offset r12, -48
.cfi_offset r13, -40
.cfi_offset r14, -32
.cfi_offset r15, -24
.cfi_offset rbp, -16
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 224
let mut a = self.a;
mov eax, dword ptr [rdi]
mov dword ptr [rsp + 16], eax
vmovdqu xmm0, xmmword ptr [rdi + 4]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 225
let mut b = self.b;
vmovdqu xmm1, xmmword ptr [rdi + 20]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 226
let mut c = self.c;
vmovdqu xmm2, xmmword ptr [rdi + 36]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 227
let mut d = self.d;
mov r8d, dword ptr [rdi + 52]
mov r9d, dword ptr [rdi + 56]
mov qword ptr [rsp + 40], rdi
mov ecx, dword ptr [rdi + 60]
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/iter/range.rs : 621
if self.start < self.end {
vpextrd ebp, xmm0, 1
vpextrd r10d, xmm0, 2
vmovd eax, xmm0
vpextrd esi, xmm0, 3
vmovd r14d, xmm1
vpextrd edi, xmm1, 1
vpextrd edx, xmm1, 2
vpextrd r12d, xmm1, 3
vpextrd r15d, xmm2, 1
vmovd r11d, xmm2
vpextrd r13d, xmm2, 2
vpextrd ebx, xmm2, 3
mov dword ptr [rsp + 32], 10
mov dword ptr [rsp + 28], ecx
mov dword ptr [rsp + 4], ecx
mov dword ptr [rsp + 24], r9d
mov dword ptr [rsp], r9d
mov r9d, edi
mov dword ptr [rsp + 20], r8d
mov edi, r8d
mov ecx, dword ptr [rsp + 16]
.p2align 4, 0x90
.LBB14_1:
mov dword ptr [rsp + 12], r9d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add ecx, esi
add eax, r14d
add ebp, r9d
add r10d, edx
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor ebx, ecx
xor edi, eax
mov dword ptr [rsp + 8], edi
mov r9d, dword ptr [rsp]
xor r9d, ebp
mov edi, dword ptr [rsp + 4]
xor edi, r10d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx r8d, ebx, 16
rorx ebx, dword ptr [rsp + 8], 16
mov dword ptr [rsp], ebx
rorx r9d, r9d, 16
rorx edi, edi, 16
mov dword ptr [rsp + 4], edi
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add r12d, r8d
add r11d, ebx
add r15d, r9d
add r13d, edi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor esi, r12d
xor r14d, r11d
mov ebx, dword ptr [rsp + 12]
xor ebx, r15d
xor edx, r13d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx edi, esi, 20
rorx esi, r14d, 20
mov dword ptr [rsp + 36], esi
rorx ebx, ebx, 20
rorx edx, edx, 20
mov dword ptr [rsp + 8], edx
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add ecx, edi
mov r14d, edi
add eax, esi
add ebp, ebx
add r10d, edx
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor r8d, ecx
mov edi, dword ptr [rsp]
xor edi, eax
xor r9d, ebp
mov esi, dword ptr [rsp + 4]
xor esi, r10d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx edx, r8d, 24
mov dword ptr [rsp], edx
rorx r8d, edi, 24
mov dword ptr [rsp + 12], r8d
rorx r9d, r9d, 24
mov dword ptr [rsp + 4], r9d
rorx edi, esi, 24
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add r12d, edx
add r11d, r8d
add r15d, r9d
add r13d, edi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor r14d, r12d
mov esi, dword ptr [rsp + 36]
xor esi, r11d
xor ebx, r15d
mov edx, dword ptr [rsp + 8]
xor edx, r13d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx r14d, r14d, 25
mov dword ptr [rsp + 8], r14d
rorx r9d, esi, 25
rorx ebx, ebx, 25
rorx r8d, edx, 25
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add ecx, r9d
add eax, ebx
add ebp, r8d
add r10d, r14d
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor edi, ecx
mov r14d, dword ptr [rsp]
xor r14d, eax
mov edx, dword ptr [rsp + 12]
xor edx, ebp
mov esi, dword ptr [rsp + 4]
xor esi, r10d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx edi, edi, 16
mov dword ptr [rsp + 4], edi
rorx r14d, r14d, 16
mov dword ptr [rsp], r14d
rorx edx, edx, 16
rorx esi, esi, 16
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add r15d, edi
add r13d, r14d
add r12d, edx
add r11d, esi
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor r9d, r15d
xor ebx, r13d
xor r8d, r12d
mov edi, dword ptr [rsp + 8]
xor edi, r11d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx r14d, r9d, 20
rorx r9d, ebx, 20
rorx ebx, r8d, 20
mov dword ptr [rsp + 8], ebx
rorx edi, edi, 20
mov dword ptr [rsp + 12], edi
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add ecx, r14d
add eax, r9d
mov r8d, r9d
add ebp, ebx
add r10d, edi
mov ebx, dword ptr [rsp + 4]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor ebx, ecx
mov edi, dword ptr [rsp]
xor edi, eax
xor edx, ebp
xor esi, r10d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx r9d, ebx, 24
rorx ebx, edi, 24
rorx edi, edx, 24
rorx edx, esi, 24
mov dword ptr [rsp + 4], r9d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
add r15d, r9d
add r13d, ebx
add r12d, edi
mov dword ptr [rsp], edx
add r11d, edx
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 119
result[i] = self.0[i] ^ rhs.0[i];
xor r14d, r15d
xor r8d, r13d
mov edx, dword ptr [rsp + 8]
xor edx, r12d
mov esi, dword ptr [rsp + 12]
xor esi, r11d
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 212
intrinsics::rotate_left(self, n as $SelfT)
rorx r14d, r14d, 25
rorx r9d, r8d, 25
rorx edx, edx, 25
rorx esi, esi, 25
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/cmp.rs : 1398
fn lt(&self, other: &$t) -> bool { (*self) < (*other) }
dec dword ptr [rsp + 32]
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/iter/range.rs : 621
if self.start < self.end {
jne .LBB14_1
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
vmovd xmm3, eax
vpinsrd xmm3, xmm3, ebp, 1
vpinsrd xmm3, xmm3, r10d, 2
vpinsrd xmm3, xmm3, esi, 3
vmovd xmm4, r14d
vpinsrd xmm4, xmm4, r9d, 1
vpinsrd xmm4, xmm4, edx, 2
vpinsrd xmm4, xmm4, r12d, 3
add ecx, dword ptr [rsp + 16]
vmovd xmm5, r11d
vpinsrd xmm5, xmm5, r15d, 1
vpinsrd xmm5, xmm5, r13d, 2
mov rax, qword ptr [rsp + 40]
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
self.a = a.wrapping_add(self.a);
mov dword ptr [rax], ecx
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
vpinsrd xmm5, xmm5, ebx, 3
add edi, dword ptr [rsp + 20]
mov edx, dword ptr [rsp]
add edx, dword ptr [rsp + 24]
mov ecx, dword ptr [rsp + 4]
add ecx, dword ptr [rsp + 28]
vpaddd xmm0, xmm3, xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 234
self.a = a.wrapping_add(self.a);
vmovdqu xmmword ptr [rax + 4], xmm0
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
vpaddd xmm0, xmm4, xmm1
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 235
self.b = b.wrapping_add(self.b);
vmovdqu xmmword ptr [rax + 20], xmm0
// /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/num/uint_macros.rs : 1184
intrinsics::wrapping_add(self, rhs)
vpaddd xmm0, xmm5, xmm2
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 236
self.c = c.wrapping_add(self.c);
vmovdqu xmmword ptr [rax + 36], xmm0
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 237
self.d = d.wrapping_add(self.d);
mov dword ptr [rax + 52], edi
mov dword ptr [rax + 56], edx
mov dword ptr [rax + 60], ecx
// /home/njohnson/grd/rust-bitcoin/chacha20_poly1305/src/chacha20.rs : 238
}
add rsp, 48
.cfi_def_cfa_offset 56
pop rbx
.cfi_def_cfa_offset 48
pop r12
.cfi_def_cfa_offset 40
pop r13
.cfi_def_cfa_offset 32
pop r14
.cfi_def_cfa_offset 24
pop r15
.cfi_def_cfa_offset 16
pop rbp
.cfi_def_cfa_offset 8
ret
```
## Is the struct needed?