In this article:
Woke up on saturday, thinking I was going to spend the next 30 minutes cleaning up this simple util function in our rust-kzg-bn254 rust crate and then move on to better things:
It takes a slice of bytes, and removes the first byte of every 32-byte chunk. Its purpose is to decode a payload that was encoded as an array of bn254 field elements:
[0,1,2,…,31,0,33,…,63,0,65…,73] -> [1,2,…,31,33,…,63,65,…,73]
Having read the rust book's section comparing loops and iterators and showing that rust iterators are a zero-cost abstraction, I thought I was just going to have to rewrite it in functional form, which would it both cleaner and faster (because of the zero-cost abstraction):
I smirked thinking back to Joe Armstrong's famous quote:
"Make it work, then make it beautiful, then if you really, really have to, make it fast. 90% of the time, if you make it beautiful, it will already be fast. So really, just make it beautiful!"
Afterall, code typically doesn't get more beautiful than its functional form. And rust was supposed to give me the fast for free as long as I make it beautiful. This had taken me all of 10 minutes so I had a bit of extra time to do what every good engineer should do: benchmark. Even Joe Armstrong would agree that the only real source of truth is CPU time. An hour later (had to learn Criterion…), I realized my functional code is actually 3x slower. What?!
By putting this function in godbolt and looking at the assembly, we can see that it copies one byte at a time:
The reasons for this are still not super clear to me. We look at some reasons in a section below. But first, it makes more sense to get back to the original function and optimize it gradually, one thing at a time, to understand which features of the code are slowing it down.
One problem with the original function that jumped to my eyes is the if statement inside of the for loop. It goes against matklad's "push ifs up and fors down" aphorism. Now, any human looking at the code can easily see that the branch condition if end > data_size
only evaluated to true at the very last index of the loop only. So clearly, either the compiler could optimize it away, and if not, then at least modern cpu branch prediction would eat this for breakfast. Right? Right?? Turns out that's not the case. Compilers are still dumb, and even predictable branches can increase branch mispredictions. This is also why getting the compiler to optimize away bound checks, or even in certain cases writing unsafe rust when the compiler won't comply, can be very useful.
Here is the optimized version.
And here are the benchmark results, also including the functional form
So we see a ~7x speedup by moving the branch outside of the for loop! But also surprisingly, the functional form is ~27x slower than the fast version.
Applying the learnings we made above, we can massage the functional form to achieve similar speeds! Perhaps "zero-cost abstraction iterators" should be renamed to the more appropriate "zero-cost abstraction iterators, as long as you know what you're doing and don't write the code you would want to write."
The below function is what I ended up submitting in my PR.
There are 2 problems with the original functional function:
collect()
doesn't pre-allocate a vector of the right capacity, which leads to extraneous copiesWe fixed those by:
collect()
chunks_exact()
instead of chunks()
in order to allow the main loop to auto-vectorize and use SIMD instructions to copy 128 (or more) bytes at a timeFor completion, we can put the function in godbolt and observed the optimized assembly output for the SIMD copying. The control flow graph shows the main copying loop between.LBB2_7
and .LBB2_6
.
And here's the LLM annotated output, clearly showing the stur and str (store (unscaled) register) SIMD copying.
The q prefix indicates the use of NEON 128-bit registers. The stur (u for unaligned) with offset #15 is because we're storing data with a 1-byte shift to remove the empty byte. This offset isn't aligned to the 16-byte boundaries that SIMD usually requires.