No sooner had I published the last piece than I found something really cool that I don't think is that well known.
The Cortex-M33 is an Armv8-M processor - that is, it implements the Microcontroller Profile part of Version 8 of the Arm Architecture Specifcation.
The Cortex-M55 and Cortex-M85 are Armv8.1-M processors - they implement Version 8.1. I sort of knew this, but didn't pay any real attention to what it means. But I should have done.
Armv8.1-M introduces new Branch and Loop Instructions. They are listed at https://developer.arm.com/documentation/101928/0101/The-Cortex-M85-Instruction-Set--Reference-Material/Armv8-1-M-branch-and-loop-instructions/List-of-Armv8-1-M-branch-and-loop-instructions, but I'll try and explain with some code examples.
Let's say we want to process some data.
#[no_mangle]
pub fn double_values(data: &mut [u8]) {
for item in data.iter_mut() {
*item = *item * 2;
}
}
Let's compile for Armv8-M Mainline, with opt-level=s
.
rustc sample.rs --emit asm --target=thumbv8m.main-none-eabi --edition 2021 --crate-type=rlib -C opt-level=s
cat sample.s | grep -v -e "\s\.[a-z]" -e "^\."
We get:
double_values:
cbz r1, .LBB0_2
.LBB0_1:
ldrb r2, [r0]
subs r1, #1
lsl.w r2, r2, #1
strb r2, [r0], #1
bne .LBB0_1
.LBB0_2:
bx lr
This is roughly:
r1
is zero, leaver0
, into r2
r1
r2
r0
, and then increment r0
by 1r1
is not zeroIf we want to process, say, 46 bytes, this will loop 46 times and it will take roughly 370 clock cycles.
If we compile for opt-level=3
, we get a lot of loop unrolling but no new instructions.
Let's now tell LLVM we have a Cortex-M55, and we want to do opt-level=2
.
rustc sample.rs --emit asm --target=thumbv8m.main-none-eabi --edition 2021 --crate-type=rlib -C opt-level=2 -C target-cpu=cortex-m55
cat sample.s | grep -v -e "\s\.[a-z]" -e "^\."
We get:
double_values:
push {r7, lr}
mov r7, sp
cmp r1, #0
it eq
popeq {r7, pc}
dlstp.8 lr, r1
.LBB0_2:
vldrb.u8 q0, [r0]
vshl.i8 q0, q0, #1
vstrb.8 q0, [r0], #16
letp lr, .LBB0_2
pop {r7, pc}
It didn't do any loop-unrolling! But what did it do?
r1
lr
to count how many items are leftq0
lr
, whichever is smallerq0
q0
to the pointer in r0
lr
and if it is not zero, return to the label on Line 8As I understand it, the letp
instruction isn't even really an instruction that needs to be executed. The loop state is held within the processor outside of the normal registers, and resetting PC back to the loop start and doing the loop decrement effectively takes zero cycles. This loop will take
The impressive thing here is that we didn't have to have any instructions for "handle this in units of 16 bytes, and once that's done, deal with whatever remainder is left over one byte at a time". It's all done in hardware - during a 'do loop', the vector instructions process either 16 bytes, or the number of bytes remaining if you're at the end of the slice. If we want to process 46 bytes, this loop repeats only 3 times - with bytes 0..=15
, 16..=31
and 32..=45
. Yet it's no bigger, in terms of code space, than the old opt-level=s
loop.
With Rust and Armv8.1-M Branch and Loop Extensions, we get very small code and very fast code - an excellent combination. And I would to see stable support for recompiling libcore to take advantage of these kinds of new instructions, or a new thumbv81m.main-none-eabi{hf,}
pair of targets which default to using at least Integer MVE, if not Float MVE.