The Rust Platform Docs highlight lots of options you can turn on when programming an Arm microcontroller. But what do they do? Let's have a look!
Here's our sample source code:
We're going to look at the Cortex-M7 (thumbv7em-none-eabi
), Cortex-M33 (thumbv8m.main-none-eabi
), Cortex-M85 (thumbv8m.main-none-eabi
). Other Arm CPUs will be similar, but this covers the basics. We're sticking to the soft-float ABI (eabi
) so that we can turn the FPU on and off. If we used the hard-float ABI (eabihf
), we would have to have the FPU enabled.
The Cortex-M7 can have either no FPU, a single-precision FPU (for processing f32
values) or a double-precision FPU (for handling f32
and f64
) values. We use the thumbv7em-none-eabi
target for this CPU, which assumes we have no FPU. Let's build our sample:
I've stripped out the assembler directives, because they are not interesting to us. Let's just look at the assembly code for each function:
This target doesn't use the FPU, so we're calling the __aeabi_fadd
function to do the floating point addition for us.
Pretty similar to the last one, except it's calling __aeabi_dadd
to add two 'doubles' (double-precision floating point values, which is a double
in C or an f64
in Rust).
It's using add.w lsl #1
to add a value shifted-left by 1 (which is the same as multiplying by two) and it does it four times. I have no idea why it chose ldrd
to load the first four values from the pointer in r1
and ldm.w
and ldr
to load the four values from the pointer in r2
. You'd think they'd be loaded in the same way. It's also unclear why it reserves 12 bytes of stack, but only stores one value in that space (the original contents of r11
).
Every Armv7E-M CPU has DSP support, so it gives us a single sxtab
instruction to do a signed-expand-and-add.
-Ctarget-cpu=cortex-m7
Now let's tell Rust we have a Cortex-M7, with -C target-cpu=cortex-m7
. This will cause LLVM to assume we have a Cortex-M7 with a double-precision FPU, because it always assumes the maximum set of features for a CPU.
Now it's copying the floats into the s0
register, and friends, and using the vadd.f32
instruction to do a floating-point add.
LLVM assumed we had a double-precision FPU, so it's copying the floats into the d0
register, and friends, and using the vadd.f64
instruction to do a double-precision floating-point add.
You can tell rustc that we only have a single precision FPU with the -C target-cpu=cortex-m7 -C target-feature=-fp64
argument. Then this function will look exactly like it did for the plain build.
Our other two functions didn't change - they don't do any FPU stuff.
The Cortex-M33 can have either no FPU, or a single-precision FPU (for processing f32
values). It can also optionally have DSP extensions, or not. We use the thumbv8m.main-none-eabi
target for this CPU. Let's build our sample:
If we diff the two outputs, we see the only thing that changed is the i8_i16_add
function:
The SXTAB instruction is from the DSP extension, so it has to do separate sign-extension and add operations here.
We can turn the DSP extension on with -C target-feature=+dsp
and then this function looks like the Cortex-M7 version.
-Ctarget-cpu=cortex-m33
Now let's tell Rust we have a Cortex-M33, with -C target-cpu=cortex-m33
. This will cause LLVM to assume we have a Cortex-M3 with a single-precision FPU, because it always assumes the maximum set of features for a CPU and that's the best a Cortex-M33 can have.
Rather than load-load-add-load-load-add… the scheduler has decided to emit load-load-load…add-add-add…. Perhaps because the Cortex-M33 has a shorter pipeline than the Cortex-M7. But basically this is the same instructions in a different order.
It now uses vldr
or Vector Load Pair which can take a double-precision register and split it into two integer registers, one containing the top-half and one containing the bottom-half.
Before, for each item in the array it generated something like:
Now it generates:
Is vldr
and vmov
faster than ldrd
and mov
? I assume so. I guess you could look at the cycle counts in the processor specification if you wanted.
Our i32_add4
function is basically unchanged, but arranged into a slightly different order. Our i8_i16_add
is back to using the sxtab
instruction because LLVM assumed the Cortex-M33 has the optional DSP extension enabled. You can turn it off with -C target-feature=-dsp
if yours doesn't have it.
The Cortex-M55 is interesting because it optionally has M-Profile Vector Extensions, also known as MVE, but also known as Helium. Our target is still thumbv8m.main-none-eabi
so that won't have changed from the Cortex-M33 plain build, so let's go right to tell LLVM we have a Cortex-M55.
Hot diggity! MVE gives us the vldrw.u32
instruction which can pull four 32-bit values from memory, into our new 128-bit qX
registers. The vadd.f32
instruction can operate on those qX
registers to perform four additions at the same time. So our whole loop is now just five instructions.
Sadly MVE doesn't support double-precision floats, so the f64_add4
function is unchanged. However…
It can do integers too, so again, just five instructions to double four values and then perform four additions.
If you want MVE without specifying that you have the Cortex-M55 (or Cortex-M85), it's the +mve
flag for integer MVE and +mve.fp
for floating-point MVE.
Again, we have DSP enabled by default so it has the sxtab
version of i8_i16_add
.
We've seen that changing the target-cpu
can not only change the order of the instructions emitted, but also enable additional CPU features (that we may or may not actually have on our CPU). These can drastically reduce the size of our code and increase the performance - especially when using the M-Profile Vector Extensions.
Why not try coming up with some algorithms in Rust and seeing what affect the different Arm architectures have and what the different target-cpu
options will do. It's all listed in the Rust Platform Docs.