owned this note
owned this note
Published
Linked with GitHub
# Memory Functions
This whole document is released as [Creative Commons 0 (CC0)][cc0].
[cc0]: https://creativecommons.org/share-your-work/public-domain/cc0/
More usable code for this article, including tests, can be found in the [github repository](https://github.com/Lokathor/arm7tdmi_mem_fns).
# Introduction
The "memory functions" do two basic kinds of things:
* They copy data from `src` to `dest`, and the two regions might or might not overlap.
* They set data in `dest` to a specific value, without reading from a source region.
Depending on the details, there's various optimizations that you can apply to these tasks.
This means that a full suite of memory functions will usually have a few similar functions, each for a slightly different task.
In Rust the memory functions are assumed to exist as part of the basic code provided by the platform.
These memory functions are called very often, and so they should be finely tuned to be as fast as possible on each platform.
Rust assumes that the platform developers know what they're doing, and that they will provide fast verisons.
If your platform doesn't do this, you can still get basic versions of the memory functions from the [compiler-builtins](https://github.com/rust-lang/compiler-builtins) crate.
Normally this is all sorted out for you automatically.
When we're programming Rust for the Gameboy Advance (GBA) we use the `thumbv4t-none-eabi` target.
This is a "bare metal" target, with no provided code libraries.
This means that a Rust program on the GBA gets the basic memory functions from `compiler-builtins`.
Except we don't want the basic versions.
LLVM is reasonable at codegen, but sometimes it misses stuff.
Also, we want to have the linker put our memory functions into IWRAM rather than ROM.
Luckily, the `compiler-builtins` functions are only "weak linked" into the program.
This means that we can override their version of a function just by providing our own version.
All we have to do to fix our situation is "just" define our own memory functions.
To follow this article you need to be able to read the basics of ARM assembly.
Normally one might suggest the official docs for this, but the ARM developer website changes often enough that old links seem to break all the time.
Instead, the guide I'd suggest is from [Azeria Labs][azeria-labs], which covers the fundamental ideas nicely in a few short pages.
[azeria-labs]: https://azeria-labs.com/writing-arm-assembly-part-1/
We'll be using the modern ARM assembly instruction syntax, where conditional instructions always have the condition on the end (eg: `ldrheq`).
This is the default for LLVM (and so for Rust too).
If you read the original ARMv4T manuals from 1994 or whenever they show the old-style syntax, where the conditional for some instructions goes "inside" the rest of the instruction (eg: `ldreqh`).
This is the default parsing mode for the GNU assembler.
If you want to use the modern syntax with the GNU assembler put the `.syntax unified` directive at the top of the file.
## Function Signature Goals
For this issue we want to approach things in little pieces at a time, and build the pieces up into a final "complete" thing.
To do that, we need to start by checking what we're even trying to build.
First up is the memory functions that are normally provided by a platform's `libc`.
I know we're programming Rust here, but since they're defined as C functions we'll look at the signature in the C style.
```c
void *memcpy(void *dest, const void *src, size_t count);
void *memmove(void *dest, const void *src, size_t count);
void *memset(void *dest, int ch, size_t count);
```
* "memory copy", for when the two regions don't overlap.
* "memory move", for then the two regions *might* overlap.
* "memory set", which sets all bytes to be the `u8` passed in `ch`. The `ch` value is passed as an `int` because this `memset` actually *predates* the ability to specify argument types in a function signature. Just go with it.
The `count` argument is always the count *in bytes*.
Oh, and the return value is always the original `dest` pointer that was passed in.
That lets you potentially do C's vesion of call chains where you write code like
```C
op_two(op_one(buffer, arg1), arg2);
```
On a lot of platforms that might be all you'd need to think about.
However, with targets that follow the ARM EABI (that includes us), there's some additional memory functions too.
LLVM will use the ARM EABI functions when approprite, so we want these to be as fast as possible as well.
```c
void __aeabi_memcpy8(void *dest, const void *src, size_t n);
void __aeabi_memcpy4(void *dest, const void *src, size_t n);
void __aeabi_memcpy(void *dest, const void *src, size_t n);
void __aeabi_memmove8(void *dest, const void *src, size_t n);
void __aeabi_memmove4(void *dest, const void *src, size_t n);
void __aeabi_memmove(void *dest, const void *src, size_t n);
void __aeabi_memset8(void *dest, size_t n, int c);
void __aeabi_memset4(void *dest, size_t n, int c);
void __aeabi_memset(void *dest, size_t n, int c);
void __aeabi_memclr8(void *dest, size_t n);
void __aeabi_memclr4(void *dest, size_t n);
void __aeabi_memclr(void *dest, size_t n);
```
These largely work like the `libc` versions except that:
* There's no return value. This might seem unimportant, but it actually does allow a small optimization.
* There's "4" and "8" versions of the functions. These functions can be used when the caller knows that the pointers are already aligned to 4 or 8, which allows significant optimization. The GBA's CPU is old enough that alignment to 8 doesn't help, but alignment to 4 will be much faster. Note that the number of bytes passed to the aligned functoins **does not** need to be a multiple of the alignment. You could call `__aeabi_memcpy4` and tell it to copy 5 bytes, if you really want.
* `__aeabi_memclr` ("memory clear") is a new group of functions that work like "memory set", but the byte value to set is just implied to be 0.
* The `__aeabi_memset` functions have the argument order swapped compared to `memset`. Now the byte to set is the third argument. This allows `__aeabi_memclr` implementations to share most of their code with `__aeabi_memset` implementations.
## Copy Ordering
When we're accessing a bunch of memory, accessing the memory in order is just faster.
This is very true on a modern system, but even the old GBA cares.
Accesses are categorized into "sequential" ("S") and "non-sequential" ("N").
The exact time taken varies by game pak, but the usual timings are that sequential accesses take 2 cycles and non-sequential accesses take 4 cycles.
Obviously we should always write our copy lookps to only use sequential accesses, right?
Just in case we're accessing the ROM as part of a copy?
Well, it's not so simple (it never is).
Imagine we have some memory like this.
(We'll use a little-endian layout for our diagram, since the GBA is a little-endian device.)
| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| H | G | F | E | D | C | B | A |
Let's imagine we want to copy a 4 byte span, with the destination starting at 2 and the source starting at 0.
* First we read A from 0 and write it to 2.
* Next we read B from 1 and write it to 3.
* Next we *should have* read C from 2, but we already overwrote that value in the first step!
When our two regions to copy between *might* overlap (like with `memmove`), we need to be sure that the overlapping portion gets handled first so that the source data is not destroyed by our own writes to the destination.
We do that by copying in the correct direction.
We'll call the sequential version "forward", and the non-sequential version "reverse".
For each copying loop we design we'll need to write both the forward and reverse version.
Then we pick between the two versions like this:
```rust
fn copy_overlapping(d: *mut u8, s: *const u8, count: usize) {
if d < s {
copy_forward(d, s, count);
} else {
copy_reverse(d, s, count);
}
}
```
And if we *know* the regions don't overlap (such as copying ROM data to RAM), then we can just always use the forward loop like we wanted.
## Compiler Explorer
The ever handy [Compiler Explorer][compiler-explorer] website lets us write some Rust and see the assembly output.
Unfortunately for us, Compiler Explorer isn't set up to build for an ARMv4T target.
Right now Compiler Explorer only supports Rust's Tier 1 and Tier 2 targets.
Instead, we have to have CE show us assembly for `armv5te-unknown-linux-gnueabi`, which is the oldest ARM target that's Tier 2.
This means that **sometimes** we're going to see assembly output that isn't actually legal for us to use on the GBA (eg: a `blx` instruction, popping an `lr` value on the stack directly into `pc` instead of using `bx lr`, etc).
But *mostly* it's close enough that we can use it as an approximate starting point.
[compiler-explorer]: https://rust.godbolt.org/
There's also always the "[Code Size vs Code Speed][size-vs-speed]" trade off to consider.
We'll want to look at the output of a few different optimization levels and pay attention to the differences.
[size-vs-speed]: https://docs.rust-embedded.org/book/unsorted/speed-vs-size.html
# Copy By `u8`
The first loops we'll look at will work exclusively in 8-bit units.
> Hey Lokathor, isn't that silly?
> Because I thought you said we wanted to go *fast* with these copies!
Yes, I know friends, I know.
But it's not as silly as it might seem.
There's two main reasons to write a `u8`-only copy loop.
1) When accessing SRAM, the program **must** (in the [RFC 2119][rfc2119] sense) run the code out of Work Ram rather than ROM, and **must** access using only 8-bit accesses (`ldrb`, `ldrsb`, `strb`).
2) It's educational!, We're starting small as a warm up. We'll try out some basic techniques we can apply to bigger examples, that sort of thing.
[rfc2119]: https://datatracker.ietf.org/doc/html/rfc2119
> **Cool Loka's Hot Tip:** Technically speaking we should be using [MaybeUninit][mu] because the memory functions are defined by Rust to be "untyped" copies that ignore the initialization state of the memory involved, but since we're going to replace the Rust code with assembly code in the final version anyway I'm going to skip that part because it's a lot of visual clutter we don't need to just see the loop structure.
[mu]: https://doc.rust-lang.org/nightly/core/mem/union.MaybeUninit.html
So here's what our Rust code looks like:
```rust
#![no_std]
pub unsafe extern "C" fn copy_u8_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
const SIZE_U8: usize = core::mem::size_of::<u8>();
while count >= SIZE_U8 {
*dest = *src;
dest = dest.add(SIZE_U8);
src = src.add(SIZE_U8);
count -= SIZE_U8;
}
}
pub unsafe extern "C" fn copy_u8_reverse(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
const SIZE_U8: usize = core::mem::size_of::<u8>();
while count >= SIZE_U8 {
count -= SIZE_U8;
*dest.add(count) = *src.add(count);
}
}
```
And here's what it looks like in the different optimization modes:
* `opt-level=3`: https://rust.godbolt.org/z/5x7vzcEfM
* `opt-level=s`: same as opt=3
* `opt-level=z`: https://rust.godbolt.org/z/1rKhGM8v7
In this case, opt=3 (fast) and opt=s (small) outputs are the same, and opt=z (aggressively small) cuts out an early return when the copy length is 0.
Let's take a close look, starting with `copy_u8_forward`.
We're going to go *really* slow for the first function or two, and then speed up a bit once we've got the basics down.
## Forward `u8`
To start, let's use the `z` output as our base.
We're going to assume that people won't call the function with a zero sized copy, so we probably don't need the extra early return check to try and minimize work.
To be clear: people *could* call with a copy size of 0, and our function must be correct if that happens, but we'll *assume* that the copy size is larger than 0 even if it makes the 0-size copy case be slightly less fast.
Generally, any memory function like this should assume that the amount of memory to manipulate is *larger*, rather than *smaller*, because when the amount of memory to manipulate is smaller the compiler will just do a few reads and writes on its own without calling a memory function.
Okay here we go:
```arm
@ Starting with the opt=z output
copy_u8_forward:
.LBB0_1:
cmp r2, #0
bxeq lr
ldrb r3, [r1], #1
sub r2, r2, #1
strb r3, [r0], #1
b .LBB0_1
```
To start, we've removed the `example::` prefix from the label name.
Since the label part of a line ends with `:` there can't be a `::` in the label name.
Technically the *real* label that `rustc` used for the function was `_ZN7example15copy_u8_forward17hc841ec0dfd74b8c3E`, and then Compiler Explorer just de-mangled that for us when showing the output.
Since we don't want to bother with mangling, we'll keep our labels plain and simple.
Also we can eliminate the local label, we can just jump to the top of the function using its non-local label.
```arm
copy_u8_forward:
cmp r2, #0
bxeq lr
ldrb r3, [r1], #1
sub r2, r2, #1
strb r3, [r0], #1
b copy_u8_forward
```
Okay, so, can we improve this code?
Well, quite honestly, we probably can.
LLVM isn't that good at taking full advantage of the [Conditional Execution][conditional-execution] that all our ARM instructions allow.
[conditional-execution]: https://azeria-labs.com/arm-conditional-execution-and-branching-part-6/
The first suspicious part is that there's `cmp` and `sub`, both for the same value in the same loop.
For later ARM devices this is basically fine since the result of `ldr` isn't available next cycle.
Something needs to go between the `ldr` and the `str`, if you don't have anything the CPU just stalls a cycle anyway, so putting the `sub` there is fine.
For this early ARM chip in the GBA we don't need the gap, so maybe we can move around the `sub` and shorten the loop.
* `cmp a, b` sets the CPU status flags based on the result of `a-b`, without storing that output to any register.
* `sub d, a, b` does `d = a-b` *without* setting the flags.
* There's also `subs d, a, b` which is similar but *with* setting the flags.
This means if we see `cmp` and `sub` in the same loop, we might be able to use `subs` instead.
It's fine to move the `sub` earlier in the loop.
The `bxeq` and `ldrb` don't interact with `r2` at all, so if `r2` gets subtacted earlier or later they don't care.
Let's check what that looks like.
```arm
copy_u8_forward:
cmp r2, #0
sub r2, r2, #1
bxeq lr
ldrb r3, [r1], #1
strb r3, [r0], #1
b copy_u8_forward
```
So now we can replace `cmp` and `sub` with a `subs` instruction if we change the condition on the `bxeq`.
Currently we break out of the loop if `cmp` finds out that the count is 0 (if `count - 0 == 0`).
However, if we use `subs` then we're setting the flags based on the value *after* we subtract 1, not *before* we subtract 1.
If we had 1, and subtract 1, and get 0, then the counter was 1 when we entered the loop, so we want to do the loop one final time to copy the final byte.
What we need is to do is stop the loop when we have 0 and subtract 1 and then underflow the value to `u32::MAX`.
For `subs` the change in flags work as follows:
* N Flag = 1 if result (as a signed value) is negative
* Z Flag = 1 if result is 0
* C Flag = 1 if the subtraction **did not** produce an unsigned underflow.
For anything else it's an unsigned overflow flag, but for subtraction the meaning is inverted, so don't forget.
* V Flag = 1 if the result of the operation is greater than or equal to 2<sup>31</sup>, or less than -2<sup>31</sup>.
[arm-carry-flag]: https://developer.arm.com/documentation/dui0801/g/Condition-Codes/Carry-flag
So as an example, `3 - 1` gives `2` with flag N=0, Z=0, C=1, V=0.
And when we do `0 - 1`, underflowing to `u32::MAX`, the flags are N=1, Z=0, C=0, V=0.
That means the condition for our `bx` needs to be `cc` ("carry clear").
```arm
copy_u8_forward:
subs r2, r2, #1
bxcc lr
ldrb r3, [r1], #1
strb r3, [r0], #1
b copy_u8_forward
```
Neat.
This is actually where I stopped when I first started writing this article.
I said to myself, "that's as tight as loop as we can make it", and I moved on.
Then [felixjones](https://github.com/felixjones) on GitHub looked at the code and said "hey get that `bx` out of your loop".
See, we can't forget that *every* instruction can be conditional.
We can move the `bx` out of the loop to the end, change the `b` to use `bgt` so that we only restart the loop when we have a non-zero amount of count remaining at the end of each loop, and finally we can put a `cs` on the `ldrb` and `strb` instructions so that if the function is called with a count of 0 we will safely pass over those instructions.
```arm
copy_u8_forward:
subs r2, r2, #1
ldrbcs r3, [r1], #1
strbcs r3, [r0], #1
bgt copy_u8_forward
bx lr
```
And, yeah.
This is nicer.
With the `bx lr` outside the loop.
That's why you show your code to other folks and have them give it a check.
## Reverse `u8`
LLVM wants our opt=z reverse loop to look like this:
```arm
copy_u8_reverse:
sub r1, r1, #1
sub r0, r0, #1
.LBB1_1:
cmp r2, #0
bxeq lr
ldrb r3, [r1, r2]
strb r3, [r0, r2]
sub r2, r2, #1
b .LBB1_1
```
That's not so great.
It's the same problem as above, LLVM is placing the subtraction at the wrong time, so we've got all sorts of extra work going on.
We can just change it to `subs` and have our conditions like before.
The difference is that we're going to use "offset addressing" instead of the "post-indexed".
Instead of the base pointers moving (after the access) with each loop, they stay the same and the count is what moves.
```arm
copy_u8_reverse:
subs r2, r2, #1
ldrbcs r3, [r1, r2]
strbcs r3, [r0, r2]
bgt copy_u8_reverse
bx lr
```
Now the copy starts with the final byte, then one byte back, then one byte back, until it copies the 0th byte.
Super easy, no trouble at all.
# Copy By (Mostly) `u16`
In practice, other than with SRAM, we'd never want to copy by only `u8`.
The next size up is `u16` ("halfword").
We might want to use this if we're working with stuff going into VRAM.
Modes 3, 4, and 5, along with the color palettes, are all 2-byte aligned data.
VARM has the fun quirk where `u8` *reads* are fine, but `u8` *writes* affect both bytes within the halfword.
We need to be careful to avoid `u8` writes when writing to VRAM.
So we'll have some functions for "mostly `u16`".
It's "mostly" because they'll use halfword copies as much as possible.
If you do request a copy length that's an odd number then the function will be forced to do one byte copy, but otherwise you can be sure you'll get halfword copies.
> **Cool Loka's Hot Tip:** The [MaybeUninit][mu] thing from before still applies.
```rust
#![no_std]
pub unsafe extern "C" fn copy_u16_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
const SIZE_U16: usize = core::mem::size_of::<u16>();
while count >= SIZE_U16 {
dest.cast::<u16>().write(src.cast::<u16>().read());
dest = dest.add(SIZE_U16);
src = src.add(SIZE_U16);
count -= SIZE_U16;
}
if count > 0 {
dest.cast::<u8>().write(src.cast::<u8>().read());
}
}
pub unsafe extern "C" fn copy_u16_reverse(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
const SIZE_U16: usize = core::mem::size_of::<u16>();
if (count % SIZE_U16) > 0 {
count -= core::mem::size_of::<u8>();
dest.add(count).write(src.add(count).read());
}
while count >= SIZE_U16 {
count -= SIZE_U16;
dest.add(count).cast::<u16>().write(src.add(count).cast::<u16>().read());
}
}
```
* `opt-level=3`: https://rust.godbolt.org/z/hen7n88xE
* `opt-level=s`: same as opt=3
* `opt-level=z`: https://rust.godbolt.org/z/n4YzYjT3b
Just like with the `u8` loops, opt=3 and opt=s give us the same output, and only opt=z goes for the extremely compact form.
## Forward `u16`
So LLVM wants us to do one of these:
```arm
@ opt 3
copy_u16_forward:
cmp r2, #2
blo .LBB0_2
.LBB0_1:
ldrh r3, [r1], #2
sub r2, r2, #2
strh r3, [r0], #2
cmp r2, #1
bhi .LBB0_1
.LBB0_2:
cmp r2, #1
ldrbeq r1, [r1]
strbeq r1, [r0]
bx lr
@ opt z
copy_u16_forward:
.LBB0_1:
cmp r2, #1
bls .LBB0_3
ldrh r3, [r1], #2
sub r2, r2, #2
strh r3, [r0], #2
b .LBB0_1
.LBB0_3:
ldrbeq r1, [r1]
strbeq r1, [r0]
bx lr
```
This isn't ideal.
As with the `u8` code, LLVM is trying to optimize for the *small* case to be faster.
What we want is that our *common* case to be faster, but for us it's the *large* case that's common.
If we take our `u8` forward loop and just change a few bits to make the main loop go by halfwords it looks like this.
```arm
copy_u16_forward:
subs r2, r2, #2
ldrhcs r3, [r1], #2
strhcs r3, [r0], #2
bgt copy_u16_forward
bx lr
@ todo: handle the spare byte, if any
```
So we subtract 2.
If we **didn't** go below 0 then we had 2 or more before the subtract.
So we copy a halfword and advance our pointers.
Then if the count is still greater than 0 we do it again.
Except we can't have an unconditional return any more.
After the subtraction the result could be
* more than 0, so the `bgt` catches it and the loop goes again.
* 0, so the loop started at 2, and we return (we can use `bxeq`).
* -1, in which case we had 1 at the start of the loop, so copy a byte and return.
* -2, in which case we had 0 at the start of the loop, so just return.
We can decide between the -1 and -2 cases by adding 1 back to the count.
If we land on 0 then that means we should copy the extra byte.
```arm
copy_u16_forward:
subs r2, r2, #2
ldrhcs r3, [r1], #2
strhcs r3, [r0], #2
bgt copy_u16_forward
bxeq lr
tst r2, #1
ldrbne r3, [r1]
strbne r3, [r0]
bx lr
```
Don't forget about the -2 case and think that we can just remove the `adds` and unconditionally copy the last byte once we're after the `bxeq lr`.
That would be a bad time for sure.
## Reverse `u16`
With the reverse loop we have to handle a possible odd byte count *before* we do our halfwords.
LLVM wants us to do it like this
```arm
@ opt=3
copy_u16_reverse:
tst r2, #1
beq .LBB1_2
sub r2, r2, #1
ldrb r3, [r1, r2]
strb r3, [r0, r2]
.LBB1_2:
cmp r2, #2
bxlo lr
sub r12, r0, #2
sub r1, r1, #2
.LBB1_4:
add r0, r1, r2
add r3, r12, r2
sub r2, r2, #2
ldrh r0, [r0]
cmp r2, #1
strh r0, [r3]
bhi .LBB1_4
bx lr
@ opt=z
copy_u16_reverse:
tst r2, #1
beq .LBB1_2
sub r2, r2, #1
ldrb r3, [r1, r2]
strb r3, [r0, r2]
.LBB1_2:
sub r12, r0, #2
sub r1, r1, #2
.LBB1_3:
cmp r2, #1
bxls lr
add r0, r1, r2
add r3, r12, r2
sub r2, r2, #2
ldrh r0, [r0]
strh r0, [r3]
b .LBB1_3
```
The opt=3 version has an 8 instruction halfword loop with `bx` on the outside of the loop (cool).
The opt=z version has an 8 instruction halfword loop with `bxls` on the inside of the loop (not cool).
So the opt=z version *saved no space* compared to the opt=3 version (8 instructions either way), and yet it's slower.
I dunno what LLVM is doing, but that doesn't seem like a good time.
Theres definitely situations where an optimzier can do some weird math transformation or something and give you better code than you'd know to do on your own, but this is an example of when the opposite happens.
For whatever reason, the optimizer fell over on the job with this one.
One thing that both LLVM versions agree on is starting the function with a `tst` instruction.
This is like `cmp` in the sense that it does an operation just to set the flags, without storing the output.
The difference is which operation is used to set the flags.
* `cmp a, b` sets the flags based on `a - b`
* `tst a, b` sets the flags based on `a & b`
So if we do `tst r2, #1` we'll be able to find out if our count is starting as an odd value.
That makes sense, and we can agree with LLVM that it's the correct opening move for this function.
After the `tst` LLVM jumps ahead in the function if we're on the even-count case, and then the odd-count case falls through to a byte copy.
That's no good.
Our ideal path should be the path with the *least* branches, so we want to re-arrange the code some.
We'll put the byte copy at the end of the function as a little sub-routine thing.
Then the odd-count case can jump there and (jump to the halfword loop after the byte copy), or the even-count case can just not jump at all.
Either way, we get to the halfword loop.
In the same basic structure as we had for the `u8` reverse loop.
We're just going by 2 instead of by 1.
```arm
copy_u16_reverse:
tst r2, #1
bne 2f
1:
subs r2, r2, #2
ldrhcs r3, [r1, r2]
strhcs r3, [r0, r2]
bgt 1b
bxeq lr
2:
sub r2, r2, #1
ldrb r3, [r1, r2]
strb r3, [r0, r2]
b 1b
```
Looks like LLVM's opt=z version was 15 instructions, and we've narrowed it down to 11 instructions.
Not bad at all.
# Copy By (Mostly) `u32`
Here's where things can get really interesting.
The biggest data copies that we'd have to do on the GBA are probably things to do with copying around tile data.
Tiles are always 8x8, and they can be 4-bits-per-pixel (4bpp) or 8-bits-per-pixel (8bpp).
So that's either 32 bytes or 64 bytes, just for a single tile.
And most things to do with tiles involve more than one tile.
We still might copy small stuff, say 12 or 16 bytes,
but when the byte count to copy starts to get high it'll quite possibly get *extremely* high.
If there's a "large" creature that needs 4 by 4 tiles to draw, even at just 4bpp that's gonna be a 512 byte copy.
So we'll finally have to be very careful that we support both extremely large copies (hundreds of bytes) and also moderate copies (32 bytes or less).
We can't really let either code path take a speed hit.
Let's try a basic rust version modeled after the `u8` and `u16` loops we've seen:
```rust
#![no_std]
pub unsafe extern "C" fn copy_u32_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
const SIZE_U32: usize = core::mem::size_of::<u32>();
while count >= SIZE_U32 {
dest.cast::<u32>().write(src.cast::<u32>().read());
dest = dest.add(SIZE_U32);
src = src.add(SIZE_U32);
count -= SIZE_U32;
}
while count > 0 {
*dest = *src;
dest = dest.add(1);
src = src.add(1);
count -= 1;
}
}
pub unsafe extern "C" fn copy_u32_reverse(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
const SIZE_U32: usize = core::mem::size_of::<u32>();
while (count % SIZE_U32) > 0 {
count -= core::mem::size_of::<u8>();
dest.add(count).write(src.add(count).read());
}
while count >= SIZE_U32 {
count -= SIZE_U32;
dest.add(count).cast::<u32>().write(src.add(count).cast::<u32>().read());
}
}
```
* `opt-level=3`: https://rust.godbolt.org/z/sPPGvhnPT
* `opt-level=s`: https://rust.godbolt.org/z/7rr73T13W (finally a difference from opt=3!)
* `opt-level=z`: https://rust.godbolt.org/z/MfsoexYnE
Okay one thing is that ARMv4T has `ldm` (load-muliple) and `stm` (store-multple).
These let you load and store several words at once, which makes bulk copying more efficient.
It's how `push` and `pop` work for the stack, but with any pointer and block of memory.
I'm not seeing any of those used in any of the optimization levels we're looking at.
Let's see if we can convince LLVM to use `ldm` and `stm`.
We'll have it copy an entire `[u32; 8]` per loop (the size of one 4bpp tile), that might do the trick:
```rust
pub unsafe extern "C" fn copy_u32_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
const SIZE_U32: usize = core::mem::size_of::<u32>();
while count >= (SIZE_U32 * 8) {
dest.cast::<[u32;8]>().write(src.cast::<[u32;8]>().read());
dest = dest.add(SIZE_U32 * 8);
src = src.add(SIZE_U32 * 8);
count -= (SIZE_U32 * 8);
}
while count >= SIZE_U32 {
dest.cast::<u32>().write(src.cast::<u32>().read());
dest = dest.add(SIZE_U32);
src = src.add(SIZE_U32);
count -= SIZE_U32;
}
while count > 0 {
*dest = *src;
dest = dest.add(1);
src = src.add(1);
count -= 1;
}
}
```
So while there's potentially more than 8 words left to go, we'll block copy 8 words.
And for this version, LLVM's [opt=3](https://rust.godbolt.org/z/sarjd7d9a) gives us...
```arm
example::copy_u32_forward:
push {r4, r5, r6, lr}
mov r4, r2
mov r5, r1
mov r6, r0
cmp r2, #32
blo .LBB0_2
.LBB0_1:
mov r0, r6
mov r1, r5
mov r2, #32
bl memmove
sub r4, r4, #32
add r5, r5, #32
add r6, r6, #32
cmp r4, #31
bhi .LBB0_1
.LBB0_2:
cmp r4, #4
blo .LBB0_4
.LBB0_3:
ldr r0, [r5], #4
sub r4, r4, #4
str r0, [r6], #4
cmp r4, #3
bhi .LBB0_3
.LBB0_4:
cmp r4, #0
beq .LBB0_6
.LBB0_5:
ldrb r0, [r5], #1
subs r4, r4, #1
strb r0, [r6], #1
bne .LBB0_5
.LBB0_6:
pop {r4, r5, r6, pc}
```
CURSE YOU BLASTED DRAGON MACHINE.
We're trying to *implement* `memmove`, you can't *call* `memmove` while we're *implementing* it!!!
Alright, fine, this handwritten assembly will just look even *less* like the LLVM assembly than normal.
## Forward `u32`
### Slow But Simple `u32` Forward
To begin, let's again take our `u8` loop and just update it to go by `u32`.
```arm
copy_u32_forward:
subs r2, r2, #4
ldrcs r3, [r1], #4
strcs r3, [r0], #4
bgt copy_u32_forward
bxeq lr
@ todo: handle the spare bytes, if any
```
We load (`ldr`) and store (`str`) one word at a time when the carry flag is set (`cs`),
meaning that we did an unsigned underflow,
meaning that we started the loop with 4 or more bytes in the count.
After the word copy:
* If: the bytes left are greater than 0 we go back to the start of the loop (`bgt`)
* Else If: the count is now *excatly* zero we return (`bxeq`)
* Else: we started the loop with *less than* 4 bytes, then subtracted 4, so we've got *some amount* negative in our counter.
In the `u16` "else" case we had either 0 or 1 spare bytes we also wanted to copy.
We resolved it with `adds`, a single conditional copy, and then an unconditional return.
In the `u32` "else" case we've got 0, 1, 2, or 3 spare bytes to copy.
That's enough possibilities that it calls for a loop.
At this point in the code the *current* count is 4 less than we're actually supposed to copy.
If we need 1 more byte copied, the count will be -3.
If there's 2 more to copy then -2, 3 more is -1.
Also, if we entered the loop with 0, we'll actually be at -4.
This means that we **can't** just loop by adding 1 to our counter and continuing until we get to 0.
That would give us like the inverse of the correct number of additional bytes.
The simplest solution in this scenario is to just add 4 back into the counter (to "undo" the most recent substract) and do a normal `u8` copy loop.
We could either call over to `copy_u8_forward`, or we could inline that loop into our `u32` function.
Any type in Rust that's got an alignment of 4 will also have a size that's a multiple of 4.
This means that handling the 1-3 spare bytes case is fairly rare.
* If we branch over to `copy_u8_forward` we pay 3 CPU cycles when we hit this case (a `b` takes 1N+2S, which is 3 cycles total for IWRAM code).
* If we inline that loop here we remove the one branch but it takes up an additional 20 bytes of IWRAM space.
I'd rather pay the 3 CPU cycles on rare occasions and save the IWRAM space.
So our fully working `u32` copy looks like this:
```arm
copy_u32_forward:
subs r2, r2, #4
ldrcs r3, [r1], #4
strcs r3, [r0], #4
bgt copy_u32_forward
bxeq lr
add r2, r2, #4
b copy_u8_forward
```
To be clear: this is **not** as fast as possible for large copies.
It will *work* at all though.
If we want to go faster than this, we'll need those `ldm` and `stm` instructions.
### `ldm` and `stm` loop for `u32` Forward
If we use `ldm` we can load more than one register up with a series of words starting from a base address.
Technically the `ldm` has several possible addressing modes, but if you don't specify one it means `ia` as the default, incriment after".
It takes a register list, and for each register that you load into, the pointer will advance one word during the loading, so that it'll load a series of successive words.
If you put `!` after the destination register then the final address value is written back into the register so that you can keep advancing the address through memory on the next pass of a loop or however else.
The `stm` instruction lets us store multiple registers out in the same way.
The combination of `ldm` and `stm` we'll call a "block" copy.
Right now we're using our registers as follows:
* r0: destination pointer
* r1: source pointer
* r2: byte count
* r3: temp storage
Per the C ABI for ARM, we can freely use `r0-r3`, `r12`, and `lr` (aka `r14`) for whatever we want during a function call.
If we want to alter any other register we've got to restore the value before our function returns.
We do this by pushing the values to the stack, then popping them back off the stack later.
The pushing and popping itself takes time.
So there's two basic strategies:
* If we use just `r3` and `r12` for our block copy, doing two words per loop while avoiding the need for the stack.
* If we use `r3`, `r12`, *and* push 6 other registers, then we can do eight words (one 4bpp tile) per loop. But then we have to use the stack.
We could consider the cycle timings as well:
| op | time |
|:---|:-----|
| ldr | 1S+1N+1I |
| str | 2N |
| ldm | (x)S+1N+1I |
| stm | (x-1)S+2N |
| sub | 1S |
| b | 2S+1N |
The `x` value on `ldm`/`stm` is the number of registers you're loading or storing.
Note that some situations can make these instructions cost more than listed, such as if you ever use `pc` as a destination, but for our purposes these numbers are fine.
Also, any conditional op takes just 1S if the condition doesn't hold.
Let's see how long things take.
For simplicity we'll pretend that N and S are both 1, even though sometimes they might be more.
```arm
@ copy 1 word loop: 2.25 cycles per byte (9/4)
1:
subs r2, r2, #4 @ 1
ldrcs r3, [r1], #4 @ 3
strcs r3, [r0], #4 @ 2
bgt 1b @ 3
```
```arm
@ copy 2 words loop: 1.375 cycles per byte (11/8)
1:
subs r2, r2, #8 @ 1
ldmcs r1!, {r3, r12} @ 4
stmcs r0!, {r3, r12} @ 3
bgt 1b @ 3
```
```arm
@ copy 8 words loop: 0.71875 cycles per byte (23/32)
1:
subs r2, r2, #32 @ 1
ldmcs r1!, {r3-r9, r12} @ 10
stmcs r0!, {r3-r9, r12} @ 9
bgt 1b @ 3
```
* The final use of each loop causes the branch condition to not hold, and then we cut 2 cycles
* To get the spare six registers for the 8-word loop we need 8 cycles for the push and 7 cycles for the pop.
So if there's 32 bytes:
* The 8-word loop will be 38 cycles (push, break on the first loop, pop)
* The 2-word loop will be 42 cycles (3 complete loops, break on the 4th loop)
Fancy as it is to not use the stack, even just a 32 byte copy is enough to justify using the stack.
If we're copying less than 32 bytes... well that's actually tricky.
We get better performance per byte from using the 2-word loop.
However, it makes the tail conditions really complicated.
In addition to spare bytes after all the word copying, we'd *also* have to handle a possible spare word after the 2-word copying.
And when we merge the +8 and -4 to be just a +4 then we have to change our conditonal copy to use `pl` for "plus" (which includes zero too).
That all seems extra fiddly, I'm not sure we want to deal with that.
Awh...
nuts...
```arm
one_word_loop:
subs r2, r2, #4
ldrcs r3, [r1], #4
strcs r3, [r0], #4
bgt one_word_loop
bxeq lr
add r2, r2, #4
b copy_u8_forward
two_then_one_loop:
subs r2, r2, #8
ldmcs r1!, {r3, r12}
stmcs r0!, {r3, r12}
bgt two_then_one_loop @ if counter > 8, continue loop
bxeq lr @ if counter == 8, return
adds r2, r2, #4 @ otherwise go back some
ldrpl r3, [r1], #4 @ if counter > 4, copy
strpl r3, [r0], #4
bxeq lr @ if counter == 4, return
add r2, r2, #4 @ otherwise fix *again*
b copy_u8_forward @ then byte copy
```
Then if all my math is right, the time totals look like this
| Words | one_word_loop | two_then_one_loop |
|:-:|:-:|:-:|
| 7 | 64 | 46 |
| 6 | 55 | 36 |
| 5 | 46 | 34 |
| 4 | 37 | 24 |
| 3 | 28 | 22 |
| 2 | 19 | 12 |
| 1 | 10 | 12 |
| 0 | 16 | 20 |
How many words does LLVM copy on its own before calling a function?
If we go back to that code sample where we tried to get LLVM to use `ldr`/`str` by having it copy arrays of `u32` we can get a hint.
For `u32` arrays of 16 bytes or less LLVM copies on its own, and then when the array is 20 or more bytes it uses a call to `memmove`.
That is **not** meant to be a statement of how LLVM always generates the code, it's just what it does with `u32` arrays in that situation.
If we try using `u8` arrays in that function, it directly copies an 8 byte array, and then goes to `memmove` at 12 bytes.
I guess it depends on the exact types involved, but I think that we can be *fairly sure* that our `u32` copy function won't be called for single word copies.
LLVM is at least that smart.
The more complex `two_then_one_loop` already performs better than the `one_word_loop` with just 2 words, and the difference increases as the number of words goes up.
Using `ldm`/`stm` to reduce loop iterations is *very* effective.
And so, with a heavy sigh that the code has gotten complex like this, I think the way we want to write it all looks like:
```arm
copy_u32_forward:
cmp r2, #32
bge 2f
1: @ copy less than 8 words
subs r2, r2, #8
ldmcs r1!, {r3, r12}
stmcs r0!, {r3, r12}
bgt 1b
bxeq lr
adds r2, r2, #4
ldrpl r3, [r1], #4
strpl r3, [r0], #4
bxeq lr
add r2, r2, #4
b copy_u8_forward
2: @ copy 8 word blocks
push {r4-r9}
3:
subs r2, r2, #32
ldmcs r1!, {r3-r9, r12}
stmcs r0!, {r3-r9, r12}
bgt 3b
pop {r4-r9}
bxeq lr
@ if we didn't early return the last loop over-subtracted
@ so fix that up before going to the main code path
adds r2, r2, #32
b 1b
```
### The Carry/Sign Test
There's at least one more trick we can apply.
Right now it's possible to have 0 to 3 spare bytes when we enter our word copy loop, which eventually falls down to calling `copy_u8_forward`.
The time to load and store isn't any faster for bytes than for words, so that 1-byte loop takes 9 cycles per full pass the same as our 1-word version does.
So if we call it for 3 spare bytes we're gonna spend a lot of time on a very small amount of work.
What we'd rather have happen in the 3 byte case is just do one halfword copy and then one byte copy, and save some cycles.
If only there was a way to quickly figure out when to do a halfword and/or byte copy.
Turns out there is, lucky us.
We can use what's called the "carry/sign test" to determine the state of bits 0 and 1 of our count in a single instruction.
We just "logical shift left" (aka `<<`) the count by 31 bits.
```arm
lsls r12, r2, #31
```
That's the whole thing!
For a shift, the flags are set as follows:
* N Flag = 1 if result (as a signed value) is negative
* Z Flag = 1 if result is 0
* C Flag = 1 if the last bit shifted out of the number is 1 (and unchanged if the shift amount is 0).
* V Flag = unaffected
So when we use `lsls #31`:
* Bit 0 of the starting value is now the sign bit of the result, which determines N.
* Bit 1 is the last bit we shifted out, which determines C.
* The rest of the bits we know are 0, since we know the count is less than 4 by this point in the code. Even if they weren't 0 for sure, they won't hurt our check.
So we shift the count left by 31 bits, then do a conditional halfword copy (using `cs` for "carry set"), followed by a conditional byte copy (using `mi` for "minus").
```arm
@ we don't need the output, just the flags changes
lsls r3, r2, #31
ldrhcs r3, [r1], #2
strhcs r3, [r0], #2
ldrbmi r3, [r1], #1
strbmi r3, [r0], #1
```
No need for a loop.
Now we can just replace the call to `copy_u8_forward` with this new assembly, and then an unconditional `bx lr`, and we'll be in good shape.
```arm
1: @ copy less than 8 words
subs r2, r2, #8
ldmcs r1!, {r3, r12}
stmcs r0!, {r3, r12}
bgt 1b
bxeq lr
adds r2, r2, #4
ldrpl r3, [r1], #4
strpl r3, [r0], #4
bxeq lr
add r2, r2, #4
lsls r3, r2, #31
ldrhcs r3, [r1], #2
strhcs r3, [r0], #2
ldrbmi r3, [r1], #1
strbmi r3, [r0], #1
bx lr
```
Ah, wait, wait!
We're adding 4 back to our counter value to undo our most recent subtraction so that we have "the count value we had when we entered the loop".
Then we do the carry/sign test on that value.
Except our numbers are two's compliment integers...
so...
if we're just adding and subtracting any multiple of 4, it's not going to affect the lowest two bits anyway, which are the bits we're looking at with our carry/sign test.
| Count Before Sub | Bits |
|:-:|:-:|
| 0 | `00000000000000000000000000000000` |
| 1 | `00000000000000000000000000000001` |
| 2 | `00000000000000000000000000000010` |
| 3 | `00000000000000000000000000000011` |
| Count After Sub | Bits |
|:-:|:-:|
| -4 | `11111111111111111111111111111100` |
| -3 | `11111111111111111111111111111101` |
| -2 | `11111111111111111111111111111110` |
| -1 | `11111111111111111111111111111111` |
With or without that 4 added back in, the lowest two bits are the same.
Which means we can skip that `add` entirely.
```arm
bxeq lr
@ we can skip the add!
lsls r3, r2, #31
```
No need to even change any conditionals or anything.
I love it.
So our final version of the function looks like this.
```arm
copy_u32_forward:
cmp r2, #32
bge 2f
1: @ copy less than 8 words
subs r2, r2, #8
ldmcs r1!, {r3, r12}
stmcs r0!, {r3, r12}
bgt 1b
bxeq lr
adds r2, r2, #4
ldrpl r3, [r1], #4
strpl r3, [r0], #4
bxeq lr
lsls r3, r2, #31
ldrhcs r3, [r1], #2
strhcs r3, [r0], #2
ldrbmi r3, [r1], #1
strbmi r3, [r0], #1
bx lr
2: @ copy 8 words at a time (loop)
push {r4-r9}
3:
subs r2, r2, #32
ldmcs r1!, {r3-r9, r12}
stmcs r0!, {r3-r9, r12}
bgt 3b
pop {r4-r9}
bxeq lr
adds r2, r2, #32
b 1b
```
This is getting a little complicated.
## Reverse `u32`
Similar to before, once we know how we want our forward loop to go we can roughly do the same steps "in reverse" to form our reverse loop.
The biggest difference is that `ldm`/`stm` can't use offset addressing, so if we hit that case we'll have to adjust the pointers manually as part of that setup.
Other than that, this should be fairly simple.
* carry/sign test on the count
* single byte copy, if any
* single halfword copy, if any
* Now we know the count is a clean multiple of 4
* Copy 8 words at a time (if the count is high enough)
* Copy 2 words at a time
* Copy one final word if necessary
We start with a carry/sign test.
If the flags show that we need to copy a byte or a halfword then we'll so that.
Except this is the reverse loop so "do that" is subtracting the correct value from the byte count and then using the byte count as an offset for the addressing.
```arm
copy_u32_reverse:
movs r3, r2 lsl #31
submi r2, r2, #1
ldrbmi r3, [r1, r2]
strbmi r3, [r0, r2]
subcs r2, r2, #2
ldrhcs r3, [r1, r2]
strhcs r3, [r0, r2]
@ TODO: the rest of the owl
```
Oh golly that's a lot of conditional instructions.
Like we had with the block copy in our forward loop, we're gonna push this out of the main path of the code.
And since we'll also do that with the block copy, I think we're going to end up with more than just 3 labels in this function.
Let's finally use some "local labels".
They're just like normal labels, but they're not exported from the file that they appear in is all.
The prefix for a local label varies by platform, for us it's `.L`, and they're otherwise like all the other labels we've used.
A label name can be fairly long.
I'm gonna use some long label names to comment the intent a bit.
```arm
copy_u32_reverse:
tst r2, #3
bne .L_handle_byte_and_halfword
.L_count_is_now_a_multiple_of_four:
@ TODO: the rest of the owl
.L_handle_byte_and_halfword:
lsls r3, r2, #31
submi r2, r2, #1
ldrbmi r3, [r1, r2]
strbmi r3, [r0, r2]
subcs r2, r2, #2
ldrhcs r3, [r1, r2]
strhcs r3, [r0, r2]
b .L_count_is_now_a_multiple_of_four
```
Amazing.
Looks better than I thought it would.
So we know our pointers are aligned to 4, because that's a requirement to call the function.
And now at the "TODO" point in the code we know the count is aligned to 4 also.
And `ldm`/`stm` don't accept offset addressing.
So we just add the count to both our pointers.
```arm
.L_count_is_now_a_multiple_of_four:
add r1, r1, r2
add r0, r0, r2
@ TODO: the rest of the owl
```
Now we're set to use `ldm`/`stm`, so I guess we put that in.
As with the forward loop, I think we want to put this to the side of the main code path and try to give the 1-7 word copy portion the main code path.
The change here to make it "reverse" is that instead of using `ldm`, which is short for `ldmia` ("load-multiple, increment after"), we'll use `ldmdb` ("load-multple, decrement before").
This matches our earlier reverse loops where we subtract before the copy and then use that subtracted value as the offset.
```arm
.L_count_is_now_a_multiple_of_four:
add r1, r1, r2
add r0, r0, r2
tst r2, #32
bge .L_block_copy_sub
.L_done_with_block_copy:
@ TODO: the rest of the owl
.L_block_copy_sub:
push {r4-r9}
1:
subs r2, r2, #32
ldmdbcs r1!, {r3-r9, r12}
stmdbcs r0!, {r3-r9, r12}
bgt 1b
pop {r4-r9}
bxeq lr
adds r2, r2, #32
b .L_done_with_block_copy
```
Now we want to fill in the `.L_done_with_block_copy`.
At this point we know that we're going to copy 0-7 words of data.
Actually, now that I'm thinking in words instead of in bytes, another idea comes to mind.
The `r2` register holds the number of bytes to copy.
Our goal, you could say, is to clear all of the bits.
Once all the bits in our counter are 0, that means there's no bytes left to copy and so we can return.
We know bits 0 and 1 are clear.
We know bits 5 and above are clear.
We're just not sure about bits 2, 3, and 4.
| Words Remaining | Counter Bits |
|:-:|:-:|
| 0 | `00000000000000000000000000000000` |
| 1 | `00000000000000000000000000000100` |
| 2 | `00000000000000000000000000001000` |
| 3 | `00000000000000000000000000001100` |
| 4 | `00000000000000000000000000010000` |
| 5 | `00000000000000000000000000010100` |
| 6 | `00000000000000000000000000011000` |
| 7 | `00000000000000000000000000011100` |
So with only 3 bits to test and possibly copy for, it's kinda practical to just test each bit and copy that much if we get a hit.
```arm
.L_done_with_block_copy:
@ check for 4+ words remaining, copy if so
tst r2, #(1<<4)
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
@ TODO: the rest of the owl
```
Yeah... that doesn't seem so bad.
Know what else is cool?
The carry/sign test thing, you can do it on *any* two bits in a register.
As long as they're next to each other of course.
We just adjust the logical left shift amount.
You subtract the *higher* number bit from 32, and shift by that much.
Before we wanted to test bits 0 and 1, so we shifted by 31 (`32-1==31`).
Now we want bits 2 and 3, so we shift by 29 (`32-3==29`).
```arm
.L_done_with_block_copy:
@ check for 4+ words remaining, copy if so
tst r2, #(1<<4)
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
@ carry/sign for 2 and/or 1 words remaining
lsls r3, r2, #29
ldmdbcs r1!, {r3, r12}
stmdbcs r0!, {r3, r12}
ldrmi r3, [r1, #-4]
strmi r3, [r0, #-4]
bx lr
```
And now that I've written it like this... it becomes obvious that probably the forward loop should be using the same technique.
Let's not go back.
For now, just *pretend* that we went back and updated the forward loop.
There's still more to talk about, and when we see the "full and complete" version of stuff that'll also include seeing the forward loop updated to use bit testing.
Okay, so here's the full u32 reverse copy:
```arm
copy_u32_reverse:
tst r2, #3
bne .L_handle_byte_and_halfword
.L_count_is_now_a_multiple_of_four:
add r1, r1, r2
add r0, r0, r2
tst r2, #32
bge .L_block_copy_sub
.L_done_with_block_copy:
tst r2, #(1<<4)
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
lsls r3, r2, #29
ldmdbcs r1!, {r3, r12}
stmdbcs r0!, {r3, r12}
ldrmi r3, [r1, #-4]
strmi r3, [r0, #-4]
bx lr
.L_block_copy_sub:
push {r4-r9}
1:
subs r2, r2, #32
ldmdbcs r1!, {r3-r9, r12}
stmdbcs r0!, {r3-r9, r12}
bgt 1b
pop {r4-r9}
bxeq lr
adds r2, r2, #32
b .L_done_with_block_copy
.L_handle_byte_and_halfword:
lsls r3, r2, #31
submi r2, r2, #1
ldrbmi r3, [r1, r2]
strbmi r3, [r0, r2]
subcs r2, r2, #2
ldrhcs r3, [r1, r2]
strhcs r3, [r0, r2]
b .L_count_is_now_a_multiple_of_four
```
And, hey, if we didn't need to have the extra +4 fixup step to undo our subtract when we go past 4 with the single word copy, then I guess we don't need to have a +32 fixup step at the end of the block copy either. They're both a power of 2 that's bigger than the bits we care about, so it's all the same as long as we're only doing bit tests.
I don't want to paste that whole thing a second time, so let's pretend I did and just call that another small thing to fix in the next big version somewhere farther down the page.
# A Working `__aeabi_memcpy`
We've seen a lot of variations on copy loops.
It's time to make the thing we came here to make, a complete `__aeabi_memcpy` function.
This will be like the "forward" loops we did before, but we'll have more prologue code.
We don't know if our two pointers are aligned to 1, 2, or 4.
The best copy loop we can use depends on the *minimum* of the alignment of both pointers.
If our pointers start less aligned than we want, *sometimes* we can do a small copy to become aligned, and then use a better aligned loop.
* **Example 1:** one pointer is 5, and the other is 11. Those are both align 1, but if we just do a single byte copy they advance to 6 and 12, which are both align 2. Now we can go by halfwords and speed up the rest of the copying.
* **Example 2:** If one pointer is 7, and the other is 14, then they'll unfortunately be *perpetually* out of alignment. Even if we try to do a small copy at the start to get them lined up, one of them will always be out of alignment when the other is in alignment.
We'll need various checks on our input to figure out what the situation is, then we can jump to the loop that fits our needs best.
Since we've got 2 values, and each value has 2 possible states for a given bit, let's make one of those truth table things.
| Dest | Src | Can Align 2 ? |
|:-:|:-:|:-:|
| Even | Even | Yes |
| Even | Odd | No |
| Odd | Even | No |
| Odd | Odd | Yes |
Ah, wonderful.
The words in the chart aren't the classic "0" or "1", but the overall pattern matches the [Exclusive-Or][xor-wp] operaton.
In Rust we write it with `^`.
In ARM assembly it's written with `eor`.
[xor-wp]: https://en.wikipedia.org/wiki/Exclusive_or#Truth_table
Depending on the Exclusive Or between the 0-bit values of the address we either will or won't be able to align the pointers to 2.
The same idea holds for the 1-bit of the address values and trying to align to 4.
It applies up to any alignment we need.
When we need both pointers to be aligned to X (where X is a power of 2 like normal), then if all the bits *below* X are clear when you Exclusive Or the two pointers you'll be able to do it.
Either both pointers are already aligned, or both pointers can be advanced the same amount to make them aligned.
Of course, that assumes that we have enough bytes to copy that there's still work to do after doing some initial small copies to advance the pointers.
As usual, if the call comes in to do a 0 byte copy or a 1 byte copy or something else small like that we need to maintain correctness.
Like with most all our other copy loops, `r0`, `r1`, and `r2` are all in use.
We're using `r3` as our temp storage for the copy.
Let's put our alignment check in `r12`.
```arm
__aeabi_memcpy:
eor r12, r0, r1
```
* If Bit 0 of `r12` is *set* we know we're capped to being align 1 at best.
* Else If Bit 1 of `r12` is set we're capped at being align 2 at best.
* Else we can get ourselves to being align 4, which is the best situation for us.
Since that's another "two bits next to each other" situation we can again use a carry/sign test.
You learn just *one neat trick* (virtual machines hate it!) and spend the entire rest of the article stuck using it over and over.
Anyway, now we can branch to the align1 path on negative,
branch to the align2 path on carry,
otherwise we keep going on the align4 path.
Once we're on the path we want, for the align2 and align4 case we might need to do a little work to actually get the alignment we're going for. In the align1 case we've obviously never got any starting work, a pointer can't be aligned to less than 1 after all.
Something like this.
```arm
__aeabi_memcpy:
eor r12, r0, r1
lsls r12, r12, #31
bmi copy_u8_forward
bcs copy_u16_forward_fixup_prelude
@ fallthrough to copy_u32_forward_fixup_prelude
```
For the fixup preludes we check the bits of either address to determine how much we need to fix up by.
Since we did the exclusive-or we know that both addresses will need fixing by the same amount, however much that is.
This feels like another situation where we want to assume we don't have to to the work on the main path, and then put any extra work on a side path.
```arm
copy_u32_forward_fixup_prelude:
tst r0, #3
bne .L_copy_u32_forward_do_a_fixup
.L_copy_u32_forward_after_the_fixup_is_complete:
@ TODO
.L_copy_u32_forward_do_a_fixup:
lsls r12, r0, #31
@ advance by 1 byte if bit0 is set in the dest address
submi r2, r2, #1
ldrbmi r3, [r1], #1
strbmi r3, [r0], #1
@ advance by 2 bytes if bit1 is set in the soudestrce address
subcs r2, r2, #2
ldrhcs r3, [r1], #2
strhcs r3, [r0], #2
@ fixup complete
b .L_copy_u32_forward_after_the_fixup_is_complete
```
How big can a label be?
I don't know honestly.
If we need to use smaller labels in the actual code that's fine.
For illustrative / educational purposes I'm going to keep the extra verbose labels in the sample code blocks.
Anyway, once a path has been selected, and the fixup applied, then we just need to actually do the copy.
For "memcpy" style functions, it's assumed that the two regions are *exclusive*, so we can always select the forward copy loop.
## `__aeabi_memcpy4` and `__aeabi_memcpy8`
These functions are just like the general `__aeabi_memcpy`, except that they require the pointers to already be aligned to the amount of their number.
That means that we can basically just rename `.L_copy_u32_forward_after_the_fixup_is_complete` to `__aeabi_memcpy4`, since both blocks assume pointers aligned to 4.
For us there's no advantage to a pointer alignment of 8.
Later ARM CPUs have "double word" operations, but we don't.
That means we can just make `__aeabi_memcpy8` be an alias for the 4 version.
```arm
copy_u32_forward_fixup_prelude:
tst r0, #3
bne .L_copy_u32_forward_do_a_fixup
@ fallthrough to __aeabi_memcpy4
__aeabi_memcpy8:
__aeabi_memcpy4:
@ TODO
.L_copy_u32_forward_do_a_fixup:
lsls r12, r0, #31
@ advance by 1 byte if bit0 is set in the dest address
submi r2, r2, #1
ldrmi r3, [r1], #1
strmi r3, [r0], #1
@ advance by 2 bytes if bit1 is set in the soudestrce address
subcs r2, r2, #2
ldrhcs r3, [r1], #2
strhcs r3, [r0], #2
@ fixup complete
b __aeabi_memcpy4
```
Aw, dag, the indentation for my labels is getting kinda weird...
I was using no-indent for labels that are exported from the file,
and then 2 spaces for labels that aren't exported,
but now in the name of having the "happy path" branch as little as possible our sad path code is jumping all over the page.
Guess it can't be helped, code just does this after a while when you keep handling more and more situations.
# A Working `__aeabi_memmove`
TODO
# A Working `__aeabi_memset`
TODO
# A Working `__aeabi_memclr`
TODO
# Working `libc` functions
TODO