changed 3 years ago
Published Linked with GitHub

Memory Functions

This whole document is released as Creative Commons 0 (CC0).

More usable code for this article, including tests, can be found in the github repository.

Introduction

The "memory functions" do two basic kinds of things:

  • They copy data from src to dest, and the two regions might or might not overlap.
  • They set data in dest to a specific value, without reading from a source region.

Depending on the details, there's various optimizations that you can apply to these tasks. This means that a full suite of memory functions will usually have a few similar functions, each for a slightly different task.

In Rust the memory functions are assumed to exist as part of the basic code provided by the platform. These memory functions are called very often, and so they should be finely tuned to be as fast as possible on each platform. Rust assumes that the platform developers know what they're doing, and that they will provide fast verisons. If your platform doesn't do this, you can still get basic versions of the memory functions from the compiler-builtins crate. Normally this is all sorted out for you automatically.

When we're programming Rust for the Gameboy Advance (GBA) we use the thumbv4t-none-eabi target. This is a "bare metal" target, with no provided code libraries. This means that a Rust program on the GBA gets the basic memory functions from compiler-builtins.

Except we don't want the basic versions. LLVM is reasonable at codegen, but sometimes it misses stuff. Also, we want to have the linker put our memory functions into IWRAM rather than ROM. Luckily, the compiler-builtins functions are only "weak linked" into the program. This means that we can override their version of a function just by providing our own version.

All we have to do to fix our situation is "just" define our own memory functions.

To follow this article you need to be able to read the basics of ARM assembly. Normally one might suggest the official docs for this, but the ARM developer website changes often enough that old links seem to break all the time. Instead, the guide I'd suggest is from Azeria Labs, which covers the fundamental ideas nicely in a few short pages.

We'll be using the modern ARM assembly instruction syntax, where conditional instructions always have the condition on the end (eg: ldrheq). This is the default for LLVM (and so for Rust too). If you read the original ARMv4T manuals from 1994 or whenever they show the old-style syntax, where the conditional for some instructions goes "inside" the rest of the instruction (eg: ldreqh). This is the default parsing mode for the GNU assembler. If you want to use the modern syntax with the GNU assembler put the .syntax unified directive at the top of the file.

Function Signature Goals

For this issue we want to approach things in little pieces at a time, and build the pieces up into a final "complete" thing. To do that, we need to start by checking what we're even trying to build.

First up is the memory functions that are normally provided by a platform's libc. I know we're programming Rust here, but since they're defined as C functions we'll look at the signature in the C style.

void *memcpy(void *dest, const void *src, size_t count);

void *memmove(void *dest, const void *src, size_t count);

void *memset(void *dest, int ch, size_t count);
  • "memory copy", for when the two regions don't overlap.
  • "memory move", for then the two regions might overlap.
  • "memory set", which sets all bytes to be the u8 passed in ch. The ch value is passed as an int because this memset actually predates the ability to specify argument types in a function signature. Just go with it.

The count argument is always the count in bytes. Oh, and the return value is always the original dest pointer that was passed in. That lets you potentially do C's vesion of call chains where you write code like

op_two(op_one(buffer, arg1), arg2);

On a lot of platforms that might be all you'd need to think about. However, with targets that follow the ARM EABI (that includes us), there's some additional memory functions too. LLVM will use the ARM EABI functions when approprite, so we want these to be as fast as possible as well.

void __aeabi_memcpy8(void *dest, const void *src, size_t n);
void __aeabi_memcpy4(void *dest, const void *src, size_t n);
void __aeabi_memcpy(void *dest, const void *src, size_t n);

void __aeabi_memmove8(void *dest, const void *src, size_t n);
void __aeabi_memmove4(void *dest, const void *src, size_t n);
void __aeabi_memmove(void *dest, const void *src, size_t n);

void __aeabi_memset8(void *dest, size_t n, int c);
void __aeabi_memset4(void *dest, size_t n, int c);
void __aeabi_memset(void *dest, size_t n, int c);

void __aeabi_memclr8(void *dest, size_t n);
void __aeabi_memclr4(void *dest, size_t n);
void __aeabi_memclr(void *dest, size_t n);

These largely work like the libc versions except that:

  • There's no return value. This might seem unimportant, but it actually does allow a small optimization.
  • There's "4" and "8" versions of the functions. These functions can be used when the caller knows that the pointers are already aligned to 4 or 8, which allows significant optimization. The GBA's CPU is old enough that alignment to 8 doesn't help, but alignment to 4 will be much faster. Note that the number of bytes passed to the aligned functoins does not need to be a multiple of the alignment. You could call __aeabi_memcpy4 and tell it to copy 5 bytes, if you really want.
  • __aeabi_memclr ("memory clear") is a new group of functions that work like "memory set", but the byte value to set is just implied to be 0.
  • The __aeabi_memset functions have the argument order swapped compared to memset. Now the byte to set is the third argument. This allows __aeabi_memclr implementations to share most of their code with __aeabi_memset implementations.

Copy Ordering

When we're accessing a bunch of memory, accessing the memory in order is just faster. This is very true on a modern system, but even the old GBA cares. Accesses are categorized into "sequential" ("S") and "non-sequential" ("N"). The exact time taken varies by game pak, but the usual timings are that sequential accesses take 2 cycles and non-sequential accesses take 4 cycles.

Obviously we should always write our copy lookps to only use sequential accesses, right? Just in case we're accessing the ROM as part of a copy? Well, it's not so simple (it never is).

Imagine we have some memory like this. (We'll use a little-endian layout for our diagram, since the GBA is a little-endian device.)

7 6 5 4 3 2 1 0
H G F E D C B A

Let's imagine we want to copy a 4 byte span, with the destination starting at 2 and the source starting at 0.

  • First we read A from 0 and write it to 2.
  • Next we read B from 1 and write it to 3.
  • Next we should have read C from 2, but we already overwrote that value in the first step!

When our two regions to copy between might overlap (like with memmove), we need to be sure that the overlapping portion gets handled first so that the source data is not destroyed by our own writes to the destination. We do that by copying in the correct direction. We'll call the sequential version "forward", and the non-sequential version "reverse". For each copying loop we design we'll need to write both the forward and reverse version. Then we pick between the two versions like this:

fn copy_overlapping(d: *mut u8, s: *const u8, count: usize) {
    if d < s {
        copy_forward(d, s, count);
    } else {
        copy_reverse(d, s, count);
    }
}

And if we know the regions don't overlap (such as copying ROM data to RAM), then we can just always use the forward loop like we wanted.

Compiler Explorer

The ever handy Compiler Explorer website lets us write some Rust and see the assembly output. Unfortunately for us, Compiler Explorer isn't set up to build for an ARMv4T target. Right now Compiler Explorer only supports Rust's Tier 1 and Tier 2 targets. Instead, we have to have CE show us assembly for armv5te-unknown-linux-gnueabi, which is the oldest ARM target that's Tier 2. This means that sometimes we're going to see assembly output that isn't actually legal for us to use on the GBA (eg: a blx instruction, popping an lr value on the stack directly into pc instead of using bx lr, etc). But mostly it's close enough that we can use it as an approximate starting point.

There's also always the "Code Size vs Code Speed" trade off to consider. We'll want to look at the output of a few different optimization levels and pay attention to the differences.

Copy By u8

The first loops we'll look at will work exclusively in 8-bit units.

Hey Lokathor, isn't that silly? Because I thought you said we wanted to go fast with these copies!

Yes, I know friends, I know. But it's not as silly as it might seem.

There's two main reasons to write a u8-only copy loop.

  1. When accessing SRAM, the program must (in the RFC 2119 sense) run the code out of Work Ram rather than ROM, and must access using only 8-bit accesses (ldrb, ldrsb, strb).
  2. It's educational!, We're starting small as a warm up. We'll try out some basic techniques we can apply to bigger examples, that sort of thing.

Cool Loka's Hot Tip: Technically speaking we should be using MaybeUninit because the memory functions are defined by Rust to be "untyped" copies that ignore the initialization state of the memory involved, but since we're going to replace the Rust code with assembly code in the final version anyway I'm going to skip that part because it's a lot of visual clutter we don't need to just see the loop structure.

So here's what our Rust code looks like:

#![no_std]

pub unsafe extern "C" fn copy_u8_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
  const SIZE_U8: usize = core::mem::size_of::<u8>();
  while count >= SIZE_U8 {
    *dest = *src;
    dest = dest.add(SIZE_U8);
    src = src.add(SIZE_U8);
    count -= SIZE_U8;
  }
}

pub unsafe extern "C" fn copy_u8_reverse(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
  const SIZE_U8: usize = core::mem::size_of::<u8>();
  while count >= SIZE_U8 {
    count -= SIZE_U8;
    *dest.add(count) = *src.add(count);
  }
}

And here's what it looks like in the different optimization modes:

In this case, opt=3 (fast) and opt=s (small) outputs are the same, and opt=z (aggressively small) cuts out an early return when the copy length is 0.

Let's take a close look, starting with copy_u8_forward. We're going to go really slow for the first function or two, and then speed up a bit once we've got the basics down.

Forward u8

To start, let's use the z output as our base. We're going to assume that people won't call the function with a zero sized copy, so we probably don't need the extra early return check to try and minimize work. To be clear: people could call with a copy size of 0, and our function must be correct if that happens, but we'll assume that the copy size is larger than 0 even if it makes the 0-size copy case be slightly less fast. Generally, any memory function like this should assume that the amount of memory to manipulate is larger, rather than smaller, because when the amount of memory to manipulate is smaller the compiler will just do a few reads and writes on its own without calling a memory function.

Okay here we go:

@ Starting with the opt=z output

copy_u8_forward:
  .LBB0_1:
    cmp     r2, #0
    bxeq    lr
    ldrb    r3, [r1], #1
    sub     r2, r2, #1
    strb    r3, [r0], #1
    b       .LBB0_1

To start, we've removed the example:: prefix from the label name. Since the label part of a line ends with : there can't be a :: in the label name. Technically the real label that rustc used for the function was _ZN7example15copy_u8_forward17hc841ec0dfd74b8c3E, and then Compiler Explorer just de-mangled that for us when showing the output. Since we don't want to bother with mangling, we'll keep our labels plain and simple.

Also we can eliminate the local label, we can just jump to the top of the function using its non-local label.

copy_u8_forward:
    cmp     r2, #0
    bxeq    lr
    ldrb    r3, [r1], #1
    sub     r2, r2, #1
    strb    r3, [r0], #1
    b       copy_u8_forward

Okay, so, can we improve this code? Well, quite honestly, we probably can. LLVM isn't that good at taking full advantage of the Conditional Execution that all our ARM instructions allow.

The first suspicious part is that there's cmp and sub, both for the same value in the same loop. For later ARM devices this is basically fine since the result of ldr isn't available next cycle. Something needs to go between the ldr and the str, if you don't have anything the CPU just stalls a cycle anyway, so putting the sub there is fine. For this early ARM chip in the GBA we don't need the gap, so maybe we can move around the sub and shorten the loop.

  • cmp a, b sets the CPU status flags based on the result of a-b, without storing that output to any register.
  • sub d, a, b does d = a-b without setting the flags.
  • There's also subs d, a, b which is similar but with setting the flags.

This means if we see cmp and sub in the same loop, we might be able to use subs instead. It's fine to move the sub earlier in the loop. The bxeq and ldrb don't interact with r2 at all, so if r2 gets subtacted earlier or later they don't care. Let's check what that looks like.

copy_u8_forward:
    cmp     r2, #0
    sub     r2, r2, #1
    bxeq    lr
    ldrb    r3, [r1], #1
    strb    r3, [r0], #1
    b       copy_u8_forward

So now we can replace cmp and sub with a subs instruction if we change the condition on the bxeq. Currently we break out of the loop if cmp finds out that the count is 0 (if count - 0 == 0). However, if we use subs then we're setting the flags based on the value after we subtract 1, not before we subtract 1. If we had 1, and subtract 1, and get 0, then the counter was 1 when we entered the loop, so we want to do the loop one final time to copy the final byte. What we need is to do is stop the loop when we have 0 and subtract 1 and then underflow the value to u32::MAX.

For subs the change in flags work as follows:

  • N Flag = 1 if result (as a signed value) is negative
  • Z Flag = 1 if result is 0
  • C Flag = 1 if the subtraction did not produce an unsigned underflow. For anything else it's an unsigned overflow flag, but for subtraction the meaning is inverted, so don't forget.
  • V Flag = 1 if the result of the operation is greater than or equal to 231, or less than -231.

So as an example, 3 - 1 gives 2 with flag N=0, Z=0, C=1, V=0. And when we do 0 - 1, underflowing to u32::MAX, the flags are N=1, Z=0, C=0, V=0. That means the condition for our bx needs to be cc ("carry clear").

copy_u8_forward:
    subs    r2, r2, #1
    bxcc    lr
    ldrb    r3, [r1], #1
    strb    r3, [r0], #1
    b       copy_u8_forward

Neat.

This is actually where I stopped when I first started writing this article. I said to myself, "that's as tight as loop as we can make it", and I moved on.

Then felixjones on GitHub looked at the code and said "hey get that bx out of your loop".

See, we can't forget that every instruction can be conditional. We can move the bx out of the loop to the end, change the b to use bgt so that we only restart the loop when we have a non-zero amount of count remaining at the end of each loop, and finally we can put a cs on the ldrb and strb instructions so that if the function is called with a count of 0 we will safely pass over those instructions.

copy_u8_forward:
    subs    r2, r2, #1
    ldrbcs  r3, [r1], #1
    strbcs  r3, [r0], #1
    bgt     copy_u8_forward
    bx      lr

And, yeah. This is nicer. With the bx lr outside the loop. That's why you show your code to other folks and have them give it a check.

Reverse u8

LLVM wants our opt=z reverse loop to look like this:

copy_u8_reverse:
    sub     r1, r1, #1
    sub     r0, r0, #1
  .LBB1_1:
    cmp     r2, #0
    bxeq    lr
    ldrb    r3, [r1, r2]
    strb    r3, [r0, r2]
    sub     r2, r2, #1
    b       .LBB1_1

That's not so great. It's the same problem as above, LLVM is placing the subtraction at the wrong time, so we've got all sorts of extra work going on.

We can just change it to subs and have our conditions like before. The difference is that we're going to use "offset addressing" instead of the "post-indexed". Instead of the base pointers moving (after the access) with each loop, they stay the same and the count is what moves.

copy_u8_reverse:
    subs    r2, r2, #1
    ldrbcs  r3, [r1, r2]
    strbcs  r3, [r0, r2]
    bgt     copy_u8_reverse
    bx      lr

Now the copy starts with the final byte, then one byte back, then one byte back, until it copies the 0th byte. Super easy, no trouble at all.

Copy By (Mostly) u16

In practice, other than with SRAM, we'd never want to copy by only u8.

The next size up is u16 ("halfword"). We might want to use this if we're working with stuff going into VRAM. Modes 3, 4, and 5, along with the color palettes, are all 2-byte aligned data. VARM has the fun quirk where u8 reads are fine, but u8 writes affect both bytes within the halfword. We need to be careful to avoid u8 writes when writing to VRAM.

So we'll have some functions for "mostly u16". It's "mostly" because they'll use halfword copies as much as possible. If you do request a copy length that's an odd number then the function will be forced to do one byte copy, but otherwise you can be sure you'll get halfword copies.

Cool Loka's Hot Tip: The MaybeUninit thing from before still applies.

#![no_std]

pub unsafe extern "C" fn copy_u16_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
  const SIZE_U16: usize = core::mem::size_of::<u16>();
  while count >= SIZE_U16 {
    dest.cast::<u16>().write(src.cast::<u16>().read());
    dest = dest.add(SIZE_U16);
    src = src.add(SIZE_U16);
    count -= SIZE_U16;
  }
  if count > 0 {
    dest.cast::<u8>().write(src.cast::<u8>().read());
  }
}

pub unsafe extern "C" fn copy_u16_reverse(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
  const SIZE_U16: usize = core::mem::size_of::<u16>();
  if (count % SIZE_U16) > 0 {
    count -= core::mem::size_of::<u8>();
    dest.add(count).write(src.add(count).read());
  }
  while count >= SIZE_U16 {
    count -= SIZE_U16;
    dest.add(count).cast::<u16>().write(src.add(count).cast::<u16>().read());
  }
}

Just like with the u8 loops, opt=3 and opt=s give us the same output, and only opt=z goes for the extremely compact form.

Forward u16

So LLVM wants us to do one of these:

@ opt 3
copy_u16_forward:
    cmp     r2, #2
    blo     .LBB0_2
  .LBB0_1:
    ldrh    r3, [r1], #2
    sub     r2, r2, #2
    strh    r3, [r0], #2
    cmp     r2, #1
    bhi     .LBB0_1
  .LBB0_2:
    cmp     r2, #1
    ldrbeq  r1, [r1]
    strbeq  r1, [r0]
    bx      lr

@ opt z
copy_u16_forward:
  .LBB0_1:
    cmp     r2, #1
    bls     .LBB0_3
    ldrh    r3, [r1], #2
    sub     r2, r2, #2
    strh    r3, [r0], #2
    b       .LBB0_1
  .LBB0_3:
    ldrbeq  r1, [r1]
    strbeq  r1, [r0]
    bx      lr

This isn't ideal. As with the u8 code, LLVM is trying to optimize for the small case to be faster. What we want is that our common case to be faster, but for us it's the large case that's common.

If we take our u8 forward loop and just change a few bits to make the main loop go by halfwords it looks like this.

copy_u16_forward:
    subs    r2, r2, #2
    ldrhcs  r3, [r1], #2
    strhcs  r3, [r0], #2
    bgt     copy_u16_forward
    bx      lr
    @ todo: handle the spare byte, if any

So we subtract 2. If we didn't go below 0 then we had 2 or more before the subtract. So we copy a halfword and advance our pointers. Then if the count is still greater than 0 we do it again.

Except we can't have an unconditional return any more. After the subtraction the result could be

  • more than 0, so the bgt catches it and the loop goes again.
  • 0, so the loop started at 2, and we return (we can use bxeq).
  • -1, in which case we had 1 at the start of the loop, so copy a byte and return.
  • -2, in which case we had 0 at the start of the loop, so just return.

We can decide between the -1 and -2 cases by adding 1 back to the count. If we land on 0 then that means we should copy the extra byte.

copy_u16_forward:
    subs    r2, r2, #2
    ldrhcs  r3, [r1], #2
    strhcs  r3, [r0], #2
    bgt     copy_u16_forward
    bxeq    lr
    tst     r2, #1
    ldrbne  r3, [r1]
    strbne  r3, [r0]
    bx      lr

Don't forget about the -2 case and think that we can just remove the adds and unconditionally copy the last byte once we're after the bxeq lr. That would be a bad time for sure.

Reverse u16

With the reverse loop we have to handle a possible odd byte count before we do our halfwords.

LLVM wants us to do it like this

@ opt=3
copy_u16_reverse:
    tst     r2, #1
    beq     .LBB1_2
    sub     r2, r2, #1
    ldrb    r3, [r1, r2]
    strb    r3, [r0, r2]
  .LBB1_2:
    cmp     r2, #2
    bxlo    lr
    sub     r12, r0, #2
    sub     r1, r1, #2
  .LBB1_4:
    add     r0, r1, r2
    add     r3, r12, r2
    sub     r2, r2, #2
    ldrh    r0, [r0]
    cmp     r2, #1
    strh    r0, [r3]
    bhi     .LBB1_4
    bx      lr

@ opt=z
copy_u16_reverse:
    tst     r2, #1
    beq     .LBB1_2
    sub     r2, r2, #1
    ldrb    r3, [r1, r2]
    strb    r3, [r0, r2]
  .LBB1_2:
    sub     r12, r0, #2
    sub     r1, r1, #2
  .LBB1_3:
    cmp     r2, #1
    bxls    lr
    add     r0, r1, r2
    add     r3, r12, r2
    sub     r2, r2, #2
    ldrh    r0, [r0]
    strh    r0, [r3]
    b       .LBB1_3

The opt=3 version has an 8 instruction halfword loop with bx on the outside of the loop (cool). The opt=z version has an 8 instruction halfword loop with bxls on the inside of the loop (not cool). So the opt=z version saved no space compared to the opt=3 version (8 instructions either way), and yet it's slower. I dunno what LLVM is doing, but that doesn't seem like a good time.

Theres definitely situations where an optimzier can do some weird math transformation or something and give you better code than you'd know to do on your own, but this is an example of when the opposite happens. For whatever reason, the optimizer fell over on the job with this one.

One thing that both LLVM versions agree on is starting the function with a tst instruction. This is like cmp in the sense that it does an operation just to set the flags, without storing the output. The difference is which operation is used to set the flags.

  • cmp a, b sets the flags based on a - b
  • tst a, b sets the flags based on a & b

So if we do tst r2, #1 we'll be able to find out if our count is starting as an odd value. That makes sense, and we can agree with LLVM that it's the correct opening move for this function.

After the tst LLVM jumps ahead in the function if we're on the even-count case, and then the odd-count case falls through to a byte copy. That's no good. Our ideal path should be the path with the least branches, so we want to re-arrange the code some. We'll put the byte copy at the end of the function as a little sub-routine thing. Then the odd-count case can jump there and (jump to the halfword loop after the byte copy), or the even-count case can just not jump at all. Either way, we get to the halfword loop. In the same basic structure as we had for the u8 reverse loop. We're just going by 2 instead of by 1.

copy_u16_reverse:
    tst     r2, #1
    bne     2f
  1:
    subs    r2, r2, #2
    ldrhcs  r3, [r1, r2]
    strhcs  r3, [r0, r2]
    bgt     1b
    bxeq    lr
  2:
    sub     r2, r2, #1
    ldrb    r3, [r1, r2]
    strb    r3, [r0, r2]
    b       1b

Looks like LLVM's opt=z version was 15 instructions, and we've narrowed it down to 11 instructions. Not bad at all.

Copy By (Mostly) u32

Here's where things can get really interesting.

The biggest data copies that we'd have to do on the GBA are probably things to do with copying around tile data. Tiles are always 8x8, and they can be 4-bits-per-pixel (4bpp) or 8-bits-per-pixel (8bpp). So that's either 32 bytes or 64 bytes, just for a single tile. And most things to do with tiles involve more than one tile. We still might copy small stuff, say 12 or 16 bytes, but when the byte count to copy starts to get high it'll quite possibly get extremely high. If there's a "large" creature that needs 4 by 4 tiles to draw, even at just 4bpp that's gonna be a 512 byte copy. So we'll finally have to be very careful that we support both extremely large copies (hundreds of bytes) and also moderate copies (32 bytes or less). We can't really let either code path take a speed hit.

Let's try a basic rust version modeled after the u8 and u16 loops we've seen:

#![no_std]

pub unsafe extern "C" fn copy_u32_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
  const SIZE_U32: usize = core::mem::size_of::<u32>();
  while count >= SIZE_U32 {
    dest.cast::<u32>().write(src.cast::<u32>().read());
    dest = dest.add(SIZE_U32);
    src = src.add(SIZE_U32);
    count -= SIZE_U32;
  }
  while count > 0 {
    *dest = *src;
    dest = dest.add(1);
    src = src.add(1);
    count -= 1;
  }
}

pub unsafe extern "C" fn copy_u32_reverse(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
  const SIZE_U32: usize = core::mem::size_of::<u32>();
  while (count % SIZE_U32) > 0 {
    count -= core::mem::size_of::<u8>();
    dest.add(count).write(src.add(count).read());
  }
  while count >= SIZE_U32 {
    count -= SIZE_U32;
    dest.add(count).cast::<u32>().write(src.add(count).cast::<u32>().read());
  }
}

Okay one thing is that ARMv4T has ldm (load-muliple) and stm (store-multple). These let you load and store several words at once, which makes bulk copying more efficient. It's how push and pop work for the stack, but with any pointer and block of memory. I'm not seeing any of those used in any of the optimization levels we're looking at.

Let's see if we can convince LLVM to use ldm and stm. We'll have it copy an entire [u32; 8] per loop (the size of one 4bpp tile), that might do the trick:

pub unsafe extern "C" fn copy_u32_forward(mut dest: *mut u8, mut src: *const u8, mut count: usize) {
  const SIZE_U32: usize = core::mem::size_of::<u32>();
  while count >= (SIZE_U32 * 8) {
    dest.cast::<[u32;8]>().write(src.cast::<[u32;8]>().read());
    dest = dest.add(SIZE_U32 * 8);
    src = src.add(SIZE_U32 * 8);
    count -= (SIZE_U32 * 8);
  }
  while count >= SIZE_U32 {
    dest.cast::<u32>().write(src.cast::<u32>().read());
    dest = dest.add(SIZE_U32);
    src = src.add(SIZE_U32);
    count -= SIZE_U32;
  }
  while count > 0 {
    *dest = *src;
    dest = dest.add(1);
    src = src.add(1);
    count -= 1;
  }
}

So while there's potentially more than 8 words left to go, we'll block copy 8 words.

And for this version, LLVM's opt=3 gives us

example::copy_u32_forward:
    push    {r4, r5, r6, lr}
    mov     r4, r2
    mov     r5, r1
    mov     r6, r0
    cmp     r2, #32
    blo     .LBB0_2
  .LBB0_1:
    mov     r0, r6
    mov     r1, r5
    mov     r2, #32
    bl      memmove
    sub     r4, r4, #32
    add     r5, r5, #32
    add     r6, r6, #32
    cmp     r4, #31
    bhi     .LBB0_1
  .LBB0_2:
    cmp     r4, #4
    blo     .LBB0_4
  .LBB0_3:
    ldr     r0, [r5], #4
    sub     r4, r4, #4
    str     r0, [r6], #4
    cmp     r4, #3
    bhi     .LBB0_3
  .LBB0_4:
    cmp     r4, #0
    beq     .LBB0_6
  .LBB0_5:
    ldrb    r0, [r5], #1
    subs    r4, r4, #1
    strb    r0, [r6], #1
    bne     .LBB0_5
  .LBB0_6:
    pop     {r4, r5, r6, pc}

CURSE YOU BLASTED DRAGON MACHINE.

We're trying to implement memmove, you can't call memmove while we're implementing it!!!

Alright, fine, this handwritten assembly will just look even less like the LLVM assembly than normal.

Forward u32

Slow But Simple u32 Forward

To begin, let's again take our u8 loop and just update it to go by u32.

copy_u32_forward:
    subs    r2, r2, #4
    ldrcs   r3, [r1], #4
    strcs   r3, [r0], #4
    bgt     copy_u32_forward
    bxeq    lr
    @ todo: handle the spare bytes, if any

We load (ldr) and store (str) one word at a time when the carry flag is set (cs), meaning that we did an unsigned underflow, meaning that we started the loop with 4 or more bytes in the count.

After the word copy:

  • If: the bytes left are greater than 0 we go back to the start of the loop (bgt)
  • Else If: the count is now excatly zero we return (bxeq)
  • Else: we started the loop with less than 4 bytes, then subtracted 4, so we've got some amount negative in our counter.

In the u16 "else" case we had either 0 or 1 spare bytes we also wanted to copy. We resolved it with adds, a single conditional copy, and then an unconditional return.

In the u32 "else" case we've got 0, 1, 2, or 3 spare bytes to copy. That's enough possibilities that it calls for a loop.

At this point in the code the current count is 4 less than we're actually supposed to copy. If we need 1 more byte copied, the count will be -3. If there's 2 more to copy then -2, 3 more is -1. Also, if we entered the loop with 0, we'll actually be at -4. This means that we can't just loop by adding 1 to our counter and continuing until we get to 0. That would give us like the inverse of the correct number of additional bytes.

The simplest solution in this scenario is to just add 4 back into the counter (to "undo" the most recent substract) and do a normal u8 copy loop. We could either call over to copy_u8_forward, or we could inline that loop into our u32 function.

Any type in Rust that's got an alignment of 4 will also have a size that's a multiple of 4. This means that handling the 1-3 spare bytes case is fairly rare.

  • If we branch over to copy_u8_forward we pay 3 CPU cycles when we hit this case (a b takes 1N+2S, which is 3 cycles total for IWRAM code).
  • If we inline that loop here we remove the one branch but it takes up an additional 20 bytes of IWRAM space.

I'd rather pay the 3 CPU cycles on rare occasions and save the IWRAM space. So our fully working u32 copy looks like this:

copy_u32_forward:
    subs    r2, r2, #4
    ldrcs   r3, [r1], #4
    strcs   r3, [r0], #4
    bgt     copy_u32_forward
    bxeq    lr
    add     r2, r2, #4
    b       copy_u8_forward

To be clear: this is not as fast as possible for large copies. It will work at all though.

If we want to go faster than this, we'll need those ldm and stm instructions.

ldm and stm loop for u32 Forward

If we use ldm we can load more than one register up with a series of words starting from a base address. Technically the ldm has several possible addressing modes, but if you don't specify one it means ia as the default, incriment after". It takes a register list, and for each register that you load into, the pointer will advance one word during the loading, so that it'll load a series of successive words. If you put ! after the destination register then the final address value is written back into the register so that you can keep advancing the address through memory on the next pass of a loop or however else.

The stm instruction lets us store multiple registers out in the same way.

The combination of ldm and stm we'll call a "block" copy.

Right now we're using our registers as follows:

  • r0: destination pointer
  • r1: source pointer
  • r2: byte count
  • r3: temp storage

Per the C ABI for ARM, we can freely use r0-r3, r12, and lr (aka r14) for whatever we want during a function call. If we want to alter any other register we've got to restore the value before our function returns. We do this by pushing the values to the stack, then popping them back off the stack later. The pushing and popping itself takes time.

So there's two basic strategies:

  • If we use just r3 and r12 for our block copy, doing two words per loop while avoiding the need for the stack.
  • If we use r3, r12, and push 6 other registers, then we can do eight words (one 4bpp tile) per loop. But then we have to use the stack.

We could consider the cycle timings as well:

op time
ldr 1S+1N+1I
str 2N
ldm (x)S+1N+1I
stm (x-1)S+2N
sub 1S
b 2S+1N

The x value on ldm/stm is the number of registers you're loading or storing. Note that some situations can make these instructions cost more than listed, such as if you ever use pc as a destination, but for our purposes these numbers are fine. Also, any conditional op takes just 1S if the condition doesn't hold.

Let's see how long things take. For simplicity we'll pretend that N and S are both 1, even though sometimes they might be more.

@ copy 1 word loop: 2.25 cycles per byte (9/4)
  1:
    subs    r2, r2, #4    @ 1
    ldrcs   r3, [r1], #4  @ 3
    strcs   r3, [r0], #4  @ 2
    bgt     1b            @ 3
@ copy 2 words loop: 1.375 cycles per byte (11/8)
  1:
    subs    r2, r2, #8     @ 1
    ldmcs   r1!, {r3, r12} @ 4
    stmcs   r0!, {r3, r12} @ 3
    bgt     1b             @ 3
@ copy 8 words loop: 0.71875 cycles per byte (23/32)
  1:
    subs    r2, r2, #32       @ 1
    ldmcs   r1!, {r3-r9, r12} @ 10
    stmcs   r0!, {r3-r9, r12} @ 9
    bgt     1b                @ 3
  • The final use of each loop causes the branch condition to not hold, and then we cut 2 cycles
  • To get the spare six registers for the 8-word loop we need 8 cycles for the push and 7 cycles for the pop.

So if there's 32 bytes:

  • The 8-word loop will be 38 cycles (push, break on the first loop, pop)
  • The 2-word loop will be 42 cycles (3 complete loops, break on the 4th loop)

Fancy as it is to not use the stack, even just a 32 byte copy is enough to justify using the stack.

If we're copying less than 32 bytes well that's actually tricky. We get better performance per byte from using the 2-word loop. However, it makes the tail conditions really complicated. In addition to spare bytes after all the word copying, we'd also have to handle a possible spare word after the 2-word copying. And when we merge the +8 and -4 to be just a +4 then we have to change our conditonal copy to use pl for "plus" (which includes zero too). That all seems extra fiddly, I'm not sure we want to deal with that.

Awh nuts

one_word_loop:
    subs    r2, r2, #4
    ldrcs   r3, [r1], #4
    strcs   r3, [r0], #4
    bgt     one_word_loop
    bxeq    lr
    add     r2, r2, #4
    b       copy_u8_forward

two_then_one_loop:
    subs    r2, r2, #8
    ldmcs   r1!, {r3, r12}
    stmcs   r0!, {r3, r12}
    bgt     two_then_one_loop @ if counter > 8, continue loop
    bxeq    lr                @ if counter == 8, return
    adds    r2, r2, #4        @ otherwise go back some
    ldrpl   r3, [r1], #4      @ if counter > 4, copy
    strpl   r3, [r0], #4
    bxeq    lr                @ if counter == 4, return
    add     r2, r2, #4        @ otherwise fix *again*
    b       copy_u8_forward   @ then byte copy

Then if all my math is right, the time totals look like this

Words one_word_loop two_then_one_loop
7 64 46
6 55 36
5 46 34
4 37 24
3 28 22
2 19 12
1 10 12
0 16 20

How many words does LLVM copy on its own before calling a function? If we go back to that code sample where we tried to get LLVM to use ldr/str by having it copy arrays of u32 we can get a hint. For u32 arrays of 16 bytes or less LLVM copies on its own, and then when the array is 20 or more bytes it uses a call to memmove. That is not meant to be a statement of how LLVM always generates the code, it's just what it does with u32 arrays in that situation. If we try using u8 arrays in that function, it directly copies an 8 byte array, and then goes to memmove at 12 bytes.

I guess it depends on the exact types involved, but I think that we can be fairly sure that our u32 copy function won't be called for single word copies. LLVM is at least that smart. The more complex two_then_one_loop already performs better than the one_word_loop with just 2 words, and the difference increases as the number of words goes up. Using ldm/stm to reduce loop iterations is very effective.

And so, with a heavy sigh that the code has gotten complex like this, I think the way we want to write it all looks like:

copy_u32_forward:
    cmp     r2, #32
    bge     2f
  1: @ copy less than 8 words
    subs    r2, r2, #8
    ldmcs   r1!, {r3, r12}
    stmcs   r0!, {r3, r12}
    bgt     1b
    bxeq    lr
    adds    r2, r2, #4
    ldrpl   r3, [r1], #4
    strpl   r3, [r0], #4
    bxeq    lr
    add     r2, r2, #4
    b       copy_u8_forward
  2: @ copy 8 word blocks
    push    {r4-r9}
  3:
    subs    r2, r2, #32
    ldmcs   r1!, {r3-r9, r12}
    stmcs   r0!, {r3-r9, r12}
    bgt     3b
    pop     {r4-r9}
    bxeq    lr
    @ if we didn't early return the last loop over-subtracted
    @ so fix that up before going to the main code path
    adds    r2, r2, #32
    b       1b

The Carry/Sign Test

There's at least one more trick we can apply.

Right now it's possible to have 0 to 3 spare bytes when we enter our word copy loop, which eventually falls down to calling copy_u8_forward. The time to load and store isn't any faster for bytes than for words, so that 1-byte loop takes 9 cycles per full pass the same as our 1-word version does. So if we call it for 3 spare bytes we're gonna spend a lot of time on a very small amount of work.

What we'd rather have happen in the 3 byte case is just do one halfword copy and then one byte copy, and save some cycles. If only there was a way to quickly figure out when to do a halfword and/or byte copy.

Turns out there is, lucky us.

We can use what's called the "carry/sign test" to determine the state of bits 0 and 1 of our count in a single instruction. We just "logical shift left" (aka <<) the count by 31 bits.

lsls r12, r2, #31

That's the whole thing!

For a shift, the flags are set as follows:

  • N Flag = 1 if result (as a signed value) is negative
  • Z Flag = 1 if result is 0
  • C Flag = 1 if the last bit shifted out of the number is 1 (and unchanged if the shift amount is 0).
  • V Flag = unaffected

So when we use lsls #31:

  • Bit 0 of the starting value is now the sign bit of the result, which determines N.
  • Bit 1 is the last bit we shifted out, which determines C.
  • The rest of the bits we know are 0, since we know the count is less than 4 by this point in the code. Even if they weren't 0 for sure, they won't hurt our check.

So we shift the count left by 31 bits, then do a conditional halfword copy (using cs for "carry set"), followed by a conditional byte copy (using mi for "minus").

@ we don't need the output, just the flags changes
lsls    r3, r2, #31
ldrhcs  r3, [r1], #2
strhcs  r3, [r0], #2
ldrbmi  r3, [r1], #1
strbmi  r3, [r0], #1

No need for a loop.

Now we can just replace the call to copy_u8_forward with this new assembly, and then an unconditional bx lr, and we'll be in good shape.

  1: @ copy less than 8 words
    subs    r2, r2, #8
    ldmcs   r1!, {r3, r12}
    stmcs   r0!, {r3, r12}
    bgt     1b
    bxeq    lr
    adds    r2, r2, #4
    ldrpl   r3, [r1], #4
    strpl   r3, [r0], #4
    bxeq    lr
    add     r2, r2, #4
    lsls    r3, r2, #31
    ldrhcs  r3, [r1], #2
    strhcs  r3, [r0], #2
    ldrbmi  r3, [r1], #1
    strbmi  r3, [r0], #1
    bx      lr

Ah, wait, wait! We're adding 4 back to our counter value to undo our most recent subtraction so that we have "the count value we had when we entered the loop". Then we do the carry/sign test on that value. Except our numbers are two's compliment integers so if we're just adding and subtracting any multiple of 4, it's not going to affect the lowest two bits anyway, which are the bits we're looking at with our carry/sign test.

Count Before Sub Bits
0 00000000000000000000000000000000
1 00000000000000000000000000000001
2 00000000000000000000000000000010
3 00000000000000000000000000000011
Count After Sub Bits
-4 11111111111111111111111111111100
-3 11111111111111111111111111111101
-2 11111111111111111111111111111110
-1 11111111111111111111111111111111

With or without that 4 added back in, the lowest two bits are the same. Which means we can skip that add entirely.

    bxeq    lr
    @ we can skip the add!
    lsls    r3, r2, #31

No need to even change any conditionals or anything. I love it.

So our final version of the function looks like this.

copy_u32_forward:
    cmp     r2, #32
    bge     2f
  1: @ copy less than 8 words
    subs    r2, r2, #8
    ldmcs   r1!, {r3, r12}
    stmcs   r0!, {r3, r12}
    bgt     1b
    bxeq    lr
    adds    r2, r2, #4
    ldrpl   r3, [r1], #4
    strpl   r3, [r0], #4
    bxeq    lr
    lsls    r3, r2, #31
    ldrhcs  r3, [r1], #2
    strhcs  r3, [r0], #2
    ldrbmi  r3, [r1], #1
    strbmi  r3, [r0], #1
    bx      lr
  2: @ copy 8 words at a time (loop)
    push    {r4-r9}
  3:
    subs    r2, r2, #32
    ldmcs   r1!, {r3-r9, r12}
    stmcs   r0!, {r3-r9, r12}
    bgt     3b
    pop     {r4-r9}
    bxeq    lr
    adds    r2, r2, #32
    b       1b

This is getting a little complicated.

Reverse u32

Similar to before, once we know how we want our forward loop to go we can roughly do the same steps "in reverse" to form our reverse loop. The biggest difference is that ldm/stm can't use offset addressing, so if we hit that case we'll have to adjust the pointers manually as part of that setup. Other than that, this should be fairly simple.

  • carry/sign test on the count
  • single byte copy, if any
  • single halfword copy, if any
  • Now we know the count is a clean multiple of 4
  • Copy 8 words at a time (if the count is high enough)
  • Copy 2 words at a time
  • Copy one final word if necessary

We start with a carry/sign test. If the flags show that we need to copy a byte or a halfword then we'll so that. Except this is the reverse loop so "do that" is subtracting the correct value from the byte count and then using the byte count as an offset for the addressing.

copy_u32_reverse:
    movs    r3, r2 lsl #31
    submi   r2, r2, #1
    ldrbmi  r3, [r1, r2]
    strbmi  r3, [r0, r2]
    subcs   r2, r2, #2
    ldrhcs  r3, [r1, r2]
    strhcs  r3, [r0, r2]
    @ TODO: the rest of the owl

Oh golly that's a lot of conditional instructions. Like we had with the block copy in our forward loop, we're gonna push this out of the main path of the code. And since we'll also do that with the block copy, I think we're going to end up with more than just 3 labels in this function. Let's finally use some "local labels". They're just like normal labels, but they're not exported from the file that they appear in is all. The prefix for a local label varies by platform, for us it's .L, and they're otherwise like all the other labels we've used.

A label name can be fairly long. I'm gonna use some long label names to comment the intent a bit.

copy_u32_reverse:
    tst     r2, #3
    bne     .L_handle_byte_and_halfword
  .L_count_is_now_a_multiple_of_four:
    @ TODO: the rest of the owl
  .L_handle_byte_and_halfword:
    lsls    r3, r2, #31
    submi   r2, r2, #1
    ldrbmi  r3, [r1, r2]
    strbmi  r3, [r0, r2]
    subcs   r2, r2, #2
    ldrhcs  r3, [r1, r2]
    strhcs  r3, [r0, r2]
    b       .L_count_is_now_a_multiple_of_four

Amazing. Looks better than I thought it would.

So we know our pointers are aligned to 4, because that's a requirement to call the function. And now at the "TODO" point in the code we know the count is aligned to 4 also. And ldm/stm don't accept offset addressing. So we just add the count to both our pointers.

  .L_count_is_now_a_multiple_of_four:
    add     r1, r1, r2
    add     r0, r0, r2
    @ TODO: the rest of the owl

Now we're set to use ldm/stm, so I guess we put that in. As with the forward loop, I think we want to put this to the side of the main code path and try to give the 1-7 word copy portion the main code path. The change here to make it "reverse" is that instead of using ldm, which is short for ldmia ("load-multiple, increment after"), we'll use ldmdb ("load-multple, decrement before"). This matches our earlier reverse loops where we subtract before the copy and then use that subtracted value as the offset.

  .L_count_is_now_a_multiple_of_four:
    add     r1, r1, r2
    add     r0, r0, r2
    tst     r2, #32
    bge     .L_block_copy_sub
  .L_done_with_block_copy:
    @ TODO: the rest of the owl
  .L_block_copy_sub:
    push    {r4-r9}
  1:
    subs    r2, r2, #32
    ldmdbcs r1!, {r3-r9, r12}
    stmdbcs r0!, {r3-r9, r12}
    bgt     1b
    pop     {r4-r9}
    bxeq    lr
    adds    r2, r2, #32
    b       .L_done_with_block_copy

Now we want to fill in the .L_done_with_block_copy. At this point we know that we're going to copy 0-7 words of data.

Actually, now that I'm thinking in words instead of in bytes, another idea comes to mind. The r2 register holds the number of bytes to copy. Our goal, you could say, is to clear all of the bits. Once all the bits in our counter are 0, that means there's no bytes left to copy and so we can return. We know bits 0 and 1 are clear. We know bits 5 and above are clear. We're just not sure about bits 2, 3, and 4.

Words Remaining Counter Bits
0 00000000000000000000000000000000
1 00000000000000000000000000000100
2 00000000000000000000000000001000
3 00000000000000000000000000001100
4 00000000000000000000000000010000
5 00000000000000000000000000010100
6 00000000000000000000000000011000
7 00000000000000000000000000011100

So with only 3 bits to test and possibly copy for, it's kinda practical to just test each bit and copy that much if we get a hit.

  .L_done_with_block_copy:
    @ check for 4+ words remaining, copy if so
    tst     r2, #(1<<4)
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    @ TODO: the rest of the owl

Yeah that doesn't seem so bad.

Know what else is cool? The carry/sign test thing, you can do it on any two bits in a register. As long as they're next to each other of course. We just adjust the logical left shift amount. You subtract the higher number bit from 32, and shift by that much. Before we wanted to test bits 0 and 1, so we shifted by 31 (32-1==31). Now we want bits 2 and 3, so we shift by 29 (32-3==29).

  .L_done_with_block_copy:
    @ check for 4+ words remaining, copy if so
    tst     r2, #(1<<4)
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    @ carry/sign for 2 and/or 1 words remaining
    lsls    r3, r2, #29
    ldmdbcs r1!, {r3, r12}
    stmdbcs r0!, {r3, r12}
    ldrmi   r3, [r1, #-4]
    strmi   r3, [r0, #-4]
    bx      lr

And now that I've written it like this it becomes obvious that probably the forward loop should be using the same technique.

Let's not go back. For now, just pretend that we went back and updated the forward loop. There's still more to talk about, and when we see the "full and complete" version of stuff that'll also include seeing the forward loop updated to use bit testing.

Okay, so here's the full u32 reverse copy:

copy_u32_reverse:
    tst     r2, #3
    bne     .L_handle_byte_and_halfword
  .L_count_is_now_a_multiple_of_four:
    add     r1, r1, r2
    add     r0, r0, r2
    tst     r2, #32
    bge     .L_block_copy_sub
  .L_done_with_block_copy:
    tst     r2, #(1<<4)
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    lsls    r3, r2, #29
    ldmdbcs r1!, {r3, r12}
    stmdbcs r0!, {r3, r12}
    ldrmi   r3, [r1, #-4]
    strmi   r3, [r0, #-4]
    bx      lr
  .L_block_copy_sub:
    push    {r4-r9}
  1:
    subs    r2, r2, #32
    ldmdbcs r1!, {r3-r9, r12}
    stmdbcs r0!, {r3-r9, r12}
    bgt     1b
    pop     {r4-r9}
    bxeq    lr
    adds    r2, r2, #32
    b       .L_done_with_block_copy
  .L_handle_byte_and_halfword:
    lsls    r3, r2, #31
    submi   r2, r2, #1
    ldrbmi  r3, [r1, r2]
    strbmi  r3, [r0, r2]
    subcs   r2, r2, #2
    ldrhcs  r3, [r1, r2]
    strhcs  r3, [r0, r2]
    b       .L_count_is_now_a_multiple_of_four

And, hey, if we didn't need to have the extra +4 fixup step to undo our subtract when we go past 4 with the single word copy, then I guess we don't need to have a +32 fixup step at the end of the block copy either. They're both a power of 2 that's bigger than the bits we care about, so it's all the same as long as we're only doing bit tests. I don't want to paste that whole thing a second time, so let's pretend I did and just call that another small thing to fix in the next big version somewhere farther down the page.

A Working __aeabi_memcpy

We've seen a lot of variations on copy loops. It's time to make the thing we came here to make, a complete __aeabi_memcpy function.

This will be like the "forward" loops we did before, but we'll have more prologue code.

We don't know if our two pointers are aligned to 1, 2, or 4. The best copy loop we can use depends on the minimum of the alignment of both pointers. If our pointers start less aligned than we want, sometimes we can do a small copy to become aligned, and then use a better aligned loop.

  • Example 1: one pointer is 5, and the other is 11. Those are both align 1, but if we just do a single byte copy they advance to 6 and 12, which are both align 2. Now we can go by halfwords and speed up the rest of the copying.
  • Example 2: If one pointer is 7, and the other is 14, then they'll unfortunately be perpetually out of alignment. Even if we try to do a small copy at the start to get them lined up, one of them will always be out of alignment when the other is in alignment.

We'll need various checks on our input to figure out what the situation is, then we can jump to the loop that fits our needs best.

Since we've got 2 values, and each value has 2 possible states for a given bit, let's make one of those truth table things.

Dest Src Can Align 2 ?
Even Even Yes
Even Odd No
Odd Even No
Odd Odd Yes

Ah, wonderful. The words in the chart aren't the classic "0" or "1", but the overall pattern matches the Exclusive-Or operaton. In Rust we write it with ^. In ARM assembly it's written with eor.

Depending on the Exclusive Or between the 0-bit values of the address we either will or won't be able to align the pointers to 2. The same idea holds for the 1-bit of the address values and trying to align to 4. It applies up to any alignment we need. When we need both pointers to be aligned to X (where X is a power of 2 like normal), then if all the bits below X are clear when you Exclusive Or the two pointers you'll be able to do it. Either both pointers are already aligned, or both pointers can be advanced the same amount to make them aligned.

Of course, that assumes that we have enough bytes to copy that there's still work to do after doing some initial small copies to advance the pointers. As usual, if the call comes in to do a 0 byte copy or a 1 byte copy or something else small like that we need to maintain correctness.

Like with most all our other copy loops, r0, r1, and r2 are all in use. We're using r3 as our temp storage for the copy. Let's put our alignment check in r12.

__aeabi_memcpy:
    eor     r12, r0, r1
  • If Bit 0 of r12 is set we know we're capped to being align 1 at best.
  • Else If Bit 1 of r12 is set we're capped at being align 2 at best.
  • Else we can get ourselves to being align 4, which is the best situation for us.

Since that's another "two bits next to each other" situation we can again use a carry/sign test. You learn just one neat trick (virtual machines hate it!) and spend the entire rest of the article stuck using it over and over.

Anyway, now we can branch to the align1 path on negative, branch to the align2 path on carry, otherwise we keep going on the align4 path.

Once we're on the path we want, for the align2 and align4 case we might need to do a little work to actually get the alignment we're going for. In the align1 case we've obviously never got any starting work, a pointer can't be aligned to less than 1 after all.

Something like this.

__aeabi_memcpy:
    eor     r12, r0, r1
    lsls    r12, r12, #31
    bmi     copy_u8_forward
    bcs     copy_u16_forward_fixup_prelude
    @ fallthrough to copy_u32_forward_fixup_prelude

For the fixup preludes we check the bits of either address to determine how much we need to fix up by. Since we did the exclusive-or we know that both addresses will need fixing by the same amount, however much that is. This feels like another situation where we want to assume we don't have to to the work on the main path, and then put any extra work on a side path.

copy_u32_forward_fixup_prelude:
    tst     r0, #3
    bne     .L_copy_u32_forward_do_a_fixup
  .L_copy_u32_forward_after_the_fixup_is_complete:
    @ TODO
  .L_copy_u32_forward_do_a_fixup:
    lsls    r12, r0, #31
    @ advance by 1 byte if bit0 is set in the dest address
    submi   r2, r2, #1
    ldrbmi  r3, [r1], #1
    strbmi  r3, [r0], #1
    @ advance by 2 bytes if bit1 is set in the soudestrce address
    subcs   r2, r2, #2
    ldrhcs  r3, [r1], #2
    strhcs  r3, [r0], #2
    @ fixup complete
    b       .L_copy_u32_forward_after_the_fixup_is_complete

How big can a label be? I don't know honestly. If we need to use smaller labels in the actual code that's fine. For illustrative / educational purposes I'm going to keep the extra verbose labels in the sample code blocks.

Anyway, once a path has been selected, and the fixup applied, then we just need to actually do the copy.

For "memcpy" style functions, it's assumed that the two regions are exclusive, so we can always select the forward copy loop.

__aeabi_memcpy4 and __aeabi_memcpy8

These functions are just like the general __aeabi_memcpy, except that they require the pointers to already be aligned to the amount of their number.

That means that we can basically just rename .L_copy_u32_forward_after_the_fixup_is_complete to __aeabi_memcpy4, since both blocks assume pointers aligned to 4.

For us there's no advantage to a pointer alignment of 8. Later ARM CPUs have "double word" operations, but we don't. That means we can just make __aeabi_memcpy8 be an alias for the 4 version.

copy_u32_forward_fixup_prelude:
    tst     r0, #3
    bne     .L_copy_u32_forward_do_a_fixup
    @ fallthrough to __aeabi_memcpy4
__aeabi_memcpy8:
__aeabi_memcpy4:
    @ TODO
  .L_copy_u32_forward_do_a_fixup:
    lsls    r12, r0, #31
    @ advance by 1 byte if bit0 is set in the dest address
    submi   r2, r2, #1
    ldrmi   r3, [r1], #1
    strmi   r3, [r0], #1
    @ advance by 2 bytes if bit1 is set in the soudestrce address
    subcs   r2, r2, #2
    ldrhcs  r3, [r1], #2
    strhcs  r3, [r0], #2
    @ fixup complete
    b       __aeabi_memcpy4

Aw, dag, the indentation for my labels is getting kinda weird I was using no-indent for labels that are exported from the file, and then 2 spaces for labels that aren't exported, but now in the name of having the "happy path" branch as little as possible our sad path code is jumping all over the page. Guess it can't be helped, code just does this after a while when you keep handling more and more situations.

A Working __aeabi_memmove

TODO

A Working __aeabi_memset

TODO

A Working __aeabi_memclr

TODO

Working libc functions

TODO

Select a repo