or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Syncing
xxxxxxxxxx
Memory Functions
This whole document is released as Creative Commons 0 (CC0).
More usable code for this article, including tests, can be found in the github repository.
Introduction
The "memory functions" do two basic kinds of things:
src
todest
, and the two regions might or might not overlap.dest
to a specific value, without reading from a source region.Depending on the details, there's various optimizations that you can apply to these tasks. This means that a full suite of memory functions will usually have a few similar functions, each for a slightly different task.
In Rust the memory functions are assumed to exist as part of the basic code provided by the platform. These memory functions are called very often, and so they should be finely tuned to be as fast as possible on each platform. Rust assumes that the platform developers know what they're doing, and that they will provide fast verisons. If your platform doesn't do this, you can still get basic versions of the memory functions from the compiler-builtins crate. Normally this is all sorted out for you automatically.
When we're programming Rust for the Gameboy Advance (GBA) we use the
thumbv4t-none-eabi
target. This is a "bare metal" target, with no provided code libraries. This means that a Rust program on the GBA gets the basic memory functions fromcompiler-builtins
.Except we don't want the basic versions. LLVM is reasonable at codegen, but sometimes it misses stuff. Also, we want to have the linker put our memory functions into IWRAM rather than ROM. Luckily, the
compiler-builtins
functions are only "weak linked" into the program. This means that we can override their version of a function just by providing our own version.All we have to do to fix our situation is "just" define our own memory functions.
To follow this article you need to be able to read the basics of ARM assembly. Normally one might suggest the official docs for this, but the ARM developer website changes often enough that old links seem to break all the time. Instead, the guide I'd suggest is from Azeria Labs, which covers the fundamental ideas nicely in a few short pages.
We'll be using the modern ARM assembly instruction syntax, where conditional instructions always have the condition on the end (eg:
ldrheq
). This is the default for LLVM (and so for Rust too). If you read the original ARMv4T manuals from 1994 or whenever they show the old-style syntax, where the conditional for some instructions goes "inside" the rest of the instruction (eg:ldreqh
). This is the default parsing mode for the GNU assembler. If you want to use the modern syntax with the GNU assembler put the.syntax unified
directive at the top of the file.Function Signature Goals
For this issue we want to approach things in little pieces at a time, and build the pieces up into a final "complete" thing. To do that, we need to start by checking what we're even trying to build.
First up is the memory functions that are normally provided by a platform's
libc
. I know we're programming Rust here, but since they're defined as C functions we'll look at the signature in the C style.u8
passed inch
. Thech
value is passed as anint
because thismemset
actually predates the ability to specify argument types in a function signature. Just go with it.The
count
argument is always the count in bytes. Oh, and the return value is always the originaldest
pointer that was passed in. That lets you potentially do C's vesion of call chains where you write code likeOn a lot of platforms that might be all you'd need to think about. However, with targets that follow the ARM EABI (that includes us), there's some additional memory functions too. LLVM will use the ARM EABI functions when approprite, so we want these to be as fast as possible as well.
These largely work like the
libc
versions except that:__aeabi_memcpy4
and tell it to copy 5 bytes, if you really want.__aeabi_memclr
("memory clear") is a new group of functions that work like "memory set", but the byte value to set is just implied to be 0.__aeabi_memset
functions have the argument order swapped compared tomemset
. Now the byte to set is the third argument. This allows__aeabi_memclr
implementations to share most of their code with__aeabi_memset
implementations.Copy Ordering
When we're accessing a bunch of memory, accessing the memory in order is just faster. This is very true on a modern system, but even the old GBA cares. Accesses are categorized into "sequential" ("S") and "non-sequential" ("N"). The exact time taken varies by game pak, but the usual timings are that sequential accesses take 2 cycles and non-sequential accesses take 4 cycles.
Obviously we should always write our copy lookps to only use sequential accesses, right? Just in case we're accessing the ROM as part of a copy? Well, it's not so simple (it never is).
Imagine we have some memory like this. (We'll use a little-endian layout for our diagram, since the GBA is a little-endian device.)
Let's imagine we want to copy a 4 byte span, with the destination starting at 2 and the source starting at 0.
When our two regions to copy between might overlap (like with
memmove
), we need to be sure that the overlapping portion gets handled first so that the source data is not destroyed by our own writes to the destination. We do that by copying in the correct direction. We'll call the sequential version "forward", and the non-sequential version "reverse". For each copying loop we design we'll need to write both the forward and reverse version. Then we pick between the two versions like this:And if we know the regions don't overlap (such as copying ROM data to RAM), then we can just always use the forward loop like we wanted.
Compiler Explorer
The ever handy Compiler Explorer website lets us write some Rust and see the assembly output. Unfortunately for us, Compiler Explorer isn't set up to build for an ARMv4T target. Right now Compiler Explorer only supports Rust's Tier 1 and Tier 2 targets. Instead, we have to have CE show us assembly for
armv5te-unknown-linux-gnueabi
, which is the oldest ARM target that's Tier 2. This means that sometimes we're going to see assembly output that isn't actually legal for us to use on the GBA (eg: ablx
instruction, popping anlr
value on the stack directly intopc
instead of usingbx lr
, etc). But mostly it's close enough that we can use it as an approximate starting point.There's also always the "Code Size vs Code Speed" trade off to consider. We'll want to look at the output of a few different optimization levels and pay attention to the differences.
Copy By
u8
The first loops we'll look at will work exclusively in 8-bit units.
Yes, I know friends, I know. But it's not as silly as it might seem.
There's two main reasons to write a
u8
-only copy loop.ldrb
,ldrsb
,strb
).So here's what our Rust code looks like:
And here's what it looks like in the different optimization modes:
opt-level=3
: https://rust.godbolt.org/z/5x7vzcEfMopt-level=s
: same as opt=3opt-level=z
: https://rust.godbolt.org/z/1rKhGM8v7In this case, opt=3 (fast) and opt=s (small) outputs are the same, and opt=z (aggressively small) cuts out an early return when the copy length is 0.
Let's take a close look, starting with
copy_u8_forward
. We're going to go really slow for the first function or two, and then speed up a bit once we've got the basics down.Forward
u8
To start, let's use the
z
output as our base. We're going to assume that people won't call the function with a zero sized copy, so we probably don't need the extra early return check to try and minimize work. To be clear: people could call with a copy size of 0, and our function must be correct if that happens, but we'll assume that the copy size is larger than 0 even if it makes the 0-size copy case be slightly less fast. Generally, any memory function like this should assume that the amount of memory to manipulate is larger, rather than smaller, because when the amount of memory to manipulate is smaller the compiler will just do a few reads and writes on its own without calling a memory function.Okay here we go:
To start, we've removed the
example::
prefix from the label name. Since the label part of a line ends with:
there can't be a::
in the label name. Technically the real label thatrustc
used for the function was_ZN7example15copy_u8_forward17hc841ec0dfd74b8c3E
, and then Compiler Explorer just de-mangled that for us when showing the output. Since we don't want to bother with mangling, we'll keep our labels plain and simple.Also we can eliminate the local label, we can just jump to the top of the function using its non-local label.
Okay, so, can we improve this code? Well, quite honestly, we probably can. LLVM isn't that good at taking full advantage of the Conditional Execution that all our ARM instructions allow.
The first suspicious part is that there's
cmp
andsub
, both for the same value in the same loop. For later ARM devices this is basically fine since the result ofldr
isn't available next cycle. Something needs to go between theldr
and thestr
, if you don't have anything the CPU just stalls a cycle anyway, so putting thesub
there is fine. For this early ARM chip in the GBA we don't need the gap, so maybe we can move around thesub
and shorten the loop.cmp a, b
sets the CPU status flags based on the result ofa-b
, without storing that output to any register.sub d, a, b
doesd = a-b
without setting the flags.subs d, a, b
which is similar but with setting the flags.This means if we see
cmp
andsub
in the same loop, we might be able to usesubs
instead. It's fine to move thesub
earlier in the loop. Thebxeq
andldrb
don't interact withr2
at all, so ifr2
gets subtacted earlier or later they don't care. Let's check what that looks like.So now we can replace
cmp
andsub
with asubs
instruction if we change the condition on thebxeq
. Currently we break out of the loop ifcmp
finds out that the count is 0 (ifcount - 0 == 0
). However, if we usesubs
then we're setting the flags based on the value after we subtract 1, not before we subtract 1. If we had 1, and subtract 1, and get 0, then the counter was 1 when we entered the loop, so we want to do the loop one final time to copy the final byte. What we need is to do is stop the loop when we have 0 and subtract 1 and then underflow the value tou32::MAX
.For
subs
the change in flags work as follows:So as an example,
3 - 1
gives2
with flag N=0, Z=0, C=1, V=0. And when we do0 - 1
, underflowing tou32::MAX
, the flags are N=1, Z=0, C=0, V=0. That means the condition for ourbx
needs to becc
("carry clear").Neat.
This is actually where I stopped when I first started writing this article. I said to myself, "that's as tight as loop as we can make it", and I moved on.
Then felixjones on GitHub looked at the code and said "hey get that
bx
out of your loop".See, we can't forget that every instruction can be conditional. We can move the
bx
out of the loop to the end, change theb
to usebgt
so that we only restart the loop when we have a non-zero amount of count remaining at the end of each loop, and finally we can put acs
on theldrb
andstrb
instructions so that if the function is called with a count of 0 we will safely pass over those instructions.And, yeah. This is nicer. With the
bx lr
outside the loop. That's why you show your code to other folks and have them give it a check.Reverse
u8
LLVM wants our opt=z reverse loop to look like this:
That's not so great. It's the same problem as above, LLVM is placing the subtraction at the wrong time, so we've got all sorts of extra work going on.
We can just change it to
subs
and have our conditions like before. The difference is that we're going to use "offset addressing" instead of the "post-indexed". Instead of the base pointers moving (after the access) with each loop, they stay the same and the count is what moves.Now the copy starts with the final byte, then one byte back, then one byte back, until it copies the 0th byte. Super easy, no trouble at all.
Copy By (Mostly)
u16
In practice, other than with SRAM, we'd never want to copy by only
u8
.The next size up is
u16
("halfword"). We might want to use this if we're working with stuff going into VRAM. Modes 3, 4, and 5, along with the color palettes, are all 2-byte aligned data. VARM has the fun quirk whereu8
reads are fine, butu8
writes affect both bytes within the halfword. We need to be careful to avoidu8
writes when writing to VRAM.So we'll have some functions for "mostly
u16
". It's "mostly" because they'll use halfword copies as much as possible. If you do request a copy length that's an odd number then the function will be forced to do one byte copy, but otherwise you can be sure you'll get halfword copies.opt-level=3
: https://rust.godbolt.org/z/hen7n88xEopt-level=s
: same as opt=3opt-level=z
: https://rust.godbolt.org/z/n4YzYjT3bJust like with the
u8
loops, opt=3 and opt=s give us the same output, and only opt=z goes for the extremely compact form.Forward
u16
So LLVM wants us to do one of these:
This isn't ideal. As with the
u8
code, LLVM is trying to optimize for the small case to be faster. What we want is that our common case to be faster, but for us it's the large case that's common.If we take our
u8
forward loop and just change a few bits to make the main loop go by halfwords it looks like this.So we subtract 2. If we didn't go below 0 then we had 2 or more before the subtract. So we copy a halfword and advance our pointers. Then if the count is still greater than 0 we do it again.
Except we can't have an unconditional return any more. After the subtraction the result could be
bgt
catches it and the loop goes again.bxeq
).We can decide between the -1 and -2 cases by adding 1 back to the count. If we land on 0 then that means we should copy the extra byte.
Don't forget about the -2 case and think that we can just remove the
adds
and unconditionally copy the last byte once we're after thebxeq lr
. That would be a bad time for sure.Reverse
u16
With the reverse loop we have to handle a possible odd byte count before we do our halfwords.
LLVM wants us to do it like this
The opt=3 version has an 8 instruction halfword loop with
bx
on the outside of the loop (cool). The opt=z version has an 8 instruction halfword loop withbxls
on the inside of the loop (not cool). So the opt=z version saved no space compared to the opt=3 version (8 instructions either way), and yet it's slower. I dunno what LLVM is doing, but that doesn't seem like a good time.Theres definitely situations where an optimzier can do some weird math transformation or something and give you better code than you'd know to do on your own, but this is an example of when the opposite happens. For whatever reason, the optimizer fell over on the job with this one.
One thing that both LLVM versions agree on is starting the function with a
tst
instruction. This is likecmp
in the sense that it does an operation just to set the flags, without storing the output. The difference is which operation is used to set the flags.cmp a, b
sets the flags based ona - b
tst a, b
sets the flags based ona & b
So if we do
tst r2, #1
we'll be able to find out if our count is starting as an odd value. That makes sense, and we can agree with LLVM that it's the correct opening move for this function.After the
tst
LLVM jumps ahead in the function if we're on the even-count case, and then the odd-count case falls through to a byte copy. That's no good. Our ideal path should be the path with the least branches, so we want to re-arrange the code some. We'll put the byte copy at the end of the function as a little sub-routine thing. Then the odd-count case can jump there and (jump to the halfword loop after the byte copy), or the even-count case can just not jump at all. Either way, we get to the halfword loop. In the same basic structure as we had for theu8
reverse loop. We're just going by 2 instead of by 1.Looks like LLVM's opt=z version was 15 instructions, and we've narrowed it down to 11 instructions. Not bad at all.
Copy By (Mostly)
u32
Here's where things can get really interesting.
The biggest data copies that we'd have to do on the GBA are probably things to do with copying around tile data. Tiles are always 8x8, and they can be 4-bits-per-pixel (4bpp) or 8-bits-per-pixel (8bpp). So that's either 32 bytes or 64 bytes, just for a single tile. And most things to do with tiles involve more than one tile. We still might copy small stuff, say 12 or 16 bytes, but when the byte count to copy starts to get high it'll quite possibly get extremely high. If there's a "large" creature that needs 4 by 4 tiles to draw, even at just 4bpp that's gonna be a 512 byte copy. So we'll finally have to be very careful that we support both extremely large copies (hundreds of bytes) and also moderate copies (32 bytes or less). We can't really let either code path take a speed hit.
Let's try a basic rust version modeled after the
u8
andu16
loops we've seen:opt-level=3
: https://rust.godbolt.org/z/sPPGvhnPTopt-level=s
: https://rust.godbolt.org/z/7rr73T13W (finally a difference from opt=3!)opt-level=z
: https://rust.godbolt.org/z/MfsoexYnEOkay one thing is that ARMv4T has
ldm
(load-muliple) andstm
(store-multple). These let you load and store several words at once, which makes bulk copying more efficient. It's howpush
andpop
work for the stack, but with any pointer and block of memory. I'm not seeing any of those used in any of the optimization levels we're looking at.Let's see if we can convince LLVM to use
ldm
andstm
. We'll have it copy an entire[u32; 8]
per loop (the size of one 4bpp tile), that might do the trick:So while there's potentially more than 8 words left to go, we'll block copy 8 words.
And for this version, LLVM's opt=3 gives us…
CURSE YOU BLASTED DRAGON MACHINE.
We're trying to implement
memmove
, you can't callmemmove
while we're implementing it!!!Alright, fine, this handwritten assembly will just look even less like the LLVM assembly than normal.
Forward
u32
Slow But Simple
u32
ForwardTo begin, let's again take our
u8
loop and just update it to go byu32
.We load (
ldr
) and store (str
) one word at a time when the carry flag is set (cs
), meaning that we did an unsigned underflow, meaning that we started the loop with 4 or more bytes in the count.After the word copy:
bgt
)bxeq
)In the
u16
"else" case we had either 0 or 1 spare bytes we also wanted to copy. We resolved it withadds
, a single conditional copy, and then an unconditional return.In the
u32
"else" case we've got 0, 1, 2, or 3 spare bytes to copy. That's enough possibilities that it calls for a loop.At this point in the code the current count is 4 less than we're actually supposed to copy. If we need 1 more byte copied, the count will be -3. If there's 2 more to copy then -2, 3 more is -1. Also, if we entered the loop with 0, we'll actually be at -4. This means that we can't just loop by adding 1 to our counter and continuing until we get to 0. That would give us like the inverse of the correct number of additional bytes.
The simplest solution in this scenario is to just add 4 back into the counter (to "undo" the most recent substract) and do a normal
u8
copy loop. We could either call over tocopy_u8_forward
, or we could inline that loop into ouru32
function.Any type in Rust that's got an alignment of 4 will also have a size that's a multiple of 4. This means that handling the 1-3 spare bytes case is fairly rare.
copy_u8_forward
we pay 3 CPU cycles when we hit this case (ab
takes 1N+2S, which is 3 cycles total for IWRAM code).I'd rather pay the 3 CPU cycles on rare occasions and save the IWRAM space. So our fully working
u32
copy looks like this:To be clear: this is not as fast as possible for large copies. It will work at all though.
If we want to go faster than this, we'll need those
ldm
andstm
instructions.ldm
andstm
loop foru32
ForwardIf we use
ldm
we can load more than one register up with a series of words starting from a base address. Technically theldm
has several possible addressing modes, but if you don't specify one it meansia
as the default, incriment after". It takes a register list, and for each register that you load into, the pointer will advance one word during the loading, so that it'll load a series of successive words. If you put!
after the destination register then the final address value is written back into the register so that you can keep advancing the address through memory on the next pass of a loop or however else.The
stm
instruction lets us store multiple registers out in the same way.The combination of
ldm
andstm
we'll call a "block" copy.Right now we're using our registers as follows:
Per the C ABI for ARM, we can freely use
r0-r3
,r12
, andlr
(akar14
) for whatever we want during a function call. If we want to alter any other register we've got to restore the value before our function returns. We do this by pushing the values to the stack, then popping them back off the stack later. The pushing and popping itself takes time.So there's two basic strategies:
r3
andr12
for our block copy, doing two words per loop while avoiding the need for the stack.r3
,r12
, and push 6 other registers, then we can do eight words (one 4bpp tile) per loop. But then we have to use the stack.We could consider the cycle timings as well:
The
x
value onldm
/stm
is the number of registers you're loading or storing. Note that some situations can make these instructions cost more than listed, such as if you ever usepc
as a destination, but for our purposes these numbers are fine. Also, any conditional op takes just 1S if the condition doesn't hold.Let's see how long things take. For simplicity we'll pretend that N and S are both 1, even though sometimes they might be more.
So if there's 32 bytes:
Fancy as it is to not use the stack, even just a 32 byte copy is enough to justify using the stack.
If we're copying less than 32 bytes… well that's actually tricky. We get better performance per byte from using the 2-word loop. However, it makes the tail conditions really complicated. In addition to spare bytes after all the word copying, we'd also have to handle a possible spare word after the 2-word copying. And when we merge the +8 and -4 to be just a +4 then we have to change our conditonal copy to use
pl
for "plus" (which includes zero too). That all seems extra fiddly, I'm not sure we want to deal with that.Awh… nuts…
Then if all my math is right, the time totals look like this
How many words does LLVM copy on its own before calling a function? If we go back to that code sample where we tried to get LLVM to use
ldr
/str
by having it copy arrays ofu32
we can get a hint. Foru32
arrays of 16 bytes or less LLVM copies on its own, and then when the array is 20 or more bytes it uses a call tomemmove
. That is not meant to be a statement of how LLVM always generates the code, it's just what it does withu32
arrays in that situation. If we try usingu8
arrays in that function, it directly copies an 8 byte array, and then goes tomemmove
at 12 bytes.I guess it depends on the exact types involved, but I think that we can be fairly sure that our
u32
copy function won't be called for single word copies. LLVM is at least that smart. The more complextwo_then_one_loop
already performs better than theone_word_loop
with just 2 words, and the difference increases as the number of words goes up. Usingldm
/stm
to reduce loop iterations is very effective.And so, with a heavy sigh that the code has gotten complex like this, I think the way we want to write it all looks like:
The Carry/Sign Test
There's at least one more trick we can apply.
Right now it's possible to have 0 to 3 spare bytes when we enter our word copy loop, which eventually falls down to calling
copy_u8_forward
. The time to load and store isn't any faster for bytes than for words, so that 1-byte loop takes 9 cycles per full pass the same as our 1-word version does. So if we call it for 3 spare bytes we're gonna spend a lot of time on a very small amount of work.What we'd rather have happen in the 3 byte case is just do one halfword copy and then one byte copy, and save some cycles. If only there was a way to quickly figure out when to do a halfword and/or byte copy.
Turns out there is, lucky us.
We can use what's called the "carry/sign test" to determine the state of bits 0 and 1 of our count in a single instruction. We just "logical shift left" (aka
<<
) the count by 31 bits.That's the whole thing!
For a shift, the flags are set as follows:
So when we use
lsls #31
:So we shift the count left by 31 bits, then do a conditional halfword copy (using
cs
for "carry set"), followed by a conditional byte copy (usingmi
for "minus").No need for a loop.
Now we can just replace the call to
copy_u8_forward
with this new assembly, and then an unconditionalbx lr
, and we'll be in good shape.Ah, wait, wait! We're adding 4 back to our counter value to undo our most recent subtraction so that we have "the count value we had when we entered the loop". Then we do the carry/sign test on that value. Except our numbers are two's compliment integers… so… if we're just adding and subtracting any multiple of 4, it's not going to affect the lowest two bits anyway, which are the bits we're looking at with our carry/sign test.
00000000000000000000000000000000
00000000000000000000000000000001
00000000000000000000000000000010
00000000000000000000000000000011
11111111111111111111111111111100
11111111111111111111111111111101
11111111111111111111111111111110
11111111111111111111111111111111
With or without that 4 added back in, the lowest two bits are the same. Which means we can skip that
add
entirely.No need to even change any conditionals or anything. I love it.
So our final version of the function looks like this.
This is getting a little complicated.
Reverse
u32
Similar to before, once we know how we want our forward loop to go we can roughly do the same steps "in reverse" to form our reverse loop. The biggest difference is that
ldm
/stm
can't use offset addressing, so if we hit that case we'll have to adjust the pointers manually as part of that setup. Other than that, this should be fairly simple.We start with a carry/sign test. If the flags show that we need to copy a byte or a halfword then we'll so that. Except this is the reverse loop so "do that" is subtracting the correct value from the byte count and then using the byte count as an offset for the addressing.
Oh golly that's a lot of conditional instructions. Like we had with the block copy in our forward loop, we're gonna push this out of the main path of the code. And since we'll also do that with the block copy, I think we're going to end up with more than just 3 labels in this function. Let's finally use some "local labels". They're just like normal labels, but they're not exported from the file that they appear in is all. The prefix for a local label varies by platform, for us it's
.L
, and they're otherwise like all the other labels we've used.A label name can be fairly long. I'm gonna use some long label names to comment the intent a bit.
Amazing. Looks better than I thought it would.
So we know our pointers are aligned to 4, because that's a requirement to call the function. And now at the "TODO" point in the code we know the count is aligned to 4 also. And
ldm
/stm
don't accept offset addressing. So we just add the count to both our pointers.Now we're set to use
ldm
/stm
, so I guess we put that in. As with the forward loop, I think we want to put this to the side of the main code path and try to give the 1-7 word copy portion the main code path. The change here to make it "reverse" is that instead of usingldm
, which is short forldmia
("load-multiple, increment after"), we'll useldmdb
("load-multple, decrement before"). This matches our earlier reverse loops where we subtract before the copy and then use that subtracted value as the offset.Now we want to fill in the
.L_done_with_block_copy
. At this point we know that we're going to copy 0-7 words of data.Actually, now that I'm thinking in words instead of in bytes, another idea comes to mind. The
r2
register holds the number of bytes to copy. Our goal, you could say, is to clear all of the bits. Once all the bits in our counter are 0, that means there's no bytes left to copy and so we can return. We know bits 0 and 1 are clear. We know bits 5 and above are clear. We're just not sure about bits 2, 3, and 4.00000000000000000000000000000000
00000000000000000000000000000100
00000000000000000000000000001000
00000000000000000000000000001100
00000000000000000000000000010000
00000000000000000000000000010100
00000000000000000000000000011000
00000000000000000000000000011100
So with only 3 bits to test and possibly copy for, it's kinda practical to just test each bit and copy that much if we get a hit.
Yeah… that doesn't seem so bad.
Know what else is cool? The carry/sign test thing, you can do it on any two bits in a register. As long as they're next to each other of course. We just adjust the logical left shift amount. You subtract the higher number bit from 32, and shift by that much. Before we wanted to test bits 0 and 1, so we shifted by 31 (
32-1==31
). Now we want bits 2 and 3, so we shift by 29 (32-3==29
).And now that I've written it like this… it becomes obvious that probably the forward loop should be using the same technique.
Let's not go back. For now, just pretend that we went back and updated the forward loop. There's still more to talk about, and when we see the "full and complete" version of stuff that'll also include seeing the forward loop updated to use bit testing.
Okay, so here's the full u32 reverse copy:
And, hey, if we didn't need to have the extra +4 fixup step to undo our subtract when we go past 4 with the single word copy, then I guess we don't need to have a +32 fixup step at the end of the block copy either. They're both a power of 2 that's bigger than the bits we care about, so it's all the same as long as we're only doing bit tests. I don't want to paste that whole thing a second time, so let's pretend I did and just call that another small thing to fix in the next big version somewhere farther down the page.
A Working
__aeabi_memcpy
We've seen a lot of variations on copy loops. It's time to make the thing we came here to make, a complete
__aeabi_memcpy
function.This will be like the "forward" loops we did before, but we'll have more prologue code.
We don't know if our two pointers are aligned to 1, 2, or 4. The best copy loop we can use depends on the minimum of the alignment of both pointers. If our pointers start less aligned than we want, sometimes we can do a small copy to become aligned, and then use a better aligned loop.
We'll need various checks on our input to figure out what the situation is, then we can jump to the loop that fits our needs best.
Since we've got 2 values, and each value has 2 possible states for a given bit, let's make one of those truth table things.
Ah, wonderful. The words in the chart aren't the classic "0" or "1", but the overall pattern matches the Exclusive-Or operaton. In Rust we write it with
^
. In ARM assembly it's written witheor
.Depending on the Exclusive Or between the 0-bit values of the address we either will or won't be able to align the pointers to 2. The same idea holds for the 1-bit of the address values and trying to align to 4. It applies up to any alignment we need. When we need both pointers to be aligned to X (where X is a power of 2 like normal), then if all the bits below X are clear when you Exclusive Or the two pointers you'll be able to do it. Either both pointers are already aligned, or both pointers can be advanced the same amount to make them aligned.
Of course, that assumes that we have enough bytes to copy that there's still work to do after doing some initial small copies to advance the pointers. As usual, if the call comes in to do a 0 byte copy or a 1 byte copy or something else small like that we need to maintain correctness.
Like with most all our other copy loops,
r0
,r1
, andr2
are all in use. We're usingr3
as our temp storage for the copy. Let's put our alignment check inr12
.r12
is set we know we're capped to being align 1 at best.r12
is set we're capped at being align 2 at best.Since that's another "two bits next to each other" situation we can again use a carry/sign test. You learn just one neat trick (virtual machines hate it!) and spend the entire rest of the article stuck using it over and over.
Anyway, now we can branch to the align1 path on negative, branch to the align2 path on carry, otherwise we keep going on the align4 path.
Once we're on the path we want, for the align2 and align4 case we might need to do a little work to actually get the alignment we're going for. In the align1 case we've obviously never got any starting work, a pointer can't be aligned to less than 1 after all.
Something like this.
For the fixup preludes we check the bits of either address to determine how much we need to fix up by. Since we did the exclusive-or we know that both addresses will need fixing by the same amount, however much that is. This feels like another situation where we want to assume we don't have to to the work on the main path, and then put any extra work on a side path.
How big can a label be? I don't know honestly. If we need to use smaller labels in the actual code that's fine. For illustrative / educational purposes I'm going to keep the extra verbose labels in the sample code blocks.
Anyway, once a path has been selected, and the fixup applied, then we just need to actually do the copy.
For "memcpy" style functions, it's assumed that the two regions are exclusive, so we can always select the forward copy loop.
__aeabi_memcpy4
and__aeabi_memcpy8
These functions are just like the general
__aeabi_memcpy
, except that they require the pointers to already be aligned to the amount of their number.That means that we can basically just rename
.L_copy_u32_forward_after_the_fixup_is_complete
to__aeabi_memcpy4
, since both blocks assume pointers aligned to 4.For us there's no advantage to a pointer alignment of 8. Later ARM CPUs have "double word" operations, but we don't. That means we can just make
__aeabi_memcpy8
be an alias for the 4 version.Aw, dag, the indentation for my labels is getting kinda weird… I was using no-indent for labels that are exported from the file, and then 2 spaces for labels that aren't exported, but now in the name of having the "happy path" branch as little as possible our sad path code is jumping all over the page. Guess it can't be helped, code just does this after a while when you keep handling more and more situations.
A Working
__aeabi_memmove
TODO
A Working
__aeabi_memset
TODO
A Working
__aeabi_memclr
TODO
Working
libc
functionsTODO