memmove
contributed by < AndybnACT
>
interview
memmove
To implement a memmove
, we need to consider if there is memory alaising (overlapped block) for input parameters. If there is an overlap and the address of destination is higher then the source parameter, then the copy of src
to dst
starts from high to low address. Otherwise, we start moving memory block from low to high.
Then, we can implement the memmove
function. Note that we copy data byte by byte to deal with objects of any sizes (e.g. char
, int
arrays).
memmove
with block size of register widthCopying data byte by byte seems not so efficient at the first glimpse, since intel allows us to load at most a QWORD
(8 bytes) in single instruction. Thus, if the boundary is properly handled, we may consider copying data each time at 8 bytes basis. Then, we revert to byte-by-byte copying if there are remaining bytes at the end.
memmove
with sse3
on x86_64Further more, we may utilize SIMD unit on modern CPUs. With the help of compiler intrinsics _mm_storeu_si128()
and _mm_lddqu_si128
, we can move 16 bytes at a time. According to intel intrinsics guide:
_mm_storeu_si128
: Store 128-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary._mm_lddqu_si128()
: Load 128-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm_loadu_si128
when the data crosses a cache line boundary.memmove
with avx2
Suprisingly, we found that using aligned avx2
extension, operating on 32 bytes of data block, would perform 23% better than the unaligned sse3
one and therfore beat the glibc
implementation. However, both unaligned avx2
and aligned sse3
performs 0.3% slower than the unaligned sse3
.
We compare implemented functions together with the one from glibc (2.30)
. To maximumly eliminate external disturbances, we use the following boot parameters:
Besides, turnning off turbo boost and select a preferred scalling factor are necessary
Last but not least, turn off the ASLR let proceses have the same view of memory across executions
To select a proper test function during these tests, we declare an array of function pointer as below.
The adopted clock resource is CLOCK_PROCESS_CPUTIME_ID
, which claims to have 1ns
resolution on my machine.
Then, we do the following job to get the execution time insode our test loop. This code is automatically correct even if tv_nsec
overflows.
We run totally 2000 times for each class of functions, ranging from 0~65536 bytes of memmove
.
We drop the first and last 500 of measurements, so the state of cache might not interfere with each others.
The meassurement then goes into t_push
to update statistics.
To temporarily make a concolusion,
Intrestingly, using byte-by-byte copying routine doesn't necessaryly mean that it will perform slower than the others if the compiler does the right optimizations.
Second, using unaligned sse3
intrinsics perform even better than the aligned one.
Third, we can beat glibc
with the help of avx2
. However, unfortunately I don't have a machine supporting avx512
, maybe we could reach one step further with the help of it.
move_zeros
How embracing is it to be stuck by a question tagged easy
on leetcode in an interview. Well, I hope that we won't make the same mistake again… Here is the problem:
Given an array nums, write a function to move all 0's to the end of it while maintaining the relative order of the non-zero elements.
Being hinted by the interviewer, here goes my solution:
This is not correct as well, we should do the following modification:
If we don't want to have nested for-loop, then we could simply do: