contributed by < kc71486
>
RISC-V
, jserv
Forked from problem C. All code in this block comes from quizcv2.c.
In order to correctly perform matrix multiplication, properly handle floating point addition and multiplication is important.
Below is the implementaion of floating point multiplication.
And below is the implementaion of floating point addition.
You can find the source code here. Feel free to fork and modify it.
All codes with line number in this block comes from quizcv4.c.
Lets first look at exponent incremnt:
The sole purpose of inc in this program is add 1, which can be easily substituted by:
Function inc
and mask_lowest_zero
are now unused, so we can safely delete them.
Now Lets look at mantissa multiplication:
My initial thought is splitting an int64_t into two int32_t, which looks like:
The worst part of this code is the overflow detection, which requires a lot of cycle. To improve that, we can use some special properties in mantissa multiplication to our advantage.
Since we only need highest 24 bits, we can rewrite imul32
, change output type into int32_t
and merge left shift into the same function.
All codes with line number in this block comes from quizcv4.c.
In c standard, if we want to use 2d array, we can either specify the column number (like float a[][3]
), or use double pointer (like float **a
). The second approach is much more versatile, but it requires additional redirection and some array length assumption, which is not ideal.
So instead we use an 1d array instead of 2d, and add implicit dimension information.
Without double lookup, indexing becomes quite challenging. A naive way could be using multiplication to get each index. However, multiplication is even slower, so instead we use some increase and decrease each index to achieve it.
Double lookup also lead to some cache locality issue, which would make the program slower if considering cache miss latency.
The first version is really unoptimized, it has 2044 bytes of instructions, 13890 cycles, 0.988 data cache hit rate and 0.8421 instruction cache hit rate (direct mapped).
In this program, we are multiplying a 3*3 matrix by a 3*3 matrix, which requires 27 fadd32 and 27 fmul32, along with 27 mmul. Consider how much cycle a normal integer multiplication takes, it would be a challenge to drastically decrease instruction count.
42 Cache misses in direct cache is suprisingly good compare to total instruction retired.
If we were using the conventional approach and allocating in heap, data fragmentation will probably cause a lot of cache misses.
In my perspective, it is wierd how 2 way set performs worse than direct map.
The perfromance probably mainly caused by excessive memory access, such as:
Most of them can be replaced by saved registers, reducing a lot of memory access.
Because there will always be 24 loops in matrix multiplication, we can loop unroll it at the expense of file size and cache locality.
And also, there are some bug in special value segment in both fadd32 and fmul32, although it is not a big issue for matrix multiplication since infinity and nan are equally invalid in matrix.
Our previous design didn't deal with flexible size matrix multiplication, so we also implement custom low complexity dynamic allocation.
Finally, there are two competing matrix traverse method, the first being: (The numbers in the matrix represent the time when the location was first used)
, and:
Using some basic analysis, we can found out that second method requires lower amount of cycles.
To address above issues, we update our assembly to the next version (with its c code equivilent)
We change the location of each variable from memory to saved registers, and merge variables with non-overlapping lifetimes.
Because there are always 24 digits in mantissa, we can use loop unroll to greatly decrease branch miss at the cost of code size and instruction cache hit rate.
The complexity of this implementation is , with zero initializing the matrix being the slowest part. We treat heap as an stack and move up when needed (all heap variables have unlimited lifetime in this implementation).
If we never go overflow or underflow, we can simply remove overflow and underflow sections, and even free some registers.
If there are no infinity or nan in the matrix, those special case handling can be removed.
If input matrix will always be valid, those if statements can also be removed.
highestbit
and mmul
can be inlined in their calling functions, but it loses some readability and the overhead is already really small.
Total cycles are 39% faster, which is quite considerable. Note that since my gcc compiler with -O1 or above cannot translate assembly correctly (the function is inlined, and the arguement is completely ignored), the result printed will be wrong (result in heap is still correct though).
Unsuprisingly, gcc -O1 outperforms my handwritten assembly.
Data cache misses and writebacks are much smaller. Instruction cache misses are also smaller, probably due to smaller code size.
2 way set makes data cache misses and writeback drastically smaller, due to the overlapping cache area between matrix a and c.