Contributed by kc71486
RISC-V
, jserv
Follow the instruction from this note
Ignore Configure $PATH
part, and add riscv-none-elf-gcc and rv32emu into ~/.bashrc
instead.
Then source
the .bashrc file
The following question is picked from the Assignment 1
唐飴苹 Calculate the Hamming Distance using Counting Leading Zeros
The Hamming Distance between two integers is defined as the number of differing bits at the same position when comparing the binary representations of the integers. For example, the Hamming Distance between 1011101 and 1001001 is 2.
In the assignment, I implement the program to calculate the Hamming Distance between the two given 64-bit unsigned integers.
Source code
The original implementation of the questions is as follows.
There is an easier way to achieve this, but since it is probably not how we should do the assignment, I am not doing it.
Easier way:
The reason I choose this question is that hamming code is used in error correcting code, which is widely used in main memory, especially in servers, where reliability is very important.
To compile the .c file, I wrote makefile.
And use make
to compile.
There are some modification I need to make:
For every test, I run the shell that includes:
dec:56266, cycle:7517, result: (correct)
O0 is the baseline of everything. Nothing special other than large program size.
A lot of load/store, no wonder it is so slow.
dec:55106, cycle:2562, result: (correct)
Much faster than O0. Also smaller code size.
Almost all load/store has become register. count_leading_zero was inlined. A lot of shift
and or
, probably because 64 bit shift is tedious.
dec:51666, cycle:2866, result: (correct)
Suprisingly slower than O1. And bigger code size. Really counter intuitive.
dec:55102, cycle:2798, result: (correct)
Suprisingly faster than O2. And slightly smaller code size.
dec:55122, cycle:2866, result: (correct)
Same to O2, I feel like some O2 optimization flags make things worse.
dec:55122, cycle:2866, result: (correct)
Same to O2, nothing special.
dec:53414, cycle:3198, result: (incorrect)
Slower than O2 but faster than O0, bigger code size, and the result is wrong.
Modified version:
test1_x0 = 0x0013000000100000;
test1_x1 = 0x00000000000FFFFF;
test2_x0 = 0x0000000200000001;
test2_x1 = 0x7FFFFFFFFFFFFFFE;
test3_x0 = 0x000000028370228F;
test3_x1 = 0x000000028370228F;
This testcase is more general (slight more random instead of consecutive bit difference). The assembly would pass the original testcase.
Upon inspecting the assembly, I found that the workflow of the assembly is quite different from the corresponding c code, so The first thing I had to do was match the c code.
There are several places that didn't follow the c code, such as:
instead of original:
It doesn't lead to incorrect result, but additional count_leading_zeros
function call does make assembly slower.
And also c1 and c2 assignment is wrong, the assembly can be interpreted as
instead of intended:
dec:56826, cycle:2566, result: (correct)
Slightly slower than -O1 optimization, indicating that the code is almost optimized. Code size is larger than O1, but that can be improved.
printf
is bulky, and we don't actually need that much code size to perform general purpose print function. Therefore, I rewrited print function to improve so the code size will be much smaller.
It shouldn't affect cycle since cycle measurement doesn't include any print function.
I replace
with
which results in
dec:11216, cycle:2566, result: (correct)
Unsuprisingly, much smaller code size. Cycle count didn't change because cycle measurement didn't include this part.
When dealing with int64 right shift, we need to perform three shift
and one or
operation, which is a lot. But since leading zero is either in higher word or lower word, we only need to do it once.
Depend on the higher bit, there are two situation:
Another place where 64 bit shift can be simplified.
Because the purpose of this loop is to traverse all digits, we can loop them seperately without two part interacting with each other.
I used autorun.sh to automate these process.
The script can only run in bash environment, using shell will cause compile error.
program size:
textdatabssdec
9092142486811384
execution result:
Elapse cycle:1406
(result: correct)
Worse in size, but much better in speed.
Let HammingDistance_c
and HammingDistance_s
have identical function type, so they are easily interchangable. This will increase cycles in c O0 optimization.
Add instruction count in main. Note that cycle measurement is in instruction count measurement, so instruction count might end up more than cycle count.
In previous version, I used blt
and bgt
instead of bltu
and bgtu
, this will cause some error in some edge case. Fixing it will cause cycle to increase.
After inline count_leading_zero, the register available is enough to totally remove all saved registers. Also, since the function don't call anything, return address no longer need to be saved.
Split data section and functions in object file using -fdata-sections -ffunction-sections
, and auto clear unused function using --gc-sections
. Doing so will decrease the code size but increase the cycle.
Add more unimportant things in bash script, nothing special.
c implementation O0:
program size:
textdatabssdec
41243647845272
execution result:
Elapse cycle:8570
Instruction count:8588
(result: correct)
More cycle than first version (+1053 cycles) because more features are added, and apparently unoptimized new features have awful performance. Much smaller code size because most unused function are removed.
c implementation O1:
program size:
textdatabssdec
26563127803748
execution result:
Elapse cycle:2692
Instruction count:2706
(result: correct)
Slightly more cycle than first version (+130 cycles) because more features are added. Also much smaller code size.
asm implementation O0:
textdatabssdec
27203647843868
execution result:
Elapse cycle:1608
Instruction count:1626
(result: correct)
Suprisingly not much more cycle than previous version (+220 cycles). Probably because more feature are added and the bugfix. Also much smaller code size.
target | O0 | O1 | O2 | Os |
---|---|---|---|---|
v1_c | 7517 | 2562 | 2866 | 2798 |
v1_asm | 3198 | 3198 | 3198 | 3198 |
v2_asm | 2566 | 2566 | 2566 | 2566 |
v3_asm | 1406 | 1404 | 1404 | 1404 |
v4_c | 8570 | 2692 | 2693 | 2957 |
v4_asm | 1608 | 1599 | 1599 | 1599 |
target | O0 | O1 | O2 | Os |
---|---|---|---|---|
v1_c | 56266 | 55106 | 55122 | 55102 |
v1_asm | 56870 | 56870 | 56870 | 56870 |
v2_asm | 11216 | 11216 | 11216 | 11216 |
v3_asm | 11384 | 10088 | 10080 | 9992 |
v4_c | 5272 | 3748 | 3744 | 3656 |
v4_asm | 3868 | 3528 | 3524 | 3440 |
appendix: (v4_c, O3): (2693, 4872)
Copy main function from disassembled elf file, and perform these changes:
gp
referenced variable into la;lw
.jal <address>
into call <label>
.s0
referenced address into sp
referenced address (useless).Write new bash script (runmains.sh) that uses as
and ld
that compiles mains.s. gcc
and as
and ld
all have unique flags.
My itos function requires integer division and modulo, and since gcc provides these funcion in their standard library. I was able to get away with it. Now I need to copy them and include them into my file.
To remove warning.
Elapse cycle:8582^@Instruction count:8596^@Hamming Distance:24^@ …
I don't know why printchar newline doesn't work. This part of main function is identical to c counterpart, and I didn't modify printchar during this step.
You should refine the use of write
system call.