# Assignment2: GNU Toolchain
contributed by < [`GliAmanti`](https://github.com/GliAmanti) >
## Installation
My OS: **``Ubuntu 22.04 LTS``**
I modify some steps in [Lab2](https://hackmd.io/@sysprog/SJAR5XMmi) to adapt the instructions to my environment.
### Prepare GNU Toolchain for RISC-V
1. Create a document, and download the GNU toolchain tarball with `wget` command.
```
mkdir hw2
cd hw2
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases
```
2. Extract the tarball, and copy the file to specific path.
```
tar zxvf xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz
cp -af xpack-riscv-none-elf-gcc-13.2.0-2 $HOME/下載/hw2/riscv-none-elf-gcc
```
3. Configure `$PATH` environment variable.
```
cd riscv-none-elf-gcc
echo "export PATH=$HOME/下載/hw2/riscv-none-elf-gcc/bin:$PATH" > setenv
```
4. Update `$PATH` environment variable.
```
source setenv
```
5. Check the toolchain version. This should work if you set `$PATH` properly.
```
riscv-none-elf-gcc -v
```
::: info
The output message will be:
```
gcc version 13.2.0 (xPack GNU RISC-V Embedded GCC x86_64)
```
:::
:::success
Remember to repeat step **4** every time you open a new terminal to run [rv32emu](https://github.com/sysprog21/rv32emu).
:::
### Get and build [rv32emu](https://github.com/sysprog21/rv32emu)
1. [rv32emu](https://github.com/sysprog21/rv32emu) relies on some third-party packages to be fully usable and to provide you full access to all of its features. Your target system must have a functional SDL2 library.
```
sudo apt update
sudo apt install libsdl2-dev libsdl2-mixer-dev
```
2. Get and build [rv32emu](https://github.com/sysprog21/rv32emu) from source.
```
git clone https://github.com/sysprog21/rv32emu
cd rv32emu
make
```
3. Validate [rv32emu](https://github.com/sysprog21/rv32emu)
```
make check
```
::: info
The output message will be:
```
Running hello.elf ... [OK]
Running puzzle.elf ... [OK]
Running pi.elf ... [OK]
```
:::
4. Run hello.elf.
```
build/rv32emu build/hello.elf
```
::: info
The output message will be:
```
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
inferior exit code 0
```
:::
### Using GNU Toolchain
Follow the steps in [Lab2: Using GNU Toolchain](https://hackmd.io/@sysprog/SJAR5XMmi#Using-GNU-Toolchain).
## Question
The following question is picked from the Assignment 1.
> 唐飴苹 [**Calculate the Hamming Distance using Counting Leading Zeros**](https://hackmd.io/@O6C2C3zQQBanDM55QRZ7DQ/Lab1_RV32I_assembly)
>
> The Hamming Distance between two integers is defined as the number of differing bits at the same position when comparing the binary representations of the integers. For example, the Hamming Distance between 1011101 and 1001001 is 2.
>
> In the assignment, I implement the program to calculate the Hamming Distance between the two given 64-bit unsigned integers.
:::spoiler The original **C implementation** of the question
```c
#include <stdio.h>
#include <stdint.h>
uint64_t test1_x0 = 0x0000000000100000;
uint64_t test1_x1 = 0x00000000000FFFFF;
uint64_t test2_x0 = 0x0000000000000001;
uint64_t test2_x1 = 0x7FFFFFFFFFFFFFFE;
uint64_t test3_x0 = 0x000000028370228F;
uint64_t test3_x1 = 0x000000028370228F;
uint16_t count_leading_zeros(uint64_t x){
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
int HammingDistance(uint64_t x0, uint64_t x1){
int Hdist = 0;
int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1);
while(max_digit > 0){
uint64_t c1 = x0 & 1;
uint64_t c2 = x1 & 1;
if(c1 != c2) Hdist += 1;
x0 = x0 >> 1;
x1 = x1 >> 1;
max_digit -= 1;
}
return Hdist;
}
int main(){
printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1));
printf("Hamming Distance = %d\n", HammingDistance(test2_x0, test2_x1));
printf("Hamming Distance = %d\n", HammingDistance(test3_x0, test3_x1));
return 0;
}
```
:::
:::spoiler The original **RISC-V Assembly implementation** of the question
```
.data
test_data_1: .dword 0x0000000000100000, 0x00000000000FFFFF # HD(1048576, 1048575) = 21
test_data_2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63
test_data_3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0
msg_string: .string "\nHamming Distance="
.text
main:
addi sp, sp, -12
# push pointers of test data onto the stack
la t0, test_data_1
sw t0, 0(sp)
la t0, test_data_2
sw t0, 4(sp)
la t0, test_data_3
sw t0, 8(sp)
# initialize main_loop
addi s0, zero, 3 # s0 : number of test case
addi s1, zero, 0 # s1 : test case counter
addi s2, sp, 0 # s2 : points to test_data_1
main_loop:
la a0, msg_string
li a7, 4 # print string
ecall
lw a0, 0(s2) # a0 : pointer to the first data in test_data_1
addi a1, a0, 8 # a1 : pointer to the second data in test_data_1
jal ra, hd_func
# print the result #
li a7, 1 # print integer
ecall # print result of hd_cal (which is in a0)
addi s2, s2, 4 # s2 : points to next test_data
addi s1, s1, 1 # counter++
bne s1, s0, main_loop
addi sp, sp, 12
li a7, 10
ecall
# hamming distance function
hd_func:
addi sp, sp, -36
sw ra, 0(sp)
sw s0, 4(sp) # address of x0
sw s1, 8(sp) # address of x1
sw s2, 12(sp) # digit of x0
sw s3, 16(sp) # digit of x1
sw s4, 20(sp) # lower part of x0
sw s5, 24(sp) # higher part of x0
sw s6, 28(sp) # lower part of x1
sw s7, 32(sp) # higher part of x1
# get address of x0 and x1
mv s0, a0 # s0 : address of x0
mv s1, a1 # s1 : address of x1
# get x0_digit
lw a0, 0(s0) # a0 : lower part of x0
lw a1, 4(s0) # a1 : higher part of x0
jal ra clz
li s2, 64
sub s2, s2, a0 # s2 : x0_digit (return value saved in a0)
# get x1_digit
lw a0, 0(s1) # a0 : lower part of x1
lw a1, 4(s1) # a1 : higher part of x1
jal ra clz
li s3, 64
sub s3, s3, a0 # s3 : x1_digit (return value saved in a0)
# get x0(s5 s4) and x1(s7 s6)
lw s4, 0(s0)
lw s5, 4(s0)
lw s6, 0(s1)
lw s7, 4(s1)
# compare with two digit
slt t0, s2, s3
bne t0, zero, x1_larger
mv s3, zero # s3: hd counter
bgt s2, zero, hd_cal_loop
# when digit is 0
mv a0, s2 # save max_digit to a0
j hd_func_end
x1_larger:
mv s2, s3 # s2 : max_digit
mv s3, zero # s3: hd counter
bgt s2, zero, hd_cal_loop
# when digit is 0
mv a0, s2 # save max_digit to a0
j hd_func_end
hd_func_end:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
lw s6, 28(sp)
lw s7, 32(sp)
addi sp, sp, 36
ret
# hamming distance calculation (result save in a0, a1)
hd_cal_loop:
# when the current digit larger than 32
addi t2, zero, 32
bgt s2, t2, hd_getLSB_upper
# hd_getLSB_lower : and with 1
li t3, 0x00000001
and t4, s4, t3
and t5, s6, t3
j hd_cal_shift
hd_getLSB_upper:
# and with 1
li t3, 0x00000001
and t4, s5, t3
and t5, s7, t3
hd_cal_shift:
# (s5 s4) = x >> 1
srli t0, s4, 1
slli t1, s5, 31
or s4, t0, t1 # s4 >> 1
srli s5, s5, 1 # s5 >> 1
# (s7 s6) = x >> 1
srli t0, s6, 1
slli t1, s7, 31
or s6, t0, t1 # s6 >> 1
srli s7, s7, 1 # s7 >> 1
beq t4, t5, hd_check_loop
addi s3, s3, 1
hd_check_loop:
addi s2, s2, -1
bne s2, zero, hd_cal_loop
mv a0, s3 # save return value to a0
j hd_func_end
# count leading zeros
clz:
addi sp, sp, -4
sw ra, 0(sp)
beq a1, zero, clz_lower_set_one
clz_upper_set_one:
srli t1, a1, 1
or a1, a1, t1
srli t1, a1, 2
or a1, a1, t1
srli t1, a1, 4
or a1, a1, t1
srli t1, a1, 8
or a1, a1, t1
srli t1, a1, 16
or a1, a1, t1
li a0, 0xffffffff
j clz_count_ones
clz_lower_set_one:
srli t0, a0, 1
or a0, a0, t0
srli t0, a0, 2
or a0, a0, t0
srli t0, a0, 4
or a0, a0, t0
srli t0, a0, 8
or a0, a0, t0
srli t0, a0, 16
or a0, a0, t0
clz_count_ones:
# x = (a1 a0)
# x -= ((x >> 1) & 0x5555555555555555); #
srli t0, a0, 1
slli t1, a1, 31
or t0, t0, t1 # t0 >> 1
srli t1, a1, 1 # t1 >> 1
li t2, 0x55555555
and t0, t0, t2
and t1, t1, t2
sltu t3, a0, t0 # t3 : borrow bit
sub a0, a0, t0
sub a1, a1, t1
sub a1, a1, t3
# x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); #
srli t0, a0, 2
slli t1, a1, 30
or t0, t0, t1 # t0 >> 2
srli t1, a1, 2 # t1 >> 2
li t2, 0x33333333
and t0, t0, t2
and t1, t1, t2
and t4, a0, t2
and t5, a1, t2
# (a1 a0) = (t1 t0) + (t5 t4)
add a0, t0, t4
sltu t3, a0, t0 # t3 : carry bit
add a1, t1, t5
add a1, a1, t3
# x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; #
srli t0, a0, 4
slli t1, a1, 28
or t0, t0, t1 # t0 >> 4
srli t1, a1, 4 # t1 >> 4
add t0, t0, a0
sltu t3, t0, a0 # t3 : carry bit
add t1, t1, a1
add t1, t1, t3
li t2, 0x0f0f0f0f
and a0, t0, t2
and a1, t1, t2
# x += (x >> 8); #
srli t0, a0, 8
slli t1, a1, 24
or t0, t0, t1 # t0 >> 8
srli t1, a1, 8 # t1 >> 8
add a0, a0, t0
sltu t3, a0, t0 # t3 : carry bit
add a1, a1, t1
add a1, a1, t3 # (a1 a0) += (t1 t0)
# x += (x >> 16); #
srli t0, a0, 16
slli t1, a1, 16
or t0, t0, t1 # t0 >> 16
srli t1, a1, 16 # t1 >> 16
add a0, a0, t0
sltu t3, a0, t0 # t3 : carry bit
add a1, a1, t1
add a1, a1, t3 # (a1 a0) += (t1 t0)
# x += (x >> 32); #
# (t1 t0) = x >> 32
mv t0, a1
mv t1, zero
add a0, a0, t0
sltu t3, a0, t0 # t3 : carry bit
add a1, a1, t1
add a1, a1, t3 # (a1 a0) += (t1 t0)
# return (64 - (x & 0x7f));
# a0 = (x & 0x7f)
andi a0, a0, 0x7f
li t0, 64
sub a0, t0, a0 # a0 = (64 - (x & 0x7f))
lw ra, 0(sp)
addi sp, sp, 4
ret
```
:::
## Optimization
### My Modified C Code
<!-- Here is my source code in [GitHub](). -->
:::spoiler Rewrite hamming distance function (Version 1)
```c
int HammingDistance(uint64_t x0, uint64_t x1)
{
uint64_t xorVal = x0 ^ x1;
int Hdist = 0;
int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1);
while (max_digit > 0)
{
if(xorVal % 2 == 1)
{
Hdist += 1;
}
xorVal >>= 1;
max_digit -= 1;
}
return Hdist;
}
```
First, use ``xor`` to find the different bits between ``x0`` and ``x1``. Then, use ``%`` to check whether the rightmost bit is 1.
:::
:::spoiler Rewrite hamming distance function (Version 2)
```c
int HammingDistance(uint64_t x0, uint64_t x1)
{
uint64_t xorVal = x0 ^ x1;
int Hdist = 0;
int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1);
while (max_digit > 0)
{
if(xorVal & 1 == 1)
{
Hdist += 1;
}
xorVal >>= 1;
max_digit -= 1;
}
return Hdist;
}
```
I change the condition in if statement from ``%`` to ``&``. But it doesn't decrease the cycle counts.
:::
### My Hand Written RISC-V Assembly Code
<!-- Here is my source code in [GitHub](). -->
:::spoiler Based on my modified C code
```
.data
test_data_1: .dword 0x0000000000100000, 0x00000000000FFFFF # HD(1048576, 1048575) = 21
test_data_2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63
test_data_3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0
msg_string: .string "\nHamming Distance="
.text
main:
addi sp, sp, -12
# push pointers of test data onto the stack
la t0, test_data_1
sw t0, 0(sp)
la t0, test_data_2
sw t0, 4(sp)
la t0, test_data_3
sw t0, 8(sp)
# initialize main_loop
addi s0, zero, 3 # s0 : number of test case
addi s1, zero, 0 # s1 : test case counter
addi s2, sp, 0 # s2 : points to test_data_1
main_loop:
la a0, msg_string
li a7, 4 # print string
ecall
lw a0, 0(s2) # a0 : pointer to the first data in test_data_1
addi a1, a0, 8 # a1 : pointer to the second data in test_data_1
jal ra, hd_func
# print the result #
li a7, 1 # print integer
ecall # print result of hd_cal (which is in a0)
addi s2, s2, 4 # s2 : points to next test_data
addi s1, s1, 1 # counter++
bne s1, s0, main_loop
addi sp, sp, 12
li a7, 10
ecall
# hamming distance function
hd_func:
addi sp, sp, -20
sw ra, 0(sp)
sw s0, 4(sp) # address of x0
sw s1, 8(sp) # address of x1
sw s2, 12(sp) #
sw s3, 16(sp) #
# get address of x0 and x1
mv s0, a0 # s0 : address of x0
mv s1, a1 # s1 : address of x1
lw a0, 0(s0) # a0 : lower part of x0
lw a1, 4(s0) # a1 : higher part of x0
mv s4, a0 # s5: lower part of x0
mv s5, a1 # s6: higher part of x0
lw a0, 0(s1) # a0 : lower part of x1
lw a1, 4(s1) # a1 : higher part of x1
xor s6, s4, a0 # s6: lower part of xorVal
xor s7, s5, a1 # s7: higher part of xorVal
# compare with x0 and x1
cmp:
blt s5, a1, jmpClz # compare the higher part only
mv a0, s4
mv a1, s5
jmpClz:
jal ra clz
li s3, 64 # s3: 64
sub s3, s3, a0 # s3: 64 - max_digit (return value saved in a0)
addi s2, x0, 1 # s2: 1
mv s8, zero # s8: hd counter
j hd_cal_loop
hd_func_end:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
addi sp, sp, 20
ret
# hamming distance calculation (result save in a0, a1)
hd_cal_loop:
and t0, s6, s2
bne t0, s2, hd_cal_shift
addi s8, s8, 1 # Hdist += 1
hd_cal_shift:
# (s7 s6) = x >> 1
srli t0, s6, 1
slli t1, s7, 31
or s6, t0, t1 # s6 >> 1
srli s7, s7, 1 # s7 >> 1
hd_check_loop:
addi s3, s3, -1
bne s3, zero, hd_cal_loop
mv a0, s8 # save return value to a0
j hd_func_end
# count leading zeros
clz:
addi sp, sp, -4
sw ra, 0(sp)
beq a1, zero, clz_lower_set_one
clz_upper_set_one:
srli t1, a1, 1
or a1, a1, t1
srli t1, a1, 2
or a1, a1, t1
srli t1, a1, 4
or a1, a1, t1
srli t1, a1, 8
or a1, a1, t1
srli t1, a1, 16
or a1, a1, t1
li a0, 0xffffffff
j clz_count_ones
clz_lower_set_one:
srli t0, a0, 1
or a0, a0, t0
srli t0, a0, 2
or a0, a0, t0
srli t0, a0, 4
or a0, a0, t0
srli t0, a0, 8
or a0, a0, t0
srli t0, a0, 16
or a0, a0, t0
clz_count_ones:
# x = (a1 a0)
# x -= ((x >> 1) & 0x5555555555555555); #
srli t0, a0, 1
slli t1, a1, 31
or t0, t0, t1 # t0 >> 1
srli t1, a1, 1 # t1 >> 1
li t2, 0x55555555
and t0, t0, t2
and t1, t1, t2
sltu t3, a0, t0 # t3 : borrow bit
sub a0, a0, t0
sub a1, a1, t1
sub a1, a1, t3
# x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); #
srli t0, a0, 2
slli t1, a1, 30
or t0, t0, t1 # t0 >> 2
srli t1, a1, 2 # t1 >> 2
li t2, 0x33333333
and t0, t0, t2
and t1, t1, t2
and t4, a0, t2
and t5, a1, t2
# (a1 a0) = (t1 t0) + (t5 t4)
add a0, t0, t4
sltu t3, a0, t0 # t3 : carry bit
add a1, t1, t5
add a1, a1, t3
# x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; #
srli t0, a0, 4
slli t1, a1, 28
or t0, t0, t1 # t0 >> 4
srli t1, a1, 4 # t1 >> 4
add t0, t0, a0
sltu t3, t0, a0 # t3 : carry bit
add t1, t1, a1
add t1, t1, t3
li t2, 0x0f0f0f0f
and a0, t0, t2
and a1, t1, t2
# x += (x >> 8); #
srli t0, a0, 8
slli t1, a1, 24
or t0, t0, t1 # t0 >> 8
srli t1, a1, 8 # t1 >> 8
add a0, a0, t0
sltu t3, a0, t0 # t3 : carry bit
add a1, a1, t1
add a1, a1, t3 # (a1 a0) += (t1 t0)
# x += (x >> 16); #
srli t0, a0, 16
slli t1, a1, 16
or t0, t0, t1 # t0 >> 16
srli t1, a1, 16 # t1 >> 16
add a0, a0, t0
sltu t3, a0, t0 # t3 : carry bit
add a1, a1, t1
add a1, a1, t3 # (a1 a0) += (t1 t0)
# x += (x >> 32); #
# (t1 t0) = x >> 32
mv t0, a1
mv t1, zero
add a0, a0, t0
sltu t3, a0, t0 # t3 : carry bit
add a1, a1, t1
add a1, a1, t3 # (a1 a0) += (t1 t0)
# return (64 - (x & 0x7f));
# a0 = (x & 0x7f)
andi a0, a0, 0x7f
li t0, 64
sub a0, t0, a0 # a0 = (64 - (x & 0x7f))
lw ra, 0(sp)
addi sp, sp, 4
ret
```
:::
## Analysis
### 1. [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c)
To measure the performance, I add the following code before the original C main function.
```c
#include <inttypes.h>
typedef uint64_t ticks;
static inline ticks getticks(void)
{
uint64_t result;
uint32_t l, h, h2;
asm volatile(
"rdcycleh %0\n"
"rdcycle %1\n"
"rdcycleh %2\n"
"sub %0, %0, %2\n"
"seqz %0, %0\n"
"sub %0, zero, %0\n"
"and %1, %1, %0\n"
: "=r"(h), "=r"(l), "=r"(h2));
result = (((uint64_t) h) << 32) | ((uint64_t) l);
return result;
}
```
And add the following code in the original C main function.
```c
int main(){
ticks t0 = getticks();
printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1));
printf("Hamming Distance = %d\n", HammingDistance(test2_x0, test2_x1));
printf("Hamming Distance = %d\n", HammingDistance(test3_x0, test3_x1));
ticks t1 = getticks();
printf("cycle counts: %" PRIu64 "\n", t1 - t0);
return 0;
}
```
I call above code *``getticks_original``*.
1. Compile the *``getticks_original``* to RV32I assembly code.
```
riscv-none-elf-gcc -S -march=rv32i -mabi=ilp32 hammingDist.c -O0 -o hammingDist_O0.s
```
::: success
I also test different optimization options by changing **``-O0``** to **``-O2``**, **``-Ofast``** and **``-Os``**.
:mag: Please check [here](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for other optimization options.
:::
2. Combine the assembly code with the linking information, and create the **executable and linkable format (ELF)** file.
```
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 hammingDist_O0.s -O0 -o hammingDist_O0.elf
```
::: success
Step **1** and **2** can be combined like this:
```
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 hammingDist.c -O0 -o hammingDist_O0.elf
```
:::
3. Run the ELF file with [rv32emu](https://github.com/sysprog21/rv32emu).
```
cd ..
cd ..
./build/rv32emu ./tests/asm-hello/hammingDist_O0.elf
```
or
```
../.././build/rv32emu hammingDist_O0.elf
```
::: info
The output will be:
```
Hamming Distance = 21
Hamming Distance = 63
Hamming Distance = 0
cycle counts: 12491
```
:::
#### Comparison Table
| Optimization Option | Cycle Counts |
|:---------------------------------- |:------------:|
| -O0 | 12491 |
| -O2 | 8713 |
| -Ofast | 8642 |
| -Os | 8678 |
| My modified C with -O0 | **10383** |
| My modified C with -Ofast | **8051** |
<!-- | My handwritten RISC-V with -O0 | |
| My handwritten RISC-V with -OOfast | | -->
### 2. [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter)
To measure the performance, I add the following code before the original C main function.
```c
#include <string.h>
extern uint64_t get_cycles();
```
And add the following code in the original C main function.
```c
int main(){
uint64_t oldcount = get_cycles();
printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1));
printf("Hamming Distance = %d\n", HammingDistance(test2_x0, test2_x1));
printf("Hamming Distance = %d\n", HammingDistance(test3_x0, test3_x1));
uint64_t cyclecount = get_cycles() - oldcount;
printf("cycle counts: %u\n", (unsigned int) cyclecount);
return 0;
}
```
I call above code *``getcycles_original``*.
1. Write my own **Makefile**.
```shell
.PHONY: clean
include ../../mk/toolchain.mk
CC = riscv-none-elf-gcc
CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall
OBJS = \
getcycles.o \
hammingDist.o
BIN = hammingDist_O0.elf
%.o: %.S
$(CC) $(CFLAGS) -c -o $@ $<
%.o: %.c
$(CC) $(CFLAGS) -c -o $@ $<
all: $(BIN)
$(BIN): $(OBJS)
$(CC) -o $@ $^
clean:
$(RM) $(BIN) $(OBJS)
```
::: success
I also test different optimization options by changing **``-O0``** to **``-O2``**, **``-Ofast``** and **``-Os``**.
:::
2. Put *``getcycles_original``*, *``getcycles.S``* and *``Makefile``* into the same document.

3. Compile the code.
```
make
```
The output will be:

::: success
Make sure that there are no the existed ELF file and object file in the same document.
```
make clean
```
:::
:::danger
Avoid using screenshots that solely contain plain text. Here are the reasons why:
1. Text-based content is more efficiently searchable than having to browse through images iteratively.
2. The rendering engine of HackMD can consistently generate well-structured layouts with annotated text instead of relying on arbitrary pictures.
3. It provides a more accessible and user-friendly experience for individuals with visual impairments.
:notes: jserv
:::
4. Run the ELF file with [rv32emu](https://github.com/sysprog21/rv32emu).
```
cd ..
cd ..
./build/rv32emu ./tests/asm-hello/hammingDist_O0.elf
```
or
```
../.././build/rv32emu hammingDist_O0.elf
```
::: info
The output will be:
```
Hamming Distance = 21
Hamming Distance = 63
Hamming Distance = 0
cycle counts: 11190
```
:::
#### Comparison Table
| Optimization Option | Cycle Counts |
|:---------------------------------- |:------------:|
| -O0 | 11190 |
| -O2 | 7435 |
| -Ofast | 7363 |
| -Os | 7400 |
| My modified C with -O0 | **9082** |
| My modified C with -Ofast | **6772** |
<!-- | My handwritten RISC-V with -O0 | |
| My handwritten RISC-V with -OOfast | | -->
### 3. RDCYCLE/RDCYCLEH
To run assembly code with [rv32emu](https://github.com/sysprog21/rv32emu), I have to do some modification on my handwritten implementation.
:::success
:mag: Please check [syscall.md](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) for more detals about [rv32emu](https://github.com/sysprog21/rv32emu).
:::
1. Add file ``myHammingDist.ld``.
```
OUTPUT_ARCH("riscv")
ENTRY(_start)
SECTIONS
{
. = 0x0;
}
```
2. Write my own **Makefile**.
```
.PHONY: clean
include ../../mk/toolchain.mk
ASFLAGS = -march=rv32i_zicsr -mabi=ilp32
LDFLAGS = --oformat=elf32-littleriscv
BIN = myHammingDist.elf
%.o: %.S
$(CROSS_COMPILE)as -R $(ASFLAGS) -c -o $@ $<
all: $(BIN)
myHammingDist.elf: myHammingDist.o
$(CROSS_COMPILE)ld -o $@ -T myHammingDist.ld $(LDFLAGS) $<
clean:
$(RM) $(BIN) myHammingDist.o
```
3. Replace label ``main`` with ``_start`` .
```
.text
_start:
```
Add the following code on the top of the file.
```
.global _start
.set SYSEXIT, 93
.set SYSWRITE, 64
```
4. Modify ``jal ra clz`` to ``jal ra, clz``.
:::info
Otherwise, the output will be:
```
hammingDist.S: Assembler messages:
hammingDist.S:75: Error: illegal operands `jal ra clz'
hammingDist.S:82: Error: illegal operands `jal ra clz'
```
:::
5. Add ``print_ascii`` block to print the results.
6. Add ``get_cycles_init`` and ``get_cycles_end`` to count the cycle.
7. Change the system calls to [rv32emu](https://github.com/sysprog21/rv32emu) version.
<!-- To run assembly code with [rv32emu](https://github.com/sysprog21/rv32emu), I have to do some modification on original implementation. -->
<!-- ### 3. [Ripes](https://github.com/mortbopet/Ripes)
I fail to translate the code that can be executed flawlessly with [rv32emu](https://github.com/sysprog21/rv32emu), since [rv32emu](https://github.com/sysprog21/rv32emu) seems to have the problem to print integer. So I compare the cycle counts outputed by Ripes.
:::warning
Don't do that. Get the things right.
:notes: jserv
:::
#### Original

#### Optimized
 -->
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::