# Assignment2: GNU Toolchain contributed by < [`GliAmanti`](https://github.com/GliAmanti) > ## Installation My OS: **``Ubuntu 22.04 LTS``** I modify some steps in [Lab2](https://hackmd.io/@sysprog/SJAR5XMmi) to adapt the instructions to my environment. ### Prepare GNU Toolchain for RISC-V 1. Create a document, and download the GNU toolchain tarball with `wget` command. ``` mkdir hw2 cd hw2 wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases ``` 2. Extract the tarball, and copy the file to specific path. ``` tar zxvf xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz cp -af xpack-riscv-none-elf-gcc-13.2.0-2 $HOME/下載/hw2/riscv-none-elf-gcc ``` 3. Configure `$PATH` environment variable. ``` cd riscv-none-elf-gcc echo "export PATH=$HOME/下載/hw2/riscv-none-elf-gcc/bin:$PATH" > setenv ``` 4. Update `$PATH` environment variable. ``` source setenv ``` 5. Check the toolchain version. This should work if you set `$PATH` properly. ``` riscv-none-elf-gcc -v ``` ::: info The output message will be: ``` gcc version 13.2.0 (xPack GNU RISC-V Embedded GCC x86_64) ``` ::: :::success Remember to repeat step **4** every time you open a new terminal to run [rv32emu](https://github.com/sysprog21/rv32emu). ::: ### Get and build [rv32emu](https://github.com/sysprog21/rv32emu) 1. [rv32emu](https://github.com/sysprog21/rv32emu) relies on some third-party packages to be fully usable and to provide you full access to all of its features. Your target system must have a functional SDL2 library. ``` sudo apt update sudo apt install libsdl2-dev libsdl2-mixer-dev ``` 2. Get and build [rv32emu](https://github.com/sysprog21/rv32emu) from source. ``` git clone https://github.com/sysprog21/rv32emu cd rv32emu make ``` 3. Validate [rv32emu](https://github.com/sysprog21/rv32emu) ``` make check ``` ::: info The output message will be: ``` Running hello.elf ... [OK] Running puzzle.elf ... [OK] Running pi.elf ... [OK] ``` ::: 4. Run hello.elf. ``` build/rv32emu build/hello.elf ``` ::: info The output message will be: ``` Hello World! Hello World! Hello World! Hello World! Hello World! inferior exit code 0 ``` ::: ### Using GNU Toolchain Follow the steps in [Lab2: Using GNU Toolchain](https://hackmd.io/@sysprog/SJAR5XMmi#Using-GNU-Toolchain). ## Question The following question is picked from the Assignment 1. > 唐飴苹 [**Calculate the Hamming Distance using Counting Leading Zeros**](https://hackmd.io/@O6C2C3zQQBanDM55QRZ7DQ/Lab1_RV32I_assembly) > > The Hamming Distance between two integers is defined as the number of differing bits at the same position when comparing the binary representations of the integers. For example, the Hamming Distance between 1011101 and 1001001 is 2. > > In the assignment, I implement the program to calculate the Hamming Distance between the two given 64-bit unsigned integers. :::spoiler The original **C implementation** of the question ```c #include <stdio.h> #include <stdint.h> uint64_t test1_x0 = 0x0000000000100000; uint64_t test1_x1 = 0x00000000000FFFFF; uint64_t test2_x0 = 0x0000000000000001; uint64_t test2_x1 = 0x7FFFFFFFFFFFFFFE; uint64_t test3_x0 = 0x000000028370228F; uint64_t test3_x1 = 0x000000028370228F; uint16_t count_leading_zeros(uint64_t x){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } int HammingDistance(uint64_t x0, uint64_t x1){ int Hdist = 0; int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); while(max_digit > 0){ uint64_t c1 = x0 & 1; uint64_t c2 = x1 & 1; if(c1 != c2) Hdist += 1; x0 = x0 >> 1; x1 = x1 >> 1; max_digit -= 1; } return Hdist; } int main(){ printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1)); printf("Hamming Distance = %d\n", HammingDistance(test2_x0, test2_x1)); printf("Hamming Distance = %d\n", HammingDistance(test3_x0, test3_x1)); return 0; } ``` ::: :::spoiler The original **RISC-V Assembly implementation** of the question ``` .data test_data_1: .dword 0x0000000000100000, 0x00000000000FFFFF # HD(1048576, 1048575) = 21 test_data_2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63 test_data_3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0 msg_string: .string "\nHamming Distance=" .text main: addi sp, sp, -12 # push pointers of test data onto the stack la t0, test_data_1 sw t0, 0(sp) la t0, test_data_2 sw t0, 4(sp) la t0, test_data_3 sw t0, 8(sp) # initialize main_loop addi s0, zero, 3 # s0 : number of test case addi s1, zero, 0 # s1 : test case counter addi s2, sp, 0 # s2 : points to test_data_1 main_loop: la a0, msg_string li a7, 4 # print string ecall lw a0, 0(s2) # a0 : pointer to the first data in test_data_1 addi a1, a0, 8 # a1 : pointer to the second data in test_data_1 jal ra, hd_func # print the result # li a7, 1 # print integer ecall # print result of hd_cal (which is in a0) addi s2, s2, 4 # s2 : points to next test_data addi s1, s1, 1 # counter++ bne s1, s0, main_loop addi sp, sp, 12 li a7, 10 ecall # hamming distance function hd_func: addi sp, sp, -36 sw ra, 0(sp) sw s0, 4(sp) # address of x0 sw s1, 8(sp) # address of x1 sw s2, 12(sp) # digit of x0 sw s3, 16(sp) # digit of x1 sw s4, 20(sp) # lower part of x0 sw s5, 24(sp) # higher part of x0 sw s6, 28(sp) # lower part of x1 sw s7, 32(sp) # higher part of x1 # get address of x0 and x1 mv s0, a0 # s0 : address of x0 mv s1, a1 # s1 : address of x1 # get x0_digit lw a0, 0(s0) # a0 : lower part of x0 lw a1, 4(s0) # a1 : higher part of x0 jal ra clz li s2, 64 sub s2, s2, a0 # s2 : x0_digit (return value saved in a0) # get x1_digit lw a0, 0(s1) # a0 : lower part of x1 lw a1, 4(s1) # a1 : higher part of x1 jal ra clz li s3, 64 sub s3, s3, a0 # s3 : x1_digit (return value saved in a0) # get x0(s5 s4) and x1(s7 s6) lw s4, 0(s0) lw s5, 4(s0) lw s6, 0(s1) lw s7, 4(s1) # compare with two digit slt t0, s2, s3 bne t0, zero, x1_larger mv s3, zero # s3: hd counter bgt s2, zero, hd_cal_loop # when digit is 0 mv a0, s2 # save max_digit to a0 j hd_func_end x1_larger: mv s2, s3 # s2 : max_digit mv s3, zero # s3: hd counter bgt s2, zero, hd_cal_loop # when digit is 0 mv a0, s2 # save max_digit to a0 j hd_func_end hd_func_end: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) addi sp, sp, 36 ret # hamming distance calculation (result save in a0, a1) hd_cal_loop: # when the current digit larger than 32 addi t2, zero, 32 bgt s2, t2, hd_getLSB_upper # hd_getLSB_lower : and with 1 li t3, 0x00000001 and t4, s4, t3 and t5, s6, t3 j hd_cal_shift hd_getLSB_upper: # and with 1 li t3, 0x00000001 and t4, s5, t3 and t5, s7, t3 hd_cal_shift: # (s5 s4) = x >> 1 srli t0, s4, 1 slli t1, s5, 31 or s4, t0, t1 # s4 >> 1 srli s5, s5, 1 # s5 >> 1 # (s7 s6) = x >> 1 srli t0, s6, 1 slli t1, s7, 31 or s6, t0, t1 # s6 >> 1 srli s7, s7, 1 # s7 >> 1 beq t4, t5, hd_check_loop addi s3, s3, 1 hd_check_loop: addi s2, s2, -1 bne s2, zero, hd_cal_loop mv a0, s3 # save return value to a0 j hd_func_end # count leading zeros clz: addi sp, sp, -4 sw ra, 0(sp) beq a1, zero, clz_lower_set_one clz_upper_set_one: srli t1, a1, 1 or a1, a1, t1 srli t1, a1, 2 or a1, a1, t1 srli t1, a1, 4 or a1, a1, t1 srli t1, a1, 8 or a1, a1, t1 srli t1, a1, 16 or a1, a1, t1 li a0, 0xffffffff j clz_count_ones clz_lower_set_one: srli t0, a0, 1 or a0, a0, t0 srli t0, a0, 2 or a0, a0, t0 srli t0, a0, 4 or a0, a0, t0 srli t0, a0, 8 or a0, a0, t0 srli t0, a0, 16 or a0, a0, t0 clz_count_ones: # x = (a1 a0) # x -= ((x >> 1) & 0x5555555555555555); # srli t0, a0, 1 slli t1, a1, 31 or t0, t0, t1 # t0 >> 1 srli t1, a1, 1 # t1 >> 1 li t2, 0x55555555 and t0, t0, t2 and t1, t1, t2 sltu t3, a0, t0 # t3 : borrow bit sub a0, a0, t0 sub a1, a1, t1 sub a1, a1, t3 # x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); # srli t0, a0, 2 slli t1, a1, 30 or t0, t0, t1 # t0 >> 2 srli t1, a1, 2 # t1 >> 2 li t2, 0x33333333 and t0, t0, t2 and t1, t1, t2 and t4, a0, t2 and t5, a1, t2 # (a1 a0) = (t1 t0) + (t5 t4) add a0, t0, t4 sltu t3, a0, t0 # t3 : carry bit add a1, t1, t5 add a1, a1, t3 # x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; # srli t0, a0, 4 slli t1, a1, 28 or t0, t0, t1 # t0 >> 4 srli t1, a1, 4 # t1 >> 4 add t0, t0, a0 sltu t3, t0, a0 # t3 : carry bit add t1, t1, a1 add t1, t1, t3 li t2, 0x0f0f0f0f and a0, t0, t2 and a1, t1, t2 # x += (x >> 8); # srli t0, a0, 8 slli t1, a1, 24 or t0, t0, t1 # t0 >> 8 srli t1, a1, 8 # t1 >> 8 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 16); # srli t0, a0, 16 slli t1, a1, 16 or t0, t0, t1 # t0 >> 16 srli t1, a1, 16 # t1 >> 16 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 32); # # (t1 t0) = x >> 32 mv t0, a1 mv t1, zero add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # return (64 - (x & 0x7f)); # a0 = (x & 0x7f) andi a0, a0, 0x7f li t0, 64 sub a0, t0, a0 # a0 = (64 - (x & 0x7f)) lw ra, 0(sp) addi sp, sp, 4 ret ``` ::: ## Optimization ### My Modified C Code <!-- Here is my source code in [GitHub](). --> :::spoiler Rewrite hamming distance function (Version 1) ```c int HammingDistance(uint64_t x0, uint64_t x1) { uint64_t xorVal = x0 ^ x1; int Hdist = 0; int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); while (max_digit > 0) { if(xorVal % 2 == 1) { Hdist += 1; } xorVal >>= 1; max_digit -= 1; } return Hdist; } ``` First, use ``xor`` to find the different bits between ``x0`` and ``x1``. Then, use ``%`` to check whether the rightmost bit is 1. ::: :::spoiler Rewrite hamming distance function (Version 2) ```c int HammingDistance(uint64_t x0, uint64_t x1) { uint64_t xorVal = x0 ^ x1; int Hdist = 0; int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); while (max_digit > 0) { if(xorVal & 1 == 1) { Hdist += 1; } xorVal >>= 1; max_digit -= 1; } return Hdist; } ``` I change the condition in if statement from ``%`` to ``&``. But it doesn't decrease the cycle counts. ::: ### My Hand Written RISC-V Assembly Code <!-- Here is my source code in [GitHub](). --> :::spoiler Based on my modified C code ``` .data test_data_1: .dword 0x0000000000100000, 0x00000000000FFFFF # HD(1048576, 1048575) = 21 test_data_2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63 test_data_3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0 msg_string: .string "\nHamming Distance=" .text main: addi sp, sp, -12 # push pointers of test data onto the stack la t0, test_data_1 sw t0, 0(sp) la t0, test_data_2 sw t0, 4(sp) la t0, test_data_3 sw t0, 8(sp) # initialize main_loop addi s0, zero, 3 # s0 : number of test case addi s1, zero, 0 # s1 : test case counter addi s2, sp, 0 # s2 : points to test_data_1 main_loop: la a0, msg_string li a7, 4 # print string ecall lw a0, 0(s2) # a0 : pointer to the first data in test_data_1 addi a1, a0, 8 # a1 : pointer to the second data in test_data_1 jal ra, hd_func # print the result # li a7, 1 # print integer ecall # print result of hd_cal (which is in a0) addi s2, s2, 4 # s2 : points to next test_data addi s1, s1, 1 # counter++ bne s1, s0, main_loop addi sp, sp, 12 li a7, 10 ecall # hamming distance function hd_func: addi sp, sp, -20 sw ra, 0(sp) sw s0, 4(sp) # address of x0 sw s1, 8(sp) # address of x1 sw s2, 12(sp) # sw s3, 16(sp) # # get address of x0 and x1 mv s0, a0 # s0 : address of x0 mv s1, a1 # s1 : address of x1 lw a0, 0(s0) # a0 : lower part of x0 lw a1, 4(s0) # a1 : higher part of x0 mv s4, a0 # s5: lower part of x0 mv s5, a1 # s6: higher part of x0 lw a0, 0(s1) # a0 : lower part of x1 lw a1, 4(s1) # a1 : higher part of x1 xor s6, s4, a0 # s6: lower part of xorVal xor s7, s5, a1 # s7: higher part of xorVal # compare with x0 and x1 cmp: blt s5, a1, jmpClz # compare the higher part only mv a0, s4 mv a1, s5 jmpClz: jal ra clz li s3, 64 # s3: 64 sub s3, s3, a0 # s3: 64 - max_digit (return value saved in a0) addi s2, x0, 1 # s2: 1 mv s8, zero # s8: hd counter j hd_cal_loop hd_func_end: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) addi sp, sp, 20 ret # hamming distance calculation (result save in a0, a1) hd_cal_loop: and t0, s6, s2 bne t0, s2, hd_cal_shift addi s8, s8, 1 # Hdist += 1 hd_cal_shift: # (s7 s6) = x >> 1 srli t0, s6, 1 slli t1, s7, 31 or s6, t0, t1 # s6 >> 1 srli s7, s7, 1 # s7 >> 1 hd_check_loop: addi s3, s3, -1 bne s3, zero, hd_cal_loop mv a0, s8 # save return value to a0 j hd_func_end # count leading zeros clz: addi sp, sp, -4 sw ra, 0(sp) beq a1, zero, clz_lower_set_one clz_upper_set_one: srli t1, a1, 1 or a1, a1, t1 srli t1, a1, 2 or a1, a1, t1 srli t1, a1, 4 or a1, a1, t1 srli t1, a1, 8 or a1, a1, t1 srli t1, a1, 16 or a1, a1, t1 li a0, 0xffffffff j clz_count_ones clz_lower_set_one: srli t0, a0, 1 or a0, a0, t0 srli t0, a0, 2 or a0, a0, t0 srli t0, a0, 4 or a0, a0, t0 srli t0, a0, 8 or a0, a0, t0 srli t0, a0, 16 or a0, a0, t0 clz_count_ones: # x = (a1 a0) # x -= ((x >> 1) & 0x5555555555555555); # srli t0, a0, 1 slli t1, a1, 31 or t0, t0, t1 # t0 >> 1 srli t1, a1, 1 # t1 >> 1 li t2, 0x55555555 and t0, t0, t2 and t1, t1, t2 sltu t3, a0, t0 # t3 : borrow bit sub a0, a0, t0 sub a1, a1, t1 sub a1, a1, t3 # x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); # srli t0, a0, 2 slli t1, a1, 30 or t0, t0, t1 # t0 >> 2 srli t1, a1, 2 # t1 >> 2 li t2, 0x33333333 and t0, t0, t2 and t1, t1, t2 and t4, a0, t2 and t5, a1, t2 # (a1 a0) = (t1 t0) + (t5 t4) add a0, t0, t4 sltu t3, a0, t0 # t3 : carry bit add a1, t1, t5 add a1, a1, t3 # x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; # srli t0, a0, 4 slli t1, a1, 28 or t0, t0, t1 # t0 >> 4 srli t1, a1, 4 # t1 >> 4 add t0, t0, a0 sltu t3, t0, a0 # t3 : carry bit add t1, t1, a1 add t1, t1, t3 li t2, 0x0f0f0f0f and a0, t0, t2 and a1, t1, t2 # x += (x >> 8); # srli t0, a0, 8 slli t1, a1, 24 or t0, t0, t1 # t0 >> 8 srli t1, a1, 8 # t1 >> 8 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 16); # srli t0, a0, 16 slli t1, a1, 16 or t0, t0, t1 # t0 >> 16 srli t1, a1, 16 # t1 >> 16 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 32); # # (t1 t0) = x >> 32 mv t0, a1 mv t1, zero add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # return (64 - (x & 0x7f)); # a0 = (x & 0x7f) andi a0, a0, 0x7f li t0, 64 sub a0, t0, a0 # a0 = (64 - (x & 0x7f)) lw ra, 0(sp) addi sp, sp, 4 ret ``` ::: ## Analysis ### 1. [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c) To measure the performance, I add the following code before the original C main function. ```c #include <inttypes.h> typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } ``` And add the following code in the original C main function. ```c int main(){ ticks t0 = getticks(); printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1)); printf("Hamming Distance = %d\n", HammingDistance(test2_x0, test2_x1)); printf("Hamming Distance = %d\n", HammingDistance(test3_x0, test3_x1)); ticks t1 = getticks(); printf("cycle counts: %" PRIu64 "\n", t1 - t0); return 0; } ``` I call above code *``getticks_original``*. 1. Compile the *``getticks_original``* to RV32I assembly code. ``` riscv-none-elf-gcc -S -march=rv32i -mabi=ilp32 hammingDist.c -O0 -o hammingDist_O0.s ``` ::: success I also test different optimization options by changing **``-O0``** to **``-O2``**, **``-Ofast``** and **``-Os``**. :mag: Please check [here](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) for other optimization options. ::: 2. Combine the assembly code with the linking information, and create the **executable and linkable format (ELF)** file. ``` riscv-none-elf-gcc -march=rv32i -mabi=ilp32 hammingDist_O0.s -O0 -o hammingDist_O0.elf ``` ::: success Step **1** and **2** can be combined like this: ``` riscv-none-elf-gcc -march=rv32i -mabi=ilp32 hammingDist.c -O0 -o hammingDist_O0.elf ``` ::: 3. Run the ELF file with [rv32emu](https://github.com/sysprog21/rv32emu). ``` cd .. cd .. ./build/rv32emu ./tests/asm-hello/hammingDist_O0.elf ``` or ``` ../.././build/rv32emu hammingDist_O0.elf ``` ::: info The output will be: ``` Hamming Distance = 21 Hamming Distance = 63 Hamming Distance = 0 cycle counts: 12491 ``` ::: #### Comparison Table | Optimization Option | Cycle Counts | |:---------------------------------- |:------------:| | -O0 | 12491 | | -O2 | 8713 | | -Ofast | 8642 | | -Os | 8678 | | My modified C with -O0 | **10383** | | My modified C with -Ofast | **8051** | <!-- | My handwritten RISC-V with -O0 | | | My handwritten RISC-V with -OOfast | | --> ### 2. [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) To measure the performance, I add the following code before the original C main function. ```c #include <string.h> extern uint64_t get_cycles(); ``` And add the following code in the original C main function. ```c int main(){ uint64_t oldcount = get_cycles(); printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1)); printf("Hamming Distance = %d\n", HammingDistance(test2_x0, test2_x1)); printf("Hamming Distance = %d\n", HammingDistance(test3_x0, test3_x1)); uint64_t cyclecount = get_cycles() - oldcount; printf("cycle counts: %u\n", (unsigned int) cyclecount); return 0; } ``` I call above code *``getcycles_original``*. 1. Write my own **Makefile**. ```shell .PHONY: clean include ../../mk/toolchain.mk CC = riscv-none-elf-gcc CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall OBJS = \ getcycles.o \ hammingDist.o BIN = hammingDist_O0.elf %.o: %.S $(CC) $(CFLAGS) -c -o $@ $< %.o: %.c $(CC) $(CFLAGS) -c -o $@ $< all: $(BIN) $(BIN): $(OBJS) $(CC) -o $@ $^ clean: $(RM) $(BIN) $(OBJS) ``` ::: success I also test different optimization options by changing **``-O0``** to **``-O2``**, **``-Ofast``** and **``-Os``**. ::: 2. Put *``getcycles_original``*, *``getcycles.S``* and *``Makefile``* into the same document. ![](https://hackmd.io/_uploads/HygRlIYfp.png) 3. Compile the code. ``` make ``` The output will be: ![](https://hackmd.io/_uploads/HyIVBUFGp.png) ::: success Make sure that there are no the existed ELF file and object file in the same document. ``` make clean ``` ::: :::danger Avoid using screenshots that solely contain plain text. Here are the reasons why: 1. Text-based content is more efficiently searchable than having to browse through images iteratively. 2. The rendering engine of HackMD can consistently generate well-structured layouts with annotated text instead of relying on arbitrary pictures. 3. It provides a more accessible and user-friendly experience for individuals with visual impairments. :notes: jserv ::: 4. Run the ELF file with [rv32emu](https://github.com/sysprog21/rv32emu). ``` cd .. cd .. ./build/rv32emu ./tests/asm-hello/hammingDist_O0.elf ``` or ``` ../.././build/rv32emu hammingDist_O0.elf ``` ::: info The output will be: ``` Hamming Distance = 21 Hamming Distance = 63 Hamming Distance = 0 cycle counts: 11190 ``` ::: #### Comparison Table | Optimization Option | Cycle Counts | |:---------------------------------- |:------------:| | -O0 | 11190 | | -O2 | 7435 | | -Ofast | 7363 | | -Os | 7400 | | My modified C with -O0 | **9082** | | My modified C with -Ofast | **6772** | <!-- | My handwritten RISC-V with -O0 | | | My handwritten RISC-V with -OOfast | | --> ### 3. RDCYCLE/RDCYCLEH To run assembly code with [rv32emu](https://github.com/sysprog21/rv32emu), I have to do some modification on my handwritten implementation. :::success :mag: Please check [syscall.md](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md) for more detals about [rv32emu](https://github.com/sysprog21/rv32emu). ::: 1. Add file ``myHammingDist.ld``. ``` OUTPUT_ARCH("riscv") ENTRY(_start) SECTIONS { . = 0x0; } ``` 2. Write my own **Makefile**. ``` .PHONY: clean include ../../mk/toolchain.mk ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 LDFLAGS = --oformat=elf32-littleriscv BIN = myHammingDist.elf %.o: %.S $(CROSS_COMPILE)as -R $(ASFLAGS) -c -o $@ $< all: $(BIN) myHammingDist.elf: myHammingDist.o $(CROSS_COMPILE)ld -o $@ -T myHammingDist.ld $(LDFLAGS) $< clean: $(RM) $(BIN) myHammingDist.o ``` 3. Replace label ``main`` with ``_start`` . ``` .text _start: ``` Add the following code on the top of the file. ``` .global _start .set SYSEXIT, 93 .set SYSWRITE, 64 ``` 4. Modify ``jal ra clz`` to ``jal ra, clz``. :::info Otherwise, the output will be: ``` hammingDist.S: Assembler messages: hammingDist.S:75: Error: illegal operands `jal ra clz' hammingDist.S:82: Error: illegal operands `jal ra clz' ``` ::: 5. Add ``print_ascii`` block to print the results. 6. Add ``get_cycles_init`` and ``get_cycles_end`` to count the cycle. 7. Change the system calls to [rv32emu](https://github.com/sysprog21/rv32emu) version. <!-- To run assembly code with [rv32emu](https://github.com/sysprog21/rv32emu), I have to do some modification on original implementation. --> <!-- ### 3. [Ripes](https://github.com/mortbopet/Ripes) I fail to translate the code that can be executed flawlessly with [rv32emu](https://github.com/sysprog21/rv32emu), since [rv32emu](https://github.com/sysprog21/rv32emu) seems to have the problem to print integer. So I compare the cycle counts outputed by Ripes. :::warning Don't do that. Get the things right. :notes: jserv ::: #### Original ![](https://hackmd.io/_uploads/rJEjm70fT.png) #### Optimized ![](https://hackmd.io/_uploads/BJb7zIRfa.png) --> :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv :::