Assignment2: GNU Toolchain

# Rewite code contributed by <[scones525](https://github.com/scones525/Computer-Architecture-2023Fall_NCKU/tree/main/hw2)> [Calculate the Hamming Distance using Counting Leading Zeros](https://github.com/yptang5488/Computer-Architecture/tree/master) by yptang5488. This is a program that can calculate the Hamming distance between two numbers. We can notice that he uses the "clz" function to reduce the computational workload. ## C program ### yptang5488 version ```c int HammingDistance(uint64_t x0, uint64_t x1){ int Hdist = 0; int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); while(max_digit > 0){ uint64_t c1 = x0 & 1; uint64_t c2 = x1 & 1; if(c1 != c2) Hdist += 1; x0 = x0 >> 1; x1 = x1 >> 1; max_digit -= 1; } return Hdist; } ``` 1. max_digit is equal to the most significant bit of the larger of the two numbers. 2. The "while" loop is used to check whether there are any differences in bits between the two of them. ### My first version. ```c int HammingDistance(uint64_t x0, uint64_t x1){ int Hdist = 0; int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); uint64_t c1 = x0 ^ x1; while(max_digit > 0){ Hdist += c1 & 1; c1 = c1 >> 1; max_digit -= 1; } return Hdist; } ``` He has demonstrated a method for determining whether two bits are the same. That is using c1, c2 to test they are same or not. By computing the XOR between two numbers, you can identify which bits are different. Utilizing this property can significantly reduce the need for storage space and computational resources. ## Assembly I encountered problems when putting these code from ripes to rv32emu and running the program. 1.Error message: ``` riscv-none-elf-ld: warning: cannot find entry symbol _start; defaulting to 00000000 ``` In Ripes, main is a starting entry but rv32emu is not. So, the solution is: ``` # add this on top of the file .global _start # modify main as this _start: ``` 2.System call If we want to print a string in Ripes, we can use: ``` .data str_print .string "Hello." .text main: la a0, str_print li a7, 4 ecall ``` However there are some different in rv32emu: ``` .data str_print: .string "Hello" .set str_size, .-str_print .set SYSEXIT, 93 .set SYSWRITE, 64 .text _start: li a7, SYSWRITE # "write" syscall li a0, 1 # 1 = standard output (stdout) la a1, msg_string # load address of hello string li a2, msg_size # length of hello string ecall # invoke syscall to print the string ``` After there modification, we can run his code in rv32emu correctly. ### His Assembly :::spoiler ```c .data test_data_1: .dword 0x0000000000100000, 0x00000000000FFFFF # HD(1048576, 1048575) = 21 test_data_2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63 test_data_3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0 msg_string: .string "\nHamming Distance=" .text main: addi sp, sp, -12 # push pointers of test data onto the stack la t0, test_data_1 sw t0, 0(sp) la t0, test_data_2 sw t0, 4(sp) la t0, test_data_3 sw t0, 8(sp) # initialize main_loop addi s0, zero, 3 # s0 : number of test case addi s1, zero, 0 # s1 : test case counter addi s2, sp, 0 # s2 : points to test_data_1 main_loop: la a0, msg_string li a7, 4 # print string ecall lw a0, 0(s2) # a0 : pointer to the first data in test_data_1 addi a1, a0, 8 # a1 : pointer to the second data in test_data_1 jal ra, hd_func # print the result # li a7, 1 # print integer ecall # print result of hd_cal (which is in a0) addi s2, s2, 4 # s2 : points to next test_data addi s1, s1, 1 # counter++ bne s1, s0, main_loop addi sp, sp, 12 li a7, 10 ecall # hamming distance function hd_func: addi sp, sp, -36 sw ra, 0(sp) sw s0, 4(sp) # address of x0 sw s1, 8(sp) # address of x1 sw s2, 12(sp) # digit of x0 sw s3, 16(sp) # digit of x1 sw s4, 20(sp) # lower part of x0 sw s5, 24(sp) # higher part of x0 sw s6, 28(sp) # lower part of x1 sw s7, 32(sp) # higher part of x1 # get address of x0 and x1 mv s0, a0 # s0 : address of x0 mv s1, a1 # s1 : address of x1 # get x0_digit lw a0, 0(s0) # a0 : lower part of x0 lw a1, 4(s0) # a1 : higher part of x0 jal ra clz li s2, 64 sub s2, s2, a0 # s2 : x0_digit (return value saved in a0) # get x1_digit lw a0, 0(s1) # a0 : lower part of x1 lw a1, 4(s1) # a1 : higher part of x1 jal ra clz li s3, 64 sub s3, s3, a0 # s3 : x1_digit (return value saved in a0) # get x0(s5 s4) and x1(s7 s6) lw s4, 0(s0) lw s5, 4(s0) lw s6, 0(s1) lw s7, 4(s1) # compare with two digit slt t0, s2, s3 bne t0, zero, x1_larger mv s3, zero # s3: hd counter bgt s2, zero, hd_cal_loop # when digit is 0 mv a0, s2 # save max_digit to a0 j hd_func_end x1_larger: mv s2, s3 # s2 : max_digit mv s3, zero # s3: hd counter bgt s2, zero, hd_cal_loop # when digit is 0 mv a0, s2 # save max_digit to a0 j hd_func_end hd_func_end: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) addi sp, sp, 36 ret # hamming distance calculation (result save in a0, a1) hd_cal_loop: # when the current digit larger than 32 addi t2, zero, 32 bgt s2, t2, hd_getLSB_upper # hd_getLSB_lower : and with 1 li t3, 0x00000001 and t4, s4, t3 and t5, s6, t3 j hd_cal_shift hd_getLSB_upper: # and with 1 li t3, 0x00000001 and t4, s5, t3 and t5, s7, t3 hd_cal_shift: # (s5 s4) = x >> 1 srli t0, s4, 1 slli t1, s5, 31 or s4, t0, t1 # s4 >> 1 srli s5, s5, 1 # s5 >> 1 # (s7 s6) = x >> 1 srli t0, s6, 1 slli t1, s7, 31 or s6, t0, t1 # s6 >> 1 srli s7, s7, 1 # s7 >> 1 beq t4, t5, hd_check_loop addi s3, s3, 1 hd_check_loop: addi s2, s2, -1 bne s2, zero, hd_cal_loop mv a0, s3 # save return value to a0 j hd_func_end # count leading zeros clz: addi sp, sp, -4 sw ra, 0(sp) beq a1, zero, clz_lower_set_one clz_upper_set_one: srli t1, a1, 1 or a1, a1, t1 srli t1, a1, 2 or a1, a1, t1 srli t1, a1, 4 or a1, a1, t1 srli t1, a1, 8 or a1, a1, t1 srli t1, a1, 16 or a1, a1, t1 li a0, 0xffffffff j clz_count_ones clz_lower_set_one: srli t0, a0, 1 or a0, a0, t0 srli t0, a0, 2 or a0, a0, t0 srli t0, a0, 4 or a0, a0, t0 srli t0, a0, 8 or a0, a0, t0 srli t0, a0, 16 or a0, a0, t0 clz_count_ones: # x = (a1 a0) # x -= ((x >> 1) & 0x5555555555555555); # srli t0, a0, 1 slli t1, a1, 31 or t0, t0, t1 # t0 >> 1 srli t1, a1, 1 # t1 >> 1 li t2, 0x55555555 and t0, t0, t2 and t1, t1, t2 sltu t3, a0, t0 # t3 : borrow bit sub a0, a0, t0 sub a1, a1, t1 sub a1, a1, t3 # x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); # srli t0, a0, 2 slli t1, a1, 30 or t0, t0, t1 # t0 >> 2 srli t1, a1, 2 # t1 >> 2 li t2, 0x33333333 and t0, t0, t2 and t1, t1, t2 and t4, a0, t2 and t5, a1, t2 # (a1 a0) = (t1 t0) + (t5 t4) add a0, t0, t4 sltu t3, a0, t0 # t3 : carry bit add a1, t1, t5 add a1, a1, t3 # x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; # srli t0, a0, 4 slli t1, a1, 28 or t0, t0, t1 # t0 >> 4 srli t1, a1, 4 # t1 >> 4 add t0, t0, a0 sltu t3, t0, a0 # t3 : carry bit add t1, t1, a1 add t1, t1, t3 li t2, 0x0f0f0f0f and a0, t0, t2 and a1, t1, t2 # x += (x >> 8); # srli t0, a0, 8 slli t1, a1, 24 or t0, t0, t1 # t0 >> 8 srli t1, a1, 8 # t1 >> 8 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 16); # srli t0, a0, 16 slli t1, a1, 16 or t0, t0, t1 # t0 >> 16 srli t1, a1, 16 # t1 >> 16 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 32); # # (t1 t0) = x >> 32 mv t0, a1 mv t1, zero add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # return (64 - (x & 0x7f)); # a0 = (x & 0x7f) andi a0, a0, 0x7f li t0, 64 sub a0, t0, a0 # a0 = (64 - (x & 0x7f)) lw ra, 0(sp) addi sp, sp, 4 ret ``` ::: <s> ![](https://hackmd.io/_uploads/SkazlUyGp.png) </s> :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: In order to improve instruction ==rdcycle== in my homework, I insert some new code into my [hamming.c](https://github.com/scones525/Computer-Architecture-2023Fall_NCKU/blob/main/hw2/hamming.c) main function . like this... ```c uint32_t get_cycle() { uint32_t cycle_value; asm volatile ("rdcycle %0" : "=r"(cycle_value)); return cycle_value; } int main(){ uint32_t startcycle = get_cycle(); . . . uint32_t endcycle = get_cycle(); printf("RDCYCLE value: %u\n", endcycle - startcycle); . . . } ``` So we can learn RDCYCLE value in c code. And I use following instruction to check it: ``` build/rv32emu tests/hello-c/hamming_O0.elf ``` The output will look like... ``` Hamming Distance = 21 Hamming Distance = 63 Hamming Distance = 0 RDCYCLE value: 12468 inferior exit code 0 ``` You can check the cycle value in Comparision part. ### Modify to run on rv32emu The following is the code I have modified, which can be executed on rv32emu. :::spoiler ```c .org 0 .global _start /* newlib system calls */ .set SYSEXIT, 93 .set SYSWRITE, 64 .data test_data_1: .dword 0x0000000000100000, 0x00000000000FFFFF # HD(1048576, 1048575) = 21 .set arr_size1, .test_data_1 test_data_2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63 .set arr_size1, .test_data_2 test_data_3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0 .set arr_size1, .test_data_3 msg_string: .string "\nHamming Distance=" .set msg_size, .-msg_string .text _start: addi sp, sp, -12 # push pointers of test data onto the stack la t0, test_data_1 sw t0, 0(sp) la t0, test_data_2 sw t0, 4(sp) la t0, test_data_3 sw t0, 8(sp) # initialize main_loop addi s0, zero, 3 # s0 : number of test case addi s1, zero, 0 # s1 : test case counter addi s2, sp, 0 # s2 : points to test_data_1 main_loop: li a7, SYSWRITE # "write" syscall li a0, 1 # 1 = standard output (stdout) la a1, msg_string # load address of hello string li a2, msg_size # length of hello string ecall # invoke syscall to print the string lw a0, 0(s2) # a0 : pointer to the first data in test_data_1 addi a1, a0, 8 # a1 : pointer to the second data in test_data_1 jal ra, hd_func # print the result # addi t3, a0, 0 # Divide the number by 10 to get the tens place li t1, 10 # divisor li t2, 0 # quotient divide_loop: blt a0, t1, done_divide sub a0, a0, t1 addi t2, t2, 1 j divide_loop done_divide: # At this point, t2 contains the quotient (6 for 69) # Get the remainder: original number minus (t2 * 10) # Manually multiplying t2 by 10 using shift and add for rv32i slli t4, t2, 1 # t4 = t2 * 2 slli t5, t2, 3 # t5 = t2 * 8 add t4, t4, t5 # t4 = t2 * 10 sub a0, t3, t4 # a0 now has the remainder # Convert the tens place to ASCII and print addi t2, t2, 48 # convert to ASCII addi a0, a0, 48 addi sp, sp, -8 sw t2, 0(sp) sw a0, 4(sp) addi a1, sp, 0 li a7, SYSWRITE li a0, 1 li a2, 1 ecall addi a1, sp, 4 li a7, SYSWRITE li a0, 1 li a2, 1 ecall addi sp, sp, 8 addi sp, sp, 4 addi s2, s2, 4 # s2 : points to next test_data addi s1, s1, 1 # counter++ bne s1, s0, main_loop addi sp, sp, 12 end: end: li a7, SYSEXIT # "exit" syscall add a0, x0, 0 # Use 0 return code ecall # invoke syscall to terminate the program # hamming distance function hd_func: addi sp, sp, -36 sw ra, 0(sp) sw s0, 4(sp) # address of x0 sw s1, 8(sp) # address of x1 sw s2, 12(sp) # digit of x0 sw s3, 16(sp) # digit of x1 sw s4, 20(sp) # lower part of x0 sw s5, 24(sp) # higher part of x0 sw s6, 28(sp) # lower part of x1 sw s7, 32(sp) # higher part of x1 # get address of x0 and x1 mv s0, a0 # s0 : address of x0 mv s1, a1 # s1 : address of x1 # get x0_digit lw a0, 0(s0) # a0 : lower part of x0 lw a1, 4(s0) # a1 : higher part of x0 jal ra, clz li s2, 64 sub s2, s2, a0 # s2 : x0_digit (return value saved in a0) # get x1_digit lw a0, 0(s1) # a0 : lower part of x1 lw a1, 4(s1) # a1 : higher part of x1 jal ra, clz li s3, 64 sub s3, s3, a0 # s3 : x1_digit (return value saved in a0) # get x0(s5 s4) and x1(s7 s6) lw s4, 0(s0) lw s5, 4(s0) lw s6, 0(s1) lw s7, 4(s1) # compare with two digit slt t0, s2, s3 bne t0, zero, x1_larger mv s3, zero # s3: hd counter bgt s2, zero, hd_cal_loop # when digit is 0 mv a0, s2 # save max_digit to a0 j hd_func_end x1_larger: mv s2, s3 # s2 : max_digit mv s3, zero # s3: hd counter bgt s2, zero, hd_cal_loop # when digit is 0 mv a0, s2 # save max_digit to a0 j hd_func_end hd_func_end: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) lw s6, 28(sp) lw s7, 32(sp) addi sp, sp, 36 ret # hamming distance calculation (result save in a0, a1) hd_cal_loop: # when the current digit larger than 32 addi t2, zero, 32 bgt s2, t2, hd_getLSB_upper # hd_getLSB_lower : and with 1 li t3, 0x00000001 and t4, s4, t3 and t5, s6, t3 j hd_cal_shift hd_getLSB_upper: # and with 1 li t3, 0x00000001 and t4, s5, t3 and t5, s7, t3 hd_cal_shift: # (s5 s4) = x >> 1 srli t0, s4, 1 slli t1, s5, 31 or s4, t0, t1 # s4 >> 1 srli s5, s5, 1 # s5 >> 1 # (s7 s6) = x >> 1 srli t0, s6, 1 slli t1, s7, 31 or s6, t0, t1 # s6 >> 1 srli s7, s7, 1 # s7 >> 1 beq t4, t5, hd_check_loop addi s3, s3, 1 hd_check_loop: addi s2, s2, -1 bne s2, zero, hd_cal_loop mv a0, s3 # save return value to a0 j hd_func_end # count leading zeros clz: addi sp, sp, -4 sw ra, 0(sp) beq a1, zero, clz_lower_set_one clz_upper_set_one: srli t1, a1, 1 or a1, a1, t1 srli t1, a1, 2 or a1, a1, t1 srli t1, a1, 4 or a1, a1, t1 srli t1, a1, 8 or a1, a1, t1 srli t1, a1, 16 or a1, a1, t1 li a0, 0xffffffff j clz_count_ones clz_lower_set_one: srli t0, a0, 1 or a0, a0, t0 srli t0, a0, 2 or a0, a0, t0 srli t0, a0, 4 or a0, a0, t0 srli t0, a0, 8 or a0, a0, t0 srli t0, a0, 16 or a0, a0, t0 clz_count_ones: # x = (a1 a0) # x -= ((x >> 1) & 0x5555555555555555); # srli t0, a0, 1 slli t1, a1, 31 or t0, t0, t1 # t0 >> 1 srli t1, a1, 1 # t1 >> 1 li t2, 0x55555555 and t0, t0, t2 and t1, t1, t2 sltu t3, a0, t0 # t3 : borrow bit sub a0, a0, t0 sub a1, a1, t1 sub a1, a1, t3 # x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); # srli t0, a0, 2 slli t1, a1, 30 or t0, t0, t1 # t0 >> 2 srli t1, a1, 2 # t1 >> 2 li t2, 0x33333333 and t0, t0, t2 and t1, t1, t2 and t4, a0, t2 and t5, a1, t2 # (a1 a0) = (t1 t0) + (t5 t4) add a0, t0, t4 sltu t3, a0, t0 # t3 : carry bit add a1, t1, t5 add a1, a1, t3 # x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; # srli t0, a0, 4 slli t1, a1, 28 or t0, t0, t1 # t0 >> 4 srli t1, a1, 4 # t1 >> 4 add t0, t0, a0 sltu t3, t0, a0 # t3 : carry bit add t1, t1, a1 add t1, t1, t3 li t2, 0x0f0f0f0f and a0, t0, t2 and a1, t1, t2 # x += (x >> 8); # srli t0, a0, 8 slli t1, a1, 24 or t0, t0, t1 # t0 >> 8 srli t1, a1, 8 # t1 >> 8 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 16); # srli t0, a0, 16 slli t1, a1, 16 or t0, t0, t1 # t0 >> 16 srli t1, a1, 16 # t1 >> 16 add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # x += (x >> 32); # # (t1 t0) = x >> 32 mv t0, a1 mv t1, zero add a0, a0, t0 sltu t3, a0, t0 # t3 : carry bit add a1, a1, t1 add a1, a1, t3 # (a1 a0) += (t1 t0) # return (64 - (x & 0x7f)); # a0 = (x & 0x7f) andi a0, a0, 0x7f li t0, 64 sub a0, t0, a0 # a0 = (64 - (x & 0x7f)) lw ra, 0(sp) addi sp, sp, 4 ret ``` ::: # Optimized by riscv-none-elf-gcc I will compare -O0, -O1, -O2, -O3, -Os, -Ofast as following. ## -O0 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 hamming.c -o hamming_O0.elf ``` * Size ``` sean@sean:~/rv32emu/tests/hello-c$ riscv-none-elf-size hamming_O0.elf text data bss dec hex filename 76972 2372 1528 80872 13be8 hamming_O0.elf ``` * Save txt ``` $ riscv-none-elf-objdump -d hamming_O0.elf >dis_objdump_O0.txt ``` ### Disassembly code I chose ==hammingDistance== to compare, because it contains the most crucial functionality. * Observation * Line of code : `104` * Allocate 96 bytes in stack In hamming distance, there are several variable... First, we need to compare distance between x0 and x1. Second, calculate maximum digits in the larger one between x0 and x1 and stroe it in max_digit. c1 and c2 represent x0 and x1 doing "and" with 1 respectively. Hdist represent to hamming distace value. Now we can check there register in below code. * `-84(s0)` store Number 1 * `-96(s0)` store Number 2 * `-92(s0)` store max_digit * `-52(s0)` store Hdist * Number of lw & sw * `lw` : 28 * `sw` : 24 You can access the full file from the following link:[dis_objdump_O0.txt](https://github.com/scones525/Computer-Architecture-2023Fall_NCKU/blob/main/hw2/dis_objdump_O0.txt) ```c= 00010608 <HammingDistance>: 10608: fa010113 add sp,sp,-96 1060c: 04112e23 sw ra,92(sp) 10610: 04812c23 sw s0,88(sp) 10614: 05212a23 sw s2,84(sp) 10618: 05312823 sw s3,80(sp) 1061c: 05412623 sw s4,76(sp) 10620: 05512423 sw s5,72(sp) 10624: 05612223 sw s6,68(sp) 10628: 05712023 sw s7,64(sp) 1062c: 03812e23 sw s8,60(sp) 10630: 03912c23 sw s9,56(sp) 10634: 06010413 add s0,sp,96 10638: faa42423 sw a0,-88(s0) 1063c: fab42623 sw a1,-84(s0) 10640: fac42023 sw a2,-96(s0) 10644: fad42223 sw a3,-92(s0) 10648: fc042623 sw zero,-52(s0) 1064c: fa842603 lw a2,-88(s0) 10650: fac42683 lw a3,-84(s0) 10654: fa042703 lw a4,-96(s0) 10658: fa442783 lw a5,-92(s0) 1065c: 00068513 mv a0,a3 10660: 00078593 mv a1,a5 10664: 00a5ee63 bltu a1,a0,10680 <HammingDistance+0x78> 10668: 00068513 mv a0,a3 1066c: 00078593 mv a1,a5 10670: 00b51c63 bne a0,a1,10688 <HammingDistance+0x80> 10674: 00060513 mv a0,a2 10678: 00070593 mv a1,a4 1067c: 00a5f663 bgeu a1,a0,10688 <HammingDistance+0x80> 10680: 00060713 mv a4,a2 10684: 00068793 mv a5,a3 10688: 00070513 mv a0,a4 1068c: 00078593 mv a1,a5 10690: af1ff0ef jal 10180 <count_leading_zeros> 10694: 00050793 mv a5,a0 10698: 00078713 mv a4,a5 1069c: 04000793 li a5,64 106a0: 40e787b3 sub a5,a5,a4 106a4: 01079793 sll a5,a5,0x10 106a8: 0107d793 srl a5,a5,0x10 106ac: fcf41523 sh a5,-54(s0) 106b0: 0b40006f j 10764 <HammingDistance+0x15c> 106b4: fa842783 lw a5,-88(s0) 106b8: 0017fb13 and s6,a5,1 106bc: fac42783 lw a5,-84(s0) 106c0: 0007fb93 and s7,a5,0 106c4: fd642023 sw s6,-64(s0) 106c8: fd742223 sw s7,-60(s0) 106cc: fa042783 lw a5,-96(s0) 106d0: 0017fc13 and s8,a5,1 106d4: fa442783 lw a5,-92(s0) 106d8: 0007fc93 and s9,a5,0 106dc: fb842c23 sw s8,-72(s0) 106e0: fb942e23 sw s9,-68(s0) 106e4: fc042703 lw a4,-64(s0) 106e8: fb842783 lw a5,-72(s0) 106ec: 00f71863 bne a4,a5,106fc <HammingDistance+0xf4> 106f0: fc442703 lw a4,-60(s0) 106f4: fbc42783 lw a5,-68(s0) 106f8: 00f70863 beq a4,a5,10708 <HammingDistance+0x100> 106fc: fcc42783 lw a5,-52(s0) 10700: 00178793 add a5,a5,1 10704: fcf42623 sw a5,-52(s0) 10708: fac42783 lw a5,-84(s0) 1070c: 01f79793 sll a5,a5,0x1f 10710: fa842703 lw a4,-88(s0) 10714: 00175913 srl s2,a4,0x1 10718: 0127e933 or s2,a5,s2 1071c: fac42783 lw a5,-84(s0) 10720: 0017d993 srl s3,a5,0x1 10724: fb242423 sw s2,-88(s0) 10728: fb342623 sw s3,-84(s0) 1072c: fa442783 lw a5,-92(s0) 10730: 01f79793 sll a5,a5,0x1f 10734: fa042703 lw a4,-96(s0) 10738: 00175a13 srl s4,a4,0x1 1073c: 0147ea33 or s4,a5,s4 10740: fa442783 lw a5,-92(s0) 10744: 0017da93 srl s5,a5,0x1 10748: fb442023 sw s4,-96(s0) 1074c: fb542223 sw s5,-92(s0) 10750: fca45783 lhu a5,-54(s0) 10754: fff78793 add a5,a5,-1 10758: 01079793 sll a5,a5,0x10 1075c: 0107d793 srl a5,a5,0x10 10760: fcf41523 sh a5,-54(s0) 10764: fca41783 lh a5,-54(s0) 10768: f4f046e3 bgtz a5,106b4 <HammingDistance+0xac> 1076c: fcc42783 lw a5,-52(s0) 10770: 00078513 mv a0,a5 10774: 05c12083 lw ra,92(sp) 10778: 05812403 lw s0,88(sp) 1077c: 05412903 lw s2,84(sp) 10780: 05012983 lw s3,80(sp) 10784: 04c12a03 lw s4,76(sp) 10788: 04812a83 lw s5,72(sp) 1078c: 04412b03 lw s6,68(sp) 10790: 04012b83 lw s7,64(sp) 10794: 03c12c03 lw s8,60(sp) 10798: 03812c83 lw s9,56(sp) 1079c: 06010113 add sp,sp,96 107a0: 00008067 ret ``` ## O1 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 hamming.c -o hamming_O0.elf ``` * Text ``` sean@sean:~/rv32emu/tests/hello-c$ riscv-none-elf-size hamming_O1.elf text data bss dec hex filename 75908 2372 1528 79808 137c0 hamming_O1.elf ``` * Save txt ``` $ riscv-none-elf-objdump -d hamming_O1.elf >dis_objdump_O1.txt ``` ### Disassemble code * Observation * Line of code : `47` * Allocate 32 bytes in stack * s0 store Number 1 that need to compute distance * s1 store Number 2 that need to compute distance to Number 1 * a5 store max_digit * a4 store Hdist * Number of lw & sw * `lw` : 5 * `sw` : 5 You can access the full file from the following link:[dis_objdump_O1.txt](https://github.com/scones525/Computer-Architecture-2023Fall_NCKU/blob/main/hw2/dis_objdump_O1.txt) ```c= 000102f0 <HammingDistance>: 102f0: fe010113 add sp,sp,-32 102f4: 00112e23 sw ra,28(sp) 102f8: 00812c23 sw s0,24(sp) 102fc: 00912a23 sw s1,20(sp) 10300: 01212823 sw s2,16(sp) 10304: 01312623 sw s3,12(sp) 10308: 00050493 mv s1,a0 1030c: 00058913 mv s2,a1 10310: 00060413 mv s0,a2 10314: 00068993 mv s3,a3 10318: 00b6ea63 bltu a3,a1,1032c <HammingDistance+0x3c> 1031c: 00060513 mv a0,a2 10320: 00068593 mv a1,a3 10324: 00d91863 bne s2,a3,10334 <HammingDistance+0x44> 10328: 00967663 bgeu a2,s1,10334 <HammingDistance+0x44> 1032c: 00048513 mv a0,s1 10330: 00090593 mv a1,s2 10334: e4dff0ef jal 10180 <count_leading_zeros> 10338: 04000793 li a5,64 1033c: 40a787b3 sub a5,a5,a0 10340: 01079793 sll a5,a5,0x10 10344: 4107d793 sra a5,a5,0x10 10348: 06f05063 blez a5,103a8 <HammingDistance+0xb8> 1034c: 00000513 li a0,0 10350: 0084c733 xor a4,s1,s0 10354: 00177713 and a4,a4,1 10358: 00e50533 add a0,a0,a4 1035c: 01f91713 sll a4,s2,0x1f 10360: 0014d493 srl s1,s1,0x1 10364: 009764b3 or s1,a4,s1 10368: 00195913 srl s2,s2,0x1 1036c: 01f99713 sll a4,s3,0x1f 10370: 00145413 srl s0,s0,0x1 10374: 00876433 or s0,a4,s0 10378: 0019d993 srl s3,s3,0x1 1037c: fff78793 add a5,a5,-1 10380: 01079793 sll a5,a5,0x10 10384: 4107d793 sra a5,a5,0x10 10388: fc0794e3 bnez a5,10350 <HammingDistance+0x60> 1038c: 01c12083 lw ra,28(sp) 10390: 01812403 lw s0,24(sp) 10394: 01412483 lw s1,20(sp) 10398: 01012903 lw s2,16(sp) 1039c: 00c12983 lw s3,12(sp) 103a0: 02010113 add sp,sp,32 103a4: 00008067 ret ``` Notice line 26 ~ 28 29~32 In this case, if we want to right shift a number in uint_64 we need to know its sign first. After doing left shift, do or with its sign. ## O2 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 hamming.c -o hamming_O2.elf ``` * text ``` sean@sean:~/rv32emu/tests/hello-c$ riscv-none-elf-size hamming_O2.elf text data bss dec hex filename 75912 2372 1528 79812 137c4 hamming_O2.elf ``` * write disassembly code ``` $ riscv-none-elf-objdump -d hamming_O2.elf >dis_objdump_O2.txt ``` ### Dissembly code * Observation * Line of code : `32` * Allocate 32 bytes in stack * s0 store Number 1 that need to compute distance * s1 store Number 2 that need to compute distance to Number 1 * a5 store max_digit * a4 store Hdist * Number of lw & sw * `lw` : 5 * `sw` : 5 You can access the full file from the following link:[dis_objdump_O2.txt](https://github.com/scones525/Computer-Architecture-2023Fall_NCKU/blob/main/hw2/dis_objdump_O2.txt) ```c= 00010374 <HammingDistance>: 10374: fe010113 add sp,sp,-32 10378: 00812c23 sw s0,24(sp) 1037c: 00912a23 sw s1,20(sp) 10380: 01212823 sw s2,16(sp) 10384: 01312623 sw s3,12(sp) 10388: 00112e23 sw ra,28(sp) 1038c: 00060413 mv s0,a2 10390: 00068913 mv s2,a3 10394: 00058993 mv s3,a1 10398: 00050493 mv s1,a0 1039c: 08b6e863 bltu a3,a1,1042c <HammingDistance+0xb8> 103a0: 00060513 mv a0,a2 103a4: 00068593 mv a1,a3 103a8: 08d98063 beq s3,a3,10428 <HammingDistance+0xb4> 103ac: e59ff0ef jal 10204 <count_leading_zeros> 103b0: 04000793 li a5,64 103b4: 40a787b3 sub a5,a5,a0 103b8: 01079793 sll a5,a5,0x10 103bc: 4107d793 sra a5,a5,0x10 103c0: 00000513 li a0,0 103c4: 04f05463 blez a5,1040c <HammingDistance+0x98> 103c8: fff78793 add a5,a5,-1 103cc: 0084c733 xor a4,s1,s0 103d0: 01079693 sll a3,a5,0x10 103d4: 01f99593 sll a1,s3,0x1f 103d8: 01f91613 sll a2,s2,0x1f 103dc: 00177713 and a4,a4,1 103e0: 0014d493 srl s1,s1,0x1 103e4: 00145413 srl s0,s0,0x1 103e8: 01079793 sll a5,a5,0x10 103ec: 0106d693 srl a3,a3,0x10 103f0: 00e50533 add a0,a0,a4 103f4: 0095e4b3 or s1,a1,s1 103f8: 0019d993 srl s3,s3,0x1 103fc: 00866433 or s0,a2,s0 10400: 00195913 srl s2,s2,0x1 10404: 4107d793 sra a5,a5,0x10 10408: fc0690e3 bnez a3,103c8 <HammingDistance+0x54> 1040c: 01c12083 lw ra,28(sp) 10410: 01812403 lw s0,24(sp) 10414: 01412483 lw s1,20(sp) 10418: 01012903 lw s2,16(sp) 1041c: 00c12983 lw s3,12(sp) 10420: 02010113 add sp,sp,32 10424: 00008067 ret 10428: f89672e3 bgeu a2,s1,103ac <HammingDistance+0x38> 1042c: 00048513 mv a0,s1 10430: 00098593 mv a1,s3 10434: f79ff06f j 103ac <HammingDistance+0x38> ``` In line 15, it used beq to handle a special case(x0, x1 are same). Because in hamming distance, if the two number that we want to calculate their distance is totally same the answer will be 0. ## O3 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O3 hamming.c -o hamming_O3.elf ``` * text ``` sean@sean:~/rv32emu/tests/hello-c$ riscv-none-elf-size hamming_O3.elf text data bss dec hex filename 76192 2372 1528 80092 138dc hamming_O3.elf ``` * write disassembly code ``` $ riscv-none-elf-objdump -d hamming_O3.elf >dis_objdump_O3.txt ``` ### Dissembly code * Observation * Line of code : `120` * Allocate 32 bytes in stack * s0 store Number 1 that need to compute distance * s1 store Number 2 that need to compute distance to Number 1 * a5 store max_digit * a4 store Hdist * Number of lw & sw * `lw` : 0 * `sw` : 0 You can access the full file from the following link:[dis_objdump_O3.txt](https://github.com/scones525/Computer-Architecture-2023Fall_NCKU/blob/main/hw2/dis_objdump_O3.txt) ```c= 00010374 <HammingDistance>: 10374: 00050713 mv a4,a0 10378: 00060893 mv a7,a2 1037c: 00068813 mv a6,a3 10380: 1ab6ee63 bltu a3,a1,1053c <HammingDistance+0x1c8> 10384: 1ad58a63 beq a1,a3,10538 <HammingDistance+0x1c4> 10388: 01f81793 sll a5,a6,0x1f 1038c: 0018d513 srl a0,a7,0x1 10390: 00a7e533 or a0,a5,a0 10394: 00185793 srl a5,a6,0x1 10398: 0107e7b3 or a5,a5,a6 1039c: 01156833 or a6,a0,a7 103a0: 00285513 srl a0,a6,0x2 103a4: 01e79893 sll a7,a5,0x1e 103a8: 00a8e533 or a0,a7,a0 103ac: 0027d893 srl a7,a5,0x2 103b0: 0117e7b3 or a5,a5,a7 103b4: 00a86833 or a6,a6,a0 103b8: 01c79893 sll a7,a5,0x1c 103bc: 00485513 srl a0,a6,0x4 103c0: 00a8e533 or a0,a7,a0 103c4: 0047d893 srl a7,a5,0x4 103c8: 0117e7b3 or a5,a5,a7 103cc: 00a86833 or a6,a6,a0 103d0: 01879893 sll a7,a5,0x18 103d4: 00885513 srl a0,a6,0x8 103d8: 00a8e533 or a0,a7,a0 103dc: 0087d893 srl a7,a5,0x8 103e0: 0117e7b3 or a5,a5,a7 103e4: 00a86833 or a6,a6,a0 103e8: 01079893 sll a7,a5,0x10 103ec: 01085513 srl a0,a6,0x10 103f0: 00a8e533 or a0,a7,a0 103f4: 0107d893 srl a7,a5,0x10 103f8: 0117e7b3 or a5,a5,a7 103fc: 00a86833 or a6,a6,a0 10400: 00f86833 or a6,a6,a5 10404: 01f79313 sll t1,a5,0x1f 10408: 00185513 srl a0,a6,0x1 1040c: 555558b7 lui a7,0x55555 10410: 55588893 add a7,a7,1365 # 55555555 <__BSS_END__+0x55531611> 10414: 00a36533 or a0,t1,a0 10418: 01157533 and a0,a0,a7 1041c: 0017d313 srl t1,a5,0x1 10420: 40a80533 sub a0,a6,a0 10424: 011378b3 and a7,t1,a7 10428: 00a83833 sltu a6,a6,a0 1042c: 411787b3 sub a5,a5,a7 10430: 410787b3 sub a5,a5,a6 10434: 01e79313 sll t1,a5,0x1e 10438: 00255813 srl a6,a0,0x2 1043c: 333338b7 lui a7,0x33333 10440: 33388893 add a7,a7,819 # 33333333 <__BSS_END__+0x3330f3ef> 10444: 01036833 or a6,t1,a6 10448: 01187833 and a6,a6,a7 1044c: 0027d313 srl t1,a5,0x2 10450: 01157533 and a0,a0,a7 10454: 00a80533 add a0,a6,a0 10458: 01137333 and t1,t1,a7 1045c: 0117f7b3 and a5,a5,a7 10460: 01053833 sltu a6,a0,a6 10464: 00f307b3 add a5,t1,a5 10468: 00f80833 add a6,a6,a5 1046c: 01c81893 sll a7,a6,0x1c 10470: 00455793 srl a5,a0,0x4 10474: 00f8e7b3 or a5,a7,a5 10478: 00a78533 add a0,a5,a0 1047c: 00485893 srl a7,a6,0x4 10480: 010888b3 add a7,a7,a6 10484: 00f537b3 sltu a5,a0,a5 10488: 0f0f1837 lui a6,0xf0f1 1048c: f0f80813 add a6,a6,-241 # f0f0f0f <__BSS_END__+0xf0ccfcb> 10490: 011787b3 add a5,a5,a7 10494: 0107f7b3 and a5,a5,a6 10498: 01057533 and a0,a0,a6 1049c: 01879893 sll a7,a5,0x18 104a0: 00855813 srl a6,a0,0x8 104a4: 0108e833 or a6,a7,a6 104a8: 01050833 add a6,a0,a6 104ac: 0087d893 srl a7,a5,0x8 104b0: 011787b3 add a5,a5,a7 104b4: 00a83533 sltu a0,a6,a0 104b8: 00f50533 add a0,a0,a5 104bc: 01051893 sll a7,a0,0x10 104c0: 01085793 srl a5,a6,0x10 104c4: 00f8e7b3 or a5,a7,a5 104c8: 00f807b3 add a5,a6,a5 104cc: 01055893 srl a7,a0,0x10 104d0: 0107b833 sltu a6,a5,a6 104d4: 01150533 add a0,a0,a7 104d8: 00a80833 add a6,a6,a0 104dc: 010787b3 add a5,a5,a6 104e0: 07f7f513 and a0,a5,127 104e4: 00050793 mv a5,a0 104e8: 06050063 beqz a0,10548 <HammingDistance+0x1d4> 104ec: 00000513 li a0,0 104f0: fff78793 add a5,a5,-1 104f4: 00c74833 xor a6,a4,a2 104f8: 01079893 sll a7,a5,0x10 104fc: 01f59e13 sll t3,a1,0x1f 10500: 01f69313 sll t1,a3,0x1f 10504: 00187813 and a6,a6,1 10508: 00175713 srl a4,a4,0x1 1050c: 00165613 srl a2,a2,0x1 10510: 01079793 sll a5,a5,0x10 10514: 0108d893 srl a7,a7,0x10 10518: 01050533 add a0,a0,a6 1051c: 00ee6733 or a4,t3,a4 10520: 0015d593 srl a1,a1,0x1 10524: 00c36633 or a2,t1,a2 10528: 0016d693 srl a3,a3,0x1 1052c: 4107d793 sra a5,a5,0x10 10530: fc0890e3 bnez a7,104f0 <HammingDistance+0x17c> 10534: 00008067 ret 10538: e4a678e3 bgeu a2,a0,10388 <HammingDistance+0x14> 1053c: 00070893 mv a7,a4 10540: 00058813 mv a6,a1 10544: e45ff06f j 10388 <HammingDistance+0x14> 10548: 00000513 li a0,0 1054c: 00008067 ret ``` ## Conclusion in different optimizer flag * From -O0 to -O1 * Significant reduction in code lines. * In "-O1" line 26~28, noticed that compiler didn't translate it line by line according to the original C code, but rather optimized it. It uses extra register to restore the result of "XOR". * Contract origin c code: ```c uint64_t c1 = x0 & 1; uint64_t c2 = x1 & 1; if(c1 != c2) Hdist += 1; ``` And -O0's disaambly code: ```c lw a4, -84(s0) # store number 1 and s6, a4, 1 lw a5, -96(s0) # store number 2 and s7, a5, 1 # if equal ,no need to add 1 beq a4,a5,10708 <HammingDistance+0x100> lw a5,-52(s0) add a5,a5,1 sw a5,-52(s0) ``` -O1's disaambly code ```c xor a4, s1, s2 and a4, a4, 1 # Compute these two lowest bits are same or not add a0, a0, 4 # this is Hdist ``` They are same function but their dissambly are very different. * From -O1 to -O2 There are some speaical case to Reduce computational load. For example in -O2 dissambly code: In line 15, it used beq to handle a special case ```c 103a8: 08d98063 beq s3,a3,10428 <HammingDistance+0xb4> . . . 10428: f89672e3 bgeu a2,s1,103ac <HammingDistance+0x38> 1042c: 00048513 mv a0,s1 10430: 00098593 mv a1,s3 10434: f79ff06f j 103ac <HammingDistance+0x38> ``` Because in hamming distance, if the two number that we want to calculate their distance is totally same the answer will be 0. In other words, there's no need to calculate the Hamming distance for two numbers that are identical. * From -O2 to -O3 There are some loop unrolling when using -O3 flag to dissambly c code. ### Comparison | | -O0 | -O1 | -O2 | -O3 | |--------------------|-----------|-----------|-----------|-----------| | Line of Code | 104 | 47 | 32 | 120 | | Register used | 16 | 12 | 12 | 7 | | Stack used (bytes) | 96 | 32 | 32 | 0 | | lw/sw count | 52 | 10 | 10 | 10 | | loop unrolling | N | N | N | Y | | RDCYCLE value | 12468 | 8470 | 8707 | 8635 |