Lab1: RV32I Assembly

# Total Hamming Distance Using Clz ([Leetcode 477](https://leetcode.com/problems/total-hamming-distance/description/)) ## Abstraction Hamming distance is an elegant mathematic concept for binary world, especially in infornmation and communication technology. For example, Hamming distance can be used for calculating bit error rate, optimal decoding method, power calculation, and more. In this homework, I'll discuss some method for calculation of total Hamming distance with a set of integer given, comparing the performance of each case. And the concept of clz mentioned in problem C of quiz one will be applied. ## Quiz1 Problem `C` ### `fabsf` C code ``` c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` ### `fabsf` assembly code ``` = .data test_data: .word 0xBF800000 # -1.0 .word 0x43D80000 # 432.0 .word 0x42180000 # -38.0 verify: .word 0X3F800000 # 1.0 .word 0x43D80000 # 432.0 .word 0X42180000 # 38.0 wrong_message: .asciz "wrong\n" right_message: .asciz "right\n" newline: .asciz "\n" .text .global main main: la t0, test_data # Load address of `tests` into t0 la t1, verify # Load address of 'verify' into t1 data addi t2, t2, 3 #3 number of test data loop: beq t2,zero,exit # If the t2=0, exit. addi t2,t2,-1 # t2-1 jal fabsf # Jump and link to `fabsf` function addi t0,t0,4 # next tested data addi t1,t1,4 # next verifying data j loop # jump to loop fabsf: lw a0, 0(t0) # Load the word from memory address pointed by t0 into a0 lw a1, 0(t1) # Load the word from memory address pointed by t1 into a1 li a2, 0x7FFFFFFF # Mask and a0,a0,a2 # Set first bit to zero bne a0, a1,false # if a0 is not equal to a1, then false message sw a0, 0(t0) # Store the word from t1 back to memory address pointed by t0 addi t4, ra,0 # store ra in t4 jal true # jump to true message addi ra,t4,0 # reload ra # Return ret false: la a0, wrong_message li a7,4 ecall lw a0,0(t0) li a7,1 ecall lw a0,0(t1) li a7,1 ecall ret true: la a0, right_message li a7,4 ecall la a0, newline li a7,4 ecall lw a0,0(t0) li a7,1 ecall la a0, newline li a7,4 ecall lw a0,0(t1) li a7,1 ecall ret exit: li a7,10 ecall ``` ### `clz` C code ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` ### `clz` assembly In this case, I remain alarm message only under false condition. ``` = .data test_data: .word 0x1 .word 0x100 .word 0x100000 verify: .word 31 .word 23 .word 11 wrong_message: .asciz "wrong" end_message: .asciz "\n end" .text .global main main: la t0, test_data la t1, verify li t2, 3 #3 test data jal test #jump and link to test function j end test: addi t4,ra,0 #store ra in t4 beq t2, zero, exit #end if t2 = 0 lw a0,0(t0) #load first test data lw a1,0(t1) #load first verifuing data li a2,0 #use for counting the number of leading zero, set to 0 li a3,32 #total 32bits jal clz #jump to clz and save position to clz, counting leading zero and save to a3 bne a3 ,a1,false #if a3 != a1, go labeled false addi ra,t4,0 #recover ra addi t2,t2,-1 #index -1 addi t0,t0,4 #next data addi t1,t1,4 #next verying data j test #back to this loop clz: beq a0, zero,exit # if a0 == 0, meaning that shift encough step, end loop addi a3,a3,-1 #if t1 !=0, leadning zero bits -1 srli a0,a0,1 #right shift 1 bit j clz #back to loop exit: ret false: la a0, wrong_message li a7,4 ecall j end end: la a0,end_message li a7,4 ecall li a7,10 ecall ``` ## Leetcode: Total Hamming Distance([Leetcode 477](https://leetcode.com/problems/total-hamming-distance/description/)) **Description:** The Hamming distance between two integers is the number of positions at which the corresponding bits are different. **Constrains:** 1 <= nums.length <= 104 0 <= nums[i] <= 109 The answer for the given input will fit in a 32-bit integer. ### Solution `A` **1. Idea** To calculate the total Hamming distance of all given integers, I propose calculating the Hamming distance of the first element with the remaining elements. Then, calculate the Hamming distance of the second element with the remaining elements, excluding the first element, and so on. Continue this until the last element and the element before it are processed, and sum up all the Hamming distances to obtain the total Hamming distance. The Hamming distance of each pair will apply $XOR$ and right shift operation. **2. Mathematical Representation** $$H_{total}=\displaystyle\sum^{N-1}_{i=1}\sum^N_{j=i+1}H(x_i,x_j)$$ where $H$ denoted as the Hamming distance function and $x_i$ denoted as the integer . **3. C code** Noted that this method can only afford calculation for Hamming distance in few integers due to its complexity $O(n^2)$. ```C int totalHammingDistance(int* nums, int numsSize) { int totalDistance = 0; for (int i = 0; i < numsSize-1;i ++) { for (int j = i + 1; j < numsSize;j++) { uint32_t w = nums[i]^ nums[j]; for (int t = 0; t < 32;t++) { totalDistance += w & 0x1; w = w >> 1; } } } return totalDistance; } ``` **4. Assembly code** In this case, I found that the allocation of alias is a big issue because I didn't utilize stack to store data when jump to other functions. Because the total Hamming distance is the summation of Hamming distance of each pair, here I only test one data to check the final result. That is, in order to fit the final result, each computation of Hamming distance generally must be correct. ``` = .data test: .word 5,8,104,233,1,5344,703 size: .word 7 verify: .word 100 pair: .asciz "\n pair HD = " total: .asciz "\n total HD = " fa: .asciz "\n false" successful: .asciz "\n successful" .text .global main main: la t0,test #loda address of test data to t0 la t1,verify #load address of verifying data to t1 la t2,size #load address of test data size to t2 lw t2,0(t2) #load the value of size li t3,0 #t3 = i li t4,0 #t4 = j li s0,0 #Total Hamming distance, s0 jal loop1 la t0,verify lw t0,0(t0) beq s0,t0,end j false loop1: addi a5,ra,0 #store ra of main in a5 bge t3,t2,exit #if t3>=t2 => i<3 ,exit addi t4,t3,1 #index of inner loop, start from t3+1, j=i+1 lw a3,0(t0) #load the i-th element of data addi a1,t0,4 #load the j=i+1 th data address jal loop2 #counting Hamming distance with i fixed addi t0,t0,4 #shift to next element, i+1 addi ra,a5,0 #recover ra of main addi t3,t3,1 j loop1 #back to this loop loop2: addi a6,ra,0 #store ra of loop1 in a6 bge t4,t2,exit #if j>= 3, exit lw t5,0(a1) #load jth element xor t5,t5,a3 #obtain xor result of ith and jth elements li a0,0 #reset a0 for bit shift, <32 li t6,32 #reset a1 for total bit number li s1,0 #number of Hammingdis of each pair jal HammingDis #counting Hamming distance of ith and jth elements addi ra,a6,0 add s0,s0,s1 #sum up each pair addi a1,a1,4 #j=j+1 element addi t4,t4,1 j loop2 HammingDis: bge a0,t6,exit #shift 31 times li a2,1 #load for and the least significant bit and a2,a2,t5 # add s1,s1,a2 #plus to total Hamming distance srli t5,t5,1 #right shift 1 bits addi a0,a0,1 #index +1 j HammingDis #count for next bit exit: ret end: la a0,total li a7,4 ecall addi a0,s0,0 li a7,1 ecall li a7,10 ecall false: la a0,fa li a7,4 ecall li a7,10 ecall ``` **5. Performance** ![image](https://hackmd.io/_uploads/H14PLiwJyx.png) ### Solution `B` **1. Idea** In this version, before counting Hamming distance, I try to calculate the leading zero of $XOR$ to reduce the total cycle. I guess it might be helpful for the thorough procedures. **2. C code** ```C int totalHammingDistance(int* nums, int numsSize) { int totalDistance = 0; for (int i = 0; i < numsSize - 1; i++) { for (int j = i + 1; j < numsSize; j++) { uint32_t w = nums[i] ^ nums[j]; if (w == 0) { continue; //avoid undefined case } int leading_zeros = __builtin_clz(w); int bits_to_process = 32 - leading_zeros; //figure out clz to reduce the following cycle for (int t = 0; t < bits_to_process; t++) { totalDistance += w & 0x1; w >>= 1; } } return totalDistance; } ``` **3. Revised Assembly Code** In this revised case, I try to allocate alias more clearly, including the utilization of stack pointer. Furthermore, clz method is also applied in this approach with the performance improved significantly and shown at the next subtitle. ``` .data test: .word 5,8,104,233,1,5344,703 size: .word 7 total: .text .global main main: la s0, test #load address of test data to s0 la s1, size #load address of test data size to s1 lw s1,0(s1) # s1 = numsSize li s2,0 # s2 = total Hd li t0, 0 # outer loop index = 1 jal loop1 jal end loop1: # to = outer loop index bge t0, s1, exit addi t1,t0,1 #inner loop index, start form t0+1 addi t2,s0,4 #load the data address to t2 used for shift in the loop lw a0, 0(s0) #load first element li a1, 32 #the clz of first element addi sp,sp,-8 sw ra, 4(sp) #sp for ra sw a0, 0(sp) #sp for first element jal clz #calculate clz of first element lw a0,0(sp) #load second element jal loop2 #loop 2 for Hamming distance with reamining elements lw ra,4(sp) #recover ra addi sp,sp,8 #recover sp addi t0,t0,1 #next loop addi s0,s0,4 #next element j loop1 loop2: # a0 = the first element # a1 = the clz of the first element # a2 = the clz of the second element # t1 is the index = i+1 # t2 = address of second element bge t1, s1, exit addi sp,sp,-16 sw ra,12(sp) sw a0,8(sp) sw a1,4(sp) sw t1,0(sp) #count clz of t2 lw a0,0(t2) #load second element to a0 li a1,32 #set clz to 0 jal clz #count clz of second element addi a2,a1,0 #clz store to a2 lw a1,4(sp) #recover a1 jal reverse addi a0,a1,-32 # a0=a1-32 sub a0,zero,a0 # a0=0-a0=-a1+32 lw t1,8(sp) #t1 = first element used for Hd_clz lw t3,0(t2) #t3 = second element for Hd_clz li a1,0 #index for Hd_clz = 0 xor t1,t1,t3 # xor t1,t2 to t1 jal HammingDis_clz lw t1,0(sp) #recover data lw a1,4(sp) #recover data lw a0,8(sp) #recover data lw ra,12(sp) #recover data addi sp,sp,16 #recover sp addi t2,t2,4 #next element addi t1,t1,1 #next loop j loop2 reverse: bge a2,a1 ,exit #if a1>a2, a0 = 32-a2, else a0 = 32-a1 addi a1,a2,0 ret clz: # a0 is the element need to be calculated it's clz # a1 is the clz beq a0, zero,exit # if a0 == 0, meaning that shift encough step, end loop addi a1,a1,-1 #if t1 !=0, leadning zero bits -1 srli a0,a0,1 #right shift 1 bit j clz #back to loop HammingDis_clz: # a0 = less clz value of two elements # a1 = index # t1 = xor value # t3 = check bge a1, a0, exit andi t3, t1, 1 # LSB add s2,s2,t3 # add LSB to total Hd srli t1,t1,1 # right shift one bit addi a1,a1,1 j HammingDis_clz #count for next bit exit: ret end: la a0,total li a7,4 ecall addi a0,s2,0 li a7,1 ecall li a7,10 ecall ``` **4. Comparison of Performances** The pictures below showed the leading zero concept applied dramatically decreased the cycle number. All parameters are improved significantly. ![image](https://hackmd.io/_uploads/H14PLiwJyx.png) Fig a. with no leading zero concept ![image](https://hackmd.io/_uploads/HJTpHxYkke.png) Fig b. with leading zero concept ### Solution `C` **1. Idea** However, the complexities of above solutions aren't inproved. I figure out another approach to sovle the complexity issue. Under the condition that we know the total number of integers, we can calculate the total Hamming distance at specific bit position then sum up. The following steps show this idea more clearly. **a.** Consider in bit position $l$, $$H(x_{il},x_{jl})=\begin{cases} 1,if\ \ \ x_{il}\ \ne x_{jl}\\ 0,if\ \ \ x_{il}\ = x_{jl} \end{cases}$$ Only bit different will cause $H=1$ **b.** With above steps, denoted $o_l, z_l$ as the number of one and zero at the position $l$, respectively. $$\begin{align}H_{total}&=\displaystyle\sum^{N-1}_{i=1}\sum^N_{j=i+1}H(x_i,x_j)=\sum^{32}_{l=1}\sum^{N-1}_{i=1}\sum^N_{j=i+1}H(x_{il},x_{jl})\\&=\sum^{32}_{l=1}o_lz_l\end{align}$$ **c.** Conclusion: only the multiplication value of $o_l,z_l$ in each bit position is needed! **2.C Code** ```C int totalHammingDistance(int* nums, int numsSize) { int totalDistance = 0; //total Hamming Distance for (int l = 0; l < 32; l++) //outer loop for bit position { int o = 0; //set nubmer of one in o int z = 0; //set number of zero in z for (int j = 0; j < numsSize; j++) //check every integer { nums[j]&1 ? o++:z++; //check LSB = 1? nums[j] = nums[j] >> 1; //right shift tested data } totalDistance = totalDistance + o * z; //sum up } return totalDistance; } ``` **3.Assembly** ``` .data test: .word 5,8,104,233,1,5344,703, 832940,10239,4829,829,111,284920,1042,17,5873,478293,182,4471,23 size: .word 20 verify: .word 1314 right_mes: .asciz "right" wrong_mes: .asciz "wrong" .text main: la s0, test #load address of test data to s0 la s1, size #load address of test data size to s1 lw s1,0(s1) # s1 = numsSize li s2,0 # s2 = total Hd li t0, 0 # bits_shift index = 1 li t1, 32 # total 32 bit shifts bits_shift: bge t0,t1,end # 32bits shift addi a0,s0,0 # copy the address of data li a1,0 # the number of zero li a2,0 # the number of one li t2,0 # index addi sp,sp,-4 # store return address on stack sw ra,0(sp) jal data_shift # count each zero and one of elements lw ra,0(sp) addi sp,sp,4 addi t0,t0,1 # next bit j bits_shift # loop data_shift: # a0 = address of the data # a1 = number of zero # a2 = number of one # t2 = index # t3 = data value bge t2,s1,z_times_o # index >= size number, count z times o lw t3,0(a0) # load element addi t2,t2,1 # next loop andi t4,t3,1 # store and result to t4 srli t3,t3,1 # right shift one bit sw t3,0(a0) # restore to s0 position beq t4,zero,z_add # if t4=0, LSB=0, zero++ addi a2,a2,1 # else , one++ c: addi a0,a0,4 # next element j data_shift # back loop z_add: addi a1,a1,1 # add number of zero j c # back and continue z_times_o: # result will be stored in s2 # a1 = number of zero # a2 = number of one # t2 = temp value for store LSB beq a1,zero,exit #right shift a1. if zero, return to main loop li t2,0 andi t2,a1,1 # LSB of a1 beq t2,zero,b # if zero, j to b, next bit add s2,s2,a2 # add to total Hd b: slli a2,a2,1 srli a1,a1,1 j z_times_o # back to this loop exit: ret end: la t0, verify lw t1, 0(t0) bne t1,s2,false la a0, right_mes li a7,4 ecall li a7,10 ecall false: la a0, wrong_mes li a7,4 ecall li a7,10 ecall ``` **4. Performance** It seems worse than above cases. ![image](https://hackmd.io/_uploads/S1Vj87YJye.png) **5. Comparison by using different data** I'm wondering why it looks worse than above cases, so I tried another data of which with large number of intege. The results are shown below. With the integers seies: $5,8,104,233,1,5344,703, 832940,10239,4829,829,111,284920,1042,17,5873,478293,182,4471,23$ a. Solution `A` ![image](https://hackmd.io/_uploads/S1m72mty1e.png) b. Solution `B` This revised versio seems make the performance much better. ![image](https://hackmd.io/_uploads/r1kO2mKk1e.png) c. Solution `C` This solution make a significant improvement of cycles! !![image](https://hackmd.io/_uploads/HJjT8fckyl.png) ## Assembly Code Improvement Based on Pipeline Diagram (Solution `C`) ### Unrolling func. data_shift * When processing `data_shift`, here are some data hazard issues that I found. As below showed.![image](https://hackmd.io/_uploads/rydZKM911g.png)When `EX` beq, the IF and ID will be eliminated at the next cycle and IF will will replaced by `addi a1,a1, 1` if `t4==0`. I have no idea how to avoid such banch issue. * Another problem that I found is if processing either`C:` or`z_add:` the data hazard impacted dramatically of the performance. As picture showed.![image](https://hackmd.io/_uploads/HJeRvXqJJx.png)![image](https://hackmd.io/_uploads/B1acuX9kke.png) To reduce to impact of this problem, I tried to enroll code of this part. The correesponding assembly code revised as: ``` data_shift: # a0 = address of the data # a1 = number of zero # a2 = number of one # t2 = index # t3 = data value bge t2,s1,z_times_o # index >= size number, count z times o lw t3,0(a0) # load element addi t2,t2,1 # next loop andi t4,t3,1 # store and result to t4 beq t4,zero,z_add # if t4=0, LSB=0, zero++ addi a2,a2,1 srli t3,t3,1 # right shift one bit sw t3,0(a0) # restore to s0 position addi a0,a0,4 # next element j data_shift # back loop z_add: srli t3,t3,1 # right shift one bit sw t3,0(a0) # restore to s0 position addi a1,a1,1 # add number of zero addi a0,a0,4 j data_shift # back and continue ``` And the performance after revised: ![image](https://hackmd.io/_uploads/B1ItYQcJyg.png) * Conclusion Currently, I am studing how to avoid nop issues when encounter jump address and jump branch. ## Reference > [Example RISC-V Assembly Programs](https://marz.utk.edu/my-courses/cosc230/book/example-risc-v-assembly-programs/) [Leetcode 477 ](https://leetcode.com/problems/total-hamming-distance/description/) [open AI](https://chatgpt.com/) [RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/main/src/asm-manual.adoc)