# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <[`RBing`](https://www.github.com/RBing123/computer_architecture.git)>(Rong-Bing, Fang) :::danger Be aware of the headings with Markdown syntax. ::: ## Abstract Problem C in the quiz1 defines a set of functions to convert a 16-bit half-precision floating-point number (`fp16`) to a 32-bit single-precision floating-point number (`fp32`). The core function, `fp16_to_fp32`, shifts the 16-bit floating-point input into the upper half of a 32-bit word, separates the sign bit, and normalizes the mantissa and exponent. It then adjusts for the differences in exponent bias between half-precision and single-precision formats. The function handles special cases such as `denormalized numbers`, `zero`, `NaN`, and `infinity`, ensuring proper conversion by setting the correct bits for each scenario. The helper function my_clz is used to count leading zeros in the exponent and normalize denormalized numbers. The following illustration is about each function and its corresponding RISC-Vassembly in Problem C of the quiz1. ### fabsf The function `fabsf` computes the absolute value of a floating-point number by manipulatin its binary representation. It reads the bitwise representation of the float `x` as a 32-bit integer, clears the sign bit ( `MSB` ) to remove any negative sign, and them writes the modified bits back into the float. The IEEE 754 standard defines the format for representing floating-point numbers in binary. The representation of a floating-point number is divided into three main parts: | sign(1 bit) | exponent(8 bits) | mantissa(23 bits) | | -------- | -------- | -------- | | 0 | 00000000 | 00000000000000000000000 | **1. Sign Bit** This single bit indicates the sign of the number: * `0` for positive number * `1` for negative number **2. Exponent** The exponent is stored in a biased format, which means a bias value is subtracted from the actual exponent to allow both positive and negative exponents to be represented without using a sign bit. * for `float` (32-bit), the bias is `127` * for `double` (64-bit), the bias is `1023` **3. Mantissa** Also known as the fraction or significand, this part stores the actual digits of the number. For normalized numbers, an implicit leading 1 is assumed before the binary point, so only the fractional part is stored. ```c static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` The corresponding RISC-V assembly are showing below. ```c fabsf: # assume that the input parameter x is stored at a0 # hold the mask 0x7FFFFFFF in t0 li t0, 0x7FFFFFFFF and a0, a0, t0 # bitwise 'and' with x(a0) and mask(t0) ret ``` :::danger Use headnings instead of font tags in HTML. ::: ### my_clz The `my_clz` function counts the number of leading zeros in a 32-bit unsigned integer `x`. This is the breakdown of its functionality. 1. **Initialization**: It initializes a counter `count` to zero, which will keep track of the number of leading zeros. 2. **Loop through Bits**: The function uses a for loop that iterates from [the most significant bit](https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit) (`MSB`) to [the least siginificant bit]((https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit)) (`LSB`). In each iteration, it checks if the current bit is set. 3. **Bitwise Operation**: The expression `x & (1U << i)` checks if the `i`-th bit of `x` is `1`. If it finds a `1`, the loop breaks, meaning it has encountered the first non-zeor bit. 4. **Counting Leading Zeros**: If the current bit is `0`, it increments the `count`. This continues until a `1` is found or all bits have been checked. 5. **Return**: The function returns the total count of leading zeros. ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` The corresponding RISC-V assembly are showing below. ```c my_clz: # assume that input parameter x is stored at a0 addi t0, zero, 0 # init count = 0 addi t1, zero, 31 # init i=31 for for_loop addi s0, zero, 1 # hold 1U my_clz_for_loop: sll t2, s0, t1 # (1U << i) stored at t2 and t2, a0, t2 # x & (1U << i) beqz t2, count_leading_zero # if x & (1U << i) == 1 go back for loop j clz_finish count_leading_zero: addi t0, t0, 1 # otherwise, leading zero count++ addi t1, t1, -1 # i-- bgez t1, my_clz_for_loop # if i!=0 go to for loop otherwise end the function clz_finish: mv a0, t0 # count(t0) move to return register ret ``` :::danger Refine the label names, making them informative. ::: :::info I think I have refined the label name and made them more informative. If there are more specific areas for improvement, please let me know. ::: ### fp16_to_fp32 The `fp16_to_fp32` function converts a 16-bit half-precision floating-point number (`h`) into a 32-bit single-precision floating-point number. This is the breakdown of its functionality. 1. **Shift Half-Precision to Upper 32 Bits**: ```c const uint32_t w = (uint32_t) h << 16; ``` This shifts the 16-bit half-precision number `h` to the upper half of a 32-bit integer, preparing it for conversion. This creates space for the mantissa and exponent while initializing the lower 16 bits to zero. 2. **Extract the sign bit**: ```c const uint32_t sign = w & UINT32_C(0x80000000); ``` The sign bit is extracted from the shifted value w. It checks the most significant bit (bit 31) of the 32-bit word, which indicates the sign of the floating-point number. 3. **Extract Mantissa and Exponent**: ```c const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); ``` This operation clears the sign bit from `w`, allowing the function to isolate the mantissa and exponent bits. 4. **Calculate Renormalization Shift**: ```c uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; ``` The `my_clz` function is called to count the number of leading zeros in the mantissa and exponent. This determines how much the mantissa needs to be shifted for normalization. The shift is adjusted for normalized and denormalized numbers, where denormalized numbers will have a greater shift. 5. **Identify NaN and Infinity**: ```c const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); ``` This check identifies whether the half-precision number represents NaN (Not a Number) or infinity by examining the exponent bits. If the exponent is 15, the result indicates that the number is either NaN or infinity. 6. **Create Zero Mask**: ```c const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; ``` This creates a mask to check if the input half-precision number is zero. If nonsign equals zero, this results in a mask of all ones (0xFFFFFFFF); otherwise, it results in zero. 7. **Form the output**: ```c return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); ``` The final return statement combines the sign, normalized mantissa, adjusted exponent, and applies the previously calculated masks. ```c static inline uint32_t fp16_to_fp32(uint16_t h) { /* * Extends the 16-bit half-precision floating-point number to 32 bits * by shifting it to the upper half of a 32-bit word: * +---+-----+------------+-------------------+ * | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 31 26-30 16-25 0-15 * * S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits. */ const uint32_t w = (uint32_t) h << 16; /* * Isolates the sign bit from the input number, placing it in the most * significant bit of a 32-bit word: * * +---+----------------------------------+ * | S |0000000 00000000 00000000 00000000| * +---+----------------------------------+ * Bits 31 0-31 */ const uint32_t sign = w & UINT32_C(0x80000000); /* * Extracts the mantissa and exponent from the input number, placing * them in bits 0-30 of the 32-bit word: * * +---+-----+------------+-------------------+ * | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 30 27-31 17-26 0-16 */ const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); /* * The renorm_shift variable indicates how many bits the mantissa * needs to be shifted to normalize the half-precision number. * For normalized numbers, renorm_shift will be 0. For denormalized * numbers, renorm_shift will be greater than 0. Shifting a * denormalized number will move the mantissa into the exponent, * normalizing it. */ uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; /* * If the half-precision number has an exponent of 15, adding a * specific value will cause overflow into bit 31, which converts * the upper 9 bits into ones. Thus: * inf_nan_mask == * 0x7F800000 if the half-precision number is * NaN or infinity (exponent of 15) * 0x00000000 otherwise */ const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); /* * If nonsign equals 0, subtracting 1 will cause overflow, setting * bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result * propagates bit 31 across all bits in zero_mask. Thus: * zero_mask == * 0xFFFFFFFF if the half-precision number is * zero (+0.0h or -0.0h) * 0x00000000 otherwise */ const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; /* * 1. Shifts nonsign left by renorm_shift to normalize it (for denormal * inputs). * 2. Shifts nonsign right by 3, adjusting the exponent to fit in the * 8-bit exponent field and moving the mantissa into the correct * position within the 23-bit mantissa field of the single-precision * format. * 3. Adds 0x70 to the exponent to account for the difference in bias * between half-precision and single-precision. * 4. Subtracts renorm_shift from the exponent to account for any * renormalization that occurred. * 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input * was NaN or infinity. * 6. ANDs with the inverted zero_mask to set the mantissa and exponent * to zero if the input was zero. * 7. Combines everything with the sign bit of the input number. */ return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` The corresponding RISC-V assembly are showing below. ```c fp16_to_fp32: # assume that the input parameter h is stored at a0 slli t0, a0, 16 # a0 left shift 16 bits li t1, 0x80000000 # hold mask 0x80000000 and s0, t1, t0 # extract sign bit li t1, 0x7FFFFFFF # hold mask 0x7FFFFFFF and t3, t1, t0 # extract the mantissa and exponent mv a0, t3 # move nonsign value to argu. reg. for function call jal ra, my_clz # jal to my_clz and assume return value store at a1 addi a1, a1, -5 # renorm_shift-5 bgt a1, zero, cont # if renorm_shift-5 > 0, cont. addi a1, zero, 0 # else set zero cont: li t4, 0x04000000 # hold mask 0x04000000 add t5, t3, t4 # add nonsign and mask srli t5, t5, 8 # right shift 8 bit li t4, 0x7F800000 # hold mask and t5, t5, t4 addi t6, t3, -1 # nonsign - 1 srli t6, t6, 31 # zero_mask li t1, 0xFFFFFFFF xor t6, t6, t1 sll t3, t3, 3 srli t2, t2, 3 li t1, 0x70 sub a1, t1, a1 slli a1, a1, 23 add t2, t2, a1 or t2, t2, t5 and t2, t2, t6 or a0, t2, s0 ret ``` ### Test The full RISC-V assembly of problem c. ```c .data test_1: .word 0xFF00 test_2: .word 0x3C00 test_3: .word 0xE010 str: .string "\nThe value converted from fp16 to fp32 is " .text main: la a0, str li a7, 4 ecall lw a0, test_1 jal ra, fp16_to_fp32 li a7, 1 ecall la a0, str li a7, 4 ecall lw a0, test_2 jal ra, fp16_to_fp32 li a7, 1 ecall la a0, str li a7, 4 ecall lw a0, test_3 jal ra, fp16_to_fp32 li a7, 1 ecall li a7, 10 ecall fp16_to_fp32: mv t0, a0 slli t0, t0, 16 li t1, 0x80000000 and t1, t1, t0 # t1 is the extract sign bit li t2, 0x7FFFFFFF # hold mask 0x7FFFFFFF and t2, t2, t0 # t2(nonsign) is the extract the mantissa and exponent my_clz: addi t3, zero, 0 # t3 hold the count addi t4, zero, 31 # init i=31 for for loop addi t5, zero, 1 # hold 1U my_clz_for_loop: sll t6, t5, t4 # (1U << i) stored at t6 and t6, t2, t6 # x & (1U << i) beqz t6, count_leading_zero # if x & (1U << i) == 1 go back for loop j clz_finish count_leading_zero: addi t3, t3, 1 addi t4, t4, -1 bgez t4, my_clz_for_loop clz_finish: addi t3, t3, -5 # renorm_shift-5 bgt t3, zero, cont # if renorm_shift-5 > 0, cont. addi t3, zero, 0 # else set zero cont: li t4, 0x04000000 # hold mask 0x04000000 add t5, t2, t4 # add nonsign and mask srli t5, t5, 8 # right shift 8 bit li t4, 0x7F800000 # hold mask and t5, t5, t4 # t5 hold inf_nan_mask addi t6, t2, -1 # nonsign - 1 srli t6, t6, 31 # t6 hold zero_mask li t4, 0xFFFFFFFF xor t6, t6, t4 # t6 hold ~zero_mask sll t2, t2, t3 srli t2, t2, 3 li t4, 0x70 sub t3, t4, t3 slli t3, t3, 23 add t2, t2, t3 or t2, t2, t5 and t2, t2, t6 or a0, t1, t2 ret ``` :::danger Show the full C source code corresponding to the original problem set. ::: The `my_clz` function calculates how many leading zeros are in the binary representation of a 32-bit unsigned integer (`uint32_t`). Based on this function, we can calculate the `Hamming distance` which will be introduced in later sections. The Hamming distance measures the number of differing bits between two binary values. By using a similar approach to bitwise operations, we can count how many bits differ between two integers, a concept closely related to the leading zero count demonstrated in this function. ### Motivation [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) is a metric used to measure the difference between two binary strings (or bit sequences) of equal length. It is defined as the number of positions at which the corresponding bits are different. For example, In AI-driven video generation (e.g., GANs or diffusion models), Hamming distance can be used as a metric to assess the similarity between different latent representations. By measuring how different the generated video frames are from the input or reference frames, it can guide optimization processes. Based on the above applications and the topic I am currently researching, I am quite interested in hamming distance. I will use `RISC-V assembly` to implement this topic, focusing on calculating Hamming distance. ## Problem Based on the above description of [Subject](Subject), the CLZ problem is similar to calculating hamming distance. [leetcode 461. Hamming Distance](https://leetcode.com/problems/hamming-distance/description/) > Description : The Hamming distance between two integers is the number of positions at which the corresponding bits are different. > > Given two integers x and y, return the Hamming distance between them. ``` Example : Input: x = 1, y = 4 Output: 2 Explanation: 1 (0 0 0 1) 4 (0 1 0 0) ↑ ↑ ``` The above arrows point to positions where the corresponding bits are different. ## Solution The Hamming distance between two integers is defined as the number of bit positions at which the corresponding bits are different. This can be calculated by performing a bitwise XOR operation on the two integers and then counting the number of '1's in the resulting binary number `diff`. Instead of directly counting all the '1's in `diff`, count the leading zeros using a function `my_clz`, which identifies how many irrelevant bits precede the first '1'. By shifting diff left by the number of leading zeros, we can eliminate these unnecessary bits and focus on the significant ones. Finally, iterate through the remaining bits of the shifted result to count the '1's, yielding the Hamming distance more efficiently by reducing the number of bits processed. ### C program ```c #include <stdio.h> #include <stdint.h> uint32_t test_11 = 0xFFFFFFFF; uint32_t test_12 = 0x00000000; uint32_t test_21 = 0x7FF00132; uint32_t test_22 = 0xCAE65324; uint32_t test_31 = 0x00000000; uint32_t test_32 = 0x00000000; static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int hamming_distance(uint32_t x, uint32_t y) { uint32_t diff = x ^ y; int hamming_count = 0; // Shift out the leading zeros first int leading_zeros = my_clz(diff); diff <<= leading_zeros; // Shift left to remove leading zeros // Count remaining '1's in diff for (int i = 0; i < 32 - leading_zeros; i++) { if (diff & (1U << (31 - i))) hamming_count++; } return hamming_count; } int main(void){ printf("the hamming distance is : %d\n", hamming_distance(test_11, test_12)); printf("the hamming distance is : %d\n", hamming_distance(test_21, test_22)); printf("the hamming distance is : %d\n", hamming_distance(test_31, test_32)); } ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: ### Assembly code ```c .data test_data_1: .word 0xFFFFFFFF, 0x00000000 # decimal is 4294967295 and 0, hamming distance is 32 test_data_2: .word 0x7FF00132, 0xCAE65324 # decimal is 2146435378 and 3404092196, hamming distance is 14 test_data_3: .word 0x00000000, 0x00000000 # decimal is 0 and 0, hamming distance is 0 print_string: .string "\nHamming Distance is " .text main: addi sp, sp, -32 # 3 variables need to be saved # store the data in stack lw t0, test_data_1 sw t0, 0(sp) lw t0, test_data_1+4 sw t0, 4(sp) lw t0, test_data_2 sw t0, 8(sp) lw t0, test_data_2+4 sw t0, 12(sp) lw t0, test_data_3 sw t0, 16(sp) lw t0, test_data_3+4 sw t0, 20(sp) # initialize main loop addi s3, zero, 3 # number of test cases main_loop: # Print results la a0, print_string # Load print string address li a7, 4 # System call for print integer ecall # calculate hamming distance for each pair of data lw a0, 0(sp) # load first number lw a1, 4(sp) # load second number jal ra, hamming_distance li a7, 1 # print integer ecall # print result of hd_cal (which is in a0) addi sp, sp, 8 # s2 : points to next test_data addi s3, s3, -1 # counter++ bnez s3, main_loop # Exit program addi sp, sp, 32 li a7, 10 ecall # hamming distance hamming_distance: addi sp, sp, -8 # Allocate memory for hamming_count on stack sw ra, 0(sp) # save return address xor s6, a0, a1 # diff = x ^ y li t4, 0 # hamming_count = 0 mv a0, s6 # move data to a0 and call my_clz jal ra, my_clz # jump and link to my_clz # a0 involve with leading zero mv t2, a0 # move leading zero from a0 to t2 sll s6, s6, t2 # remove leading zero # Calculate the number of remaining 1's in the diff addi t3, zero, 32 # 32 bits sub t2, t3, t2 # calculate 32 - leading zero (i) li a3, 1 # hold 1 hamming_count_loop: beqz t2, hamming_done # If t2 == 0, jump to done sub s7, t3, t2 # 32 - i sll t6, a3, s7 # 1 << (32-i) and t6, s6, t6 # t6 = diff & 1 beqz t6, skip_count # If t6 == 0, skip counting addi t4, t4, 1 # hamming_count++ skip_count: addi t2, t2, -1 # Decrease bit counter j hamming_count_loop # Repeat loop hamming_done: mv a0, t4 # Move hamming_count to a0 (return value) lw ra, 0(sp) # Restore return address addi sp, sp, 8 # Restore stack space ret # Return from the function # clz function my_clz: addi sp, sp, -8 # return count and hold 1U sw ra, 0(sp) # save return address mv s0, a0 # move input argument to saved register addi s1, zero, 0 # init count = 0 addi t0, zero, 31 # init i=31 addi s2, zero, 1 # hold 1U my_clz_loop: sll t1, s2, t0 # (1U << i) and t1, s0, t1 # x & (1U << i) beqz t1, count # if x & (1U << i) == 1 go back for loop j clz_finish count: addi s1, s1, 1 # otherwise, count++ addi t0, t0, -1 bgez t0, my_clz_loop # if i!=0 go to for loop otherwise end the function clz_finish: mv a0, s1 # count(s1) move to return register lw ra, 0(sp) # lw return address from 0(sp) addi sp, sp, 8 # stack pointer ret ``` :::danger Use fewer instructions. ::: ## Result :::danger Check list: 1. Did your test suite include the corner cases? 2. Can you test suite be operated via given [test driver](https://en.wikipedia.org/wiki/Test_driver)? 3. Can you check the report and break down the details? ::: :::info What does the content of the checklist mean? ::: ### test data 1 * `test_data_11` = `0xFFFFFFFF` ( `4294967295` in decimal ) * `test_data_12` = `0x00000000` ( `0` in decimal ) ### test data 2 * `test_data_21` = `0x7FF00132` ( `2146435378` in decimal ) * `test_data_22` = `0xCAE65324` ( `3404092196` in decimal ) ### test data 3 * `test_data_31` = `0x00000000` ( `0` in decimal ) * `test_data_32` = `0x00000000` ( `0` in decimal ) ### C program output Run the c program will obtain the following result. ```shell $ gcc -o hw hw1.c $ ./hw the hamming distance is : 32 the hamming distance is : 14 the hamming distance is : 0 ``` :::danger Do not use screenshots for plain text content, as this is inaccessible to visually impaired users. ::: ### RISC-V assembly output When Ripes is finished running, we can get the following result which is about the hamming distance and the execution information. **Result:** ```shell Hamming Distance is 32 Hamming Distance is 14 Hamming DIstance is 0 Program exited with code: 0 ``` **Execution info:** ```shell cycles: 1183 Instrs. retired: 831 CPI: 1.42 IPC: 0.702 Clock rate: 9.26hz ``` <div align='center'> <img src=https://hackmd.io/_uploads/BkwNxLYkJl.png> <img src=https://hackmd.io/_uploads/By4IeUKyke.png> </div> ## Analysis I test the code using [Ripes](https://github.com/mortbopet/Ripes) simulator. ### Pseudo instruction ```pseudocode! 00000000 <main>: 0: fe010113 addi x2 x2 -32 4: 10000297 auipc x5 0x10000 8: ffc2a283 lw x5 -4 x5 c: 00512023 sw x5 0 x2 10: 10000297 auipc x5 0x10000 14: ff42a283 lw x5 -12 x5 18: 00512223 sw x5 4 x2 1c: 10000297 auipc x5 0x10000 20: fec2a283 lw x5 -20 x5 24: 00512423 sw x5 8 x2 28: 10000297 auipc x5 0x10000 2c: fe42a283 lw x5 -28 x5 30: 00512623 sw x5 12 x2 34: 10000297 auipc x5 0x10000 38: fdc2a283 lw x5 -36 x5 3c: 00512823 sw x5 16 x2 40: 10000297 auipc x5 0x10000 44: fd42a283 lw x5 -44 x5 48: 00512a23 sw x5 20 x2 4c: 00300993 addi x19 x0 3 00000050 <main_loop>: 50: 10000517 auipc x10 0x10000 54: fc850513 addi x10 x10 -56 58: 00400893 addi x17 x0 4 5c: 00000073 ecall 60: 00012503 lw x10 0 x2 64: 00412583 lw x11 4 x2 68: 024000ef jal x1 36 <hamming_distance> 6c: 00100893 addi x17 x0 1 70: 00000073 ecall 74: 00810113 addi x2 x2 8 78: fff98993 addi x19 x19 -1 7c: fc099ae3 bne x19 x0 -44 <main_loop> 80: 00c10113 addi x2 x2 32 84: 00a00893 addi x17 x0 10 88: 00000073 ecall 0000008c <hamming_distance>: 8c: ff810113 addi x2 x2 -8 90: 00112023 sw x1 0 x2 94: 00b54b33 xor x22 x10 x11 98: 00000e93 addi x29 x0 0 9c: 000b0513 addi x10 x22 0 a0: 048000ef jal x1 72 <my_clz> a4: 00050393 addi x7 x10 0 a8: 007b1b33 sll x22 x22 x7 ac: 02000e13 addi x28 x0 32 b0: 407e03b3 sub x7 x28 x7 b4: 00100693 addi x13 x0 1 000000b8 <hamming_count_loop>: b8: 02038063 beq x7 x0 32 <hamming_done> bc: 407e0bb3 sub x23 x28 x7 c0: 01769fb3 sll x31 x13 x23 c4: 01fb7fb3 and x31 x22 x31 c8: 000f8463 beq x31 x0 8 <skip_count> cc: 001e8e93 addi x29 x29 1 000000d0 <skip_count>: d0: fff38393 addi x7 x7 -1 d4: fe5ff06f jal x0 -28 <hamming_count_loop> 000000d8 <hamming_done>: d8: 000e8513 addi x10 x29 0 dc: 00012083 lw x1 0 x2 e0: 00810113 addi x2 x2 8 e4: 00008067 jalr x0 x1 0 000000e8 <my_clz>: e8: ff810113 addi x2 x2 -8 ec: 00112023 sw x1 0 x2 f0: 00050413 addi x8 x10 0 f4: 00000493 addi x9 x0 0 f8: 01f00293 addi x5 x0 31 fc: 00100913 addi x18 x0 1 00000100 <my_clz_loop>: 100: 00591333 sll x6 x18 x5 104: 00647333 and x6 x8 x6 108: 00030463 beq x6 x0 8 <count> 10c: 0100006f jal x0 16 <clz_finish> 00000110 <count>: 110: 00148493 addi x9 x9 1 114: fff28293 addi x5 x5 -1 118: fe02d4e3 bge x5 x0 -24 <my_clz_loop> 0000011c <clz_finish>: 11c: 00048513 addi x10 x9 0 120: 00012083 lw x1 0 x2 124: 00810113 addi x2 x2 8 128: 00008067 jalr x0 x1 0 ``` ### Break down the details In the first case, we obtain the result, which is 32. This is because the two numbers being compared are `0xFFFFFFFF` and `0x00000000`, meaning all 32 bits differ, resulting in a Hamming distance of `32`. The third test case is also crucial because it represents the scenario where both inputs are zero. This case is significant for a couple of reason: 1. **Edge case** handling : It tests how the implementation handles the edge case of having no differing bits. Since the Hamming distance is defined as the number of differing bits between two binary strings, an input of all zeros should return a Hamming distance of `zero`. This helps ensure that the algorithm correctly identifies cases where there are no bits to compare. 2. **Undefined Behavior** with `__builtin_clz`: This case also emphasizes the importance of handling edge cases correctly in relation to using `__builtin_clz`. According to the specifications, using `__builtin_clz` on zero leads to undefined behavior. By including this test case, we can validate that the algorithm gracefully handles such situations without causing unexpected results or crashes. It’s important to note, however, that when calculating the Hamming distance or when using the `__builtin_clz` function (which counts leading zeros), special care must be taken with inputs like 0. The behavior of `__builtin_clz(0)` is technically undefined according to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html). This is because the function expects the input to be a non-zero value, and on most systems, calling it with 0 will return a result based on system-specific behavior, which may not always be what is expected. In some cases, it might return 32, but this should not be relied upon as a standard behavior. Therefore, if handling zero is a possibility, it’s important to add specific checks in the code to avoid invoking undefined behavior. > According to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html), the `__builtin_clz` function counts the number of leading zeros in an unsigned integer. However, if the input value is zero, the behavior is undefined. >This is because the result of __builtin_clz(0) would theoretically require counting all the bits, but since the number of leading zeros for 0 is ambiguous in a standard binary representation (as all bits are zeros), compilers do not define a result for this case. This leads to platform-dependent outcomes or, in some cases, crashes. Furthermore, the `hamming_distance` function plays a crucial role in determining the Hamming distance through the following steps. 1. Calculate the **XOR**: The function accepts two integer values (`a0` and `a1`), which represent the two binary numbers for which the Hamming distance is to be calculated. 2. **Count Leading Zeros**: To facilitate the counting of `1`s, the function determines how many leading zeros are present in the result of the XOR operation. This is achieved by calling the `my_clz` function, which implements the count leading zeros (`clz`) operation. 3. **Shift to Remove Leading Zeros**: the function shifts the XOR result to the left, effectively discarding the leading zeros. This prepares the number for counting the remaining `1`s in its binary representation. 4. **Count Remaining `1`s**: A loop iterates through the remaining bits, checking each bit of the modified XOR result. Each time a `1` is encountered, a counter (`t4`) is incremented. The loop continues until all bits are checked. ### 5-stage pipelined processor Ripes provide different processors to run the code. Then I choose 5-stage processor to run my program. ![image](https://hackmd.io/_uploads/BkWOaCl1yg.png) I use `addi x29, x29, 1` in the psedoinstruction, which address is at `0xCC` for example and analyze how the processor operates the instruction in different stages. 1. **IF (Instruction Fetch)** * Instruction is being fetched from the memory. * Update the PC to the next sequential instruction by adding 4 (because each instruction is 4 bytes) to the PC. 2. **ID (Instruction Decode and register read)** * Decode the instruction and read the registers corresponding to register source specifiers from the register file 3. **EX (Execute)** * The ALU performs one of the following functions on the operands prepared in the prior cycle * Memory reference (load/store)—The ALU adds the base register and the offset to form the effective address. * Register-Register ALU instruction—The ALU performs the operation specified by the ALU opcode on the values read from the register file. * Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate. * Conditional branch—Determine whether the condition is true. 4. **MEM (Memory Access)** * If the instruction is a load, the memory does a read using the effective address computed in the previous cycle. * If it is a store, then the memory writes the data from the second register read from the register file using the effective address. 5. **WB (Write Back to registers)** * Write the result into the register file * From the memory system (for a load) * From the ALU (for an Register-Register ALU instruction) --- Take `addi x29, x29, 1` in the psedoinstruction, which address is at `0xCC` and see how it works in each stage. **<font size=4>1. IF</font>** ![image](https://hackmd.io/_uploads/BJG2pClJke.png) First, the program counter ( `PC` ) is `0x000000cc`, it means that the next instruction to be executed is located at memory address `0x000000cc`. The CPU will **fetch the instruction** stored at this address from memory as the next step in the instruction cycle. We can find `addi` instruction and get its function code and opcode from [ The RISC-V Instruction Set Manual (p.104)](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) : ![圖片](https://hackmd.io/_uploads/rJfuBJZ1Je.png) And then it will be transformed to `0x001e8e93` as a machine code. | imm[11:0] | rs1 | funct3 | rd | opcode | | --------- | --- | --- | -- | ------- | | 000000000001 | 11101(x29) | 000 | 11101(x29) | 0010011 | Simultaneously, `PC`(`0x000000cc`) will be added by 4 and obtain the next instruction address which is `0x000000d0`. **<font size=4>2. ID</font>** ![圖片](https://hackmd.io/_uploads/SyxteybJkg.png) In this stage, the decoder decodes the machine code(`0x001e8e93`) into four parts. * `opcode` field which is `ADDI` * `R1 idx` which is `0x1d` (x29) * `Wr idx` which is `0x1d` (x29) * `R2 idx` is not used because this is an I-type instruction. Decoder will get the `imm[11:0]` and extend to a 32-bit number for calculation. This 12-bit value is sign-extended to 32 bits, meaning the sign bit (the 12th bit) is propagated across the remaining upper bits . > the explanation about the 12-bit sign-extension can be found in the [RISC-V manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf), specifically in Chapter 2: "RV32I Base Integer Instruction Set", under Section 2.4 Immediate Encoding. However, the `R1 idx` will be sent to the `Registers` to obtain the value. * `Reg 1` = `0x00000000` **<font size=4>3. EX</font>** ![圖片](https://hackmd.io/_uploads/Skzjey-yye.png) In this stage, ALU sums the input that will be chosen by 2-level multiplexers(`MUXs`) : * choice of `op1` : `Reg 1` go through 2 MUXs as the first input of ALU. * choice of `op2` : `imm.` go through 2 MUXs as the second input of ALU. In addition, `Reg 2` is not used in this situation. So, the input and output of ALU are : * **Input** * `op1` = `0x00000000` (value in x29) * `op2` = `0x00000001` (imm. value) * **Output** * `ALU_RES` = `0x00000001` Meanwhile, the `Wr idx` (`0x1d`) and the `PC` (`0x000000d0`) will be sent into next stage. **<font size=4>4. MEM</font>** ![圖片](https://hackmd.io/_uploads/SktRl1Z1Jx.png) In this stage, I-type format doesn't need to load data from memory, so the `Wr en` control signal is set to 0. `PC`(`0x000000d0`), `ALU_RES` (`0x00000001`) and the `Wr idx`(`0x1d`) will go through to the next stage. **<font size=4>5. WB</font>** ![圖片](https://hackmd.io/_uploads/BkXWbkbykl.png) In this stage, `ALU_RES`(`0x00000001`) will write back to the target register `Wr idx`(`0x1d`). Before write back to the register, `PC`(0x000000d0), `ALU_RES` and `mem_read_out` will go through a MUXs to determine which data will be writen back to the register. ### Memory update In the two images above, it can be seen that the value stored in register `x29` has changed from `0x00000000` to `0x00000001`. We also know that the CPU updates the register data only after completing the Write Back (`WB`) stage. <div align="center"> <img src="https://hackmd.io/_uploads/SyxCa_lZk1g.png" width=350> <img src="https://hackmd.io/_uploads/S1u2seZkyg.png" width=350> </div> ![圖片](https://hackmd.io/_uploads/S1zBFxWyJe.png)