# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <[`RBing`](https://www.github.com/RBing123/computer_architecture.git)>(Rong-Bing, Fang)
:::danger
Be aware of the headings with Markdown syntax.
:::
## Abstract
Problem C in the quiz1 defines a set of functions to convert a 16-bit half-precision floating-point number (`fp16`) to a 32-bit single-precision floating-point number (`fp32`). The core function, `fp16_to_fp32`, shifts the 16-bit floating-point input into the upper half of a 32-bit word, separates the sign bit, and normalizes the mantissa and exponent. It then adjusts for the differences in exponent bias between half-precision and single-precision formats. The function handles special cases such as `denormalized numbers`, `zero`, `NaN`, and `infinity`, ensuring proper conversion by setting the correct bits for each scenario. The helper function my_clz is used to count leading zeros in the exponent and normalize denormalized numbers.
The following illustration is about each function and its corresponding RISC-Vassembly in Problem C of the quiz1.
### fabsf
The function `fabsf` computes the absolute value of a floating-point number by manipulatin its binary representation. It reads the bitwise representation of the float `x` as a 32-bit integer, clears the sign bit ( `MSB` ) to remove any negative sign, and them writes the modified bits back into the float.
The IEEE 754 standard defines the format for representing floating-point numbers in binary. The representation of a floating-point number is divided into three main parts:
| sign(1 bit) | exponent(8 bits) | mantissa(23 bits) |
| -------- | -------- | -------- |
| 0 | 00000000 | 00000000000000000000000 |
**1. Sign Bit**
This single bit indicates the sign of the number:
* `0` for positive number
* `1` for negative number
**2. Exponent**
The exponent is stored in a biased format, which means a bias value is subtracted from the actual exponent to allow both positive and negative exponents to be represented without using a sign bit.
* for `float` (32-bit), the bias is `127`
* for `double` (64-bit), the bias is `1023`
**3. Mantissa**
Also known as the fraction or significand, this part stores the actual digits of the number. For normalized numbers, an implicit leading 1 is assumed before the binary point, so only the fractional part is stored.
```c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
The corresponding RISC-V assembly are showing below.
```c
fabsf:
# assume that the input parameter x is stored at a0
# hold the mask 0x7FFFFFFF in t0
li t0, 0x7FFFFFFFF
and a0, a0, t0 # bitwise 'and' with x(a0) and mask(t0)
ret
```
:::danger
Use headnings instead of font tags in HTML.
:::
### my_clz
The `my_clz` function counts the number of leading zeros in a 32-bit unsigned integer `x`.
This is the breakdown of its functionality.
1. **Initialization**: It initializes a counter `count` to zero, which will keep track of the number of leading zeros.
2. **Loop through Bits**: The function uses a for loop that iterates from [the most significant bit](https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit) (`MSB`) to [the least siginificant bit]((https://en.wikipedia.org/wiki/Bit_numbering#Most_significant_bit)) (`LSB`). In each iteration, it checks if the current bit is set.
3. **Bitwise Operation**: The expression `x & (1U << i)` checks if the `i`-th bit of `x` is `1`. If it finds a `1`, the loop breaks, meaning it has encountered the first non-zeor bit.
4. **Counting Leading Zeros**: If the current bit is `0`, it increments the `count`. This continues until a `1` is found or all bits have been checked.
5. **Return**: The function returns the total count of leading zeros.
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
The corresponding RISC-V assembly are showing below.
```c
my_clz:
# assume that input parameter x is stored at a0
addi t0, zero, 0 # init count = 0
addi t1, zero, 31 # init i=31 for for_loop
addi s0, zero, 1 # hold 1U
my_clz_for_loop:
sll t2, s0, t1 # (1U << i) stored at t2
and t2, a0, t2 # x & (1U << i)
beqz t2, count_leading_zero # if x & (1U << i) == 1 go back for loop
j clz_finish
count_leading_zero:
addi t0, t0, 1 # otherwise, leading zero count++
addi t1, t1, -1 # i--
bgez t1, my_clz_for_loop # if i!=0 go to for loop otherwise end the function
clz_finish:
mv a0, t0 # count(t0) move to return register
ret
```
:::danger
Refine the label names, making them informative.
:::
:::info
I think I have refined the label name and made them more informative.
If there are more specific areas for improvement, please let me know.
:::
### fp16_to_fp32
The `fp16_to_fp32` function converts a 16-bit half-precision floating-point number (`h`) into a 32-bit single-precision floating-point number.
This is the breakdown of its functionality.
1. **Shift Half-Precision to Upper 32 Bits**:
```c
const uint32_t w = (uint32_t) h << 16;
```
This shifts the 16-bit half-precision number `h` to the upper half of a 32-bit integer, preparing it for conversion. This creates space for the mantissa and exponent while initializing the lower 16 bits to zero.
2. **Extract the sign bit**:
```c
const uint32_t sign = w & UINT32_C(0x80000000);
```
The sign bit is extracted from the shifted value w. It checks the most significant bit (bit 31) of the 32-bit word, which indicates the sign of the floating-point number.
3. **Extract Mantissa and Exponent**:
```c
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
```
This operation clears the sign bit from `w`, allowing the function to isolate the mantissa and exponent bits.
4. **Calculate Renormalization Shift**:
```c
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
```
The `my_clz` function is called to count the number of leading zeros in the mantissa and exponent. This determines how much the mantissa needs to be shifted for normalization. The shift is adjusted for normalized and denormalized numbers, where denormalized numbers will have a greater shift.
5. **Identify NaN and Infinity**:
```c
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
```
This check identifies whether the half-precision number represents NaN (Not a Number) or infinity by examining the exponent bits. If the exponent is 15, the result indicates that the number is either NaN or infinity.
6. **Create Zero Mask**:
```c
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
```
This creates a mask to check if the input half-precision number is zero. If nonsign equals zero, this results in a mask of all ones (0xFFFFFFFF); otherwise, it results in zero.
7. **Form the output**:
```c
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
```
The final return statement combines the sign, normalized mantissa, adjusted exponent, and applies the previously calculated masks.
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
/*
* Extends the 16-bit half-precision floating-point number to 32 bits
* by shifting it to the upper half of a 32-bit word:
* +---+-----+------------+-------------------+
* | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 31 26-30 16-25 0-15
*
* S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits.
*/
const uint32_t w = (uint32_t) h << 16;
/*
* Isolates the sign bit from the input number, placing it in the most
* significant bit of a 32-bit word:
*
* +---+----------------------------------+
* | S |0000000 00000000 00000000 00000000|
* +---+----------------------------------+
* Bits 31 0-31
*/
const uint32_t sign = w & UINT32_C(0x80000000);
/*
* Extracts the mantissa and exponent from the input number, placing
* them in bits 0-30 of the 32-bit word:
*
* +---+-----+------------+-------------------+
* | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000|
* +---+-----+------------+-------------------+
* Bits 30 27-31 17-26 0-16
*/
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
/*
* The renorm_shift variable indicates how many bits the mantissa
* needs to be shifted to normalize the half-precision number.
* For normalized numbers, renorm_shift will be 0. For denormalized
* numbers, renorm_shift will be greater than 0. Shifting a
* denormalized number will move the mantissa into the exponent,
* normalizing it.
*/
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
/*
* If the half-precision number has an exponent of 15, adding a
* specific value will cause overflow into bit 31, which converts
* the upper 9 bits into ones. Thus:
* inf_nan_mask ==
* 0x7F800000 if the half-precision number is
* NaN or infinity (exponent of 15)
* 0x00000000 otherwise
*/
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
/*
* If nonsign equals 0, subtracting 1 will cause overflow, setting
* bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result
* propagates bit 31 across all bits in zero_mask. Thus:
* zero_mask ==
* 0xFFFFFFFF if the half-precision number is
* zero (+0.0h or -0.0h)
* 0x00000000 otherwise
*/
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
/*
* 1. Shifts nonsign left by renorm_shift to normalize it (for denormal
* inputs).
* 2. Shifts nonsign right by 3, adjusting the exponent to fit in the
* 8-bit exponent field and moving the mantissa into the correct
* position within the 23-bit mantissa field of the single-precision
* format.
* 3. Adds 0x70 to the exponent to account for the difference in bias
* between half-precision and single-precision.
* 4. Subtracts renorm_shift from the exponent to account for any
* renormalization that occurred.
* 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input
* was NaN or infinity.
* 6. ANDs with the inverted zero_mask to set the mantissa and exponent
* to zero if the input was zero.
* 7. Combines everything with the sign bit of the input number.
*/
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
The corresponding RISC-V assembly are showing below.
```c
fp16_to_fp32:
# assume that the input parameter h is stored at a0
slli t0, a0, 16 # a0 left shift 16 bits
li t1, 0x80000000 # hold mask 0x80000000
and s0, t1, t0 # extract sign bit
li t1, 0x7FFFFFFF # hold mask 0x7FFFFFFF
and t3, t1, t0 # extract the mantissa and exponent
mv a0, t3 # move nonsign value to argu. reg. for function call
jal ra, my_clz # jal to my_clz and assume return value store at a1
addi a1, a1, -5 # renorm_shift-5
bgt a1, zero, cont # if renorm_shift-5 > 0, cont.
addi a1, zero, 0 # else set zero
cont:
li t4, 0x04000000 # hold mask 0x04000000
add t5, t3, t4 # add nonsign and mask
srli t5, t5, 8 # right shift 8 bit
li t4, 0x7F800000 # hold mask
and t5, t5, t4
addi t6, t3, -1 # nonsign - 1
srli t6, t6, 31 # zero_mask
li t1, 0xFFFFFFFF
xor t6, t6, t1
sll t3, t3, 3
srli t2, t2, 3
li t1, 0x70
sub a1, t1, a1
slli a1, a1, 23
add t2, t2, a1
or t2, t2, t5
and t2, t2, t6
or a0, t2, s0
ret
```
### Test
The full RISC-V assembly of problem c.
```c
.data
test_1: .word 0xFF00
test_2: .word 0x3C00
test_3: .word 0xE010
str: .string "\nThe value converted from fp16 to fp32 is "
.text
main:
la a0, str
li a7, 4
ecall
lw a0, test_1
jal ra, fp16_to_fp32
li a7, 1
ecall
la a0, str
li a7, 4
ecall
lw a0, test_2
jal ra, fp16_to_fp32
li a7, 1
ecall
la a0, str
li a7, 4
ecall
lw a0, test_3
jal ra, fp16_to_fp32
li a7, 1
ecall
li a7, 10
ecall
fp16_to_fp32:
mv t0, a0
slli t0, t0, 16
li t1, 0x80000000
and t1, t1, t0 # t1 is the extract sign bit
li t2, 0x7FFFFFFF # hold mask 0x7FFFFFFF
and t2, t2, t0 # t2(nonsign) is the extract the mantissa and exponent
my_clz:
addi t3, zero, 0 # t3 hold the count
addi t4, zero, 31 # init i=31 for for loop
addi t5, zero, 1 # hold 1U
my_clz_for_loop:
sll t6, t5, t4 # (1U << i) stored at t6
and t6, t2, t6 # x & (1U << i)
beqz t6, count_leading_zero # if x & (1U << i) == 1 go back for loop
j clz_finish
count_leading_zero:
addi t3, t3, 1
addi t4, t4, -1
bgez t4, my_clz_for_loop
clz_finish:
addi t3, t3, -5 # renorm_shift-5
bgt t3, zero, cont # if renorm_shift-5 > 0, cont.
addi t3, zero, 0 # else set zero
cont:
li t4, 0x04000000 # hold mask 0x04000000
add t5, t2, t4 # add nonsign and mask
srli t5, t5, 8 # right shift 8 bit
li t4, 0x7F800000 # hold mask
and t5, t5, t4 # t5 hold inf_nan_mask
addi t6, t2, -1 # nonsign - 1
srli t6, t6, 31 # t6 hold zero_mask
li t4, 0xFFFFFFFF
xor t6, t6, t4 # t6 hold ~zero_mask
sll t2, t2, t3
srli t2, t2, 3
li t4, 0x70
sub t3, t4, t3
slli t3, t3, 23
add t2, t2, t3
or t2, t2, t5
and t2, t2, t6
or a0, t1, t2
ret
```
:::danger
Show the full C source code corresponding to the original problem set.
:::
The `my_clz` function calculates how many leading zeros are in the binary representation of a 32-bit unsigned integer (`uint32_t`).
Based on this function, we can calculate the `Hamming distance` which will be introduced in later sections. The Hamming distance measures the number of differing bits between two binary values. By using a similar approach to bitwise operations, we can count how many bits differ between two integers, a concept closely related to the leading zero count demonstrated in this function.
### Motivation
[Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) is a metric used to measure the difference between two binary strings (or bit sequences) of equal length. It is defined as the number of positions at which the corresponding bits are different.
For example, In AI-driven video generation (e.g., GANs or diffusion models), Hamming distance can be used as a metric to assess the similarity between different latent representations. By measuring how different the generated video frames are from the input or reference frames, it can guide optimization processes.
Based on the above applications and the topic I am currently researching, I am quite interested in hamming distance. I will use `RISC-V assembly` to implement this topic, focusing on calculating Hamming distance.
## Problem
Based on the above description of [Subject](Subject), the CLZ problem is similar to calculating hamming distance.
[leetcode 461. Hamming Distance](https://leetcode.com/problems/hamming-distance/description/)
> Description : The Hamming distance between two integers is the number of positions at which the corresponding bits are different.
>
> Given two integers x and y, return the Hamming distance between them.
```
Example :
Input: x = 1, y = 4
Output: 2
Explanation:
1 (0 0 0 1)
4 (0 1 0 0)
↑ ↑
```
The above arrows point to positions where the corresponding bits are different.
## Solution
The Hamming distance between two integers is defined as the number of bit positions at which the corresponding bits are different. This can be calculated by performing a bitwise XOR operation on the two integers and then counting the number of '1's in the resulting binary number `diff`. Instead of directly counting all the '1's in `diff`, count the leading zeros using a function `my_clz`, which identifies how many irrelevant bits precede the first '1'. By shifting diff left by the number of leading zeros, we can eliminate these unnecessary bits and focus on the significant ones. Finally, iterate through the remaining bits of the shifted result to count the '1's, yielding the Hamming distance more efficiently by reducing the number of bits processed.
### C program
```c
#include <stdio.h>
#include <stdint.h>
uint32_t test_11 = 0xFFFFFFFF;
uint32_t test_12 = 0x00000000;
uint32_t test_21 = 0x7FF00132;
uint32_t test_22 = 0xCAE65324;
uint32_t test_31 = 0x00000000;
uint32_t test_32 = 0x00000000;
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
int hamming_distance(uint32_t x, uint32_t y) {
uint32_t diff = x ^ y;
int hamming_count = 0;
// Shift out the leading zeros first
int leading_zeros = my_clz(diff);
diff <<= leading_zeros; // Shift left to remove leading zeros
// Count remaining '1's in diff
for (int i = 0; i < 32 - leading_zeros; i++) {
if (diff & (1U << (31 - i)))
hamming_count++;
}
return hamming_count;
}
int main(void){
printf("the hamming distance is : %d\n", hamming_distance(test_11, test_12));
printf("the hamming distance is : %d\n", hamming_distance(test_21, test_22));
printf("the hamming distance is : %d\n", hamming_distance(test_31, test_32));
}
```
:::danger
You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking.
:::
### Assembly code
```c
.data
test_data_1: .word 0xFFFFFFFF, 0x00000000 # decimal is 4294967295 and 0, hamming distance is 32
test_data_2: .word 0x7FF00132, 0xCAE65324 # decimal is 2146435378 and 3404092196, hamming distance is 14
test_data_3: .word 0x00000000, 0x00000000 # decimal is 0 and 0, hamming distance is 0
print_string: .string "\nHamming Distance is "
.text
main:
addi sp, sp, -32 # 3 variables need to be saved
# store the data in stack
lw t0, test_data_1
sw t0, 0(sp)
lw t0, test_data_1+4
sw t0, 4(sp)
lw t0, test_data_2
sw t0, 8(sp)
lw t0, test_data_2+4
sw t0, 12(sp)
lw t0, test_data_3
sw t0, 16(sp)
lw t0, test_data_3+4
sw t0, 20(sp)
# initialize main loop
addi s3, zero, 3 # number of test cases
main_loop:
# Print results
la a0, print_string # Load print string address
li a7, 4 # System call for print integer
ecall
# calculate hamming distance for each pair of data
lw a0, 0(sp) # load first number
lw a1, 4(sp) # load second number
jal ra, hamming_distance
li a7, 1 # print integer
ecall # print result of hd_cal (which is in a0)
addi sp, sp, 8 # s2 : points to next test_data
addi s3, s3, -1 # counter++
bnez s3, main_loop
# Exit program
addi sp, sp, 32
li a7, 10
ecall
# hamming distance
hamming_distance:
addi sp, sp, -8 # Allocate memory for hamming_count on stack
sw ra, 0(sp) # save return address
xor s6, a0, a1 # diff = x ^ y
li t4, 0 # hamming_count = 0
mv a0, s6 # move data to a0 and call my_clz
jal ra, my_clz # jump and link to my_clz
# a0 involve with leading zero
mv t2, a0 # move leading zero from a0 to t2
sll s6, s6, t2 # remove leading zero
# Calculate the number of remaining 1's in the diff
addi t3, zero, 32 # 32 bits
sub t2, t3, t2 # calculate 32 - leading zero (i)
li a3, 1 # hold 1
hamming_count_loop:
beqz t2, hamming_done # If t2 == 0, jump to done
sub s7, t3, t2 # 32 - i
sll t6, a3, s7 # 1 << (32-i)
and t6, s6, t6 # t6 = diff & 1
beqz t6, skip_count # If t6 == 0, skip counting
addi t4, t4, 1 # hamming_count++
skip_count:
addi t2, t2, -1 # Decrease bit counter
j hamming_count_loop # Repeat loop
hamming_done:
mv a0, t4 # Move hamming_count to a0 (return value)
lw ra, 0(sp) # Restore return address
addi sp, sp, 8 # Restore stack space
ret # Return from the function
# clz function
my_clz:
addi sp, sp, -8 # return count and hold 1U
sw ra, 0(sp) # save return address
mv s0, a0 # move input argument to saved register
addi s1, zero, 0 # init count = 0
addi t0, zero, 31 # init i=31
addi s2, zero, 1 # hold 1U
my_clz_loop:
sll t1, s2, t0 # (1U << i)
and t1, s0, t1 # x & (1U << i)
beqz t1, count # if x & (1U << i) == 1 go back for loop
j clz_finish
count:
addi s1, s1, 1 # otherwise, count++
addi t0, t0, -1
bgez t0, my_clz_loop # if i!=0 go to for loop otherwise end the function
clz_finish:
mv a0, s1 # count(s1) move to return register
lw ra, 0(sp) # lw return address from 0(sp)
addi sp, sp, 8 # stack pointer
ret
```
:::danger
Use fewer instructions.
:::
## Result
:::danger
Check list:
1. Did your test suite include the corner cases?
2. Can you test suite be operated via given [test driver](https://en.wikipedia.org/wiki/Test_driver)?
3. Can you check the report and break down the details?
:::
:::info
What does the content of the checklist mean?
:::
### test data 1
* `test_data_11` = `0xFFFFFFFF` ( `4294967295` in decimal )
* `test_data_12` = `0x00000000` ( `0` in decimal )
### test data 2
* `test_data_21` = `0x7FF00132` ( `2146435378` in decimal )
* `test_data_22` = `0xCAE65324` ( `3404092196` in decimal )
### test data 3
* `test_data_31` = `0x00000000` ( `0` in decimal )
* `test_data_32` = `0x00000000` ( `0` in decimal )
### C program output
Run the c program will obtain the following result.
```shell
$ gcc -o hw hw1.c
$ ./hw
the hamming distance is : 32
the hamming distance is : 14
the hamming distance is : 0
```
:::danger
Do not use screenshots for plain text content, as this is inaccessible to visually impaired users.
:::
### RISC-V assembly output
When Ripes is finished running, we can get the following result which is about the hamming distance and the execution information.
**Result:**
```shell
Hamming Distance is 32
Hamming Distance is 14
Hamming DIstance is 0
Program exited with code: 0
```
**Execution info:**
```shell
cycles: 1183
Instrs. retired: 831
CPI: 1.42
IPC: 0.702
Clock rate: 9.26hz
```
<div align='center'>
<img src=https://hackmd.io/_uploads/BkwNxLYkJl.png>
<img src=https://hackmd.io/_uploads/By4IeUKyke.png>
</div>
## Analysis
I test the code using [Ripes](https://github.com/mortbopet/Ripes) simulator.
### Pseudo instruction
```pseudocode!
00000000 <main>:
0: fe010113 addi x2 x2 -32
4: 10000297 auipc x5 0x10000
8: ffc2a283 lw x5 -4 x5
c: 00512023 sw x5 0 x2
10: 10000297 auipc x5 0x10000
14: ff42a283 lw x5 -12 x5
18: 00512223 sw x5 4 x2
1c: 10000297 auipc x5 0x10000
20: fec2a283 lw x5 -20 x5
24: 00512423 sw x5 8 x2
28: 10000297 auipc x5 0x10000
2c: fe42a283 lw x5 -28 x5
30: 00512623 sw x5 12 x2
34: 10000297 auipc x5 0x10000
38: fdc2a283 lw x5 -36 x5
3c: 00512823 sw x5 16 x2
40: 10000297 auipc x5 0x10000
44: fd42a283 lw x5 -44 x5
48: 00512a23 sw x5 20 x2
4c: 00300993 addi x19 x0 3
00000050 <main_loop>:
50: 10000517 auipc x10 0x10000
54: fc850513 addi x10 x10 -56
58: 00400893 addi x17 x0 4
5c: 00000073 ecall
60: 00012503 lw x10 0 x2
64: 00412583 lw x11 4 x2
68: 024000ef jal x1 36 <hamming_distance>
6c: 00100893 addi x17 x0 1
70: 00000073 ecall
74: 00810113 addi x2 x2 8
78: fff98993 addi x19 x19 -1
7c: fc099ae3 bne x19 x0 -44 <main_loop>
80: 00c10113 addi x2 x2 32
84: 00a00893 addi x17 x0 10
88: 00000073 ecall
0000008c <hamming_distance>:
8c: ff810113 addi x2 x2 -8
90: 00112023 sw x1 0 x2
94: 00b54b33 xor x22 x10 x11
98: 00000e93 addi x29 x0 0
9c: 000b0513 addi x10 x22 0
a0: 048000ef jal x1 72 <my_clz>
a4: 00050393 addi x7 x10 0
a8: 007b1b33 sll x22 x22 x7
ac: 02000e13 addi x28 x0 32
b0: 407e03b3 sub x7 x28 x7
b4: 00100693 addi x13 x0 1
000000b8 <hamming_count_loop>:
b8: 02038063 beq x7 x0 32 <hamming_done>
bc: 407e0bb3 sub x23 x28 x7
c0: 01769fb3 sll x31 x13 x23
c4: 01fb7fb3 and x31 x22 x31
c8: 000f8463 beq x31 x0 8 <skip_count>
cc: 001e8e93 addi x29 x29 1
000000d0 <skip_count>:
d0: fff38393 addi x7 x7 -1
d4: fe5ff06f jal x0 -28 <hamming_count_loop>
000000d8 <hamming_done>:
d8: 000e8513 addi x10 x29 0
dc: 00012083 lw x1 0 x2
e0: 00810113 addi x2 x2 8
e4: 00008067 jalr x0 x1 0
000000e8 <my_clz>:
e8: ff810113 addi x2 x2 -8
ec: 00112023 sw x1 0 x2
f0: 00050413 addi x8 x10 0
f4: 00000493 addi x9 x0 0
f8: 01f00293 addi x5 x0 31
fc: 00100913 addi x18 x0 1
00000100 <my_clz_loop>:
100: 00591333 sll x6 x18 x5
104: 00647333 and x6 x8 x6
108: 00030463 beq x6 x0 8 <count>
10c: 0100006f jal x0 16 <clz_finish>
00000110 <count>:
110: 00148493 addi x9 x9 1
114: fff28293 addi x5 x5 -1
118: fe02d4e3 bge x5 x0 -24 <my_clz_loop>
0000011c <clz_finish>:
11c: 00048513 addi x10 x9 0
120: 00012083 lw x1 0 x2
124: 00810113 addi x2 x2 8
128: 00008067 jalr x0 x1 0
```
### Break down the details
In the first case, we obtain the result, which is 32. This is because the two numbers being compared are `0xFFFFFFFF` and `0x00000000`, meaning all 32 bits differ, resulting in a Hamming distance of `32`. The third test case is also crucial because it represents the scenario where both inputs are zero. This case is significant for a couple of reason:
1. **Edge case** handling : It tests how the implementation handles the edge case of having no differing bits. Since the Hamming distance is defined as the number of differing bits between two binary strings, an input of all zeros should return a Hamming distance of `zero`. This helps ensure that the algorithm correctly identifies cases where there are no bits to compare.
2. **Undefined Behavior** with `__builtin_clz`: This case also emphasizes the importance of handling edge cases correctly in relation to using `__builtin_clz`. According to the specifications, using `__builtin_clz` on zero leads to undefined behavior. By including this test case, we can validate that the algorithm gracefully handles such situations without causing unexpected results or crashes.
It’s important to note, however, that when calculating the Hamming distance or when using the `__builtin_clz` function (which counts leading zeros), special care must be taken with inputs like 0. The behavior of `__builtin_clz(0)` is technically undefined according to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html). This is because the function expects the input to be a non-zero value, and on most systems, calling it with 0 will return a result based on system-specific behavior, which may not always be what is expected. In some cases, it might return 32, but this should not be relied upon as a standard behavior. Therefore, if handling zero is a possibility, it’s important to add specific checks in the code to avoid invoking undefined behavior.
> According to the [GCC documentation](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html), the `__builtin_clz` function counts the number of leading zeros in an unsigned integer. However, if the input value is zero, the behavior is undefined.
>This is because the result of __builtin_clz(0) would theoretically require counting all the bits, but since the number of leading zeros for 0 is ambiguous in a standard binary representation (as all bits are zeros), compilers do not define a result for this case. This leads to platform-dependent outcomes or, in some cases, crashes.
Furthermore, the `hamming_distance` function plays a crucial role in determining the Hamming distance through the following steps.
1. Calculate the **XOR**: The function accepts two integer values (`a0` and `a1`), which represent the two binary numbers for which the Hamming distance is to be calculated.
2. **Count Leading Zeros**: To facilitate the counting of `1`s, the function determines how many leading zeros are present in the result of the XOR operation. This is achieved by calling the `my_clz` function, which implements the count leading zeros (`clz`) operation.
3. **Shift to Remove Leading Zeros**: the function shifts the XOR result to the left, effectively discarding the leading zeros. This prepares the number for counting the remaining `1`s in its binary representation.
4. **Count Remaining `1`s**: A loop iterates through the remaining bits, checking each bit of the modified XOR result. Each time a `1` is encountered, a counter (`t4`) is incremented. The loop continues until all bits are checked.
### 5-stage pipelined processor
Ripes provide different processors to run the code. Then I choose 5-stage processor to run my program.

I use `addi x29, x29, 1` in the psedoinstruction, which address is at `0xCC` for example and analyze how the processor operates the instruction in different stages.
1. **IF (Instruction Fetch)**
* Instruction is being fetched from the memory.
* Update the PC to the next sequential instruction by adding 4 (because each instruction is 4 bytes) to the PC.
2. **ID (Instruction Decode and register read)**
* Decode the instruction and read the registers corresponding to register source specifiers from the register file
3. **EX (Execute)**
* The ALU performs one of the following functions on the operands prepared in the prior cycle
* Memory reference (load/store)—The ALU adds the base register and the offset to form the effective address.
* Register-Register ALU instruction—The ALU performs the operation specified by the ALU opcode on the values read from the register file.
* Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate.
* Conditional branch—Determine whether the condition is true.
4. **MEM (Memory Access)**
* If the instruction is a load, the memory does a read using the effective address computed in the previous cycle.
* If it is a store, then the memory writes the data from the second register read from the register file using the effective address.
5. **WB (Write Back to registers)**
* Write the result into the register file
* From the memory system (for a load)
* From the ALU (for an Register-Register ALU instruction)
---
Take `addi x29, x29, 1` in the psedoinstruction, which address is at `0xCC` and see how it works in each stage.
**<font size=4>1. IF</font>**

First, the program counter ( `PC` ) is `0x000000cc`, it means that the next instruction to be executed is located at memory address `0x000000cc`. The CPU will **fetch the instruction** stored at this address from memory as the next step in the instruction cycle.
We can find `addi` instruction and get its function code and opcode from [ The RISC-V Instruction Set Manual (p.104)](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) :

And then it will be transformed to `0x001e8e93` as a machine code.
| imm[11:0] | rs1 | funct3 | rd | opcode |
| --------- | --- | --- | -- | ------- |
| 000000000001 | 11101(x29) | 000 | 11101(x29) | 0010011 |
Simultaneously, `PC`(`0x000000cc`) will be added by 4 and obtain the next instruction address which is `0x000000d0`.
**<font size=4>2. ID</font>**

In this stage, the decoder decodes the machine code(`0x001e8e93`) into four parts.
* `opcode` field which is `ADDI`
* `R1 idx` which is `0x1d` (x29)
* `Wr idx` which is `0x1d` (x29)
* `R2 idx` is not used because this is an I-type instruction.
Decoder will get the `imm[11:0]` and extend to a 32-bit number for calculation.
This 12-bit value is sign-extended to 32 bits, meaning the sign bit (the 12th bit) is propagated across the remaining upper bits .
> the explanation about the 12-bit sign-extension can be found in the [RISC-V manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf), specifically in Chapter 2: "RV32I Base Integer Instruction Set", under Section 2.4 Immediate Encoding.
However, the `R1 idx` will be sent to the `Registers` to obtain the value.
* `Reg 1` = `0x00000000`
**<font size=4>3. EX</font>**

In this stage, ALU sums the input that will be chosen by 2-level multiplexers(`MUXs`) :
* choice of `op1` : `Reg 1` go through 2 MUXs as the first input of ALU.
* choice of `op2` : `imm.` go through 2 MUXs as the second input of ALU. In addition, `Reg 2` is not used in this situation.
So, the input and output of ALU are :
* **Input**
* `op1` = `0x00000000` (value in x29)
* `op2` = `0x00000001` (imm. value)
* **Output**
* `ALU_RES` = `0x00000001`
Meanwhile, the `Wr idx` (`0x1d`) and the `PC` (`0x000000d0`) will be sent into next stage.
**<font size=4>4. MEM</font>**

In this stage, I-type format doesn't need to load data from memory, so the `Wr en` control signal is set to 0.
`PC`(`0x000000d0`), `ALU_RES` (`0x00000001`) and the `Wr idx`(`0x1d`) will go through to the next stage.
**<font size=4>5. WB</font>**

In this stage, `ALU_RES`(`0x00000001`) will write back to the target register `Wr idx`(`0x1d`). Before write back to the register, `PC`(0x000000d0), `ALU_RES` and `mem_read_out` will go through a MUXs to determine which data will be writen back to the register.
### Memory update
In the two images above, it can be seen that the value stored in register `x29` has changed from `0x00000000` to `0x00000001`. We also know that the CPU updates the register data only after completing the Write Back (`WB`) stage.
<div align="center">
<img src="https://hackmd.io/_uploads/SyxCa_lZk1g.png" width=350>
<img src="https://hackmd.io/_uploads/S1u2seZkyg.png" width=350>
</div>
