# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `hhchen` > ## Problem C in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) ### C Code ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` ### Assembly Code for function fp16_to_fp32 ```c .data argument: .word 0x0000 endl: .string "\n" .text la t0, argument lw t0, 0(t0) # store x in t0 slli t0, t0, 16 li s1, 0x80000000 and s1, s1, t0 # s1 = x and with 0x80000000(sign) li s2, 0x7FFFFFFF and s2, s2, t0 # s2 = x and with 0x7FFFFFFF(unsign) addi t1, t1, 31 # t1 = 31 (t1 is i) addi t2, t2, 1 # t2 = 1 clz_loop: sll t4, t2, t1 # t4 record 1U shift i addi t1, t1, -1 # i = i - 1 and t4, t4, s1 # x & (1U << i) bnez t4, out_clz # break addi t5, t5, 1 # did not enter the if statement and incremented count bge t1, zero, clz_loop # return to the clz_loop out_clz: addi s3 ,s3 , 5 addi t5, t5, -5 bge t5, s3, out add t5, x0, x0 # t5 = renorm_shift out: li t3, 0x4000000 add t3, s2, t3 srli t3, t3, 8 li s3, 0x7F800000 and t3, t3, s3 # s3 = inf_nan_mask addi s4, s2, -1 srli s4, s4, 31 # s4 = zero mask sll s2, s2, t5 srli s2, s2, 3 sub t5, zero, t5 addi t5, t5, 0x70 slli t5, t5, 23 # 0x70 - renorm << 23 or t5, t5, s3 xori s4, s4, -1 and t5, t5, s4 add s2, t5, s2 or s1, s1, s2 mv a0, s1 li a7,34 ecall ``` ## Binary Watch ([LeetCode 401](https://leetcode.com/problems/binary-watch/description/)) ### Motivation I chose the Binary Watch problem (LeetCode 401) because I saw someone solve the [Number of 1 Bits problem (LeetCode 191)](https://leetcode.com/problems/number-of-1-bits/description/) before. These problems are similar because they both deal with counting bits. For the Binary Watch, I used a simple method that checks all possibilities. ### Problem A binary watch has 4 LEDs on the top to represent the hours (0-11), and 6 LEDs on the bottom to represent the minutes (0-59). Each LED represents a zero or one, with the least significant bit on the right. For example, the below binary watch reads "4:51". ![image](https://hackmd.io/_uploads/SJXEBZ-J1e.png) Given an integer turnedOn which represents the number of LEDs that are currently on (ignoring the PM), return all possible times the watch could represent. You may return the answer in any order. The hour must not contain a leading zero. For example, "01:00" is not valid. It should be "1:00". The minute must consist of two digits and may contain a leading zero. For example, "10:2" is not valid. It should be "10:02". ## Solution ### Idea for problem solving - Enumerate all possible hours (0 to 11) and for each hour, enumerate all possible minutes (0 to 59). - Convert both the hour and minute to binary and concatenate them. - Then count the number of 1s in the resulting binary string. This count is known as the [Hamming weight](https://en.wikipedia.org/wiki/Hamming_weight) (you can use the concept of Hamming weight, as in [LeetCode problem 191](https://leetcode.com/problems/number-of-1-bits/description/)). - If the count is exactly n, add the corresponding time to the result set. - After completing the enumeration, you will have the desired set of times. - This approach effectively finds all times on a 12-hour clock whose binary representation has a Hamming weight of n. ### C Code - At first, I implemented the Hamming weight function and printed the corresponding value to test if the functionality was working correctly. #### Hamming Weight (Version 1) - To compute the 1’s of a number, I apply the idea that the result of `n & (n - 1)` will always be the value of `n` without the least significant 1. ```c int hammingWeight(unsigned short n) { int count = 0; while (n != 0){ n = n & (n - 1); count++; } return count; } ``` #### Hamming Weight (Version 2) - This function `hammingWeight` calculates the number of bits set to `1` (Hamming Weight) in an `unsigned short` (16-bit) value `n`. It uses an optimized method to count bits through successive bitwise operations: 1. **First Step**: `(n & 0x5555) + ((n >> 1) & 0x5555)` — pairs of bits are combined. 2. **Second Step**: `(n & 0x3333) + ((n >> 2) & 0x3333)` — groups of 4 bits are added. 3. **Third Step**: `(n & 0x0F0F) + ((n >> 4) & 0x0F0F)` — groups of 8 bits are added. 4. **Fourth Step**: `(n & 0x00FF) + ((n >> 8) & 0x00FF)` — combines all 16 bits. The final result is the count of `1` bits in `n`. This approach is efficient for counting bits in constant time. ```c int hammingWeight(unsigned short n) { n = (n & 0x5555) + ((n >> 1) & 0x5555); n = (n & 0x3333) + ((n >> 2) & 0x3333); n = (n & 0x0F0F) + ((n >> 4) & 0x0F0F); n = (n & 0x00FF) + ((n >> 8) & 0x00FF); return n; } ``` #### Reading BinaryWatch ```c char** readBinaryWatch(int turnedOn, int* returnSize) { char** valid = (char**)malloc(256 * sizeof(char*)); *returnSize = 0; if (turnedOn < 0 || turnedOn > 10) // If the number of LEDs is out of range, return empty result return valid; for (unsigned short i = 0; i < 1024; ++i) { // Iterate through all possible 10-bit combinations if (hammingWeight(i) == turnedOn && (i >> 6) < 12 && (i & 0x3F) < 60) { // Check if the number of bits set to '1' matches and the time is valid int hour = i >> 6; // Extract the hour (first 4 bits) int minute = i & 0x3F; // Extract the minutes (last 6 bits) char* timeString = (char*)malloc(6 * sizeof(char)); if (minute < 10) sprintf(timeString, "%d:0%d", hour, minute); else sprintf(timeString, "%d:%d", hour, minute); valid[*returnSize] = timeString; (*returnSize)++; } } return valid; } ``` #### Main Program ```c #include <stdio.h> #include <stdlib.h> #include <string.h> int main() { int turnedOn = 1; // Number of LEDs turned on (fixed to 1 for testing) int returnSize; // Number of valid time results // Get the valid times for the given number of LEDs char** times = readBinaryWatch(turnedOn, &returnSize); // Print the valid times printf("\nValid times for %d LEDs turned on:\n", turnedOn); for (int i = 0; i < returnSize; i++) { printf("%s\n", times[i]); free(times[i]); } free(times); return 0; } ``` - Console Output ```c Valid times for 1 LEDs turned on: 0:01 0:02 0:04 0:08 0:16 0:32 1:00 2:00 4:00 8:00 ``` ### Assembly Code #### Binary Watch (Version 1) ```c .data str_valid: .string "LEDs turned on:" str_colon: .string ":" str_leading_zero: .string "0" str_newline: .string "\n" input_turned_on: .word 1 .text .globl main main: addi sp, sp, -4 sw ra, 0(sp) lw a0, input_turned_on # Load LED turned on count into a0 jal ra, readBinaryWatch # Call readBinaryWatch function # Print result message la a0, str_valid # Load result message string li a7, 4 # Syscall: print string ecall lw a0, input_turned_on # Load LED turned on count li a7, 1 # Syscall: print integer ecall la a0, str_newline # Print newline li a7, 4 ecall # Exit program lw ra, 0(sp) addi sp, sp, 4 li a7, 10 # Syscall: exit ecall # hammingWeight function hammingWeight: li t0, 0 # Initialize count to 0 hammingWeight_loop: beqz a0, hammingWeight_end # If n == 0, exit loop addi t1, a0, -1 # t1 = n - 1 and a0, a0, t1 # n = n & (n - 1) addi t0, t0, 1 # Increment count j hammingWeight_loop hammingWeight_end: mv a0, t0 # Move count to return register ret # readBinaryWatch function readBinaryWatch: addi sp, sp, -28 sw ra, 24(sp) sw s0, 20(sp) sw s1, 16(sp) sw s2, 12(sp) sw s3, 8(sp) sw s4, 4(sp) sw s5, 0(sp) mv s5, a0 # s5 = turnedOn (input parameter) li s0, 0 # s0 = counter li s1, 1024 # s1 = upper bound li s3, 0 # s3 = number of valid times found loop_watch: beq s0, s1, end_readBinaryWatch mv a0, s0 jal ra, hammingWeight bne a0, s5, next_iteration # Extract hour and minute srli t0, s0, 6 andi t0, t0, 0xF # t0 = hour andi t1, s0, 0x3F # t1 = minute # Check hour < 12 and minute < 60 li t2, 12 bge t0, t2, next_iteration li t2, 60 bge t1, t2, next_iteration # Print hour mv a0, t0 li a7, 1 # Syscall: print integer ecall # Print colon la a0, str_colon li a7, 4 # Syscall: print string ecall # Print minute with leading zero if needed li t2, 10 bge t1, t2, print_minute la a0, str_leading_zero li a7, 4 # Syscall: print string ecall print_minute: mv a0, t1 li a7, 1 # Syscall: print integer ecall # Print newline la a0, str_newline li a7, 4 # Syscall: print string ecall addi s3, s3, 1 # Increment valid times counter next_iteration: addi s0, s0, 1 j loop_watch end_readBinaryWatch: mv a0, s3 # Return number of valid times found # Restore saved registers and return lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 ret ``` #### Binary Watch (Version 2) ```c .data str_valid: .string "LEDs turned on:" str_colon: .string ":" str_leading_zero: .string "0" str_newline: .string "\n" input_turned_on: .word 1 .text .globl main main: addi sp, sp, -4 sw ra, 0(sp) lw a0, input_turned_on # Load LED turned on count into a0 jal ra, readBinaryWatch # Call readBinaryWatch function # Print result message la a0, str_valid # Load result message string li a7, 4 # Syscall: print string ecall lw a0, input_turned_on # Load LED turned on count li a7, 1 # Syscall: print integer ecall la a0, str_newline # Print newline li a7, 4 ecall # Exit program lw ra, 0(sp) addi sp, sp, 4 li a7, 10 # Syscall: exit ecall # hammingWeight function hammingWeight: mv t0, a0 srli t1, t0, 1 li t2, 0x55555555 and t1, t1, t2 sub t0, t0, t1 li t1, 0x33333333 and t2, t0, t1 srli t0, t0, 2 and t0, t0, t1 add t0, t0, t2 srli t1, t0, 4 add t0, t0, t1 li t1, 0x0f0f0f0f and t0, t0, t1 srli t1, t0, 8 add t0, t0, t1 srli t1, t0, 16 add a0, t0, t1 andi a0, a0, 0x3f ret # readBinaryWatch function readBinaryWatch: addi sp, sp, -28 sw ra, 24(sp) sw s0, 20(sp) sw s1, 16(sp) sw s2, 12(sp) sw s3, 8(sp) sw s4, 4(sp) sw s5, 0(sp) mv s5, a0 # s5 = turnedOn (input parameter) li s0, 0 # s0 = counter li s1, 1024 # s1 = upper bound li s3, 0 # s3 = number of valid times found loop_watch: beq s0, s1, end_readBinaryWatch mv a0, s0 jal ra, hammingWeight bne a0, s5, next_iteration # Extract hour and minute srli t0, s0, 6 andi t0, t0, 0xF # t0 = hour andi t1, s0, 0x3F # t1 = minute # Check hour < 12 and minute < 60 li t2, 12 bge t0, t2, next_iteration li t2, 60 bge t1, t2, next_iteration # Print hour mv a0, t0 li a7, 1 # Syscall: print integer ecall # Print colon la a0, str_colon li a7, 4 # Syscall: print string ecall # Print minute with leading zero if needed li t2, 10 bge t1, t2, print_minute la a0, str_leading_zero li a7, 4 # Syscall: print string ecall print_minute: mv a0, t1 li a7, 1 # Syscall: print integer ecall # Print newline la a0, str_newline li a7, 4 # Syscall: print string ecall addi s3, s3, 1 # Increment valid times counter next_iteration: addi s0, s0, 1 j loop_watch end_readBinaryWatch: mv a0, s3 # Return number of valid times found # Restore saved registers and return lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 ret ``` :::danger Check list: 1. Did your test suite include the corner cases? 2. Can you test suite be operated via given [test driver](https://en.wikipedia.org/wiki/Test_driver)? 3. Can you check the report and break down the details? ::: :::success I added three new chapters: Results, Comparison, and Analysis. ::: ## Result ### Test case 1. Test case 1 (0 LEDs on): Expected output: Only one valid time - 0:00 Analysis: This is a corner case where no LEDs are lit. The only valid time in this scenario is 0:00 (midnight). It tests the program's ability to handle the minimum possible input. 2. Test case 2 (1 LED on): Expected output: 12 valid times Analysis: This case represents the lower bound for valid inputs. The possible times are: - Hours: 1:00, 2:00, 4:00, 8:00 - Minutes: 0:01, 0:02, 0:04, 0:08, 0:16, 0:32 - Total: 10 times This case tests the program's ability to handle a minimal valid input and correctly generate all possible times with only one bit set. 3. Test case 3 (9 LEDs on): Expected output: No valid times Analysis: This is an invalid input corner case. It's impossible to have 9 LEDs on in a valid time configuration because: - Hours use 4 bits (0-11 in binary) - Minutes use 6 bits (0-59 in binary) - Total bits available: 4 + 6 = 10 bits ## Comparing This analysis primarily focuses on comparing two versions of the Hamming weight function implementation. The Hamming weight function counts the number of '1' bits in a binary number, which is crucial for the Binary Watch problem. | Feature | Version 1 | Version 2 | |----------------------------|-------------------------------------------|-------------------------------------| | **Algorithm Type** | Bitwise Clear (Lowest Bit) | Mask and Divide-and-Conquer | | **Code Complexity** | Simple and intuitive | More complex (uses masks and shifts)| | **Parallelism Potential** | No parallelism | Can be parallelized with masking | | **Use Case** | Simple implementation, fewer 1-bits | High efficiency for large data | I tested two versions of the Hamming weight function.The cycle count, CPI, and IPC of this assembly code(as shown in the image below). | Execution info | Version 1 | Version 2 | |----------------------------|--------------|--------------| | **Cycles** | 56734 | 38302 | | **Instrs. retired** | 36156 | 30012 | | **CPI** | 1.57 | 1.28 | | **IPC** | 0.637 | 0.784 | | **Clock rate** | 0 Hz | 0 Hz | ## Analysis I test the code using [Ripes](https://ripes.me/) simulator. ### Pseudo instruction The translated code looks like: ```c 00000000 <main>: 0: ffc10113 addi x2 x2 -4 4: 00112023 sw x1 0 x2 8: 10000517 auipc x10 0x10000 c: 00e52503 lw x10 14 x10 10: 0a0000ef jal x1 160 <readBinaryWatch> 14: 10000517 auipc x10 0x10000 18: fec50513 addi x10 x10 -20 1c: 00400893 addi x17 x0 4 20: 00000073 ecall 24: 10000517 auipc x10 0x10000 28: ff252503 lw x10 -14 x10 2c: 00100893 addi x17 x0 1 30: 00000073 ecall 34: 10000517 auipc x10 0x10000 38: fe050513 addi x10 x10 -32 3c: 00400893 addi x17 x0 4 40: 00000073 ecall 44: 00012083 lw x1 0 x2 48: 00410113 addi x2 x2 4 4c: 00a00893 addi x17 x0 10 50: 00000073 ecall 00000054 <hammingWeight>: 54: 00050293 addi x5 x10 0 58: 0012d313 srli x6 x5 1 5c: 555553b7 lui x7 0x55555 60: 55538393 addi x7 x7 1365 64: 00737333 and x6 x6 x7 68: 406282b3 sub x5 x5 x6 6c: 33333337 lui x6 0x33333 70: 33330313 addi x6 x6 819 74: 0062f3b3 and x7 x5 x6 78: 0022d293 srli x5 x5 2 7c: 0062f2b3 and x5 x5 x6 80: 007282b3 add x5 x5 x7 84: 0042d313 srli x6 x5 4 88: 006282b3 add x5 x5 x6 8c: 0f0f1337 lui x6 0xf0f1 90: f0f30313 addi x6 x6 -241 94: 0062f2b3 and x5 x5 x6 98: 0082d313 srli x6 x5 8 9c: 006282b3 add x5 x5 x6 a0: 0102d313 srli x6 x5 16 a4: 00628533 add x10 x5 x6 a8: 03f57513 andi x10 x10 63 ac: 00008067 jalr x0 x1 0 000000b0 <readBinaryWatch>: b0: fe410113 addi x2 x2 -28 b4: 00112c23 sw x1 24 x2 b8: 00812a23 sw x8 20 x2 bc: 00912823 sw x9 16 x2 c0: 01212623 sw x18 12 x2 c4: 01312423 sw x19 8 x2 c8: 01412223 sw x20 4 x2 cc: 01512023 sw x21 0 x2 d0: 00050a93 addi x21 x10 0 d4: 00000413 addi x8 x0 0 d8: 40000493 addi x9 x0 1024 dc: 00000993 addi x19 x0 0 000000e0 <loop_watch>: e0: 08940463 beq x8 x9 136 <end_readBinaryWatch> e4: 00040513 addi x10 x8 0 e8: f6dff0ef jal x1 -148 <hammingWeight> ec: 07551a63 bne x10 x21 116 <next_iteration> f0: 00645293 srli x5 x8 6 f4: 00f2f293 andi x5 x5 15 f8: 03f47313 andi x6 x8 63 fc: 00c00393 addi x7 x0 12 100: 0672d063 bge x5 x7 96 <next_iteration> 104: 03c00393 addi x7 x0 60 108: 04735c63 bge x6 x7 88 <next_iteration> 10c: 00028513 addi x10 x5 0 110: 00100893 addi x17 x0 1 114: 00000073 ecall 118: 10000517 auipc x10 0x10000 11c: ef850513 addi x10 x10 -264 120: 00400893 addi x17 x0 4 124: 00000073 ecall 128: 00a00393 addi x7 x0 10 12c: 00735a63 bge x6 x7 20 <print_minute> 130: 10000517 auipc x10 0x10000 134: ee250513 addi x10 x10 -286 138: 00400893 addi x17 x0 4 13c: 00000073 ecall 00000140 <print_minute>: 140: 00030513 addi x10 x6 0 144: 00100893 addi x17 x0 1 148: 00000073 ecall 14c: 10000517 auipc x10 0x10000 150: ec850513 addi x10 x10 -312 154: 00400893 addi x17 x0 4 158: 00000073 ecall 15c: 00198993 addi x19 x19 1 00000160 <next_iteration>: 160: 00140413 addi x8 x8 1 164: f7dff06f jal x0 -132 <loop_watch> 00000168 <end_readBinaryWatch>: 168: 00098513 addi x10 x19 0 16c: 01812083 lw x1 24 x2 170: 01412403 lw x8 20 x2 174: 01012483 lw x9 16 x2 178: 00c12903 lw x18 12 x2 17c: 00812983 lw x19 8 x2 180: 00412a03 lw x20 4 x2 184: 00012a83 lw x21 0 x2 188: 01c10113 addi x2 x2 28 18c: 00008067 jalr x0 x1 0 ``` In each row it denotes address in instruction memory, instruction's machine code (in hex) and instruction itself respectively. ### 5-stage pipelined processor The 5 stages is designed to deal with the instructions in pipeline approach to enhance the execution efficiency. The 5 stages is shown as below: ![image](https://hackmd.io/_uploads/SJ4RbNtJ1e.png) - Instruction fetch (IF) - Instruction decode and register fetch (ID) - Execute (EX) - Memory access (MEM) - Register write back (WB) The five-stage pipeline in a processor is used to parallelize the execution of instructions, allowing multiple instructions to be processed simultaneously at different stages. Here’s an overview of the pipeline stages and an explanation of how an I-type instruction progresses through these stages: ### I-Type Instruction Example Let’s use the **I-Type instruction** `addi x2, x2, -16` as an example and demonstrate how it passes through each of the five stages in the RISC-V pipeline. 1. **Instruction Fetch (IF)**: In the **IF** stage, the processor fetches the instruction `addi x2, x2, -16` from memory. The **Program Counter (PC)** determines the address of the instruction. After fetching, the PC is incremented by `4` to point to the next instruction, since RISC-V instructions are 4 bytes long. ![image](https://hackmd.io/_uploads/BkTfkwtJJg.png) 2. **Instruction Decode (ID)**: In the **ID** stage, the fetched instruction is decoded by the control unit to determine that it is an `addi` instruction, an I-Type arithmetic operation. The control signals are generated to guide the ALU to perform addition. The value of the source register (`x2`) is read from the register file, and the immediate value (`-16`) is extracted and sign-extended to 32 bits. ![image](https://hackmd.io/_uploads/rJMSJDKJkl.png) 3. **Execute (EX)**: In the **EX** stage, the ALU performs the addition operation between the value from `x2` and the immediate value `-16`. - For example, let’s assume `x2` initially contains the value `64`. - The ALU then calculates `64 + (-16)` which results in `48`. ![image](https://hackmd.io/_uploads/SyyP1wFJyg.png) 4. **Memory Access (MEM)**: In the **MEM** stage, there is no memory operation needed for this instruction since it is purely an arithmetic operation. Therefore, the ALU result (`48`) is simply forwarded to the next stage. ![image](https://hackmd.io/_uploads/BJKvJPKy1g.png) 5. **Register Write Back (WB)**: No action needed for `sw`. In the **WB** stage, the result from the ALU (`48`) is written back to the destination register (`x2`). After this stage is completed, the value in register `x2` is updated to `48`. ![image](https://hackmd.io/_uploads/Hy3c1wKJJx.png) ## Reference - [Binary Watch C++ Solution](https://hackmd.io/@Inversionpeter/BkccX5Ovw) - [191. Number of 1 Bits](https://hackmd.io/jj61kCP_R-etsz4c7KTOSw?view)