# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`sarahting101`](https://github.com/sarahting101) > :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: ## Quiz1 Problem `C` :::danger Improve your expressions. ::: The problem statement is in [here](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C). Convert a 16-bit floating-point number to a 32-bit floating-point number, which will utilize counting leading zeros function. ## Implementation ### C code The `my_clz` function is from Quiz, To let it run, I add the main function. ```c= #include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int main(){ int num = 1000; int result = 0; result = my_clz(num); printf("%d\n", result); return 0; } ``` :::danger Show the full C source code corresponding to the original problem set. ::: - The full C source code for Problem `C`, including `my_clz` function. ```c= #include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; printf("w = %x\n", w); const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } int main(){ int num = 0x3C00; int result = 0; result = fp16_to_fp32(num); printf("result = %x\n", result); return 0; } ``` ### Assembly code - First version To calculate the number of leading zeros in a binary number using the logical left shift `sll` and bitwise AND operations `and`. But it needs to run multiple times in the `clz` loop (up to 32 times), which takes a lot of time. ```assembly= .data input: .word 1000 .text main: lw a0, input li t0, 31 #i li t1, -1 #count jal ra, clz mv a0, t1 li a7, 1 ecall li a7, 10 ecall clz: addi t1, t1, 1 li t3, 1 sll t3, t3, t0 and t2, a0, t3 addi t0, t0, -1 #--i beq t2, zero, clz ret ``` - Second version To improve the efficiency of counting leading zeros in a binary number, I use a more optimized algorithm, called population count, determines how many 1s are there in a binary number without requiring a loop. ```assembly= .data input: .word 1000 input2: .word 44093 .text main: lw a0, input srli t0, a0, 1 or a0, a0, t0 srli t0, a0, 2 or a0, a0, t0 srli t0, a0, 4 or a0, a0, t0 srli t0, a0, 8 or a0, a0, t0 srli t0, a0, 16 or a0, a0, t0 srli t0, a0, 1 li t2, 0x55555555 and t1, t0, t2 sub a0, a0, t1 srli t0, a0, 2 li t2, 0x33333333 and t1, t0, t2 and t3, a0, t2 add a0, t1, t3 srli t0, a0, 4 add t1, a0, t0 li t2, 0x0f0f0f0f and a0, t1, t2 srli t0, a0, 8 add a0, a0, t0 srli t0, a0, 16 add a0, a0, t0 andi a0, a0, 0x7f li t1, 32 sub t0, t1, a0 mv a0, t0 li a7, 1 ecall li a7, 10 ecall ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: - Third version The full Assembly code for Problem `C` and it can automate the testing procedures. If all test results are correct, output `Correct!`; otherwise, output `False!` ```assembly= .data input_fp16: .word 0x3C00 .word 0x61A0 .word 0xD730 output_fp32: .word 0x3F800000 .word 0x44340000 .word 0xC2E60000 str1: .string "False!" str2: .string "Correct!" .text main: li t5, 3 la t6, input_fp16 loop_cases: lw a0, 0(t6) lw a6, 12(t6) jal ra, fp16_to_fp32 bne t0, a6, print_false addi t6, t6, 4 addi t5, t5, -1 bnez t5, loop_cases la a0, str2 li a7, 4 ecall li a7, 10 ecall fp16_to_fp32: addi sp, sp, -4 sw ra, 0(sp) slli a0, a0, 16 #a0 = w li t0, 0x80000000 and a1, a0, t0 #a1 = sign li t0, 0x7fffffff and a2, a0, t0 #a2 = nonsign mv t4, a2 jal ra, clz mv a2, t4 li t0, 5 bltu a3, t0, less_than_5 addi a3, a3, -5 jal ra, continue less_than_5: li a3, 0 continue: li t0, 0x04000000 add t0, a2, t0 srli t0, t0, 8 li t1, 0x7F800000 and a4, t0, t1 #a4 = inf_nan_mask addi t0, a2, -1 srli a5, t0, 31 #a5 = zero_mask sll t0, a2, a3 srli t0, t0, 3 #check1 li t1, 0x70 sub t3, t1, a3 #check2 slli t3, t3, 23 #triangle add t0, t0, t3 #square or t0, t0, a4 #circle xori t1, a5, -1 and t0, t0, t1 or t0, a1, t0 #t0 = ans lw ra, 0(sp) addi sp, sp, 4 ret clz: srli t0, a2, 1 or a2, a2, t0 srli t0, a2, 2 or a2, a2, t0 srli t0, a2, 4 or a2, a2, t0 srli t0, a2, 8 or a2, a2, t0 srli t0, a2, 16 or a2, a2, t0 srli t0, a2, 1 li t2, 0x55555555 and t1, t0, t2 sub a2, a2, t1 srli t0, a2, 2 li t2, 0x33333333 and t1, t0, t2 and t3, a2, t2 add a2, t1, t3 srli t0, a2, 4 add t1, a2, t0 li t2, 0x0f0f0f0f and a2, t1, t2 srli t0, a2, 8 add a2, a2, t0 srli t0, a2, 16 add a2, a2, t0 andi a2, a2, 0x7f li t1, 32 sub t0, t1, a2 mv a3, t0 #a3 = renorm_shift ret print_false: la a0, str1 li a7, 4 ecall li a7, 10 ecall ``` ## Minimum Bit Flips to Convert Number ### Problem statement A bit flip of a number `x` is choosing a bit in the binary representation of `x` and flipping it from either `0` to `1` or `1` to `0`. For example, for `x = 7`, the binary representation is `111` and we may choose any bit (including any leading zeros not shown) and flip it. We can flip the first bit from the right to get `110`, flip the second bit from the right to get `101`, flip the fifth bit from the right (a leading zero) to get `10111`, etc. Given two integers **start** and **goal**, return the **minimum number of bit flips** to convert start to goal. Constraints: 0 <= `start`, `goal` <= 10^9^ ### Motives for choosing the problem - The relationship between Quiz1 problems B, while not involving IEEE floating point representation, still requires binary conversion. - The question concerns bit operations, and I consider bit flipping is both easy to implement and fast to execute in RISC-V architecture.This operation is also fundamental to a classical problem known as the Hamming Distance. ## Implementation ### C code - First version Convert the decimal to binary, and then perform bit comparison. ```cpp= void dectobin(int *arr, int num){ int i = 0; while(num){ if(num%2){ arr[i] = 1; } num/=2; i++; } return; } int minBitFlips(int start, int goal) { int startBinary[36]; int goalBinary[36]; dectobin(startBinary, start); dectobin(goalBinary, goal); int count = 0; for(int i = 0 ; i < 36 ; i++){ if(startBinary[i]!=goalBinary[i]){ count++; } } return count; } ``` - Time Complexity `O(log n)` (where `n` is the larger of `start` or `goal`) - Second version ```cpp= int minBitFlips(int start, int goal) { int count = 0; int diff = start ^ goal; while (diff) { count += (diff & 1); diff >>= 1; } return count; } ``` - Key Improvements Compared to the first version, although the code have the same time complexity of `O(log n)`, the optimization comes from several factors: - Direct Bit Comparison Use logical operations for optimization. XOR operation can identify the bits that differ between `start` and `goal`. This reduces the need for multiple loops or array accesses. - Elimination of Extra Arrays The original code used two arrays (`startBinary` and `goalBinary`) to store binary representations, which required additional space and made the code longer and more complex. The optimized code calculates the differing bits directly without storing intermediate representations. - Third version ```cpp= int minBitFlips(int start, int goal) { return __builtin_popcount(start ^ goal); } ``` - Key Improvements Using built-in functions like `__builtin_popcount` (in GCC) to count the number of set bits in the XOR result. However, this is not practical when writing assembly code, as it can only use the RISC-V instruction set. Therefore, when implementing the assembly code, the second version will serve as the primary reference. ### Assembly code - First version Try to modify the second edition of the C language into assembly code, without handling multiple test data yet. ```assembly= .data start: .word 10 goal: .word 7 .text main: lw a0, start lw a1, goal li t2, 0 xor t0, a0, a1 jal ra, count mv a0, t2 li a7, 1 ecall li a7, 10 ecall count: andi t1, t0, 1 add t2, t2, t1 srli t0, t0, 1 bne t0, zero, count jr x1 ``` - Second version Create three test data sets and ensure the output includes more information. ```assembly= .data start1: .word 10 goal1: .word 7 start2: .word 3 goal2: .word 4 start3: .word 0 goal3: .word 31 str1: .string "Input: start = " str2: .string ", goal = " str3: .string "Output: " .text main: lw a0, start1 #case1 lw a1, goal1 jal ra, forcase lw a0, start2 #case2 lw a1, goal2 jal ra, forcase lw a0, start3 #case3 lw a1, goal3 jal ra, forcase li a7, 10 ecall forcase: addi sp, sp, -4 sw ra, 0(sp) li t2, 0 xor t0, a0, a1 jal ra, counting jal ra, printResult lw ra, 0(sp) addi sp, sp, 4 ret counting: andi t1, t0, 1 add t2, t2, t1 srli t0, t0, 1 bne t0, zero, counting jr x1 printResult: mv t3, a0 mv t4, a1 la a0, str1 #Input: start = li a7, 4 ecall mv a0, t3 #start value li a7, 1 ecall la a0, str2 #, goal = li a7, 4 ecall mv a0, t4 #goal value li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall la a0, str3 #Output: li a7, 4 ecall mv a0, t2 #count li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall li a0, 10 #newline li a7, 11 ecall ret ``` :::danger Use fewer instructions. ::: Output ``` Input: start = 10, goal = 7 Output: 3 Input: start = 3, goal = 4 Output: 3 Input: start = 0, goal = 31 Output: 5 ``` - Third version Added a loop to iterate through the three test cases using a single block of code. ```assembly= .data start1: .word 10 goal1: .word 7 start2: .word 3 goal2: .word 4 start3: .word 0 goal3: .word 31 str1: .string "Input: start = " str2: .string ", goal = " str3: .string "Output: " .text main: li t5, 3 la t6, start1 loop_cases: lw a0, 0(t6) lw a1, 4(t6) jal ra, forcase addi t6, t6, 8 addi t5, t5, -1 bnez t5, loop_cases li a7, 10 ecall forcase: addi sp, sp, -4 sw ra, 0(sp) li t2, 0 xor t0, a0, a1 jal ra, counting jal ra, printResult lw ra, 0(sp) addi sp, sp, 4 jr x1 counting: andi t1, t0, 1 add t2, t2, t1 srli t0, t0, 1 bne t0, zero, counting jr x1 printResult: mv t3, a0 mv t4, a1 la a0, str1 #Input: start = li a7, 4 ecall mv a0, t3 #start value li a7, 1 ecall la a0, str2 #, goal = li a7, 4 ecall mv a0, t4 #goal value li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall la a0, str3 #Output: li a7, 4 ecall mv a0, t2 #count li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall li a0, 10 #newline li a7, 11 ecall jr x1 ``` - Fourth version Using counting leading zero, yet there are more cycles than in the previous iteration. That is, in my opinion, because the third version already uses `srli` to reduce the number to zero; counting leading zeros is not necessary to determine when to stop the calculation. ```assembly= .data start1: .word 10 goal1: .word 7 start2: .word 3 goal2: .word 4 start3: .word 0 goal3: .word 31 str1: .string "Input: start = " str2: .string ", goal = " str3: .string "Output: " .text main: li t5, 3 la t6, start1 loop_cases: lw a0, 0(t6) lw a1, 4(t6) jal ra, forcase addi t6, t6, 8 addi t5, t5, -1 bnez t5, loop_cases li a7, 10 ecall forcase: addi sp, sp, -4 sw ra, 0(sp) li t2, 0 xor t0, a0, a1 #count t0 1 sum jal ra, clz jal ra, counting jal ra, printResult lw ra, 0(sp) addi sp, sp, 4 jr x1 clz: mv a2, t0 mv t4, t0 srli t0, a2, 1 or a2, a2, t0 srli t0, a2, 2 or a2, a2, t0 srli t0, a2, 4 or a2, a2, t0 srli t0, a2, 8 or a2, a2, t0 srli t0, a2, 16 or a2, a2, t0 srli t0, a2, 1 li t2, 0x55555555 and t1, t0, t2 sub a2, a2, t1 srli t0, a2, 2 li t2, 0x33333333 and t1, t0, t2 and t3, a2, t2 add a2, t1, t3 srli t0, a2, 4 add t1, a2, t0 li t2, 0x0f0f0f0f and a2, t1, t2 srli t0, a2, 8 add a2, a2, t0 srli t0, a2, 16 add a2, a2, t0 andi a2, a2, 0x7f li t1, 32 sub t0, t1, a2 #t0 is the ans of clz li t2, 0 li t1, 0 li t3, 32 sub t0, t3, t0 ret counting: andi t3, t4, 1 add t2, t2, t3 srli t4, t4, 1 addi t1, t1, 1 bne t1, t0, counting ret printResult: mv t3, a0 mv t4, a1 la a0, str1 #Input: start = li a7, 4 ecall mv a0, t3 #start value li a7, 1 ecall la a0, str2 #, goal = li a7, 4 ecall mv a0, t4 #goal value li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall la a0, str3 #Output: li a7, 4 ecall mv a0, t2 #count li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall li a0, 10 #newline li a7, 11 ecall jr x1 ``` - Fifth version Unlike the previous version, I did not utilize the entire code for counting leading zeros in these versions. I use the final portion of it, named population count, which counts the number of 1s in a binary number. The computation algorithm counts the number of 1s without the need for a loop, inheriting the benefits of the second version of `clz`. No matter how large the number is, the same execution cycle can be achieved. ```assembly= .data start1: .word 10 goal1: .word 7 start2: .word 3 goal2: .word 4 start3: .word 0 goal3: .word 31 str1: .string "Input: start = " str2: .string ", goal = " str3: .string "Output: " .text main: li t5, 3 la t6, start1 loop_cases: lw a0, 0(t6) lw a1, 4(t6) jal ra, forcase addi t6, t6, 8 addi t5, t5, -1 bnez t5, loop_cases li a7, 10 ecall forcase: addi sp, sp, -4 sw ra, 0(sp) li t2, 0 xor t0, a0, a1 #count t0 1 sum jal ra, counting jal ra, printResult lw ra, 0(sp) addi sp, sp, 4 jr x1 counting: mv a2, t0 srli t0, a2, 1 li t2, 0x55555555 and t1, t0, t2 sub a2, a2, t1 srli t0, a2, 2 li t2, 0x33333333 and t1, t0, t2 and t3, a2, t2 add a2, t1, t3 srli t0, a2, 4 add t1, a2, t0 li t2, 0x0f0f0f0f and a2, t1, t2 srli t0, a2, 8 add a2, a2, t0 srli t0, a2, 16 add a2, a2, t0 andi a2, a2, 0x7f mv t2, a2 ret printResult: mv t3, a0 mv t4, a1 la a0, str1 #Input: start = li a7, 4 ecall mv a0, t3 #start value li a7, 1 ecall la a0, str2 #, goal = li a7, 4 ecall mv a0, t4 #goal value li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall la a0, str3 #Output: li a7, 4 ecall mv a0, t2 #count li a7, 1 ecall li a0, 10 #newline li a7, 11 ecall li a0, 10 #newline li a7, 11 ecall jr x1 ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: - Sixth version Use fewer instructions and automate the testing procedures. ```assembly= .data start: .word 10 .word 3 .word 0 goal: .word 7 .word 4 .word 31 output: .word 3 .word 3 .word 5 str1: .string "False!" str2: .string "Correct!" .text main: li t5, 3 la t6, start loop_cases: lw a0, 0(t6) lw a1, 12(t6) lw a3, 24(t6) jal ra, forcase bne a3, t2, print_false addi t6, t6, 4 addi t5, t5, -1 bnez t5, loop_cases la a0, str2 li a7, 4 ecall li a7, 10 ecall forcase: addi sp, sp, -4 sw ra, 0(sp) li t2, 0 xor t0, a0, a1 #count t0 1 sum jal ra, counting lw ra, 0(sp) addi sp, sp, 4 jr x1 counting: mv a2, t0 srli t0, a2, 1 li t2, 0x55555555 and t1, t0, t2 sub a2, a2, t1 srli t0, a2, 2 li t2, 0x33333333 and t1, t0, t2 and t3, a2, t2 add a2, t1, t3 srli t0, a2, 4 add t1, a2, t0 li t2, 0x0f0f0f0f and a2, t1, t2 srli t0, a2, 8 add a2, a2, t0 srli t0, a2, 16 add a2, a2, t0 andi a2, a2, 0x7f mv t2, a2 ret print_false: la a0, str1 li a7, 4 ecall li a7, 10 ecall ``` ## Analysis ### Optimizations Made :::danger Improve your expressions. ::: #### Between C code and assembly code, there are four improvement. - Inline Counting Logic: The counting logic is kept in the same function to avoid the overhead of function calls. - Reduced Stack Usage: Removed unnecessary stack manipulation by keeping all necessary values in registers. - Consolidated Print Logic: Streamlined the print process to minimize repeated code. - Direct Register Usage: Used registers efficiently for intermediate values. #### Cycles between different versions of assembly code: Run in Single-cycle processor. No matter how large the number is, the same execution cycle can be achieved. - for smaller input(the value of `start` and `goal` are under 32) | Discription | Cycles | | -------- | -------- | | baseline | 194 | | with `count leading zero` | 335 | | with `population count` | 221 | - for larger input(the value of `start` and `goal` are greater than 10000) | Discription | Cycles | | -------- | -------- | | baseline | 338 | | with `count leading zero` | 515 | | with `population count` | 221 | ### 5-stage processor ![{ED236A68-3E05-4530-BC5F-A516F9340263}](https://hackmd.io/_uploads/ryoZDA4k1e.png) - It takes more cycles than in single-cycle processor. | Discription | Cycles | | -------- | -------- | | baseline | 524 | | with `count leading zero` | 717 | | with `population count` | 321 | - Data hazards (when an instruction depends on the result of a previous instruction), control hazards (due to branch instructions), and structural hazards (resource conflicts) can stall the pipeline. These stalls can reduce the theoretical speedup of the pipeline. - There are two `nop` being inserted before the `jal` or `bne` instruction. ![{4A803C9A-3F74-4488-B818-A11E2952AFEE}](https://hackmd.io/_uploads/r1ku66Ny1g.png) - There may be a delay before the new value is available in the pipeline. The `nop` instructions give the processor time to resolve this hazard by ensuring that the necessary data is ready when the called function executes. - When a branch or jump occurs, the pipeline may need to flush or discard some instructions that have already been fetched. Inserting `nop `can help manage this by preventing certain instructions from executing until the branch is resolved. - Instruction Fetch (IF) In this stage, the processor fetches the instruction from memory using the program counter (PC). The fetched instruction is then stored in the instruction register, and the PC is updated to point to the next instruction. - This is the first instruction `li t5, 3 `, it converts to `addi x30, x0, 3` and the `PC` = 0. `0x00300f13` represents a RISC-V instruction encoded in binary format. ![image](https://hackmd.io/_uploads/Syj8-0E1Jl.png =300x) - Instruction Decode (ID) The fetched instruction is decoded in this stage. The processor determines the operation to be performed and identifies the operands. Register values are read from the register file, and immediate values are extracted if needed. This stage also checks for hazards and prepares control signals for the execution stage. - passing `0x00300f13` to ID, the immediate value is `3` ![image](https://hackmd.io/_uploads/ryvRf0EJkl.png =300x) - Execution (EX) In this stage, the actual operation specified by the instruction is performed. This could involve arithmetic or logical operations in the ALU (Arithmetic Logic Unit) for arithmetic instructions, calculating effective addresses for load/store instructions, or determining branch targets for control flow instructions. - initialize register x30 with the value 3 ![{8BB6E81E-3A93-400F-B429-F890D33BF9E3}](https://hackmd.io/_uploads/HJsM40Ek1g.png =300x) - Memory Access (MEM) If the instruction involves memory access (like load or store), this stage handles it. For load instructions, data is read from memory, while for store instructions, data is written to memory. The addresses used for these operations are calculated in the previous stage. ![{70F5BD69-36E6-40D8-B5D9-6BA79635DAC6}](https://hackmd.io/_uploads/HJvzHAVy1e.png =200x) - Write Back (WB) The final stage writes the result of the operation back to the register file. For load instructions, the data retrieved from memory is written to the appropriate register. For other instructions, the result from the ALU is written back. ![{393C6BE7-2152-4B3D-B549-9DBCA8A100FF}](https://hackmd.io/_uploads/rJY_BCEJyl.png =200x) ## Reference - [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance#:~:text=The%20function%20hamming_distance(),%20implemented%20in%20Python%203,%20computes%20the%20Hamming) - [Ripes introduction](https://github.com/mortbopet/Ripes/blob/master/docs/introduction.md)