Lab1: RV32I Assembly

# Complex Multiplication in a Simple Way contributed by <[`yishuo0802`](https://github.com/yishuo0802)> > [Source code](https://github.com/yishuo0802/computer-arch-2024/tree/main/01.HW1). ## Motivation :::success Inspired by [LeetCode 537. Complex Number Multiplication](https://leetcode.com/problems/complex-number-multiplication/description/) ::: In this problem on Leetcode, we deal with integer-based complex number operations. However, since complex number operations often require handling decimal numbers, I have modified the Leetcode problem to focus on decimal operations. My C code is designed to demonstrate that it is possible to pass the FP16 data type and then convert FP16 to FP32 during the calculation. This allows most numbers to yield the same result when multiplying in FP16 and FP32. The process of converting between FP16 and FP32 will utilize concepts from [Quiz1 ProblemA](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-A). My goal is to use FP16 to perform complex multiplication calculations without precision loss in most scenarios. This way, <font color=lightblue>**the memory access required will be reduced to half of what is needed for FP32**</font>. ## FP16 Format $$ FP16 = (-1)^S \cdot 2^{E - 15} \cdot (1 + 2^{-10}M) $$ ``` ┌ sign │ │ ┌ exponent │ │ │ │ ┌ mantissa │ │ │ │┌──┴─┐┌───┴───┐ 0b0000000000000000 fp16 ``` ## FP32 Format $$ FP32 = (-1)^S \cdot 2^{E - 127} \cdot \left( 1 + 2^{-23} M \right) $$ ``` ┌ sign │ │ ┌ exponent │ │ │ │ ┌ mantissa │ │ │ │┌──┴───┐┌──────────┴──────────┐ 0b00000000000000000000000000000000 fp32 ``` ## Conversion from FP16 to FP32 `fp16_to_fp32` is designed to convert a 16-bit half-precision floating-point number (fp16_t h) into a 32-bit single-precision floating-point number (float). The code employs bit manipulation techniques to efficiently handle both ==normalized== and ==denormalized== numbers according to the IEEE 754 floating-point standard. ### C code ``` c float fp16_to_fp32(fp16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t two_w = w + w; const uint32_t exp_offset = UINT32_C(0xE0) << 23; const float exp_scale = 0x1.0p-112f; const float normalized_value = bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale; const uint32_t mask = UINT32_C(126) << 23; const float magic_bias = 0.5f; const float denormalized_value = bits_to_fp32((two_w >> 17) | mask) - magic_bias; const uint32_t denormalized_cutoff = UINT32_C(1) << 27; const uint32_t result = sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value)); return bits_to_fp32(result); } ``` ### Assembly code ``` riscv fp16_to_fp32: # a0: input fp16_t h fp16_to_fp32__prologue: addi sp, sp, -12 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) fp16_to_fp32__body: mv s0, a0 # w slli s0, s0, 16 li s2, 0x80000000 and t0, s0, s2 # t0: sign add s0, s0, s0 # two_w li t1, 0xE0 # t1: exp_offset slli t1, t1, 23 li t2, 0x07800000 # t2: exp_scale srli t3, s0, 4 # t3: normalized_value add t3, t3, t1 addi sp, sp, -20 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) sw t3, 16(sp) mv a0, t3 mv a1, t2 jal ra, mul_float32 mv t3, a0 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) lw t3, 16(sp) addi sp, sp, 20 mv t3, a0 li t4, 126 slli t4, t4, 23 # t4: mask li t5, 0x3F000000 # t5: magic_bias srli t6, s0, 17 # t6: denormalized_value or t6, t6, t4 sub t6, t6, t5 li s1, 1 slli s1, s1, 27 # s1: denormalized_cutoff bltu s0, s1, fp16_to_fp32__taken j fp16_to_fp32__nottaken fp16_to_fp32__taken: mv a0, t6 fp16_to_fp32__nottaken: mv a0, t3 fp16_to_fp32__if_end: or a0, t0, a0 # a0: result fp16_to_fp32__epilogue: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) addi sp, sp, 12 ret ``` ### Challenge However, in the conversion process, we will use floating-point addition and multiplication. However, due to the limitation that this assignment only allows RV32-I instructions, We have to implement floating-point multiplication and addition using RV32-I instructions. #### Floating point ADD ``` riscv add_float32: #a0: src1 #a1: src2 add_float32__prologue: addi sp, sp, -12 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) add_float32__body: mv s0, a0 mv s1, a1 #Extract sign, exponent, and mantissa srli t0, s0, 31 # t0: src1 sign srli t1, s1, 31 # t1: src2 sign srli t2, s0, 23 # t2: src1 exponent andi t2, t2, 0xFF srli t3, s1, 23 # t3: src2 exponent andi t3, t3, 0xFF li s2, 0x7FFFFF and t4, s0, s2 # t4: src1 mantissa and t5, s1, s2 # t5: src2 mantissa # Set the implict leading one li s2, 0x800000 or t4, t4, s2 or t5, t5, s2 # Compare exponents and align mantissas bge t2, t3, add_float32__align_a1 add_float32__align_a0: sub t6, t3, t2 # t6: exponent difference srl t4, t4, t6 # shift a0's mantissa add t2, t2, t6 # Set exponent of a0 to exponent of a1 j add_float32__add_or_sub add_float32__align_a1: sub t6, t2, t3 # t6: exponent difference srl t5, t5, t6 # shift a1's mantissa add t3, t3, t6 # Set exponent of a1 to exponent of a1 add_float32__add_or_sub: beq t0, t1, add_float32__add add_float32__sub: blt t4, t5, add_float32_change_sign sub t4, t4, t5 j add_float32__normalize add_float32_change_sign: sub t4, t5, t4 xori t0, t0, 1 j add_float32__normalize add_float32__add: add t4, t4, t5 add_float32__normalize: li s2, 0x800000 blt t4, s2, add_float32__normalize_loop li s2, 0x1000000 bge t4, s2, add_float32__shift_right j add_float32__pack_result add_float32__normalize_loop: and t3, t4, s2 # check if leading bit is 1 bne t3, zero, add_float32__pack_result slli t4, t4, 1 # mantissa left shift 1 addi t2, t2, -1 # exponent - 1 j add_float32__normalize_loop add_float32__shift_right: srli t4, t4, 1 # mantissa right shift 1 addi t2, t2, 1 # exponent + 1 add_float32__pack_result: slli t0, t0, 31 slli t2, t2, 23 or t0, t0, t2 li s2, 0x7fffff and t4, t4, s2 or a0, t0, t4 add_float32__epilogue: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) addi sp, sp, 12 ret ``` #### Floating point MUL ```riscv mul_float32: #a0 = a0 * a1 mul_float32__prologue: addi sp, sp, -12 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) mv s0, a0 mv s1, a1 mul_float32__body: # Extract sign bits srli t0, s0, 31 andi t0, t0, 1 # t0: sign of a0 srli t1, s1, 31 andi t1, t1, 1 # t1: sign of a1 # Extract exponent slli t2, s0, 1 srli t2, t2, 24 # t2: exponent of a0 slli t3, s1, 1 srli t3, t3, 24 # t3: exponent of a1 # Extract mantissa li s2, 0x7FFFFF and t4, s0, s2 li s2, 0x800000 or t4, t4, s2 # t4: mantissa of a0 li s2, 0x7FFFFF and t5, s1, s2 li s2, 0x800000 or t5, t5, s2 # t5: mantissa of a1 # Determine sign xor t0, t0, t1 # t0: sign of result # Add exponent add t2, t2, t3 # t2: exponent of result li t6, 127 # t6: bias sub t2, t2, t6 # Multiply mantissa addi sp, sp, -28 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) sw t3, 16(sp) sw t4, 20(sp) sw t5, 24(sp) srli a0, t4, 8 srli a1, t5, 8 jal ra, mul_uint32 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) lw t3, 16(sp) lw t4, 20(sp) lw t5, 24(sp) addi sp, sp, 28 srli t4, a0, 7 # t4: mantissa of result # Normalize srai t5, t4, 24 andi t5, t5, 1 beq t5, zero, mul_float32__norm_done addi t2, t2, 1 # exponent + 1 srli t4, t4, 1 # mantissa normalize mul_float32__norm_done: # Pack the result slli t0, t0, 31 slli t2, t2, 23 li s2, 0x7fffff and t4, t4, s2 or a0, t0, t2 or a0, a0, t4 mul_float32__epilogue: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) addi sp, sp, 12 ret ``` When calculating floating-point multiplication, the mantissa part can be treated as fixed-point multiplication, so integer multiplication can be used for the operation. Therefore, we also need to implement an integer multiplication using RV32I instructions #### Integer MUL ```riscv mul_uint32: addi sp, sp, -12 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) li s0, 0 # s0: current result li t0, 0 # t0: loop index li t1, 16 # t1: loop bound mul_uint32_LOOP1_BEGIN: bge t0, t1, mul_uint32_LOOP_END srl s1, a1, t0 andi s1, s1, 1 beqz s1, mul_uint32_IF_END sll s2, a0, t0 add s0, s0, s2 mul_uint32_IF_END: addi t0, t0, 1 j mul_uint32_LOOP1_BEGIN mul_uint32_LOOP_END: mv a0, s0 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) addi sp, sp, 12 ret ``` ## Conversion from FP32 to FP16 `fp32_to_fp16` converts a 32-bit single-precision floating-point number (float) to a 16-bit half-precision floating-point number. The function carefully handles special cases like NaN. ### C code ```c fp16_t fp32_to_fp16(float f) { const float scale_to_inf = 0x1.0p+112f; const float scale_to_zero = 0x1.0p-110f; float base = (fabsf(f) * scale_to_inf) * scale_to_zero; const uint32_t w = fp32_to_bits(f); const uint32_t shl1_w = w + w; const uint32_t sign = w & UINT32_C(0x80000000); uint32_t bias = shl1_w & UINT32_C(0xFF000000); if (bias < UINT32_C(0x71000000)) bias = UINT32_C(0x71000000); base = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000)) + base; const uint32_t bits = fp32_to_bits(base); const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00); const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF); const uint32_t nonsign = exp_bits + mantissa_bits; return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign); } ``` ### Assembly code ```riscv fp32_to_fp16: # a0: input float f fp32_to_fp16__prologue: addi sp, sp, -16 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw s3, 12(sp) fp32_to_fp16__body: mv s0, a0 li t0, 0x77800000 # t0: scale_to_inf li t1, 0x08800000 # t1: scale_to_zero li s3, 0x7fffffff and t2, s0, s3 # t2: fabs(f) addi sp, sp, -16 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) mv a0, t2 mv a1, t0 jal ra, mul_float32 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) mv a0, a0 mv a1, t1 jal ra, mul_float32 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) addi sp, sp, 16 mv t2, a0 # t2: base mv a0, s0 mv t3, a0 # t3: w add t4, t3, t3 # t4: shl1_w li s3, 0x80000000 and t5, t3, s3 # t5: sign li s3, 0xFF000000 and t6, t4, s3 # t6: bias li s1, 0x71000000 bgeu t6, s1, fp32_to_fp16__bge_taken mv t6, s1 fp32_to_fp16__bge_taken: srli s1, t6, 1 li s3, 0x07800000 add s1, s1, s3 addi sp, sp, -32 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) sw t3, 16(sp) sw t4, 20(sp) sw t5, 24(sp) sw t6, 28(sp) mv a0, s1 mv a1, t2 jal ra, add_float32 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) lw t3, 16(sp) lw t4, 20(sp) lw t5, 24(sp) lw t6, 28(sp) addi sp, sp, 32 mv t2, a0 # t2: base, bits srai t0, t2, 13 li s3, 0x00007C00 and t0, t0, s3 # t0: exp_bits li s3, 0x00000FFF and t1, t2, s3 # t1: mantissa_bits add t0, t0, t1 # t0: nonsign srli s0, t5, 16 li s1, 0xFF000000 mv s2, zero bltu s1, t4, fp32_to_fp16__bge_taken2 or s0, s0, t0 j fp32_to_fp16__bge_taken2_end fp32_to_fp16__bge_taken2: li s3, 0x7E00 or s0, s0, s3 fp32_to_fp16__bge_taken2_end: mv a0, s0 fp32_to_fp16__epilogue: lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) lw s3, 12(sp) addi sp, sp, 16 ret ``` ## Optimization > The code size can be found from the previous commits. The original code will not be included here due to space constraints | Version | Cycle | Code Size | |:-------:|:---------:|:---------:| | 1 | 13578094 | 924 | | 2 | 13256145 | ==708== | | 3 | ==19147== | 708 | ### Optimization Result #### Version 1 ![version1](https://hackmd.io/_uploads/rJCjBULkyg.png) #### Version 2 :::spoiler Redundant Function Elimination Redundant Memory Access Redundant Stack Allocation/Deallocation Multiplication Loop Minimization ::: ![version2](https://hackmd.io/_uploads/BJ1BILLkyx.png) #### Version 3 - Booth Algorithm ![version3](https://hackmd.io/_uploads/SJ5N8LUyke.png) The following is a detailed explanation of my optimization approach ### Redundant Function Elimination In C code, the `bits_to_fp32` and `fp32_to_bits` functions handle the conversion between float and uint32. Since the interpretation of 32-bit data in assembly code is determined by how you choose to interpret it (as a float or a uint), these two functions are not necessary in assembly. Therefore, **Redundant Function Elimination** optimization can be applied. ```c float bits_to_fp32(uint32_t w) { union { uint32_t as_bits; float as_value; } fp32 = {.as_bits = w}; return fp32.as_value; } uint32_t fp32_to_bits(float f) { union { float as_value; uint32_t as_bits; } fp32 = {.as_value = f}; return fp32.as_bits; } ``` ```riscv addi sp, sp, -32 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) sw t3, 16(sp) sw t4, 20(sp) sw t5, 24(sp) sw t6, 28(sp) mv a0, t2 jal ra, fp32_to_bits lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) lw t3, 16(sp) lw t4, 20(sp) lw t5, 24(sp) lw t6, 28(sp) addi sp, sp, 32 ``` ### Redundant Memory Access In a function where multiple function calls are made, the `ra` register only needs to be saved once. It can be loaded back from memory just before the `ret` at the end of the function. This allows for **Redundant Memory Access** optimization. ```diff main: # ... addi sp, sp, -4 sw ra, 0(sp) jal ra, inference - lw ra, 0(sp) - addi sp, sp, 4 # ... - addi sp, sp, -4 - sw ra, 0(sp) jal ra, inference - lw ra, 0(sp) - addi sp, sp, 4 # ... - addi sp, sp, -4 - sw ra, 0(sp) jal ra, inference lw ra, 0(sp) addi sp, sp, 4 # ... ``` ### Redundant Stack Allocation/Deallocation When calling two consecutive functions, if the same number of registers need to be stored in the stack, the shifting of the stack pointer back and forth can be eliminated. Furthermore, if the stored registers are not used during the process, both `lw` and `sw` instructions can be removed as a pair, applying **Redundant Stack Allocation/Deallocation optimization**. ```diff fp32_to_fp16__body: # ... addi sp, sp, -16 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) mv a0, t2 mv a1, t0 jal ra, mul_float32 lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) - addi sp, sp, 12 - addi sp, sp, -12 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) mv a0, a0 mv a1, t1 jal ra, mul_float32 lw ra, 0(sp) lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) addi sp, sp, 16 mv t2, a0 # t2: base ``` ### Multiplication Loop Minimization :::info After switching to the Booth Algorithm, this optimization will no longer be necessary. ::: The original implementation of integer multiplication is executed using a loop. For example, `A * B` is performed by running a loop`B` times with adding `A`. Therefore, at the beginning of the function, if you can check and ensure that `B < A` (and if not, swap A and B), it can achieve **Multiplication Loop Minimization**. ```riscv mv s0, a0 mv s1, a1 mul_uint32__body: li t0, 0 # init value li t1, 0 # loop index mul_uint32__loop_start: bge t1, s1, mul_uint32__loop_end add t0, t0, s0 addi t1, t1, 1 j mul_uint32__loop_start mul_uint32__loop_end: ``` ```riscv blt a1, a0, mul_uint32__a1_smaller mv s0, a1 mv s1, a0 j mul_uint32__body mul_uint32__a1_smaller: mv s0, a0 mv s1, a1 mul_uint32__body: li t0, 0 # init value li t1, 0 # loop index mul_uint32__loop_start: bge t1, s1, mul_uint32__loop_end add t0, t0, s0 addi t1, t1, 1 j mul_uint32__loop_start mul_uint32__loop_end: ``` ### Multiplication Method Optimizatiom :::warning Loop -> Booth Algorithm (unsigned version) ::: Since the current multiplication operation is focused on calculating the fractional part of the mantissa, the test data used in this design are more common decimal values, such as 1.5. However, the mantissa part in this case would be `0xC000`. If multiplication is performed using a loop, it would be very time-consuming. Therefore, the Booth Algorithm was adopted instead. However, since Booth's algorithm originally handles signed multiplication, adjustments were made to apply it to unsigned multiplication for this case. By comparing the cycles between version 1 and version 3, the cycles were reduced from over 10 million to just over 10,000, <font color=lightblue>**achieving a 1000x speedup</font>**. ```riscv mul_uint32: addi sp, sp, -12 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) li s0, 0 # s0: current result li t0, 0 # t0: loop index li t1, 16 # t1: loop bound mul_uint32_LOOP1_BEGIN: bge t0, t1, mul_uint32_LOOP_END srl s1, a1, t0 andi s1, s1, 1 beqz s1, mul_uint32_IF_END sll s2, a0, t0 add s0, s0, s2 mul_uint32_IF_END: addi t0, t0, 1 j mul_uint32_LOOP1_BEGIN mul_uint32_LOOP_END: mv a0, s0 lw s0, 0(sp) lw s1, 4(sp) lw s2, 8(sp) addi sp, sp, 12 ret ``` ## Testing Results and Conclusion ==Complex Multiplication== | Complex1 | Complex2 | Result_FP32 | Result_FP16 | |:--------------:|:-------------:|:-----------------------:|:-----------------------:| | 1.5 + 1.75i | -3.0 + -1.5i | -1.875000 + -7.500000i | -1.875000 + -7.500000i | | 2.25 + 6.5i | 3.375 + 10.5i | -60.656250 + 45.562500i | -60.656250 + 45.562500i | | -1.125 + -3.5i | -4.5 + 2.25i | 12.937500 + 13.218750 | 12.937500 + 13.218750 | The results of FP16 calculations are the calculations that I converted the FP16 values to FP32 to perform. It can be seen that in most cases, the precision supported by FP16 is already sufficient to compute lossless values. Therefore, in applications with a large amount of complex multiplication computation, we can reduce memory access by half. ## Pipeline 5 Stage Explanation ![image](https://hackmd.io/_uploads/Bky2jPUk1e.png) ### **IF** (Instruction Fetch) The processor retrieves the instruction from memory (typically from the instruction cache or memory if it is not cached). The Program Counter (PC) is used to address the memory, and the fetched instruction is passed on to the next stage. Retrieve the 32-bit instruction from memory. After fetching, the PC is incremented to point to the next instruction in memory. ### **ID** (Instruction Decode) The fetched instruction is decoded, and the processor identifies the type of instruction (e.g., arithmetic, load/store, branch). The control signals are generated, and the registers involved in the instruction are identified and read from the register file. Extract the opcode, source registers, destination register, and immediate values (if any). Based on the opcode, the appropriate control signals are generated for the rest of the pipeline. Use the ALU to perform operations like addition, subtraction, logical shifts, and comparisons. For load/store instructions, this stage computes the effective memory address. ### **IE** (Instruction Execution) The actual operation specified by the instruction is performed. This could involve arithmetic or logical operations, address calculation for load/store instructions, or comparison for branch instructions. Use the ALU to perform operations like addition, subtraction, logical shifts, and comparisons. For load/store instructions, this stage computes the effective memory address. ### ==Forwarding in IE stage== ![image](https://hackmd.io/_uploads/BkgrZnIyJx.png) #### Probloems in piepline CPU 1. In a pipelined processor like RISC-V, different stages of instruction execution happen in parallel. If an instruction needs to use the result of a previous instruction before it is available (e.g., in the Write Back (WB) stage), this creates a data hazard. Without any mechanism to resolve this, the processor would need to stall the pipeline, resulting in a delay (reduced performance). 2. Load data hazard occurs in a pipelined processor when an instruction depends on data that is being loaded from memory by a previous load instruction. The hazard arises because the data being loaded is not available until the Memory Access (MEM) stage of the pipeline, but the subsequent instruction may need the data in an earlier stage, typically the Instruction Execution (IE) stage. 3. A control hazard, also known as a branch hazard, occurs in a pipelined processor when the pipeline makes the wrong decision on which instruction to fetch due to an unresolved branch instruction. This type of hazard happens because the outcome of a branch instruction (e.g., whether to take the branch or not) is often not known until later stages of the pipeline. If the processor incorrectly fetches the next instruction. #### Solution 1. Forwarding - Forwarding solves this by directly passing the result from one pipeline stage to another, without waiting for the WB stage to complete. This bypasses the need to write the result back to the register file before it can be used by another instruction. - Instead of waiting for the result to be written to the register file in the WB stage, the result is "forwarded" directly from the Memory (MEM) stages or Write Back(WB) stages. 2. Flush / Keep / Bubble The following table shows the flush or keep instructions that need to be inserted due to control hazards and data hazards encountered above. - `stall`: Set when there is a Load instruction in `IE` stage and there is an instruction in `ID`stage that uses the Load’s writeback data - `jb`: Set when a jump or branch is established in `IE` stag | | stall | jb | others | |:------:|:------:|:-----:|:------:| | Reg_PC | keep | flush | X | | Reg_D | keep | flush | X | | Reg_E | bubble | flush | X | | Reg_M | X | X | X | | Reg_W | X | X | X | ### **MEM** (Memory Access) For load instructions, this stage accesses the memory to retrieve data into the processor. For store instructions, this stage writes the calculated data to memory. ### **WB** (Write Back) The result of the instruction (from the execution or memory access stage) is written back to the destination register in the register file. ## Visualization of Each Type of Instrction Using the Ripes ### R-type In an R-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` and `rs2_index` are sent to the register file to read the data. When the data is passed from the ID stage to the IE stage, the ALU performs calculations based on `opcode`. The result is then passed through the MEM stage and finally written back to the `destination register (rd)` in the register file in the WB stage. ![image](https://hackmd.io/_uploads/r1_PPnIyyl.png) ### I-type In an I-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` is sent to the register file to read the data, while the `immediate value` is extracted from the instruction and sign-extended to match the operand size. When the data and the sign-extended immediate value are passed from the ID stage to the IE stage, the ALU performs calculations based on opcode and the extended immediate value. The result is then passed through the MEM stage (if required, depending on the instruction type) and finally written back to `destination register (rd)` in the register file in the WB stage. ![image](https://hackmd.io/_uploads/ryZODnUyJe.png) ### I-type Load In an I-type Load instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` is sent to the register file to read the data, and the `immediate value` (offset) is extracted from the instruction and sign-extended. When the base address (from rs1) and the sign-extended immediate value (offset) are passed from the ID stage to the IE stage, the ALU calculates the effective memory address by adding the base address and the offset. The calculated address is then passed to the MEM stage, where the data at the calculated memory address is loaded from memory. Finally, the loaded data is passed to the WB stage and written back into the `destination register (rd)` in the register file. ![image](https://hackmd.io/_uploads/ByxYv3Ik1x.png) ### S-type In an S-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` and `rs2_index` are sent to the register file to read the data from the source registers. The `immediate value` (which is split across the instruction fields) is extracted and sign-extended. When the base address from rs1 and the sign-extended immediate value (offset) are passed from the ID stage to the IE stage, the ALU calculates the effective memory address by adding the base address and the offset. The calculated address and the data from rs2 are passed to the MEM stage, where the data from rs2 is stored at the calculated memory address. In an S-type instruction, there is no write-back to the register file during the WB stage, as the operation involves storing data to memory. ![image](https://hackmd.io/_uploads/r1Ouv2UyJe.png) ### U-type In a U-type instruction, after fetching the instruction and passing it to the decoder, the upper 20 bits of the instruction represent the immediate value. This immediate is directly used without requiring sign-extension, as it forms the upper bits of the result. The immediate value is passed from the ID stage to the IE stage, where it is either used directly (as in the case of `LUI`) or added to the program counter (PC) in the case of `AUIPC` (Add Upper Immediate to PC). In the WB stage, for `LUI`, the immediate value is written directly to the destination register (`rd`). For `AUIPC`, the result of adding the immediate value to the PC is written to `rd`. U-type instructions do not involve the MEM stage, as they don’t perform memory accesses. ![image](https://hackmd.io/_uploads/r1OKPhIJJx.png) ### B-type In a B-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` and `rs2_index` are sent to the register file to read the data from the source registers. The immediate value (which is split across several fields in the instruction) is extracted and sign-extended. When the data from `rs1`, `rs2`, and the sign-extended immediate value (representing the branch offset) are passed from the ID stage to the IE stage, the ALU performs a comparison between the values in `rs1` and `rs2` based on the specific branch condition (e.g., `BEQ` for equal, `BNE` for not equal). If the condition is met, the ALU calculates the target address by adding the branch offset (immediate value) to the program counter (PC), and control is passed to the calculated address. If the condition is not met, the program continues executing sequentially. B-type instructions do not involve the MEM stage or WB stage, as they control the program flow based on comparisons rather than loading, storing, or computing data. ![image](https://hackmd.io/_uploads/ry4cvnU1Je.png) ### J-type In a J-type instruction (e.g., `JAL` - Jump and Link), after fetching the instruction and passing it to the decoder, the 20-bit immediate value (representing the jump offset) is extracted from various fields of the instruction and sign-extended to 32 bits. The sign-extended immediate value is passed from the ID stage to the IE stage, where it is added to the current program counter (PC) to calculate the target jump address. Additionally, the return address (PC + 4, which points to the next instruction after the jump) is written to the destination register (`rd`) in the WB stage. In a J-type instruction, there is no need for the MEM stage, as the instruction involves modifying the control flow by jumping to a calculated address while saving the return address in a register. ![image](https://hackmd.io/_uploads/SkNsw2LJyx.png) ## Future Work 1. Supports FP16 and BF16 operations 2. Supports SIMD operations after conversion to FP16 3. Handle Infinity and NaN