# Complex Multiplication in a Simple Way
contributed by <[`yishuo0802`](https://github.com/yishuo0802)>
> [Source code](https://github.com/yishuo0802/computer-arch-2024/tree/main/01.HW1).
## Motivation
:::success
Inspired by [LeetCode 537. Complex Number Multiplication](https://leetcode.com/problems/complex-number-multiplication/description/)
:::
In this problem on Leetcode, we deal with integer-based complex number operations. However, since complex number operations often require handling decimal numbers, I have modified the Leetcode problem to focus on decimal operations.
My C code is designed to demonstrate that it is possible to pass the FP16 data type and then convert FP16 to FP32 during the calculation. This allows most numbers to yield the same result when multiplying in FP16 and FP32. The process of converting between FP16 and FP32 will utilize concepts from [Quiz1 ProblemA](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-A).
My goal is to use FP16 to perform complex multiplication calculations without precision loss in most scenarios.
This way, <font color=lightblue>**the memory access required will be reduced to half of what is needed for FP32**</font>.
## FP16 Format
$$
FP16 = (-1)^S \cdot 2^{E - 15} \cdot (1 + 2^{-10}M)
$$
```
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌──┴─┐┌───┴───┐
0b0000000000000000 fp16
```
## FP32 Format
$$
FP32 = (-1)^S \cdot 2^{E - 127} \cdot \left( 1 + 2^{-23} M \right)
$$
```
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌──┴───┐┌──────────┴──────────┐
0b00000000000000000000000000000000 fp32
```
## Conversion from FP16 to FP32
`fp16_to_fp32` is designed to convert a 16-bit half-precision floating-point number (fp16_t h) into a 32-bit single-precision floating-point number (float).
The code employs bit manipulation techniques to efficiently handle both ==normalized== and ==denormalized== numbers according to the IEEE 754 floating-point standard.
### C code
``` c
float fp16_to_fp32(fp16_t h)
{
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t two_w = w + w;
const uint32_t exp_offset = UINT32_C(0xE0) << 23;
const float exp_scale = 0x1.0p-112f;
const float normalized_value =
bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale;
const uint32_t mask = UINT32_C(126) << 23;
const float magic_bias = 0.5f;
const float denormalized_value =
bits_to_fp32((two_w >> 17) | mask) - magic_bias;
const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
const uint32_t result =
sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value)
: fp32_to_bits(normalized_value));
return bits_to_fp32(result);
}
```
### Assembly code
``` riscv
fp16_to_fp32:
# a0: input fp16_t h
fp16_to_fp32__prologue:
addi sp, sp, -12
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
fp16_to_fp32__body:
mv s0, a0 # w
slli s0, s0, 16
li s2, 0x80000000
and t0, s0, s2 # t0: sign
add s0, s0, s0 # two_w
li t1, 0xE0 # t1: exp_offset
slli t1, t1, 23
li t2, 0x07800000 # t2: exp_scale
srli t3, s0, 4 # t3: normalized_value
add t3, t3, t1
addi sp, sp, -20
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
sw t3, 16(sp)
mv a0, t3
mv a1, t2
jal ra, mul_float32
mv t3, a0
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
lw t3, 16(sp)
addi sp, sp, 20
mv t3, a0
li t4, 126
slli t4, t4, 23 # t4: mask
li t5, 0x3F000000 # t5: magic_bias
srli t6, s0, 17 # t6: denormalized_value
or t6, t6, t4
sub t6, t6, t5
li s1, 1
slli s1, s1, 27 # s1: denormalized_cutoff
bltu s0, s1, fp16_to_fp32__taken
j fp16_to_fp32__nottaken
fp16_to_fp32__taken:
mv a0, t6
fp16_to_fp32__nottaken:
mv a0, t3
fp16_to_fp32__if_end:
or a0, t0, a0 # a0: result
fp16_to_fp32__epilogue:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
addi sp, sp, 12
ret
```
### Challenge
However, in the conversion process, we will use floating-point addition and multiplication. However, due to the limitation that this assignment only allows RV32-I instructions,
We have to implement floating-point multiplication and addition using RV32-I instructions.
#### Floating point ADD
``` riscv
add_float32:
#a0: src1 #a1: src2
add_float32__prologue:
addi sp, sp, -12
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
add_float32__body:
mv s0, a0
mv s1, a1
#Extract sign, exponent, and mantissa
srli t0, s0, 31 # t0: src1 sign
srli t1, s1, 31 # t1: src2 sign
srli t2, s0, 23 # t2: src1 exponent
andi t2, t2, 0xFF
srli t3, s1, 23 # t3: src2 exponent
andi t3, t3, 0xFF
li s2, 0x7FFFFF
and t4, s0, s2 # t4: src1 mantissa
and t5, s1, s2 # t5: src2 mantissa
# Set the implict leading one
li s2, 0x800000
or t4, t4, s2
or t5, t5, s2
# Compare exponents and align mantissas
bge t2, t3, add_float32__align_a1
add_float32__align_a0:
sub t6, t3, t2 # t6: exponent difference
srl t4, t4, t6 # shift a0's mantissa
add t2, t2, t6 # Set exponent of a0 to exponent of a1
j add_float32__add_or_sub
add_float32__align_a1:
sub t6, t2, t3 # t6: exponent difference
srl t5, t5, t6 # shift a1's mantissa
add t3, t3, t6 # Set exponent of a1 to exponent of a1
add_float32__add_or_sub:
beq t0, t1, add_float32__add
add_float32__sub:
blt t4, t5, add_float32_change_sign
sub t4, t4, t5
j add_float32__normalize
add_float32_change_sign:
sub t4, t5, t4
xori t0, t0, 1
j add_float32__normalize
add_float32__add:
add t4, t4, t5
add_float32__normalize:
li s2, 0x800000
blt t4, s2, add_float32__normalize_loop
li s2, 0x1000000
bge t4, s2, add_float32__shift_right
j add_float32__pack_result
add_float32__normalize_loop:
and t3, t4, s2 # check if leading bit is 1
bne t3, zero, add_float32__pack_result
slli t4, t4, 1 # mantissa left shift 1
addi t2, t2, -1 # exponent - 1
j add_float32__normalize_loop
add_float32__shift_right:
srli t4, t4, 1 # mantissa right shift 1
addi t2, t2, 1 # exponent + 1
add_float32__pack_result:
slli t0, t0, 31
slli t2, t2, 23
or t0, t0, t2
li s2, 0x7fffff
and t4, t4, s2
or a0, t0, t4
add_float32__epilogue:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
addi sp, sp, 12
ret
```
#### Floating point MUL
```riscv
mul_float32:
#a0 = a0 * a1
mul_float32__prologue:
addi sp, sp, -12
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
mv s0, a0
mv s1, a1
mul_float32__body:
# Extract sign bits
srli t0, s0, 31
andi t0, t0, 1 # t0: sign of a0
srli t1, s1, 31
andi t1, t1, 1 # t1: sign of a1
# Extract exponent
slli t2, s0, 1
srli t2, t2, 24 # t2: exponent of a0
slli t3, s1, 1
srli t3, t3, 24 # t3: exponent of a1
# Extract mantissa
li s2, 0x7FFFFF
and t4, s0, s2
li s2, 0x800000
or t4, t4, s2 # t4: mantissa of a0
li s2, 0x7FFFFF
and t5, s1, s2
li s2, 0x800000
or t5, t5, s2 # t5: mantissa of a1
# Determine sign
xor t0, t0, t1 # t0: sign of result
# Add exponent
add t2, t2, t3 # t2: exponent of result
li t6, 127 # t6: bias
sub t2, t2, t6
# Multiply mantissa
addi sp, sp, -28
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
sw t3, 16(sp)
sw t4, 20(sp)
sw t5, 24(sp)
srli a0, t4, 8
srli a1, t5, 8
jal ra, mul_uint32
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
lw t3, 16(sp)
lw t4, 20(sp)
lw t5, 24(sp)
addi sp, sp, 28
srli t4, a0, 7 # t4: mantissa of result
# Normalize
srai t5, t4, 24
andi t5, t5, 1
beq t5, zero, mul_float32__norm_done
addi t2, t2, 1 # exponent + 1
srli t4, t4, 1 # mantissa normalize
mul_float32__norm_done:
# Pack the result
slli t0, t0, 31
slli t2, t2, 23
li s2, 0x7fffff
and t4, t4, s2
or a0, t0, t2
or a0, a0, t4
mul_float32__epilogue:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
addi sp, sp, 12
ret
```
When calculating floating-point multiplication, the mantissa part can be treated as fixed-point multiplication, so integer multiplication can be used for the operation. Therefore, we also need to implement an integer multiplication using RV32I instructions
#### Integer MUL
```riscv
mul_uint32:
addi sp, sp, -12
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
li s0, 0 # s0: current result
li t0, 0 # t0: loop index
li t1, 16 # t1: loop bound
mul_uint32_LOOP1_BEGIN:
bge t0, t1, mul_uint32_LOOP_END
srl s1, a1, t0
andi s1, s1, 1
beqz s1, mul_uint32_IF_END
sll s2, a0, t0
add s0, s0, s2
mul_uint32_IF_END:
addi t0, t0, 1
j mul_uint32_LOOP1_BEGIN
mul_uint32_LOOP_END:
mv a0, s0
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
addi sp, sp, 12
ret
```
## Conversion from FP32 to FP16
`fp32_to_fp16` converts a 32-bit single-precision floating-point number (float) to a 16-bit half-precision floating-point number. The function carefully handles special cases like NaN.
### C code
```c
fp16_t fp32_to_fp16(float f)
{
const float scale_to_inf = 0x1.0p+112f;
const float scale_to_zero = 0x1.0p-110f;
float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
const uint32_t w = fp32_to_bits(f);
const uint32_t shl1_w = w + w;
const uint32_t sign = w & UINT32_C(0x80000000);
uint32_t bias = shl1_w & UINT32_C(0xFF000000);
if (bias < UINT32_C(0x71000000))
bias = UINT32_C(0x71000000);
base = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000)) + base;
const uint32_t bits = fp32_to_bits(base);
const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF);
const uint32_t nonsign = exp_bits + mantissa_bits;
return (sign >> 16) |
(shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign);
}
```
### Assembly code
```riscv
fp32_to_fp16:
# a0: input float f
fp32_to_fp16__prologue:
addi sp, sp, -16
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
fp32_to_fp16__body:
mv s0, a0
li t0, 0x77800000 # t0: scale_to_inf
li t1, 0x08800000 # t1: scale_to_zero
li s3, 0x7fffffff
and t2, s0, s3 # t2: fabs(f)
addi sp, sp, -16
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
mv a0, t2
mv a1, t0
jal ra, mul_float32
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
mv a0, a0
mv a1, t1
jal ra, mul_float32
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
addi sp, sp, 16
mv t2, a0 # t2: base
mv a0, s0
mv t3, a0 # t3: w
add t4, t3, t3 # t4: shl1_w
li s3, 0x80000000
and t5, t3, s3 # t5: sign
li s3, 0xFF000000
and t6, t4, s3 # t6: bias
li s1, 0x71000000
bgeu t6, s1, fp32_to_fp16__bge_taken
mv t6, s1
fp32_to_fp16__bge_taken:
srli s1, t6, 1
li s3, 0x07800000
add s1, s1, s3
addi sp, sp, -32
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
sw t3, 16(sp)
sw t4, 20(sp)
sw t5, 24(sp)
sw t6, 28(sp)
mv a0, s1
mv a1, t2
jal ra, add_float32
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
lw t3, 16(sp)
lw t4, 20(sp)
lw t5, 24(sp)
lw t6, 28(sp)
addi sp, sp, 32
mv t2, a0 # t2: base, bits
srai t0, t2, 13
li s3, 0x00007C00
and t0, t0, s3 # t0: exp_bits
li s3, 0x00000FFF
and t1, t2, s3 # t1: mantissa_bits
add t0, t0, t1 # t0: nonsign
srli s0, t5, 16
li s1, 0xFF000000
mv s2, zero
bltu s1, t4, fp32_to_fp16__bge_taken2
or s0, s0, t0
j fp32_to_fp16__bge_taken2_end
fp32_to_fp16__bge_taken2:
li s3, 0x7E00
or s0, s0, s3
fp32_to_fp16__bge_taken2_end:
mv a0, s0
fp32_to_fp16__epilogue:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
addi sp, sp, 16
ret
```
## Optimization
> The code size can be found from the previous commits. The original code will not be included here due to space constraints
| Version | Cycle | Code Size |
|:-------:|:---------:|:---------:|
| 1 | 13578094 | 924 |
| 2 | 13256145 | ==708== |
| 3 | ==19147== | 708 |
### Optimization Result
#### Version 1

#### Version 2
:::spoiler
Redundant Function Elimination
Redundant Memory Access
Redundant Stack Allocation/Deallocation
Multiplication Loop Minimization
:::

#### Version 3 - Booth Algorithm

The following is a detailed explanation of my optimization approach
### Redundant Function Elimination
In C code, the `bits_to_fp32` and `fp32_to_bits` functions handle the conversion between float and uint32.
Since the interpretation of 32-bit data in assembly code is determined by how you choose to interpret it (as a float or a uint), these two functions are not necessary in assembly. Therefore, **Redundant Function Elimination** optimization can be applied.
```c
float bits_to_fp32(uint32_t w)
{
union {
uint32_t as_bits;
float as_value;
} fp32 = {.as_bits = w};
return fp32.as_value;
}
uint32_t fp32_to_bits(float f)
{
union {
float as_value;
uint32_t as_bits;
} fp32 = {.as_value = f};
return fp32.as_bits;
}
```
```riscv
addi sp, sp, -32
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
sw t3, 16(sp)
sw t4, 20(sp)
sw t5, 24(sp)
sw t6, 28(sp)
mv a0, t2
jal ra, fp32_to_bits
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
lw t3, 16(sp)
lw t4, 20(sp)
lw t5, 24(sp)
lw t6, 28(sp)
addi sp, sp, 32
```
### Redundant Memory Access
In a function where multiple function calls are made, the `ra` register only needs to be saved once. It can be loaded back from memory just before the `ret` at the end of the function. This allows for **Redundant Memory Access** optimization.
```diff
main:
# ...
addi sp, sp, -4
sw ra, 0(sp)
jal ra, inference
- lw ra, 0(sp)
- addi sp, sp, 4
# ...
- addi sp, sp, -4
- sw ra, 0(sp)
jal ra, inference
- lw ra, 0(sp)
- addi sp, sp, 4
# ...
- addi sp, sp, -4
- sw ra, 0(sp)
jal ra, inference
lw ra, 0(sp)
addi sp, sp, 4
# ...
```
### Redundant Stack Allocation/Deallocation
When calling two consecutive functions, if the same number of registers need to be stored in the stack, the shifting of the stack pointer back and forth can be eliminated.
Furthermore, if the stored registers are not used during the process, both `lw` and `sw` instructions can be removed as a pair, applying **Redundant Stack Allocation/Deallocation optimization**.
```diff
fp32_to_fp16__body:
# ...
addi sp, sp, -16
sw ra, 0(sp)
sw t0, 4(sp)
sw t1, 8(sp)
sw t2, 12(sp)
mv a0, t2
mv a1, t0
jal ra, mul_float32
lw t0, 0(sp)
lw t1, 4(sp)
lw t2, 8(sp)
- addi sp, sp, 12
- addi sp, sp, -12
sw t0, 0(sp)
sw t1, 4(sp)
sw t2, 8(sp)
mv a0, a0
mv a1, t1
jal ra, mul_float32
lw ra, 0(sp)
lw t0, 4(sp)
lw t1, 8(sp)
lw t2, 12(sp)
addi sp, sp, 16
mv t2, a0 # t2: base
```
### Multiplication Loop Minimization
:::info
After switching to the Booth Algorithm, this optimization will no longer be necessary.
:::
The original implementation of integer multiplication is executed using a loop. For example, `A * B` is performed by running a loop`B` times with adding `A`. Therefore, at the beginning of the function, if you can check and ensure that `B < A` (and if not, swap A and B), it can achieve **Multiplication Loop Minimization**.
```riscv
mv s0, a0
mv s1, a1
mul_uint32__body:
li t0, 0 # init value
li t1, 0 # loop index
mul_uint32__loop_start:
bge t1, s1, mul_uint32__loop_end
add t0, t0, s0
addi t1, t1, 1
j mul_uint32__loop_start
mul_uint32__loop_end:
```
```riscv
blt a1, a0, mul_uint32__a1_smaller
mv s0, a1
mv s1, a0
j mul_uint32__body
mul_uint32__a1_smaller:
mv s0, a0
mv s1, a1
mul_uint32__body:
li t0, 0 # init value
li t1, 0 # loop index
mul_uint32__loop_start:
bge t1, s1, mul_uint32__loop_end
add t0, t0, s0
addi t1, t1, 1
j mul_uint32__loop_start
mul_uint32__loop_end:
```
### Multiplication Method Optimizatiom
:::warning
Loop -> Booth Algorithm (unsigned version)
:::
Since the current multiplication operation is focused on calculating the fractional part of the mantissa, the test data used in this design are more common decimal values, such as 1.5. However, the mantissa part in this case would be `0xC000`. If multiplication is performed using a loop, it would be very time-consuming.
Therefore, the Booth Algorithm was adopted instead. However, since Booth's algorithm originally handles signed multiplication, adjustments were made to apply it to unsigned multiplication for this case.
By comparing the cycles between version 1 and version 3, the cycles were reduced from over 10 million to just over 10,000, <font color=lightblue>**achieving a 1000x speedup</font>**.
```riscv
mul_uint32:
addi sp, sp, -12
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
li s0, 0 # s0: current result
li t0, 0 # t0: loop index
li t1, 16 # t1: loop bound
mul_uint32_LOOP1_BEGIN:
bge t0, t1, mul_uint32_LOOP_END
srl s1, a1, t0
andi s1, s1, 1
beqz s1, mul_uint32_IF_END
sll s2, a0, t0
add s0, s0, s2
mul_uint32_IF_END:
addi t0, t0, 1
j mul_uint32_LOOP1_BEGIN
mul_uint32_LOOP_END:
mv a0, s0
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
addi sp, sp, 12
ret
```
## Testing Results and Conclusion
==Complex Multiplication==
| Complex1 | Complex2 | Result_FP32 | Result_FP16 |
|:--------------:|:-------------:|:-----------------------:|:-----------------------:|
| 1.5 + 1.75i | -3.0 + -1.5i | -1.875000 + -7.500000i | -1.875000 + -7.500000i |
| 2.25 + 6.5i | 3.375 + 10.5i | -60.656250 + 45.562500i | -60.656250 + 45.562500i |
| -1.125 + -3.5i | -4.5 + 2.25i | 12.937500 + 13.218750 | 12.937500 + 13.218750 |
The results of FP16 calculations are the calculations that I converted the FP16 values to FP32 to perform.
It can be seen that in most cases, the precision supported by FP16 is already sufficient to compute lossless values.
Therefore, in applications with a large amount of complex multiplication computation, we can reduce memory access by half.
## Pipeline 5 Stage Explanation

### **IF** (Instruction Fetch)
The processor retrieves the instruction from memory (typically from the instruction cache or memory if it is not cached).
The Program Counter (PC) is used to address the memory, and the fetched instruction is passed on to the next stage.
Retrieve the 32-bit instruction from memory.
After fetching, the PC is incremented to point to the next instruction in memory.
### **ID** (Instruction Decode)
The fetched instruction is decoded, and the processor identifies the type of instruction (e.g., arithmetic, load/store, branch).
The control signals are generated, and the registers involved in the instruction are identified and read from the register file.
Extract the opcode, source registers, destination register, and immediate values (if any).
Based on the opcode, the appropriate control signals are generated for the rest of the pipeline.
Use the ALU to perform operations like addition, subtraction, logical shifts, and comparisons.
For load/store instructions, this stage computes the effective memory address.
### **IE** (Instruction Execution)
The actual operation specified by the instruction is performed. This could involve arithmetic or logical operations, address calculation for load/store instructions, or comparison for branch instructions.
Use the ALU to perform operations like addition, subtraction, logical shifts, and comparisons. For load/store instructions, this stage computes the effective memory address.
### ==Forwarding in IE stage==

#### Probloems in piepline CPU
1. In a pipelined processor like RISC-V, different stages of instruction execution happen in parallel. If an instruction needs to use the result of a previous instruction before it is available (e.g., in the Write Back (WB) stage), this creates a data hazard. Without any mechanism to resolve this, the processor would need to stall the pipeline, resulting in a delay (reduced performance).
2. Load data hazard occurs in a pipelined processor when an instruction depends on data that is being loaded from memory by a previous load instruction. The hazard arises because the data being loaded is not available until the Memory Access (MEM) stage of the pipeline, but the subsequent instruction may need the data in an earlier stage, typically the Instruction Execution (IE) stage.
3. A control hazard, also known as a branch hazard, occurs in a pipelined processor when the pipeline makes the wrong decision on which instruction to fetch due to an unresolved branch instruction. This type of hazard happens because the outcome of a branch instruction (e.g., whether to take the branch or not) is often not known until later stages of the pipeline. If the processor incorrectly fetches the next instruction.
#### Solution
1. Forwarding
- Forwarding solves this by directly passing the result from one pipeline stage to another, without waiting for the WB stage to complete. This bypasses the need to write the result back to the register file before it can be used by another instruction.
- Instead of waiting for the result to be written to the register file in the WB stage, the result is "forwarded" directly from the Memory (MEM) stages or Write Back(WB) stages.
2. Flush / Keep / Bubble
The following table shows the flush or keep instructions that need to be inserted due to control hazards and data hazards encountered above.
- `stall`: Set when there is a Load instruction in `IE` stage and there is an instruction in `ID`stage that uses the Load’s writeback data
- `jb`: Set when a jump or branch is established in `IE` stag
| | stall | jb | others |
|:------:|:------:|:-----:|:------:|
| Reg_PC | keep | flush | X |
| Reg_D | keep | flush | X |
| Reg_E | bubble | flush | X |
| Reg_M | X | X | X |
| Reg_W | X | X | X |
### **MEM** (Memory Access)
For load instructions, this stage accesses the memory to retrieve data into the processor.
For store instructions, this stage writes the calculated data to memory.
### **WB** (Write Back)
The result of the instruction (from the execution or memory access stage) is written back to the destination register in the register file.
## Visualization of Each Type of Instrction Using the Ripes
### R-type
In an R-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` and `rs2_index` are sent to the register file to read the data.
When the data is passed from the ID stage to the IE stage, the ALU performs calculations based on `opcode`.
The result is then passed through the MEM stage and finally written back to the `destination register (rd)` in the register file in the WB stage.

### I-type
In an I-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` is sent to the register file to read the data, while the `immediate value` is extracted from the instruction and sign-extended to match the operand size.
When the data and the sign-extended immediate value are passed from the ID stage to the IE stage, the ALU performs calculations based on opcode and the extended immediate value.
The result is then passed through the MEM stage (if required, depending on the instruction type) and finally written back to `destination register (rd)` in the register file in the WB stage.

### I-type Load
In an I-type Load instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` is sent to the register file to read the data, and the `immediate value` (offset) is extracted from the instruction and sign-extended.
When the base address (from rs1) and the sign-extended immediate value (offset) are passed from the ID stage to the IE stage, the ALU calculates the effective memory address by adding the base address and the offset.
The calculated address is then passed to the MEM stage, where the data at the calculated memory address is loaded from memory.
Finally, the loaded data is passed to the WB stage and written back into the `destination register (rd)` in the register file.

### S-type
In an S-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` and `rs2_index` are sent to the register file to read the data from the source registers. The `immediate value` (which is split across the instruction fields) is extracted and sign-extended.
When the base address from rs1 and the sign-extended immediate value (offset) are passed from the ID stage to the IE stage, the ALU calculates the effective memory address by adding the base address and the offset.
The calculated address and the data from rs2 are passed to the MEM stage, where the data from rs2 is stored at the calculated memory address.
In an S-type instruction, there is no write-back to the register file during the WB stage, as the operation involves storing data to memory.

### U-type
In a U-type instruction, after fetching the instruction and passing it to the decoder, the upper 20 bits of the instruction represent the immediate value. This immediate is directly used without requiring sign-extension, as it forms the upper bits of the result.
The immediate value is passed from the ID stage to the IE stage, where it is either used directly (as in the case of `LUI`) or added to the program counter (PC) in the case of `AUIPC` (Add Upper Immediate to PC).
In the WB stage, for `LUI`, the immediate value is written directly to the destination register (`rd`). For `AUIPC`, the result of adding the immediate value to the PC is written to `rd`.
U-type instructions do not involve the MEM stage, as they don’t perform memory accesses.

### B-type
In a B-type instruction, after fetching the instruction and passing it to the decoder, the `rs1_index` and `rs2_index` are sent to the register file to read the data from the source registers. The immediate value (which is split across several fields in the instruction) is extracted and sign-extended.
When the data from `rs1`, `rs2`, and the sign-extended immediate value (representing the branch offset) are passed from the ID stage to the IE stage, the ALU performs a comparison between the values in `rs1` and `rs2` based on the specific branch condition (e.g., `BEQ` for equal, `BNE` for not equal).
If the condition is met, the ALU calculates the target address by adding the branch offset (immediate value) to the program counter (PC), and control is passed to the calculated address. If the condition is not met, the program continues executing sequentially.
B-type instructions do not involve the MEM stage or WB stage, as they control the program flow based on comparisons rather than loading, storing, or computing data.

### J-type
In a J-type instruction (e.g., `JAL` - Jump and Link), after fetching the instruction and passing it to the decoder, the 20-bit immediate value (representing the jump offset) is extracted from various fields of the instruction and sign-extended to 32 bits.
The sign-extended immediate value is passed from the ID stage to the IE stage, where it is added to the current program counter (PC) to calculate the target jump address.
Additionally, the return address (PC + 4, which points to the next instruction after the jump) is written to the destination register (`rd`) in the WB stage.
In a J-type instruction, there is no need for the MEM stage, as the instruction involves modifying the control flow by jumping to a calculated address while saving the return address in a register.

## Future Work
1. Supports FP16 and BF16 operations
2. Supports SIMD operations after conversion to FP16
3. Handle Infinity and NaN