# Assignment1: RISC-V Assembly and Instruction Pipeline
Contributed by <[moonchin819](https://github.com/moonchin819)>
## Implementation
:::info
**🛈 AI Tools Usage**
I utilized AI tools to support my work in this assignment.
Specifically, Claude was used for code explanations, and Perplexity for clarifying technical concepts.
:::
Here's the [source code](https://github.com/moonchin819/ca2025-quizzes).
## [Problem B](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B)
### uf8_decode
1. Extract last four bits as mantissa, and the high fours bit as the exponent part by shift logically right.
2. Generate 1's according to the number of exponent
* e.g. exponent = 3, 0x7FFF >> 12 = 0x0007, and there's actual three 1's
* And then move logically left for four bits to align offset to the mantissa part, to avoid overlapping and the continuity of the decode results
\begin{gather*}D(b) = m \cdot 2^e + (2^e - 1) \cdot 16\end{gather*}
3. Mantissa controls the detail in every segments of exponent
### Method 1 - Arithmetic Computation
```asm=
uf8_decode:
andi t0, a0, 0x0F # mantissa = f1 & 0xF
srli t1, a0, 4 # exponent = f1 >> 4
li t2, 15
sub t2, t2, t1 # 15 - exponent
li t3, 0x7FFF
srl t3, t3, t2 # 0x7fff >> (15 - exponent)
slli t3, t3, 4 # offset << 4
sll t2, t0, t1 # mantissa << exponent
add a0, t2, t3 # mantissa + offset
ret
```
### Method 2 - Read-only Memory (ROM) Lookup Table
1. Calculate the value of each exponent and place into a table
* In this problem, the range of data is not so large, so I can try sacrifice some memory space to have better execution efficiency
2. We won't have to calculate the offset in the execution, using the value of exponent as the index to look up the table.
\begin{gather*}\text{Offset}(e) =(2^{e}-1)\cdot16 \end{gather*}
| Exponent | Offset | Range |Step Size|
|:---------:|:-------:|:----------------------|:-----:|
| 0 | 0 | [0, 15] |1|
| 1 | 16 | [16, 47] |2|
| 2 | 48 | [48, 111] |4|
| 3 | 112 | [112, 239] |8|
| 4 | 240 | [240, 495] |16|
| 5 | 496 | [496, 1007] |32|
|...|...|...|$2^{e}$|
| 15 | 524272 | [524272, 1048575] |32768|
```asm=
offset_table:
.word 0, 16, 48, 112, 240, 496, 1008, 2032, 4080, 8176, 16368, 32752, 65520, 131056, 262128, 524272
uf8_decode:
andi t0, a0, 0x0F # mantissa = f1 & 0xF
srli t1, a0, 4 # exponent = f1 >> 4
la t2, offset_table
slli t3, t1, 2 # offset << 4
add t2, t2, t3
lw t2, 0(t2)
sll t0, t0, t1 # mantissa << exponent
add a0, t0, t2 # mantissa + offset
ret
```
### Pros and Cons
Pros :
* It has faster access time
* It's ideal for frequent reuse
Cons :
* It requires more memory space
* Not suitable for very large data ranges
### clz
* n represents "how many leading zeros are still there" in current prediction (set 32 to assume there's at most 32 0's)
* c is how many bits to shift right now (set 16 to test first)
* Use a loop to check the higher 16 bits first to check whether all of them are zero -> if not, check the second half part.
* Use the core logic of binary
* If the result of shifting right ```y!=0```
-> indicates that the leading zeros are less than what is expected -> ```n-=c```
in the mean time, update ```x=y``` to keep on dealing with the right half
* If ```y==0```, it means the part that have been moved out are all zero, so we don't have to minus ```n```
* ```c >>=1``` implies we narrow down the search range by half each time, and it's the main concept of binary search.
### Method 1 - Directly Translation
```asm=
clz:
li t0, 32 # n = 32
li t1, 16 # c = 16
clz_loop:
srl t2, a0, t1 # y = x >> c
beqz t2, clz_skip # if (y == 0) skip
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
clz_skip:
srli t1, t1, 1 # c >>= 1
bnez t1, clz_loop # while (c != 0)
sub a0, t0, a0 # return n - x
ret
```
### Method 2 - Loop Expansion
* Since the loop is only executed 5 times (c decades from 16 to 1), we could expand the loop to eliminate all the cost of branches
* Implement the code from 16 to 1 in order, keep dividing c by 2 until c be equal to 0 then exit the loop and return the result value.
```asm=
clz:
li t0, 32 # n = 32
# c = 16
srli t2, a0, 16 # y = x >> 16
beqz t2, c_equal_8 # branch if y is equal to zero
addi t0, t0, -16 # n -= 16
mv a0, t2 # x = y
c_equal_8:
srli t2, a0, 8 # y = x >> 8
beqz t2, c_equal_4
addi t0, t0, -8 # n -= 8
mv a0, t2 # x = y
c_equal_4:
srli t2, a0, 4 # y = x >> 4
beqz t2, c_equal_2
addi t0, t0, -4 # n -= 4
mv a0, t2 # x = y
c_equal_2:
srli t2, a0, 2 # y = x >> 2
beqz t2, c_equal_1
addi t0, t0, -2 # n -= 2
mv a0, t2 # x = y
c_equal_1:
srli t2, a0, 1 # y = x >> 1
beqz t2, iter_end
addi t0, t0, -1 # n -= 1
mv a0, t2 # x = y
iter_end:
sub a0, t0, a0 # return n - x
ret
```
* Although number of code lines has more than doubled, this approach reduces approximately 5000 cycles.

### uf8_encode
:::spoiler Assembly Code
```asm=
uf8_encode:
addi sp, sp, -4
sw ra, 0(sp) # will call clz later
mv t5, a0 # t5 = value
li t0, 16
blt t5, t0, ret_value # branch if value is less than 16
jal ra, clz # call function clz, a0 = lz
li t0, 31
sub t0, t0, a0 # msb = 31 - lz
li t1, 0 # exponent = 0
li t2, 0 # overflow = 0
li t3, 5
blt t0, t3, find_exact_exp # branch to find exact exponent if msb is less than 5
addi t1, t0, -4 # exponent = msb - 4
li t3, 15
ble t1, t3, over # branch to the nearest 1 function if exponent is less than or equal to 15
li t1, 15 # exponent = 15
over:
li t3, 0 # e (counter)
overflow_loop:
slli t6, t2, 1 # overflow << 1
addi t2, t6, 16 # (overflow << 1) + 16
addi t3, t3, 1 # e++
blt t3, t1, overflow_loop # branch to the begining of the loop if e is still less than exponent
adjust_loop:
blez t1, find_exact_exp # branch if exponent is less than or equal to zero
bge t5, t2, find_exact_exp # branch if overflow is greater than or equal to value (unsigned)
addi t6, t5, -16 # overflow - 16
srli t5, t6, 1 # (overflow - 16) >> 1
addi t1, t1, -1 # exponent--
j adjust_loop
find_exact_exp:
li t6, 15
find_exact_exp_loop:
bge t1, t6, exact_done # branch if exponent is greater than or equal to 15
slli t3, t2, 1 # overflow << 1
addi t3, t3, 16 # next_overflow = (oveflow << 1) + 16
blt t5, t3, exact_done # branch if value is less than next_overflow
mv t2, t3 # overflow = next_overflow
addi t1, t1, 1 # exponent++
j find_exact_exp_loop
exact_done:
sub t0, t5, t2 # value - overflow
srl t0, t0, t1 # (value - overflow) >> exponent
slli t3, t1, 4 # exponent << 4
or a0, t3, t0 # (expoment << 4) | mantissa
ret_value:
lw ra, 0(sp)
addi sp, sp, 4
jr ra
```
:::
### Full Assembly Code
:::spoiler Assembly Code
```asm=
.data
msg1: .asciz ": produce value "
msg2: .asciz " but encodes back to "
msg3: .asciz ": value\n"
msg4: .asciz " <= previous_value "
msg5: .asciz "All tests passed.\n"
msg6: .asciz "Test failed.\n"
line_break: .asciz "\n"
.align 2
.text
.global main
main:
jal ra, test # start testing
beq a0, x0, fail # didn't pass the test
la a0, msg5
li a7, 4 # test passed
ecall
li a7, 10 # set exit code
li a0, 0
ecall # successful if exit code is 0
fail:
la a0, msg6
li a7, 4
ecall
li a7, 10
li a0, 1
ecall # unsuccessful if exit code is 1
test:
addi sp, sp, -4
sw ra, 0(sp)
li s0, -1 # s0 = previous_value
li s1, 1 # s1 = passed (true)
li s2, 0 # s2 = i, counter initialized
li s3, 256 # end of counter
begin_test_loop:
mv a0, s2
jal ra, uf8_decode
mv s4, a0 # return the value of uf8_decode
mv a0, s4
jal ra, uf8_encode
mv s5, a0 # return the value of uf8_encode
test_if1:
beq s2, s5, test_if2 # branch if fl==fl2
mv a0, s2 # printf(fl)
li a7, 34
ecall
la a0, msg1
li a7, 4
ecall
mv a0, s4
li a7, 1
ecall
la a0, msg2
li a7, 4
ecall
mv a0, s5
li a7, 34
ecall
la a0, line_break
li a7, 4
ecall
li s1, 0 # passed = false
test_if2:
bgt s4, s0, end_test_loop
mv a0, s2 # printf(fl)
li a7, 34
ecall
la a0, msg3
li a7, 4
ecall
mv a0, s4 # printf(value)
li a7, 1
ecall
la a0, msg4
li a7, 4
ecall
mv a0, s0 # printf(previous_value)
li a7, 34
ecall
la a0, line_break
li a7, 4
ecall
li s1, 0 # passed = false
end_test_loop:
mv s0, s4 # previous_value = value
addi s2, s2, 1
blt s2, s3, begin_test_loop
mv a0, s1 # return passed
lw ra, 0(sp)
addi sp, sp, 4
jr ra
clz:
li t0, 32 # n = 32
li t1, 16 # c = 16
clz_loop:
srl t2, a0, t1 # y = x >> c
beqz t2, clz_skip # if (y == 0) skip
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
clz_skip:
srli t1, t1, 1 # c >>= 1
bnez t1, clz_loop # while (c != 0)
sub a0, t0, a0 # return n - x
ret
offset_table:
.word 0, 16, 48, 112, 240, 496, 1008, 2032, 4080, 8176, 16368, 32752, 65520, 131056, 262128, 524272
uf8_decode:
andi t0, a0, 0x0F # mantissa = f1 & 0xF
srli t1, a0, 4 # exponent = f1 >> 4
la t2, offset_table
slli t3, t1, 2 # offset << 4
add t2, t2, t3
lw t2, 0(t2)
sll t0, t0, t1 # mantissa << exponent
add a0, t0, t2 # mantissa + offset
ret
uf8_encode:
addi sp, sp, -4
sw ra, 0(sp) # will call clz later
mv t5, a0 # t5 = value
li t0, 16
blt t5, t0, ret_value # branch if value is less than 16
jal ra, clz # call function clz, a0 = lz
li t0, 31
sub t0, t0, a0 # msb = 31 - lz
li t1, 0 # exponent = 0
li t2, 0 # overflow = 0
li t3, 5
blt t0, t3, find_exact_exp # branch to find exact exponent if msb is less than 5
addi t1, t0, -4 # exponent = msb - 4
li t3, 15
ble t1, t3, over # branch to the nearest 1 function if exponent is less than or equal to 15
li t1, 15 # exponent = 15
over:
li t3, 0 # e (counter)
overflow_loop:
slli t6, t2, 1 # overflow << 1
addi t2, t6, 16 # (overflow << 1) + 16
addi t3, t3, 1 # e++
blt t3, t1, overflow_loop # branch to the begining of the loop if e is still less than exponent
adjust_loop:
blez t1, find_exact_exp # branch if exponent is less than or equal to zero
bge t5, t2, find_exact_exp # branch if overflow is greater than or equal to value (unsigned)
addi t6, t5, -16 # overflow - 16
srli t5, t6, 1 # (overflow - 16) >> 1
addi t1, t1, -1 # exponent--
j adjust_loop
find_exact_exp:
li t6, 15
find_exact_exp_loop:
bge t1, t6, exact_done # branch if exponent is greater than or equal to 15
slli t3, t2, 1 # overflow << 1
addi t3, t3, 16 # next_overflow = (oveflow << 1) + 16
blt t5, t3, exact_done # branch if value is less than next_overflow
mv t2, t3 # overflow = next_overflow
addi t1, t1, 1 # exponent++
j find_exact_exp_loop
exact_done:
sub t0, t5, t2 # value - overflow
srl t0, t0, t1 # (value - overflow) >> exponent
slli t3, t1, 4 # exponent << 4
or a0, t3, t0 # (expoment << 4) | mantissa
ret_value:
lw ra, 0(sp)
addi sp, sp, 4
jr ra
```
:::
### Analysis
In this assignment we use [Ripes simulator](https://ripes.me/) to test our code.
* Method 1 - Arithmetic Computation
* Assembly code size : 165 lines
Cycle counts : 40107

* Method 2.1 - ROM Lookup Table without Loop Expansion
* Assembly code size : 166 lines
Cycle counts : 39851

* Method 2.2 - ROM Lookup Table with Loop Expansion
* Assembly code size : 182 lines
Cycle counts : 35291

Method 2.2
## [Problem C](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C)
BFloat16 is a 16-bit floating point format designed for keeping the dynamic range of float32 but reduce the precision.
The value of BFloat16 is calculated as :
\begin{gather*}v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)\end{gather*}
Determine it's positive or negative by sign, and decide multiple by exponent, then finetune the precision of decimals part by mantissa.
### Initial Part
:::spoiler Assembly Code
```asm=
.globl bf16_isnan
bf16_isnan:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, isnan_false
li t2, 0x007F
and t3, a0, t2
snez a0, t3
ret
isnan_false:
li a0, 0
ret
bf16_isinf:
li t0, 0x7F80
and t1, a0, t0
bne t1, t0, isinf_false
li t2, 0x007F
and t3, a0, t2
seqz a0, t3
ret
isinf_false:
li a0, 0
ret
bf16_iszero:
li t0, 0x7FFF
and t1, a0, t0
seqz a0, t1
ret
.globl f32_to_bf16
f32_to_bf16:
addi sp, sp, -4
sw s0, 0(sp)
mv s0, a0
srli t0, s0, 23
andi t0, t0, 0xFF
li t1, 0xFF
bne t0, t1, unspecial
srli a0, s0, 16
li t0, 0xFFFF
and a0, a0, t0
j f32_to_bf16_done
unspecial:
srli t0, s0, 16
andi t0, t0, 1
li t1, 0x7FFF
add t0, t0, t1
add s0, s0, t0
srli a0, s0, 16
f32_to_bf16_done:
lw s0, 0(sp)
addi sp, sp, 4
ret
.globl bf16_to_f32
bf16_to_f32:
slli a0, a0, 16
ret
.globl BF16_NAN
BF16_NAN:
li a0, 0x7FC0
ret
.globl BF16_ZERO
BF16_ZERO:
li a0, 0x0000
ret
```
:::
### Add

Result with three test cases
:::spoiler Assembly Code
```asm=
.globl bf16_add
bf16_add:
srli t0, a0, 15 # a.bits >> 15
andi t0, t0, 1 # & 1
srli t1, a1, 15 # b.bits >> 15
andi t1, t0, 1 # & 1
srli t2, a0, 7 # a.bits >> 7
andi t2, t2, 0xFF # & 0xFF
srli t3, a1, 7 # b.bits >> 7
andi t3, t3, 0xFF # & 0xFF
li t6, 0xFF
beq t2, t6, exp_a_checkall
j check_exp_b
exp_a_checkall:
beqz t4, exp_a_not_nan
mv a0, a0
ret
exp_a_not_nan:
beq t3, t6, check1
mv a0, a0
ret
check1:
bnez t5, return_b1
beq t0, t1, return_b1
li a0, 0x7FC0
ret
return_b1:
mv a0, a1
ret
check_exp_b:
beq t3, t6, return_b2
j check_0_a
return_b2:
mv a0, a1
ret
check_0_a:
bnez t2, check_0_b
bnez t4, check_0_b
mv a0, a1
ret
check_0_b:
bnez t3, norm_a
bnez t5, norm_a
mv a0, a0
ret
norm_a:
beqz t2, norm_b
ori t4, t4, 0x80
norm_b:
beqz t3, end_check1
ori t5, t5, 0x80
end_check1:
addi sp, sp, -20
sw s0, 16(sp) # for exp_diff
sw s1, 12(sp) # for result_sign
sw s2, 8(sp) # for result_exp
sw s3, 4(sp) # for result_mant
sw s4, 0(sp)
sub s0, t2, t3 # exp_diff = exp_a - exp_b
blez s0, diff_neg # branch if exp_diff <= 0
mv s2, t2 # result_exp = exp_a
li t6, 8
bgt s0, t6, return_a
srl t5, t5, s0
j exp_done
diff_neg:
bgez s0, diff_else
mv s2, t3
li t6, -8
blt s0, t6, return_b3
neg s4, s0
srl t4, t4, s4
j exp_done
diff_else:
mv s2, t2
return_a:
mv a0, a0
ret
return_b3:
mv a0, a1
ret
exp_done:
bne t0, t1, diff_sign # branch if sign_a != sign_b
same_sign:
mv s1, t0
add s3, t4, t5
andi t6, s3, 0x100
beqz t6, norm_end
srli s3, s3, 1
addi s2, s2, 1
li t6, 0xFF
bge s1, t6, overflow_inf
j norm_end
overflow_inf:
slli a0, s1, 15
li t6, 0x7F80
or a0, a0, t6
ret
diff_sign:
bge t4, t5, manta_>_mantb
mv s1, t1
sub s3, t5, t4
j mant_result
manta_>_mantb:
mv s1, t0
sub s3, t4, t5
mant_result:
beqz s3, return_zero
norm_loop:
andi t6, s3, 0x80
bnez t6, norm_end
slli s3, s3, 1
addi s2, s2, -1
blez s2, return_zero
j norm_loop
norm_end:
slli t0, s1, 15
andi t1, s2, 0xFF
slli t1, t1, 7
or t0, t0, t1
andi t1, s3, 0x7F
or a0, t0, t1
ret
return_zero:
li a0, 0x0000
ret
```
:::
### Sub

Result with three test cases
:::spoiler Assembly Code
```asm=
.globl bf16_sub
bf16_sub:
li t0, 0x8000
xor a1, a1, t0
j bf16_add
```
:::
### Mul

Result with three test cases
:::spoiler Assembly Code
```asm=
.globl bf16_mul
bf16_mul:
addi sp, sp, -16
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw ra, 12(sp)
srli t0, a0, 15 # a.bits >> 15
andi t0, t0, 1 # t0 = sign_a
srli t1, a1, 15 # b.bits >> 15
andi t1, t0, 1 # t1 = sign_b
srli t2, a0, 7 # a.bits >> 7
andi t2, t2, 0xFF # t2 = exp_a
srli t3, a1, 7 # b.bits >> 7
andi t3, t3, 0xFF # t3 = exp_b
andi t4, a0, 0x7F # t4 = mant_a
andi t5, a1, 0x7F # t5 = mant_b
xor s0, t0, t1 # s0 = result_sign
li t6, 0xFF
bne t2, t6, check_b_exp
bnez t4, return_a
bnez t3, result_1
bnez t5, result_1
result_1:
slli a0, s0, 15
li t6, 0x7F80
or a0, a0, t6
j quit
check_b_exp:
li t6, 0xFF
bne t3, t6, check_0
bnez t5, return_b
bnez t2, result_2
bnez t4, result_2
j return_nan
result_2:
slli a0, s0, 15
li t6, 0x7F80
or a0, a0, t6
j quit
check_0:
bnez t2, a_not_zero
bnez t4, a_not_zero
j return_0
a_not_zero:
bnez t3, norm_mant
bnez t5, norm_mant
return_0:
slli a0, s0, 15
j quit
norm_mant:
li s1, 0 # s1 : exp_adjust = 0
bnez t2, norm_a_else # if(!exp_a)
norm_loop_a:
andi t6, t4, 0x80
bnez t6, norm_loop_a_done
slli t4, t4, 1
addi s1, s1, -1
j norm_loop_a
norm_loop_a_done:
li t2, 1
j check_exp_b_norm
norm_a_else:
ori t4, t4, 0x80
check_exp_b_norm:
bnez t3, else_norm_b
norm_loop_b:
andi t6, t5, 0x80
bnez t6, norm_b_done
slli t5, t5, 1
addi s1, s1, -1
j norm_loop_b
norm_b_done:
li t3, 1
j mul_mant
else_norm_b:
ori t5, t5, 0x80
mul_mant:
mul s2, t4, t5 # s2 = result_mant
add t6, t2, t3
addi t6, t6, -127
add t6, t6, s1
mv s1, t6 # s1 = result_exp
li t6, 0x8000
and t6, s2, t6
beqz t6, mult_else
srli s2, s2, 8
andi s2, s2, 0x7F
addi s1, s1, 1
j check_exp_overflow
mult_else:
srli s2, s2, 7
addi s2, s2, 0x7F
check_exp_overflow:
li t6, 0xFF
blt s1, t6, underflow_check
slli a0, s0, 15
li t6, 0x7F80
or a0, a0, t6
j quit
underflow_check:
bgt s1, zero, final
li t6, -6
blt s1, t6, return_0_udflow
li t6, 1
sub t6, t6, s1
srl s2, s2, t6
li s1, 0
j final
return_0_udflow:
slli a0, s0, 15
j quit
final:
slli a0, s0, 15
andi s1, s1, 0xFF
slli s1, s1, 7
andi s2, s2, 0x7F
or a0, a0, s1
or a0, a0, s2
j quit
return_a:
mv a0, a0
ret
return_b:
mv a0, a1
ret
return_nan:
li a0, 0x7FC0
j quit
quit:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw ra, 12(sp)
addi sp, sp, 16
ret
```
:::
### Div

Result with three test cases
:::spoiler Assembly Code
```asm=
.globl bf16_div
bf16_div:
addi sp, sp, -16
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
srli t0, a0, 15 # a.bits >> 15
andi t0, t0, 1 # & 1 # t0 = sign_a
srli t1, a1, 15 # b.bits >> 15
andi t1, t1, 1 # & 1 # t1 = sign_b
srli t2, a0, 7 # a.bits >> 7
andi t2, t2, 0xFF # & 0xFF # t2 = exp_a
srli t3, a1, 7 # b.bits >> 7
andi t3, t3, 0xFF # & 0xFF # t3 = exp_b
andi t4, a0, 0x7F # t4 = mant_a
andi t5, a1, 0x7F # t5 = mant_b
xor s0, t0, t1 # s0 = result_sign = sign_a ^ sign_b
li t6, 0xFF # t6 is for temporal usage
bne t3, t6, check_zero # branch if(exp_b != 0xFF)
beqz t5, check_inf # branch if mant = 0 => return b
mv a0, a1 # return b
j recover
check_inf:
li t6, 0xFF
bne t2, t6, result_sign_1
bnez t4, result_sign_1
li a0, 0x7FC0 # return Nan
j recover
result_sign_1:
slli a0, s0, 15
j recover
check_zero:
bnez t3, check_2_inf
bnez t5, check_2_inf
bnez t2, result_sign_2
bnez t4, result_sign_2
li a0, 0x7FC0 # return Nan
j recover
result_sign_2:
slli a0, s0, 15
li t6, 0x7F80
or a0, a0, t6
j recover
check_2_inf:
li t6, 0xFF
bne t2, t6, check_div_zero
beqz t4, result_3
mv a0, a0 # return a
j recover
result_3:
slli a0, s0, 15
li t6, 0x7F80
or a0, a0, t6
j recover
check_div_zero:
bnez t2, norm
bnez t4, norm
slli a0, s0, 15
j recover
norm:
beqz t2, norm_b
ori t4, t4, 0x80
norm_b:
beqz t3, norm_end
ori t5, t5, 0x80
norm_end:
slli s1, t4, 15 # s1 : dividend = mant_a << 15
mv s2, t5 # s2 : divisor = mant_b
li s3, 0 # s3 : quotient = 0
li t6, 0 # t6 : i=0
div_loop:
li a2, 16 # a3 = 16
bge t6, a2, end_div_loop # branch if i >= 16
slli s3, s3, 1 # quotient <<= 1
li a3, 15
sub a3, a3, t6 # a2 = 15 - i
sll a4, s2, a3 # a3 = divisor << (15-i)
bltu s1, a4, skip_sub # branch if dividend is less than
sub s1, s1, a4 # dividend -= (divisor << (15-i))
ori s3, s3, 1 # quotient |= 1
skip_sub:
addi t6, t6, 1 # i++
j div_loop
end_div_loop:
sub a2, t2, t3 # a2 = exp_a - exp_b
addi a2, a2, 127 # a2 = result_exp = exp_a - exp_b + BF16_EXP_BIAS
bnez t2, res_b # branch if exp_a != 0
addi a2, a2, -1
res_b:
bnez t3, q_check # branch if exp_b != 0
addi a2, a2, 1
q_check:
li t6, 0x8000
and a4, s3, t6 # quotient & 0x8000
beqz a4, q_else
srli s3, s3, 8
j check_overflow
q_else:
q_loop:
li t6, 0x8000
and a4, s3, t6 # quotient & 0x8000
bnez a4, q_loop_done
li t6, 1
ble a2, t6, q_loop_done
slli s3, s3, 1
addi a2, a2, -1
j q_loop
q_loop_done:
srli s3, s3, 8
check_overflow:
andi s3, s3, 0x7F
li t6, 0xFF
bge a2, t6, overflow
j check_un
overflow:
slli a0, s0, 15
li t6, 0x7F80
or a0, a0, t6
j recover
check_un:
bgt a2, x0, final_result
slli a0, s0, 15
j recover
final_result:
slli a0, s0, 15
andi a2, a2, 0xFF
slli a2, a2, 7
or a0, a0, a2
andi s3, s3, 0x7F
or a0, a0, s3
recover:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
addi sp, sp, 16
ret
```
:::
### Square Root

Result with four test cases
:::spoiler Assembly Code
```asm=
bf16_sqrt:
addi sp, sp, -32
sw ra, 28(sp)
sw s0, 24(sp)
sw s1, 20(sp)
sw s2, 16(sp)
sw s3, 12(sp)
sw s4, 8(sp)
sw s5, 4(sp)
sw s6, 0(sp)
srli t0, a0, 15 # Shift right 15
andi t0, t0, 1 # Extract sign bit by and operation with 1
srli t1, a0, 7 # Shift right 7
andi t1, t1, 0xFF # Extract exponent by and operation with 0xFF (8 bits)
andi t2, a0, 0x7F # And operation with 0x7F to extract mantissa
li t3, 0xFF
bne t1, t3, check_zero # Branch if exp is not equal to 0xFF
bnez t2, return_a # Branch if mantissa is not equal to zero
bnez t0, return_nan # Branch if sign bit is not equal to zero
j return_a # Return the value of a (input)
check_zero: # First special case
or t3, t1, t2 #
bnez t3, check_negative
j return_zero
check_negative: # Second special case
bnez t0, return_nan
bnez t1, compute_sqrt
j return_zero
compute_sqrt:
addi s0, t1, -127
ori s1, t2, 0x80
andi t3, s0, 1
beqz t3, even_exp
slli s1, s1, 1
addi t4, s0, -1
srai t4, t4, 1
addi s2, t4, 127
j binary_search
even_exp:
srai t4, s0, 1
addi s2, t4, 127
binary_search:
li s3, 90
li s4, 256
li s5, 128
search_loop:
bgt s3, s4, search_done
add t3, s3, s4
srli t3, t3, 1
mv a1, t3
mv a2, t3
jal multiply
mv t4, a0
srli t4, t4, 7
bgt t4, s1, search_high
mv s5, t3
addi s3, t3, 1
j search_loop
search_high:
addi s4, t3, -1
j search_loop
search_done:
li t3, 256
blt s5, t3, check_low
srli s5, s5, 1
addi s2, s2, 1
j extract_mant
check_low:
li t3, 128
bge s5, t3, extract_mant
norm_loop:
li t3, 128
bge s5, t3, extract_mant
li t3, 1
ble s2, t3, extract_mant
slli s5, s5, 1
addi s2, s2, -1
j norm_loop
extract_mant:
andi s6, s5, 0x7F
li t3, 0xFF
bge s2, t3, return_inf
blez s2, return_zero
andi t3, s2, 0xFF
slli t3, t3, 7
or a0, t3, s6
j cleanup
return_zero:
li a0, 0
j cleanup
return_nan:
li a0, 0x7FC0
j cleanup
return_inf:
li a0, 0x7F80
j cleanup
return_a:
cleanup:
lw s6, 0(sp)
lw s5, 4(sp)
lw s4, 8(sp)
lw s3, 12(sp)
lw s2, 16(sp)
lw s1, 20(sp)
lw s0, 24(sp)
lw ra, 28(sp)
addi sp, sp, 32
ret
multiply:
li a0, 0
beqz a2, mult_done
mult_loop:
andi t0, a2, 1
beqz t0, mult_skip
add a0, a0, a1
mult_skip:
slli a1, a1, 1
srli a2, a2, 1
bnez a2, mult_loop
mult_done:
ret
```
:::
## 5-Stage Pipeline

Here's the block diagram of the standard RISC-V pipeline, it divides instruction execution into **5 stages**, allowing multiple instructions to overlap in execution, one instruction in each stage.
This mechanism improves throughput by executing several instructions simultaneously.
### Pipeline Principle
In this pipeline mechanism, each instruction passes through five stages in order:
IF -> ID -> EX -> MEM -> WB
Let's go through each stage and see how it works.
### I-type
First, I choose the first instruction after calling the test , ```addi sp, sp, -4``` function to discuss.
1. Instruction Fetch (IF)

* For Program Counter: Currently holds the value **0x00000040**, which points to the address of the current instruction.
* The value is then sent to the **Instruction Memory** as the address input.
* Instruction memory reads the 32-bit instruction stored at **0x00000040**. And **0xFFC10113** is the fetched instruction corresponding to the instruction ```addi sp, sp, -4```.
* Since `addi` is not a branch or jump instruction, the MUX selects the sequential path PC+4. The PC register is updated to 0x00000044 on the next clock cycle.
2. Instruction Decode (ID)

* Instruction is decoded to
* Opcode : `0x13`or`ADDI`
* r1 idx : 0x02 (x2) (sp)
* r2 idx is unused in I-type because I-type instruction adds an immediate to source register.
* rd, rs1 : x2
* Immediate : 0xfffffffc (-4)
* Reg1 = 0x7FFFFFF0 is the value of current stack pointer, and Reg2 is 0x00000000 stands for unused.
* The immediate will be sent into ALU with the output of register.
* All the data was then sent to ID/EX pipeline register waiting for being used in next stage.
3. Execute (EX)

* Since it's `ADDI` I-type instruction, we need not access to memory or jump, just receive the value from register file and execute addition.
* Input source, using two MUX to select the inputs of ALU as register value and immediate value:
* Op1 : 0x7FFFFFF0 (from register x2 (xp)), represents the original stack pointer
* Op2 : 0xFFFFFFFC (equals to immediate -4)
* ALU executes the addition according to the control signal ALUOp, and the result is sp-4 = 0x7FFFFFEC
* And there's a Branch Unit in the figure, but branch taken is set as 0 since this instruction is not a branch instruction, therefore, PC won't change the direction and keep on executing next instruction.
* After EX stage is over, ALU result, destination register and control signals are sent into EX/MEM register.
4. Memory Access (MEM)

* Data Memory module determine whether access or not depends on the control signals.
* For this `ADDI` instruction case, we won't access to memory since it's a pure arithmetic instruction, so the WrEn is red and the output of Data Memory module is 0 since no memory data was read.
* Pipeline register sends the results to the next stage, which meas it's a pass-through stage.
5. Write Back (WB)

* In this stage, there's a crucial MUX to decide whether the write back data is from ALU results or the value read by Data Memory.
* Since `ADDI` is a arithmetic instruction, the control signal is 0 and MUX chose ALU result as input then write back to register file.
* And we can see the WrEn turns into 1 which means it's available to write in, so the value of sp is updated to 0x7FFFFFEC.