# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [`姜冠宇`](https://github.com/Denny0097) >
## Introduction
### 688. Knight Probability in Chessboard
On an $n * n$ chessboard, a knight starts at the cell (row, column) and attempts to make exactly k moves. The rows and columns are 0-indexed, so the top-left cell is $(0, 0)$, and the bottom-right cell is $(n - 1, n - 1)$.
A chess knight has eight possible moves it can make, as illustrated below. Each move is two cells in a cardinal direction, then one cell in an orthogonal direction.

Each time the knight is to move, it chooses one of eight possible moves uniformly at random (even if the piece would go off the chessboard) and moves there.
The knight continues moving until it has made exactly k moves or has moved off the chessboard.
Return the probability that the knight remains on the board after it has stopped moving.
Example 1:
Input: n = 3, k = 2, row = 0, column = 0
Output: 0.06250
Explanation: There are two moves (to $(1,2)$, $(2,1)$) that will keep the knight on the board.
From each of those positions, there are also two moves that will keep the knight on the board.
The total probability the knight stays on the board is 0.0625.
Example 2:
Input: n = 1, k = 0, row = 0, column = 0
Output: 1.00000
Constraints:
$1 <= n <= 25$
$0 <= k <= 100$
$0 <= row, column <= n - 1$
## Implementation
You can find the source code [here](https://github.com/Denny0097).
### Motivation
I use the bf16 format to store those arrays and the fp32 format at calculate time, expecting it will have better space and computing performance.
### DP
(k,x,y): k mean step num, (x,y) mean cell
initially, knight is at cell (i,j)
>$(0,x,y) = 0$, when $x\neq i || y\neq j$
$(0,i,j) = 1$
$(k,x+2,y+1) = (k-1,x,y)/8$
$(k,x+2,y-1) = (k-1,x,y)/8$
$(k,x-2,y+1) = (k-1,x,y)/8$
$(k,x-2,y-1) = (k-1,x,y)/8$
$(k,x+1,y+2) = (k-1,x,y)/8$
$(k,x+1,y-2) = (k-1,x,y)/8$
$(k,x-1,y+2) = (k-1,x,y)/8$
$(k,x-1,y-2) = (k-1,x,y)/8$
$(k,x,y) = (k-1,x,y)$, except above
calculate the sum of prop that in chessboard
### c code for function knightProbabilit without FP transfer
```c=
double knightProbability(int n, int k, int row, int column){
double DPtable[n][n];
memset(DPtable, 0, sizeof(DPtable));
DPtable[row][column] = 1;
double Prop = 0.0;
int moves[8][2] = {{2,1},{2,-1},{-2,1},{-2,-1},{1,2},{1,-2},{-1,2},{-1,-2}};
for(int i = 0; i < k; i++){
double TDP[n][n];
memset(TDP, 0, sizeof(TDP));
for(int r = 0; r < n; r++){
for(int c = 0; c < n; c++){
for(int j = 0; j < 8; j++){
int moveRow = r + moves[j][0];
int moveCol = c + moves[j][1];
if(moveRow >= 0 && moveRow <= n-1 && moveCol >= 0 && moveCol <= n-1)
TDP[moveRow][moveCol] += DPtable[r][c]/8.0;
}
}
}
memcpy(DPtable, TDP, sizeof(DPtable));
}
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
Prop += DPtable[i][j];
}
}
return Prop;
}
```
### c code for function fp32_to_bf16/ bf16_to_fp32
```c=
typedef struct {
uint16_t bits;
} bf16_t;
// Inline function to convert FP32 to BF16
static inline bf16_t fp32_to_bf16(float s) {
/*'''*/
}
// Inline function to convert BF16 back to FP32
static inline float bf16_to_fp32(bf16_t h) {
/*'''*/
}
```
### **assembly** code for function fp32_to_bf16/ bf16_to_fp32
```riscv=
.text
.globl fp32_to_bf16
fp32_to_bf16: # return value put in a1
mv fp, sp
addi sp, sp, -16
sw ra, 12(sp)
sw fp, 8(sp)
sw a0, 4(sp) #store a0 argument(fp32 number)
# check NaN(((u.i & 0x7fffffff) > 0x7f800000))
li t0, 0x7FFFFFFF
and t1, a0, t0
li t0, 0x7f800000
bgt t1, t0, IsNaN # if t1 > 01 is NaN
# not Nan
li t0, 0x7fff
srli t6, a0, 0x10
andi t1, t6, 1
add t0, t0, t1
add t0, a0, t0
srli a1, t0, 0x10
addi sp, sp, 16
ret
IsNaN:
srli t0, a0, 0x10
ori a0, t0, 64
addi sp, sp, 16
ret
.text
bf16_to_fp32:
addi sp, sp, -16 # Save space on stack
sw ra, 12(sp) # Save return address
sw a0, 8(sp) # Save argument
# 將16位BF16數值移到高16位
slli t0, a0, 16 # Move BF16 to high 16 bits of 32-bit FP32
# 結果存回 a1,這裡 u.f = u.i (直接返回)
mv a0, t0 # Move the result to return register
lw ra, 12(sp) # Restore return address
addi sp, sp, 16 # Restore stack
jr ra # Return
```
### c code for function knightProbability
```c=
double knightProbability(int n, int k, int row, int column) {
// Define DPtable and TDP as arrays of bf16_t
bf16_t DPtable[n][n];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
DPtable[i][j].bits = 0;
}
}
// Set the initial probability to 1.0 and convert to BF16
DPtable[row][column] = fp32_to_bf16(1.0f);
float Prop = 0.0f;
int moves[8][2] = {{2,1},{2,-1},{-2,1},{-2,-1},{1,2},{1,-2},{-1,2},{-1,-2}};
for (int i = 0; i < k; i++) {
bf16_t TDP[n][n];
for (int r = 0; r < n; r++) {
for (int c = 0; c < n; c++) {
TDP[r][c].bits = 0;
}
}
for (int r = 0; r < n; r++) {
for (int c = 0; c < n; c++) {
// Use FP32 for computation
if (DPtable[r][c].bits > 0) {
float prob_fp32 = bf16_to_fp32(DPtable[r][c]);
prob_fp32 /= 8.0f;
for (int j = 0; j < 8; j++) {
int moveRow = r + moves[j][0];
int moveCol = c + moves[j][1];
if (moveRow >= 0 && moveRow < n && moveCol >= 0 && moveCol < n) {
float temp_fp32 = bf16_to_fp32(TDP[moveRow][moveCol]);
temp_fp32 += prob_fp32;
TDP[moveRow][moveCol] = fp32_to_bf16(temp_fp32);
}
}
}
}
}
memcpy(DPtable, TDP, sizeof(DPtable));
}
for the final result
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
Prop += bf16_to_fp32(DPtable[i][j]);
}
}
return (double)Prop;
}
```
### assembly code for function knightProbability
Using ```Input: n = 3, k = 2, row = 0, column = 0``` as an example.
```riscv=
.text
.globl main
main:
li a0, 3
li a1, 2
li a2, 0
li a3, 0
jal knightProbability
#...略過
mv t0, a0
li a7, 2
ecall
li a7, 93
li a0, 0
ecall
# Input: n = 3, k = 2, row = 0, column = 0
# knightProbability: Calculate knight move probability
.text
.globl knightProbability
knightProbability:
# Arguments: a0 = n, a1 = k, a2 = row, a3 = column
addi sp, sp, -32
sw ra, 28(sp)
sw a0, 20(sp)
sw a1, 16(sp)
sw a2, 12(sp)
sw a3, 8(sp)
# Initialize DPtable with zeros
DP_initial:
# i: t2, j: t3, t0 = n, t7 as curr position
# Initialize DPtable with zeros
li t0, 0
la t1, DPtable # Load DPtable base address into t1
li t2, 18 # Set the size of DPtable (3x3 table, adjust accordingly for other sizes)
DP_zero_full:
sh t0, 0(t1) # Store zero into DPtable
addi t1, t1, 2 # Move to next word in DPtable
addi t2, t2, -2 # Decrease counter
bnez t2, DP_zero_full # Repeat until DPtable is filled
# Set DPtable[row][column] = 1.0 (in BF16 format)
Set_one_to_DPrc:
# Call the custom multiplication function my_mul
mv a1, a2 # Copy row into a1 (first argument for my_mul), a0 = n (second argument for my_mul)
call my_mul # Call my_mul function to calculate row * n, result in a0
mv t2, a0 # Store result back to t2
lw a0, 20(sp)
lw a1, 16(sp)
add t2, t2, a3 # Add column index
slli t2, t2, 1 # Each element is 2 bytes, so multiply index by 2
la t1, DPtable # Reload base address of DPtable into t1
add t1, t1, t2 # t1 now points to DPtable[row][column]
li t0, 0x3f80 # Load BF16 representation of 1.0 into t0
sh t0, 0(t1) # Store 1.0 (in BF16 format) to DPtable[row][column]
# Main loop: Iterate k times
li s3, 0 # Set loop counter i = 0
main_loop:
bge s3, a1, loop_end # if i >= k, exit loop
# Initialize TDPtable with zeros
li t0, 0
la t1, TDPtable # Load TDPtable base address into t1
li t2, 18 # Set the size of TDPtable (3x3 table)
TDP_zero_full:
sh t0, 0(t1) # Store zero into TDPtable
addi t1, t1, 2 # Move to next word in TDPtable
addi t2, t2, -2 # Decrease counter
bnez t2, TDP_zero_full # Repeat until TDPtable is filled
# Iterate over DPtable and update TDPtable
li s4, 0 # Set r = 0
iter_row: # Check if r == n, r: s4, c: s5, r*n: t6
bge s4, a0, row_done # if r >= n, finish processing rows
li s5, 0 # Set c = 0
iter_col: # Check if c == n
bge s5, a0, col_done # if c >= n, finish processing columns
# Check if DPtable[r][c] > 0
la s1, DPtable # Load DPtable base address into t1
# Call the custom multiplication function my_mul
mv a0, s4 # Copy row r into a0 (first argument for my_mul)
mv a1, a0 # Copy n into a1 (second argument for my_mul)
sw s3, 4(sp)
call my_mul # Call my_mul function to calculate r * n, result in a0
lw s3, 4(sp)
mv t6, a0 # Store result back to t6
lw a0, 20(sp)
lw a1, 16(sp)
add t6, t6, s5 # Add column index
slli t6, t6, 1 # Each element is 2 bytes, so multiply index by 2
# t0: DP[r][c]'s position, s1: DPtalbe[r][c]
add s1, s1, t6 # s1 now points to DPtable[r][c]
lw t0, 0(s1) # Load DPtable[r][c] into t0
slli t0, t0, 16 # trans bf16 to fp32
beqz t0, skip_calculate # If DPtable[r][c] == 0, skip to next iteration
# prob_fp32 /= 8.0f;
mv a0, t0
li a1, 3
call Fdiv
mv a4, a0 # a4: prob_fp32 /= 8.0f
lw a0, 20(sp)
lw a1, 16(sp)
# update TDPtable
# Convert BF16 to FP32 and divide by 8
# s8: 8, j: s9
li s9, 0
li s8, 8
move_loop:
sw t0, 4(sp) # temp store DPtable[r][c] val
slli t1, s9, 3 # j*2*4
la t0, moves # use t0 for move array's position
add t0, t0, t1 # t0 = move + j*2*4,t0 now points to move[j][0]
lw t1, 0(t0) # Load move[j][0]
addi t0, t0, 4
lw t2, 0(t0) # Load move[j][1]
add t1, t1, s4 # t1 = r + move[j][0]
add t2, t2, s5 # t2 = c + move[j][1]
## if (moveRow >= 0 && moveRow < n && moveCol >= 0 && moveCol < n)
bge t1, a0, next_move
blt t1, x0, next_move
bge t2, a0, next_move
blt t2, x0, next_move
prop_calcu:
#####
mv a1, t1 # Copy row into a1 (first argument for my_mul), a0 = n (second argument for my_mul)
call my_mul # Call my_mul function to calculate row * n, result in a0
mv s6, a0 # Store result back to s6
lw a0, 20(sp)
lw a1, 16(sp)
add s6, s6, t2 # Add column index
slli s6, s6, 1 # Each element is 2 bytes, so multiply index by 2
la s7, TDPtable # s7 now points to TDPtable
add s7, s7, s6 # s7 now points to TPtable[r][c]
lw t0, 0(s7)
mv a0, t0
mv a1, a4
call Fadd # jump to Fadd
mv t0, a0 # temp_fp32 += prob_fp32, a4: prob_fp32 /= 8.0f
lw a0, 20(sp)
lw a1, 16(sp)
srli t0, t0, 16 # tran temp_fp32 to bf16 format
sh t0, 0(s7) # Store fp32_to_bf16(temp_fp32) to DPtable[row][column]
#####
lw t0, 4(sp)
next_move:
addi t3, t3, 1
addi s9, s9, 1
bge s9, s8, skip_calculate
j move_loop
# (For simplicity, floating point emulation should be handled here)
# Store results back to TDPtable, updating neighboring cells based on knight moves
# ... (Process knight moves logic and update TDPtable)
skip_calculate:
addi s5, s5, 1 # c++
j iter_col
col_done:
addi s4, s4, 1 # r++
j iter_row
row_done:
# memcpy(DPtable, TDPtable, sizeof(DPtable))
la t1, DPtable # Load DPtable base address into t1
la t2, TDPtable # Load TDPtable base address into t2
li t3, 18 # Copy 18 bytes (3x3 table)
copy_loop:
lw t0, 0(t2) # Load word from TDPtable
sh t0, 0(t1) # Store word to DP
addi t1, t1, 2 # Move to next word in DPtable
addi t2, t2, 2 # Move to next word in TDPtable
addi t3, t3, -2 # Decrease counter
bnez t3, copy_loop # Repeat until DPtable is updated
addi s3, s3, 1 # i++
j main_loop # Continue main loop
loop_end:
sum_dp_table:
# Arguments:
# a0 = n
li t6, 0 # t6 as Prop sum
# Outer loop (i = 0)
li s1, 0 # s1 = i
outer_loop:
lw a0, 20(sp)
bgt s1, a0, return
# Inner loop (j = 0)
li s2, 0 # t2 = j
inner_loop:
lw a0, 20(sp)
bgt s2, a0, next_i # if j >= n, branch to next outer loop
# Calculate DPtable[i][j] address: base + (i * n + j) * 2
mv a1, s1
call my_mul # t3 = i * n
mv t3, a0
lw a0, 20(sp)
lw a1, 16(sp)
# test t3 val
la a0, str2
li a7, 4
ecall
mv a0, t3
li a7, 1
ecall
la a0, str5
li a7, 4
ecall
#
add t3, t3, s2 # t3 = i * n + j
slli t3, t3, 1 # t3 = (i * n + j) * 2
# test t3 val
la a0, str2
li a7, 4
ecall
mv a0, t3
li a7, 1
ecall
la a0, str5
li a7, 4
ecall
#
la t0, DPtable
add t3, t3, t0 # t3 = DPtable[i][j]'s position
# Load the BF16 value from DPtable[i][j]
lh t4, 0(t3) # load BF16 to t4
# Call bf16_to_fp32
mv a0, t4 # a0 = bf val
jal bf16_to_fp32 # call bf16_to_fp32, save return val at a0
mv t5, a0 # store fp val in t5
mv a1, t6
call Fadd # t6 = t6 + t5
mv t6, a0
lw a0, 20(sp)
lw a1, 16(sp)
# Increment j
addi s2, s2, 1 # j++
j inner_loop
next_i:
# Increment i
addi s1, s1, 1 # i++
# test
la a0, str6
li a7, 4
ecall
mv a0, s1
li a7, 1
ecall
la a0, str3
li a7, 4
ecall
mv a0, t6
li a7, 2
ecall
la a0, str5
li a7, 4
ecall
# test
j outer_loop
return:
mv a0, t6
lw ra, 28(sp)
addi sp, sp, 32
jr ra # Return final result
.text
.globl bf16_to_fp32
bf16_to_fp32:
addi sp, sp, -16 # Save space on stack
sw ra, 12(sp) # Save return address
sw a0, 8(sp) # Save argument
# 將16位BF16數值移到高16位
slli a0, a0, 16 # Move BF16 to high 16 bits of 32-bit FP32
# 結果存回 a1,這裡 u.f = u.i (直接返回)
lw ra, 12(sp) # Restore return address
addi sp, sp, 16 # Restore stack
jr ra # Return
.text
.globl my_mul
my_mul:
# Arguments:
# Return value in a0
# Initialize result register
li t0, 0 # t0 will store the result (初始化結果為 0)
li t1, 0 # t1 is the bit position counter (初始位移量)
mul_loop:
# Check if the least significant bit of a1 (multiplier) is 1
andi t2, a1, 1 # Extract the least significant bit of multiplier
# If the bit is 1, add (multiplicand << t1) to the result
beqz t2, skip_add # If the bit is 0, skip addition
sll t3, a0, t1 # Shift multiplicand by t1
add t0, t0, t3 # Add shifted multiplicand to result
skip_add:
# Shift multiplier right by 1
srli a1, a1, 1
addi t1, t1, 1 # Increment the bit position counter
# Repeat for all bits
bnez a1, mul_loop # If a1 is not zero, repeat the loop
# Store the result in a0 and return
mv a0, t0 # Move result to a0
ret
# a0: float val, a1: exp of 2
# Function to perform floating-point division
.text
.globl Fdiv
Fdiv:
addi sp, sp, -16
sw ra, 12(sp)
sw a0, 8(sp)
sw a1, 4(sp)
li t3, 0x7f800000
and t1, a0, t3
srli t1, t1, 23
li t2, 0x80
sub t0, t1, a1
slli t0, t0, 23
li t3, 0x800fffff
and t1, a0, t3
or a0, t1, t0
lw ra, 12(sp)
lw a1, 4(sp)
addi sp, sp, 16
ret
.text
.globl fadd
Fadd:
addi sp, sp, -24
sw ra, 20(sp)
sw t0, 16(sp)
sw t1, 12(sp)
sw t2, 8(sp)
sw t3, 4(sp)
sw s0, 0(sp)
beq a0, zero, return_num1
beq a1, zero, return_num2
j not_zero
return_num1: # when f1 = 0, return a1
mv a0, a1
j return_f
return_num2: # when f2 = 0, return a0
j return_f
not_zero:
mv t0, a0
mv t1, a1
slli t4, t0, 1 # don't need to get sign
srli t4, t4, 24 # exponent of f1
slli t5, t1, 1
srli t5, t5, 24 # exponent of f2
li s0, 0x800000
andi s1, t0, 0x1ff # fraction of f1
or s1, s1, s0 # add hidden bit
andi s2, t1, 0x1ff # fraction of f2
or s2, s2, s0 # add hidden bit
bge t4, t5, not_taken1 #if f1 > f2, branch to not_tacken1
sub t6, t5, t4 # else, shift = e2 - e1
srl s1, s1, t6 # right shift fraction1
mv t4, t5 # let t4 stroes the bigger exponent
not_taken1:
sub t6, t4, t5 # shift = e1 - e2
srl s2, s2, t6 # right shift fraction2
add s3, s1, s2 # result = fraction1 + fraction2
li s0, 0x1000000 # normalize the number
and s1, s3, s0
beq s1, zero, not_taken2
srli s3, s3, 1 # right shift the fraction
addi t4, t4, 1 # exponent+=1
not_taken2:
slli t4, t4, 23 # exponent << 23
li s0, 0x7FFFFF
and s3, s3, s0 # fraction & 0x7FFFFF
or a0, t4, s3 # concat the number
li a7, 2
return_f:
lw ra, 20(sp)
lw t0, 16(sp)
lw t1, 12(sp)
lw t2, 8(sp)
lw t3, 4(sp)
lw s0, 0(sp)
addi sp, sp, 24
ret
.data
.align 2
# moves: Knight's possible moves
moves:
.word 2, 1, 2, -1, -2, 1, -2, -1
.word 1, 2, 1, -2, -1, 2, -1, -2
.data
DPtable:
.half 0, 0, 0, 0, 0, 0, 0, 0, 0
.data
TDPtable:
.half 0, 0, 0, 0, 0, 0, 0, 0, 0
str2: .string "Test t3: : "
str3: .string " , "
str4: .string " test answer : "
str5: .string "\n"
str6: .string "round "
```
## Analysis
### Ripes Simulator
Testing the code using Ripes simulator.
| Execution info | use BF16 to FP32 |
|------------|----|
| Cycles| 2648 |
| Instrs. retired | 1997|
| CPI| 1.33|
| IPC| 0.754|
| Clock rate | 0 Hz
```=
00000000 <main>:
0: 00300513 addi x10 x0 3
4: 00200593 addi x11 x0 2
8: 00000613 addi x12 x0 0
c: 00000693 addi x13 x0 0
10: 010000ef jal x1 16 <knightProbability>
14: 00050293 addi x5 x10 0
18: 00200893 addi x17 x0 2
1c: 00000073 ecall
00000020 <knightProbability>:
20: fe010113 addi x2 x2 -32
24: 00112e23 sw x1 28 x2
28: 00a12a23 sw x10 20 x2
2c: 00b12823 sw x11 16 x2
30: 00c12623 sw x12 12 x2
34: 00d12423 sw x13 8 x2
00000038 <DP_initial>:
38: 00000293 addi x5 x0 0
3c: 10000317 auipc x6 0x10000
40: 00430313 addi x6 x6 4
44: 01200393 addi x7 x0 18
00000048 <DP_zero_full>:
48: 00531023 sh x5 0 x6
4c: 00230313 addi x6 x6 2
50: ffe38393 addi x7 x7 -2
54: fe039ae3 bne x7 x0 -12 <DP_zero_full>
00000058 <Set_one_to_DPrc>:
58: 00060593 addi x11 x12 0
5c: 00000097 auipc x1 0x0 <main>
60: 24c080e7 jalr x1 x1 588
64: 00050393 addi x7 x10 0
68: 01412503 lw x10 20 x2
6c: 01012583 lw x11 16 x2
70: 00d383b3 add x7 x7 x13
74: 00139393 slli x7 x7 1
78: 10000317 auipc x6 0x10000
7c: fc830313 addi x6 x6 -56
80: 00730333 add x6 x6 x7
84: 000042b7 lui x5 0x4
88: f8028293 addi x5 x5 -128
8c: 00531023 sh x5 0 x6
90: 00000993 addi x19 x0 0
00000094 <main_loop>:
94: 18b9d063 bge x19 x11 384 <sum_dp_table>
98: 00000293 addi x5 x0 0
9c: 10000317 auipc x6 0x10000
a0: fb630313 addi x6 x6 -74
a4: 01200393 addi x7 x0 18
000000a8 <TDP_zero_full>:
a8: 00531023 sh x5 0 x6
ac: 00230313 addi x6 x6 2
b0: ffe38393 addi x7 x7 -2
b4: fe039ae3 bne x7 x0 -12 <TDP_zero_full>
b8: 00000a13 addi x20 x0 0
000000bc <iter_row>:
bc: 12aa5263 bge x20 x10 292 <row_done>
c0: 00000a93 addi x21 x0 0
000000c4 <iter_col>:
c4: 10aada63 bge x21 x10 276 <col_done>
c8: 10000497 auipc x9 0x10000
cc: f7848493 addi x9 x9 -136
d0: 000a0513 addi x10 x20 0
d4: 00050593 addi x11 x10 0
d8: 01312223 sw x19 4 x2
dc: 00000097 auipc x1 0x0 <main>
e0: 1cc080e7 jalr x1 x1 460
e4: 00412983 lw x19 4 x2
e8: 00050f93 addi x31 x10 0
ec: 01412503 lw x10 20 x2
f0: 01012583 lw x11 16 x2
f4: 015f8fb3 add x31 x31 x21
f8: 001f9f93 slli x31 x31 1
fc: 01f484b3 add x9 x9 x31
100: 0004a283 lw x5 0 x9
104: 01029293 slli x5 x5 16
108: 0c028463 beq x5 x0 200 <skip_calculate>
10c: 00028513 addi x10 x5 0
110: 00300593 addi x11 x0 3
114: 00000097 auipc x1 0x0 <main>
118: 1c0080e7 jalr x1 x1 448
11c: 00050713 addi x14 x10 0
120: 01412503 lw x10 20 x2
124: 01012583 lw x11 16 x2
128: 00000c93 addi x25 x0 0
12c: 00800c13 addi x24 x0 8
00000130 <move_loop>:
130: 00512223 sw x5 4 x2
134: 003c9313 slli x6 x25 3
138: 10000297 auipc x5 0x10000
13c: ec828293 addi x5 x5 -312
140: 006282b3 add x5 x5 x6
144: 0002a303 lw x6 0 x5
148: 00428293 addi x5 x5 4
14c: 0002a383 lw x7 0 x5
150: 01430333 add x6 x6 x20
154: 015383b3 add x7 x7 x21
158: 06a35463 bge x6 x10 104 <next_move>
15c: 06034263 blt x6 x0 100 <next_move>
160: 06a3d063 bge x7 x10 96 <next_move>
164: 0403ce63 blt x7 x0 92 <next_move>
00000168 <prop_calcu>:
168: 00030593 addi x11 x6 0
16c: 00000097 auipc x1 0x0 <main>
170: 13c080e7 jalr x1 x1 316
174: 00050b13 addi x22 x10 0
178: 01412503 lw x10 20 x2
17c: 01012583 lw x11 16 x2
180: 007b0b33 add x22 x22 x7
184: 001b1b13 slli x22 x22 1
188: 10000b97 auipc x23 0x10000
18c: ecab8b93 addi x23 x23 -310
190: 016b8bb3 add x23 x23 x22
194: 000ba283 lw x5 0 x23
198: 00028513 addi x10 x5 0
19c: 00070593 addi x11 x14 0
1a0: 00000097 auipc x1 0x0 <main>
1a4: 1a8080e7 jalr x1 x1 424
1a8: 00050293 addi x5 x10 0
1ac: 01412503 lw x10 20 x2
1b0: 01012583 lw x11 16 x2
1b4: 0102d293 srli x5 x5 16
1b8: 005b9023 sh x5 0 x23
1bc: 00412283 lw x5 4 x2
000001c0 <next_move>:
1c0: 001e0e13 addi x28 x28 1
1c4: 001c8c93 addi x25 x25 1
1c8: 018cd463 bge x25 x24 8 <skip_calculate>
1cc: f65ff06f jal x0 -156 <move_loop>
000001d0 <skip_calculate>:
1d0: 001a8a93 addi x21 x21 1
1d4: ef1ff06f jal x0 -272 <iter_col>
000001d8 <col_done>:
1d8: 001a0a13 addi x20 x20 1
1dc: ee1ff06f jal x0 -288 <iter_row>
000001e0 <row_done>:
1e0: 10000317 auipc x6 0x10000
1e4: e6030313 addi x6 x6 -416
1e8: 10000397 auipc x7 0x10000
1ec: e6a38393 addi x7 x7 -406
1f0: 01200e13 addi x28 x0 18
000001f4 <copy_loop>:
1f4: 0003a283 lw x5 0 x7
1f8: 00531023 sh x5 0 x6
1fc: 00230313 addi x6 x6 2
200: 00238393 addi x7 x7 2
204: ffee0e13 addi x28 x28 -2
208: fe0e16e3 bne x28 x0 -20 <copy_loop>
20c: 00198993 addi x19 x19 1
210: e85ff06f jal x0 -380 <main_loop>
00000214 <sum_dp_table>:
214: 00000f93 addi x31 x0 0
218: 00000313 addi x6 x0 0
0000021c <outer_loop>:
21c: 04a35e63 bge x6 x10 92 <return>
220: 00000393 addi x7 x0 0
00000224 <inner_loop>:
224: 04a3d663 bge x7 x10 76 <next_i>
228: 02a30e33 mul x28 x6 x10
22c: 007e0e33 add x28 x28 x7
230: 001e1e13 slli x28 x28 1
234: 10000297 auipc x5 0x10000
238: e0c28293 addi x5 x5 -500
23c: 005e0e33 add x28 x28 x5
240: 000e1e83 lh x29 0 x28
244: 000e8513 addi x10 x29 0
248: 040000ef jal x1 64 <bf16_to_fp32>
24c: 00050f13 addi x30 x10 0
250: 000f8593 addi x11 x31 0
254: 00000097 auipc x1 0x0 <main>
258: 0f4080e7 jalr x1 x1 244
25c: 00050f93 addi x31 x10 0
260: 01412503 lw x10 20 x2
264: 01012583 lw x11 16 x2
268: 00138393 addi x7 x7 1
26c: fb9ff06f jal x0 -72 <inner_loop>
00000270 <next_i>:
270: 00130313 addi x6 x6 1
274: fa9ff06f jal x0 -88 <outer_loop>
00000278 <return>:
278: 000f8513 addi x10 x31 0
27c: 01c12083 lw x1 28 x2
280: 02010113 addi x2 x2 32
284: 00008067 jalr x0 x1 0
00000288 <bf16_to_fp32>:
288: ff010113 addi x2 x2 -16
28c: 00112623 sw x1 12 x2
290: 00a12423 sw x10 8 x2
294: 01051293 slli x5 x10 16
298: 00028513 addi x10 x5 0
29c: 00c12083 lw x1 12 x2
2a0: 01010113 addi x2 x2 16
2a4: 00008067 jalr x0 x1 0
000002a8 <my_mul>:
2a8: 00000293 addi x5 x0 0
2ac: 00000313 addi x6 x0 0
000002b0 <mul_loop>:
2b0: 0015f393 andi x7 x11 1
2b4: 00038663 beq x7 x0 12 <skip_add>
2b8: 00651e33 sll x28 x10 x6
2bc: 01c282b3 add x5 x5 x28
000002c0 <skip_add>:
2c0: 0015d593 srli x11 x11 1
2c4: 00130313 addi x6 x6 1
2c8: fe0594e3 bne x11 x0 -24 <mul_loop>
2cc: 00028513 addi x10 x5 0
2d0: 00008067 jalr x0 x1 0
000002d4 <Fdiv>:
2d4: ff010113 addi x2 x2 -16
2d8: 00112623 sw x1 12 x2
2dc: 00a12423 sw x10 8 x2
2e0: 00b12223 sw x11 4 x2
2e4: 7f800e37 lui x28 0x7f800
2e8: 01c57333 and x6 x10 x28
2ec: 01735313 srli x6 x6 23
2f0: 08000393 addi x7 x0 128
2f4: 02735663 bge x6 x7 44 <Exp_post>
2f8: 40b302b3 sub x5 x6 x11
2fc: 01729293 slli x5 x5 23
300: 80100e37 lui x28 0x80100
304: fffe0e13 addi x28 x28 -1
308: 01c57333 and x6 x10 x28
30c: 00536533 or x10 x6 x5
310: 00c12083 lw x1 12 x2
314: 00412583 lw x11 4 x2
318: 01010113 addi x2 x2 16
31c: 00008067 jalr x0 x1 0
00000320 <Exp_post>:
320: 00b302b3 add x5 x6 x11
324: 01729293 slli x5 x5 23
328: 80100e37 lui x28 0x80100
32c: fffe0e13 addi x28 x28 -1
330: 01c57333 and x6 x10 x28
334: 00536533 or x10 x6 x5
338: 00c12083 lw x1 12 x2
33c: 00412583 lw x11 4 x2
340: 01010113 addi x2 x2 16
344: 00008067 jalr x0 x1 0
00000348 <Fadd>:
348: fe810113 addi x2 x2 -24
34c: 00112a23 sw x1 20 x2
350: 00512823 sw x5 16 x2
354: 00612623 sw x6 12 x2
358: 00712423 sw x7 8 x2
35c: 01c12223 sw x28 4 x2
360: 00812023 sw x8 0 x2
364: 00050663 beq x10 x0 12 <return_num1>
368: 00058863 beq x11 x0 16 <return_num2>
36c: 0100006f jal x0 16 <not_zero>
00000370 <return_num1>:
370: 00058513 addi x10 x11 0
374: 07c0006f jal x0 124 <return_f>
00000378 <return_num2>:
378: 0780006f jal x0 120 <return_f>
0000037c <not_zero>:
37c: 00050293 addi x5 x10 0
380: 00058313 addi x6 x11 0
384: 00129e93 slli x29 x5 1
388: 018ede93 srli x29 x29 24
38c: 00131f13 slli x30 x6 1
390: 018f5f13 srli x30 x30 24
394: 00800437 lui x8 0x800
398: 1ff2f493 andi x9 x5 511
39c: 0084e4b3 or x9 x9 x8
3a0: 1ff37913 andi x18 x6 511
3a4: 00896933 or x18 x18 x8
3a8: 01eed863 bge x29 x30 16 <not_taken1>
3ac: 41df0fb3 sub x31 x30 x29
3b0: 01f4d4b3 srl x9 x9 x31
3b4: 000f0e93 addi x29 x30 0
000003b8 <not_taken1>:
3b8: 41ee8fb3 sub x31 x29 x30
3bc: 01f95933 srl x18 x18 x31
3c0: 012489b3 add x19 x9 x18
3c4: 01000437 lui x8 0x1000
3c8: 0089f4b3 and x9 x19 x8
3cc: 00048663 beq x9 x0 12 <not_taken2>
3d0: 0019d993 srli x19 x19 1
3d4: 001e8e93 addi x29 x29 1
000003d8 <not_taken2>:
3d8: 017e9e93 slli x29 x29 23
3dc: 00800437 lui x8 0x800
3e0: fff40413 addi x8 x8 -1
3e4: 0089f9b3 and x19 x19 x8
3e8: 013ee533 or x10 x29 x19
3ec: 00200893 addi x17 x0 2
000003f0 <return_f>:
3f0: 01412083 lw x1 20 x2
3f4: 01012283 lw x5 16 x2
3f8: 00c12303 lw x6 12 x2
3fc: 00812383 lw x7 8 x2
400: 00412e03 lw x28 4 x2
404: 00012403 lw x8 0 x2
408: 01810113 addi x2 x2 24
40c: 00008067 jalr x0 x1 0
```
### 5-stage pipelined processor
Ripes is a *five-stage*
#### IF

#### ID

#### EX

#### MEM

#### WB

## Reference
* [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
* [LeetCode : 688. Knight Probability in Chessboard](https://leetcode.com/problems/knight-probability-in-chessboard/description/)
* [RISC-V 指令集架構介紹 - RV32I](https://tclin914.github.io/16df19b4/)
* [Lab1: RV32I Simulator](https://hackmd.io/@sysprog/H1TpVYMdB)