# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed < [`vestata`](https://github.com/vestata) >
## Problem B in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
### C code
**fp32_to_bf16**
```c
static inline bf16_t fp32_to_bf16(float s)
{
bf16_t h;
union {
float f;
uint32_t i;
} u = {.f = s};
if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */
h.bits = (u.i >> 16) | 64; /* force to quiet */
return h;
}
h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10;
return h;
}
```
This code is a function written in C that converts a 32-bit floating-point number (FP32) into a 16-bit floating-point format called bfloat16 (BF16).
Recall that:
```c
┌ sign
│
│ ┌ exponent
│ │
│ │ ┌ mantissa
│ │ │
│┌──┴───┐┌─┴───┐
0b0000000000000000 bfloat16
```
**Handling special case**: In the bfloat16 format, a value is considered NaN (Not a Number) when the exponent is all 1s (`11111111` in binary), and the mantissa (fraction) is not entirely zero. This is similar to how NaNs are represented in the IEEE 754 floating-point formats.
If the exponent is all 1s but the mantissa is completely zero, the value represents infinity (positive or negative, depending on the sign bit). To distinguish NaN from infinity, at least one bit in the mantissa must be set to 1.
In particular, setting the first bit of the mantissa to 1 by `| 64` ensures that the value is treated as NaN instead of infinity. [Single-precision_floating-point wiki](https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding)
**In normal case**: The expression `(u.i >> 0x10) & 1` checks whether the bfloat16 result would be even or odd. It does this by examining the least significant bit of the lower 16 bits that will be discarded during the conversion.If the value is **odd** (i.e., the result of the check is 1), the code adds an extra 1 to `u.i` to help with rounding.After this adjustment, the value is right-shifted by 16 bits (`>> 0x10`), effectively converting the 32-bit float to bfloat16 by keeping the upper 16 bits.
This ensures that the conversion properly handles rounding by following the "round-to-nearest-even" rule.
The bfloat16 is stored in the lower 16 bits of a 32-bit word.
### RISC-V Assembly Implementation
handling union in risc-v
```c
fp32_to_bf16:
# Input: a0 = 32-bit float value
# Output: a0 = 16-bit bfloat16 value in the lower section of uint32_t
add t0, x0, a0
# special case
li t1, 0x7fffffff # set t1 to 0x7fffffff
and t2, t0, t1 # set t2 to u.i & 0x7fffffff
li t3, 0x7f800000 # set t3 to 0x7f800000
bltu t3, t2, handle_nan
# normal case
srli t5, a0, 0x10 # (u.i >> 0x10)
andi t5, t5, 1 # & 1
li t6, 0x7fff # set t6 to 0x7fff
add t5, t5, t6 # (0x7fff + ((u.i >> 0x10) & 1))
add a0, a0, t5 # apply round-to-nearest-even
srli a0, a0, 16 # move bf16 to less significant 16 bit
jr ra
handle_nan:
add t4, x0, a0 # set t4 to a0
srli t4, t4, 16 # u.i >> 16
ori t4, t4, 64 # | 64, /* force to quiet */
add a0, x0, t4 # set a0 to return
jr ra
```
<!-- :::warning
I initially aimed to implement bfloat16 vector dot product, but the C code implementation turned out to be quite complex. I'm not confident enough to translate it into assembly from scratch or meet the requirements for this assignment. As a result, I decided to switch to a different topic. If you're interested, you can check it out on my [GitHub](https://github.com/vestata/Computer-Architecture-2024.git).
::: -->
## Problem C in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
### version 1
#### C code
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
#### Assembly code
```assembly
main:
li a0, 0x13
jal ra, my_clz
li a7, 10
ecall
my_clz:
# Input: a0 = uint32_t
# Output: a0 = clz
li t0, 0 # count = 0
li t1, 31 # i = 31
clz_loop:
li t2, 1
sll t2, t2, t1 # (1U << i)
and t2, t2, a0 # x & (1U << i)
bnez t2, clz_end # if (x & (1U << i)) break
addi t0, t0, 1 # count++
addi t1, t1, -1 # i--
bgez t1, clz_loop # for loop
clz_end:
mv a0, t0 # set output
ret
```
| Info | Value |
|:----------------|:--------|
| Cycles | 267 |
| Instrs. retired | 201 |
| CPI | 1.33 |
| IPC | 0.753 |
| Clock rate | 0 Hz |
### version 2
Use bitmask to implement clz is an easy way to boost the speed of the code.
#### C code
```c
int my_clz(uint32_t x) {
if (x == 0) return 32;
int n = 0;
if (x <= 0x0000FFFF) { n += 16; x <<= 16; }
if (x <= 0x00FFFFFF) { n += 8; x <<= 8; }
if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; }
if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; }
if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; }
return n;
}
```
#### Assembly
```assembly
my_clz:
# Input: a0 = uint32_t
# Output: a0 = clz
beqz a0, clz_zero
li t0, 0 # n = 0
li t1, 0x0000FFFF
blt t1, a0, shift16_skip
addi t0, t0, 16 # n += 16;
slli a0, a0, 16 # x <<= 16;
shift16_skip:
li t1, 0x00FFFFFF
blt t1, a0, shift8_skip
addi t0, t0, 8 # n += 8;
slli a0, a0, 8 # x <<= 8;
shift8_skip:
li t1, 0x0FFFFFFF
blt t1, a0, shift4_skip
addi t0, t0, 4 # n += 4;
slli a0, a0, 4 # x <<= 4;
shift4_skip:
li t1, 0x3FFFFFFF
blt t1, a0, shift2_skip
addi t0, t0, 2 # n += 2;
slli a0, a0, 2 # x <<= 2;
shift2_skip:
li t1, 0x7FFFFFFF
blt t1, a0, clz_done
addi t0, t0, 1 # n += 1;
slli a0, a0, 1 # x <<= 1;
j clz_done
clz_zero:
li a0, 32
ret
clz_done:
li a0, t0
ret
```
| Info | Value |
|:----------------|:--------|
| Cycles | 46 |
| Instrs. retired | 32 |
| CPI | 1.44 |
| IPC | 0.696 |
| Clock rate | 0 Hz |
## [3307. Find the K-th Character in String Game II](https://leetcode.com/problems/find-the-k-th-character-in-string-game-ii/)
>**Description**:
Alice and Bob are playing a game. Initially, Alice has a string `word = "a"`.
>You are given a positive integer `k`. You are also given an integer array `operations`, where `operations[i]` represents the type of the `ith` operation.
>Now Bob will ask Alice to perform all operations in sequence:
If `operations[i] == 0`, append a copy of word to itself.
If `operations[i] == 1`, generate a new string by changing each character in `word` to its next character in the English alphabet, and append it to the original `word`. For example, performing the operation on `"c"` generates `"cd"` and performing the operation on `"zb"` generates `"zbac"`.
Return the value of the `kth` character in `word` after performing all the operations.
>Note that the character `'z'` can be changed to `'a'` in the second type of operation.
**Solution**:
So we need to find how many times `a` changes to reach the input position `k`, the first step is to calculate the word length of the final operation. Since the word initially starts with `a`, in the first iteration, the word length is two. In the second iteration, the word length becomes four. The word length grows exponentially. Therefore, to determine the word size at position `k`, it can be represented as $2^{\log_2(k)+1}$, which corresponds to $\log_2(k)+1$ operations.
The next step is to determine in which half of the string position `k` is located at each operation. If `k` is in the right half of the word, we increase the change count by one if `operation[i] == 1`. Then, we subtract the position of `k` to find its parent character. Use a for loop to go through all the operations.
In the final step, apply the accumulated changes to the character `a`.
```c
char kthCharacter(long long k, int* operations, int operationsSize) {
int mutations = 0;
for (int op = log2(k); op >= 0; --op)
if (k > 1LL << op) {
k -= 1LL << op;
mutations += operations[op];
}
return 'a' + mutations % 26;
}
```
The relationship between `ilog2` (integer log base 2) and `clz` (count leading zeros) stems from the fact that finding the logarithm base 2 of an integer can be directly related to identifying the position of the highest set bit in the binary representation of the number.
The `log2()` function can be defined by custom function, `my_clz()`.
```c
int ilog2(int n){
return 31 - my_clz(n);
}
```
Notice that the input of this question is in `long long`, so we hath to modify `my_clz()` and `ilog2()` to support 64-bit format.
```c
int my_clz64(uint64_t x) {
if (x == 0) return 64;
int n = 0;
if (x <= 0x00000000FFFFFFFF) { n += 32; x <<= 32; }
if (x <= 0x0000FFFFFFFFFFFF) { n += 16; x <<= 16; }
if (x <= 0x00FFFFFFFFFFFFFF) { n += 8; x <<= 8; }
if (x <= 0x0FFFFFFFFFFFFFFF) { n += 4; x <<= 4; }
if (x <= 0x3FFFFFFFFFFFFFFF) { n += 2; x <<= 2; }
if (x <= 0x7FFFFFFFFFFFFFFF) { n += 1; x <<= 1; }
return n;
}
int ilog2(long long n){
return 63 - my_clz64(n);
}
```
The full code is right [here](https://github.com/vestata/Computer-Architecture-2024/tree/main/hw1).
Since we have to handle `long long`, which is 64-bit data, we need to modify the `rv32i` code to use two registers to handle `long long`. Following the C code, we have defined three `long long` operations: `shift_left_64`, `blt_64`, and `sub_64`.
```assembly
my_clz64:
# Input:
# a0 (lower 32 bits of x)
# a1 (upper 32 bits of x)
# Output:
# a0 (clz)
or t0, a0, a1
beqz t0, zero_case # if x == 0
li t1, 0 # set n = 0
bnez a1, shift32_skip
addi t1, t1, 32 # n += 32
mv a1, a0 # x <<= 32
li a0, 0 # x_low = 0
shift32_skip:
# if (x_hi <= 0x0000ffff)
li t2, 0xffff
sltu t3, t2, a1 # t3(flag) = (0xfff < x_hi)
bnez t3, shift16_skip # if (x_hi <= 0000ffff)
addi t1, t1, 16 # n += 16
slli a1, a1, 16 # x_hi <<= 16
mv t3, a0
srli t3, t3, 16
or a1, a1, t3 # x_hi = (x_hi << 16) | (x_lo >> 16)
slli a0, a0, 16 #
shift16_skip:
# if (x_hi <= 0x00ffffff)
li t2, 0xffffff
sltu t3, t2, a1
bnez t3, shift8_skip
addi t1, t1, 8 # n += 8
slli a1, a1, 8 # x_hi <<= 8
mv t3, a0 # set a extra register to preserve the bits
# being shifted
srli t3, t3, 8
or a1, a1, t3 # x_hi = (x_hi << 8) | (x_lo >> 8)
slli a0, a0, 8 #
shift8_skip:
# if (x_hi <= 0x0fffffff)
li t2, 0xfffffff
sltu t3, t2, a1
bnez t3, shift4_skip
addi t1, t1, 4 # n += 4
slli a1, a1, 4 # x_hi <<= 4
mv t3, a0
srli t3, t3, 4
or a1, a1, t3 # x_hi = (x_hi << 4) | (x_lo >> 4)
slli a0, a0, 4 #
shift4_skip:
# if (x_hi <= 0x3fffffff)
li t2, 0x3fffffff
sltu t3, t2, a1
bnez t3, shift2_skip
addi t1, t1, 2 # n += 2
slli a1, a1, 2 # x_hi <<= 2
mv t3, a0
srli t3, t3, 2
or a1, a1, t3 # x_hi = (x_hi << 2) | (x_lo >> 2)
slli a0, a0, 2 #
shift2_skip:
# if (x_hi <= 0x7fffffff)
li t2, 0x7fffffff
sltu t3, t2, a1
bnez t3, clz_end
addi t1, t1, 1 # n += 1
slli a1, a1, 1 # x_hi <<= 1
mv t3, a0
srli t3, t3, 1
or a1, a1, t3 # x_hi = (x_hi << 1) | (x_lo >> 1)
slli a0, a0, 1 #
clz_end:
mv a0, t1 # return n
ret
zero_case:
li a0 64 # retrun 64
ret
ilog2:
addi sp, sp, -4
sw ra, 0(sp)
jal ra, my_clz64
lw ra, 0(sp)
addi sp, sp, 4
li t0, 63
sub a0, t0, a0
ret
kthCharacter:
# Inputs:
# a0: k_low (lower 32 bits of k)
# a1: k_high (upper 32 bits of k)
# a2: operations (pointer to array)
# a3: operationsSize
# Output:
# a0: resulting character
addi sp, sp, -28
sw ra, 24(sp) # save return address
sw s0, 20(sp) # save mutations
sw s1, 16(sp) # save op
sw s2, 12(sp) # save k_low
sw s3, 8(sp) # save k_high
sw s4, 4(sp) # save operations address
sw s5, 0(sp) # save operationsSize
mv s4, a2
mv s5, s3
li s0, 0 # set s0 to mutation = 0
mv s2, a0
mv s3, a1
jal ra, ilog2 # call ilog2, result in a0
mv s1, a0 # set s1 = op
loop:
bltz s1, loop_end # branch if op < 0
mv a0, s1 # a0 = op
jal ra, shift_left_64 # return a0, a1 for 64 bit data
mv t0, a0 # set t0 = shift_reult_low
mv t1, a1 # set t1 = shift_reult_high
mv a0, t0 # set a0 = shift_result_low
mv a1, t1 # set a1 = shift_result_high
mv a2, s2 # set a2 = k_low
mv a3, s3 # set a3 = k_high
jal ra, blt_64 # blt 1LL << op, k, return boolean result
# in a0
beqz a0, skip_if
mv a0, s2 # a0 = k_low
mv a1, s3 # a1 = k_high
mv a2, t0 # a2 = shift_result_low
mv a3, t1 # a3 = shift_result_high
jal ra, sub_64 # return result in a0, a1
# update k
mv s2, a0 # s2 = k_low
mv s3, a1 # s3 = k_high
# operation[op]
slli t0, s1, 2
add t1, s4, t0
lw t2, 0(t1)
add s0, s0, t2 # s0(mutations) += operations[op]
skip_if:
addi s1, s1, -1 # op--
j loop
loop_end:
li t0, 26
rem t1, s0, t0 # set t1 = mutations % 26
li t0, 0x61 # set t0 = ascii of 'a'
add a0, t0, t1 # a0 = 'a' + (mutations % 26)
lw s5, 0(sp)
lw s4, 4(sp)
lw s3, 8(sp)
lw s2, 12(sp)
lw s1, 16(sp)
lw s0, 20(sp)
lw ra, 24(sp)
addi sp, sp, 28
ret
# To handle left shift for 64-bit long long
shift_left_64:
# Inputs(long long):
# a0: shift amount(op)
# Outputs(long long):
# a0: lower 32 bits
# a1: upper 32 bits
li t4, 32
blt a0, t4, shift_less_32 # check if shift over the
# representation of 32-bit
# left shift more than 31bits
addi a0, a0, -32 #
li a1, 1 #
sll a1, a1, a0 # set x_hi = 1 << (op - 32)
li a0, 0 # set x_low = 0
j end
shift_less_32:
# left shift more than 32 bits
li a1, 1 # temp set 1
sll a0, a1, a0 # set x_low = 1LL << op
li a1, 0 # set x_hi = 0
j end
end:
ret
# To handle comaprison of two long long data
blt_64:
# Inputs(two long long):
# a0: x1_low
# a1: x1_high
# a2: x2_low
# a3: x2_high
# Outputs(long long):
# a0: boolean
bltu a1, a3, true
beq a1, a3, equal
false:
li a0, 0 # x1_high > x2_high, return false
ret
true:
# x1 < x2, return true
li a0, 1
ret
equal:
# if x1_high = x2_high, compare lower
bltu a0, a2, true # x1_low < x2_low, return true
j false
# To handle substraction of two long long data
sub_64:
# Inputs(two long long):
# a0: x1_low
# a1: x1_high
# a2: x2_low
# a3: x2_high
# Outputs(long long):
# a0: r_low
# a1: r_high
sltu t0, a0, a2 # t0 = (a0 < a2) ? 1 : 0, check if borrow
sub a0, a0, a2 # a0 = a0 - a2
sub a1, a1, a3 # a1 = a1 - a3
sub a1, a1, t0 # a1 = a1 - t0
ret
```
### Test cases
**Test case 1:**
>Input: k = 12145134613, operations = [0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1]
Output: "i" (0x69)
**Test case 2:**
>Input: k = 10, operations = [0, 1, 0, 1]
Output: "b" (0x62)
**Test case 3:**
>Input: k = 25, operations = [1, 1, 1, 0, 1, 0, 1, 0, 1, 0]
Output: "b" (0x62)
The test result as below:
```c
All tests pass!
Program exited with code: 0
```
The execution info as below:
| Info | Value |
|-------------------|--------|
| Cycles | 2687 |
| Instrs. retired | 1709 |
| CPI | 1.57 |
| IPC | 0.636 |
| Clock rate | 0 Hz |
## Analysis
### Pseudo instruction
```assembly
00000000 <main>:
0: 10000297 auipc x5 0x10000
4: 08c28293 addi x5 x5 140
8: 0002a503 lw x10 0 x5
c: 10000317 auipc x6 0x10000
10: 08430313 addi x6 x6 132
14: 00032583 lw x11 0 x6
18: 10000617 auipc x12 0x10000
1c: fe860613 addi x12 x12 -24
20: 10000397 auipc x7 0x10000
24: 07438393 addi x7 x7 116
28: 0003a683 lw x13 0 x7
2c: 1e0000ef jal x1 480 <kthCharacter>
30: 10000e17 auipc x28 0x10000
34: 068e0e13 addi x28 x28 104
38: 000e2e83 lw x29 0 x28
3c: 09c000ef jal x1 156 <check_result>
40: 10000297 auipc x5 0x10000
44: 06c28293 addi x5 x5 108
48: 0002a503 lw x10 0 x5
4c: 10000317 auipc x6 0x10000
50: 06430313 addi x6 x6 100
54: 00032583 lw x11 0 x6
58: 10000617 auipc x12 0x10000
5c: 04460613 addi x12 x12 68
60: 10000397 auipc x7 0x10000
64: 05438393 addi x7 x7 84
68: 0003a683 lw x13 0 x7
6c: 1a0000ef jal x1 416 <kthCharacter>
70: 10000e17 auipc x28 0x10000
74: 048e0e13 addi x28 x28 72
78: 000e2e83 lw x29 0 x28
7c: 05c000ef jal x1 92 <check_result>
80: 10000297 auipc x5 0x10000
84: 06428293 addi x5 x5 100
88: 0002a503 lw x10 0 x5
8c: 10000317 auipc x6 0x10000
90: 05c30313 addi x6 x6 92
94: 00032583 lw x11 0 x6
98: 10000617 auipc x12 0x10000
9c: 02460613 addi x12 x12 36
a0: 10000397 auipc x7 0x10000
a4: 04c38393 addi x7 x7 76
a8: 0003a683 lw x13 0 x7
ac: 160000ef jal x1 352 <kthCharacter>
b0: 10000e17 auipc x28 0x10000
b4: 040e0e13 addi x28 x28 64
b8: 000e2e83 lw x29 0 x28
bc: 01c000ef jal x1 28 <check_result>
c0: 00400893 addi x17 x0 4
c4: 10000517 auipc x10 0x10000
c8: 03050513 addi x10 x10 48
cc: 00000073 ecall
d0: 00a00893 addi x17 x0 10
d4: 00000073 ecall
000000d8 <check_result>:
d8: 01d51463 bne x10 x29 8 <fail>
dc: 00008067 jalr x0 x1 0
000000e0 <fail>:
e0: 00400893 addi x17 x0 4
e4: 10000517 auipc x10 0x10000
e8: 02150513 addi x10 x10 33
ec: 00000073 ecall
f0: 00a00893 addi x17 x0 10
f4: 00000073 ecall
000000f8 <my_clz64>:
f8: 00b562b3 or x5 x10 x11
fc: 0e028463 beq x5 x0 232 <zero_case>
100: 00000313 addi x6 x0 0
104: 00059863 bne x11 x0 16 <shift32_skip>
108: 02030313 addi x6 x6 32
10c: 00050593 addi x11 x10 0
110: 00000513 addi x10 x0 0
00000114 <shift32_skip>:
114: 000103b7 lui x7 0x10
118: fff38393 addi x7 x7 -1
11c: 00b3be33 sltu x28 x7 x11
120: 000e1e63 bne x28 x0 28 <shift16_skip>
124: 01030313 addi x6 x6 16
128: 01059593 slli x11 x11 16
12c: 00050e13 addi x28 x10 0
130: 010e5e13 srli x28 x28 16
134: 01c5e5b3 or x11 x11 x28
138: 01051513 slli x10 x10 16
0000013c <shift16_skip>:
13c: 010003b7 lui x7 0x1000
140: fff38393 addi x7 x7 -1
144: 00b3be33 sltu x28 x7 x11
148: 000e1e63 bne x28 x0 28 <shift8_skip>
14c: 00830313 addi x6 x6 8
150: 00859593 slli x11 x11 8
154: 00050e13 addi x28 x10 0
158: 008e5e13 srli x28 x28 8
15c: 01c5e5b3 or x11 x11 x28
160: 00851513 slli x10 x10 8
00000164 <shift8_skip>:
164: 100003b7 lui x7 0x10000
168: fff38393 addi x7 x7 -1
16c: 00b3be33 sltu x28 x7 x11
170: 000e1e63 bne x28 x0 28 <shift4_skip>
174: 00430313 addi x6 x6 4
178: 00459593 slli x11 x11 4
17c: 00050e13 addi x28 x10 0
180: 004e5e13 srli x28 x28 4
184: 01c5e5b3 or x11 x11 x28
188: 00451513 slli x10 x10 4
0000018c <shift4_skip>:
18c: 400003b7 lui x7 0x40000
190: fff38393 addi x7 x7 -1
194: 00b3be33 sltu x28 x7 x11
198: 000e1e63 bne x28 x0 28 <shift2_skip>
19c: 00230313 addi x6 x6 2
1a0: 00259593 slli x11 x11 2
1a4: 00050e13 addi x28 x10 0
1a8: 002e5e13 srli x28 x28 2
1ac: 01c5e5b3 or x11 x11 x28
1b0: 00251513 slli x10 x10 2
000001b4 <shift2_skip>:
1b4: 800003b7 lui x7 0x80000
1b8: fff38393 addi x7 x7 -1
1bc: 00b3be33 sltu x28 x7 x11
1c0: 000e1e63 bne x28 x0 28 <clz_end>
1c4: 00130313 addi x6 x6 1
1c8: 00159593 slli x11 x11 1
1cc: 00050e13 addi x28 x10 0
1d0: 001e5e13 srli x28 x28 1
1d4: 01c5e5b3 or x11 x11 x28
1d8: 00151513 slli x10 x10 1
000001dc <clz_end>:
1dc: 00030513 addi x10 x6 0
1e0: 00008067 jalr x0 x1 0
000001e4 <zero_case>:
1e4: 04000513 addi x10 x0 64
1e8: 00008067 jalr x0 x1 0
000001ec <ilog2>:
1ec: ffc10113 addi x2 x2 -4
1f0: 00112023 sw x1 0 x2
1f4: f05ff0ef jal x1 -252 <my_clz64>
1f8: 00012083 lw x1 0 x2
1fc: 00410113 addi x2 x2 4
200: 03f00293 addi x5 x0 63
204: 40a28533 sub x10 x5 x10
208: 00008067 jalr x0 x1 0
0000020c <kthCharacter>:
20c: fe410113 addi x2 x2 -28
210: 00112c23 sw x1 24 x2
214: 00812a23 sw x8 20 x2
218: 00912823 sw x9 16 x2
21c: 01212623 sw x18 12 x2
220: 01312423 sw x19 8 x2
224: 01412223 sw x20 4 x2
228: 01512023 sw x21 0 x2
22c: 00060a13 addi x20 x12 0
230: 00098a93 addi x21 x19 0
234: 00000413 addi x8 x0 0
238: 00050913 addi x18 x10 0
23c: 00058993 addi x19 x11 0
240: fadff0ef jal x1 -84 <ilog2>
244: 00050493 addi x9 x10 0
00000248 <loop>:
248: 0604c063 blt x9 x0 96 <loop_end>
24c: 00048513 addi x10 x9 0
250: 08c000ef jal x1 140 <shift_left_64>
254: 00050293 addi x5 x10 0
258: 00058313 addi x6 x11 0
25c: 00028513 addi x10 x5 0
260: 00030593 addi x11 x6 0
264: 00090613 addi x12 x18 0
268: 00098693 addi x13 x19 0
26c: 0a0000ef jal x1 160 <blt_64>
270: 02050863 beq x10 x0 48 <skip_if>
274: 00090513 addi x10 x18 0
278: 00098593 addi x11 x19 0
27c: 00028613 addi x12 x5 0
280: 00030693 addi x13 x6 0
284: 0a8000ef jal x1 168 <sub_64>
288: 00050913 addi x18 x10 0
28c: 00058993 addi x19 x11 0
290: 00249293 slli x5 x9 2
294: 005a0333 add x6 x20 x5
298: 00032383 lw x7 0 x6
29c: 00740433 add x8 x8 x7
000002a0 <skip_if>:
2a0: fff48493 addi x9 x9 -1
2a4: fa5ff06f jal x0 -92 <loop>
000002a8 <loop_end>:
2a8: 01a00293 addi x5 x0 26
2ac: 02546333 rem x6 x8 x5
2b0: 06100293 addi x5 x0 97
2b4: 00628533 add x10 x5 x6
2b8: 00012a83 lw x21 0 x2
2bc: 00412a03 lw x20 4 x2
2c0: 00812983 lw x19 8 x2
2c4: 00c12903 lw x18 12 x2
2c8: 01012483 lw x9 16 x2
2cc: 01412403 lw x8 20 x2
2d0: 01812083 lw x1 24 x2
2d4: 01c10113 addi x2 x2 28
2d8: 00008067 jalr x0 x1 0
000002dc <shift_left_64>:
2dc: 02000e93 addi x29 x0 32
2e0: 01d54c63 blt x10 x29 24 <shift_less_32>
2e4: fe050513 addi x10 x10 -32
2e8: 00100593 addi x11 x0 1
2ec: 00a595b3 sll x11 x11 x10
2f0: 00000513 addi x10 x0 0
2f4: 0140006f jal x0 20 <end>
000002f8 <shift_less_32>:
2f8: 00100593 addi x11 x0 1
2fc: 00a59533 sll x10 x11 x10
300: 00000593 addi x11 x0 0
304: 0040006f jal x0 4 <end>
00000308 <end>:
308: 00008067 jalr x0 x1 0
0000030c <blt_64>:
30c: 00d5e863 bltu x11 x13 16 <true>
310: 00d58a63 beq x11 x13 20 <equal>
00000314 <false>:
314: 00000513 addi x10 x0 0
318: 00008067 jalr x0 x1 0
0000031c <true>:
31c: 00100513 addi x10 x0 1
320: 00008067 jalr x0 x1 0
00000324 <equal>:
324: fec56ce3 bltu x10 x12 -8 <true>
328: fedff06f jal x0 -20 <false>
0000032c <sub_64>:
32c: 00c532b3 sltu x5 x10 x12
330: 40c50533 sub x10 x10 x12
334: 40d585b3 sub x11 x11 x13
338: 405585b3 sub x11 x11 x5
33c: 00008067 jalr x0 x1 0
```
### 5-stage pipelined processor
Now we can choose a processor to run this code. Ripes provide four kinds of processor for us to choose, including **single cycle processor**, **5-stage processor**, **5-stage with hazard detection** and **5-stage with forward and hazard detection**. Here we choose the **5 stage processor**. Its block diagram look like this:

The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are:
1. Instruction fetch (IF)
2. Instruction decode and register fetch (ID)
3. Execute (EX)
4. Memory access (MEM)
5. Register write back (WB)
You can see that each stage is separated by pipeline registers (the rectangle block with stage names on its each side) in the block diagram.
#### I-Format load
Detailed Analysis of `lw t2, 0(t1)` in RISC-V Pipeline:
>instruction format : `lw t2, 0(t1)`
The register `t1` stores the address of the input operation array plus the offset of the variable `op`. We can then check whether the data of operation array is `0` or `1`, and store the result in `t2`.
#### 1. Instruction fetch (IF)

- Follow [online tool for RISC-V Instruction Encoder/Decoder](https://luplab.gitlab.io/rvcodecjs/) we can find the machine code to `lw t2, 0(t1)` is 0x00032383.
- PC will increment by 4 automatically using the above adder.
- Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder.
#### Instruction decode and register fetch (ID)

| imm[11:0] | rs1[5] | funct3[3] | rd[5] | opcode[7] | Instruction |
| ------------ | ------ | --------- | ----- | --------- | ------------- |
| imm[11:0] | rs1 | 010 | rd | 0000011 | lw |
| 000000000000 | 00110 | 010 | 00111 | 0000011 | lw(in binary) |
| 0x00000000 | 0x06 | 0x02 | 0x07 | 0x03 | lw(in hex) |
- Recall I-format load, we can split `lw t2, 0(t1)` into five parts. Which is shown above.
- `R2 idx` is not used in `lw` so it is `0x00`
- Current PC value (`0x00000298`) and next PC value (`0x0000029c`) are just send through this stage, we don't use them.
#### 3. Execute (EX)

- Since this is an I-type instruction, the first-level multiplexers (which normally select values from register 1 and register 2) are not used. These register values are filtered out by the second-level multiplexers, as the instruction does not rely on register-to-register operations.
- `Reg 1` and `Reg 2` are send to branch block, but no branch is taken.
- The second-level multiplexers select the current value of the program counter (PC) and the immediate value (`0`) to be used as the operands for the ALU.
- The ALU performs an addition operation between the base address in `0x06` and the immediate value (`0`). The result is the effective memory address, which is then passed to the memory access stage (EX/MEM).
#### 4. Memory access (MEM)

- Use as data memory address. Memory read data at address `0x10000084`, so Read `out` is equal to `0x00000001`. The table below denotes the data section of memory.
- 
- Next PC value (`0x0000029c`) and `Wr idx` (`0x07`) are just send through this stage, we don't use them.
- `Reg 2` is send to `Data in`, but memory doesn't enable writing.
#### 5. Register write back (WB)

- The multiplexer choose `Read out` from data memory as final output. So the output value is `0x00000001`.
- The output value and `Wr idx` are send back to registers block. Finally, the value `0x00000001` will be write into `x7` register, whose ABI name is `t2`.
After all these stage are done, the register is updated like this:

<!-- ## Matrix multiplication
The range of representation for 32-bit floating point numbers.
| Type | Exp | Fraction | Sign |
| -------------------- | ------- | -------- | ---- |
| Positive Zero | 0 | 0 | 0 |
| Negative Zero | 0 | 0 | 1 |
| Denormalised numbers | 0 | non zero | 0 |
| -Denormalised numbers | 0 | non zero | 1 |
| Normalised numbers | 1...254 | any | any |
| Infinities | $2^8$-1 | 0 | any |
| NaN | $2^8$-1 | non zero | any |
So, we need to handle cases where floating-point calculations result in underflow and overflow. Following previous work, both floating-point addition and floating-point multiplication are required to perform matrix multiplication. Some robust implementations can be referenced from [com-arch2023-quiz](https://hackmd.io/@sysprog/arch2023-quiz1#Problem-C).
<!-- ---
```c
float f32vecdot(float *A, float *B, int n) {
float ret = 0.0f;
for (int i = 0; i < n; i++) {
ret += A[i] * B[i];
}
return ret;
}
void f32tobf16vecdot(float *A, int n, uint32_t *result) {
int packed_size = n >> 1;
for (int i = 0; i < packed_size; i++) {
bf16_t bf16_a = fp32_to_bf16(A[i << 1]);
bf16_t bf16_b = fp32_to_bf16(A[i << 1 + 1]);
packed_result[i] = bf16_encode(bf16_a, bf16_b);
}
}
```
---
### `f_mul`
```c
float fmul32(float a, float b)
{
/* TODO: Special values like NaN and INF */
int32_t ia = *(int32_t *) &a, ib = *(int32_t *) &b;
/* sign */
int sa = ia >> 31;
int sb = ib >> 31;
/* mantissa */
int32_t ma = (ia & 0x7FFFFF) | 0x800000;
int32_t mb = (ib & 0x7FFFFF) | 0x800000;
/* exponent */
int32_t ea = ((ia >> 23) & 0xFF);
int32_t eb = ((ib >> 23) & 0xFF);
// handle zeros
if(ea == 0 || eb == 0){
int32_t f_zero = 0 | (sa ^ sb) << 31;
return *(float *) &f_zero;
}
// handle
/* 'r' = result */
int64_t mrtmp = imul32(ma, mb) >> 23;
int mshift = getbit(mrtmp, C01);
int64_t mr = mrtmp >> mshift;
int32_t ertmp = ea + eb - C02;
int32_t er = mshift ? inc(ertmp) : ertmp;
/* TODO: Overflow ^ */
int sr = sa ^ sb;
int32_t r = (sr << C03) | ((er & 0xFF) << C04) | (mr & 0x7FFFFF);
return *(float *) &r;
}
``` -->
<!-- Since we are using ripes for analysis, so we will focus at single processor. The data will focus on bfloat16 so the bottleneck of the matrix multiplication will be the the memory access. -->
<!-- matrix multiplication 主要分三個步驟
1. 從 main memory load 到 register
2. A * B
3. 累加到 C -->
<!-- The first thing we have to deal with is floating point manipulation, since no M or F/D extensions are allowed. To instruction have to be written from scratch, which are `f_add` and `f_mul`.
### `f_add`
```c
# Floating-point addition function (f_add)
f_add:
# Extract significand, exponent, and sign of A
li t6, 0x7FFFFF
and t0, a0, t6 # t0 = Significand of A (23 bits)
srli t2, a0, 23 # t2 = Exponent of A (8 bits)
andi t2, t2, 0xFF # Mask the exponent
srli t4, a0, 31 # t4 = Sign of A (1 bit)
# Extract significand, exponent, and sign of B
and t1, a1, t6 # t1 = Significand of B (23 bits)
srli t3, a1, 23 # t3 = Exponent of B (8 bits)
andi t3, t3, 0xFF # Mask the exponent
srli t5, a1, 31 # t5 = Sign of B (1 bit)
# Add implicit leading one to significands
li t6, 0x800000 # t6 = 1 << 23
or t0, t0, t6 # A's significand with implicit 1
or t1, t1, t6 # B's significand with implicit 1
# Compare exponents and align significands
bge t2, t3, align_B # If A's exponent >= B's exponent, align B
# Else, align A
sub t6, t3, t2 # t6 = Exponent difference
srl t0, t0, t6 # Shift A's significand
mv t2, t3 # Set exponent to the larger one
j add_significands
align_B:
sub t6, t2, t3 # t6 = Exponent difference
srl t1, t1, t6 # Shift B's significand
# Exponent t2 is already the larger one
add_significands:
# Handle sign and add/subtract significands
beq t4, t5, same_sign # If signs are the same, add
# Signs are different; subtract smaller from larger
blt t0, t1, swap_operands
sub t0, t0, t1 # t0 = t0 - t1
j normalize
swap_operands:
sub t0, t1, t0 # t0 = t1 - t0
mv t4, t5 # Result sign is sign of B
j normalize
same_sign:
add t0, t0, t1 # Add significands
normalize:
# Normalize result
li t6, 0x800000 # MSB position
normalize_loop:
blt t0, t6, shift_left
j check_overflow
shift_left:
slli t0, t0, 1 # Shift left
addi t2, t2, -1 # Decrease exponent
j normalize_loop
check_overflow:
srli t3, t0, 24 # Check for overflow
beqz t3, finalize
srli t0, t0, 1 # Shift right
addi t2, t2, 1 # Increase exponent
j check_overflow
finalize:
# Pack result into IEEE 754 format
li t6, 0x7FFFFF
and t0, t0, t6 # Mask significand to 23 bits
slli t2, t2, 23 # Position exponent
or t0, t0, t2 # Combine significand and exponent
slli t4, t4, 31 # Position sign bit
or a0, t0, t4 # Combine all parts
jr ra # Return to caller
```
-->
<!-- ### fp32 matmul -->
<!-- The naive way for baseline.
```c
void matmul(float* A, float* B, float* C, int M, int K, int N) {
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < K; k++) {
C[i * N + j] += A[i * K + k] * B[k * N + j];
}
}
}
}
```
```c
matmul:
# input
# a0 - float* A
# a1 - float* B
# a2 - float* C
# a3 - int M
# a4 - int K
# a5 - int N
# stack register
addi sp, sp, -24
sw s0, 0(sp)
sw s1, 4(sp)
sw s2, 8(sp)
sw s3, 12(sp)
sw s4, 16(sp)
sw s5, 20(sp)
mv s3, a3 # set s3 = M
mv s4, a4 # set s4 = K
mv s5, a5 # set s5 = N
# initial i = 0
li s0, 0 # set s0 = i
Loop_i:
bge s0, s3, Loop_i_end # if i >= M, go to Loop_i_end
# initial j = 0
li s1, 0 # set s1 = j
Loop_j:
bge s1, s5, Loop_j_end # if j >= N, go to Loop_j_end
mul t0, s0, s5 # t0 = i * N
add t0, t0, s1 # t0 = i * N + j
slli t0, t0, 2 # t0 = (i * N + j) * 4
add t0, t0, a2 # t0 = address of C[i * N + j]
lw t0, 0(t0) # t0 = C[i * N + j]
# initial k = 0
li s2, 0 # s2 = k
Loop_k:
bge s2, s4, Loop_k_end # if k >= K, go to Loop_k_end
mul t1, s0, s4 # t1 = i * K
add t1, t1, s2 # t1 = i * K + k
slli t1, t1, 2 # t1 = (i * K + k) * 4
add t1, t1, a0 # t1 = address of A[i * K + k]
lw t1, 0(t1) # t1 = A[i * K + k]
mul t2, s2, s5 # t2 = k * N
add t2, t2, s1 # t2 = k * N + j
slli t2, t2, 2 # t2 = (k * N + j) * 4
add t2, t2, a1 # t2 = address of B[k * N + j]
lw t2, 0(t2) # t2 = B[k * N + j]
mul t3, t1, t2 # t3 = t1 * t2
add t0, t0, t3 # t0 = t0 + t3
addi s2, s2, 1 # k++
j Loop_k # back to Loop_k
Loop_k_end:
# store the result to C[i * N + j]
sw t0, 0(t0) # C[i * N + j] = t0
addi s1, s1, 1 # j++
j Loop_j # back to Loop_j
Loop_j_end:
addi s0, s0, 1 # i++
j Loop_i # back to Loop_i
Loop_i_end:
lw s0, 0(sp)
lw s1, 4(sp)
lw s2, 8(sp)
lw s3, 12(sp)
lw s4, 16(sp)
lw s5, 20(sp)
addi sp, sp, 24
ret # return
``` -->
## TODO
:::warning
1. Check if popcount version clz can go faster
2. Improve `blt_64`, `shift_left_64`, `sub_64`
:::
## Reference
[Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol)
[Leetcode 3307. Find the K-th Character in String Game II](https://leetcode.com/problems/find-the-k-th-character-in-string-game-ii/description/)
[online tool for RISC-V Instruction Encoder/Decoder](https://luplab.gitlab.io/rvcodecjs/)