# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [vicLin8712](https://github.com/vicLin8712/ca2025-quizzes#) >
## Problem B
Refer to [Quiz1 of Computer Architecture (2025 Fall) Problem `B`](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
### Assembly Implementation
#### First approach clz
To calculate leading zero of given unsigned integer `x`, I left shift `x` until the first non-zero bit position at the MSB. The implementation code as below showed:
```c
main:
lw a0, test
clz:
li t0, 0 # counts of leading zero
li t1, 0x80000000 # MSB set to 1
clz_loop:
and t2, t1, a0 # Check MSB
bnez t2, clz_end # branch
addi t0,t0,1
slli a0, a0,1
j clz_loop
clz_end:
mv a0, t0
li a7,1
ecall
```
The time complexity obvious is $O(n)$ which determined by the position of the first non-zero bit. The corresponding performance tests as below showed with best, worst, and general case:

Fig. Best case when `x=0x80000000`

Fig. Worst case when `x=0x0000001`

Fig. General case when `x=0x00010000`
#### Second approach clz
Refer to the algorithm how problem `B` approaches its `clz`, the implementation assembly showed below:
```c
main:
lw a0, test
clz:
li t0, 32 # n
li t1, 16 # c
clz_loop:
srl t2, a0, t1 # y = x >> c
beqz t2, skip_sub # if (y==0)
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
skip_sub:
srli t1, t1, 1 # c >>= 1
bnez t1, clz_loop
end_clz:
sub a0, t0, a0 # return n - x
li a7, 1
ecall
```
The above assembly executed in $O(log_2n)$ time complexity, only depended on the bit length of given data.

Fig. `x=0x80000000`

Fig. `x=0x00000001`
As above execution information showed, total cycles are fixed nomatter where the first non-zero bit at.
#### branchless CLZ
#### uf8_decode
I compiled `uf8_decode` with `RISC-V (32-bits) gcc -o2` on [Compiler Explorer](https://godbolt.org/). The corresponding assembly code and performance is:
```c
uf8_decode:
srli a3,a0,4
li a4,15
li a5,32768
sub a4,a4,a3
addi a5,a5,-1
sra a5,a5,a4
andi a0,a0,15
sll a0,a0,a3
slli a5,a5,4
add a0,a5,a0
ret
```

In my version, I tried to modified origin mapping function as:
$$
\begin{aligned}
D(b) &= m \cdot 2^e + (2^e - 1) \cdot 16
\\&= (m+16)\cdot2^e - 16
\end{aligned}
$$
And the following assembly code and performance are:
```c
uf8_decode:
srli t0, a0, 4 # e = b >> 4
andi t1, a0, 0x0F # m = b & 0xF
addi t1, t1, 16 # t1 = m + 16
sll t2, t1, t0 # t2 = (m+16) << e = (m+16)*2^e
addi a0, t2, -16 # a0 = (m+16)*2^e - 16
ret
```

#### uf8_encode
Consider the encoding function as problem `B` showed:
$$
\begin{aligned}
E(v) = \begin{cases}
v, & \text{if } v < 16 \\
16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise}
\end{cases}
\end{aligned}
$$
When given a 32-bit integer $v$, we wish to get an 8-bit integer $x$ to represent its origin value. The structure of 8-bit utf8 code is:
<div align="right">
<img src=https://hackmd.io/_uploads/rJI4oH8hle.png
width="400">
</div>
There is only 4 bit available for exponent, which means that the maximum exponent value is $2^4-1=15$. The value $v$ will located at a range of corresponding exponent value. Based on the `utf_decode` function, we can construct corresponding range table for different exponent as problem `B` showed:

For example, if exponenet is $1$, the minimum and maximum value are:
$$
\begin{aligned}
&D(00010000) = 0 + (2^1-1)\cdot 16 = 16, m=0000 \, e=0001\\
&D(00011111) = 15\cdot 2^1 + (2^1-1)\cdot 16 = 46,m=1111 \, e=0001
\end{aligned}
$$
The first step is to decided the exponent value where $v$ located at. Assume that we have correct exponent value $e$ now, we are going to figure out mantissa value. Refer to decode function, we can obtain what mantissa value should be:
$$
\begin{aligned}
&D(b) = m \cdot 2^e + (2^e - 1) \cdot 16\\
&\Rightarrow \frac {D(b)-(2^e-1)\cdot16}{2^e} = m
\end{aligned}
$$
The below pictures illustrate how a utf-8 data decoded and restore. The test program should verify wether uf8 remaining the same after decoded and encoded.
:::info
utf8 -> int32 -> utf8: No infomation loss
int32 -> utf8 -> int32: Information loss due to data compaction
:::

#### approach 1
The fisrt task is to figure out the exponent of given value.
We can assume that the exponenet can be calculated by `clz` method because the first non-zero value decide the minimum value of given 32-bit integer.
```asm
# a0: Passed in value, the value to be encoded.
uf8_encode:
# a1: exponent
# a2: offset
# Retrun if a0 < 16
li t0, 16
blt a0, t0, r
# clz, caller save process
addi sp, sp, -8
sw a0, 0(sp)
sw ra, 4(sp)
jal clz
mv a1, a0 # Store leading zero into a1
lw a0, 0(sp)
lw ra, 4(sp)
addi sp, sp, 8
# Initial exponent gauss
li t0, 27 # exp max = msb-4 = 31-a1-4 = 27 -4
li t1, 15 # Max exp limit
sub a1, t0, a1
blt a1, t1, skip_max_exp
mv a1, t1
skip_max_exp:
# Caller save process for offset calculation
addi sp, sp, -12
sw a0, 0(sp)
sw a1, 4(sp)
sw ra,8(sp)
mv a0, a1
jal cal_offset # Put exp a1 into a0, call cal offset
mv a2, a0 # Store offset value based on gaussed exponent into a2
lw a0, 0(sp)
lw a1, 4(sp)
lw ra, 8(sp)
addi sp,sp,12
reset_over:
# Check offset over over value
blt a2, a0, skip_overflow
addi a2, a2, -16
srli a2, a2,1
addi a1,a1,-1
j reset_over
skip_overflow:
# Check offset less than next offset value
li t1, 15
bge a1,t1, end_uf8_encode
slli t0,a2,1
addi t0,t0,16
blt a0, t0, end_uf8_encode
mv a2,t0
addi a1,a1,1
j skip_overflow
end_uf8_encode:
sub t0, a0, a2 # Mantissa = value(a0)-a2(overflow)
srl t0, t0, a1 # Mantissa >> exponent(a1)
slli a1, a1, 4 # Exp(a1) << 4
or a0, a1,t0 # Store exp << 4 | Mantissa
ret
# Calculate offset for given a0
cal_offset:
# a0: exponenet value for offset calculating
li t0, 0 # exp index
li t1, 0 # offset value
offset_loop:
bge t0, a0, end_cal_offset
slli t1,t1,1
addi t1, t1,16
addi t0,t0,1
j offset_loop
end_cal_offset:
mv a0, t1
jr ra
```
#### approach 2
For a given value `x` to be encoded into utf-8, I tried another approach, searching pre-build table to figure out exponenet directly. The code size comparing to **approach 1** can be redueced almost third of it:
```asm
table: .word 16,48,112,240,496,1008,2032,4048,8176
,16368,32752,65520,131056,262128,524272
# a0: The value to be encoded
uf8_encode:
# a1: exponent
# a2: offset
la a2, table # Load offset table
li a1, 0 # Initial exponent value
li t1, 15 # Maximum exp value
lw t2, 0(a2) # Load data of offset table
next_offset:
lw t2, 0(a2) # Load data of offset table
blt a0, t2, end_uf8_encode # If value less than next offset
# Go to next offset value
addi a1, a1, 1 # Exponent++
addi a2, a2, 4 # Next offset address
mv t3, t2 # Store previous data
bge a1, t1, end_uf8_encode
j next_offset
end_uf8_encode:
mv a2, t3 # Store offset value into a2
sub t0, a0, a2 # Mantissa = value(a0)-a2(overflow)
srl t0, t0, a1 # Mantissa >> exponent(a1)
slli a1, a1, 4 # Exp(a1) << 4
or a0, a1,t0 # Store exp << 4 | Mantissa
ret
```
### Validation
Now we have two different uf8 encode and decode approach.
First one is using gcc compiler -02 optimization and the other is using pre-build table without `clz`.
The below two results all executed on the `Rips` simulator. With case from `uf8 = 0x00` to `uf8 = 0xFF`.
#### gcc
I used [Compiler Explorer](https://godbolt.org/) to compile source code and run on the Ripes. Of course the number of whole cycles includs string output and relative operations.

#### table
> commit [2f2dbf9](https://github.com/vicLin8712/ca2025-quizzes/commit/2f2dbf906f35261c3becc57ce526cc0da910db7f)
The cycles almost reduced $28$ % using table approach! Although the fixed size memory required for this table.

> commit [4ec3276](https://github.com/vicLin8712/ca2025-quizzes/commit/4ec32764d419968e5d4a42f993bb5cdef8a0e517) support [rv32emu](https://github.com/sysprog21/rv32emu) system calls
#### Analysis
## Problem C
In this problem, four operations are implemented for `bf16` data type, `add`, `minus`, `mul`, and `div`. The bit field of `bf16` is:
```
┌─────────┬──────────────┬──────────────┐
│Sign (1) │ Exponent (8) │ Mantissa (7) │
└─────────┴──────────────┴──────────────┘
15 14 7 6 0
S: Sign bit (0 = positive, 1 = negative)
E: Exponent bits (8 bits, bias = 127)
M: Mantissa/fraction bits (7 bits)
```
The relation between real value $v$ and `bf16` representation can be calculated as:
\begin{aligned}
v=(-1)^S\times 2^{E-127} \times \bigg (1+\frac M{128} \bigg )
\end{aligned}
where
\begin{aligned}
&S:sign\\
&E:Exponent\\
&M:Mantissa
\end{aligned}
Some special case defined as follow:
* Zero: $E=0$ and $M=0$
* Infinity: $E=255$ and $M=0$
* NaN: $E=255$ and $M \ne 0$
* Denormals: Not supported (flush to zero)
### bf16_add
There are several edge conditions that must be considered first. Let us denote the parameters `a` and `b` as the two values to be added. We need to follow the rules of the addition operation described in **Section 6 of the IEEE 754-2019 Standard**.
| Operand a | Operand b | Result | Explanation |
| --------- |:---------:| ------ | --------------------------------------------------------------------------------- |
| +∞ | finite | +∞ | Adding any finite number to positive infinity still results in positive infinity. |
| -∞ | finite | -∞ | Adding any finite number to negative infinity still results in negative infinity. |
| finite | +∞ | +∞ | Symmetric case; finite plus positive infinity is positive infinity. |
| finite | -∞ | -∞ | Symmetric case; finite plus negative infinity is negative infinity. |
| +∞ | +∞ | +∞ | Same sign infinities remain the same (positive infinity). |
| -∞ | -∞ | -∞ | Same sign infinities remain the same (negative infinity). |
| +∞ | -∞ | NaN | Opposite sign infinities cause invalid operation → Not a Number (NaN). |
| -∞ | +∞ | NaN | Opposite sign infinities cause invalid operation → Not a Number (NaN). |
Based on the table, we first check whether a or b is `NaN`; if so, return `NaN`.
Otherwise, check whether either operand is `infinity`.
If both are `infinity`, return `NaN` when their signs differ, otherwise return that `infinity`.
If exactly one is infinity, return that infinity (preserving its sign).
If neither is `NaN` or `infinity`, proceed with the normal finite addition.
We can use flag to handle such problem
```c
/* value a: bit 0 1 2, mantissa > 0, exp full, sign bit
* value b: bit 3 4 5, mantissa > 0, exp full, sign bit
*/
uint8_t flag;
```