# Assignment 1: RISC-V Assembly and Instruction Pipeline
contributed by < [xwt](@xwt) >
> **AI tools usage**
## Problem b in [Quiz1](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
### Overview
Assume uf8 implements a logarithmic 8-bit codec that maps 20-bit unsigned integers (
[0,1,015,792]) to 8-bit symbols via logarithmic quantization, delivering 2.5:1 compression and ≤6.25% relative error.
#### Decoding
$$ D(b) = m \cdot 2^e + (2^e - 1) \cdot 16 $$
#### Encoding
$$ E(v) = \begin{cases}
v, & \text{if } v < 16 \\
16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise}
\end{cases} $$
Where $\text{offset}(e) = (2^e - 1) \cdot 16 $
#### Error Analysis
Absolute Error: $\Delta_{\max} = 2^e - 1$
Relative Error: $\varepsilon_{\max} = 1/16 = 6.25\%$
Expected Error: $\mathbb{E}[\varepsilon] \approx 3\%$
#### Information Theory
Input Entropy: 20 bits
Output Entropy: 8 bits
Theoretical Minimum: 7.6 bits (for 6.25% error bound)
Efficiency: 8/7.6 = 95% optimal
| Exponent | Range | Step Size |
| -------- |:-------------------- | --------- |
| 0 | [0, 15] | 1 |
| 1 | [16, 46] | 2 |
| 2 | [48, 108] | 4 |
| 3 | [112, 232] | 8 |
| ... | ... | $2^e$ |
| 15 | [524,272, 1,015,792] | 32,768 |
#### clz
A key part of the uf8_encode process is the use of the CLZ (Count Leading Zeros) operation. CLZ quickly determines the position of the most significant bit in a 32-bit integer. This allows the encoder to efficiently estimate the exponent that represents the magnitude of the input value. Instead of using slow logarithmic or floating-point operations, clz provides a fast integer-based way to identify how large a number is.
1. clz stackless
Uses conditional branches (beq /bne) to progressively shift and test bits. It shifts the number right by 16, 8, 4, 2, 1 bits iteratively, halving c each time, to find where the first 1 appears.
2. clz branchless
The code counts leading zeros (CLZ) of a 32-bit unsigned value without any branch that depends on the data.
It performs a fixed sequence of “probe & shift” on chunk sizes 16, 8, 4, 2, 1 bits:
| Feature | clz_stackless | clz_branchless |
|:------------------------ | ------------------ |:----------------------------------------- |
| **Uses branches** | Yes | No |
| **Timing behavior** | Variable | Constant |
| **Code size** | Smaller | Larger |
| **Impact** | Pipeline may flush | No flush |
| **Applicable Scenarios** | low-end MCUs | CPUs with pipelines such as RISC-V or ARM |
* Pipeline flush
A pipeline flush occurs when the instructions currently in the CPU pipeline must be discarded because the processor’s control flow has changed unexpectedly, usually due to a branch misprediction, exception, or interrupt. When this happens, all the instructions that were fetched or partially executed after the mispredicted branch (or before the correct target is known) become invalid, so the processor flushes them out of the pipeline and starts fetching the correct instructions again. This process causes the CPU to discard partially executed instructions, refill the pipeline from the correct instruction address, and results in a performance penalty due to several lost cycles. Pipeline flushes are commonly caused by branch mispredictions, exceptions, interrupts, self-modifying code, or pipeline hazards that invalidate prior assumptions. They lead to reduced instruction throughput and decreased efficiency, especially in deep pipelines such as modern RISC-V or ARM processors.
To avoid a pipeline flush, I implemented the clz_branchless version.
### clz
#### stackless
The code provided by the professor is a stackless version. I first implemented this version, and then compared it with the branchless version.
* c code
```c=
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
* assebly code
```risc-v=
clz:
li t0, 32 # n = 32
li t1, 16 # c = 16
c:
srl t2, a0, t1 # y = x >> c
beqz t2, y # if (y == 0) skip
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
y:
srli t1, t1, 1 # c >>= 1
bnez t1, c # while (c)
sub a0, t0, a0 # return n - x
ret
```

#### branchless
```risc-v=
clz_branchless:
li t0, 32 # n = 32
mv t1, a0 # u = x
# step 16
srli t2, t1, 16 # t2 = u >> 16
snez t2, t2 # t2 = (u>>16)!=0 ? 1:0
slli t3, t2, 4 # t3 = t2<<4 = 0 or 16
srl t1, t1, t3 # u >>= t3
sub t0, t0, t3 # n -= t3
# step 8
srli t2, t1, 8
snez t2, t2 # 0/1
slli t3, t2, 3 # 0 or 8
srl t1, t1, t3
sub t0, t0, t3
# step 4
srli t2, t1, 4
snez t2, t2 # 0/1
slli t3, t2, 2 # 0 or 4
srl t1, t1, t3
sub t0, t0, t3
# step 2
srli t2, t1, 2
snez t2, t2 # 0/1
slli t3, t2, 1 # 0 or 2
srl t1, t1, t3
sub t0, t0, t3
# step 1
srli t2, t1, 1
snez t2, t2 # 0/1
mv t3, t2 # 0 or 1
srl t1, t1, t3
sub t0, t0, t3
# return n - u
sub a0, t0, t1
ret
```

In the Ripes simulator, the execution results show that the stackless version required 56 cycles to complete 28 instructions, giving a CPI of 2.0 and IPC of 0.5. In contrast, the branchless version finished in only 43 cycles while executing 33 instructions, achieving a CPI of 1.3 and IPC of 0.767.
This means the branchless implementation, though slightly longer in instruction count, executed more efficiently because it avoided conditional branches that often cause pipeline flushes. Without these branches, the instruction flow was smoother, minimizing stalls and allowing better pipeline utilization.
As a result, the branchless version improved performance by reducing wasted cycles, increasing instruction throughput, and achieving a higher execution efficiency — an expected outcome on pipelined architectures like RISC-V.
### uf8_decode
``` risc-v=
uf8_decode:
andi t0, a0, 0x0F # t0 = mantissa
srli t1, a0, 4 # t1 = exponent (0..15)
li t2, 15
sub t2, t2, t1 # t2 = 15 - exponent
li t3, 0x7FFF
srl t3, t3, t2 # t3 = 0x7FFF >> (15 - exponent)
slli t3, t3, 4 # t3 <<= 4 = offset
sll t4, t0, t1 # t4 = mantissa << exponent
add a0, t4, t3 # a0 = t4 + offset
ret
```
### uf8_encode
``` risc-v=
uf8_encode:
addi sp, sp, -4 # Allocate stack space# Allocate stack space
sw ra, 0(sp) # Save return address of test
mv s3, a0
li t0, 16
blt a0, t0, encode_done
jal ra, clz # a0 = lz
li t1, 31
sub t1, t1, a0 # t1 = msb
li t2, 0 # exponent
li t3, 0 # overflow
li t4, 5
blt t1, t4, exact # t4=(msb<5)
addi t2, t1, -4 # exponent = msb-4
li t4, 16
blt t2, t4 , overflow #exponent < 16
li t2, 15 #exponent = 15
overflow:
li t5, 0 # e=0
overflow_loop:
bge t5, t2, estimate #e >= exponent
slli t3, t3, 1 #overflow << 1
addi t3, t3, 16 #(overflow << 1) + 16
addi t5, t5, 1 #e++
j overflow_loop
estimate:
beqz t2, exact #exponent = 0
bge s3, t3, exact #value < overflow
addi t3, t3, -16 #overflow - 16
srli t3, t3, 1 #(overflow - 16) >> 1
addi t2, t2, -1 # exponent--
j estimate
exact:
li t5, 15
bgeu t2, t5, exact_done # if (exponent >= 15) break
slli t4, t3, 1 #(overflow << 1)
addi t4, t4, 16 #next_overflow=(overflow << 1) + 16
bltu s3, t4, exact_done ## value < next_overflow break
mv t3, t4 #overflow = next_overflo
addi t2, t2, 1 #exponent++
j exact
exact_done:
sub t4, s3, t3 #value - overflow
srl t4, t4, t2 #(value - overflow) >> exponent
slli t2, t2, 4 #exponent << 4
or a0, t2, t4 #(exponent << 4) | mantissa
encode_done:
lw ra, 0(sp)
addi sp, sp, 4
ret
```
### Result




### Problem b full code
:::spoiler full c code
```c=
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
typedef uint8_t uf8;
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
/* Decode uf8 to uint32_t */
uint32_t uf8_decode(uf8 fl)
{
uint32_t mantissa = fl & 0x0f;
uint8_t exponent = fl >> 4;
uint32_t offset = (0x7FFF >> (15 - exponent)) << 4;
return (mantissa << exponent) + offset;
}
/* Encode uint32_t to uf8 */
uf8 uf8_encode(uint32_t value)
{
/* Use CLZ for fast exponent calculation */
if (value < 16)
return value;
/* Find appropriate exponent using CLZ hint */
int lz = clz(value);
int msb = 31 - lz;
/* Start from a good initial guess */
uint8_t exponent = 0;
uint32_t overflow = 0;
if (msb >= 5) {
/* Estimate exponent - the formula is empirical */
exponent = msb - 4;
if (exponent > 15)
exponent = 15;
/* Calculate overflow for estimated exponent */
for (uint8_t e = 0; e < exponent; e++)
overflow = (overflow << 1) + 16;
/* Adjust if estimate was off */
while (exponent > 0 && value < overflow) {
overflow = (overflow - 16) >> 1;
exponent--;
}
}
/* Find exact exponent */
while (exponent < 15) {
uint32_t next_overflow = (overflow << 1) + 16;
if (value < next_overflow)
break;
overflow = next_overflow;
exponent++;
}
uint8_t mantissa = (value - overflow) >> exponent;
return (exponent << 4) | mantissa;
}
/* Test encode/decode round-trip */
static bool test(void)
{
int32_t previous_value = -1;
bool passed = true;
for (int i = 0; i < 256; i++) {
uint8_t fl = i;
int32_t value = uf8_decode(fl);
uint8_t fl2 = uf8_encode(value);
if (fl != fl2) {
printf("%02x: produces value %d but encodes back to %02x\n", fl,
value, fl2);
passed = false;
}
if (value <= previous_value) {
printf("%02x: value %d <= previous_value %d\n", fl, value,
previous_value);
passed = false;
}
previous_value = value;
}
return passed;
}
int main(void)
{
if (test()) {
printf("All tests passed.\n");
return 0;
}
return 1;
}
```
:::
:::spoiler full assembly code
```risc-v=
.data
str0: .string ":produces value "
str1: .string " but encodes back to "
str2: .string ":value "
str3: .string " <= previous_value "
str4: .string "\n All tests passed.\n"
str5: .string "\n somthing wrong.\n"
nl: .string "\n"
.text
.globl main
#
main:
addi sp, sp, -4 # adjust stack for item
sw ra, 0(sp) # save return address of main
jal ra, test # go to test
beqz a0, fail # tests failed
la a0, str4 # load string4
li a7, 4 # Call system call (printString)
ecall
j exit
fail:
la a0, str5 # load string4
li a7, 4 # Call system call (printString)
ecall
exit:
lw ra, 0(sp) # restore register for caller
addi sp, sp, 4 # adjust stack to delete item
li a7, 10 # exit the program
ecall
test:
addi sp, sp, -20 # allocate stack space
sw ra, 16(sp) # save return address of test
sw s0, 12(sp) # save s0
sw s1, 8(sp) # save s1
sw s1, 4(sp) # save s2
sw s1, 0(sp) # save s3
li s0, -1 # previous_value = -1
li s1, 1 # bool passed = true
li s2, 0 # i = 0
li s3, 0 # value = 0
loop1:
li t0, 256 # t0 = 256
bge s2, t0, true # i = 256 go to true
mv a0, s2 # a0 = i
jal ra, uf8_decode # call uf8_decode
mv s3, a0 # s3 = value = uf8_decode(fl)
jal ra, uf8_encode # call uf8_encode
mv t2, a0 # t2 = fl2 = uf8_encode(value)
beq s2, t2, next # fl = fl2 go to next
li s1, 0 # bool passed = 0 (False)
mv a0, s2 # fl
jal ra, print_int # print int (fl)
la a0, str0 # ":produces value "
jal ra, print_str # print string (str0)
mv a0, s3 # value
jal ra, print_int # print int (value)
la a0, str1 # "but encodes back to "
jal ra, print_str # print string (str1)
mv a0, t2 # fl2
jal ra, print_int # print int (fl2)
jal ra, print_nl # print string (nl)
next:
blt s0, s3, update # previous < value go to update
li s1, 0 # bool passed = 0 (False)
mv a0, s2 # fl
jal ra, print_int
la a0, str2 # ":value "
jal ra, print_str
mv a0, s3 # value
jal ra, print_int
la a0, str3 # "<= previous_value"
jal ra, print_str
mv a0, s0 # previous_value
jal ra, print_int
la a0, str3
jal ra, print_nl
update:
mv s0, s3 # previous_value = value
addi s2, s2, 1 # i++
j loop1 #go to loop1
true:
mv a0, s1 # move bool value into a0
lw s3, 0(sp) # restore s3-0 from stack
lw s2, 4(sp)
lw s1, 8(sp)
lw s0, 12(sp)
lw ra, 16(sp) # restore ra from stack
addi sp, sp, 20 # release this function's stack frame
ret # return to caller
uf8_decode:
andi t0, a0, 0x0F # mantissa = fl & 0x0f
srli t1, a0, 4 # exponent = fl >> 4
li t2, 15
sub t2, t2, t1 # t2 = 15 - exponent
li t3, 0x7FFF
srl t3, t3, t2 # t3 = 0x7FFF >> (15 - exponent)
slli t3, t3, 4 # offset= t3 <<= 4
sll t4, t0, t1 # t4 = mantissa << exponent
add a0, t4, t3 # a0 = t4 + offset
ret # return to caller
uf8_encode:
addi sp, sp, -4
sw ra, 0(sp)
li t0, 16
blt a0, t0, encode_done #value < 16 go to encode_done
jal ra, clz # call clz,return a0(lz)
li t1, 31
sub t1, t1, a0 # msb = 31 - lz
li t2, 0 # exponent = 0
li t3, 0 # overflow = 0
li t4, 5
blt t1, t4, exact # msb < 5 go to exact
addi t2, t1, -4 # exponent = msb-4
li t4, 16
blt t2, t4, overflow # exponent < 16 go to overflow
li t2, 15 # exponent = 15
overflow:
li t5, 0 # e = 0
overflow_loop:
bge t5, t2, estimate # e >= exponent go to estimate
slli t3, t3, 1 # overflow << 1
addi t3, t3, 16 # (overflow << 1) + 16
addi t5, t5, 1 # e++
j overflow_loop #go to overflow_loop
estimate:
beqz t2, exact #exponent = 0
bge s3, t3, exact #value < overflow go to exact
addi t3, t3, -16 #overflow - 16
srli t3, t3, 1 #(overflow - 16) >> 1
addi t2, t2, -1 # exponent--
j estimate #go to estimate
exact:
li t5, 15
bgeu t2, t5, exact_done # if (exponent >= 15) go to exact_done
slli t4, t3, 1 # (overflow << 1)
addi t4, t4, 16 # next_overflow = (overflow << 1) + 16
bltu s3, t4, exact_done # value < next_overflow go to exact_done
mv t3, t4 # overflow = next_overflo
addi t2, t2, 1 # exponent++
j exact
exact_done:
sub t4, s3, t3 # value - overflow
srl t4, t4, t2 # (value - overflow) >> exponent
slli t2, t2, 4 # exponent << 4
or a0, t2, t4 # (exponent << 4) | mantissa
encode_done:
lw ra, 0(sp)
addi sp, sp, 4
ret
clz:
li t0, 32 # n = 32
li t1, 16 # c = 16
c:
srl t2, a0, t1 # y = x >> c
beqz t2, y # if (y == 0) skip
sub t0, t0, t1 # n -= c
mv a0, t2 # x = y
y:
srli t1, t1, 1 # c >>= 1
bnez t1, c # while (c)
sub a0, t0, a0 # return n - x
ret
print_int: # a0 = int
li a7, 1
ecall
ret
print_str: # a0 = &str
li a7, 4
ecall
ret
print_nl: # a0 = &str
la a0, nl
li a7, 4
ecall
ret
```
:::
## Problem c in [Quiz1](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
### Overview
The bfloat16 format (16-bit, from Google Brain) preserves float32’s dynamic range by keeping the same 8-bit exponent, but reduces precision to a 7-bit significand (vs. 23).
Bit Layout
```
┌─────────┬──────────────┬──────────────┐
│Sign (1) │ Exponent (8) │ Mantissa (7) │
└─────────┴──────────────┴──────────────┘
15 14 7 6 0
S: Sign bit (0 = positive, 1 = negative)
E: Exponent bits (8 bits, bias = 127)
M: Mantissa/fraction bits (7 bits)
```
The value 𝑣 of a BFloat16 number is calculated as:
$$v = (-1)^S \times 2^{E-127} \times \left(1 + \frac{M}{128}\right)$$
where:
* $S \in \{0, 1\}$ is the sign bit
* $E \in [1, 254]$ is the biased exponent
* $M \in [0, 127]$ is the mantissa value
Special Cases
* Zero:$E = 0, M = 0 \Rightarrow v = (-1)^S \times 0$
* Infinity: $E = 255, M = 0 \Rightarrow v = (-1)^S \times \infty$
* NaN: $E = 255, M \neq 0 \Rightarrow v = \text{NaN}$
* Denormals: Not supported (flush to zero)
### Special Value Check Function
:::spoiler c code
``` c=
static inline bool bf16_isnan(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_isinf(bf16_t a)
{
return ((a.bits & BF16_EXP_MASK) == BF16_EXP_MASK) &&
!(a.bits & BF16_MANT_MASK);
}
static inline bool bf16_iszero(bf16_t a)
{
return !(a.bits & 0x7FFF);
}
```
:::
:::spoiler assebly code
``` risc-v=
# ----------------------------------------
# bf16_isnan(a0 = bf16 bits) -> a0 = 1/0
# ----------------------------------------
bf16_isnan:
li t0, BF16_EXP_MASK # t0 = 0x7F80
and t1, a0, t0 # t1 = a & BF16_EXP_MASK
bne t1, t0, return_F # if ((a & EXP_MASK) != EXP_MASK) -> 0
li t0, BF16_MANT_MASK # t0 = 0x007F
and t1, a0, t0 # t1 = a & BF16_MANT_MASK
snez a0, t1 # a0 = (t1 != 0) ? 1 : 0
ret
# ----------------------------------------
# bf16_isinf(a0 = bf16 bits) -> a0 = 1/0
# ----------------------------------------
bf16_isinf:
li t0, BF16_EXP_MASK # t0 = 0x7F80
and t1, a0, t0 # t1 = a & BF16_EXP_MASK
bne t1, t0, return_F # if ((a & EXP_MASK) != EXP_MASK) -> 0
li t0, BF16_MANT_MASK # t0 = 0x007F
and t1, a0, t0 # t1 = a & BF16_MANT_MASK
seqz a0, t1 # a0 = (t1 == 0) ? 1 : 0
ret
# ----------------------------------------
# bf16_iszero(a0 = bf16 bits) -> a0 = 1/0
# ----------------------------------------
bf16_iszero:
li t0, 0x7FFF # t0 = 0x7FFF
and t1, a0, t0 # t1 = a & 0x7FFF
seqz a0, t1 # a0 = (t1 == 0) ? 1 : 0
ret
# ----------------------------------------
# return 0
# ----------------------------------------
return_F:
li a0, 0
ret
```
:::
### Conversion Function
:::spoiler c code
``` c=
```
:::
:::spoiler assebly code
``` risc-v=
# ----------------------------------------
# f32_to_bf16 / bf16_to_f32
# ----------------------------------------
f32_to_bf16:
srli t0, a0, 23 # f32bits >> 23
andi t0, t0, 0xFF # (f32bits >> 23) & 0xFF
li t1, 0xFF
beq t0, t1, bit16 # exponent == 0xFF
srli t2, a0, 16 # t2 = f32bits >> 16
andi t2, t2, 1 # t2 = ((f32bits >> 16) & 1)
li t3, 0x7FFF
add t2, t2, t3
add a0, a0, t2
srli a0, a0, 16
ret
bit16: # exponent == 0xFF
srli a0, a0, 16 # f32bits >> 16
ret
bf16_to_f32:
slli a0, a0, 16
ret
```
:::
### Addition and Substraction
:::spoiler c code
``` c=
```
:::
:::spoiler assebly code
``` risc-v=
```
:::
### Multiplication
:::spoiler c code
``` c=
```
:::
:::spoiler assebly code
``` risc-v=
```
:::
### Division
:::spoiler c code
``` c=
```
:::
:::spoiler assebly code
``` risc-v=
```
:::
### Square
:::spoiler c code
``` c=
```
:::
:::spoiler assebly code
``` risc-v=
```
:::
## 5-Stage Pipeline Processor
| Stage | full name | Description |
| ------- |:---------------------------------- |:-------------------------------------------------------- |
| **IF** | Instruction Fetch | Fetch the instruction from memory |
| **ID** | Instruction Decode & Register Read | Decode the instruction and read register operands |
| **EX** | Execution or Address Calculation | Execute the operation or calculate the effective address |
| **MEM** | Data Memory Access | Access data memory for load or store instructions |
| **WB** | Write Back | Write the result back into the destination register |
1. IF
2. ID
3. EX
4. MEM
5. WB
| Metric | Full Name | Description |
|:------------------ |:---------------------- |:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Cycles** | Clock Cycles | The total number of clock cycles elapsed since program execution began. |
| **Instrs Retired** | Instructions Retired | The total number of instructions that have successfully passed through all pipeline stages (IF → ID → EX → MEM → WB) and completed execution. Flushed or canceled instructions are not counted. |
| **CPI** | Cycles Per Instruction | The average number of clock cycles required to complete one instruction. Calculated as **CPI = Cycles ÷ Instrs Retired**. A lower CPI indicates better performance and fewer stalls. |
| **IPC** | Instructions Per Cycle | The average number of instructions completed per clock cycle. Calculated as **IPC = Instrs Retired ÷ Cycles**. A higher IPC indicates better parallelism and pipeline efficiency. |
## Reference
[Quiz1 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz1-sol)
[Pipeline](https://hackmd.io/@ExcitedMail/rynNyOKJY)