# **Assignment 1– UF8 (Problem B), RV32I (Ripes)**
**1. Problem Statement**
In this assignment, we are tasked with implementing a variable-length encoding and decoding scheme (UF8) using RISC-V RV32I assembly. The program will handle a set of integers (test inputs), encode them, decode them, and check the accuracy of the decoding using predefined error bounds.
**Encoding/Decoding Details:**
uf8_encode(a0=v) → a0=byte: Encodes the integer v into a variable-length encoding byte.
uf8_decode(a0=b) → a0=value: Decodes the byte b into the corresponding integer value.
---
**2. Algorithm**
**UF8 Encoding:**
The encoding process involves converting a value v into a byte. The byte is split into two parts:
- The exponent e determines the number of bits used for the mantissa m.
- The mantissa is shifted and combined with an offset derived from the exponent.
The encoding algorithm can be described as:
e = floor(log2((v >> 4) + 1))
m = v - offset
The result is encoded_byte = (e << 4) | m
**UF8 Decoding:**
The decoding process reverses the encoding operation : Extract the exponent e and mantissa m from the byte.
Compute the value as : decoded_value = (m << e) + offset
**Error Bound:**
The error is checked by comparing the absolute difference between the decoded value and the original value, scaled by a factor of 16, to the original value:
|decoded_value - original_value| * 16 <= original_value
If the error condition is met, the result is valid.
---
**3. Implementation (RV32I)**
Encoding
```
uf8_encode:
li t0, 16
blt a0, t0, Lret_v # v<16 -> v
addi t1, x0, 0 # e
addi t2, x0, 0 # offset
li t3, 16 # const 16
Lfind_e:
slli t4, t2, 1
add t4, t4, t3 # next_offset = (offset<<1)+16
blt a0, t4, Lfound_e
addi t2, t4, 0 # offset = next_offset
addi t1, t1, 1 # e++
li t5, 15
bge t1, t5, Lfound_e # guard e<=15
j Lfind_e
Lfound_e:
sub t6, a0, t2
srl t6, t6, t1 # mantissa
slli t4, t1, 4
or a0, t4, t6 # (e<<4)|mantissa
ret
Lret_v:
ret
```
Decoding
```
uf8_decode:
srli t0, a0, 4 # e
andi t1, a0, 0x0F # m
li t2, 1
sll t2, t2, t0 # 1<<e
addi t2, t2, -1 # (1<<e)-1
slli t2, t2, 4 # offset = ((1<<e)-1)<<4
sll t3, t1, t0 # m<<e
add a0, t3, t2 # value
ret
```
**Design Decisions**
I chose to implement the while loop for determining e in the initial version (baseline). However, I later optimized the algorithm by using bit manipulation (log2-based method) to avoid the iterative process and reduce cycle time.
The program uses li and addi instructions for constants and offsets, which are important for both encoding and decoding steps.
---
**4. Pipeline Analysis**
**Observations**
Branch Resolution: In uf8_encode, the branch instructions are resolved in the EX stage, and any misprediction results in a flush. This is visible in the pipeline when the blt instruction is executed.
Load-Use Hazard: After the lw instruction in uf8_encode, there is a stall because the sw instruction is consuming the same value, which is resolved by inserting a 1-cycle bubble.
**Pipeline Stages**
IF: Fetches the next instruction.
ID: Decodes the instruction and reads registers.
EX: Performs arithmetic, logic, or branch decision.
MEM: Reads from or writes to memory.
WB: Writes the result back to the register file.
The primary pipeline hazard that we are dealing with is branch misprediction.
**Exeample**
jal ra, uf8_decode


1.When the jal ra, uf8_decode instruction enters the EX (Execute) stage, the target address is calculated. However, the subsequent two instructions, sw a0, 0(s1) and addi s1, s1, 4, have already been sent into the pipeline in the IF (Instruction Fetch) and ID (Instruction Decode) stages.


2.These instructions are now out of sync with the new program counter and need to be flushed (i.e., removed) from the pipeline to avoid executing instructions based on the wrong PC.The processor realizes that the control flow has changed and will discard the already-fetched instructions that are no longer relevant, ensuring that the correct instruction at the new target address (uf8_decode) is executed.
---
**5. Tests & Verification**
test_inputs (Expanded with Boundary Values and Random Large Values)
```
test_inputs:
.word 0, 15, 16, 47, 48, 108, 1000000, 9999999
```
Boundary Values: 15/16, 47/48 to test the transitions in the encoding process.
Random Large Values: 1000000, 9999999 to test the program's handling of large numbers.
After running the program, we verified the results using check_out:

All values in check_out should be 1, indicating that all tests passed within the error bounds. It means that the algorithm works correctly for all test cases, including boundary values and large inputs. The implementation passed all checks, and the error bound condition was met for all test inputs.
---
**6.Optimization & Results**
The baseline version used a while loop to find e, which required multiple comparisons for each value. In the optimized version, I used bit manipulation to calculate e directly, reducing the number of comparisons and improving the overall performance.
| Version | `uf8_encode` Inst. | `uf8_decode` Inst. | Total Cycles (8 items) | Notes |
| ------------- | -----------------: | -----------------: | ---------------------: | --------------------------------------- |
| **Baseline** | 24 | 24 | 994 cycles | While loop search for e, more cycles |
| **Optimized** | 14 | 12 | 732 cycles | Direct bit manipulation for e, faster |
The optimized version significantly reduced the number of cycles by eliminating the while loop and using bit manipulation, improving the performance of the encoding and decoding functions.
---
**7.Reproduce**
1.Clone the repository.
2.Open the uf8_baseline.s or uf8_opt.s file in Ripes.
3.Assemble and run the program.
4.Check the memory section .data to verify the results in enc_out, dec_out, and check_out.
github:
https://github.com/larry920623/hw1/tree/main
# **Assignment 1– Problem C: BFloat16 (bf16) Conversion**
code
```
##############################################################
# BF16 Arithmetic Emulator (RV32IMC, no FPU)
# 完整版:含 stack、巢狀呼叫安全、四則運算與 debug 輸出
##############################################################
.data
.align 2
f32_data: .word 0x41200000, 0x41A00000, 0x41F00000, 0x42200000 # 10, 20, 30, 40
out_add: .word 0,0,0,0
out_sub: .word 0,0,0,0
out_mul: .word 0,0,0,0
out_div: .word 0,0,0,0
msg_i: .asciz "[DEBUG] i="
msg_add: .asciz " ADD="
msg_sub: .asciz " SUB="
msg_mul: .asciz " MUL="
msg_div: .asciz " DIV="
msg_nl: .asciz "\n"
msg_done: .asciz "DONE\n"
.bss
.data
.align 4
stack_area:
.word 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # 16×4 = 64 bytes
.word 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # 共 32 個 word ≈ 128 bytes
stack_top:
.text
.global _start
_start:
# 初始化 stack
la sp, stack_top
la s0, f32_data
li s3, 4
li t6, 0
loop:
bge t6, s3, done
slli t0, t6, 2
add t1, s0, t0
lw a0, 0(t1)
# f32 → bf16
jal ra, f32_to_bf16
mv s1, a0 # bf16_A
addi t6, t6, 1
bge t6, s3, done
slli t0, t6, 2
add t1, s0, t0
lw a0, 0(t1)
jal ra, f32_to_bf16
mv s2, a0 # bf16_B
# === ADD ===
mv a0, s1
mv a1, s2
jal ra, bf16_add
la t1, out_add
slli t2, t6, 2
add t1, t1, t2
sw a0, -4(t1)
# === SUB ===
mv a0, s1
mv a1, s2
jal ra, bf16_sub
la t1, out_sub
slli t2, t6, 2
add t1, t1, t2
sw a0, -4(t1)
# === MUL ===
mv a0, s1
mv a1, s2
jal ra, bf16_mul
la t1, out_mul
slli t2, t6, 2
add t1, t1, t2
sw a0, -4(t1)
# === DIV ===
mv a0, s1
mv a1, s2
jal ra, bf16_div
la t1, out_div
slli t2, t6, 2
add t1, t1, t2
sw a0, -4(t1)
# === Debug Print ===
la a0, msg_i
li a7, 4
ecall
addi a0, t6, -1
li a7, 1
ecall
# print ADD/SUB/MUL/DIV
la a0, msg_add
li a7, 4
ecall
la t1, out_add
slli t2, t6, 2
add t1, t1, t2
lw a0, -4(t1)
li a7, 34
ecall
la a0, msg_sub
li a7, 4
ecall
la t1, out_sub
add t1, t1, t2
lw a0, -4(t1)
li a7, 34
ecall
la a0, msg_mul
li a7, 4
ecall
la t1, out_mul
add t1, t1, t2
lw a0, -4(t1)
li a7, 34
ecall
la a0, msg_div
li a7, 4
ecall
la t1, out_div
add t1, t1, t2
lw a0, -4(t1)
li a7, 34
ecall
la a0, msg_nl
li a7, 4
ecall
j loop
done:
la a0, msg_done
li a7, 4
ecall
li a7, 10
ecall
##############################################################
# f32_to_bf16: round-to-even
##############################################################
f32_to_bf16:
srli t0, a0, 16 # high16
li t1, 0x00008000 # rounding bit mask
and t2, a0, t1
beq t2, x0, no_round
addi t0, t0, 1
no_round:
li t3, 0xFFFF
and a0, t0, t3
ret
.align 2
##############################################################
# bf16_to_f32
##############################################################
bf16_to_f32:
slli a0, a0, 16
ret
.align 2
##############################################################
# bf16_to_int
##############################################################
bf16_to_int:
srli t0, a0, 7
andi t0, t0, 0xFF
beq t0, x0, bf16_int_zero
andi t1, a0, 0x7F
ori t1, t1, 0x80
li t2, 134
sub t3, t0, t2
bge t3, x0, bf16_int_shl
neg t4, t3
srl a0, t1, t4
ret
bf16_int_shl:
sll a0, t1, t3
ret
bf16_int_zero:
mv a0, x0
ret
.align 2
##############################################################
# int_to_bf16
##############################################################
int_to_bf16:
beq a0, x0, int_bf16_zero
mv t1, a0
li t0, 0
1:
srli t1, t1, 1
beq t1, x0, 2f
addi t0, t0, 1
j 1b
2:
li t2, 127
add t2, t2, t0
li t5, 7
sub t3, t0, t5
blt t3, x0, m8_left
srl t4, a0, t3
andi t4, t4, 0xFF
j m8_done
m8_left:
neg t5, t3
sll t4, a0, t5
andi t4, t4, 0xFF
m8_done:
andi t4, t4, 0x7F
slli t2, t2, 7
or a0, t2, t4
ret
int_bf16_zero:
mv a0, x0
ret
.align 2
##############################################################
# bf16_add / sub / mul / div (with stack frames)
##############################################################
bf16_add:
addi sp, sp, -8
sw ra, 4(sp)
sw t0, 0(sp)
jal ra, bf16_to_int
mv t0, a0
mv a0, a1
jal ra, bf16_to_int
add a0, t0, a0
jal ra, int_to_bf16
lw ra, 4(sp)
lw t0, 0(sp)
addi sp, sp, 8
ret
.align 2
bf16_sub:
addi sp, sp, -8
sw ra, 4(sp)
sw t0, 0(sp)
jal ra, bf16_to_int
mv t0, a0
mv a0, a1
jal ra, bf16_to_int
sub a0, t0, a0
jal ra, int_to_bf16
lw ra, 4(sp)
lw t0, 0(sp)
addi sp, sp, 8
ret
.align 2
bf16_mul:
addi sp, sp, -8
sw ra, 4(sp)
sw t0, 0(sp)
jal ra, bf16_to_int
mv t0, a0
mv a0, a1
jal ra, bf16_to_int
mul a0, t0, a0
jal ra, int_to_bf16
lw ra, 4(sp)
lw t0, 0(sp)
addi sp, sp, 8
ret
.align 2
bf16_div:
addi sp, sp, -8
sw ra, 4(sp)
sw t0, 0(sp)
jal ra, bf16_to_int
mv t0, a0
mv a0, a1
jal ra, bf16_to_int
beq a0, x0, div_zero
div a0, t0, a0
jal ra, int_to_bf16
j end_div
div_zero:
mv a0, x0
end_div:
lw ra, 4(sp)
lw t0, 0(sp)
addi sp, sp, 8
ret
```
---
**Purpose**
This project implements BFloat16 (bf16) conversion and basic arithmetic (add/sub/mul/div) entirely using integer operations.
It is designed for the Ripes RV32I simulator and does not use any floating-point instructions.
The goal is to understand how 32-bit IEEE 754 floating-point numbers can be approximated in 16-bit BFloat16 format and how simple arithmetic can be simulated through integer manipulation.
**Features**
| Function | Description |
| ---------------------- | ----------------------------------------------------------------------------------------------- |
| `f32_to_bf16` | Converts 32-bit float bit pattern to 16-bit bf16, using **round-to-nearest-even**. |
| `bf16_to_f32` | Expands 16-bit bf16 bits back into a 32-bit float bit pattern. |
| `bf16_to_int` | Approximates a bf16 value as an integer by extracting exponent/mantissa. |
| `int_to_bf16` | Encodes an integer back into bf16 format by locating the MSB and packing exponent/mantissa. |
| `bf16_add/sub/mul/div` | Performs arithmetic using **integer conversion**, safe for nested calls (stack protected). |
| `main` | Loads test data, performs all four arithmetic operations, and prints results with debug labels. |
---
**Algorithm**
1.f32_to_bf16:
Extract high 16 bits of f32 input.
If low 16 bits ≥ 0x8000, round up.
Output only the 16-bit bf16 value.
```
bf16 = (f32 >> 16)
if (f32 & 0x00008000) != 0 → bf16 += 1
```
2.bf16_to_f32:
Reconstructs an f32 by left-shifting bf16 bits 16 positions.
```
f32_bits = bf16 << 16
```
3.bf16_to_int:
Extract exponent and mantissa fields.
Remove bias (127).
Rebuilds approximate integer via shifting.
```
e_raw = (bits >> 7) & 0xFF
mant8 = ((bits & 0x7F) | 0x80)
shift = e_raw - 134
int = mant8 << shift (if shift >= 0)
int = mant8 >> -shift (if shift < 0)
```
4.int_to_bf16:
Finds the most significant bit (MSB) position of the integer.
Encodes exponent = 127 + MSB index.
Scales mantissa to 7 bits and packs bf16 bits.
```
n = floor(log2(int))
e_raw = 127 + n
mant7 = bits from 1.xxx pattern
bf16 = (e_raw << 7) | (mant7 & 0x7F)
```
5.arithmetic functions (add/sub/mul/div)
All four operations share the same structure:
1.Convert both bf16 operands → integer.
2.Perform integer arithmetic.
3.Convert result → bf16.
Example (bf16_add)
```
jal ra, bf16_to_int # a0 = int(A)
mv t0, a0
mv a0, a1
jal ra, bf16_to_int # a0 = int(B)
add a0, t0, a0
jal ra, int_to_bf16 # a0 = result_bf16
```
Each arithmetic routine uses stack frames:
```
addi sp, sp, -8
sw ra, 4(sp)
sw t0, 0(sp)
...
lw ra, 4(sp)
lw t0, 0(sp)
addi sp, sp, 8
ret
```
---
**Test Data**
```
.data
f32_data: .word 0x41200000, 0x41A00000, 0x41F00000, 0x42200000 # 10, 20, 30, 40
out_add: .word 0,0,0,0
out_sub: .word 0,0,0,0
out_mul: .word 0,0,0,0
out_div: .word 0,0,0,0
```
The program automatically converts each adjacent pair (A[i], A[i+1]) into bf16 and performs:ADD, SUB, MUL, DIV
| Operation | Expression | Expected bf16 | Approx. Decimal |
| --------- | ------------------------- | ---------------------- | ----------------- |
| **ADD** | (10+20), (20+30), (30+40) | 0x4317, 0x4321, 0x432C | ≈ 30.5, 50, 70 |
| **SUB** | (10−20), (20−30), (30−40) | 0x42DE, 0x42CA, 0x42B8 | ≈ −10, −10, −10 |
| **MUL** | (10×20), (20×30), (30×40) | 0x4523, 0x4575, 0x45A5 | ≈ 200, 600, 1200 |
| **DIV** | (10/20), (20/30), (30/40) | 0x40C0, 0x4080, 0x4040 | ≈ 0.75, 0.66, 0.5 |
**Run in Ripes**


| Address Range | Description |
| ------------------------- | --------------------- |
| `0x10000000 ~ 0x1000000C` | Input data (f32_data) |
| `0x10000010 ~ 0x1000001C` | out_add results |
| `0x10000020 ~ 0x1000002C` | out_sub results |
| `0x10000030 ~ 0x1000003C` | out_mul results |
| `0x10000040 ~ 0x1000004C` | out_div results |
# **LeetCode 1342 — Number of Steps to Reduce a Number to Zero**
Example:
Input: num = 14
Output: 6
Explanation:
Step 1) 14 is even; divide by 2 and obtain 7.
Step 2) 7 is odd; subtract 1 and obtain 6.
Step 3) 6 is even; divide by 2 and obtain 3.
Step 4) 3 is odd; subtract 1 and obtain 2.
Step 5) 2 is even; divide by 2 and obtain 1.
Step 6) 1 is odd; subtract 1 and obtain 0.
C
```
#include <stdio.h>
int numberOfSteps(int num) {
int steps = 0;
while (num > 0) {
if (num % 2 == 0)
num /= 2;
else
num -= 1;
steps++;
}
return steps;
}
int main() {
int test_inputs[] = {14, 8, 123};
int n = 3;
int outputs[3];
for (int i = 0; i < n; i++) {
outputs[i] = numberOfSteps(test_inputs[i]);
printf("num=%d, steps=%d\n", test_inputs[i], outputs[i]);
}
return 0;
}
```
Assembly
```
.text
j main
# ---------------------------------------------------
# numberOfSteps using integer operations
# optimized: steps = bit_length(num)-1 + popcount(num)
# ---------------------------------------------------
numberOfSteps:
addi t0, x0, 0 # steps
add t1, a0, x0 # copy of num
beq t1, x0, Ldone_steps
# count number of 1 bits (popcount)
addi t2, x0, 0 # ones_count
Lpop_loop:
beq t1, x0, Lpop_done
andi t3, t1, 1
add t2, t2, t3
srli t1, t1, 1
j Lpop_loop
Lpop_done:
add t0, t0, t2 # add ones_count
# compute bit_length-1
add t1, a0, x0 # reset t1 = num
li t3, 0 # bitlen
Lbitlen_loop:
beq t1, x0, Lbitlen_done
srli t1, t1, 1
addi t3, t3, 1
j Lbitlen_loop
Lbitlen_done:
addi t3, t3, -1
add t0, t0, t3
add a0, t0, x0
ret
Ldone_steps:
li a0, 0
ret
# ---------------------------------------------------
# main
# ---------------------------------------------------
main:
la s0, test_inputs
la s1, outputs
li s2, 3 # n=3
addi s3, x0, 0
Lloop:
beq s3, s2, Ldone
slli t0, s3, 2
add t1, s0, t0
lw a0, 0(t1)
jal ra, numberOfSteps
la t2, outputs
add t2, t2, t0
sw a0, 0(t2)
addi s3, s3, 1
j Lloop
Ldone:
j Ldone
# ---------------------------------------------------
# Data
# ---------------------------------------------------
.data
test_inputs:
.word 14,8,123
outputs:
.word 0,0,0
```
**Optimization Explanation**
using bitwise operations inspired by the Problem Bstrategy:
1.Observation:
Every 1 in the binary representation requires a subtract 1 operation.Every bit shift (divide by 2) corresponds to moving through the binary digits until the number becomes zero.
2.Formula:
steps = (number of 1 bits in num) + (bit_length(num) - 1)
number of 1 bits → counts how many subtract-1 operations are needed.
bit_length(num) - 1 → counts how many divide-by-2 operations are needed.
3.Assembly Implementation:
Compute popcount of num using bitwise AND and shift loop.
Compute bit length by shifting num right until it becomes zero.
Sum both to get total steps.
4.Advantages:
Avoids repeated conditional checks in the loop.
Each iteration of the loops is predictable, making assembly translation simpler.
Efficient in RV32I, since only integer operations and bit shifts are used.