# Assignment 1 - RISC-V
## SUMMARY
- **Opening**
- **UTF8 Encoding/Decoding**
- **Study of the Processor**
- **BF16 All Operations (add/sub/mul/div/sqrt)**
---
# Opening
Before reading this work, I want to clarify a few things:
- First, I'm not entirely certain what the professor expects, which is why this report explains each part of the code in detail, even if it becomes borring. (I will adjust the approach for the next homework if needed)
- The more I worked on this assignment, the more I wanted to refactor everything. Unfortunately, there are constraints:
- **Better naming conventions than "skip" and "loop" (currently quite unreadable)**
- **More comprehensive explanations throughout**
- **This document reads more like a development journal than a formal report , I may make statements and revise them shortly after. I apologize for any inconsistencies.**
# UTF8 Encoding/Decoding
## Overview & Summary
Implementation of an encoder/decoder for the UTF8 format in RISC-V assembly.
- **Part 1: Pure C to RISC-V Translation** (encode & decode functions)
- **Part 2: Validation & Testing** (test suite and error handling)
- **Part 3: Performance Analysis** (instruction count and optimization)
---
# Part 1: Pure C to RISC-V Translation
**Main registers convention:**
- `x1`: input value / addresses
- `x2`: return result
- `x3-x19`: working registers
- `x20`: saved `ra` (return address)
- `x21`: stack pointer (`sp`)
---
## encode_uf8 Function
This section shows the step-by-step translation from C to RISC-V assembly.
### Save and Initialize
```riscv
encode_uf8:
addi x21, x21, -4
sw x20, 0(x21)
jal x20, init_registers_encode
```
- Saves `ra` on stack for function call preservation
```riscv
init_registers_encode:
li x2, 0 # return value
li x3, 16 # comparison threshold
li x4, 32 # CLZ counter
li x5, 16 # CLZ step
jr x20
```
- Initializes working registers with constants
```riscv
blt x1, x3, skip_encode
```
If value < 16, no encoding is needed; the value is returned directly.
### CLZ (Count Leading Zeros)
```riscv
add x6, x1, x0
loop:
beq x5, x0, end_loop
srl x7, x6, x5
beq x7, x0, skip2
sub x4, x4, x5
addi x6, x7, 0
skip2:
li x10, 1
srl x5, x5, x10
j loop
```
**Binary search algorithm** to count leading zeros:
- `x4`: counter
- `x5`: shift step
- `x6`: working value
- Shifts and tests to find the first 1 bit
### MSB Calculation (Most Significant Bit)
```riscv
sub x3, x4, x6
li x4, 31
sub x4, x4, x3
```
- `x3` = lz (leading zeros)
- `x4` = MSB position (31 - LZ)
### Exponent Calculation
```riscv
li x5, 0
blt x4, x11, skip3
sub x5, x4, x12
blt x5, x13, skip4
li x5, 15
```
- If MSB < 5: exponent = 0
- else: exponent = MSB - 4
- maximum value: 15
### Overflow Construction
```riscv
loop3:
bge x7, x5, loop3_end
sll x6, x6, x10
addi x6, x6, 16
addi x7, x7, 1
j loop3
```
overflow = (16 << exp) - 16
### Overflow Adjustment
```riscv
loop4:
blt x5, x10, loop4_end
blt x6, x1, loop4_end
beq x6, x1, loop4_end
addi x6, x6, -16
srl x6, x6, x10
addi x5, x5, -1
j loop4
```
Reduces exponent if overflow is too large compared to the input value.
### Overflow Expansion
```riscv
loop5:
bge x5, x15, loop5_end
sll x7, x6, x10
addi x7, x7, 16
bge x7, x1, loop5_end
addi x6, x7, 0
addi x5, x5, 1
j loop5
```
Increases exponent if possible to maximize precision.
### Mantissa Calculation
```riscv
sub x7, x1, x6
srl x7, x7, x5
sll x2, x5, x14
or x2, x2, x7
```
- Mantissa = (value - overflow) >> exponent
- result = (exponent << 4) | mantissa
### Return
```riscv
skip_encode:
add x2, x0, x1 # if value < 16, just return value
encode_return:
lw x20, 0(x21)
addi x21, x21, 4
jr x20
```
Uses the stack to handle multiple values (for testing purposes).
---
## decode_uf8 Function
### Register Initialization (decode)
```riscv
init_decode:
li x2, 0 # return value
li x3, 0 # mantissa
li x4, 0 # exponent
li x6, 0x0F # mantissa mask
li x7, 4 # shift amount
li x8, 15 # max exponent
li x9, 0x7FFF # offset mask
jr x20
```
### Field Extraction
```riscv
and x3, x1, x6 # mantissa = value & 0x0F
srl x4, x1, x7 # exponent = value >> 4
```
Separates 4 bits of mantissa and 4 bits of exponent.
### Offset Calculation
```riscv
sub x5, x8, x4 # 15 - exponent
srl x5, x9, x5 # 0x7FFF >> (15 - exponent)
sll x5, x5, x7 # << 4
```
Calculates offset = `((0x7FFF >> (15 - exp)) << 4)`.
### Final Decoding
```riscv
sll x2, x3, x4 # mantissa << exponent
add x2, x2, x5 # + offset
```
Result = `(mantissa << exponent) + offset`.
---
# Part 2: Validation & Testing
## Test Data Structure
### Test Values
```riscv
initial_value1: .word 976
initial_value2: .word 15
initial_value3: .word 0
```
**Test values**: 976, 15, and 0 to verify functionality.
These values test three different code paths:
- **0 and 15**: pass through without calculation (value < 16, direct return via `skip_encode`)
- **976**: goes through the full encode/decode process and should return exactly the same value
**Note**: Different numbers can be used, but keep in mind there will be an error rate for certain values due to the lossy nature of UF8 encoding.
### Result Management
```riscv
test1_pass:
la a0, msg_pass
li a7, 4
ecall
la x10, test_results
li x11, 1
sw x11, 0(x10)
```
Prints "PASS" and stores 1 in `test_results[0]`.
If test fails:
```riscv
la a0, msg_fail
li a7, 4
ecall
la x10, test_results
sw x0, 0(x10)
```
Prints "FAIL" and stores 0 in `test_results[0]`.
### Final Summary
```riscv
final_summary:
la x10, test_results
lw x11, 0(x10)
lw x12, 4(x10)
lw x13, 8(x10)
and x14, x11, x12
and x14, x14, x13
beqz x14, failed
```
- Loads all 3 test results
- Performs logical AND: all must be 1 to pass
- If x14 = 0, at least one test failed
```riscv
la a0, msg_all_pass
li a7, 4
ecall
```
If all tests passed, prints "All tests PASSED!".
```riscv
failed:
la a0, msg_final_fail
li a7, 4
ecall
```
If any test failed, prints "Some tests FAILED!".
---
# Part 3: Performance Analysis
## Instruction Count Analysis
To measure performance, **976** was chosen as a representative value that exercises the full encoding path (976 > 16).
### Methodology:
1. Count instructions in the RISC-V implementation using Ripes step-by-step execution
2. Estimate instructions in the C implementation
3. Identify optimization opportunities
### encode_uf8 function (value = 976):
**Total: 102 instructions**
- **Initialization**: 10 instructions
- **CLZ**: 35 instructions (binary search loop, 5 iterations)
- **MSB calculation**: 3 instructions
- **Exponent and overflow setup**: 11 instructions
- **Overflow construction loops**: 26 instructions (loop3 + loop4 + loop5)
- **Mantissa calculation and combine**: 6 instructions
- **Return**: 3 instructions
- **Other**: 8 instructions (branches, misc, etc.)
The analysis reveals that CLZ is expensive, as are the various loops.
### decode_uf8 function (value = 976 encoded):
**Total: 23 instructions**
- **Initialization**: 15 instructions
- **Decode formula**: 8 instructions
Initialization is the most costly component.
## C vs RISC-V Comparison
| Operation | RISC-V | C language |
|-----------|--------|---------------|
| Encode | 102 | 77-90 |
| Decode | 23 | 12 |
Note: These are estimated values for **ONE NUMBER ONLY**.
## Optimization Opportunities
### 1. CLZ Optimization
**Current**: Simple translation from the C program.
**Potential improvements**:
- Use built-in RISC-V modules (not allowed in this assignment)
- Use a lookup table stored in memory
### 2. Loop Fusion
**Current**: Three separate loops (loop3, loop4, loop5)
**Attempted optimization**: Merge into a single loop with combined conditions
Multiple fusion attempts were made, but no significant improvement was achieved, so the original version was retained.
One attempt to merge loop4 and loop5 saved only 3 instructions:
```riscv
merge_loop:
# Loop4
blt x6, x1, skip00 # if overflow < value, try expand
beq x6, x1, skip3 # if overflow == value, done
# Overflow too big, decrease
addi x6, x6, -16 # overflow -= 16
srl x6, x6, x10 # overflow >>= 1
addi x5, x5, -1 # exponent--
j merge_loop
skip00:
# Loop5
bge x5, x15, skip3 # if exponent >= 15, done
sll x7, x6, x10 # next_overflow = overflow << 1
addi x7, x7, 16 # next_overflow += 16
bge x7, x1, skip3 # if next_overflow >= value, done
addi x6, x7, 0 # overflow = next_overflow
addi x5, x5, 1 # exponent++
j merge_loop # continue adjusting
```
## Optimized Performance Estimate
| Optimization | Encode (before) | Encode (after) | Improvement |
|--------------|-----------------|----------------|-------------|
| Loop merging | 102 | 98 | Minimal |
No improvements were found for decode; initialization remains the most expensive component and cannot be easily optimized.
## Conclusion
After these unsuccessful optimization attempts, research was conducted on C compilation and efficiency. The key finding is that **cycle count is more important than instruction count for efficiency**. The next section compares C assembly with RISC-V assembly to better understand this principle.
---
# Processor Study with Ripes
Using the Ripes simulator with RISC-V 32-bit 5-stage processor, the flow of instructions through the pipeline and control signals at the hardware level are analyzed (based on the understanding of question 3 in homework 1).
### Pipeline Configuration:
- **Processor**: RISC-V 32-bit Single-cycle / 5-stage pipeline
- **ISA**: RV32I
- **Stages**: IF → ID → EX → MEM → WB
### Pipeline Stages Overview:
1. **IF** (Instruction Fetch): Fetch instruction from instruction memory using PC
2. **ID** (Instruction Decode): Decode instruction, read source registers from register file
3. **EX** (Execute): Perform ALU operation or calculate memory address
4. **MEM** (Memory Access): Access data memory for load/store instructions
5. **WB** (Write Back): Write result back to destination register
---
## Selected Instructions for Analysis
Two representative instructions covering big part of instruction are analyzed: `lw`, `sw`.
---
## Instruction 1: Load Word - `lw x1, 0(x2)`
**Context**: Loading test value 976 from memory at the beginning of main program
**Purpose**: Read the test value from data memory into register x1 for encoding
### Initial State


- Memory address `0x10000000`: contains **976** (0x000003D0)
- Register x1: contains **0x00000000**
- Register x2: contains address **0x10000000**
---
### Stage 1: Instruction Fetch (IF)

**What happens**:
- PC outputs current instruction address:

- Instruction Memory fetches the encoded `lw` instruction:

- 0x00012083 = 0000000000000001001000001**0000011**

---
### Stage 2: Instruction Decode (ID)

**What happens**:
Control Unit decodes instruction 
- `0x02` source address (2)
- `0x00` add offset (nothing to add)
- `LW` instruction
- `0x01` destination register (1)
---
### Stage 3: Execute (EX)

**What happens**:
- ALU performs address calculation: base_address + offset

Calculation: 0x10000000 + 0 = 0x10000000
Result is the memory address to access.
---
### Stage 4: Memory Access (MEM)

**What happens**:
- Data Memory receives address from ALU (0x10000000)

- Signal Write red → Data memory reading only
- Data at address 0x10000000 is read: **976** (0x000003D0)

**Control Signals Active**:
- **WR en = 0** (inactive), indicating read operation
---
### Stage 5: Write Back (WB)

**What happens**:

- Wr data: 976 (0x000003d0)
- Wr idx: 0x01 (register x1)
- Wr En: 1 (enable) will write to the register

### Result
- **Register x1 now contains the test value for use throughout the program**

---
## Instruction 2: Store Word - `sw x20, 0(x21)`
Now that the process is understood, this analysis will proceed more quickly.
**Context**: Storing encoded result back to memory after encoding
**Purpose**: Store the return address (RA) in the stack (this technique is used when there are nested function calls)
### Initial State

Expected state:
- Register x20: contains the return address
- Register x21: contains the current stack address
---
### Stage 1: IF

---

- PC: line number 0x00000200 
- Instructions: 0x014aa023=0000000101001010101000000**0100011**


---
### Stage 2: ID


**What happens**:
- `0x15` Reg1 destination address (x21)
- `0x14` Reg2 source address (x20)
- `SW` instruction
- `0x00` offset (0)

- Reading register contents:
- Reg 1: 0x00010000
- Reg 2: 0x00000028

---
### Stage 3: EX

**What happens**:

- ALU calculates memory address: x21 + 0
---
### Stage 4: MEM

**What happens**:

- Addr: x21 (location to write)
- Data in: x20 (data to write)
**Control Signals Active**:
- **Wr en = 1 (green) = x20 → x21**

---
### Stage 5: WB

**What happens**:
- Data is written to the correct location!

- No register will be updated
- Wr en = 0 (inactive)

---
## Signal Summary Table
To avoid repetition, here is a summary table of all instruction types:
| Instruction Type | Reg Wr | Mem Wr | Mem to Reg | ALU | Branch |
|-----------------|----------|----------|----------|--------|--------|
| `lw` (Load) | 1 | 0 | 1 | 1 | 0 |
| `sw` (Store) | 0 | 1 | X | 1 | 0 |
| `sub` (R-type) | 1 | 0 | 0 | 1 | 0 |
| `beq` (B-type) | 0 | 0 | X | 0 | 1 |
| `jal` (Jump) | 1 | 0 | 1 | X | 0 |
**Legend**: 1 = active, 0 = inactive, X = don't care
---
## Processor Hazard Handling
### Data Hazard Resolution
This section was not originally planned, but an important observation was made during the SW instruction presentation. This differs from the processor configuration used at my home institution.
The code was initially modified to present SW in an ideal situation:
```riscv
encode_uf8:
addi x21, x21, 0
sw x20, 0(x21)
```
But the actual code is:
```riscv
encode_uf8:
addi x21, x21, -4
sw x20, 0(x21)
```
This modifies the stack pointer to avoid overwriting data (standard stack behavior).
**Problem**: The instruction needs the updated value of x21 before it has been written back.
**Solution in Ripes**:

- The `addi x21 x21 -4` instruction is in the MEM stage, so the new value of x21 is already computed
- The ALU accepts the forwarding value from the next pipeline stage
- The old value from the ID/EX pipeline is bypassed
---
### Branch Hazard Resolution
Further investigation revealed another type of hazard handling in the processor.
**Code sequence**:
```riscv
beq x5, x0, end_loop
srl x7, x6, x5
```
**Problem**: Branch decision is not made until the EX stage, but the processor has already fetched the next instruction.
**Solution**:
- The next instruction is fetched to prepare for both cases:

- If the branch is taken, the entire pipeline is flushed:

---
# BF16 Operations
## SUMMARY
- **BF16 to F32 Conversion**
- **F32 to BF16 Conversion**
- **BF16 Addition**
- **BF16 Subtraction**
- **BF16 Multiplication**
- **BF16 Division**
- **BF16 Square Root**
---
# Part 1: Pure C to RISC-V Translation
## BF16 Format Overview
- **16 bits total: 1 sign bit + 8 exponent bits + 7 mantissa bits**
- Special values: Inf (exp=0xFF, mant=0), NaN (exp=0xFF, mant≠0), Zero (exp=0, mant=0) (Unlike UF8, these special values exist in BF16)
**Register Convention:**
- `x1`: result value
- `x2`: return address (ra)
- `x3`: stack pointer (sp)
- `x4-x9`: extracted information (sign, exp, mantissa)
- `x10-x15`: temporaries
- `x16-x22`: working registers
*The register convention has been updated from the UF8 implementation for improved clarity.*
---
## 1. BF16 to F32 Conversion
### Code
```riscv
main:
la x1, bfloat16
lh x1, 0(x1)
slli x4, x1, 16
la x7, float32
sw x4, 0(x7)
j end
```
**Explanation:** FP32 = BF16 << 16.
---
## 2. F32 to BF16 Conversion
### Handle Special Cases
```riscv
f32_to_bf16:
la x1, bfloat32
lw x1, 0(x1)
srli x4, x1, 23
andi x4, x4, 0x7FF
li x5, 0xFF
bne x4, x5, normal_case
srli x6, x1, 16
andi x6, x6, 0xFFFF
j return_bf16
```
### Code
```riscv
normal_case:
srli x6, x1, 16
andi x6, x6, 1
addi x6, x6, 0x7FFF
add x6, x1, x6
srli x6, x6, 16
```
---
## 3. BF16 Addition
### Initialize
```riscv
init:
srli x4, x14, 15 # sign_a
srli x5, x15, 15 # sign_b
srli x6, x14, 7 # exp_a
andi x6, x6, 0xFF
srli x7, x15, 7 # exp_b
andi x7, x7, 0xFF
andi x8, x14, 0x7F # mant_a
andi x9, x15, 0x7F # mant_b
jr x2
```
*This initialization pattern is reused for most operations.*
### Handle Special Cases
```riscv
add:
jal x2, init
li x16, 0xFF
bne x6, x16, skip1
bne x8, x0, return_a # a is NaN
bne x7, x16, skip2
bne x9, x0, return_b # b is NaN
beq x4, x5, return_b # Inf + Inf (same sign)
li x1, 0xFFC0 # Inf + (-Inf) = NaN
j return
skip2:
return_a:
add x1, x14, x0
j return
skip1:
li x16, 0xFF
beq x7, x16, return_b # b is Inf/NaN
or x16, x6, x8
beq x16, x0, return_b # a is zero
or x16, x7, x9
beq x16, x0, return_a # b is zero
```
### Add Implicit Bit
```riscv
beq x6, x0, skip3
ori x8, x8, 0x80 # add implicit 1
skip3:
beq x7, x0, skip4
ori x9, x9, 0x80 # add implicit 1
skip4:
```
### Align Exponents
```riscv
sub x10, x6, x7 # exp_diff
blez x10, skip5
add x12, x6, x0 # result_exp = exp_a
li x16, 8
bge x10, x16, return_a # diff too large
srl x9, x9, x10 # align mant_b
j skip7
skip5:
bgez x10, skip6
add x12, x7, x0 # result_exp = exp_b
li x16, -8
blt x10, x16, return_b # diff too large
sub x16, x0, x10
srl x8, x8, x16 # align mant_a
j skip7
skip6:
add x12, x6, x0 # exponents equal
skip7:
```
### Add/Sub Mantissas
```riscv
bne x4, x5, skip8 # different signs
add x11, x4, x0 # result_sign
add x13, x8, x9 # add mantissas
andi x16, x13, 0x100
beq x16, x0, skip9
srli x13, x13, 1 # normalize overflow
addi x12, x12, 1
li x16, 0xFF
blt x12, x16, skip9
slli x1, x11, 15 # overflow to Inf
li x16, 0x7F80
or x1, x1, x16
j return
skip9:
j combine_result
skip8:
blt x8, x9, skip10
add x11, x4, x0
sub x13, x8, x9
j skip11
skip10:
add x11, x5, x0
sub x13, x9, x8
skip11:
beq x13, x0, return_zero
skip12:
andi x16, x13, 0x80
bne x16, x0, skip13
slli x13, x13, 1 # normalize
addi x12, x12, -1
blez x12, return_zero
j skip12
skip13:
```
### Combine Result
```riscv
combine_result:
slli x1, x11, 15 # sign << 15
andi x16, x12, 0xFF
slli x16, x16, 7 # exp << 7
or x1, x1, x16
andi x16, x13, 0x7F # mantissa
or x1, x1, x16
j return
```
---
## 4. BF16 Subtraction
### Code
```riscv
sub:
li x16, 0x8000
xor x15, x15, x16 # flip sign of b
j add # reuse addition
```
**Context:** Subtraction is in the same program as addition (selected during test execution).
**Explanation:** Subtraction is implemented as addition with the sign of the second operand flipped.
---
## 5. BF16 Multiplication
### Initialize
```riscv
init:
srli x4, x10, 15 # sign_a
srli x5, x11, 15 # sign_b
srli x6, x10, 7
andi x6, x6, 0xFF # exp_a
srli x7, x11, 7
andi x7, x7, 0xFF # exp_b
andi x8, x10, 0x7F # mant_a
andi x9, x11, 0x7F # mant_b
xor x12, x4, x5 # result_sign
jr x2
```
### Handle Special Cases
```riscv
mul:
jal x2, init
li x16, 0xFF
bne x6, x16, skip1
bnez x8, return_a # a is NaN
or x17, x7, x9
beqz x17, return_NAN # Inf * 0 = NaN
slli x1, x12, 15
li x16, 0x7F80
or x1, x1, x16 # return Inf
j return
skip1:
li x16, 0xFF
bne x7, x16, skip2
bnez x9, return_b # b is NaN
or x17, x6, x8
beqz x17, return_NAN # 0 * Inf = NaN
slli x1, x12, 15
li x16, 0x7F80
or x1, x1, x16 # return Inf
j return
skip2:
or x17, x6, x8
or x16, x7, x9
bnez x17, skip3
slli x1, x12, 15 # return signed zero
j return
skip3:
bnez x16, skip4
slli x1, x12, 15
j return
skip4:
```
### Normalize
```riscv
li x13, 0 # exponent adjustment
bnez x6, skip5
normalize_a:
andi x16, x8, 0x80
bnez x16, skip6
slli x8, x8, 1
addi x13, x13, -1
j normalize_a
skip6:
li x6, 1
j skip7
skip5:
ori x8, x8, 0x80 # add implicit bit
skip7:
bnez x7, skip8
normalize_b:
andi x16, x9, 0x80
bnez x16, skip9
slli x9, x9, 1
addi x13, x13, -1
j normalize_b
skip9:
li x7, 1
j skip10
skip8:
ori x9, x9, 0x80
skip10:
```
### Multiply Mantissas
```riscv
li x14, 0
li x17, 0
mul_loop: # result_mant = mant_a * mant_b
li x18, 8
beq x17, x18, mul_end
andi x18, x9, 1
beqz x18, mul_skip
add x14, x14, x8
mul_skip:
slli x8, x8, 1
srli x9, x9, 1
addi x17, x17, 1
j mul_loop
mul_end:
add x15, x6, x7
li x16, 127
sub x15, x15, x16 # result_exp = exp_a + exp_b - 127
add x15, x15, x13
```
### Normalize Result
```riscv
li x16, 0x8000
and x17, x14, x16
beqz x17, skip11
srli x14, x14, 8
andi x14, x14, 0x7F
addi x15, x15, 1 # bit 15 set, shift right
j skip12
skip11:
srli x14, x14, 7 # bit 15 not set
andi x14, x14, 0x7F
skip12:
```
### Handle Overflow/Underflow
```riscv
li x16, 0xFF
blt x15, x16, skip13
slli x1, x12, 15
li x16, 0x7F80
or x1, x1, x16 # overflow to Inf
j return
skip13:
bgtz x15, skip14
li x17, -6
blt x15, x17, skip15
li x16, 1
sub x16, x16, x15
srl x14, x14, x16 # denormalize
li x15, 0
j skip14
skip15:
slli x1, x12, 15 # underflow to zero
j return
skip14:
```
### Combine Result
```riscv
slli x1, x12, 15
andi x16, x15, 0xFF
slli x16, x16, 7
or x1, x1, x16
andi x16, x14, 0x7F
or x1, x1, x16
j return
```
---
## 6. BF16 Division
### Initialize
```riscv
init:
srli x4, x10, 15
srli x5, x11, 15
srli x6, x10, 7
andi x6, x6, 0xFF
srli x7, x11, 7
andi x7, x7, 0xFF
andi x8, x10, 0x7F
andi x9, x11, 0x7F
xor x12, x4, x5 # result_sign
jr x2
```
### Handle Special Cases
```riscv
div:
jal x2, init
li x16, 0xFF
bne x7, x16, skip1
bnez x9, return_b # b is NaN
li x16, 0xFF
bne x6, x16, skip1_1
bnez x8, skip1_1
li x1, 0x7FC0 # Inf / Inf = NaN
j return
skip1_1:
slli x1, x12, 15 # x / Inf = 0
j return
skip1:
or x17, x7, x9
bnez x17, skip2
or x17, x6, x8
bnez x17, skip2_1
li x1, 0x7FC0 # 0 / 0 = NaN
j return
skip2_1:
slli x1, x12, 15
li x16, 0x7F80
or x1, x1, x16 # x / 0 = Inf
j return
skip2:
li x16, 0xFF
bne x6, x16, skip3
bnez x8, return_a # a is NaN
slli x1, x12, 15
li x16, 0x7F80
or x1, x1, x16 # Inf / x = Inf
j return
skip3:
or x17, x6, x8
bnez x17, skip4
slli x1, x12, 15 # 0 / x = 0
j return
skip4:
```
### Add Implicit Bit
```riscv
bnez x6, skip5
ori x8, x8, 0x80
j skip6
skip5:
ori x8, x8, 0x80
skip6:
bnez x7, skip7
ori x9, x9, 0x80
j skip8
skip7:
ori x9, x9, 0x80
skip8:
```
### Integer Division
```riscv
slli x13, x8, 15 # dividend = mant_a << 15
mv x14, x9 # divisor = mant_b
li x15, 0 # quotient = 0
li x18, 0 # loop counter
loop:
li x19, 16
beq x18, x19, end_loop
slli x15, x15, 1
li x20, 15
sub x20, x20, x18
sll x21, x14, x20
bltu x13, x21, skip9
sub x13, x13, x21
ori x15, x15, 1
skip9:
addi x18, x18, 1
j loop
end_loop:
```
### Calculate Exponent
```riscv
sub x22, x6, x7
addi x22, x22, 127 # result_exp = exp_a - exp_b + 127
bnez x6, skip10
addi x22, x22, -1 # adjust for denormal a
skip10:
bnez x7, skip11
addi x22, x22, 1 # adjust for denormal b
skip11:
```
### Normalize Result
```riscv
li x16, 0x8000
and x17, x15, x16
beqz x17, normalize_loop_start
srli x15, x15, 8 # already normalized
j skip12
normalize_loop_start:
loop_2:
li x16, 0x8000
and x17, x15, x16
bnez x17, end_loop_2
li x19, 1
ble x22, x19, end_loop_2
slli x15, x15, 1
addi x22, x22, -1
j loop_2
end_loop_2:
srli x15, x15, 8
skip12:
andi x15, x15, 0x7F
```
### Handle Overflow/Underflow
```riscv
li x16, 0xFF
blt x22, x16, skip13
slli x1, x12, 15
li x16, 0x7F80
or x1, x1, x16 # overflow to Inf
j return
skip13:
bgtz x22, skip14
slli x1, x12, 15 # underflow to zero
j return
skip14:
```
### Combine Result
```riscv
slli x1, x12, 15
andi x22, x22, 0xFF
slli x22, x22, 7
or x1, x1, x22
andi x15, x15, 0x7F
or x1, x1, x15
j return
```
---
## 7. BF16 Square Root
**Disclaimer:**
AI assistance was used for this implementation due to difficulties encountered. However, even with AI support, challenges remained because both the AI and I struggled with certain RISC-V assembly aspects.
A working version using the RV32M extension (multiplication/division instructions) is presented here, as it may be useful for future assignments.
**Note: If you wish to review the non-working pure RV32I version with infinite loops, please check the GitHub repository.**
### Initialize
```riscv
init:
srli x4, x10, 15 # sign
srli x5, x10, 7
andi x5, x5, 0xFF # exponent
andi x6, x10, 0x7F # mantissa
jr x2
```
### Handle Special Cases
```riscv
sqrt:
jal x2, init
li x7, 0xFF
bne x5, x7, skip1
bnez x6, return_a # NaN
bnez x4, return_nan # sqrt(-Inf)
j return_a # sqrt(+Inf)
skip1:
or x7, x5, x6
bnez x7, skip2
li x1, 0 # sqrt(0) = 0
j return
skip2:
bnez x4, return_nan # sqrt(negative)
bnez x5, skip3
li x1, 0 # sqrt(denormal) ≈ 0
j return
skip3:
```
### Calculate Square Root
```riscv
li x7, 127
sub x8, x5, x7 # e = exp - 127
ori x11, x6, 0x80 # m = mantissa with implicit bit
andi x7, x8, 1
beqz x7, skip4
slli x11, x11, 1 # if e is odd, m *= 2
addi x8, x8, -1
srai x9, x8, 1
addi x9, x9, 127 # new_exp = (e-1)/2 + 127
j skip5
skip4:
srai x9, x8, 1
addi x9, x9, 127 # new_exp = e/2 + 127
skip5:
```
### Binary Search for Square Root
```riscv
li x12, 90 # low = 90
li x13, 256 # high = 256
li x14, 128 # result_sqrt = 128
while_loop:
bltu x13, x12, end_loop
add x15, x12, x13
srli x15, x15, 1 # mid = (low + high) / 2
mul x18, x15, x15
li x7, 128
divu x18, x18, x7 # sq = mid^2 / 128
bltu x11, x18, skip6
mv x14, x15
addi x12, x15, 1 # low = mid + 1
j skip7
skip6:
addi x13, x15, -1 # high = mid - 1
skip7:
j while_loop
end_loop:
```
### Normalize Result
```riscv
li x7, 256
blt x14, x7, skip8
srli x14, x14, 1
addi x9, x9, 1
j skip9
skip8:
li x7, 128
bge x14, x7, skip9
loop_2:
li x7, 128
bge x14, x7, skip10
li x16, 1
ble x9, x16, skip10
slli x14, x14, 1
addi x9, x9, -1
j loop_2
skip10:
skip9:
andi x14, x14, 0x7F
```
### Handle Overflow/Underflow
```riscv
li x7, 0xFF
bge x9, x7, return_inf
blez x9, return_zero
```
### Combine Result
```riscv
andi x9, x9, 0xFF
slli x9, x9, 7
or x1, x9, x14
j return
```
---
# Part 2: Validation & Testing
## Test Program
```riscv
.data
test_a: .word 0x3F80 # 1.0
test_b: .word 0x4000 # 2.0
result_add1: .word 0
msg_test1: .string "Test Add 1.0+2.0: "
msg_pass: .string "PASS\n"
msg_fail: .string "FAIL\n"
.text
.global main
main:
la a0, msg_test1
li a7, 4
ecall
la x14, test_a
lw x14, 0(x14)
la x15, test_b
lw x15, 0(x15)
jal x2, add
la x7, result_add1
sw x1, 0(x7)
li x16, 0x4040
beq x1, x16, test1_pass
la a0, msg_fail
li a7, 4
ecall
j test2
test1_pass:
la a0, msg_pass
li a7, 4
ecall
test2: ...
```
**Note: Eight different tests can be executed in the q1-bf16-test.s file (excluding sqrt).**
---
# Part 3: Performance Analysis
**Initially, this analysis was intended to simply compare instruction counts. However, the goal evolved to surpass C compiler performance by leveraging RISC-V architecture-specific optimizations.** This addresses the assignment's directive: *"Leverage the strengths of the RISC-V architecture for efficiency."*
## BF16 Addition Performance Comparison
The following comparison uses a simple test case:
- **Operation: result = a + b**
- **a = 1.0 (BF16: 0x3F80)**
- **b = 2.0 (BF16: 0x4000)**
### Initial Attempt: C with -O2 (Inline Disabled)
The first approach was to examine the assembly generated with -O2 optimization. However, the compiler inlined the function by default. To obtain a fair comparison, inlining was disabled using compiler directives.
The resulting x86-64 assembly (truncated for brevity):
```assembly
bf16_add:
.LFB19:
.cfi_startproc
movl %edi, %ecx
movl %esi, %edx
movl %edi, %r11d
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
shrw $7, %cx
pushq %rbx
.cfi_def_cfa_offset 24
.cfi_offset 3, -24
shrw $7, %dx
movl %esi, %ebx
movl %esi, %r10d
movzbl %cl, %ecx
movl %edi, %r8d
movl %edi, %eax
shrw $15, %r11w
shrw $15, %bx
movzbl %dl, %edx
andl $127, %edi
andl $127, %r10d
cmpw $255, %cx
je .L55
...
.cfi_endproc
```
### Performance Results
After counting instructions and estimating cycle counts with AI assistance and manual verification:
| Implementation | Instructions | Cycles |
|----------------|--------------|--------|
| **RISC-V (manual)** 👑 | **21** | **37** |
| C (-O2, no inline) | 51 | 81 |
**The RISC-V implementation wins!** However, this comparison has significant caveats:
1. **Non-inline compilation** was forced for analysis purposes
2. **-O2 optimization** was used instead of maximum -O3
3. **x86-64 architecture** uses many registers and is designed for 64-bit operations with additional security features compared to the minimal RISC-V implementation
### Adding Inline Optimization
| Implementation | Instructions | Cycles |
|----------------|--------------|--------|
| **RISC-V (manual)** 👑 | **21** | **37** |
| C (-O2, no inline) | 51 | 81 |
| C (-O2, inline) | 58 | 72 |
**Interesting observation:** More instructions but fewer cycles due to better instruction scheduling and elimination of function call overhead.
### Maximum Optimization: -O3
| Implementation | Instructions | Cycles |
|----------------|--------------|--------|
| **RISC-V (manual)** 👑 | **21** | **37** |
| C (-O2, no inline) | 51 | 81 |
| C (-O2, inline) | 58 | 72 |
| C (-O3) | 42 | 48 |
**Why does -O3 perform better?**
1. **The compiler integrates the function directly into main**, eliminating all function call overhead
2. **Common case optimization**: The compiler reorders code to place the most frequent execution paths first, minimizing branch mispredictions
### Lessons Learned
The performance advantage of the manual RISC-V implementation comes from:
1. **Minimal overhead**: No function call prologue/epilogue when integrated properly
2. **RISC-V simplicity**: The reduced instruction set means each operation is optimized for the architecture
3. **Hand-tuned paths**: Critical paths (common cases) can be manually optimized to avoid unnecessary branches
However, to achieve truly superior performance, the implementation should:
- **Place rare cases (like infinity/NaN handling) at the end** to avoid polluting the instruction cache
- **Optimize for the common case first** (normal additions) with minimal branching
- **Use pipeline-friendly instruction sequences** that avoid data hazards
---
# Conclusion
## Summary of Achievements
This assignment successfully implemented:
1. **UTF8 encoding/decoding** with complete test coverage
2. **Comprehensive processor pipeline analysis** demonstrating understanding of hazards and forwarding
3. **Complete BF16 floating-point operations** (conversion, arithmetic, and square root)
## Insights
### 1. Assembly Programming Complexity
Writing assembly code revealed the complexity. Even simple operations require extensive error handling, normalization, and edge case management.
### 2. Performance Characteristics
The analysis demonstrated that:
- **Instruction count alone is misleading** - cycle count matter more
- **Compiler optimizations are sophisticated** - achieving better performance than -O3 requires deep architectural knowledge (but not in tose simple case)
- **Hand-optimization has limits** - while the RISC-V implementation was competitive, the margin diminishes with higher compiler optimization levels
### 3. Architecture-Specific Design
The RISC-V architecture's strength lies in:
- **Simple, uniform instruction format** enabling fast decode
- **Large register file** reducing memory traffic
- **Explicit pipeline visibility** allowing manual optimization of data hazards
## Future Improvements
For future assignments:
1. **Code Organization**: Implement better naming conventions and modular structure
2. **Optimization Strategy**: Focus on common-case performance rather than exhaustive edge case
3. **Documentation**: Maintain clearer separation between implementation notes and formal analysis
## Final Reflection
This assignment was enlightening.
While initially frustrating (Due to simple translation of C to risc ,especially the sqrt implementation), it provided invaluable insight into low-level computation, compiler optimization, and computer architecture. The iterative process of writing, testing, and optimizing assembly code, though time-consuming, has deepened my understanding of low levels languages.
The realization that manual RISC-V code could compete with (and occasionally exceed) compiled C code was unexpected, even if this advantage diminishes at higher optimization levels.
I don't consider this homerwork as good (I mean that i didn't provided a pretty good homework) but this exercise has prepared me well for future work i think.
---
**End of Report**