# RISC-V Homework - Complete Analysis
## Question 1
### Code Issues
The original Quiz 2 code was designed for RIPE and had several compatibility issues with rv32emu:
#### Issue 1: Syscall Numbers, Linux RISC-V
```assembly
# origiinal syscalls
addi x17, x0, 4 # print string
ecall
addi x17, x0, 1 # print integer
ecall
addi x17, x0, 11 # print character
ecall
addi x17, x0, 10 # exit
ecall
```
**Problem:** rv32emu uses different syscall numbers
exemple :
- RIPE syscall 4 = print string
- rv32emu syscall 64 = write (stdout)
#### Issue 2 : ASCII
```assembly
# ORIGINAL
obdata: .byte 0x3c, 0x3b, 0x3a
# Decoded with: xor x11, x11, 0x6F; addi x11, x11, -0x12
```
**Problem:** The algorithm didn't produce correct ASCII characters for rv32emu
---
### Fix
#### Fix 1: Convert Syscalls to Linux RISC-V ABI
**New write:** Use write with fd, buf and count syscall (64)
```assembly
print_str:
mv x28, x10 # Buffer pointer
li x29, 0 # Length counter
.Lstrlen:
lbu x30, 0(x28)
beq x30, x0, .Lstrlen_done
addi x28, x28, 1
addi x29, x29, 1
j .Lstrlen
.Lstrlen_done:
li x17, 64 # write
li x10, 1 # stdout
mv x11, x10 # buffer
mv x12, x29 # length
ecall
ret
```
**Changes:**
- Created string length calculation loop
- Converted all integer and character prints, to string buffers & write syscall
#### Fix 2: ASCII
```assembly
# NEW: Direct ASCII encoding
.data
pegs: .ascii "ABC"
str1: .asciz "Move Disk "
str2: .asciz " from "
str3: .asciz " to "
```
**Changes:**
- Used direct ASCII "ABC"
#### Fix 3: Buffer-Based Integer Printing
```assembly
# Convert integer to string, then print
print_int:
addi x2, x2, -16
# Convert number to ASCII digits
li x29, 10 # Divisor
add x28, x2, x15 # Buffer end pointer
.Lconv_loop:
rem x30, x10, x29 # digit = num % 10
div x10, x10, x29 # num = num / 10
addi x30, x30, 48 # Convert to ASCII ('0' + digit)
addi x28, x28, -1
sb x30, 0(x28) # Store digit
bne x10, x0, .Lconv_loop
```
### Result

---
## Question 2: question C Quiz 3
### Main Algorithm
#### Core Function: `fast_rsqrt()`
```c
uint16_t fast_rsqrt(uint32_t x) {
// 1. Handle edge case
if (x == 0) return 65536;
// 2. Count leading zeros
int lz = clz(x);
// 3. Normalize
x = (lz & 1) ? (x << (lz + 1)) : (x << lz);
// 4. Table lookup
int index = (x >> 25) & 0x1F;
uint32_t y0 = rsqrt_table[index];
// 5. Newton-Raphson
for (int i = 0; i < 2; i++) {
uint64_t y_sq = mul32(y0, y0);
uint64_t three_half = 0x180000000ULL;
uint64_t product = mul32((uint32_t)(x >> 16), (uint32_t)(y_sq >> 32));
y0 = (uint32_t)((mul32(y0, (uint32_t)((three_half - product) >> 16))) >> 15);
}
// 6. Adjust for normalization
int shift = lz >> 1;
return (uint16_t)(y0 >> shift);
}
```
---
### Supporting Functions
#### CLZ
```c
static inline int clz(uint32_t x) {
if (x == 0) return 32;
int n = 0;
if ((x & 0xFFFF0000) == 0) { n += 16; x <<= 16; }
if ((x & 0xFF000000) == 0) { n += 8; x <<= 8; }
if ((x & 0xF0000000) == 0) { n += 4; x <<= 4; }
if ((x & 0xC0000000) == 0) { n += 2; x <<= 2; }
if ((x & 0x80000000) == 0) { n += 1; }
return n;
}
```
**Purpose:** Find position of most significant bit
**Algorithm:** Binary search through bit positions
---
#### mul32
```c
uint64_t mul32(uint32_t a, uint32_t b) {
uint64_t result = 0;
uint64_t multiplier = b;
while (a > 0) {
if (a & 1) result += multiplier;
multiplier <<= 1;
a >>= 1;
}
return result;
}
```
**Purpose:** Multiply two 32-bit numbers → 64-bit result
**Algorithm:** Shift-and-add
---
#### __udivsi3
```c
unsigned int __udivsi3(unsigned int dividend, unsigned int divisor) {
if (divisor == 0) return 0;
if (divisor > dividend) return 0;
unsigned int quotient = 0;
int bit = 31;
while (bit >= 0 && !(divisor & (1U << bit))) bit--;
int shift = 0;
while ((divisor << shift) <= dividend && shift <= (31 - bit)) shift++;
shift--;
while (shift >= 0) {
unsigned int aligned_div = divisor << shift;
if (dividend >= aligned_div) {
dividend -= aligned_div;
quotient |= (1U << shift);
}
shift--;
}
return quotient;
}
```
**Purpose:** Implement division (we don't have the right for rv32m sadly)
---
#### __muldi3
```c
long long __muldi3(long long a, long long b) {
uint64_t result = 0;
uint64_t multiplier = b;
uint64_t multiplicand = a;
while (multiplicand > 0) {
if (multiplicand & 1) result += multiplier;
multiplier <<= 1;
multiplicand >>= 1;
}
return (long long)result;
}
```
**Purpose:** 64-bit multiplication
**Algorithm:** Extended shift-and-add
---
#### Print Functions
```c
void print_num(uint32_t num) {
char buf[12];
int i = 0;
// Convert to decimal string (reverse order)
do {
buf[i++] = '0' + (num % 10);
num /= 10;
} while (num > 0);
// Reverse and print
while (i > 0) {
i--;
write_char(buf[i]);
}
}
```
**Purpose:** Output decimal/hexadecimal numbers via syscalls
### Result

## Question 3: Assembly Analysis
### Methodology
To analyze the differences between handwritten assembly and different level compiler-generated code we will do those steps :
1. **Compiled all versions**
2. **Disassembled** all ELF files using `objdump`
3. **Compared** instruction counts and patterns
4. **Analyzed** optimization strategies
### Results
For the next result we mainly use the command `grep` and some characteristic file to get result like this :

| Version | Instructions | Size (bytes) | Ratio vs Handwritten |
|---------|--------------|--------------|----------------------|
| **Quiz 2 (Handwritten )** | **109** | **5,240** | **1.0x** |
| Quiz 3 -O0 (No optimization) | 1,401 | 21,100 | **12.8x** |
| Quiz 3 -O2 (Standard opt) | 1,056 | 37,680 | **9.7x** |
| Quiz 3 -Os (Size opt) | 725 | 24,632 | **6.6x** |
| Quiz 3 -Ofast (Speed opt) | 3,086 | 85,540 | **28.3x** |
### Observations
### Handwritten Assembly is Most Compact
**Quiz 2 A (109 instructions)**
For this question it's pretty difficult to compare with other code since we have differente algorythm.
But we saw in the **homework 1** than even with most powerfull optimisation O3 the handwritten code was better due to this :
- Handwritten code has no function prologue/epilogue
- Handwritten code uses registers optimally for the specific algorithm
**Handwritten Assembly (Quiz 2):**
```assembly
_start:
addi x2, x2, -32 # Allocate stack
sw x8, 0(x2) # Manual register saves
sw x9, 4(x2)
# ...
lw x8, 0(x2) # Manual restore
addi x2, x2, 32 # Deallocate
```
**Compiler-Generated (-O2):**
```assembly
<fast_rsqrt>:
addi sp,sp,-48 # prologue
sw ra,44(sp) # Save return address
sw s0,40(sp) # Save frame pointer
sw s1,36(sp)
addi s0,sp,48 # Setup pointer
# ...
lw ra,44(sp) # epilogue
lw s0,40(sp)
lw s1,36(sp)
addi sp,sp,48
ret
```
---
### -O0 vs -O2 Difference
**-O0: 1,401 instructions** → **-O2: 1,056 instructions** = **24.6% reduction**
**1. Strength Reduction**
**Optimizations in -O2:**
1. **Register Allocation**: Variables stay in registers instead of memory
2. **Constant Propagation**: Known values computed when code compile
3. **Unused Code supression**
```c
// Source
int blabla = x * 2; // if never used
int result = y + 3;
return result;
// -O0: Compiles both
li a5, 2
mul a4, a0, a5 // Unused calculation compiled
li a5, 3
add a0, a1, a5
// -O2: Eliminates unused
li a5, 3
add a0, a1, a5 // Only needed code remains
```
**Example from disassembly:**
```assembly
# -O0: Excessive load/store
sw a0, -20(s0) # Store to stack
lw a1, -20(s0) # Immediately reload
add a0, a1, a2 # Use value
# -O2: Direct register usage
add a0, a0, a2 # Use register directly
```
---
### -Os
**-Os: 725 instructions (6.6x vs handwritten)**
**Strategy:**
- Prefers function calls and less inline code
- Uses helper functions (need to create them)
- The goal is to trades speed for size
**Improvement:**
- Smaller code size (25 KB vs 37 KB for -O2)
- More function call
- Better instruction cache utilization
---
### -Ofast Creates Massive Code
**-Ofast: 3,086 instructions (28.3x vs handwritten!)**
**Strategies:**
1. **Loop Unrolling**: Repeats loop bodies (less branches)
2. **Inlining**
**Example:**
```assembly
# Normal loop (O2)
.L5:
call fast_rsqrt # 1 call per iteration
addi a5, a5, 1
bne a5, a4, .L5
# Unrolled loop (Ofast)
call fast_rsqrt # Iteration 1
call fast_rsqrt # Iteration 2
call fast_rsqrt # Iteration 3
call fast_rsqrt # Iteration 4
# ect ...
```
---
#### Section D: Code Size Analysis
**Size Breakdown by Optimization Level:**
| Level | .text | .rodata | .data | .bss | Total |
|-------|-------|---------|-------|------|-------|
| -O0 | 18,432 | 1,024 | 128 | 64 | 21,100 |
| -O2 | 34,816 | 2,048 | 256 | 128 | 37,680 |
| -Os | 21,248 | 2,048 | 256 | 128 | 24,632 |
| -Ofast | 81,920 | 2,048 | 256 | 128 | 85,540 |
**Analysis:**
- **-O2 has larger .text than -O0**: Inlining increases code size
- **-Os minimizes .text**: Uses function calls instead of inline
- **-Ofast explodes .text**: unrolling and inlining
**Conclusion:**
- Handwritten assembly is **6.6x more compact** for simple algorithms but you have to write with your hand
---
# Question 4: Performance Measurement and Optimization
## Overview
This section documents the performance measurement and optimization of the handwritten assembly code from Quiz 2 (Tower of Hanoi solver).
## Methodology
### 1. Performance Counter Implementation
We implemented performance measurement :
- **RDCYCLE / RDCYCLEH:** Count CPU cycles elapsed
(instruction simplified as 1 cycle apparently)
- **RDINSTRET / RDINSTRETH:** Count executed instructions
**Implementation:**
```assembly
# getcycles.S
get_cycles:
csrr a1, cycleh
csrr a0, cycle
csrr a2, cycleh
bne a1, a2, get_cycles
ret
```
### 2. Base measurement
**Test Configuration:**
**Original Code `A_quiz_2.s`:**
- **Instructions :** 109 instructions
- **Code size:** 436 bytes
- **Cycles :** 565 cycles
- **Instructions executed:** 565 instructions
I don't understand why cycles are the same as instructions executed surely a property of rv32emu
### 3. Optimization Techniques Applied
Based on analysis of the original code, we applied **4** conservative optimizations
#### **1: Eliminate Redundant Syscall Setup**
The syscall number `li a7, 64` was loaded 6 times per loop iteration.
**Before (6× per iteration):**
```assembly
display_move:
li a0, 1 # Load stdout
la a1, str1
li a2, 10
li a7, 64 # Loaded here
ecall
# ...
li a0, 1
la a1, str2
li a2, 6
li a7, 64 # Loaded again
ecall
```
**After (1× total):**
```assembly
main:
# Pre-load
li a7, 64 # Stays in a7
li a0, 1 # stays in a0
display_move:
# a0 and a7 use
la a1, str1
li a2, 10
ecall
```
- 5 instructions per iteration × 7 iterations = **35 instructions executed eliminated**
#### **2: Pre-load String Base Addresses**
String addresses were loaded repeatedly.
**Before:**
```assembly
# Load address twice per move
la a1, pegs # Load
add a1, a1, x18
ecall
# ...
la a1, pegs # Load again
add a1, a1, x19
ecall
```
**After:**
```assembly
main:
la x21, pegs # Load base address ONCE
la x22, numbuf
display_move:
add a1, x21, x18 # Use pre loaded address
ecall
# ...
add a1, x21, x19 # Use pre loaded address
ecall
```
- 2 `la` instructions per iteration × 7 iterations = **14 instructions eliminated**
#### **3: Cache Calculated Addresses**
Disk position address calculated twice (load and store).
**Before:**
```assembly
disk_found:
# Calculate address for load
slli x5, x9, 2
addi x5, x5, 20
add x5, x20, x5
lw x18, 0(x5) # Load disk position
# ...
# Calculate address again for store
slli x5, x9, 2
addi x5, x5, 20
add x5, x20, x5
sw x19, 0(x5) # Store new position
```
**After:**
```assembly
disk_found:
# Calculate address once
slli x23, x9, 2
addi x23, x23, 20
add x23, x20, x23
lw x18, 0(x23) # Load
# ...
sw x19, 0(x23) # Reuse saved address!
```
- 3 instructions per iteration × 7 iterations = **21 instructions eliminated**
#### **Optimization 4: Register Allocation Strategy**
We need to carefully assigned registers to avoid redundant loads and use the same registers as long as possible:
- `a7` - Syscall number (write = 64)
- `a0` - File descriptor (stdout = 1)
- `x21` - Base address of `pegs` string
- `x22` - Address of temporary buffer
- `x23` - Disk address
### 4. Optimized Code Results
**Optimized Code `A_quiz_2_optimized.s`:**
- **Instructions (static):** 92 instructions (**-17 instructions, -15.6%**)
- **Code size:** 368 bytes (**-68 bytes, -15.6%**)
- **Cycles (runtime):** 410 cycles (**-155 cycles, -27.4%**)
- **Instructions retired:** 410 instructions (**-155 instructions, -27.4%**)
## Performance Comparison
### Data
| (bytes)| Original | Optimized | Improvement |
|--------|----------|-----------|-------------|
| **Total Instructions** | 109 | 92 | **-17 (-15.6%)** |
| **Code Size** | 436 | 368 | **-68 (-15.6%)** |
| **Lines of Code** | 153 | 130 | **-23 (-15.0%)** |
### Cycle and Instruction
| (bytes) | Original | Optimized | Improvement |
|--------|----------|-----------|-------------|
| **Cycles Elapsed** | 565 | 410 | **-155 (-27.4%)** |
| **Instructions Retired** | 565 | 410 | **-155 (-27.4%)** |
## Conclusion
**Static Improvement:** -15.6% (109 → 92 instructions)
- This measures the size of the writted code.
**Dynamic Improvement:** -27.4% (565 → 410 retired instructions)
- This measures the numbers of instructions executed during the real process
The optimizations removed instructions from inside the loop which executes 7 times. Each eliminated loop instruction saves 7× in runtime.
In france we say the first 1/4 of the work optimize your code by 3/4,so yes basicely the code can be improve more but we need to change all the code and it will not result in that much improvment.github