arch2025-homework2

# RISC-V Homework - Complete Analysis ## Question 1 ### Code Issues The original Quiz 2 code was designed for RIPE and had several compatibility issues with rv32emu: #### Issue 1: Syscall Numbers, Linux RISC-V ```assembly # origiinal syscalls addi x17, x0, 4 # print string ecall addi x17, x0, 1 # print integer ecall addi x17, x0, 11 # print character ecall addi x17, x0, 10 # exit ecall ``` **Problem:** rv32emu uses different syscall numbers exemple : - RIPE syscall 4 = print string - rv32emu syscall 64 = write (stdout) #### Issue 2 : ASCII ```assembly # ORIGINAL obdata: .byte 0x3c, 0x3b, 0x3a # Decoded with: xor x11, x11, 0x6F; addi x11, x11, -0x12 ``` **Problem:** The algorithm didn't produce correct ASCII characters for rv32emu --- ### Fix #### Fix 1: Convert Syscalls to Linux RISC-V ABI **New write:** Use write with fd, buf and count syscall (64) ```assembly print_str: mv x28, x10 # Buffer pointer li x29, 0 # Length counter .Lstrlen: lbu x30, 0(x28) beq x30, x0, .Lstrlen_done addi x28, x28, 1 addi x29, x29, 1 j .Lstrlen .Lstrlen_done: li x17, 64 # write li x10, 1 # stdout mv x11, x10 # buffer mv x12, x29 # length ecall ret ``` **Changes:** - Created string length calculation loop - Converted all integer and character prints, to string buffers & write syscall #### Fix 2: ASCII ```assembly # NEW: Direct ASCII encoding .data pegs: .ascii "ABC" str1: .asciz "Move Disk " str2: .asciz " from " str3: .asciz " to " ``` **Changes:** - Used direct ASCII "ABC" #### Fix 3: Buffer-Based Integer Printing ```assembly # Convert integer to string, then print print_int: addi x2, x2, -16 # Convert number to ASCII digits li x29, 10 # Divisor add x28, x2, x15 # Buffer end pointer .Lconv_loop: rem x30, x10, x29 # digit = num % 10 div x10, x10, x29 # num = num / 10 addi x30, x30, 48 # Convert to ASCII ('0' + digit) addi x28, x28, -1 sb x30, 0(x28) # Store digit bne x10, x0, .Lconv_loop ``` ### Result ![image](https://hackmd.io/_uploads/B1lUnf5DkWl.png) --- ## Question 2: question C Quiz 3 ### Main Algorithm #### Core Function: `fast_rsqrt()` ```c uint16_t fast_rsqrt(uint32_t x) { // 1. Handle edge case if (x == 0) return 65536; // 2. Count leading zeros int lz = clz(x); // 3. Normalize x = (lz & 1) ? (x << (lz + 1)) : (x << lz); // 4. Table lookup int index = (x >> 25) & 0x1F; uint32_t y0 = rsqrt_table[index]; // 5. Newton-Raphson for (int i = 0; i < 2; i++) { uint64_t y_sq = mul32(y0, y0); uint64_t three_half = 0x180000000ULL; uint64_t product = mul32((uint32_t)(x >> 16), (uint32_t)(y_sq >> 32)); y0 = (uint32_t)((mul32(y0, (uint32_t)((three_half - product) >> 16))) >> 15); } // 6. Adjust for normalization int shift = lz >> 1; return (uint16_t)(y0 >> shift); } ``` --- ### Supporting Functions #### CLZ ```c static inline int clz(uint32_t x) { if (x == 0) return 32; int n = 0; if ((x & 0xFFFF0000) == 0) { n += 16; x <<= 16; } if ((x & 0xFF000000) == 0) { n += 8; x <<= 8; } if ((x & 0xF0000000) == 0) { n += 4; x <<= 4; } if ((x & 0xC0000000) == 0) { n += 2; x <<= 2; } if ((x & 0x80000000) == 0) { n += 1; } return n; } ``` **Purpose:** Find position of most significant bit **Algorithm:** Binary search through bit positions --- #### mul32 ```c uint64_t mul32(uint32_t a, uint32_t b) { uint64_t result = 0; uint64_t multiplier = b; while (a > 0) { if (a & 1) result += multiplier; multiplier <<= 1; a >>= 1; } return result; } ``` **Purpose:** Multiply two 32-bit numbers → 64-bit result **Algorithm:** Shift-and-add --- #### __udivsi3 ```c unsigned int __udivsi3(unsigned int dividend, unsigned int divisor) { if (divisor == 0) return 0; if (divisor > dividend) return 0; unsigned int quotient = 0; int bit = 31; while (bit >= 0 && !(divisor & (1U << bit))) bit--; int shift = 0; while ((divisor << shift) <= dividend && shift <= (31 - bit)) shift++; shift--; while (shift >= 0) { unsigned int aligned_div = divisor << shift; if (dividend >= aligned_div) { dividend -= aligned_div; quotient |= (1U << shift); } shift--; } return quotient; } ``` **Purpose:** Implement division (we don't have the right for rv32m sadly) --- #### __muldi3 ```c long long __muldi3(long long a, long long b) { uint64_t result = 0; uint64_t multiplier = b; uint64_t multiplicand = a; while (multiplicand > 0) { if (multiplicand & 1) result += multiplier; multiplier <<= 1; multiplicand >>= 1; } return (long long)result; } ``` **Purpose:** 64-bit multiplication **Algorithm:** Extended shift-and-add --- #### Print Functions ```c void print_num(uint32_t num) { char buf[12]; int i = 0; // Convert to decimal string (reverse order) do { buf[i++] = '0' + (num % 10); num /= 10; } while (num > 0); // Reverse and print while (i > 0) { i--; write_char(buf[i]); } } ``` **Purpose:** Output decimal/hexadecimal numbers via syscalls ### Result ![image](https://hackmd.io/_uploads/Byv7j5D1-l.png) ## Question 3: Assembly Analysis ### Methodology To analyze the differences between handwritten assembly and different level compiler-generated code we will do those steps : 1. **Compiled all versions** 2. **Disassembled** all ELF files using `objdump` 3. **Compared** instruction counts and patterns 4. **Analyzed** optimization strategies ### Results For the next result we mainly use the command `grep` and some characteristic file to get result like this : ![image](https://hackmd.io/_uploads/HJRTi9P1-l.png) | Version | Instructions | Size (bytes) | Ratio vs Handwritten | |---------|--------------|--------------|----------------------| | **Quiz 2 (Handwritten )** | **109** | **5,240** | **1.0x** | | Quiz 3 -O0 (No optimization) | 1,401 | 21,100 | **12.8x** | | Quiz 3 -O2 (Standard opt) | 1,056 | 37,680 | **9.7x** | | Quiz 3 -Os (Size opt) | 725 | 24,632 | **6.6x** | | Quiz 3 -Ofast (Speed opt) | 3,086 | 85,540 | **28.3x** | ### Observations ### Handwritten Assembly is Most Compact **Quiz 2 A (109 instructions)** For this question it's pretty difficult to compare with other code since we have differente algorythm. But we saw in the **homework 1** than even with most powerfull optimisation O3 the handwritten code was better due to this : - Handwritten code has no function prologue/epilogue - Handwritten code uses registers optimally for the specific algorithm **Handwritten Assembly (Quiz 2):** ```assembly _start: addi x2, x2, -32 # Allocate stack sw x8, 0(x2) # Manual register saves sw x9, 4(x2) # ... lw x8, 0(x2) # Manual restore addi x2, x2, 32 # Deallocate ``` **Compiler-Generated (-O2):** ```assembly <fast_rsqrt>: addi sp,sp,-48 # prologue sw ra,44(sp) # Save return address sw s0,40(sp) # Save frame pointer sw s1,36(sp) addi s0,sp,48 # Setup pointer # ... lw ra,44(sp) # epilogue lw s0,40(sp) lw s1,36(sp) addi sp,sp,48 ret ``` --- ### -O0 vs -O2 Difference **-O0: 1,401 instructions** → **-O2: 1,056 instructions** = **24.6% reduction** **1. Strength Reduction** **Optimizations in -O2:** 1. **Register Allocation**: Variables stay in registers instead of memory 2. **Constant Propagation**: Known values computed when code compile 3. **Unused Code supression** ```c // Source int blabla = x * 2; // if never used int result = y + 3; return result; // -O0: Compiles both li a5, 2 mul a4, a0, a5 // Unused calculation compiled li a5, 3 add a0, a1, a5 // -O2: Eliminates unused li a5, 3 add a0, a1, a5 // Only needed code remains ``` **Example from disassembly:** ```assembly # -O0: Excessive load/store sw a0, -20(s0) # Store to stack lw a1, -20(s0) # Immediately reload add a0, a1, a2 # Use value # -O2: Direct register usage add a0, a0, a2 # Use register directly ``` --- ### -Os **-Os: 725 instructions (6.6x vs handwritten)** **Strategy:** - Prefers function calls and less inline code - Uses helper functions (need to create them) - The goal is to trades speed for size **Improvement:** - Smaller code size (25 KB vs 37 KB for -O2) - More function call - Better instruction cache utilization --- ### -Ofast Creates Massive Code **-Ofast: 3,086 instructions (28.3x vs handwritten!)** **Strategies:** 1. **Loop Unrolling**: Repeats loop bodies (less branches) 2. **Inlining** **Example:** ```assembly # Normal loop (O2) .L5: call fast_rsqrt # 1 call per iteration addi a5, a5, 1 bne a5, a4, .L5 # Unrolled loop (Ofast) call fast_rsqrt # Iteration 1 call fast_rsqrt # Iteration 2 call fast_rsqrt # Iteration 3 call fast_rsqrt # Iteration 4 # ect ... ``` --- #### Section D: Code Size Analysis **Size Breakdown by Optimization Level:** | Level | .text | .rodata | .data | .bss | Total | |-------|-------|---------|-------|------|-------| | -O0 | 18,432 | 1,024 | 128 | 64 | 21,100 | | -O2 | 34,816 | 2,048 | 256 | 128 | 37,680 | | -Os | 21,248 | 2,048 | 256 | 128 | 24,632 | | -Ofast | 81,920 | 2,048 | 256 | 128 | 85,540 | **Analysis:** - **-O2 has larger .text than -O0**: Inlining increases code size - **-Os minimizes .text**: Uses function calls instead of inline - **-Ofast explodes .text**: unrolling and inlining **Conclusion:** - Handwritten assembly is **6.6x more compact** for simple algorithms but you have to write with your hand --- # Question 4: Performance Measurement and Optimization ## Overview This section documents the performance measurement and optimization of the handwritten assembly code from Quiz 2 (Tower of Hanoi solver). ## Methodology ### 1. Performance Counter Implementation We implemented performance measurement : - **RDCYCLE / RDCYCLEH:** Count CPU cycles elapsed (instruction simplified as 1 cycle apparently) - **RDINSTRET / RDINSTRETH:** Count executed instructions **Implementation:** ```assembly # getcycles.S get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ret ``` ### 2. Base measurement **Test Configuration:** **Original Code `A_quiz_2.s`:** - **Instructions :** 109 instructions - **Code size:** 436 bytes - **Cycles :** 565 cycles - **Instructions executed:** 565 instructions I don't understand why cycles are the same as instructions executed surely a property of rv32emu ### 3. Optimization Techniques Applied Based on analysis of the original code, we applied **4** conservative optimizations #### **1: Eliminate Redundant Syscall Setup** The syscall number `li a7, 64` was loaded 6 times per loop iteration. **Before (6× per iteration):** ```assembly display_move: li a0, 1 # Load stdout la a1, str1 li a2, 10 li a7, 64 # Loaded here ecall # ... li a0, 1 la a1, str2 li a2, 6 li a7, 64 # Loaded again ecall ``` **After (1× total):** ```assembly main: # Pre-load li a7, 64 # Stays in a7 li a0, 1 # stays in a0 display_move: # a0 and a7 use la a1, str1 li a2, 10 ecall ``` - 5 instructions per iteration × 7 iterations = **35 instructions executed eliminated** #### **2: Pre-load String Base Addresses** String addresses were loaded repeatedly. **Before:** ```assembly # Load address twice per move la a1, pegs # Load add a1, a1, x18 ecall # ... la a1, pegs # Load again add a1, a1, x19 ecall ``` **After:** ```assembly main: la x21, pegs # Load base address ONCE la x22, numbuf display_move: add a1, x21, x18 # Use pre loaded address ecall # ... add a1, x21, x19 # Use pre loaded address ecall ``` - 2 `la` instructions per iteration × 7 iterations = **14 instructions eliminated** #### **3: Cache Calculated Addresses** Disk position address calculated twice (load and store). **Before:** ```assembly disk_found: # Calculate address for load slli x5, x9, 2 addi x5, x5, 20 add x5, x20, x5 lw x18, 0(x5) # Load disk position # ... # Calculate address again for store slli x5, x9, 2 addi x5, x5, 20 add x5, x20, x5 sw x19, 0(x5) # Store new position ``` **After:** ```assembly disk_found: # Calculate address once slli x23, x9, 2 addi x23, x23, 20 add x23, x20, x23 lw x18, 0(x23) # Load # ... sw x19, 0(x23) # Reuse saved address! ``` - 3 instructions per iteration × 7 iterations = **21 instructions eliminated** #### **Optimization 4: Register Allocation Strategy** We need to carefully assigned registers to avoid redundant loads and use the same registers as long as possible: - `a7` - Syscall number (write = 64) - `a0` - File descriptor (stdout = 1) - `x21` - Base address of `pegs` string - `x22` - Address of temporary buffer - `x23` - Disk address ### 4. Optimized Code Results **Optimized Code `A_quiz_2_optimized.s`:** - **Instructions (static):** 92 instructions (**-17 instructions, -15.6%**) - **Code size:** 368 bytes (**-68 bytes, -15.6%**) - **Cycles (runtime):** 410 cycles (**-155 cycles, -27.4%**) - **Instructions retired:** 410 instructions (**-155 instructions, -27.4%**) ## Performance Comparison ### Data | (bytes)| Original | Optimized | Improvement | |--------|----------|-----------|-------------| | **Total Instructions** | 109 | 92 | **-17 (-15.6%)** | | **Code Size** | 436 | 368 | **-68 (-15.6%)** | | **Lines of Code** | 153 | 130 | **-23 (-15.0%)** | ### Cycle and Instruction | (bytes) | Original | Optimized | Improvement | |--------|----------|-----------|-------------| | **Cycles Elapsed** | 565 | 410 | **-155 (-27.4%)** | | **Instructions Retired** | 565 | 410 | **-155 (-27.4%)** | ## Conclusion **Static Improvement:** -15.6% (109 → 92 instructions) - This measures the size of the writted code. **Dynamic Improvement:** -27.4% (565 → 410 retired instructions) - This measures the numbers of instructions executed during the real process The optimizations removed instructions from inside the loop which executes 7 times. Each eliminated loop instruction saves 7× in runtime. In france we say the first 1/4 of the work optimize your code by 3/4,so yes basicely the code can be improve more but we need to change all the code and it will not result in that much improvment.github