Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline contributed by < [phyrexxxxx](https://github.com/phyrexxxxx/ca2025-quizzes) > ## Prerequisite ### Installing RISC-V Toolchain ```bash $ sudo apt update $ sudo apt install gcc-riscv64-unknown-elf ``` Install RISC-V GCC compiler and related tools ### Verifying Installation ```bash # Check RISC-V toolchain riscv64-unknown-elf-gcc --version riscv64-unknown-elf-as --version riscv64-unknown-elf-ld --version # Check tools which riscv64-unknown-elf-size which riscv64-unknown-elf-objdump which riscv64-unknown-elf-nm ``` Output: ``` $ riscv64-unknown-elf-gcc --version riscv64-unknown-elf-gcc (13.2.0-11ubuntu1+12) 13.2.0 Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ riscv64-unknown-elf-as --version GNU assembler (2.42-1ubuntu1+6) 2.42 Copyright (C) 2024 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `riscv64-unknown-elf'. $ riscv64-unknown-elf-ld --version GNU ld (2.42-1ubuntu1+6) 2.42 Copyright (C) 2024 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or (at your option) a later version. This program has absolutely no warranty. $ which riscv64-unknown-elf-size /usr/bin/riscv64-unknown-elf-size $ which riscv64-unknown-elf-objdump /usr/bin/riscv64-unknown-elf-objdump $ which riscv64-unknown-elf-nm /usr/bin/riscv64-unknown-elf-nm ``` The Ubuntu package installs `riscv64-unknown-elf-gcc`, but it can compile RV32I programs using `-march=rv32i`: - `riscv64-unknown-elf-gcc` + `-march=rv32i` + `-mabi=ilp32` = compile RV32I programs - Though the toolchain is named `riscv64`, it can compile both 32-bit and 64-bit programs - This is the standard practice for RISC-V official GNU toolchain ## Preparing Ripes Simulator Visit: https://github.com/mortbopet/Ripes/releases > ⚠️ Note: **WSL2 cannot directly open GUI windows**, you must either: > > - Install **X Server for Windows** (e.g., [VcXsrv](https://sourceforge.net/projects/vcxsrv/) or [X410](https://x410.dev/)) > - Or download **Windows version of Ripes** directly on the Windows host Click: `Ripes-v2.2.6-70-gc8e0412-win-x86_64.zip` After extraction, directly click `Ripes.exe` to launch Ripes GUI # Problem B: CLZ (v1-basic) > [commit 4ef4eb1](https://github.com/phyrexxxxx/ca2025-quizzes/blob/6082c8d70486b21cfe842e6cd79ec32aa92e3e37/q1b-uf8/clz/v1-basic/clz_v1.s) ```asm .text .globl clz_v1 .type clz_v1, @function clz_v1: li t0, 32 # n = 32 li t1, 16 # c = 16 # Iteration 1: c = 16 srl t2, a0, t1 # y = x >> 16 beqz t2, .L1_skip # if (y == 0) sub t0, t0, t1 # n -= 16 mv a0, t2 # x = y .L1_skip: # Iteration 2: c = 8 srli t1, t1, 1 # c >>= 1 → c = 8 srl t2, a0, t1 # y = x >> 8 beqz t2, .L2_skip # if (y == 0) sub t0, t0, t1 # n -= 8 mv a0, t2 # x = y .L2_skip: # Iteration 3: c = 4 srli t1, t1, 1 # c >>= 1 → c = 4 srl t2, a0, t1 # y = x >> 4 beqz t2, .L3_skip # if (y == 0) sub t0, t0, t1 # n -= 4 mv a0, t2 # x = y .L3_skip: # Iteration 4: c = 2 srli t1, t1, 1 # c >>= 1 → c = 2 srl t2, a0, t1 # y = x >> 2 beqz t2, .L4_skip # if (y == 0) sub t0, t0, t1 # n -= 2 mv a0, t2 # x = y .L4_skip: # Iteration 5: c = 1 srli t1, t1, 1 # c >>= 1 → c = 1 srl t2, a0, t1 # y = x >> 1 beqz t2, .L5_skip # if (y == 0) sub t0, t0, t1 # n -= 1 mv a0, t2 # x = y .L5_skip: sub a0, t0, a0 # return n - x ret .size clz_v1, .-clz_v1 ``` ```bash cd q1b-uf8/clz/v1-basic ``` ## Step 1: Compile CLZ Versions ```bash # Compile v1-hand (hand-written assembly) riscv64-unknown-elf-as \ -march=rv32i \ -mabi=ilp32 \ -o clz_v1.o clz_v1.s # Compile GCC -O0 (no optimization) riscv64-unknown-elf-gcc \ -march=rv32i \ -mabi=ilp32 \ -ffunction-sections \ -O0 -c -o clz_gcc_O0.o clz.c # Compile GCC -O2 (standard optimization) riscv64-unknown-elf-gcc \ -march=rv32i \ -mabi=ilp32 \ -ffunction-sections \ -O2 -c -o clz_gcc_O2.o clz.c # Compile GCC -O3 (aggressive optimization) riscv64-unknown-elf-gcc \ -march=rv32i \ -mabi=ilp32 \ -ffunction-sections \ -O3 -c -o clz_gcc_O3.o clz.c # Compile test framework riscv64-unknown-elf-as \ -march=rv32i \ -mabi=ilp32 \ -o ../test_framework/test_framework.o \ ../test_framework/test_framework.s ``` ## Step 2: Link Test Programs Errors encountered during the process: ``` $ riscv64-unknown-elf-ld \ -nostdlib \ -o test_v1_hand.elf \ ../test_framework/test_framework.o \ clz_v1.o riscv64-unknown-elf-ld: ../test_framework/test_framework.o: ABI is incompatible with that of the selected emulation: target emulation `elf32-littleriscv' does not match `elf64-littleriscv' riscv64-unknown-elf-ld: failed to merge target specific data of file ../test_framework/test_framework.o riscv64-unknown-elf-ld: clz_v1.o: ABI is incompatible with that of the selected emulation: target emulation `elf32-littleriscv' does not match `elf64-littleriscv' riscv64-unknown-elf-ld: failed to merge target specific data of file clz_v1.o riscv64-unknown-elf-ld: ../test_framework/test_framework.o: in function `test_loop': (.text+0x30): undefined reference to `clz' riscv64-unknown-elf-ld: test_v1_hand.elf(.text): relocation ".L11+0x0 (type R_RISCV_JAL)" goes out of range riscv64-unknown-elf-ld: ../test_framework/test_framework.o: file class ELFCLASS32 incompatible with ELFCLASS64 riscv64-unknown-elf-ld: final link failed: file in wrong format ``` Error reason: ``` target emulation `elf32-littleriscv' does not match `elf64-littleriscv' ``` i.e., `elf32-littleriscv` and `elf64-littleriscv` don't match: - ABI mismatch: Object file `clz_v1.o` is 32-bit, but the linker is using 64-bit default mode - Symbol name mismatch: Assembly defines `clz_v1`, but test framework looks for `clz` ```asm # clz_v1.s clz_v1: li t0, 32 # n = 32 li t1, 16 # c = 16 ``` Fixed using the following commands to re-link: ```bash # Link v1-hand version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz_v1 \ -o test_v1_hand.elf \ ../test_framework/test_framework.o \ clz_v1.o # Link GCC -O0 version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz \ -o test_gcc_O0.elf \ ../test_framework/test_framework.o \ clz_gcc_O0.o # Link GCC -O2 version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz \ -o test_gcc_O2.elf \ ../test_framework/test_framework.o \ clz_gcc_O2.o # Link GCC -O3 version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz \ -o test_gcc_O3.elf \ ../test_framework/test_framework.o \ clz_gcc_O3.o ``` Key parameter explanation: | Parameter | Purpose | |---------------------|---------------------------------------| | `-melf32lriscv` | Tell 64-bit linker to use 32-bit RISC-V emulation mode | | `--defsym clz=clz_v1` | Create symbol alias, making `clz` point to `clz_v1` | Generated ELF executable: - ELF 32-bit LSB executable - RISC-V soft-float ABI - Contains `_start`, `clz`, and `clz_v1` symbols For the "Link GCC -O0 version", "Link GCC -O2 version", and "Link GCC -O3 version" above, `--defsym clz=clz` may not be necessary (I have verified this) — GCC-compiled object files already have the correct `clz` symbol, but adding it doesn't hurt. The most critical parameter is `-melf32lriscv`. ### Check Generated Test Programs ``` $ ls -lh test_*.elf -rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:32 test_gcc_O0.elf -rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:34 test_gcc_O2.elf -rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:35 test_gcc_O3.elf -rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:31 test_v1_hand.elf ``` ## Step 3: Verify Correctness **Important**: Verify program correctness first before measuring code size! The test framework will automatically execute 12 test cases and "expect" to output correct or wrong in the Ripes Console. ### Launch Ripes 1. Start Ripes 2. Select Processor: **5-stage pipeline (RV32I)** ### Test v1-hand Version Load program: File → Load Program → Select `test_v1_hand.elf` #### Check Results **Problem:** After execution in Ripes, the console doesn't show `correct` or `wrong`, only displays: ``` Program exited with code: 0 ``` **Reason:** When loading ELF, Ripes doesn't process its "educational syscalls (a7=1/4/10)"; ecall 10 is just treated as "exit", so we only see Program exited with code: 0. In Ripes' "Assembly Program" mode (directly pasting `.s`), it supports simplified syscalls like a7=4 print string. **Fix:** `q1b-uf8/clz/test_framework/test_framework.s`: ```diff all_correct: - # Output "correct\n" - la a0, correct_str - li a7, 4 # syscall: print_string - ecall - j exit + li a0, 0 # exit code 0 = PASS + li a7, 10 # ecall: exit + ecall test_failed: - # Output "wrong\n" - la a0, wrong_str - li a7, 4 # syscall: print_string - ecall + li a0, 1 # exit code 1 = FAIL + li a7, 10 + ecall ``` Then recompile and link. But later I directly integrated `test_framework.s` and `clz_v1.s` into: `clz_v1_standalone.s` Then load `clz_v1_standalone.s` in Ripes and run it directly. This is the fastest way to verify RISC-V program correctness. ![image](https://hackmd.io/_uploads/SJxkAIL6gl.png) Only versions that pass correctness testing are worth continuing to measure code size and performance. ## Step 4: Measure Code Size The prerequisite for measuring code size and runtime performance is: **passing the program logic correctness test** in the previous step. ### Method 1: Measure Using size Tool ``` $ riscv64-unknown-elf-size clz_v1.o clz_gcc_O0.o clz_gcc_O2.o clz_gcc_O3.o text data bss dec hex filename 112 0 0 112 70 clz_v1.o 128 0 0 128 80 clz_gcc_O0.o 48 0 0 48 30 clz_gcc_O2.o 92 0 0 92 5c clz_gcc_O3.o ``` ### Method 2: Using objdump > https://dokk.org/manpages/debian/12/binutils-riscv64-unknown-elf/riscv64-unknown-elf-objdump.1.en > > objdump - display information from object files > - [-h|--section-headers|--headers] > - [-d|--disassemble[=symbol]] ``` $ riscv64-unknown-elf-objdump -h clz_v1.o clz_v1.o: file format elf32-littleriscv Sections: Idx Name Size VMA LMA File off Algn 0 .text 00000070 00000000 00000000 00000034 2**2 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 1 .data 00000000 00000000 00000000 000000a4 2**0 CONTENTS, ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 000000a4 2**0 ALLOC 3 .riscv.attributes 0000001a 00000000 00000000 000000a4 2**0 CONTENTS, READONLY ``` To measure code size, extract the `Size` field of the `.text` section from `riscv64-unknown-elf-objdump -h` output ### Code Size Analysis Sorted (smallest to largest): 1. **GCC -O2**: 48 bytes 2. **GCC -O3**: 92 bytes (+91%) 3. **v1-hand**: 112 bytes (+133%) 4. **GCC -O0**: 128 bytes (+166%) **Observations:** - GCC -O2 is smallest, compiler optimization is very effective - Hand-written assembly is **2.3 times** larger than GCC -O2 ## Step 5: Generate Disassembly Files ### Disassemble All Versions ```bash riscv64-unknown-elf-objdump -d clz_v1.o > clz_v1.dis riscv64-unknown-elf-objdump -d clz_gcc_O0.o > clz_gcc_O0.dis riscv64-unknown-elf-objdump -d clz_gcc_O2.o > clz_gcc_O2.dis riscv64-unknown-elf-objdump -d clz_gcc_O3.o > clz_gcc_O3.dis ``` ### View Disassembly Content ``` $ cat clz_v1.dis clz_v1.o: file format elf32-littleriscv Disassembly of section .text: 00000000 <clz_v1>: 0: 02000293 li t0,32 4: 01000313 li t1,16 8: 006553b3 srl t2,a0,t1 c: 00038663 beqz t2,18 <.L1_skip> 10: 406282b3 sub t0,t0,t1 14: 00038513 mv a0,t2 00000018 <.L1_skip>: 18: 00135313 srli t1,t1,0x1 1c: 006553b3 srl t2,a0,t1 20: 00038663 beqz t2,2c <.L2_skip> 24: 406282b3 sub t0,t0,t1 28: 00038513 mv a0,t2 0000002c <.L2_skip>: 2c: 00135313 srli t1,t1,0x1 30: 006553b3 srl t2,a0,t1 34: 00038663 beqz t2,40 <.L3_skip> 38: 406282b3 sub t0,t0,t1 3c: 00038513 mv a0,t2 00000040 <.L3_skip>: 40: 00135313 srli t1,t1,0x1 44: 006553b3 srl t2,a0,t1 48: 00038663 beqz t2,54 <.L4_skip> 4c: 406282b3 sub t0,t0,t1 50: 00038513 mv a0,t2 00000054 <.L4_skip>: 54: 00135313 srli t1,t1,0x1 58: 006553b3 srl t2,a0,t1 5c: 00038663 beqz t2,68 <.L5_skip> 60: 406282b3 sub t0,t0,t1 64: 00038513 mv a0,t2 00000068 <.L5_skip>: 68: 40a28533 sub a0,t0,a0 6c: 00008067 ret ``` ### Count Static Instructions ``` $ grep -E '^\s+[0-9a-f]+:' clz_v1.dis | wc -l 28 $ grep -E '^\s+[0-9a-f]+:' clz_gcc_O0.dis | wc -l 32 $ grep -E '^\s+[0-9a-f]+:' clz_gcc_O2.dis | wc -l 12 $ grep -E '^\s+[0-9a-f]+:' clz_gcc_O3.dis | wc -l 23 ``` Record results: | Version | Static Instructions | |---------|---------------------| | v1-hand | 28 | | GCC -O0 | 32 | | GCC -O2 | 12 | | GCC -O3 | 23 | ## Step 6: Measure Runtime Performance (Using Ripes) **Prerequisite**: Passed correctness test + measured code size Link the hand-written RISC-V `clz_v1.s` into `clz_v1.o` and link `clz_gcc_O0.o`, `clz_gcc_O2.o`, `clz_gcc_O3.o` compiled by GCC with `test_framework.o` Generate respectively: - `test_v1_hand.elf` - `test_gcc_O0.elf` - `test_gcc_O2.elf` - `test_gcc_O3.elf` ```bash # Link v1-hand version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz_v1 \ -o test_v1_hand.elf \ ../test_framework/test_framework.o \ clz_v1.o # Link GCC -O0 version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz \ -o test_gcc_O0.elf \ ../test_framework/test_framework.o \ clz_gcc_O0.o # Link GCC -O2 version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz \ -o test_gcc_O2.elf \ ../test_framework/test_framework.o \ clz_gcc_O2.o # Link GCC -O3 version riscv64-unknown-elf-ld \ -melf32lriscv \ -nostdlib \ --defsym clz=clz \ -o test_gcc_O3.elf \ ../test_framework/test_framework.o \ clz_gcc_O3.o ``` Execute the following steps for each version: 1. Start Ripes 2. Select Processor: **5-stage pipeline (RV32I)** 3. File → Load Program, test sequentially: - `test_v1_hand.elf` - `test_gcc_O0.elf` - `test_gcc_O2.elf` - `test_gcc_O3.elf` Run performance test. After completion, check the **Execution info** panel in the lower right: `test_v1_hand.elf`: ![image](https://hackmd.io/_uploads/rJuihvUpgx.png) `test_gcc_O0.elf`: ![image](https://hackmd.io/_uploads/HyNREFL6xg.png) `test_gcc_O2.elf`: ![image](https://hackmd.io/_uploads/B12zStI6ex.png) `test_gcc_O3.elf`: ![image](https://hackmd.io/_uploads/ByJ4HK8ale.png) | ELF Filename | Cycles | Instrs. retired | CPI | IPC | Clock rate | | -------------------- | ------ | --------------- | ---- | ----- | ---------- | | `test_v1_hand.elf` | 644 | 415 | 1.55 | 0.644 | 0 Hz | | `test_gcc_O0.elf` | 1520 | 967 | 1.57 | 0.636 | 0 Hz | | `test_gcc_O2.elf` | 788 | 487 | 1.62 | 0.618 | 0 Hz | | `test_gcc_O3.elf` | 488 | 283 | 1.72 | 0.58 | 0 Hz | # Problem B: CLZ (v2-branchless) > [commit e28af9e](https://github.com/sysprog21/ca2025-quizzes/commit/e28af9e1427b27d60ab63e34e4ede40da3424f50) > [commit eb6fd76](https://github.com/sysprog21/ca2025-quizzes/commit/eb6fd7620def011f171f6f2a7693f3a5a71b9243) ```asm .text .globl clz_v2 .type clz_v2, @function clz_v2: li t0, 32 # n = 32 li t1, 16 # c = 16 # Iteration 1: c = 16 srl t2, a0, t1 # y = x >> 16 # Create conditional mask sltu t3, x0, t2 # t3 = (y != 0) ? 1 : 0 neg t4, t3 # t4 = (y != 0) ? -1 : 0 = 0xFFFFFFFF : 0x00000000 # Conditional update n: n -= (y != 0) ? c : 0 and t5, t1, t4 # t5 = (y != 0) ? c : 0 sub t0, t0, t5 # n -= t5 # Conditional update x: x = (y != 0) ? y : x and t5, t2, t4 # t5 = (y != 0) ? y : 0 not t6, t4 # t6 = (y != 0) ? 0 : 0xFFFFFFFF and t6, a0, t6 # t6 = (y != 0) ? 0 : x or a0, t5, t6 # x = t5 | t6 = (y != 0) ? y : x # Iteration 2: c = 8 srli t1, t1, 1 # c = 8 srl t2, a0, t1 # y = x >> 8 sltu t3, x0, t2 # condition neg t4, t3 # mask and t5, t1, t4 # conditional c sub t0, t0, t5 # n -= (y != 0) ? c : 0 and t5, t2, t4 # conditional y not t6, t4 # inverse mask and t6, a0, t6 # conditional x or a0, t5, t6 # x = (y != 0) ? y : x # Iteration 3: c = 4 srli t1, t1, 1 # c = 4 srl t2, a0, t1 # y = x >> 4 sltu t3, x0, t2 neg t4, t3 and t5, t1, t4 sub t0, t0, t5 and t5, t2, t4 not t6, t4 and t6, a0, t6 or a0, t5, t6 # Iteration 4: c = 2 srli t1, t1, 1 # c = 2 srl t2, a0, t1 # y = x >> 2 sltu t3, x0, t2 neg t4, t3 and t5, t1, t4 sub t0, t0, t5 and t5, t2, t4 not t6, t4 and t6, a0, t6 or a0, t5, t6 # Iteration 5: c = 1 srli t1, t1, 1 # c = 1 srl t2, a0, t1 # y = x >> 1 sltu t3, x0, t2 neg t4, t3 and t5, t1, t4 sub t0, t0, t5 and t5, t2, t4 not t6, t4 and t6, a0, t6 or a0, t5, t6 # return n - x sub a0, t0, a0 ret .size clz_v2, .-clz_v2 ``` The testing procedure is identical to v1-basic. ```bash cd q1b-uf8/clz/v2-branchless ``` Code Size: ``` $ riscv64-unknown-elf-size clz_v2.o clz_gcc_O0.o clz_gcc_O2.o clz_gcc_O3.o text data bss dec hex filename 212 0 0 212 d4 clz_v2.o 344 0 0 344 158 clz_gcc_O0.o 108 0 0 108 6c clz_gcc_O2.o 108 0 0 108 6c clz_gcc_O3.o ``` Static instructions: | Version | Static Instructions | |---------|---------------------| | v2-hand | 53 | | GCC -O0 | 86 | | GCC -O2 | 27 | | GCC -O3 | 27 | ### ⭐ Observation: `clz_gcc_O2.dis` vs `clz_gcc_O3.dis` ```diff diff -u clz_gcc_O2.dis clz_gcc_O3.dis --- clz_gcc_O2.dis 2025-10-10 22:50:57.562537064 +0800 +++ clz_gcc_O3.dis 2025-10-10 22:50:57.566537061 +0800 @@ -1,5 +1,5 @@ -clz_gcc_O2.o: file format elf32-littleriscv +clz_gcc_O3.o: file format elf32-littleriscv Disassembly of section .text.clz: ``` From the diff results above, we can see that the two disassembly results (`.dis`) are almost identical, with the only difference being: the filename changes from `clz_gcc_O2.o` to `clz_gcc_O3.o`. This indicates that the v2 version of `clz.c` has reached its optimization limit at `-O2`, and `-O3` has nothing more to do. Runtime Performance (Using Ripes): `test_v2_hand.elf`: ![image](https://hackmd.io/_uploads/ryHR1jUTlx.png) `test_gcc_O0.elf`: ![image](https://hackmd.io/_uploads/r1_eljI6xl.png) `test_gcc_O2.elf`: ![image](https://hackmd.io/_uploads/B15blj8pee.png) `test_gcc_O3.elf`: ![image](https://hackmd.io/_uploads/rkRzloIale.png) | ELF Filename | Cycles | Instrs. retired | CPI | IPC | Clock rate | | -------------------- | ------ | --------------- | ---- | ----- | ---------- | | `test_v2_hand.elf` | 944 | 835 | 1.13 | 0.885 | 0 Hz | | `test_gcc_O0.elf` | 1532 | 1159 | 1.32 | 0.757 | 0 Hz | | `test_gcc_O2.elf` | 536 | 451 | 1.19 | 0.841 | 0 Hz | | `test_gcc_O3.elf` | 536 | 451 | 1.19 | 0.841 | 0 Hz | # Problem B: CLZ (v3-table) > [Commit 958167e](https://github.com/sysprog21/ca2025-quizzes/commit/958167e51b3246d988e8f6393e473a7cca8dda8d) > [Commit 2828c7c](https://github.com/sysprog21/ca2025-quizzes/commit/2828c7cdb464b7a5000521a994f49cae68a407be) ```asm .data .align 2 # 8-bit lookup table: stores leading zeros for each number from 0-255 clz_table_8: .byte 8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4 .byte 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 .byte 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 .byte 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 .byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 .byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 .byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 .byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 .text .globl clz_v3 .type clz_v3, @function # a0 = input value & return value # a1 = extracted byte value (0–255) used for lookup table # a2 = base address of clz_table_8 # a3 = offset value (0, 8, 16, or 24) clz_v3: # Special case: x = 0 beqz a0, .L_return_32 # Load the base address of the lookup table into a2 la a2, clz_table_8 # Check [31:24] srli a1, a0, 24 # a1 = x >> 24 bnez a1, .L_byte3 # if (byte), offset=0 # Check [23:16] srli a1, a0, 16 # a1 = x >> 16 andi a1, a1, 0xFF # a1 = (x >> 16) & 0xFF bnez a1, .L_byte2 # if (byte), offset=8 # Check [15:8] srli a1, a0, 8 # a1 = x >> 8 andi a1, a1, 0xFF # a1 = (x >> 8) & 0xFF bnez a1, .L_byte1 # if (byte), offset=16 # Check [7:0] andi a1, a0, 0xFF # a1 = x & 0xFF li a3, 24 # offset = 24 j .L_lookup .L_byte3: # [31:24], offset = 0 li a3, 0 # offset = 0 j .L_lookup .L_byte2: # [23:16], offset = 8 li a3, 8 # offset = 8 j .L_lookup .L_byte1: # [15:8], offset = 16 li a3, 16 # offset = 16 .L_lookup: # result = offset + clz_table_8[byte] add a2, a2, a1 # a2 = &clz_table_8[byte] lbu a0, 0(a2) # a0 = clz_table_8[byte] add a0, a0, a3 # a0 = clz_table_8[byte] + offset ret .L_return_32: # Special case: x = 0 li a0, 32 ret ``` The testing procedure is identical to v1-basic and v2-branchless. ```bash cd q1b-uf8/clz/v3-table ``` Code Size: ``` $ riscv64-unknown-elf-size clz_v3.o clz_gcc_O0.o clz_gcc_O2.o clz_gcc_O3.o text data bss dec hex filename 100 256 0 356 164 clz_v3.o 488 0 0 488 1e8 clz_gcc_O0.o 384 0 0 384 180 clz_gcc_O2.o 384 0 0 384 180 clz_gcc_O3.o ``` Static Instructions: | Version | Static Instructions | |---------|---------------------| | v3-hand | 25 | | GCC -O0 | 58 | | GCC -O2 | 32 | | GCC -O3 | 32 | ### ⭐ Observation: `clz_gcc_O2.dis` vs `clz_gcc_O3.dis` ```diff diff -u clz_gcc_O2.dis clz_gcc_O3.dis --- clz_gcc_O2.dis 2025-10-11 13:32:58.711310893 +0800 +++ clz_gcc_O3.dis 2025-10-11 13:32:58.715258710 +0800 @@ -1,5 +1,5 @@ -clz_gcc_O2.o: file format elf32-littleriscv +clz_gcc_O3.o: file format elf32-littleriscv Disassembly of section .text.clz: ``` Same situation as the v2 version — the v3 version of `clz.c` has reached its optimization limit at `-O2`, and `-O3` has nothing more to do. Runtime Performance (Using Ripes): `test_v3_hand.elf`: ![image](https://hackmd.io/_uploads/rkxHCPDTgx.png) `test_gcc_O0.elf`: ![image](https://hackmd.io/_uploads/BytU0vDaeg.png) `test_gcc_O2.elf`: ![image](https://hackmd.io/_uploads/HJkOCvvplx.png) `test_gcc_O3.elf`: ![image](https://hackmd.io/_uploads/SyfKAvPpel.png) | ELF Filename | Cycles | Instrs. retired | CPI | IPC | Clock rate | | -------------------- | ------ | --------------- | ---- | ----- | ---------- | | `test_v3_hand.elf` | 272 | 163 | 1.67 | 0.599 | 0 Hz | | `test_gcc_O0.elf` | 392 | 271 | 1.45 | 0.691 | 0 Hz | | `test_gcc_O2.elf` | 272 | 163 | 1.67 | 0.599 | 0 Hz | | `test_gcc_O3.elf` | 272 | 163 | 1.67 | 0.599 | 0 Hz | # Analysis `q1b-uf8/clz/v1-basic/clz_v1.s`: ```asm clz_v1: li t0, 32 # n = 32 li t1, 16 # c = 16 # Iteration 1: c = 16 srl t2, a0, t1 # y = x >> 16 beqz t2, .L1_skip # if (y == 0) sub t0, t0, t1 # n -= 16 mv a0, t2 # x = y ``` Test input: `0x80000000` (`a0 = 0x80000000`) ``` .word 0x80000000 # 10: highest bit -> 0 ``` ## IF Stage (Instruction Fetch) The main goal of the IF stage is: "Fetch the next instruction to execute from memory" — that is, let the PC (Program Counter) provide the address → read the instruction → prepare to pass it to the next stage (ID). ![image](https://hackmd.io/_uploads/SyV9I2vplg.png) ![image](https://hackmd.io/_uploads/S16c8nvTgl.png) ![image](https://hackmd.io/_uploads/r1Zi8hDpgg.png) ![image](https://hackmd.io/_uploads/S1DujnD6gg.png) At PC = `0x000001e8`, the instruction memory fetches `0x006553b3`, which decodes to: ``` srl x7, x10, x6 ``` The PC adder outputs `0x000001ec` (next instruction). The IF/ID register has `enable = green = 1`, `clear = red = 0`, indicating normal pipeline flow. This stage doesn't update memory because it's only fetching instructions. Reference: https://www.rose-hulman.edu/class/csse/csse232/pdf/RISCV_Green_Card.pdf ![image](https://hackmd.io/_uploads/rkMCb1d6lx.png) ![image](https://hackmd.io/_uploads/SynJzk_Tgg.png) ``` srl x7, x10, x6 ``` According to the above explanation, `srl` is an R-type instruction. The 32-bit instruction format for R-type is: ``` funct7 | rs2 | rs1 | funct3 | rd | opcode | 31 25 24 20 19 15 14 12 11 7 6 0 ``` - `opcode = 0110011` - `funct3 = 101` - `funct7 = 0000000` - `rd = x7 = 7` - `rs1 = x10 = 10` - `rs2 = x6 = 6` | Field | Bit Range | Value (Binary) | Decimal/Note | | -------- | --------- | -------------- | --------------- | | `funct7` | 31..25 | `0000000` | SRL | | `rs2` | 24..20 | `00110` | x6 (=6) | | `rs1` | 19..15 | `01010` | x10 (=10) | | `funct3` | 14..12 | `101` | SRL/SRA group | | `rd` | 11..7 | `00111` | x7 (=7) | | `opcode` | 6..0 | `0110011` | OP (R-type) | Combine the above into 32-bit: ``` 0000000 00110 01010 101 00111 0110011 = 0000 0000 0110 0101 0101 0011 1011 0011 = 0x006553b3 ``` Leading 0 can be omitted as `0x6553b3`, but tools often display full 8 digits: - In Little Endian memory, the 4 bytes would be sequentially: `b3 53 65 00` - Ripes' "instr" column will display the combined 32-bit `0x006553b3` ### Multiplexer: PC Selector ![image](https://hackmd.io/_uploads/Hk0Ghl_ale.png) - Bottom input: `0x000001e8` (current PC) - Top input: Branch or jump target (currently unused) - Output: `0x000001e8` (still current PC) This multiplexer determines "the source of the next PC": - If normal execution: select `PC + 4` - If encountering branch or jump instruction: select branch target During this cycle (`srl` instruction), there's no branch/jump, so the multiplexer selects "current PC", allowing the pipeline to proceed sequentially. ### Adder ![image](https://hackmd.io/_uploads/H1oS3x_Tel.png) The figure shows: * Bottom input: `PC = 0x000001e8` * Top input: `0x00000004` * Output: `0x000001ec` RISC-V instructions are 4 bytes each, so the adder calculates "the address of the next instruction": ``` PC = PC + 4 = 0x000001e8 + 0x4 = 0x000001ec ``` So the next instruction will be fetched from `0x000001ec`. ### Instruction Memory ![image](https://hackmd.io/_uploads/BJID2eu6lx.png) The figure shows: - Input address: `addr = 0x000001e8` - Output instruction: `instr = 0x006553b3` Instruction Memory's function is to read the corresponding 32-bit instruction content based on the address provided by PC. Here it fetches the machine code of `srl x7, x10, x6`. ### Compressed Decoder Below Instruction Memory ![image](https://hackmd.io/_uploads/H13u3gd6lg.png) The figure shows: - Input: `0x006553b3` - Output: `0x006553b3` RISC-V supports 16-bit "compressed instructions (RVC)" and 32-bit normal instructions. Therefore, this module will: - Check if the instruction is in compressed format - If it's a compressed instruction (16 bit), expand it to a 32-bit instruction here - If not (like this example is standard 32-bit), pass it through unchanged Since `0x006553b3` in the figure is already a 32-bit instruction, the decoder doesn't change the content, and input equals output. ### IF/ID Pipeline Register ![image](https://hackmd.io/_uploads/Hk7shgd6ll.png) - enable = green (1) → allow data to proceed to the next stage - clear = red (0) → no flush (meaning no branch or exception) - IF/ID Register buffers: - Instruction content (`0x006553b3`) - PC value (`0x000001e8`) ## ID Stage (Instruction Decode) ![image](https://hackmd.io/_uploads/rJGSI6Pagg.png) ![image](https://hackmd.io/_uploads/Syyk0gdpel.png) ID Stage main tasks: - Decode instruction (Control Unit generates control signals) - Read source register values (Register File) - Generate immediate value (Immediate Generator) - Pass results to the next stage IDEX pipeline register ### Decode Module (Control Unit) ![image](https://hackmd.io/_uploads/rJC4Bg_age.png) - Decode Module identifies this instruction as SRL (Shift Right Logical) - `opcode = 0x0a` (R-type) - `rs1 = 0x0a` (`x10 = 10`) - `rs2 = 0x06` (`x6 = 6`) - `rd = 0x07` (`x7 = 7`) - In ID stage, Control Unit generates control signals based on opcode / funct3 / funct7: - `RegWrite = 1` (because result will be written back to x7) - `ALUSrc = 0` (because it's a register-to-register operation, not immediate value) - `MemRead = 0` - `MemWrite = 0` - `Branch = 0` - `ALUOp = "SRL"` ### Register File ![image](https://hackmd.io/_uploads/HkgJKedpee.png) Instruction `srl x7, x10, x6`: - R1 idx: `rs1 = 0x0a` (`x10 = 10`) - R2 idx: `rs2 = 0x06` (`x6 = 6`) Register File automatically reads the two corresponding source registers `rs1` and `rs2` based on decode results: ![image](https://hackmd.io/_uploads/SJPJoe_axx.png) - Reg 1: 0x80000000 - Reg 2: 0x1000007d These values will be sent to the ALU input multiplexer. ### Immediate Generator ![image](https://hackmd.io/_uploads/rJa9jg_Tll.png) Value in the figure: `0xdeadbeef` - This is Ripes' default display value (because SRL is R-type and doesn't use immediate value) - Actually, this module won't be used for R-type instructions - For I-type (e.g., `addi`), this would generate a sign-extended immediate value ### ID/EX Pipeline Register ![image](https://hackmd.io/_uploads/Sy51ngdpee.png) The figure shows: - `rs1 = 0x80000000` - `rs2 = 0x1000007d` ID/EX Pipeline Register passes decoded results to the next stage (EX): - `enable = green = 1` → pipeline proceeds normally - `clear = red = 0` → no flush ## EX Stage (Execute) ![image](https://hackmd.io/_uploads/Sy3FaluTeg.png) ![image](https://hackmd.io/_uploads/Skwnpxdplx.png) EX Stage function: - Goal: ALU (Arithmetic Logic Unit) performs actual operations (shifts, addition/subtraction, AND/OR, etc.) and determines whether to branch - Input source: From ID/EX Pipeline Register (includes Register values and Control Signals) - Output result: Sent to EX/MEM Pipeline Register (for subsequent MEM or WB use) ### ID/EX Pipeline Register ![image](https://hackmd.io/_uploads/BkCHrW_agx.png) The figure shows: - `rs1 = 0x80000000` - `rs2 = 0x1000007d` - `PC = 0x000001e8` - Control signals: - `enable = green = 1` - `clear = red = 0` ID/EX Pipeline Register is the data handoff point from ID → EX, it buffers: - Register input values (`rs1`, `rs2`) - Control signals corresponding to the instruction (such as `RegWrite`, `ALUOp`, `Branch`, etc.) - Current PC (for branch calculation) - `enable = 1` indicates pipeline proceeds normally - `clear = 0` indicates no flush (non-jump instruction stage) ### ALU ![image](https://hackmd.io/_uploads/S1G9Obdpll.png) ![image](https://hackmd.io/_uploads/rkecjb_Tel.png) ``` srl x7, x10, x6 ``` Its function is: ``` x7 = x10 >> (x6[4:0]) = x10 >> (x6 & 0x1F) ``` In the official RV32I specification: > For SRL, SRA, and SLL instructions, only the lower 5 bits of the shift amount (rs2[4:0]) are used. > [RISC-V User-Level ISA Spec, Vol. I: Unprivileged Architecture, §2.5 Shift Instructions] Meaning: Even if rs2 contains a large value, only the lower 5 bits can be used, because in RV32I, the valid shift range for a 32-bit integer is 0~31, which has 32 possibilities ⇒ 5 bits are sufficient to represent. Values in the figure: - `Op1 = 0x80000000` - `rs1` → data to be shifted - `Op2 = 0x00000010` - Only the lower 5 bits are used as shift amount - `x6[4:0] = 0x1000007d[4:0] = x6 & 0x1F = 0x1d` - `Res = 0x00008000` - `x7 = x10 >> (x6 & 0x1F) = 0x80000000 >> 0x1d = 29` - However, `Res` here shows not 29, but `0x00008000`, because a **data hazard** occurred ``` clz_v1: li t0, 32 # n = 32 li t1, 16 # c = 16 # Iteration 1: c = 16 srl t2, a0, t1 # y = x >> 16 beqz t2, .L1_skip # if (y == 0) sub t0, t0, t1 # n -= 16 mv a0, t2 # x = y ``` | Cycle | 1 | 2 | 3 | 4 | 5 | 6 | |:---------------- |:---:|:---:|:---:|:---:|:---:|:---:| | `li t1, 16` | IF | ID | EX | MEM | WB | | | `srl t2, a0, t1` | | IF | ID | EX | MEM | WB | - Hazard type occurring here: **RAW (Read After Write)** - The `t1` that `srl` needs to read is precisely the destination register that the previous `li` (i.e., `addi t1, x0, 16`) is writing back to - **Forwarding path:** From `li` EX/MEM → `srl` ALU - When the pipeline detects that `srl` needs to use the result of `t1`, it directly forwards the execution result of `li` from the output of the EX/MEM Pipeline Register back to the ALU input, without waiting for `li` to complete writeback (WB) - This way, `srl` can get the correct `t1 = 16 = 0x00000010` in the EX stage of cycle 4, avoiding pipeline stall and maintaining correct execution - That is: `x7 = x10 >> (x6 & 0x1F) = 0x80000000 >> (0x00000010 & 0x1F) = 0x80000000 >> 16 = 0x00008000` ### ALU Multiplexers ![image](https://hackmd.io/_uploads/r1hXXf_axx.png) MUX with three green dots in the figure: - Left MUX: Select Op1 = rs1 (non-immediate value) - Right MUX: Select Op2 = rs2 (non-immediate value) **Reason:** - Control Signal `ALUSrc = 0` (because this `srl` instruction is R-type, doesn't use immediate value) - Therefore, both inputs come from Register File: - `Op1` ← `rs1 (x10)` - `Op2` ← `rs2 (x6)` ### Branch Module ![image](https://hackmd.io/_uploads/BJxMEfOaex.png) The figure shows: `Branch taken = red = 0` This instruction is `srl`, not a branch instruction, therefore: - `Branch = 0` - `Branch taken` signal is red (false) - For instructions like `beq` or `bne`, this would use ALU's zero flag to determine if the branch should be taken ### EX/MEM Pipeline Register ![image](https://hackmd.io/_uploads/ByBO4G_Tge.png) EX/MEM Pipeline Register will: - Buffer ALU result (`0x00008000`) - Buffer Control Signal (such as whether to write back to register) - Buffer destination register number (`rd = x7`) ## MEM Stage (Memory Access) ![image](https://hackmd.io/_uploads/HywM2MuTxg.png) ![image](https://hackmd.io/_uploads/H1KGBfd6el.png) ### EX/MEM Pipeline Register - This stage's input comes from the previous cycle's EX stage - `ALU Res = 0x00008000`: This is the ALU operation result from the previous stage (EX) - For a branch instruction (e.g., `beq`) in the MEM stage, this represents the branch target address (`PC + imm`) - For a general arithmetic instruction (e.g., `srl` in the figure) in the MEM stage, it's just passed through to the next stage (WB), doesn't touch data memory - ALU operation result is buffered in the EX/MEM Pipeline Register, waiting for WB stage to write to x7 ### Data Memory Although the `srl` instruction is in the MEM stage, `srl` is not a Load/Store instruction, therefore: - Data memory is not activated - `Wr en = 0` (red) - `MemRead = 0` - `MemWrite = 0` Values displayed in memory: - `Addr. = 0x00008000` - `Data in = 0x00000010` - `Read out = 0x00000000` These are all don't care values (just signals flowing in the pipeline, not actual access operations). ## WB Stage (Write Back) ![image](https://hackmd.io/_uploads/SJ4SizOpel.png) ![image](https://hackmd.io/_uploads/rJ3riM_aeg.png) In the WB (Write Back) stage, the execution result of instruction `srl x7, x10, x6` is ready to be written back to the register file. ### MEM/WB Pipeline Register - This stage's input is passed from the MEM stage - `ALU result = 0x00008000` (obtained from `srl x7, x10, x6`) - Buffered control signals (such as `RegWrite`) are also passed from the MEM stage - The main function of this register is: - Buffer the value to be written back - Buffer the target register number (`rd`) to write back to - Control signal, determines whether to actually perform writeback ### Write Back Multiplexer ![image](https://hackmd.io/_uploads/r1P8pGd6xe.png) - WB stage must determine the source of data to write back: - If Load instruction → write back the value read from Data Memory (`MemRead=1`) - If general ALU instruction (e.g., the `srl` instruction currently in WB) → write back from ALU result (`MemRead=0`) - In the figure: - Bottom input (from Data memory Read out) = `0x00000000` - Top input (from ALU result) = `0x00008000` - Since this cycle is an R-type instruction (SRL), the writeback source is ALU result → `0x00008000` ### Write Back Register File ![image](https://hackmd.io/_uploads/SynZ0Gd6ll.png) - According to the information in MEM/WB Pipeline Register: - `rd = x7` - `Write data = 0x00008000` - `RegWrite = 1` (allow write) Therefore, the next cycle will write `0x00008000` to Register x7, as shown in the figure below: ![image](https://hackmd.io/_uploads/rkFtAfOpxg.png)