# Assignment 1: RISC-V Assembly and Instruction Pipeline
contributed by < [phyrexxxxx](https://github.com/phyrexxxxx/ca2025-quizzes) >
## Prerequisite
### Installing RISC-V Toolchain
```bash
$ sudo apt update
$ sudo apt install gcc-riscv64-unknown-elf
```
Install RISC-V GCC compiler and related tools
### Verifying Installation
```bash
# Check RISC-V toolchain
riscv64-unknown-elf-gcc --version
riscv64-unknown-elf-as --version
riscv64-unknown-elf-ld --version
# Check tools
which riscv64-unknown-elf-size
which riscv64-unknown-elf-objdump
which riscv64-unknown-elf-nm
```
Output:
```
$ riscv64-unknown-elf-gcc --version
riscv64-unknown-elf-gcc (13.2.0-11ubuntu1+12) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ riscv64-unknown-elf-as --version
GNU assembler (2.42-1ubuntu1+6) 2.42
Copyright (C) 2024 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `riscv64-unknown-elf'.
$ riscv64-unknown-elf-ld --version
GNU ld (2.42-1ubuntu1+6) 2.42
Copyright (C) 2024 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
$ which riscv64-unknown-elf-size
/usr/bin/riscv64-unknown-elf-size
$ which riscv64-unknown-elf-objdump
/usr/bin/riscv64-unknown-elf-objdump
$ which riscv64-unknown-elf-nm
/usr/bin/riscv64-unknown-elf-nm
```
The Ubuntu package installs `riscv64-unknown-elf-gcc`, but it can compile RV32I programs using `-march=rv32i`:
- `riscv64-unknown-elf-gcc` + `-march=rv32i` + `-mabi=ilp32` = compile RV32I programs
- Though the toolchain is named `riscv64`, it can compile both 32-bit and 64-bit programs
- This is the standard practice for RISC-V official GNU toolchain
## Preparing Ripes Simulator
Visit: https://github.com/mortbopet/Ripes/releases
> ⚠️ Note: **WSL2 cannot directly open GUI windows**, you must either:
>
> - Install **X Server for Windows** (e.g., [VcXsrv](https://sourceforge.net/projects/vcxsrv/) or [X410](https://x410.dev/))
> - Or download **Windows version of Ripes** directly on the Windows host
Click: `Ripes-v2.2.6-70-gc8e0412-win-x86_64.zip`
After extraction, directly click `Ripes.exe` to launch Ripes GUI
# Problem B: CLZ (v1-basic)
> [commit 4ef4eb1](https://github.com/phyrexxxxx/ca2025-quizzes/blob/6082c8d70486b21cfe842e6cd79ec32aa92e3e37/q1b-uf8/clz/v1-basic/clz_v1.s)
```asm
.text
.globl clz_v1
.type clz_v1, @function
clz_v1:
li t0, 32 # n = 32
li t1, 16 # c = 16
# Iteration 1: c = 16
srl t2, a0, t1 # y = x >> 16
beqz t2, .L1_skip # if (y == 0)
sub t0, t0, t1 # n -= 16
mv a0, t2 # x = y
.L1_skip:
# Iteration 2: c = 8
srli t1, t1, 1 # c >>= 1 → c = 8
srl t2, a0, t1 # y = x >> 8
beqz t2, .L2_skip # if (y == 0)
sub t0, t0, t1 # n -= 8
mv a0, t2 # x = y
.L2_skip:
# Iteration 3: c = 4
srli t1, t1, 1 # c >>= 1 → c = 4
srl t2, a0, t1 # y = x >> 4
beqz t2, .L3_skip # if (y == 0)
sub t0, t0, t1 # n -= 4
mv a0, t2 # x = y
.L3_skip:
# Iteration 4: c = 2
srli t1, t1, 1 # c >>= 1 → c = 2
srl t2, a0, t1 # y = x >> 2
beqz t2, .L4_skip # if (y == 0)
sub t0, t0, t1 # n -= 2
mv a0, t2 # x = y
.L4_skip:
# Iteration 5: c = 1
srli t1, t1, 1 # c >>= 1 → c = 1
srl t2, a0, t1 # y = x >> 1
beqz t2, .L5_skip # if (y == 0)
sub t0, t0, t1 # n -= 1
mv a0, t2 # x = y
.L5_skip:
sub a0, t0, a0 # return n - x
ret
.size clz_v1, .-clz_v1
```
```bash
cd q1b-uf8/clz/v1-basic
```
## Step 1: Compile CLZ Versions
```bash
# Compile v1-hand (hand-written assembly)
riscv64-unknown-elf-as \
-march=rv32i \
-mabi=ilp32 \
-o clz_v1.o clz_v1.s
# Compile GCC -O0 (no optimization)
riscv64-unknown-elf-gcc \
-march=rv32i \
-mabi=ilp32 \
-ffunction-sections \
-O0 -c -o clz_gcc_O0.o clz.c
# Compile GCC -O2 (standard optimization)
riscv64-unknown-elf-gcc \
-march=rv32i \
-mabi=ilp32 \
-ffunction-sections \
-O2 -c -o clz_gcc_O2.o clz.c
# Compile GCC -O3 (aggressive optimization)
riscv64-unknown-elf-gcc \
-march=rv32i \
-mabi=ilp32 \
-ffunction-sections \
-O3 -c -o clz_gcc_O3.o clz.c
# Compile test framework
riscv64-unknown-elf-as \
-march=rv32i \
-mabi=ilp32 \
-o ../test_framework/test_framework.o \
../test_framework/test_framework.s
```
## Step 2: Link Test Programs
Errors encountered during the process:
```
$ riscv64-unknown-elf-ld \
-nostdlib \
-o test_v1_hand.elf \
../test_framework/test_framework.o \
clz_v1.o
riscv64-unknown-elf-ld: ../test_framework/test_framework.o: ABI is incompatible with that of the selected emulation:
target emulation `elf32-littleriscv' does not match `elf64-littleriscv'
riscv64-unknown-elf-ld: failed to merge target specific data of file ../test_framework/test_framework.o
riscv64-unknown-elf-ld: clz_v1.o: ABI is incompatible with that of the selected emulation:
target emulation `elf32-littleriscv' does not match `elf64-littleriscv'
riscv64-unknown-elf-ld: failed to merge target specific data of file clz_v1.o
riscv64-unknown-elf-ld: ../test_framework/test_framework.o: in function `test_loop':
(.text+0x30): undefined reference to `clz'
riscv64-unknown-elf-ld: test_v1_hand.elf(.text): relocation ".L11+0x0 (type R_RISCV_JAL)" goes out of range
riscv64-unknown-elf-ld: ../test_framework/test_framework.o: file class ELFCLASS32 incompatible with ELFCLASS64
riscv64-unknown-elf-ld: final link failed: file in wrong format
```
Error reason:
```
target emulation `elf32-littleriscv' does not match `elf64-littleriscv'
```
i.e., `elf32-littleriscv` and `elf64-littleriscv` don't match:
- ABI mismatch: Object file `clz_v1.o` is 32-bit, but the linker is using 64-bit default mode
- Symbol name mismatch: Assembly defines `clz_v1`, but test framework looks for `clz`
```asm
# clz_v1.s
clz_v1:
li t0, 32 # n = 32
li t1, 16 # c = 16
```
Fixed using the following commands to re-link:
```bash
# Link v1-hand version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz_v1 \
-o test_v1_hand.elf \
../test_framework/test_framework.o \
clz_v1.o
# Link GCC -O0 version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz \
-o test_gcc_O0.elf \
../test_framework/test_framework.o \
clz_gcc_O0.o
# Link GCC -O2 version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz \
-o test_gcc_O2.elf \
../test_framework/test_framework.o \
clz_gcc_O2.o
# Link GCC -O3 version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz \
-o test_gcc_O3.elf \
../test_framework/test_framework.o \
clz_gcc_O3.o
```
Key parameter explanation:
| Parameter | Purpose |
|---------------------|---------------------------------------|
| `-melf32lriscv` | Tell 64-bit linker to use 32-bit RISC-V emulation mode |
| `--defsym clz=clz_v1` | Create symbol alias, making `clz` point to `clz_v1` |
Generated ELF executable:
- ELF 32-bit LSB executable
- RISC-V soft-float ABI
- Contains `_start`, `clz`, and `clz_v1` symbols
For the "Link GCC -O0 version", "Link GCC -O2 version", and "Link GCC -O3 version" above, `--defsym clz=clz` may not be necessary (I have verified this) — GCC-compiled object files already have the correct `clz` symbol, but adding it doesn't hurt. The most critical parameter is `-melf32lriscv`.
### Check Generated Test Programs
```
$ ls -lh test_*.elf
-rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:32 test_gcc_O0.elf
-rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:34 test_gcc_O2.elf
-rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:35 test_gcc_O3.elf
-rwxr-xr-x 1 phyrexxxxx phyrexxxxx 1.7K Oct 10 14:31 test_v1_hand.elf
```
## Step 3: Verify Correctness
**Important**: Verify program correctness first before measuring code size!
The test framework will automatically execute 12 test cases and "expect" to output correct or wrong in the Ripes Console.
### Launch Ripes
1. Start Ripes
2. Select Processor: **5-stage pipeline (RV32I)**
### Test v1-hand Version
Load program: File → Load Program → Select `test_v1_hand.elf`
#### Check Results
**Problem:** After execution in Ripes, the console doesn't show `correct` or `wrong`, only displays:
```
Program exited with code: 0
```
**Reason:** When loading ELF, Ripes doesn't process its "educational syscalls (a7=1/4/10)"; ecall 10 is just treated as "exit", so we only see Program exited with code: 0.
In Ripes' "Assembly Program" mode (directly pasting `.s`), it supports simplified syscalls like a7=4 print string.
**Fix:** `q1b-uf8/clz/test_framework/test_framework.s`:
```diff
all_correct:
- # Output "correct\n"
- la a0, correct_str
- li a7, 4 # syscall: print_string
- ecall
- j exit
+ li a0, 0 # exit code 0 = PASS
+ li a7, 10 # ecall: exit
+ ecall
test_failed:
- # Output "wrong\n"
- la a0, wrong_str
- li a7, 4 # syscall: print_string
- ecall
+ li a0, 1 # exit code 1 = FAIL
+ li a7, 10
+ ecall
```
Then recompile and link.
But later I directly integrated `test_framework.s` and `clz_v1.s` into: `clz_v1_standalone.s`
Then load `clz_v1_standalone.s` in Ripes and run it directly. This is the fastest way to verify RISC-V program correctness.

Only versions that pass correctness testing are worth continuing to measure code size and performance.
## Step 4: Measure Code Size
The prerequisite for measuring code size and runtime performance is: **passing the program logic correctness test** in the previous step.
### Method 1: Measure Using size Tool
```
$ riscv64-unknown-elf-size clz_v1.o clz_gcc_O0.o clz_gcc_O2.o clz_gcc_O3.o
text data bss dec hex filename
112 0 0 112 70 clz_v1.o
128 0 0 128 80 clz_gcc_O0.o
48 0 0 48 30 clz_gcc_O2.o
92 0 0 92 5c clz_gcc_O3.o
```
### Method 2: Using objdump
> https://dokk.org/manpages/debian/12/binutils-riscv64-unknown-elf/riscv64-unknown-elf-objdump.1.en
>
> objdump - display information from object files
> - [-h|--section-headers|--headers]
> - [-d|--disassemble[=symbol]]
```
$ riscv64-unknown-elf-objdump -h clz_v1.o
clz_v1.o: file format elf32-littleriscv
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000070 00000000 00000000 00000034 2**2
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
1 .data 00000000 00000000 00000000 000000a4 2**0
CONTENTS, ALLOC, LOAD, DATA
2 .bss 00000000 00000000 00000000 000000a4 2**0
ALLOC
3 .riscv.attributes 0000001a 00000000 00000000 000000a4 2**0
CONTENTS, READONLY
```
To measure code size, extract the `Size` field of the `.text` section from `riscv64-unknown-elf-objdump -h` output
### Code Size Analysis
Sorted (smallest to largest):
1. **GCC -O2**: 48 bytes
2. **GCC -O3**: 92 bytes (+91%)
3. **v1-hand**: 112 bytes (+133%)
4. **GCC -O0**: 128 bytes (+166%)
**Observations:**
- GCC -O2 is smallest, compiler optimization is very effective
- Hand-written assembly is **2.3 times** larger than GCC -O2
## Step 5: Generate Disassembly Files
### Disassemble All Versions
```bash
riscv64-unknown-elf-objdump -d clz_v1.o > clz_v1.dis
riscv64-unknown-elf-objdump -d clz_gcc_O0.o > clz_gcc_O0.dis
riscv64-unknown-elf-objdump -d clz_gcc_O2.o > clz_gcc_O2.dis
riscv64-unknown-elf-objdump -d clz_gcc_O3.o > clz_gcc_O3.dis
```
### View Disassembly Content
```
$ cat clz_v1.dis
clz_v1.o: file format elf32-littleriscv
Disassembly of section .text:
00000000 <clz_v1>:
0: 02000293 li t0,32
4: 01000313 li t1,16
8: 006553b3 srl t2,a0,t1
c: 00038663 beqz t2,18 <.L1_skip>
10: 406282b3 sub t0,t0,t1
14: 00038513 mv a0,t2
00000018 <.L1_skip>:
18: 00135313 srli t1,t1,0x1
1c: 006553b3 srl t2,a0,t1
20: 00038663 beqz t2,2c <.L2_skip>
24: 406282b3 sub t0,t0,t1
28: 00038513 mv a0,t2
0000002c <.L2_skip>:
2c: 00135313 srli t1,t1,0x1
30: 006553b3 srl t2,a0,t1
34: 00038663 beqz t2,40 <.L3_skip>
38: 406282b3 sub t0,t0,t1
3c: 00038513 mv a0,t2
00000040 <.L3_skip>:
40: 00135313 srli t1,t1,0x1
44: 006553b3 srl t2,a0,t1
48: 00038663 beqz t2,54 <.L4_skip>
4c: 406282b3 sub t0,t0,t1
50: 00038513 mv a0,t2
00000054 <.L4_skip>:
54: 00135313 srli t1,t1,0x1
58: 006553b3 srl t2,a0,t1
5c: 00038663 beqz t2,68 <.L5_skip>
60: 406282b3 sub t0,t0,t1
64: 00038513 mv a0,t2
00000068 <.L5_skip>:
68: 40a28533 sub a0,t0,a0
6c: 00008067 ret
```
### Count Static Instructions
```
$ grep -E '^\s+[0-9a-f]+:' clz_v1.dis | wc -l
28
$ grep -E '^\s+[0-9a-f]+:' clz_gcc_O0.dis | wc -l
32
$ grep -E '^\s+[0-9a-f]+:' clz_gcc_O2.dis | wc -l
12
$ grep -E '^\s+[0-9a-f]+:' clz_gcc_O3.dis | wc -l
23
```
Record results:
| Version | Static Instructions |
|---------|---------------------|
| v1-hand | 28 |
| GCC -O0 | 32 |
| GCC -O2 | 12 |
| GCC -O3 | 23 |
## Step 6: Measure Runtime Performance (Using Ripes)
**Prerequisite**: Passed correctness test + measured code size
Link the hand-written RISC-V `clz_v1.s` into `clz_v1.o` and link `clz_gcc_O0.o`, `clz_gcc_O2.o`, `clz_gcc_O3.o` compiled by GCC with `test_framework.o`
Generate respectively:
- `test_v1_hand.elf`
- `test_gcc_O0.elf`
- `test_gcc_O2.elf`
- `test_gcc_O3.elf`
```bash
# Link v1-hand version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz_v1 \
-o test_v1_hand.elf \
../test_framework/test_framework.o \
clz_v1.o
# Link GCC -O0 version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz \
-o test_gcc_O0.elf \
../test_framework/test_framework.o \
clz_gcc_O0.o
# Link GCC -O2 version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz \
-o test_gcc_O2.elf \
../test_framework/test_framework.o \
clz_gcc_O2.o
# Link GCC -O3 version
riscv64-unknown-elf-ld \
-melf32lriscv \
-nostdlib \
--defsym clz=clz \
-o test_gcc_O3.elf \
../test_framework/test_framework.o \
clz_gcc_O3.o
```
Execute the following steps for each version:
1. Start Ripes
2. Select Processor: **5-stage pipeline (RV32I)**
3. File → Load Program, test sequentially:
- `test_v1_hand.elf`
- `test_gcc_O0.elf`
- `test_gcc_O2.elf`
- `test_gcc_O3.elf`
Run performance test. After completion, check the **Execution info** panel in the lower right:
`test_v1_hand.elf`:

`test_gcc_O0.elf`:

`test_gcc_O2.elf`:

`test_gcc_O3.elf`:

| ELF Filename | Cycles | Instrs. retired | CPI | IPC | Clock rate |
| -------------------- | ------ | --------------- | ---- | ----- | ---------- |
| `test_v1_hand.elf` | 644 | 415 | 1.55 | 0.644 | 0 Hz |
| `test_gcc_O0.elf` | 1520 | 967 | 1.57 | 0.636 | 0 Hz |
| `test_gcc_O2.elf` | 788 | 487 | 1.62 | 0.618 | 0 Hz |
| `test_gcc_O3.elf` | 488 | 283 | 1.72 | 0.58 | 0 Hz |
# Problem B: CLZ (v2-branchless)
> [commit e28af9e](https://github.com/sysprog21/ca2025-quizzes/commit/e28af9e1427b27d60ab63e34e4ede40da3424f50)
> [commit eb6fd76](https://github.com/sysprog21/ca2025-quizzes/commit/eb6fd7620def011f171f6f2a7693f3a5a71b9243)
```asm
.text
.globl clz_v2
.type clz_v2, @function
clz_v2:
li t0, 32 # n = 32
li t1, 16 # c = 16
# Iteration 1: c = 16
srl t2, a0, t1 # y = x >> 16
# Create conditional mask
sltu t3, x0, t2 # t3 = (y != 0) ? 1 : 0
neg t4, t3 # t4 = (y != 0) ? -1 : 0 = 0xFFFFFFFF : 0x00000000
# Conditional update n: n -= (y != 0) ? c : 0
and t5, t1, t4 # t5 = (y != 0) ? c : 0
sub t0, t0, t5 # n -= t5
# Conditional update x: x = (y != 0) ? y : x
and t5, t2, t4 # t5 = (y != 0) ? y : 0
not t6, t4 # t6 = (y != 0) ? 0 : 0xFFFFFFFF
and t6, a0, t6 # t6 = (y != 0) ? 0 : x
or a0, t5, t6 # x = t5 | t6 = (y != 0) ? y : x
# Iteration 2: c = 8
srli t1, t1, 1 # c = 8
srl t2, a0, t1 # y = x >> 8
sltu t3, x0, t2 # condition
neg t4, t3 # mask
and t5, t1, t4 # conditional c
sub t0, t0, t5 # n -= (y != 0) ? c : 0
and t5, t2, t4 # conditional y
not t6, t4 # inverse mask
and t6, a0, t6 # conditional x
or a0, t5, t6 # x = (y != 0) ? y : x
# Iteration 3: c = 4
srli t1, t1, 1 # c = 4
srl t2, a0, t1 # y = x >> 4
sltu t3, x0, t2
neg t4, t3
and t5, t1, t4
sub t0, t0, t5
and t5, t2, t4
not t6, t4
and t6, a0, t6
or a0, t5, t6
# Iteration 4: c = 2
srli t1, t1, 1 # c = 2
srl t2, a0, t1 # y = x >> 2
sltu t3, x0, t2
neg t4, t3
and t5, t1, t4
sub t0, t0, t5
and t5, t2, t4
not t6, t4
and t6, a0, t6
or a0, t5, t6
# Iteration 5: c = 1
srli t1, t1, 1 # c = 1
srl t2, a0, t1 # y = x >> 1
sltu t3, x0, t2
neg t4, t3
and t5, t1, t4
sub t0, t0, t5
and t5, t2, t4
not t6, t4
and t6, a0, t6
or a0, t5, t6
# return n - x
sub a0, t0, a0
ret
.size clz_v2, .-clz_v2
```
The testing procedure is identical to v1-basic.
```bash
cd q1b-uf8/clz/v2-branchless
```
Code Size:
```
$ riscv64-unknown-elf-size clz_v2.o clz_gcc_O0.o clz_gcc_O2.o clz_gcc_O3.o
text data bss dec hex filename
212 0 0 212 d4 clz_v2.o
344 0 0 344 158 clz_gcc_O0.o
108 0 0 108 6c clz_gcc_O2.o
108 0 0 108 6c clz_gcc_O3.o
```
Static instructions:
| Version | Static Instructions |
|---------|---------------------|
| v2-hand | 53 |
| GCC -O0 | 86 |
| GCC -O2 | 27 |
| GCC -O3 | 27 |
### ⭐ Observation: `clz_gcc_O2.dis` vs `clz_gcc_O3.dis`
```diff
diff -u clz_gcc_O2.dis clz_gcc_O3.dis
--- clz_gcc_O2.dis 2025-10-10 22:50:57.562537064 +0800
+++ clz_gcc_O3.dis 2025-10-10 22:50:57.566537061 +0800
@@ -1,5 +1,5 @@
-clz_gcc_O2.o: file format elf32-littleriscv
+clz_gcc_O3.o: file format elf32-littleriscv
Disassembly of section .text.clz:
```
From the diff results above, we can see that the two disassembly results (`.dis`) are almost identical, with the only difference being: the filename changes from `clz_gcc_O2.o` to `clz_gcc_O3.o`.
This indicates that the v2 version of `clz.c` has reached its optimization limit at `-O2`, and `-O3` has nothing more to do.
Runtime Performance (Using Ripes):
`test_v2_hand.elf`:

`test_gcc_O0.elf`:

`test_gcc_O2.elf`:

`test_gcc_O3.elf`:

| ELF Filename | Cycles | Instrs. retired | CPI | IPC | Clock rate |
| -------------------- | ------ | --------------- | ---- | ----- | ---------- |
| `test_v2_hand.elf` | 944 | 835 | 1.13 | 0.885 | 0 Hz |
| `test_gcc_O0.elf` | 1532 | 1159 | 1.32 | 0.757 | 0 Hz |
| `test_gcc_O2.elf` | 536 | 451 | 1.19 | 0.841 | 0 Hz |
| `test_gcc_O3.elf` | 536 | 451 | 1.19 | 0.841 | 0 Hz |
# Problem B: CLZ (v3-table)
> [Commit 958167e](https://github.com/sysprog21/ca2025-quizzes/commit/958167e51b3246d988e8f6393e473a7cca8dda8d)
> [Commit 2828c7c](https://github.com/sysprog21/ca2025-quizzes/commit/2828c7cdb464b7a5000521a994f49cae68a407be)
```asm
.data
.align 2
# 8-bit lookup table: stores leading zeros for each number from 0-255
clz_table_8:
.byte 8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4
.byte 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3
.byte 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
.byte 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
.byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
.byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
.byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
.byte 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.text
.globl clz_v3
.type clz_v3, @function
# a0 = input value & return value
# a1 = extracted byte value (0–255) used for lookup table
# a2 = base address of clz_table_8
# a3 = offset value (0, 8, 16, or 24)
clz_v3:
# Special case: x = 0
beqz a0, .L_return_32
# Load the base address of the lookup table into a2
la a2, clz_table_8
# Check [31:24]
srli a1, a0, 24 # a1 = x >> 24
bnez a1, .L_byte3 # if (byte), offset=0
# Check [23:16]
srli a1, a0, 16 # a1 = x >> 16
andi a1, a1, 0xFF # a1 = (x >> 16) & 0xFF
bnez a1, .L_byte2 # if (byte), offset=8
# Check [15:8]
srli a1, a0, 8 # a1 = x >> 8
andi a1, a1, 0xFF # a1 = (x >> 8) & 0xFF
bnez a1, .L_byte1 # if (byte), offset=16
# Check [7:0]
andi a1, a0, 0xFF # a1 = x & 0xFF
li a3, 24 # offset = 24
j .L_lookup
.L_byte3:
# [31:24], offset = 0
li a3, 0 # offset = 0
j .L_lookup
.L_byte2:
# [23:16], offset = 8
li a3, 8 # offset = 8
j .L_lookup
.L_byte1:
# [15:8], offset = 16
li a3, 16 # offset = 16
.L_lookup:
# result = offset + clz_table_8[byte]
add a2, a2, a1 # a2 = &clz_table_8[byte]
lbu a0, 0(a2) # a0 = clz_table_8[byte]
add a0, a0, a3 # a0 = clz_table_8[byte] + offset
ret
.L_return_32:
# Special case: x = 0
li a0, 32
ret
```
The testing procedure is identical to v1-basic and v2-branchless.
```bash
cd q1b-uf8/clz/v3-table
```
Code Size:
```
$ riscv64-unknown-elf-size clz_v3.o clz_gcc_O0.o clz_gcc_O2.o clz_gcc_O3.o
text data bss dec hex filename
100 256 0 356 164 clz_v3.o
488 0 0 488 1e8 clz_gcc_O0.o
384 0 0 384 180 clz_gcc_O2.o
384 0 0 384 180 clz_gcc_O3.o
```
Static Instructions:
| Version | Static Instructions |
|---------|---------------------|
| v3-hand | 25 |
| GCC -O0 | 58 |
| GCC -O2 | 32 |
| GCC -O3 | 32 |
### ⭐ Observation: `clz_gcc_O2.dis` vs `clz_gcc_O3.dis`
```diff
diff -u clz_gcc_O2.dis clz_gcc_O3.dis
--- clz_gcc_O2.dis 2025-10-11 13:32:58.711310893 +0800
+++ clz_gcc_O3.dis 2025-10-11 13:32:58.715258710 +0800
@@ -1,5 +1,5 @@
-clz_gcc_O2.o: file format elf32-littleriscv
+clz_gcc_O3.o: file format elf32-littleriscv
Disassembly of section .text.clz:
```
Same situation as the v2 version — the v3 version of `clz.c` has reached its optimization limit at `-O2`, and `-O3` has nothing more to do.
Runtime Performance (Using Ripes):
`test_v3_hand.elf`:

`test_gcc_O0.elf`:

`test_gcc_O2.elf`:

`test_gcc_O3.elf`:

| ELF Filename | Cycles | Instrs. retired | CPI | IPC | Clock rate |
| -------------------- | ------ | --------------- | ---- | ----- | ---------- |
| `test_v3_hand.elf` | 272 | 163 | 1.67 | 0.599 | 0 Hz |
| `test_gcc_O0.elf` | 392 | 271 | 1.45 | 0.691 | 0 Hz |
| `test_gcc_O2.elf` | 272 | 163 | 1.67 | 0.599 | 0 Hz |
| `test_gcc_O3.elf` | 272 | 163 | 1.67 | 0.599 | 0 Hz |
# Analysis
`q1b-uf8/clz/v1-basic/clz_v1.s`:
```asm
clz_v1:
li t0, 32 # n = 32
li t1, 16 # c = 16
# Iteration 1: c = 16
srl t2, a0, t1 # y = x >> 16
beqz t2, .L1_skip # if (y == 0)
sub t0, t0, t1 # n -= 16
mv a0, t2 # x = y
```
Test input: `0x80000000` (`a0 = 0x80000000`)
```
.word 0x80000000 # 10: highest bit -> 0
```
## IF Stage (Instruction Fetch)
The main goal of the IF stage is: "Fetch the next instruction to execute from memory" — that is, let the PC (Program Counter) provide the address → read the instruction → prepare to pass it to the next stage (ID).




At PC = `0x000001e8`, the instruction memory fetches `0x006553b3`, which decodes to:
```
srl x7, x10, x6
```
The PC adder outputs `0x000001ec` (next instruction).
The IF/ID register has `enable = green = 1`, `clear = red = 0`, indicating normal pipeline flow.
This stage doesn't update memory because it's only fetching instructions.
Reference: https://www.rose-hulman.edu/class/csse/csse232/pdf/RISCV_Green_Card.pdf


```
srl x7, x10, x6
```
According to the above explanation, `srl` is an R-type instruction. The 32-bit instruction format for R-type is:
```
funct7 | rs2 | rs1 | funct3 | rd | opcode |
31 25 24 20 19 15 14 12 11 7 6 0
```
- `opcode = 0110011`
- `funct3 = 101`
- `funct7 = 0000000`
- `rd = x7 = 7`
- `rs1 = x10 = 10`
- `rs2 = x6 = 6`
| Field | Bit Range | Value (Binary) | Decimal/Note |
| -------- | --------- | -------------- | --------------- |
| `funct7` | 31..25 | `0000000` | SRL |
| `rs2` | 24..20 | `00110` | x6 (=6) |
| `rs1` | 19..15 | `01010` | x10 (=10) |
| `funct3` | 14..12 | `101` | SRL/SRA group |
| `rd` | 11..7 | `00111` | x7 (=7) |
| `opcode` | 6..0 | `0110011` | OP (R-type) |
Combine the above into 32-bit:
```
0000000 00110 01010 101 00111 0110011
= 0000 0000 0110 0101 0101 0011 1011 0011
= 0x006553b3
```
Leading 0 can be omitted as `0x6553b3`, but tools often display full 8 digits:
- In Little Endian memory, the 4 bytes would be sequentially: `b3 53 65 00`
- Ripes' "instr" column will display the combined 32-bit `0x006553b3`
### Multiplexer: PC Selector

- Bottom input: `0x000001e8` (current PC)
- Top input: Branch or jump target (currently unused)
- Output: `0x000001e8` (still current PC)
This multiplexer determines "the source of the next PC":
- If normal execution: select `PC + 4`
- If encountering branch or jump instruction: select branch target
During this cycle (`srl` instruction), there's no branch/jump, so the multiplexer selects "current PC", allowing the pipeline to proceed sequentially.
### Adder

The figure shows:
* Bottom input: `PC = 0x000001e8`
* Top input: `0x00000004`
* Output: `0x000001ec`
RISC-V instructions are 4 bytes each, so the adder calculates "the address of the next instruction":
```
PC = PC + 4
= 0x000001e8 + 0x4
= 0x000001ec
```
So the next instruction will be fetched from `0x000001ec`.
### Instruction Memory

The figure shows:
- Input address: `addr = 0x000001e8`
- Output instruction: `instr = 0x006553b3`
Instruction Memory's function is to read the corresponding 32-bit instruction content based on the address provided by PC. Here it fetches the machine code of `srl x7, x10, x6`.
### Compressed Decoder Below Instruction Memory

The figure shows:
- Input: `0x006553b3`
- Output: `0x006553b3`
RISC-V supports 16-bit "compressed instructions (RVC)" and 32-bit normal instructions. Therefore, this module will:
- Check if the instruction is in compressed format
- If it's a compressed instruction (16 bit), expand it to a 32-bit instruction here
- If not (like this example is standard 32-bit), pass it through unchanged
Since `0x006553b3` in the figure is already a 32-bit instruction, the decoder doesn't change the content, and input equals output.
### IF/ID Pipeline Register

- enable = green (1) → allow data to proceed to the next stage
- clear = red (0) → no flush (meaning no branch or exception)
- IF/ID Register buffers:
- Instruction content (`0x006553b3`)
- PC value (`0x000001e8`)
## ID Stage (Instruction Decode)


ID Stage main tasks:
- Decode instruction (Control Unit generates control signals)
- Read source register values (Register File)
- Generate immediate value (Immediate Generator)
- Pass results to the next stage IDEX pipeline register
### Decode Module (Control Unit)

- Decode Module identifies this instruction as SRL (Shift Right Logical)
- `opcode = 0x0a` (R-type)
- `rs1 = 0x0a` (`x10 = 10`)
- `rs2 = 0x06` (`x6 = 6`)
- `rd = 0x07` (`x7 = 7`)
- In ID stage, Control Unit generates control signals based on opcode / funct3 / funct7:
- `RegWrite = 1` (because result will be written back to x7)
- `ALUSrc = 0` (because it's a register-to-register operation, not immediate value)
- `MemRead = 0`
- `MemWrite = 0`
- `Branch = 0`
- `ALUOp = "SRL"`
### Register File

Instruction `srl x7, x10, x6`:
- R1 idx: `rs1 = 0x0a` (`x10 = 10`)
- R2 idx: `rs2 = 0x06` (`x6 = 6`)
Register File automatically reads the two corresponding source registers `rs1` and `rs2` based on decode results:

- Reg 1: 0x80000000
- Reg 2: 0x1000007d
These values will be sent to the ALU input multiplexer.
### Immediate Generator

Value in the figure: `0xdeadbeef`
- This is Ripes' default display value (because SRL is R-type and doesn't use immediate value)
- Actually, this module won't be used for R-type instructions
- For I-type (e.g., `addi`), this would generate a sign-extended immediate value
### ID/EX Pipeline Register

The figure shows:
- `rs1 = 0x80000000`
- `rs2 = 0x1000007d`
ID/EX Pipeline Register passes decoded results to the next stage (EX):
- `enable = green = 1` → pipeline proceeds normally
- `clear = red = 0` → no flush
## EX Stage (Execute)


EX Stage function:
- Goal: ALU (Arithmetic Logic Unit) performs actual operations (shifts, addition/subtraction, AND/OR, etc.) and determines whether to branch
- Input source: From ID/EX Pipeline Register (includes Register values and Control Signals)
- Output result: Sent to EX/MEM Pipeline Register (for subsequent MEM or WB use)
### ID/EX Pipeline Register

The figure shows:
- `rs1 = 0x80000000`
- `rs2 = 0x1000007d`
- `PC = 0x000001e8`
- Control signals:
- `enable = green = 1`
- `clear = red = 0`
ID/EX Pipeline Register is the data handoff point from ID → EX, it buffers:
- Register input values (`rs1`, `rs2`)
- Control signals corresponding to the instruction (such as `RegWrite`, `ALUOp`, `Branch`, etc.)
- Current PC (for branch calculation)
- `enable = 1` indicates pipeline proceeds normally
- `clear = 0` indicates no flush (non-jump instruction stage)
### ALU


```
srl x7, x10, x6
```
Its function is:
```
x7 = x10 >> (x6[4:0])
= x10 >> (x6 & 0x1F)
```
In the official RV32I specification:
> For SRL, SRA, and SLL instructions, only the lower 5 bits of the shift amount (rs2[4:0]) are used.
> [RISC-V User-Level ISA Spec, Vol. I: Unprivileged Architecture, §2.5 Shift Instructions]
Meaning: Even if rs2 contains a large value, only the lower 5 bits can be used, because in RV32I, the valid shift range for a 32-bit integer is 0~31, which has 32 possibilities ⇒ 5 bits are sufficient to represent.
Values in the figure:
- `Op1 = 0x80000000`
- `rs1` → data to be shifted
- `Op2 = 0x00000010`
- Only the lower 5 bits are used as shift amount
- `x6[4:0] = 0x1000007d[4:0] = x6 & 0x1F = 0x1d`
- `Res = 0x00008000`
- `x7 = x10 >> (x6 & 0x1F) = 0x80000000 >> 0x1d = 29`
- However, `Res` here shows not 29, but `0x00008000`, because a **data hazard** occurred
```
clz_v1:
li t0, 32 # n = 32
li t1, 16 # c = 16
# Iteration 1: c = 16
srl t2, a0, t1 # y = x >> 16
beqz t2, .L1_skip # if (y == 0)
sub t0, t0, t1 # n -= 16
mv a0, t2 # x = y
```
| Cycle | 1 | 2 | 3 | 4 | 5 | 6 |
|:---------------- |:---:|:---:|:---:|:---:|:---:|:---:|
| `li t1, 16` | IF | ID | EX | MEM | WB | |
| `srl t2, a0, t1` | | IF | ID | EX | MEM | WB |
- Hazard type occurring here: **RAW (Read After Write)**
- The `t1` that `srl` needs to read is precisely the destination register that the previous `li` (i.e., `addi t1, x0, 16`) is writing back to
- **Forwarding path:** From `li` EX/MEM → `srl` ALU
- When the pipeline detects that `srl` needs to use the result of `t1`, it directly forwards the execution result of `li` from the output of the EX/MEM Pipeline Register back to the ALU input, without waiting for `li` to complete writeback (WB)
- This way, `srl` can get the correct `t1 = 16 = 0x00000010` in the EX stage of cycle 4,
avoiding pipeline stall and maintaining correct execution
- That is: `x7 = x10 >> (x6 & 0x1F) = 0x80000000 >> (0x00000010 & 0x1F) = 0x80000000 >> 16 = 0x00008000`
### ALU Multiplexers

MUX with three green dots in the figure:
- Left MUX: Select Op1 = rs1 (non-immediate value)
- Right MUX: Select Op2 = rs2 (non-immediate value)
**Reason:**
- Control Signal `ALUSrc = 0` (because this `srl` instruction is R-type, doesn't use immediate value)
- Therefore, both inputs come from Register File:
- `Op1` ← `rs1 (x10)`
- `Op2` ← `rs2 (x6)`
### Branch Module

The figure shows: `Branch taken = red = 0`
This instruction is `srl`, not a branch instruction, therefore:
- `Branch = 0`
- `Branch taken` signal is red (false)
- For instructions like `beq` or `bne`, this would use ALU's zero flag to determine if the branch should be taken
### EX/MEM Pipeline Register

EX/MEM Pipeline Register will:
- Buffer ALU result (`0x00008000`)
- Buffer Control Signal (such as whether to write back to register)
- Buffer destination register number (`rd = x7`)
## MEM Stage (Memory Access)


### EX/MEM Pipeline Register
- This stage's input comes from the previous cycle's EX stage
- `ALU Res = 0x00008000`: This is the ALU operation result from the previous stage (EX)
- For a branch instruction (e.g., `beq`) in the MEM stage, this represents the branch target address (`PC + imm`)
- For a general arithmetic instruction (e.g., `srl` in the figure) in the MEM stage, it's just passed through to the next stage (WB), doesn't touch data memory
- ALU operation result is buffered in the EX/MEM Pipeline Register, waiting for WB stage to write to x7
### Data Memory
Although the `srl` instruction is in the MEM stage, `srl` is not a Load/Store instruction, therefore:
- Data memory is not activated
- `Wr en = 0` (red)
- `MemRead = 0`
- `MemWrite = 0`
Values displayed in memory:
- `Addr. = 0x00008000`
- `Data in = 0x00000010`
- `Read out = 0x00000000`
These are all don't care values (just signals flowing in the pipeline, not actual access operations).
## WB Stage (Write Back)


In the WB (Write Back) stage, the execution result of instruction `srl x7, x10, x6` is ready to be written back to the register file.
### MEM/WB Pipeline Register
- This stage's input is passed from the MEM stage
- `ALU result = 0x00008000` (obtained from `srl x7, x10, x6`)
- Buffered control signals (such as `RegWrite`) are also passed from the MEM stage
- The main function of this register is:
- Buffer the value to be written back
- Buffer the target register number (`rd`) to write back to
- Control signal, determines whether to actually perform writeback
### Write Back Multiplexer

- WB stage must determine the source of data to write back:
- If Load instruction → write back the value read from Data Memory (`MemRead=1`)
- If general ALU instruction (e.g., the `srl` instruction currently in WB) → write back from ALU result (`MemRead=0`)
- In the figure:
- Bottom input (from Data memory Read out) = `0x00000000`
- Top input (from ALU result) = `0x00008000`
- Since this cycle is an R-type instruction (SRL), the writeback source is ALU result → `0x00008000`
### Write Back Register File

- According to the information in MEM/WB Pipeline Register:
- `rd = x7`
- `Write data = 0x00008000`
- `RegWrite = 1` (allow write)
Therefore, the next cycle will write `0x00008000` to Register x7, as shown in the figure below:
