# 2025 Computer Architecture homework 2
## Prerequest
* Ubuntu Linux 24.04 LTS or later macOS 15+
* rv32emu environment refer to [Lab2:RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel)
### WSL (Windows Subsystem for Linux)
WSL is my preferred choice over traditional virtual machines (like VirtualBox) for development on Windows. Here's a simple breakdown of why:
* Lightweight & Fast: WSL starts in seconds. It uses far fewer system resources (RAM and CPU) because it's not booting a full, separate operating system.
* Excellent Performance: WSL2 provides a full Linux kernel, offering near-native file system performance (for operations inside Linux) and 100% application compatibility.
In short, WSL gives me the power of a native Linux command line without the overhead or isolation of a traditional virtual machine
## differnent between rv32emu and ripes
When working with rv32emu, it is essential to understand that it is not a bare-metal hardware simulator (like QEMU in system-mode). Instead, rv32emu is a Linux user-space emulator.
This fundamental design choice dictates how ecall (Environment Call) instructions are handled. You cannot use ecall for arbitrary, custom traps as you would in a true bare-metal environment.
for example:
* exit
Ripes Assembly
```
li a0, 10
ecall
```
rv32emu
```
li a7, 93
li a0, 0
ecall
```
* write
Ripes assembly
```
li a0, 4
```
rv32emu
```
li a7, 64
```
## C code without stdio.h
>When you are programming in a bare-metal or minimal environment (like for rv32emu), you often cannot use printf. This is not because printf is a magical part of the C language, but because it is a complex library function.
```c
#define printstr(ptr, length)
do {
asm volatile(
"add a7, x0, 0x40;"
"add a0, x0, 0x1;" /* stdout */
"add a1, x0, %0;"
"mv a2, %1;" /* length character */
"ecall;"
:
: "r"(ptr), "r"(length)
: "a0", "a1", "a2", "a7");
} while (0)
#define TEST_OUTPUT(msg, length) printstr(msg, length)
#define TEST_LOGGER(msg)
{
char _msg[] = msg;
TEST_OUTPUT(_msg, sizeof(_msg) - 1);
}
```
This set of C macros (TEST_LOGGER, TEST_OUTPUT, printstr) provides a custom, low-level way to print strings in an environment where the standard C library (stdio.h) is not available.
The primary function is to bypass printf and directly execute a SYS_write (system call 64) via inline assembly.
This is a common technique for bare-metal programming or in minimal emulators like rv32emu that simulate a Linux kernel's ABI.
TEST_LOGGER is a convenient wrapper that allows you to print any string by calling the write system call directly. This works perfectly in rv3Demu because rv32emu is a Linux user-space emulator that understands and executes these system calls.
## environment
* ubuntu 24.04.2 LTS

* [rv32emu](https://github.com/sysprog21/rv32emu)
## problems
The [main.c](https://github.com/rainbow0212/ca2025-homework2/blob/main/ca2025-homework2/system/playground/main.c) file is used to print the results for all three problems at once, as well as to calculate their respective cycle and instruction counts.
* For quiz2a and hw1 (is_power_of_two), the computation is performed by making external calls to assembly language functions.
* quiz3c, in contrast, is computed using a software implementation in C.
### quiz2a (3 disk tower of hanoi )
origin assembly code(run in ripes)
```
.text
.globl main
main:
addi x2, x2, -32
sw x8, 0(x2)
sw x9, 4(x2)
sw x18, 8(x2)
sw x19, 12(x2)
sw x20, 16(x2)
li x5, 0x15
sw x5, 20(x2)
sw x5, 24(x2)
sw x5, 28(x2)
sw x0, 20(x2)
sw x0, 24(x2)
sw x0, 28(x2)
addi x8, x0, 1
game_loop:
addi x5, x0, 8
beq x8, x5, finish_game
srli x5, x8, 1
xor x6, x8, x5
addi x7, x8, -1
srli x28, x7, 1
xor x7, x7, x28
xor x5, x6, x7
addi x9, x0, 0
andi x6, x5, 1
bne x6, x0, disk_found
addi x9, x0, 1
andi x6, x5, 2
bne x6, x0, disk_found
addi x9, x0, 2
disk_found:
andi x30, x5, 5
addi x31, x0, 5
beq x30, x31, pattern_match
jal x0, continue_move
pattern_match:
continue_move:
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
lw x18, 0(x5)
bne x9, x0, handle_large
addi x19, x18, 2
addi x6, x0, 3
blt x19, x6, display_move
sub x19, x19, x6
jal x0, display_move
handle_large:
lw x6, 20(x2)
addi x19, x0, 3
sub x19, x19, x18
sub x19, x19, x6
display_move:
la x20, obdata
add x5, x20, x18
lbu x11, 0(x5)
li x6, 0x6F
xor x11, x11, x6
addi x11, x11, -0x12
add x7, x20, x19
lbu x12, 0(x7)
xor x12, x12, x6
addi x12, x12, -0x12
la x10, str1
addi x17, x0, 4
ecall
addi x10, x9, 1
addi x17, x0, 1
ecall
la x10, str2
addi x17, x0, 4
ecall
addi x10, x11, 0
addi x17, x0, 11
ecall
la x10, str3
addi x17, x0, 4
ecall
addi x10, x12, 0
addi x17, x0, 11
ecall
addi x10, x0, 10
addi x17, x0, 11
ecall
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
sw x19, 0(x5)
addi x8, x8, 1
jal x0, game_loop
finish_game:
lw x8, 0(x2)
lw x9, 4(x2)
lw x18, 8(x2)
lw x19, 12(x2)
lw x20, 16(x2)
addi x2, x2, 32
addi x17, x0, 10
ecall
.data
obdata: .byte 0x3c, 0x3b, 0x3a
str1: .asciz "Move Disk "
str2: .asciz " from "
str3: .asciz " to "
```
[assembly code on rv32emu](https://github.com/rainbow0212/ca2025-homework2/blob/main/ca2025-homework2/system/playground/quiz2a.S)
Because rv32emu's system call can only print a character, an additional function is needed to print a string. This function uses a loop to calculate the string's length and then prints it
```
print_str:
mv x5, x10
mv x6, x10
# calculate string length
strlen_loop:
lbu x7, 0(x6)
beq x7, x0, strlen_done
addi x6, x6, 1
j strlen_loop
strlen_done:
# calcaulate length: t1 - t0
sub x12, x6, x5
# write system call
li x17, 0x40
li x10, 1
mv x11, x5
ecall
ret
# print char
# x10 = char's ASCII code
print_char:
addi x2, x2, -4
sb x10, 0(x2)
# write system call
li x17, 0x40
li x10, 1
mv x11, x2
li x12, 1
ecall
addi x2, x2, 4
ret
```
The Gray code calculation part remains the same as in the original code
As mentioned before, the printing portion needs to be converted separately
```
la x10, str1
jal x1, print_str
# print disk number (x9+1)
lw x5, 4(x2)
addi x10, x5, 1
addi x10, x10, 0x30
jal x1, print_char
# print " from "
la x10, str2
jal x1, print_str
# print src peg (x11)
lw x10, 8(x2)
jal x1, print_char
# print " to "
la x10, str3
jal x1, print_str
# print target peg (x12)
lw x10, 12(x2)
jal x1, print_char
# newline
la x10, newline
jal x1, print_str
```
Here are the results of the performance analysis

I used loop unrolling to reduce overhead in the string length calculation, since that part is implemented with a loop
[quiz2a with loop unrolling](https://github.com/rainbow0212/ca2025-homework2/blob/main/ca2025-homework2/system/playground/quiz2a_loop_unroll.S)
```
print_str:
mv x5, x10
strlen_loop_4x:
lbu x6, 0(x10)
beqz x6, strlen_done
lbu x7, 1(x10)
beqz x7, found_at_1
lbu x28, 2(x10)
beqz x28, found_at_2
lbu x29, 3(x10)
beqz x29, found_at_3
addi x10, x10, 4
j strlen_loop_4x
found_at_1:
addi x10, x10, 1
j strlen_done
found_at_2:
addi x10, x10, 2
j strlen_done
found_at_3:
addi x10, x10, 3
strlen_done:
sub x12, x10, x5
li x17, 0x40
li x10, 1
mv x11, x5
ecall
ret
# print char
# x10 = char's ASCII code
print_char:
addi x2, x2, -4
sb x10, 0(x2)
li x17, 0x40
li x10, 1
mv x11, x2
li x12, 1
ecall
addi x2, x2, 4
ret
# print string
# a0 = str pointer a1 = length
print_str_n:
mv x12, x11
mv x11, x10
li x17, 0x40
li x10, 1
ecall
ret
```

A significant reduction in the cycle count can be observed
### quiz3c (rsqrt)
origin assembly code
```c
static int clz(uint32_t x) {
if (!x) return 32;
int n = 0;
if (!(x & 0xFFFF0000)) { n += 16; x <<= 16; }
if (!(x & 0xFF000000)) { n += 8; x <<= 8; }
if (!(x & 0xF0000000)) { n += 4; x <<= 4; }
if (!(x & 0xC0000000)) { n += 2; x <<= 2; }
if (!(x & 0x80000000)) { n += 1; }
return n;
}
static uint64_t mul32(uint32_t a, uint32_t b) {
uint64_t r = 0;
for (int i = 0; i < 32; i++) {
if (b & (1U << i))
r += (uint64_t)a << i;
}
return r;
}
static const uint16_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384,
11585, 8192, 5793, 4096, 2896,
2048, 1448, 1024, 724, 512,
362, 256, 181, 128, 90,
64, 45, 32, 23, 16,
11, 8, 6, 4, 3,
2, 1
};
uint32_t fast_rsqrt(uint32_t x) {
if (x == 0) return 0xFFFFFFFF;
if (x == 1) return 65536;
int exp = 31 - clz(x);
uint32_t y = rsqrt_table[exp];
if (x > (1u << exp)) {
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp);
y -= (uint32_t) ((delta * frac) >> 16);
}
for (int iter = 0; iter < 2; iter++) {
uint32_t y2 = (uint32_t)mul32(y, y);
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16);
y = (uint32_t)(mul32(y, (3u << 16) - xy2) >> 17);
}
return y;
}
```
You can run it directly in main.c
Here are the results of the performance analysis

### [hw1(is power of two)](https://github.com/rainbow0212/ca2025-homework2/blob/main/ca2025-homework2/system/playground/is_power_of_two.S)
```c
typedef struct {
int input;
bool expected;
} power_of_two_test_case_t;
static void test_power_of_two(void)
{
static const power_of_two_test_case_t power_of_two_tests[] = {
{0x1, true}, // 1
{0x10, true}, // 16
{0x3, false}, // 3
{0xFFFFFFFF, false}, // -1
{0x0, false}, // 0
{0x800, true} // 2048
};
bool all_passed = true;
int num_tests = sizeof(power_of_two_tests) / sizeof(power_of_two_tests[0]);
for (int i = 0; i < num_tests; i++) {
power_of_two_test_case_t test = power_of_two_tests[i];
bool actual_result = is_power_of_two(test.input);
if (actual_result != test.expected) {
all_passed = false;
break;
}
}
TEST_LOGGER("is_power_of_two completed!\n");
}
```
The test data is embedded in main.c, and the code only checks for data correctness—it does not print the input numbers or the results. This is done to reduce printing overhead, allowing for a more accurate performance measurement

## Reference
:::warning
This assignment was completed with assistance from [Claude](https://claude.ai/new) for grammar checking, translation and initial brainstorming. All final analysis and conclusions are my own.
:::
* [Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol#Problem-A)
* [Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3-sol#Problem-C)
* [Assignment2: Complete Applications](https://hackmd.io/@sysprog/2025-arch-homework2)
* [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel)
* [rv32emu](https://github.com/sysprog21/rv32emu)