# Assignment 2: Complete Applications
Contributed by <`ryanycs`>
[TOC]
## Bare-metal Programming
### Key Components
#### start.S
```asm
# Startup code for bare metal RISC-V
.section .text._start
.globl _start
.type _start, @function
_start:
# Set up stack pointer
la sp, __stack_top
# Clear BSS
la t0, __bss_start
la t1, __bss_end
1:
bge t0, t1, 2f
sw zero, 0(t0)
addi t0, t0, 4
j 1b
2:
# Call main
call main
# Exit syscall (if main returns)
li a7, 93 # exit syscall number
li a0, 0 # exit code
ecall
# Infinite loop (should never reach here)
3:
j 3b
.size _start, .-_start
# Provide BSS markers if linker script doesn't define them
.weak __bss_start
.weak __bss_end
.weak __stack_top
```
The `start.S` is used to setup the program execute environment:
- Initialize stack pointer, where `__stack_top` is specified in linker script.
- Zero the `.bss` section manually by looping from `__bss_start` to `__bss_end`.
- Call the `main` function.
- When the `main` function retun, perform an `exit` system call.
:::info
##### `.weak` assembly directive
The `.weak` assembly in the startup code here directive is used to prevent undefined reference error if linker script does not provide these symbols.
:::
#### linker.ld
```c
OUTPUT_ARCH( "riscv" )
ENTRY(_start)
SECTIONS
{
. = 0x10000;
.text : {
*(.text._start)
*(.text)
}
.data : { *(.data) }
.bss : {
__bss_start = .;
*(.bss)
__bss_end = .;
}
.stack (NOLOAD) : {
. = ALIGN(16);
. += 4096;
__stack_top = .;
}
}
```
This linker script specify the following code layout:
```
┌───────────────────┐
│.text._start │0x10000
│.text │
├───────────────────┤
│.data │
├───────────────────┤
│__bss_start │
│.bss │
│__bss_end │
├───────────────────┤ ─┐
│.stack │ │
│ │ │4096
│__stack_top │ │
└───────────────────┘ ─┘
```
Therefore, the `start.S` will be place first at the address `0x10000`. This make sure the startup code will be execute first, setting the stack pointer and clear the bss section.
Other text sections like main function are followed the `_start` section. The `.data` section stores initialized global and static variables, while the `.bss` section, located between `__bss_start` and `__bss_end`, holds uninitialized variables that are cleared during startup. Finally, the .stack section reserves 4096 bytes of memory for the program stack.
### Build & Run
After we complete all key components of the bare-metal programming. It time to make these file run on the rv32emu emulator.
To simplify the compile/assemble and link process. We can write a makefile:
:::spoiler Makefile
```shell
include ../../../mk/toolchain.mk
ARCH = -march=rv32izicsr
LINKER_SCRIPT = linker.ld
EMU ?= ../../../build/rv32emu
AFLAGS = -g $(ARCH)
CFLAGS = -g -march=rv32i_zicsr
LDFLAGS = -T $(LINKER_SCRIPT)
EXEC = test.elf
CC = $(CROSS_COMPILE)gcc
AS = $(CROSS_COMPILE)as
LD = $(CROSS_COMPILE)ld
OBJDUMP = $(CROSS_COMPILE)objdump
OBJS = start.o perfcounter.o main.o
.PHONY: all run dump clean
all: $(EXEC)
$(EXEC): $(OBJS) $(LINKER_SCRIPT)
$(LD) $(LDFLAGS) -o $@ $(OBJS)
%.o: %.S
$(AS) $(AFLAGS) $< -o $@
%.o: %.c
$(CC) $(CFLAGS) $< -o $@ -c
run: $(EXEC)
@test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1)
@grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1)
@grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1)
$(EMU) $<
dump: $(EXEC)
$(OBJDUMP) -Ds $< | less
clean:
rm -f $(EXEC) $(OBJS)
```
:::
After creating the Makefile, we can use the following command to build the bare-metal target:
```
make
```
This command first assembles or compiles each source file into an object file, and then links them together to produce the final ELF executable.
To run the program on the rv32emu emulator, we can use the following command:
```
make run
```
This command executes the generated ELF file using the rv32emu emulator.
## Homework 1 program - int32_to_bf16
You can find the [source code](https://github.com/ryanycs/ca2025-quizzes/tree/main/hw2/int32_to_bf16) here.
:::info
The code should be place into `rv32emu/tests/system/` directory.
:::
you can refer [Convert int32 to bfloat16](https://hackmd.io/@ryanycs/arch2025-homework1#Convert-int32-to-bfloat16) to see how this code work.
### Modification between Ripes and rv32emu
Since the system code between [Ripes]() and [rv32emu]() has a different calling convention, when we need to use system code, we need to modify the asm code.
For instance, if we want to print a string where `pass_message` is the address of this string. The corresponding RISC-V program is shown below:
#### Ripes
```asm
li a7, 4 # print_string
la a0, pass_message
ecall
```
#### rv32emu
```asm
li a7, 64 # system call: write
li a0, 1 # fd: stdout
la a1, pass_message # buf
li a2, 18 # count
ecall
```
### Result
This program includes a test function that compares the custom integer-to-bfloat16 conversion with the standard float-to-bfloat16 approach using a list of test integers. If all test value are passed, it will print the following message:
```
All tests passed.
```
## Quiz 2 Problem A - Tower of Hanoi
You can find the [source code](https://github.com/ryanycs/ca2025-quizzes/tree/main/hw2/hanoi) here.
:::info
The code should be place into `rv32emu/tests/system/` directory.
:::
### Result
To execute the program, use the folling command:
```shell
make run
```
You will see the output shown as below:
```
=== RISC-V Assembly ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 727
Instructions: 727
=== C Implementation ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 10764
Instructions: 10764
```
We can see that the C implementation is about ten times slower than the assembly version.
After that, I tried to use different compilation options to see the performacne of compiler-optimized code.
My first attempt was simply adding flags such as `-O1`, `-O2`, and so on to `CFLAGS` in the Makefile. However, this introduced an unexpected issue: the compiler optimized away the inline-assembly macro, even though it was marked volatile. As a result, the program produced no output at all.
To slove the problem, I rewrote the macro as a function. Specifically, I modify the `printstr(ptr, length)` from:
```c
#define printstr(ptr, length) \
do { \
asm volatile( \
"add a7, x0, 0x40;" \
"add a0, x0, 0x1;" /* stdout */ \
"add a1, x0, %0;" \
"mv a2, %1;" /* length character */ \
"ecall;" \
: \
: "r"(ptr), "r"(length) \
: "a0", "a1", "a2", "a7"); \
} while (0)
```
to
```c
static inline void printstr(char* ptr, int length) {
register const char *a1 asm("a1") = (ptr);
register long a2 asm("a2") = (length);
asm volatile(
"add a7, x0, 0x40;"
"add a0, x0, 0x1;" /* stdout */
"ecall;"
:
: "r"(a1), "r"(a2)
);
}
```
The result of `O1` optimization is:
#### O1
```
=== C Implementation ===
Move Disk 10Movem A to C
Move Disk 10Movem A to B
Move Disk 10Movem C to B
Move Disk 10Movem A to C
Move Disk 10Movem B to A
Move Disk 10Movem B to C
Move Disk 10Movem A to C
Cycles: 7219
Instructions: 7219
```
We can see that this approach introduces a new issue: the expected output format is like `Move Disk x from x to x`. However, the output of the result is `Move Disk 10Movem x to x`. The disk number and string from are gone.
> This is because compiler optimizations may allocate local variables to registers or stack slots in ways that conflict with the inline assembly used for printing. Such behavior can lead to unintended memory overwrites, resulting in incorrect output.
> \- ChatGPT
Therefore, I added the memory clobber into `printstr(ptr, length)`:
```c
static inline void printstr(char* ptr, int length) {
register const char *a1 asm("a1") = (ptr);
register long a2 asm("a2") = (length);
asm volatile(
"add a7, x0, 0x40;"
"add a0, x0, 0x1;" /* stdout */
"ecall;"
:
: "r"(a1), "r"(a2)
: "memory", "a0"
);
}
```
After that, it work fine if using compiler optimization. The experienment result is shown as below:
#### O1
```
=== C Implementation ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 3978
Instructions: 3978
```
#### O2
```
=== C Implementation ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 4147
Instructions: 4147
```
#### O3
```
=== C Implementation ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 4114
Instructions: 4114
```
#### Ofast
```
=== C Implementation ===
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 4114
Instructions: 4114
```
#### Summary
| | asm | C | O1 | O2 | O3 | Ofast |
|--------|-----|-------|------|------|------|-------|
| Cycles | 727 | 10764 | 3978 | 4147 | 4114 | 4114 |
| Instrs | 727 | 10764 | 3978 | 4147 | 4114 | 4114 |
## Quiz 3 Problem C - rsqrt
You can find the [source code](https://github.com/ryanycs/ca2025-quizzes/tree/main/hw2/rsqrt) here.
:::info
The code should be place into `rv32emu/tests/system/` directory.
:::
### Result
Since we use Q16.16 format, the actual result of the ouput need to be devided by $2^{16}$. For instance, `rsqrt(2) = 46341`, indicate `rsqrt(2) = 46341 / 65536 = 0.7071075439`, where the real result is `0.7071067811`.
```
rsqrt(1) = 65536
Cycles: 7733
Instructions: 7733
rsqrt(2) = 46341
Cycles: 11068
Instructions: 11068
rsqrt(4) = 32768
Cycles: 10465
Instructions: 10465
rsqrt(16) = 16384
Cycles: 11662
Instructions: 11662
rsqrt(100) = 6553
Cycles: 13367
Instructions: 13367
rsqrt(4294967295) = 1
Cycles: 18280
Instructions: 18280
```
#### O1
```
rsqrt(1) = 65536
Cycles: 2950
Instructions: 2950
rsqrt(2) = 46341
Cycles: 4336
Instructions: 4336
rsqrt(4) = 32768
Cycles: 4102
Instructions: 4102
rsqrt(16) = 16384
Cycles: 4556
Instructions: 4556
rsqrt(100) = 6553
Cycles: 5230
Instructions: 5230
rsqrt(4294967295) = 1
Cycles: 7131
Instructions: 7131
```
#### O2
```
rsqrt(1) = 65536
Cycles: 3070
Instructions: 3070
rsqrt(2) = 46341
Cycles: 4660
Instructions: 4660
rsqrt(4) = 32768
Cycles: 4543
Instructions: 4543
rsqrt(16) = 16384
Cycles: 5032
Instructions: 5032
rsqrt(100) = 6553
Cycles: 5616
Instructions: 5616
rsqrt(4294967295) = 1
Cycles: 7586
Instructions: 7586
```
#### O3
```
rsqrt(1) = 65536
Cycles: 3070
Instructions: 3070
rsqrt(2) = 46341
Cycles: 4643
Instructions: 4643
rsqrt(4) = 32768
Cycles: 4527
Instructions: 4527
rsqrt(16) = 16384
Cycles: 5016
Instructions: 5016
rsqrt(100) = 6553
Cycles: 5597
Instructions: 5597
rsqrt(4294967295) = 1
Cycles: 7565
Instructions: 7565
```
#### Ofast
```
rsqrt(1) = 65536
Cycles: 3070
Instructions: 3070
rsqrt(2) = 46341
Cycles: 4643
Instructions: 4643
rsqrt(4) = 32768
Cycles: 4527
Instructions: 4527
rsqrt(16) = 16384
Cycles: 5016
Instructions: 5016
rsqrt(100) = 6553
Cycles: 5597
Instructions: 5597
rsqrt(4294967295) = 1
Cycles: 7565
Instructions: 7565
```
#### Summary
We only compare the cycle and instruction counts of `rsqrt(INT_MAX)`
| | C | O1 | O2 | O3 | Ofast |
|--------|-------|------|------|------|-------|
| Cycles | 18280 | 7131 | 7586 | 7565 | 7565 |
| Instrs | 18280 | 7131 | 7586 | 7565 | 7565 |
### Optimize
If we replace `clz(uint32_t x)`:
```c
static inline unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return n - x;
}
```
with byte-shift version:
```c
static inline unsigned clz(uint32_t x)
{
if (x == 0)
return 32;
uint32_t n = 0;
if ((x >> 16) == 0) { n += 16; x <<= 16; }
if ((x >> 24) == 0) { n += 8; x <<= 8; }
if ((x >> 28) == 0) { n += 4; x <<= 4; }
if ((x >> 30) == 0) { n += 2; x <<= 2; }
if ((x >> 31) == 0) { n += 1; }
return n;
}
```
| | C | O1 | O2 | O3 | Ofast |
|--------|-------|------|------|------|-------|
| Cycles | 18206 | 7099 | 7559 | 7560 | 7560 |
| Instrs | 18206 | 7099 | 7559 | 7560 | 7560 |
We take the `-O1` version as an example, the cycle count is decreases from $7131$ to $7099$.
## Acknowledgement
This assignment was completed with assistance from [Github Copilot](https://github.com/features/copilot) auto completions for code/comments writing (Agent mode does not use). [ChatGPT](https://chatgpt.com/) is used to refine the writing of this document.
## Reference
[Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol)
[Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3-sol)