# Assignment 2: Complete Applications Contributed by <`ryanycs`> [TOC] ## Bare-metal Programming ### Key Components #### start.S ```asm # Startup code for bare metal RISC-V .section .text._start .globl _start .type _start, @function _start: # Set up stack pointer la sp, __stack_top # Clear BSS la t0, __bss_start la t1, __bss_end 1: bge t0, t1, 2f sw zero, 0(t0) addi t0, t0, 4 j 1b 2: # Call main call main # Exit syscall (if main returns) li a7, 93 # exit syscall number li a0, 0 # exit code ecall # Infinite loop (should never reach here) 3: j 3b .size _start, .-_start # Provide BSS markers if linker script doesn't define them .weak __bss_start .weak __bss_end .weak __stack_top ``` The `start.S` is used to setup the program execute environment: - Initialize stack pointer, where `__stack_top` is specified in linker script. - Zero the `.bss` section manually by looping from `__bss_start` to `__bss_end`. - Call the `main` function. - When the `main` function retun, perform an `exit` system call. :::info ##### `.weak` assembly directive The `.weak` assembly in the startup code here directive is used to prevent undefined reference error if linker script does not provide these symbols. ::: #### linker.ld ```c OUTPUT_ARCH( "riscv" ) ENTRY(_start) SECTIONS { . = 0x10000; .text : { *(.text._start) *(.text) } .data : { *(.data) } .bss : { __bss_start = .; *(.bss) __bss_end = .; } .stack (NOLOAD) : { . = ALIGN(16); . += 4096; __stack_top = .; } } ``` This linker script specify the following code layout: ``` ┌───────────────────┐ │.text._start │0x10000 │.text │ ├───────────────────┤ │.data │ ├───────────────────┤ │__bss_start │ │.bss │ │__bss_end │ ├───────────────────┤ ─┐ │.stack │ │ │ │ │4096 │__stack_top │ │ └───────────────────┘ ─┘ ``` Therefore, the `start.S` will be place first at the address `0x10000`. This make sure the startup code will be execute first, setting the stack pointer and clear the bss section. Other text sections like main function are followed the `_start` section. The `.data` section stores initialized global and static variables, while the `.bss` section, located between `__bss_start` and `__bss_end`, holds uninitialized variables that are cleared during startup. Finally, the .stack section reserves 4096 bytes of memory for the program stack. ### Build & Run After we complete all key components of the bare-metal programming. It time to make these file run on the rv32emu emulator. To simplify the compile/assemble and link process. We can write a makefile: :::spoiler Makefile ```shell include ../../../mk/toolchain.mk ARCH = -march=rv32izicsr LINKER_SCRIPT = linker.ld EMU ?= ../../../build/rv32emu AFLAGS = -g $(ARCH) CFLAGS = -g -march=rv32i_zicsr LDFLAGS = -T $(LINKER_SCRIPT) EXEC = test.elf CC = $(CROSS_COMPILE)gcc AS = $(CROSS_COMPILE)as LD = $(CROSS_COMPILE)ld OBJDUMP = $(CROSS_COMPILE)objdump OBJS = start.o perfcounter.o main.o .PHONY: all run dump clean all: $(EXEC) $(EXEC): $(OBJS) $(LINKER_SCRIPT) $(LD) $(LDFLAGS) -o $@ $(OBJS) %.o: %.S $(AS) $(AFLAGS) $< -o $@ %.o: %.c $(CC) $(CFLAGS) $< -o $@ -c run: $(EXEC) @test -f $(EMU) || (echo "Error: $(EMU) not found" && exit 1) @grep -q "ENABLE_ELF_LOADER=1" ../../../build/.config || (echo "Error: ENABLE_ELF_LOADER=1 not set" && exit 1) @grep -q "ENABLE_SYSTEM=1" ../../../build/.config || (echo "Error: ENABLE_SYSTEM=1 not set" && exit 1) $(EMU) $< dump: $(EXEC) $(OBJDUMP) -Ds $< | less clean: rm -f $(EXEC) $(OBJS) ``` ::: After creating the Makefile, we can use the following command to build the bare-metal target: ``` make ``` This command first assembles or compiles each source file into an object file, and then links them together to produce the final ELF executable. To run the program on the rv32emu emulator, we can use the following command: ``` make run ``` This command executes the generated ELF file using the rv32emu emulator. ## Homework 1 program - int32_to_bf16 You can find the [source code](https://github.com/ryanycs/ca2025-quizzes/tree/main/hw2/int32_to_bf16) here. :::info The code should be place into `rv32emu/tests/system/` directory. ::: you can refer [Convert int32 to bfloat16](https://hackmd.io/@ryanycs/arch2025-homework1#Convert-int32-to-bfloat16) to see how this code work. ### Modification between Ripes and rv32emu Since the system code between [Ripes]() and [rv32emu]() has a different calling convention, when we need to use system code, we need to modify the asm code. For instance, if we want to print a string where `pass_message` is the address of this string. The corresponding RISC-V program is shown below: #### Ripes ```asm li a7, 4 # print_string la a0, pass_message ecall ``` #### rv32emu ```asm li a7, 64 # system call: write li a0, 1 # fd: stdout la a1, pass_message # buf li a2, 18 # count ecall ``` ### Result This program includes a test function that compares the custom integer-to-bfloat16 conversion with the standard float-to-bfloat16 approach using a list of test integers. If all test value are passed, it will print the following message: ``` All tests passed. ``` ## Quiz 2 Problem A - Tower of Hanoi You can find the [source code](https://github.com/ryanycs/ca2025-quizzes/tree/main/hw2/hanoi) here. :::info The code should be place into `rv32emu/tests/system/` directory. ::: ### Result To execute the program, use the folling command: ```shell make run ``` You will see the output shown as below: ``` === RISC-V Assembly === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 727 Instructions: 727 === C Implementation === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 10764 Instructions: 10764 ``` We can see that the C implementation is about ten times slower than the assembly version. After that, I tried to use different compilation options to see the performacne of compiler-optimized code. My first attempt was simply adding flags such as `-O1`, `-O2`, and so on to `CFLAGS` in the Makefile. However, this introduced an unexpected issue: the compiler optimized away the inline-assembly macro, even though it was marked volatile. As a result, the program produced no output at all. To slove the problem, I rewrote the macro as a function. Specifically, I modify the `printstr(ptr, length)` from: ```c #define printstr(ptr, length) \ do { \ asm volatile( \ "add a7, x0, 0x40;" \ "add a0, x0, 0x1;" /* stdout */ \ "add a1, x0, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ : "a0", "a1", "a2", "a7"); \ } while (0) ``` to ```c static inline void printstr(char* ptr, int length) { register const char *a1 asm("a1") = (ptr); register long a2 asm("a2") = (length); asm volatile( "add a7, x0, 0x40;" "add a0, x0, 0x1;" /* stdout */ "ecall;" : : "r"(a1), "r"(a2) ); } ``` The result of `O1` optimization is: #### O1 ``` === C Implementation === Move Disk 10Movem A to C Move Disk 10Movem A to B Move Disk 10Movem C to B Move Disk 10Movem A to C Move Disk 10Movem B to A Move Disk 10Movem B to C Move Disk 10Movem A to C Cycles: 7219 Instructions: 7219 ``` We can see that this approach introduces a new issue: the expected output format is like `Move Disk x from x to x`. However, the output of the result is `Move Disk 10Movem x to x`. The disk number and string from are gone. > This is because compiler optimizations may allocate local variables to registers or stack slots in ways that conflict with the inline assembly used for printing. Such behavior can lead to unintended memory overwrites, resulting in incorrect output. > \- ChatGPT Therefore, I added the memory clobber into `printstr(ptr, length)`: ```c static inline void printstr(char* ptr, int length) { register const char *a1 asm("a1") = (ptr); register long a2 asm("a2") = (length); asm volatile( "add a7, x0, 0x40;" "add a0, x0, 0x1;" /* stdout */ "ecall;" : : "r"(a1), "r"(a2) : "memory", "a0" ); } ``` After that, it work fine if using compiler optimization. The experienment result is shown as below: #### O1 ``` === C Implementation === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 3978 Instructions: 3978 ``` #### O2 ``` === C Implementation === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 4147 Instructions: 4147 ``` #### O3 ``` === C Implementation === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 4114 Instructions: 4114 ``` #### Ofast ``` === C Implementation === Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 4114 Instructions: 4114 ``` #### Summary | | asm | C | O1 | O2 | O3 | Ofast | |--------|-----|-------|------|------|------|-------| | Cycles | 727 | 10764 | 3978 | 4147 | 4114 | 4114 | | Instrs | 727 | 10764 | 3978 | 4147 | 4114 | 4114 | ## Quiz 3 Problem C - rsqrt You can find the [source code](https://github.com/ryanycs/ca2025-quizzes/tree/main/hw2/rsqrt) here. :::info The code should be place into `rv32emu/tests/system/` directory. ::: ### Result Since we use Q16.16 format, the actual result of the ouput need to be devided by $2^{16}$. For instance, `rsqrt(2) = 46341`, indicate `rsqrt(2) = 46341 / 65536 = 0.7071075439`, where the real result is `0.7071067811`. ``` rsqrt(1) = 65536 Cycles: 7733 Instructions: 7733 rsqrt(2) = 46341 Cycles: 11068 Instructions: 11068 rsqrt(4) = 32768 Cycles: 10465 Instructions: 10465 rsqrt(16) = 16384 Cycles: 11662 Instructions: 11662 rsqrt(100) = 6553 Cycles: 13367 Instructions: 13367 rsqrt(4294967295) = 1 Cycles: 18280 Instructions: 18280 ``` #### O1 ``` rsqrt(1) = 65536 Cycles: 2950 Instructions: 2950 rsqrt(2) = 46341 Cycles: 4336 Instructions: 4336 rsqrt(4) = 32768 Cycles: 4102 Instructions: 4102 rsqrt(16) = 16384 Cycles: 4556 Instructions: 4556 rsqrt(100) = 6553 Cycles: 5230 Instructions: 5230 rsqrt(4294967295) = 1 Cycles: 7131 Instructions: 7131 ``` #### O2 ``` rsqrt(1) = 65536 Cycles: 3070 Instructions: 3070 rsqrt(2) = 46341 Cycles: 4660 Instructions: 4660 rsqrt(4) = 32768 Cycles: 4543 Instructions: 4543 rsqrt(16) = 16384 Cycles: 5032 Instructions: 5032 rsqrt(100) = 6553 Cycles: 5616 Instructions: 5616 rsqrt(4294967295) = 1 Cycles: 7586 Instructions: 7586 ``` #### O3 ``` rsqrt(1) = 65536 Cycles: 3070 Instructions: 3070 rsqrt(2) = 46341 Cycles: 4643 Instructions: 4643 rsqrt(4) = 32768 Cycles: 4527 Instructions: 4527 rsqrt(16) = 16384 Cycles: 5016 Instructions: 5016 rsqrt(100) = 6553 Cycles: 5597 Instructions: 5597 rsqrt(4294967295) = 1 Cycles: 7565 Instructions: 7565 ``` #### Ofast ``` rsqrt(1) = 65536 Cycles: 3070 Instructions: 3070 rsqrt(2) = 46341 Cycles: 4643 Instructions: 4643 rsqrt(4) = 32768 Cycles: 4527 Instructions: 4527 rsqrt(16) = 16384 Cycles: 5016 Instructions: 5016 rsqrt(100) = 6553 Cycles: 5597 Instructions: 5597 rsqrt(4294967295) = 1 Cycles: 7565 Instructions: 7565 ``` #### Summary We only compare the cycle and instruction counts of `rsqrt(INT_MAX)` | | C | O1 | O2 | O3 | Ofast | |--------|-------|------|------|------|-------| | Cycles | 18280 | 7131 | 7586 | 7565 | 7565 | | Instrs | 18280 | 7131 | 7586 | 7565 | 7565 | ### Optimize If we replace `clz(uint32_t x)`: ```c static inline unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return n - x; } ``` with byte-shift version: ```c static inline unsigned clz(uint32_t x) { if (x == 0) return 32; uint32_t n = 0; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } if ((x >> 31) == 0) { n += 1; } return n; } ``` | | C | O1 | O2 | O3 | Ofast | |--------|-------|------|------|------|-------| | Cycles | 18206 | 7099 | 7559 | 7560 | 7560 | | Instrs | 18206 | 7099 | 7559 | 7560 | 7560 | We take the `-O1` version as an example, the cycle count is decreases from $7131$ to $7099$. ## Acknowledgement This assignment was completed with assistance from [Github Copilot](https://github.com/features/copilot) auto completions for code/comments writing (Agent mode does not use). [ChatGPT](https://chatgpt.com/) is used to refine the writing of this document. ## Reference [Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol) [Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3-sol)