[my github](https://github.com/Stanley0915/assignment_2/blob/eedc9bd17a184335ba1f00e862083ff6d98a3241/Q2problemA.S) ## AI Collaboration (Transparency) --- # HW2-1 ## Part 1: Setting up the Bare-Metal Environment My first step was to set up the rv32emu bare-metal environment as specified in the assignment. 1. Build rv32emu: I built the emulator with system emulation enabled: ``` make ENABLE_ELF_LOADER=1 ENABLE_SYSTEM=1 ``` 2. Create Project: I copied the playground template to a new directory for my assignment: ``` cp -r tests/system/playground hw_2 cd hw_2 ``` 3. Modify Makefile: This was a critical step, as the playground's Makefile had hardcoded relative paths. I had to modify it accordingly (as the prompt instructed) to fit the new location. * Error 1 (Toolchain): Makefile:1: ../../../mk/toolchain.mk: No such file or directory * Fix 1: Changed include ../../../mk/toolchain.mk to include ../mk/toolchain.mk (only one level up). * Error 2 (Emulator Path): Error: ../../../build/rv32emu not found * Fix 2: Changed the EMU variable from ../../../build/rv32emu to ../build/rv32emu. ## Part 2: Porting a Pure Assembly Program (HW1 - problemB.S) For the first part of the assignment, I chose to port my problemB.S (a uf8 round-trip tester) from Ripes to rv32emu. 1. I added problemB.S to the hw_baremetal directory. 1. I modified the Makefile OBJS list to replace main.o with problemB.o. ``` # OBJS = start.o main.o ... OBJS = start.o problemB.o perfcounter.o ... ``` ## Part 3: Debugging Syscall Incompatibility This is where I encountered the "potential incompatibility" mentioned in the assignment. My program immediately crashed. * **Initial Error**: FATAL src/syscall.c:538: Unknown syscall: 4 * **Analysis**: My problemB.S was written for the Ripes/Venus environment, which uses a different syscall ABI (Application Binary Interface). * Ripes ecall 4 = Print Formatted String * Ripes ecall 10 = Exit * **rv32emu's ABI**: After checking docs/syscall.md, I found rv32emu uses a different, low-level ABI: * ecall 64 = SYS_write (file_descriptor, buffer_address, length) * ecall 93 = SYS_exit (exit_code) * **Decision**: I decided to modify the assembly to use rv32emu's native syscalls directly, avoiding the C printf library entirely for this first part. ## Part 4: Debugging ecall 64 Argument Registers My first attempt to fix this still failed. The program ran but printed nothing. * **Problem**: I had simply replaced ecall 4 with ecall 64 but kept the same arguments (a0=string_addr, a2=length). * **Analysis**: The ecall 64 (SYS_write) specification requires a different register setup: * a0: File Descriptor (must be 1 for stdout) * a1: String Address * a2: String Length * **Solution**: I had to rewrite all ecall blocks in problemB.S to follow this new convention. Example: exit: block modification * Before (Venus-style ecall 4): ``` exit:     lui     a0, %hi(LC2)     addi    a0, a0, %lo(LC2)     li      a7, 4     ecall     li      a7, 10     ecall ``` * After (rv32emu-style ecall 64/93): ``` exit:     li      a0, 1 # a0 = stdout     lui     a1, %hi(LC2) # a1 = string address     addi    a1, a1, %lo(LC2)     li      a2, 17 # a2 = string length     li      a7, 64 # a7 = 64     ecall     li      a0, 0 # a0 = exit code     li      a7, 93 # a7 = 93     ecall ``` ## Part 5: Final Result and Observation After these changes, the program executed successfully and terminated cleanly: ![image](https://hackmd.io/_uploads/rkFSpRfgWl.png) # Problem A from Quiz 2 **Task Objective** To port a pure RISC-V assembly program for "Tower of Hanoi" (Q2problemA.S), originally written for a Ripes environment, to run successfully in the rv32emu bare-metal environment. ### Challenge 1: System Call Incompatibility The first problem I encountered was that rv32emu, as a bare-metal emulator, does not recognize the high-level system calls used by Ripes. * Ripes (Old Environment): * ecall 4: Print String (e.g., " from ") * ecall 1: Print Integer (e.g., the value in x9) * ecall 11: Print Character (e.g., 'A' from x11) * ecall 10: Exit Program * rv32emu (New Environment): * ecall 64 (SYS_write): The only printing method. It only prints "strings" from a "memory address" and cannot print numbers. * ecall 93 (SYS_exit): The only exit method. ### Challenge 2: The Limitations of ecall 64: The ecall 64 (SYS_write) specification is extremely rigid. It requires three arguments: * a0: File Descriptor (must be 1 for stdout) * a1: The "memory address" of the string * a2: The "length" of the string This created two new problems: How do I print a single character (from a register) and how do I print a number? ### Solution : Printing a Single Character (x11, x12, \n) I used a "**Stack Buffer**" technique: 1. Modify main: I increased the stack size at the start of the main function: `addi x2, x2, -32` was changed to `addi x2, x2, -48`. This reserved extra space. I designated `32(x2)` (i.e., `32(sp)`) as my "1-byte character buffer." 2. Modify "Print Char" Code: I replaced all `ecall 11` calls (like printing` x11`) with the following sequence: ``` #Store the character (from x11) into "memory" sw x11, 32(x2) li a0, 1 addi a1, x2, 32 li a2, 1 li a7, 64 ecall ``` (I applied this same logic to print the destination peg x12 and the newline \n character.) ### Solution : Printing a Number (x9, the Disk ID) This was the biggest challenge, as ecall 64 cannot convert a number to a string. **My Solution: ASCII Math** I realized the disk ID in x9 was a number (0, 1, 2) but I needed to print the characters ('1', '2', '3'). I found the ASCII table pattern: * '1' = ASCII code 49 * '2' = ASCII code 50 * '3' = ASCII code 51 formula: ASCII = Number + 49 ``` #Convert the "Number" to an "ASCII Character" addi x10, x9, 49 #Use the "Stack Buffer" technique to print this new character sw x10, 44(x2) li a0, 1 addi a1, x2, 44 li a2, 1 li a7, 64 ecall ``` ### Final Result ![image](https://hackmd.io/_uploads/B1vxyv_-We.png) # Problem C from Quiz 3 ### Fast Reciprocal Square Root (rsqrt) Implementation The objective of this program is the highly efficient calculation of $\mathbf{1/\sqrt{x}}$. Because we are operating in an RV32I (No Hardware Multiplier/FPU) environment, the code employs two major optimization strategies: * **Structural Optimization:** Utilizes Lookup Table (LUT) + Newton's Method for iteration, which is a numerically faster method than directly computing $\sqrt{x}$. * **Hardware Adaptation:** All mathematical operations are implemented through bitwise manipulation and Q16 fixed-point arithmetic, avoiding reliance on the non-existent hardware mul instruction. ### Interpolation and Correction (Linear and Exponential) This section contains the logic for correcting the initial guess obtained from the lookup table (rsqrt_table). ``` frace: # frac = ((x - (1 << exp)) << 16) >> exp sub a5,a7,a5 sltu a4,a7,a5 neg a4,a4 srli a6,a5,16 slli a4,a4,16 addi a1,a3,-32 add a4,a6,a4 slli a5,a5,16 blt a1,zero,exp_smaller_32 srl a5,a4,a1 Correction_frace: mul a5,a2,a5 # a5 = delta * frac srli a5,a5,16 sub a0,a0,a5 # y -= (delta * frac) >> 16 ``` ### Software Multiplication Loop Expansion (Newton Iteration) Due to hardware limitations, multiplication must be simulated using shift and addition. ``` Iteration: li t0,0 li a5,0 # y * y y_mul_y: addi a2,a5,-32 srl a3,a0,a5 sll a4,a0,a5 srai a2,a2,31 andi a3,a3,1 and a4,a4,a2 addi a5,a5,1 beq a3,zero,x_mul_y_2 add t0,t0,a4 x_mul_y_2: bne a5,a1,y_mul_y # x * y^2 li a2,0 li t6,0 li a5,0 j check_bit bit_is_1: li a4,0 sll a3,a7,a3 _1_xy2_add: add a4,a2,a4 sltu t2,a4,a2 add t6,t6,a3 mv a2,a4 add t6,t2,t6 next_bit: addi a5,a5,1 beq a5,a1,first check_bit: srl a4,t0,a5 andi a4,a4,1 addi a3,a5,-32 beq a4,x0,next_bit sub a4,a6,a5 bge a3,x0,bit_is_1 srl a3,t1,a4 sll a4,a7,a5 j _1_xy2_add ``` ### Count Leading Zeros (CLZ) Used for fast determination of the input's magnitude (Exponent). Employs a bitwise binary search (checking 16 bits, then 8, 4, etc.) for high speed. ``` #clz_begin mv t1, a0 li t2, 32 li t3, 16 1: srl t4, t1, t3 beqz t4, 2f sub t2, t2, t3 mv t1, t4 2: srai t3, t3, 1 bnez t3, 1b sub t2, t2, t1 mv a4,t2 #clz_end ``` ### Lookup Table (LUT) Stores pre-calculated initial values ($1/\sqrt{2^E}$) in Q16 fixed-point format ($1.0 = 65536$). This serves as the best initial guess ($y_0$) for Newton iteration. ``` .align 2 .data rsqrt_table: .word 65536 # 2^0 .word 46341 # 2^1 .word 32768 # 2^2 .word 23170 # 2^3 .word 16384 # 2^4 .word 11585 # 2^5 .word 8192 # 2^6 .word 5793 # 2^7 .word 4096 # 2^8 .word 2896 # 2^9 .word 2048 # 2^10 .word 1448 # 2^11 .word 1024 # 2^12 .word 724 # 2^13 .word 512 # 2^14 .word 362 # 2^15 .word 256 # 2^16 .word 181 # 2^17 .word 128 # 2^18 .word 90 # 2^19 .word 64 # 2^20 .word 45 # 2^21 .word 32 # 2^22 .word 23 # 2^23 .word 16 # 2^24 .word 11 # 2^25 .word 8 # 2^26 .word 6 # 2^27 .word 4 # 2^28 .word 3 # 2^29 .word 2 # 2^30 .word 1 # 2^31 ``` ### Newton Iteration (Final Quadratic Convergence) * Objective: Executes 2 iterations of Newton's method to achieve final accuracy. * Core: Implements the formula $y_{n+1} = y_n \left( \frac{3}{2} - \frac{x y_n^2}{2} \right)$. * Fixed-Point Management: The final result is right-shifted by 17 bits (>> 17). This shift simultaneously performs the division by $2$ and completes the final $Q16$ format normalization. **Implementation:** Completed the assembly implementation of Problem C and measured the cycle count. The automated test data generation code has not been added yet. --- # reference