[my github](https://github.com/Stanley0915/assignment_2/blob/eedc9bd17a184335ba1f00e862083ff6d98a3241/Q2problemA.S)
## AI Collaboration (Transparency)
---
# HW2-1
## Part 1: Setting up the Bare-Metal Environment
My first step was to set up the rv32emu bare-metal environment as specified in the assignment.
1. Build rv32emu: I built the emulator with system emulation enabled:
```
make ENABLE_ELF_LOADER=1 ENABLE_SYSTEM=1
```
2. Create Project: I copied the playground template to a new directory for my assignment:
```
cp -r tests/system/playground hw_2
cd hw_2
```
3. Modify Makefile: This was a critical step, as the playground's Makefile had hardcoded relative paths. I had to modify it accordingly (as the prompt instructed) to fit the new location.
* Error 1 (Toolchain): Makefile:1: ../../../mk/toolchain.mk: No such file or directory
* Fix 1: Changed include ../../../mk/toolchain.mk to include ../mk/toolchain.mk (only one level up).
* Error 2 (Emulator Path): Error: ../../../build/rv32emu not found
* Fix 2: Changed the EMU variable from ../../../build/rv32emu to ../build/rv32emu.
## Part 2: Porting a Pure Assembly Program (HW1 - problemB.S)
For the first part of the assignment, I chose to port my problemB.S (a uf8 round-trip tester) from Ripes to rv32emu.
1. I added problemB.S to the hw_baremetal directory.
1. I modified the Makefile OBJS list to replace main.o with problemB.o.
```
# OBJS = start.o main.o ...
OBJS = start.o problemB.o perfcounter.o ...
```
## Part 3: Debugging Syscall Incompatibility
This is where I encountered the "potential incompatibility" mentioned in the assignment. My program immediately crashed.
* **Initial Error**: FATAL src/syscall.c:538: Unknown syscall: 4
* **Analysis**: My problemB.S was written for the Ripes/Venus environment, which uses a different syscall ABI (Application Binary Interface).
* Ripes ecall 4 = Print Formatted String
* Ripes ecall 10 = Exit
* **rv32emu's ABI**: After checking docs/syscall.md, I found rv32emu uses a different, low-level ABI:
* ecall 64 = SYS_write (file_descriptor, buffer_address, length)
* ecall 93 = SYS_exit (exit_code)
* **Decision**: I decided to modify the assembly to use rv32emu's native syscalls directly, avoiding the C printf library entirely for this first part.
## Part 4: Debugging ecall 64 Argument Registers
My first attempt to fix this still failed. The program ran but printed nothing.
* **Problem**: I had simply replaced ecall 4 with ecall 64 but kept the same arguments (a0=string_addr, a2=length).
* **Analysis**: The ecall 64 (SYS_write) specification requires a different register setup:
* a0: File Descriptor (must be 1 for stdout)
* a1: String Address
* a2: String Length
* **Solution**: I had to rewrite all ecall blocks in problemB.S to follow this new convention.
Example: exit: block modification
* Before (Venus-style ecall 4):
```
exit:
lui a0, %hi(LC2)
addi a0, a0, %lo(LC2)
li a7, 4
ecall
li a7, 10
ecall
```
* After (rv32emu-style ecall 64/93):
```
exit:
li a0, 1 # a0 = stdout
lui a1, %hi(LC2) # a1 = string address
addi a1, a1, %lo(LC2)
li a2, 17 # a2 = string length
li a7, 64 # a7 = 64
ecall
li a0, 0 # a0 = exit code
li a7, 93 # a7 = 93
ecall
```
## Part 5: Final Result and Observation
After these changes, the program executed successfully and terminated cleanly:

# Problem A from Quiz 2
**Task Objective**
To port a pure RISC-V assembly program for "Tower of Hanoi" (Q2problemA.S), originally written for a Ripes environment, to run successfully in the rv32emu bare-metal environment.
### Challenge 1: System Call Incompatibility
The first problem I encountered was that rv32emu, as a bare-metal emulator, does not recognize the high-level system calls used by Ripes.
* Ripes (Old Environment):
* ecall 4: Print String (e.g., " from ")
* ecall 1: Print Integer (e.g., the value in x9)
* ecall 11: Print Character (e.g., 'A' from x11)
* ecall 10: Exit Program
* rv32emu (New Environment):
* ecall 64 (SYS_write): The only printing method. It only prints "strings" from a "memory address" and cannot print numbers.
* ecall 93 (SYS_exit): The only exit method.
### Challenge 2: The Limitations of ecall 64:
The ecall 64 (SYS_write) specification is extremely rigid. It requires three arguments:
* a0: File Descriptor (must be 1 for stdout)
* a1: The "memory address" of the string
* a2: The "length" of the string
This created two new problems: How do I print a single character (from a register) and how do I print a number?
### Solution : Printing a Single Character (x11, x12, \n)
I used a "**Stack Buffer**" technique:
1. Modify main: I increased the stack size at the start of the main function: `addi x2, x2, -32` was changed to `addi x2, x2, -48`. This reserved extra space. I designated `32(x2)` (i.e., `32(sp)`) as my "1-byte character buffer."
2. Modify "Print Char" Code: I replaced all `ecall 11` calls (like printing` x11`) with the following sequence:
```
#Store the character (from x11) into "memory"
sw x11, 32(x2)
li a0, 1
addi a1, x2, 32
li a2, 1
li a7, 64
ecall
```
(I applied this same logic to print the destination peg x12 and the newline \n character.)
### Solution : Printing a Number (x9, the Disk ID)
This was the biggest challenge, as ecall 64 cannot convert a number to a string.
**My Solution: ASCII Math**
I realized the disk ID in x9 was a number (0, 1, 2) but I needed to print the characters ('1', '2', '3'). I found the ASCII table pattern:
* '1' = ASCII code 49
* '2' = ASCII code 50
* '3' = ASCII code 51
formula: ASCII = Number + 49
```
#Convert the "Number" to an "ASCII Character"
addi x10, x9, 49
#Use the "Stack Buffer" technique to print this new character
sw x10, 44(x2)
li a0, 1
addi a1, x2, 44
li a2, 1
li a7, 64
ecall
```
### Final Result

# Problem C from Quiz 3
### Fast Reciprocal Square Root (rsqrt) Implementation
The objective of this program is the highly efficient calculation of $\mathbf{1/\sqrt{x}}$. Because we are operating in an RV32I (No Hardware Multiplier/FPU) environment, the code employs two major optimization strategies:
* **Structural Optimization:** Utilizes Lookup Table (LUT) + Newton's Method for iteration, which is a numerically faster method than directly computing $\sqrt{x}$.
* **Hardware Adaptation:** All mathematical operations are implemented through bitwise manipulation and Q16 fixed-point arithmetic, avoiding reliance on the non-existent hardware mul instruction.
### Interpolation and Correction (Linear and Exponential)
This section contains the logic for correcting the initial guess obtained from the lookup table (rsqrt_table).
```
frace:
# frac = ((x - (1 << exp)) << 16) >> exp
sub a5,a7,a5
sltu a4,a7,a5
neg a4,a4
srli a6,a5,16
slli a4,a4,16
addi a1,a3,-32
add a4,a6,a4
slli a5,a5,16
blt a1,zero,exp_smaller_32
srl a5,a4,a1
Correction_frace:
mul a5,a2,a5 # a5 = delta * frac
srli a5,a5,16
sub a0,a0,a5 # y -= (delta * frac) >> 16
```
### Software Multiplication Loop Expansion (Newton Iteration)
Due to hardware limitations, multiplication must be simulated using shift and addition.
```
Iteration:
li t0,0
li a5,0
# y * y
y_mul_y:
addi a2,a5,-32
srl a3,a0,a5
sll a4,a0,a5
srai a2,a2,31
andi a3,a3,1
and a4,a4,a2
addi a5,a5,1
beq a3,zero,x_mul_y_2
add t0,t0,a4
x_mul_y_2:
bne a5,a1,y_mul_y
# x * y^2
li a2,0
li t6,0
li a5,0
j check_bit
bit_is_1:
li a4,0
sll a3,a7,a3
_1_xy2_add:
add a4,a2,a4
sltu t2,a4,a2
add t6,t6,a3
mv a2,a4
add t6,t2,t6
next_bit:
addi a5,a5,1
beq a5,a1,first
check_bit:
srl a4,t0,a5
andi a4,a4,1
addi a3,a5,-32
beq a4,x0,next_bit
sub a4,a6,a5
bge a3,x0,bit_is_1
srl a3,t1,a4
sll a4,a7,a5
j _1_xy2_add
```
### Count Leading Zeros (CLZ)
Used for fast determination of the input's magnitude (Exponent). Employs a bitwise binary search (checking 16 bits, then 8, 4, etc.) for high speed.
```
#clz_begin
mv t1, a0
li t2, 32
li t3, 16
1: srl t4, t1, t3
beqz t4, 2f
sub t2, t2, t3
mv t1, t4
2: srai t3, t3, 1
bnez t3, 1b
sub t2, t2, t1
mv a4,t2
#clz_end
```
### Lookup Table (LUT)
Stores pre-calculated initial values ($1/\sqrt{2^E}$) in Q16 fixed-point format ($1.0 = 65536$). This serves as the best initial guess ($y_0$) for Newton iteration.
```
.align 2
.data
rsqrt_table:
.word 65536 # 2^0
.word 46341 # 2^1
.word 32768 # 2^2
.word 23170 # 2^3
.word 16384 # 2^4
.word 11585 # 2^5
.word 8192 # 2^6
.word 5793 # 2^7
.word 4096 # 2^8
.word 2896 # 2^9
.word 2048 # 2^10
.word 1448 # 2^11
.word 1024 # 2^12
.word 724 # 2^13
.word 512 # 2^14
.word 362 # 2^15
.word 256 # 2^16
.word 181 # 2^17
.word 128 # 2^18
.word 90 # 2^19
.word 64 # 2^20
.word 45 # 2^21
.word 32 # 2^22
.word 23 # 2^23
.word 16 # 2^24
.word 11 # 2^25
.word 8 # 2^26
.word 6 # 2^27
.word 4 # 2^28
.word 3 # 2^29
.word 2 # 2^30
.word 1 # 2^31
```
### Newton Iteration (Final Quadratic Convergence)
* Objective: Executes 2 iterations of Newton's method to achieve final accuracy.
* Core: Implements the formula $y_{n+1} = y_n \left( \frac{3}{2} - \frac{x y_n^2}{2} \right)$.
* Fixed-Point Management: The final result is right-shifted by 17 bits (>> 17). This shift simultaneously performs the division by $2$ and completes the final $Q16$ format normalization.
**Implementation:**
Completed the assembly implementation of Problem C and measured the cycle count.
The automated test data generation code has not been added yet.
---
# reference