# Assignment2: Complete Applications
Contributed by <[moonchin819](https://github.com/moonchin819)>
## Implementation
Here's the [source code](https://github.com/moonchin819/computer_architecture_hw2)
## [Homework 1 uf8](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B)
To make the program from First Quiz Problem B run in a bare-metal environment using rv32emu’s system emulation, we need five files.
* Makefile - To specify the cross compiler and link script (linker.ld)
* Define all .S/.c files to be compiled in the project
* Control build target : `all`, `clean`, `run`, `dump`, etc...
* .S, C programs → link → generate elf file which can be executed.
* linker.ld - Define memory layout of the program
* To tell linker
* Where to put .text(program)
* .data (initialized data)
* .bss (Uninitialized global variables)
* Where are the stack and heap
* Replace the operating system loader in bare-metal
Because there's no OS to help allocate memory, so this linker.ld file is the only one which tells CPU where to start, matching the `_start` entry point in start.S file.
* start.S - Startup code for the project, responsible for all the initialization
* `_start` entry point assigned by linker.ld
* Initialize stack pointer, clear .bss section
* Call `main()` after setting
* perfcounter.S - Performance counters
* To read CSR counter of RISC-V
* Help measure execution information in the situation without OS.
* main.c
* Provide function `main()` and called by start.S
* As a high-level control, to call the assembly language I write
* Can also call the function provided by `perfcounter.S` to measure the efficacy.
* hw2.S - Main algorithm implementation
* Include all the assembly code related to the quiz, assignment, or whatever the project is going to deal with.
* We usually define it as a function, so that it could be called by `main.c`
* Might include some bitwise operation and results printing.
### Bare-metal programming
I have to modify the original program.
In the first line, `.section .text` or `.text` should be added.
It tells compiler and linker, the following instructions belong to "executable code section" of the program.
Under the OS like Linux, compiler and linker will automatically put the program into the section such as `.text`, `.data`, `.bss`. But in bare-metal environment, there's no OS and standart startup code, so we need to clearly tell the compiler and linker "which section is code, and which is data"
:::success
I only show the modified parts from Homework 1 Problem B, so the encoding part will not be present since the changes are same as decoding part.
:::
```asm
.section .text
clz:
li t0, 32 # n = 32, c = 16
srli t2, a0, 16 # y = x >> 16
beqz t2, c_equal_8 # branch if y is equal to zero
addi t0, t0, -16 # n -= 16
add a0, t2, x0 # x = y
c_equal_8:
srli t2, a0, 8 # y = x >> 8
beqz t2, c_equal_4
addi t0, t0, -8 # n -= 8
add a0, t2, x0 # x = y
c_equal_4:
srli t2, a0, 4 # y = x >> 4
beqz t2, c_equal_2
addi t0, t0, -4 # n -= 4
add a0, t2, x0 # x = y
c_equal_2:
srli t2, a0, 2 # y = x >> 2
beqz t2, c_equal_1
addi t0, t0, -2 # n -= 2
add a0, t2, x0 # x = y
c_equal_1:
srli t2, a0, 1 # y = x >> 1
beqz t2, iter_end
addi t0, t0, -1 # n -= 1
add a0, t2, x0 # x = y
iter_end:
sub a0, t0, a0 # return n - x
jr ra
offset_table:
.word 0, 16, 48, 112, 240, 496, 1008, 2032, 4080, 8176, 16368, 32752, 65520, 131056, 262128, 524272
```
`.type uf8_decode, %function` is a symbol type label for the linker and debugger. It tells linker that symbol `uf8_decode` represents a "function", but not "variable"
The instruction `.align 2` aligns the starting address of the following code to a 2^2 = 4 bytes boundary.
* Because the length of each instruction in RISC-V is 4 bytes, if the starting address of the function didn't align to 4 byte, CPU might occur some misaligned fetch exception.
* So, here we use `.align 2`, since it's the safest, and most common approach.
* It can also avoid unnecessary instruction fetching overhead.
```asm
.globl uf8_decode
.type uf8_decode,%function
.align 2
uf8_decode:
andi t0, a0, 0x0F # mantissa = f1 & 0xF
srli t1, a0, 4 # exponent = f1 >> 4
la t2, offset_table
slli t3, t1, 2 # offset << 4
add t2, t2, t3
lw t2, 0(t2)
sll t0, t0, t1 # mantissa << exponent
add a0, t0, t2 # mantissa + offset
jr ra
.size uf8_decode,.-uf8_decode
```
At the end of the code, I add `.size uf8_decode,.-uf8_encode`, although it is unnecessary line.
`.size` is a symbol table information instruction for the linker, it tells linker the length of the function, that is, from where to where.
### Test Result
```tex
16:18:00 INFO src/main.c:315: RISC-V emulator is created and ready to run
hw1b: PASSED
Cycles: 28490
Instructions: 28490
=== All Tests Completed ===16:18:00 INFO src/main.c:338: RISC-V emulator is destroyed
```
## [Quiz 2 Problem A](https://hackmd.io/@sysprog/arch2025-quiz2-sol#Problem-A)
### Hanoi Tower
In the game loop, we need to iterate 8 steps, because 3 disks for hanoi tower need 2^3-1=7 steps, with initial state we have 8 conditions in total.
```asm
addi x8, x0, 1 # Counter starts from 1
game_loop:
# BLANK 4: Check loop termination (2^3 moves).
addi x5, x0, 8 # 8 steps
beq x8, x5, finish_game # If x8 == 8, skip to the end
```
Gray Code calculation : It's an binary encoding, the number nearby has only one bit difference, which is used for deciding which disk to move every time.
```asm
# BLANK 5: What is k for Gray code?
srli x5, x8, 1
# BLANK 6: Complete Gray(n) calculation
xor x6, x8, x5
```
Gray(n) XOR Gray(n-1) will get the number with only one "1", it indicates which disk has to move.
```asm
addi x7, x8, -1
srli x28, x7, 1
# BLANK 9: Generate Gray(n-1)
xor x7, x7, x28
# BLANK 10: Which bits changed?
xor x5, x6, x7
```
To determine which disk to move, first we set the number of the disk to default 0.
```asm
addi x9, x0, 0
# BLANK 11: Mask for testing LSB
andi x6, x5, 1
# BLANK 12: Branch if disk 0 moves
bne x6, x0, disk_found
# BLANK 13: Set disk 1
addi x9, x0, 1
# BLANK 14: Test second bit with proper mask
andi x6, x5, 2
bne x6, x0, disk_found
# BLANK 15: Last disk number
addi x9, x0, 2
```
It's wrong if we have bit 0 and bit 2 at the same time.
But this will not happen under normal circumstances.
```asm
disk_found:
# BLANK 16: Check impossible pattern (multiple bits)
andi x30, x5, 5
addi x31, x0, 5
beq x30, x31, pattern_match
jal x0, continue_move
```
To calculate the position of the target, we load the current position, and then we calculate the new position.
```asm
continue_move:
# BLANK 17: Word-align disk index (multiply by what?)
slli x5, x9, 2
# BLANK 18: Base offset for disk array
addi x5, x5, 20
add x5, x2, x5
lw x18, 0(x5)
```
Calculate the new place for smallest disk 0.
The smallest disk will always jump over a peg when moved, and that's why add 2 in the second instruction here.
The fourth instruction means that if the new position is less than 3, we don't have to adjust that, otherwise, we keep on to the next line to subtract 3 to adjust.
(Because we only have three pegs.)
After the adjustment, we can move on to the display the movement.
```asm
bne x9, x0, handle_large
# BLANK 19: Small disk moves by how many positions?
addi x19, x18, 2
# BLANK 20: Number of pegs
addi x6, x0, 3
blt x19, x6, display_move
sub x19, x19, x6
jal x0, display_move
```
For two larger disks : disk 1 and disk 2.
Move the large disks to the only peg which is not occupied by the smallest disk.
```asm
handle_large:
# BLANK 21: Load reference disk position
lw x6, 20(x2)
# BLANK 22: Sum of all peg indices (0+1+2)
addi x19, x0, 3
sub x19, x19, x18
sub x19, x19, x6
```
```asm
display_move:
la x20, obdata # Load the address of obdata section to x20
add x5, x20, x18 # Calculate the address of the source peg
# x5 points to the item number "x18" in obdata array
# BLANK 23: Load byte from obfuscated data
lbu x11, 0(x5) # Load 1 byte, which means the lowest 8 bits+
# BLANK 24: Decode constant (0x6F)
li x6, 0x6F # Load decode constant 0x6F
xor x11, x11, x6 # Decoding process。The origin obdata was fused,so we need to revert in to actual ascii code.
```
Same process as above to decode target peg, x12 here is the ascii code of target peg.
```asm
add x7, x20, x19
lbu x12, 0(x7)
xor x12, x12, x6
addi x12, x12, -0x12
```
Here we calculate the offset in a stack, to record "Where is each disk currently located in peg ? "
```asm
# BLANK 26: Calculate storage offset
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
```
### Test Result
```tex
16:19:03 INFO src/main.c:315: RISC-V emulator is created and ready to run
=====Hanoi Tower Test=====
Origin code test with only print format adjustment.
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
Cycles: 674
Instructions: 674
=== Test Completed ===16:19:03 INFO src/main.c:338: RISC-V emulator is destroyed
```
## [Quiz 3 Problem C](https://hackmd.io/@sysprog/arch2025-quiz3-sol#Problem-C)
Fixed-point format
`Actual value = Integer value / 2^N`
* 2^0^ : No scaling
* 2^16^ : 16 times scaling
* 2^32^ : 32 times scaling
Because y represents 1/sqrt(x), theoretically, its range is (0, 1], and scaled by 2^16^ the range become [0, 65536].
So it's stored as uint32_t, but the valid values are all below 17 bits.
```c
static uint64_t mul32(uint32_t a, uint32_t b)
{
uint64_t r = 0;
for (int i = 0; i < 32; i++){
if (b & (1U << i))
r += (uint64_t)a << i;
}
return r;
}
```
### rsqrt_org.h
```cpp
#ifndef RSQRT_ORG // if not defined RSQRT_ORG (if RSQRT macro isn't defined yet). To check if this header file has already been included.
#define RSQRT_ORG // define the macro RSQRT_ORG(its content is empty), indicates that the header file has already been included.
//If the file is included again the following time, #ifndef will check that it's already defined and skip the entire file's contents.
static uint64_t mul32(uint32_t a, uint32_t b);
static int clz(uint32_t x);
uint32_t fast_rsqrt(uint32_t x);
#endif
```
In `main.c` file, we have to use the function `TEST_OUTPUT` to print the number from the table, so I wrote this simple function to calculate the length of the string.
```c
static uint32_t str_len(const char *s)
{
const char *p = s;
while (*p) p++;
return p - s;
}
```
```tex
16:23:42 INFO src/main.c:315: RISC-V emulator is created and ready to run
=====Fast Reciprocal sqrt Test=====
Original version
fast_rsqrt(1) = 65536Exact: 65536), Approx. Error = 0%, Remainder = 0/6553600
Cycles: 61 Instructions: 61
fast_rsqrt(4) = 32768Exact: 32768), Approx. Error = 0%, Remainder = 0/3276800
Cycles: 2783 Instructions: 2783
fast_rsqrt(16) = 16384Exact: 16384), Approx. Error = 0%, Remainder = 0/1638400
Cycles: 2789 Instructions: 2789
fast_rsqrt(24) = 13377Exact: 13377.48), Approx. Error = 0%, Remainder = 4800/1337748
Cycles: 4203 Instructions: 4203
fast_rsqrt(87) = 7026Exact: 7026.2), Approx. Error = 0%, Remainder = 2000/702620
Cycles: 4432 Instructions: 4432
fast_rsqrt(269) = 3995Exact: 3995.8), Approx. Error = 0%, Remainder = 8000/399580
Cycles: 4375 Instructions: 4375
fast_rsqrt(666) = 2539Exact: 2539.47), Approx. Error = 0%, Remainder = 4700/253947
Cycles: 4331 Instructions: 4331
Total Cycles: 236297
Total Instructions: 236297
=== All Test Passed ===16:23:42 INFO src/main.c:338: RISC-V emulator is destroyed
```
### -Os (Optimized for size)
I encounter this error while changing Makefile for optimized for size.
This error might relate to the implementation way of 64-bit operation under RV32 environment.
In `-march=rv32i` environment, the processor only have 32-bit integer register, so there's no native arithmetic operations for 64-bit.
Therefore, if I utilize arithmetic operations for `uint64_t`, the compiler will automatically insert function like `__ashldi.3`.
```tex
riscv-none-elf-ld: rsqrt_org.o: in function `fast_rsqrt':
/home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:48:(.text+0xf4): undefined reference to `__lshrdi3'
riscv-none-elf-ld: rsqrt_org.o: in function `mul32':
/home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x140): undefined reference to `__ashldi3'
/home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x1d0): undefined reference to `__ashldi3'
make: *** [Makefile:26: test.elf] Error 1
riscv-none-elf-ld -T linker.ld -o test.elf start.o main.o perfcounter.o rsqrt_org.o
riscv-none-elf-ld: warning: test.elf has a LOAD segment with RWX permissions
riscv-none-elf-ld: rsqrt_org.o: in function `fast_rsqrt':
/home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:48:(.text+0xf4): undefined reference to `__lshrdi3'
riscv-none-elf-ld: rsqrt_org.o: in function `mul32':
/home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x140): undefined reference to `__ashldi3'
/home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x1d0): undefined reference to `__ashldi3'
make: *** [Makefile:26: test.elf] Error 1
```
That means I can only use 32-bit operation for this
```c
static void mul32(uint64_result_t *result, uint32_t a, uint32_t b) // can't use uint64_t for size optimization
{
result->lo = 0; // initialization
result->hi = 0;
uint32_t a_lo = a; // low 32 bits of a
uint32_t a_hi = 0; // high 32 bits of a, it's 0 at the beginning because 'a' have only 32 bits
// so 'a' is then seen as 64-bit number
for (int i = 0; i < 32; i++){ // check every single bit of 'b' from low to high
if (b & 1U){ // check if the LSB of 'b' is 1
uint32_t old_lo = result->lo; // save previous results of result->lo
result->lo += a_lo; // if true, this indicates that we should add " the currently shifted 'a' "
if (result->lo < old_lo){ // carry check, if result->lo become smaller after adding a_lo
// it means overflow has occured
result->hi++; // so we have to add the carry to high 32 bits
}
result->hi += a_hi; // add high 32 bits of a to high 32 bits of the result
}
// then shift left 'a' to prepare for the next round
a_hi = (a_hi << 1) | (a_lo >> 31); // this step puts the MSB of a_lo to the LSB of a_hi
a_lo = a_lo << 1;
b >>= 1; // also shift right 'b' for the next round
}
```
```c
uint32_t fast_rsqrt(uint32_t x)
{
if (x == 0) return 0xFFFFFFFF; // Return Inf
if (x == 1) return 65536;
int exp = 31 - clz(x);
uint32_t y = rsqrt_table[exp];
if (x > (1u << exp))
{
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
// uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); // 改寫
uint32_t frac = ((x - (1u << exp)) << 16) >> exp;
y -= (uint32_t) ((delta * frac) >> 16);
}
for (int iter = 0; iter < 2; iter++){
uint64_result_t y2_result;
mul32(&y2_result, y, y);
uint32_t y2 = y2_result.lo;
uint64_result_t xy2_temp;
mul32(&xy2_temp, x, y2);
uint32_t xy2 = (xy2_temp.hi << 16) | (xy2_temp.lo >> 16);
uint64_result_t y_temp;
mul32(&y_temp, y, (3u << 16) - xy2);
uint32_t result = (y_temp.hi << 15) | (y_temp.lo >> 17);
y = result;
}
return y;
}
```
That means in some optimization modes (e.g. -Os、-Ofast), gcc will automatically use uint64_t for simulation, which isn't provided in bare-metal.
```tex
=====Fast Reciprocal sqrt Test=====
Original version
fast_rsqrt(1) = 65536(Exact: 65536), Approx. Error = 0%, Remainder = 0/6553600 Cycles: 43
Instructions: 43
fast_rsqrt(4) = 32768(Exact: 32768), Approx. Error = 0%, Remainder = 0/3276800 Cycles: 1907
Instructions: 1907
fast_rsqrt(16) = 16384(Exact: 16384), Approx. Error = 0%, Remainder = 0/1638400 Cycles: 1911
Instructions: 1911
fast_rsqrt(24) = 13377(Exact: 13377.48), Approx. Error = 0%, Remainder = 4800/1337748 Cycles: 2345
Instructions: 2345
fast_rsqrt(87) = 7026(Exact: 7026.2), Approx. Error = 0%, Remainder = 2000/702620 Cycles: 2398
Instructions: 2398
fast_rsqrt(269) = 3995(Exact: 3995.8), Approx. Error = 0%, Remainder = 8000/399580 Cycles: 2385
Instructions: 2385
fast_rsqrt(666) = 2539(Exact: 2539.47), Approx. Error = 0%, Remainder = 4700/253947 Cycles: 2356
Instructions: 2356
Total Cycles: 103769
Total Instructions: 103767
=== All Test Passed ===
```
### Problem for `-Os` and `-Ofast`
After I test the original handwritten code in the requirement, I'd like to test on `-Os` (optimized for size) and `-Ofast` (optimized for speed).
I encounter the same problem in both mode.
The compilter optimized out the function that print strings and numbers.
```c
#define TEST_OUTPUT(msg, length) printstr(msg, length)
#define TEST_LOGGER(msg) \
{ \
char _msg[] = msg; \
TEST_OUTPUT(_msg, sizeof(_msg) - 1); \
}
```
When I `make run`, the results will be like this.
```tex
21:42:54 INFO src/main.c:315: RISC-V emulator is created and ready to run
1655366553600655360036
36
432768327680032768001486
1486
16163841638400163840049
49
241337713377.480480013377481118
1118
8770267026.202000702620209
209
26939953995.808000399580190
190
66625392539.470470025394759
59
d2�i94105
d2�i��
94105
d2�i��
�21:42:54
```
So I have to add clobber argument "memory" in `printstr` to avoid this according to this [online doc](https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Extended-Asm.html?utm_source=chatgpt.com#Clobbers-and-Scratch-Registers).
It tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands.
### -Ofast (Optimized for speed)
Changing the compilation option to `-Ofast` will make cycle count decrease in most cases.
The reason might be :
* Reduction of instructions
* Constant folding : Some constant arithmetic operation can be solve during compilation.
* Inline expansion and branch simplification : Some small function might go through inline expansion, reducing the cost of function call (call/return) and pipeline flush.
* Rearrange the order of instructions to obscure memory delay or pipeline hazard, which makes cycle count lower.
```tex
=====Fast Reciprocal sqrt Test=====
Original version
fast_rsqrt(1) = 65536 (Exact: 65536), Approx. Error = 0%, Remainder = 0/6553600 Cycles: 36
Instructions: 36
fast_rsqrt(4) = 32768 (Exact: 32768), Approx. Error = 0%, Remainder = 0/3276800 Cycles: 1486
Instructions: 1486
fast_rsqrt(16) = 16384 (Exact: 16384), Approx. Error = 0%, Remainder = 0/1638400 Cycles: 49
Instructions: 49
fast_rsqrt(24) = 13377 (Exact: 13377.48), Approx. Error = 0%, Remainder = 4800/1337748 Cycles: 1118
Instructions: 1118
fast_rsqrt(87) = 7026 (Exact: 7026.2), Approx. Error = 0%, Remainder = 2000/702620 Cycles: 209
Instructions: 209
fast_rsqrt(269) = 3995 (Exact: 3995.8), Approx. Error = 0%, Remainder = 8000/399580 Cycles: 190
Instructions: 190
fast_rsqrt(666) = 2539 (Exact: 2539.47), Approx. Error = 0%, Remainder = 4700/253947 Cycles: 59
Instructions: 59
Total Cycles: 94551
Total Instructions: 94551
=== All Test Passed ===
```
### Section Size Comparison
We can use command `riscv-none-elf-size` to show memory usage breakdown by section.
And we can conclude that `-Os` perform the best in code size, reducing approximately half of it compared to the original one.
The most signficant difference is `Os` is 452 bytes smaller than `Ofast`.
Take a look into data section, they're all zero. Which means there's no initialized global variable in the code, or they might be optimized out by the compiler.
BSS means uninitialized data segment, and they're almost the same in those three results. This part doesn't take up flash memory.
#### For the Original version
```tex
text data bss dec hex filename
5036 0 4100 9136 23b0 test.elf
```
#### For `Osize`
```tex
text data bss dec hex filename
2596 0 4108 6704 1a30 test.elf
```
The reason why `Os` is the smallest :
1. Loop Optimization : The loop in the optimization might be unrolled, and some duplicate code might be merged.
2. Choice of instructions : `Os` will use the instruction which is more compact.
3. Arrangement of registers : Reduce the instructions like load/store
#### For `Ofast`
```tex
text data bss dec hex filename
3048 0 4104 7152 1bf0 test.elf
```
The features of `Ofast` including :
1. More aggressive inlining
2. Loop expansion
### Conclusion
We can conclude from the results :
* `Os` optimization is highly effective, which reduced the code size by 48.5%.
* RAM usage is stable, because we could see that the difference across all optimization levels is minimal
* `Os` is suitable for embedded systems