arch2025-homework2

# Assignment2: Complete Applications Contributed by <[moonchin819](https://github.com/moonchin819)> ## Implementation Here's the [source code](https://github.com/moonchin819/computer_architecture_hw2) ## [Homework 1 uf8](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-B) To make the program from First Quiz Problem B run in a bare-metal environment using rv32emu’s system emulation, we need five files. * Makefile - To specify the cross compiler and link script (linker.ld) * Define all .S/.c files to be compiled in the project * Control build target : `all`, `clean`, `run`, `dump`, etc... * .S, C programs → link → generate elf file which can be executed. * linker.ld - Define memory layout of the program * To tell linker * Where to put .text(program) * .data (initialized data) * .bss (Uninitialized global variables) * Where are the stack and heap * Replace the operating system loader in bare-metal Because there's no OS to help allocate memory, so this linker.ld file is the only one which tells CPU where to start, matching the `_start` entry point in start.S file. * start.S - Startup code for the project, responsible for all the initialization * `_start` entry point assigned by linker.ld * Initialize stack pointer, clear .bss section * Call `main()` after setting * perfcounter.S - Performance counters * To read CSR counter of RISC-V * Help measure execution information in the situation without OS. * main.c * Provide function `main()` and called by start.S * As a high-level control, to call the assembly language I write * Can also call the function provided by `perfcounter.S` to measure the efficacy. * hw2.S - Main algorithm implementation * Include all the assembly code related to the quiz, assignment, or whatever the project is going to deal with. * We usually define it as a function, so that it could be called by `main.c` * Might include some bitwise operation and results printing. ### Bare-metal programming I have to modify the original program. In the first line, `.section .text` or `.text` should be added. It tells compiler and linker, the following instructions belong to "executable code section" of the program. Under the OS like Linux, compiler and linker will automatically put the program into the section such as `.text`, `.data`, `.bss`. But in bare-metal environment, there's no OS and standart startup code, so we need to clearly tell the compiler and linker "which section is code, and which is data" :::success I only show the modified parts from Homework 1 Problem B, so the encoding part will not be present since the changes are same as decoding part. ::: ```asm .section .text clz: li t0, 32 # n = 32, c = 16 srli t2, a0, 16 # y = x >> 16 beqz t2, c_equal_8 # branch if y is equal to zero addi t0, t0, -16 # n -= 16 add a0, t2, x0 # x = y c_equal_8: srli t2, a0, 8 # y = x >> 8 beqz t2, c_equal_4 addi t0, t0, -8 # n -= 8 add a0, t2, x0 # x = y c_equal_4: srli t2, a0, 4 # y = x >> 4 beqz t2, c_equal_2 addi t0, t0, -4 # n -= 4 add a0, t2, x0 # x = y c_equal_2: srli t2, a0, 2 # y = x >> 2 beqz t2, c_equal_1 addi t0, t0, -2 # n -= 2 add a0, t2, x0 # x = y c_equal_1: srli t2, a0, 1 # y = x >> 1 beqz t2, iter_end addi t0, t0, -1 # n -= 1 add a0, t2, x0 # x = y iter_end: sub a0, t0, a0 # return n - x jr ra offset_table: .word 0, 16, 48, 112, 240, 496, 1008, 2032, 4080, 8176, 16368, 32752, 65520, 131056, 262128, 524272 ``` `.type uf8_decode, %function` is a symbol type label for the linker and debugger. It tells linker that symbol `uf8_decode` represents a "function", but not "variable" The instruction `.align 2` aligns the starting address of the following code to a 2^2 = 4 bytes boundary. * Because the length of each instruction in RISC-V is 4 bytes, if the starting address of the function didn't align to 4 byte, CPU might occur some misaligned fetch exception. * So, here we use `.align 2`, since it's the safest, and most common approach. * It can also avoid unnecessary instruction fetching overhead. ```asm .globl uf8_decode .type uf8_decode,%function .align 2 uf8_decode: andi t0, a0, 0x0F # mantissa = f1 & 0xF srli t1, a0, 4 # exponent = f1 >> 4 la t2, offset_table slli t3, t1, 2 # offset << 4 add t2, t2, t3 lw t2, 0(t2) sll t0, t0, t1 # mantissa << exponent add a0, t0, t2 # mantissa + offset jr ra .size uf8_decode,.-uf8_decode ``` At the end of the code, I add `.size uf8_decode,.-uf8_encode`, although it is unnecessary line. `.size` is a symbol table information instruction for the linker, it tells linker the length of the function, that is, from where to where. ### Test Result ```tex 16:18:00 INFO src/main.c:315: RISC-V emulator is created and ready to run hw1b: PASSED Cycles: 28490 Instructions: 28490 === All Tests Completed ===16:18:00 INFO src/main.c:338: RISC-V emulator is destroyed ``` ## [Quiz 2 Problem A](https://hackmd.io/@sysprog/arch2025-quiz2-sol#Problem-A) ### Hanoi Tower In the game loop, we need to iterate 8 steps, because 3 disks for hanoi tower need 2^3-1=7 steps, with initial state we have 8 conditions in total. ```asm addi x8, x0, 1 # Counter starts from 1 game_loop: # BLANK 4: Check loop termination (2^3 moves). addi x5, x0, 8 # 8 steps beq x8, x5, finish_game # If x8 == 8, skip to the end ``` Gray Code calculation : It's an binary encoding, the number nearby has only one bit difference, which is used for deciding which disk to move every time. ```asm # BLANK 5: What is k for Gray code? srli x5, x8, 1 # BLANK 6: Complete Gray(n) calculation xor x6, x8, x5 ``` Gray(n) XOR Gray(n-1) will get the number with only one "1", it indicates which disk has to move. ```asm addi x7, x8, -1 srli x28, x7, 1 # BLANK 9: Generate Gray(n-1) xor x7, x7, x28 # BLANK 10: Which bits changed? xor x5, x6, x7 ``` To determine which disk to move, first we set the number of the disk to default 0. ```asm addi x9, x0, 0 # BLANK 11: Mask for testing LSB andi x6, x5, 1 # BLANK 12: Branch if disk 0 moves bne x6, x0, disk_found # BLANK 13: Set disk 1 addi x9, x0, 1 # BLANK 14: Test second bit with proper mask andi x6, x5, 2 bne x6, x0, disk_found # BLANK 15: Last disk number addi x9, x0, 2 ``` It's wrong if we have bit 0 and bit 2 at the same time. But this will not happen under normal circumstances. ```asm disk_found: # BLANK 16: Check impossible pattern (multiple bits) andi x30, x5, 5 addi x31, x0, 5 beq x30, x31, pattern_match jal x0, continue_move ``` To calculate the position of the target, we load the current position, and then we calculate the new position. ```asm continue_move: # BLANK 17: Word-align disk index (multiply by what?) slli x5, x9, 2 # BLANK 18: Base offset for disk array addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) ``` Calculate the new place for smallest disk 0. The smallest disk will always jump over a peg when moved, and that's why add 2 in the second instruction here. The fourth instruction means that if the new position is less than 3, we don't have to adjust that, otherwise, we keep on to the next line to subtract 3 to adjust. (Because we only have three pegs.) After the adjustment, we can move on to the display the movement. ```asm bne x9, x0, handle_large # BLANK 19: Small disk moves by how many positions? addi x19, x18, 2 # BLANK 20: Number of pegs addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move ``` For two larger disks : disk 1 and disk 2. Move the large disks to the only peg which is not occupied by the smallest disk. ```asm handle_large: # BLANK 21: Load reference disk position lw x6, 20(x2) # BLANK 22: Sum of all peg indices (0+1+2) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 ``` ```asm display_move: la x20, obdata # Load the address of obdata section to x20 add x5, x20, x18 # Calculate the address of the source peg # x5 points to the item number "x18" in obdata array # BLANK 23: Load byte from obfuscated data lbu x11, 0(x5) # Load 1 byte, which means the lowest 8 bits+ # BLANK 24: Decode constant (0x6F) li x6, 0x6F # Load decode constant 0x6F xor x11, x11, x6 # Decoding process。The origin obdata was fused，so we need to revert in to actual ascii code. ``` Same process as above to decode target peg, x12 here is the ascii code of target peg. ```asm add x7, x20, x19 lbu x12, 0(x7) xor x12, x12, x6 addi x12, x12, -0x12 ``` Here we calculate the offset in a stack, to record "Where is each disk currently located in peg ? " ```asm # BLANK 26: Calculate storage offset slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 ``` ### Test Result ```tex 16:19:03 INFO src/main.c:315: RISC-V emulator is created and ready to run =====Hanoi Tower Test===== Origin code test with only print format adjustment. Move Disk 1 from A to C Move Disk 2 from A to B Move Disk 1 from C to B Move Disk 3 from A to C Move Disk 1 from B to A Move Disk 2 from B to C Move Disk 1 from A to C Cycles: 674 Instructions: 674 === Test Completed ===16:19:03 INFO src/main.c:338: RISC-V emulator is destroyed ``` ## [Quiz 3 Problem C](https://hackmd.io/@sysprog/arch2025-quiz3-sol#Problem-C) Fixed-point format `Actual value = Integer value / 2^N` * 2^0^ : No scaling * 2^16^ : 16 times scaling * 2^32^ : 32 times scaling Because y represents 1/sqrt(x), theoretically, its range is (0, 1], and scaled by 2^16^ the range become [0, 65536]. So it's stored as uint32_t, but the valid values are all below 17 bits. ```c static uint64_t mul32(uint32_t a, uint32_t b) { uint64_t r = 0; for (int i = 0; i < 32; i++){ if (b & (1U << i)) r += (uint64_t)a << i; } return r; } ``` ### rsqrt_org.h ```cpp #ifndef RSQRT_ORG // if not defined RSQRT_ORG (if RSQRT macro isn't defined yet). To check if this header file has already been included. #define RSQRT_ORG // define the macro RSQRT_ORG(its content is empty), indicates that the header file has already been included. //If the file is included again the following time, #ifndef will check that it's already defined and skip the entire file's contents. static uint64_t mul32(uint32_t a, uint32_t b); static int clz(uint32_t x); uint32_t fast_rsqrt(uint32_t x); #endif ``` In `main.c` file, we have to use the function `TEST_OUTPUT` to print the number from the table, so I wrote this simple function to calculate the length of the string. ```c static uint32_t str_len(const char *s) { const char *p = s; while (*p) p++; return p - s; } ``` ```tex 16:23:42 INFO src/main.c:315: RISC-V emulator is created and ready to run =====Fast Reciprocal sqrt Test===== Original version fast_rsqrt(1) = 65536Exact: 65536), Approx. Error = 0%, Remainder = 0/6553600 Cycles: 61 Instructions: 61 fast_rsqrt(4) = 32768Exact: 32768), Approx. Error = 0%, Remainder = 0/3276800 Cycles: 2783 Instructions: 2783 fast_rsqrt(16) = 16384Exact: 16384), Approx. Error = 0%, Remainder = 0/1638400 Cycles: 2789 Instructions: 2789 fast_rsqrt(24) = 13377Exact: 13377.48), Approx. Error = 0%, Remainder = 4800/1337748 Cycles: 4203 Instructions: 4203 fast_rsqrt(87) = 7026Exact: 7026.2), Approx. Error = 0%, Remainder = 2000/702620 Cycles: 4432 Instructions: 4432 fast_rsqrt(269) = 3995Exact: 3995.8), Approx. Error = 0%, Remainder = 8000/399580 Cycles: 4375 Instructions: 4375 fast_rsqrt(666) = 2539Exact: 2539.47), Approx. Error = 0%, Remainder = 4700/253947 Cycles: 4331 Instructions: 4331 Total Cycles: 236297 Total Instructions: 236297 === All Test Passed ===16:23:42 INFO src/main.c:338: RISC-V emulator is destroyed ``` ### -Os (Optimized for size) I encounter this error while changing Makefile for optimized for size. This error might relate to the implementation way of 64-bit operation under RV32 environment. In `-march=rv32i` environment, the processor only have 32-bit integer register, so there's no native arithmetic operations for 64-bit. Therefore, if I utilize arithmetic operations for `uint64_t`, the compiler will automatically insert function like `__ashldi.3`. ```tex riscv-none-elf-ld: rsqrt_org.o: in function `fast_rsqrt': /home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:48:(.text+0xf4): undefined reference to `__lshrdi3' riscv-none-elf-ld: rsqrt_org.o: in function `mul32': /home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x140): undefined reference to `__ashldi3' /home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x1d0): undefined reference to `__ashldi3' make: *** [Makefile:26: test.elf] Error 1 riscv-none-elf-ld -T linker.ld -o test.elf start.o main.o perfcounter.o rsqrt_org.o riscv-none-elf-ld: warning: test.elf has a LOAD segment with RWX permissions riscv-none-elf-ld: rsqrt_org.o: in function `fast_rsqrt': /home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:48:(.text+0xf4): undefined reference to `__lshrdi3' riscv-none-elf-ld: rsqrt_org.o: in function `mul32': /home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x140): undefined reference to `__ashldi3' /home/david0819/riscv-none-elf-gcc/rv32emu/calHW2/quiz3c/rsqrt_org.c:32:(.text+0x1d0): undefined reference to `__ashldi3' make: *** [Makefile:26: test.elf] Error 1 ``` That means I can only use 32-bit operation for this ```c static void mul32(uint64_result_t *result, uint32_t a, uint32_t b) // can't use uint64_t for size optimization { result->lo = 0; // initialization result->hi = 0; uint32_t a_lo = a; // low 32 bits of a uint32_t a_hi = 0; // high 32 bits of a, it's 0 at the beginning because 'a' have only 32 bits // so 'a' is then seen as 64-bit number for (int i = 0; i < 32; i++){ // check every single bit of 'b' from low to high if (b & 1U){ // check if the LSB of 'b' is 1 uint32_t old_lo = result->lo; // save previous results of result->lo result->lo += a_lo; // if true, this indicates that we should add " the currently shifted 'a' " if (result->lo < old_lo){ // carry check, if result->lo become smaller after adding a_lo // it means overflow has occured result->hi++; // so we have to add the carry to high 32 bits } result->hi += a_hi; // add high 32 bits of a to high 32 bits of the result } // then shift left 'a' to prepare for the next round a_hi = (a_hi << 1) | (a_lo >> 31); // this step puts the MSB of a_lo to the LSB of a_hi a_lo = a_lo << 1; b >>= 1; // also shift right 'b' for the next round } ``` ```c uint32_t fast_rsqrt(uint32_t x) { if (x == 0) return 0xFFFFFFFF; // Return Inf if (x == 1) return 65536; int exp = 31 - clz(x); uint32_t y = rsqrt_table[exp]; if (x > (1u << exp)) { uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0; uint32_t delta = y - y_next; // uint32_t frac = (uint32_t) ((((uint64_t)x - (1UL << exp)) << 16) >> exp); // 改寫 uint32_t frac = ((x - (1u << exp)) << 16) >> exp; y -= (uint32_t) ((delta * frac) >> 16); } for (int iter = 0; iter < 2; iter++){ uint64_result_t y2_result; mul32(&y2_result, y, y); uint32_t y2 = y2_result.lo; uint64_result_t xy2_temp; mul32(&xy2_temp, x, y2); uint32_t xy2 = (xy2_temp.hi << 16) | (xy2_temp.lo >> 16); uint64_result_t y_temp; mul32(&y_temp, y, (3u << 16) - xy2); uint32_t result = (y_temp.hi << 15) | (y_temp.lo >> 17); y = result; } return y; } ``` That means in some optimization modes (e.g. -Os、-Ofast), gcc will automatically use uint64_t for simulation, which isn't provided in bare-metal. ```tex =====Fast Reciprocal sqrt Test===== Original version fast_rsqrt(1) = 65536(Exact: 65536), Approx. Error = 0%, Remainder = 0/6553600 Cycles: 43 Instructions: 43 fast_rsqrt(4) = 32768(Exact: 32768), Approx. Error = 0%, Remainder = 0/3276800 Cycles: 1907 Instructions: 1907 fast_rsqrt(16) = 16384(Exact: 16384), Approx. Error = 0%, Remainder = 0/1638400 Cycles: 1911 Instructions: 1911 fast_rsqrt(24) = 13377(Exact: 13377.48), Approx. Error = 0%, Remainder = 4800/1337748 Cycles: 2345 Instructions: 2345 fast_rsqrt(87) = 7026(Exact: 7026.2), Approx. Error = 0%, Remainder = 2000/702620 Cycles: 2398 Instructions: 2398 fast_rsqrt(269) = 3995(Exact: 3995.8), Approx. Error = 0%, Remainder = 8000/399580 Cycles: 2385 Instructions: 2385 fast_rsqrt(666) = 2539(Exact: 2539.47), Approx. Error = 0%, Remainder = 4700/253947 Cycles: 2356 Instructions: 2356 Total Cycles: 103769 Total Instructions: 103767 === All Test Passed === ``` ### Problem for `-Os` and `-Ofast` After I test the original handwritten code in the requirement, I'd like to test on `-Os` (optimized for size) and `-Ofast` (optimized for speed). I encounter the same problem in both mode. The compilter optimized out the function that print strings and numbers. ```c #define TEST_OUTPUT(msg, length) printstr(msg, length) #define TEST_LOGGER(msg) \ { \ char _msg[] = msg; \ TEST_OUTPUT(_msg, sizeof(_msg) - 1); \ } ``` When I `make run`, the results will be like this. ```tex 21:42:54 INFO src/main.c:315: RISC-V emulator is created and ready to run 1655366553600655360036 36 432768327680032768001486 1486 16163841638400163840049 49 241337713377.480480013377481118 1118 8770267026.202000702620209 209 26939953995.808000399580190 190 66625392539.470470025394759 59 d2�i94105 d2�i�� 94105 d2�i�� 21:42:54 ``` So I have to add clobber argument "memory" in `printstr` to avoid this according to this [online doc](https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Extended-Asm.html?utm_source=chatgpt.com#Clobbers-and-Scratch-Registers). It tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands. ### -Ofast (Optimized for speed) Changing the compilation option to `-Ofast` will make cycle count decrease in most cases. The reason might be : * Reduction of instructions * Constant folding : Some constant arithmetic operation can be solve during compilation. * Inline expansion and branch simplification : Some small function might go through inline expansion, reducing the cost of function call (call/return) and pipeline flush. * Rearrange the order of instructions to obscure memory delay or pipeline hazard, which makes cycle count lower. ```tex =====Fast Reciprocal sqrt Test===== Original version fast_rsqrt(1) = 65536 (Exact: 65536), Approx. Error = 0%, Remainder = 0/6553600 Cycles: 36 Instructions: 36 fast_rsqrt(4) = 32768 (Exact: 32768), Approx. Error = 0%, Remainder = 0/3276800 Cycles: 1486 Instructions: 1486 fast_rsqrt(16) = 16384 (Exact: 16384), Approx. Error = 0%, Remainder = 0/1638400 Cycles: 49 Instructions: 49 fast_rsqrt(24) = 13377 (Exact: 13377.48), Approx. Error = 0%, Remainder = 4800/1337748 Cycles: 1118 Instructions: 1118 fast_rsqrt(87) = 7026 (Exact: 7026.2), Approx. Error = 0%, Remainder = 2000/702620 Cycles: 209 Instructions: 209 fast_rsqrt(269) = 3995 (Exact: 3995.8), Approx. Error = 0%, Remainder = 8000/399580 Cycles: 190 Instructions: 190 fast_rsqrt(666) = 2539 (Exact: 2539.47), Approx. Error = 0%, Remainder = 4700/253947 Cycles: 59 Instructions: 59 Total Cycles: 94551 Total Instructions: 94551 === All Test Passed === ``` ### Section Size Comparison We can use command `riscv-none-elf-size` to show memory usage breakdown by section. And we can conclude that `-Os` perform the best in code size, reducing approximately half of it compared to the original one. The most signficant difference is `Os` is 452 bytes smaller than `Ofast`. Take a look into data section, they're all zero. Which means there's no initialized global variable in the code, or they might be optimized out by the compiler. BSS means uninitialized data segment, and they're almost the same in those three results. This part doesn't take up flash memory. #### For the Original version ```tex text data bss dec hex filename 5036 0 4100 9136 23b0 test.elf ``` #### For `Osize` ```tex text data bss dec hex filename 2596 0 4108 6704 1a30 test.elf ``` The reason why `Os` is the smallest : 1. Loop Optimization : The loop in the optimization might be unrolled, and some duplicate code might be merged. 2. Choice of instructions : `Os` will use the instruction which is more compact. 3. Arrangement of registers : Reduce the instructions like load/store #### For `Ofast` ```tex text data bss dec hex filename 3048 0 4104 7152 1bf0 test.elf ``` The features of `Ofast` including : 1. More aggressive inlining 2. Loop expansion ### Conclusion We can conclude from the results : * `Os` optimization is highly effective, which reduced the code size by 48.5%. * RAM usage is stable, because we could see that the difference across all optimization levels is minimal * `Os` is suitable for embedded systems

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.