Assignment2: Complete Applications

# Assignment2: Complete Applications contributed by < [AnnTaiwan](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2) > >[!Note] AI tools usage >I use ChatGPT to assist with Quiz 2 & 3 by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage. >[!Tip] Tutorial >Follow instructions to build rv32emu : [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel) >* I install on Ubuntu 24.04.3. > >[Assignment2: Complete Applications](https://hackmd.io/@sysprog/2025-arch-homework2) ## Pick one complete program (with test suite) done in Homework 1 and make it run in a bare-metal environment using rv32emu’s system emulation. > I pick [Quiz 1-Problem B: uf8_t decode and encode](https://github.com/AnnTaiwan/ca2025-quizzes/blob/main/q1-uf8_rv32_mine_v1.s). * [souce code for uf8_t decode and encode using rv32emu’s system emulation](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q2) * Function declaration in `main.c`: (`CLZ`, `uf8_decode`, and `uf8_encode` are assembly code.) ```c= /* ============= q1_uf8 Declaration ============= */ extern uint32_t uf8_decode(uint8_t in); extern uint8_t uf8_encode(uint32_t in); ``` * Disassemble elf: ```c= 0001432c <CLZ>: 1432c: 02000293 li t0,32 14330: 01000313 li t1,16 00014334 <do_while>: 14334: 006553b3 srl t2,a0,t1 14338: 00038663 beqz t2,14344 <SHIFT_TOO_MUCH> 1433c: 406282b3 sub t0,t0,t1 14340: 00038533 add a0,t2,zero 00014344 <SHIFT_TOO_MUCH>: 14344: 00135313 srli t1,t1,0x1 14348: fe0316e3 bnez t1,14334 <do_while> 1434c: 40a28533 sub a0,t0,a0 14350: 00008067 ret 00014354 <uf8_decode>: 14354: 00f57293 andi t0,a0,15 14358: 00455313 srli t1,a0,0x4 1435c: 00f00393 li t2,15 14360: 406383b3 sub t2,t2,t1 14364: 00008e37 lui t3,0x8 14368: fffe0e13 addi t3,t3,-1 # 7fff <_start-0x8001> 1436c: 007e5e33 srl t3,t3,t2 14370: 004e1e13 slli t3,t3,0x4 14374: 006293b3 sll t2,t0,t1 14378: 01c38533 add a0,t2,t3 1437c: 00008067 ret 00014380 <uf8_encode>: 14380: ffc10113 addi sp,sp,-4 14384: 00112023 sw ra,0(sp) 14388: 00050fb3 add t6,a0,zero 1438c: 01000293 li t0,16 14390: 085fc863 blt t6,t0,14420 <RETURN> 14394: f99ff0ef jal 1432c <CLZ> 14398: 01f00293 li t0,31 1439c: 40a282b3 sub t0,t0,a0 143a0: 00050333 add t1,a0,zero 143a4: 00000393 li t2,0 143a8: 00000e13 li t3,0 143ac: 00500e93 li t4,5 143b0: 05d2c063 blt t0,t4,143f0 <Find_exact_exponent> 143b4: ffc28393 addi t2,t0,-4 143b8: 00f00e93 li t4,15 143bc: 007ed463 bge t4,t2,143c4 <Cal_overflow> 143c0: 00f00393 li t2,15 ``` * Run result ![image](https://hackmd.io/_uploads/rkY_S9QJWl.png) It successfully passes all the tests. ## Tower of Hanoi - from Problem A of Quiz 2 [Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol#Problem-A) * **Souce code** for Tower of Hanoi using rv32emu’s system emulation: * [Without optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3) * [With `-Ofast` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Ofast) * [With `-Os` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Osize) ### Purpose Move 3 disks from peg A to peg C, and print the actions. ### Flow by seeing gray code's different bit position | Gray Code (3-bit) | Bit variation | Moving disk | Action | | :-------------: | :-------: | :----: | :---: | | 000 | Start | — | — | | 001 | bit0 flip | A (smallese) | A → C | | 011 | bit1 flip | B | A → B | | 010 | bit0 flip | A | C → B | | 110 | bit2 flip | C (biggest) | A → C | | 111 | bit0 flip | A | B → A | | 101 | bit1 flip | B | B → C | | 100 | bit0 flip | A | A → C | * Gray code formula: $Gray(n) = n\ \oplus \ (n>>1)$ * See different bit position: $diff\_bit\_pos = Gray(n)\ \oplus \ Gray(n-1)$ #### Original Code in quiz 2 - problem A * register usage | Step | Purpose | Register/Concept | | -------- | ------------------ | ---------------------- | | `x8` | Move counter | 1→7 | | `x6` | Gray(n) | Current Gray code | | `x7` | Gray(n-1) | Previous Gray code | | `x5` | XOR diff | Which disk moves | | `x9` | Disk index | 0=A,1=B,2=C | | `x18` | Disk’s current peg | 0=A,1=B,2=C | | `x19` | Disk’s next peg | computed target | | `obdata` | Obfuscated pegs | Decoded to ‘A’,‘B’,‘C’ | ##### How to move disk 1, 2 to the pegs except the one that the smallest disk is at. All pegs is $0,1,2$, so total sum is $0+1+2=3$. Current moving disk can't be on top of the smallest one and can't move to its current peg. Therefore, when moving the disks except the smallest one, the formula is $Target\ peg= 3-self's\ peg-smallest\ disk's\ peg$. ### Code written in assembly explanation * Function declaration in `main.c`: ```c= /* ============= Tower of Hanoi Declaration ============= */ // return 0 (stored in a0) if successful exit extern uint32_t play_toh_v1(void); extern uint32_t play_toh_v2(void); extern uint32_t play_toh_v3(void); ``` #### Version 1 (Original assembly code from Q2-A) I only adjust the `display_move` to apply the print format of rv32emu. * Complete code ```c= # Perform the tower of Hanoi. # Disks: 0,1,2 (0 is the smallest one) # Peg: A,B,C (Try to move all disks from peg A to C) # Here is the original code from quiz 2 - A. # I only adjust the `display_move` to apply the print format of rv32emu. .text .globl play_toh_v1 .type play_toh_v1,%function .align 2 play_toh_v1: addi x2, x2, -32 sw x8, 0(x2) sw x9, 4(x2) sw x18, 8(x2) # disk0's peg sw x19, 12(x2) # disk1's peg sw x20, 16(x2) # disk2's peg li x5, 0x15 sw x5, 20(x2) # 20(x2) is disk 1's peg number sw x5, 24(x2) sw x5, 28(x2) # Fix disk positions (BLANK 1-3: neutralize x5 effect) # BLANK 1: Fix position at x2+20 sw x0, 20(x2) # BLANK 2: Fix position at x2+24 sw x0, 24(x2) # BLANK 3: Fix position at x2+28 sw x0, 28(x2) addi x8, x0, 1 game_loop: # BLANK 4: Check loop termination (2^3 moves) addi x5, x0, 8 beq x8, x5, finish_game # Gray code formula: gray(n) = n XOR (n >> k) # BLANK 5: What is k for Gray code? srli x5, x8, 1 # BLANK 6: Complete Gray(n) calculation xor x6, x8, x5 # BLANK 7-8: Calculate previous value and its shift addi x7, x8, -1 srli x28, x7, 1 # BLANK 9: Generate Gray(n-1) xor x7, x7, x28 # BLANK 10: Which bits changed? xor x5, x6, x7 # Initialize disk number addi x9, x0, 0 # BLANK 11: Mask for testing LSB andi x6, x5, 1 # BLANK 12: Branch if disk 0 moves bne x6, x0, disk_found # BLANK 13: Set disk 1 addi x9, x0, 1 # BLANK 14: Test second bit with proper mask andi x6, x5, 2 bne x6, x0, disk_found # BLANK 15: Last disk number addi x9, x0, 2 disk_found: # BLANK 16: Check impossible pattern (multiple bits) andi x30, x5, 5 addi x31, x0, 5 beq x30, x31, pattern_match jal x0, continue_move pattern_match: continue_move: # BLANK 17: Word-align disk index (multiply by what?) slli x5, x9, 2 # BLANK 18: Base offset for disk array addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) bne x9, x0, handle_large # BLANK 19: Small disk moves by how many positions? addi x19, x18, 2 # BLANK 20: Number of pegs addi x6, x0, 3 blt x19, x6, display_move sub x19, x19, x6 jal x0, display_move handle_large: # BLANK 21: Load reference disk position lw x6, 20(x2) # BLANK 22: Sum of all peg indices (0+1+2) addi x19, x0, 3 sub x19, x19, x18 sub x19, x19, x6 display_move: la x20, obdata add x5, x20, x18 # BLANK 23: Load byte from obfuscated data lbu x13, 0(x5) # BLANK 24: Decode constant (0x6F) li x6, 0x6F xor x13, x13, x6 # BLANK 25: Final offset adjustment addi x13, x13, -0x12 add x7, x20, x19 lbu x14, 0(x7) xor x14, x14, x6 addi x14, x14, -0x12 # la x10, str1 # addi x17, x0, 4 # ecall add a7, x0, 0x40 add a0, x0, 0x1 # stdout la a1, str1 li a2, 10 #length character ecall # addi x10, x9, 1 # addi x17, x0, 1 # ecall addi a7, x9, 1 # x9 now is 0,1,2. In order to print 1,2,3 , add 1 la a1, disk # Find corresponding ascii code add a1, a1, a7 add a7, x0, 0x40 add a0, x0, 0x1 # stdout li a2, 1 #length character ecall # la x10, str2 # addi x17, x0, 4 # ecall add a7, x0, 0x40 add a0, x0, 0x1 # stdout la a1, str2 li a2, 6 #length character ecall # addi x10, x13, 0 # addi x17, x0, 11 # print char # ecall add a7, x0, 0x40 add a0, x0, 0x1 # stdout la a1, peg addi x13, x13, -65 # get the index from 0 to 2 add a1, a1, x13 li a2, 1 #length character ecall # la x10, str3 # addi x17, x0, 4 # ecall add a7, x0, 0x40 add a0, x0, 0x1 # stdout la a1, str3 li a2, 4 #length character ecall # addi x10, x14, 0 # addi x17, x0, 11 # ecall add a7, x0, 0x40 add a0, x0, 0x1 # stdout la a1, peg addi x14, x14, -65 # get the index from 0 to 2 add a1, a1, x14 li a2, 1 #length character ecall # addi x10, x0, 10 # addi x17, x0, 11 # ecall add a7, x0, 0x40 add a0, x0, 0x1 # stdout la a1, nextline li a2, 1 #length character ecall # BLANK 26: Calculate storage offset slli x5, x9, 2 addi x5, x5, 20 add x5, x2, x5 # BLANK 27: Update disk position sw x19, 0(x5) # BLANK 28-29: Increment counter and loop addi x8, x8, 1 jal x0, game_loop finish_game: lw x8, 0(x2) lw x9, 4(x2) lw x18, 8(x2) lw x19, 12(x2) lw x20, 16(x2) addi x2, x2, 32 li a0, 0 # return 0 if successful exit jr ra # addi x17, x0, 10 # ecall .data obdata: .byte 0x3c, 0x3b, 0x3a peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C' disk: .byte 48, 49, 50, 51 # ascii code: '0' '1' '2' '3' str1: .ascii "Move Disk " str2: .ascii " from " str3: .ascii " to " nextline: .ascii "\n" ``` For printing decimal number in charcter, '1', '2', '3', and disk character, "A", "B", "C", I use ASCII code to represent: ```c= peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C' disk: .byte 48, 49, 50, 51 # ascii code: '0' '1' '2' '3' str1: .ascii "Move Disk " str2: .ascii " from " str3: .ascii " to " nextline: .ascii "\n" ``` #### Version 2 (handwritten code) >[Commit bc36823](https://github.com/AnnTaiwan/computer_arch/commit/bc36823c08f84fbf1c5007ff7819d897e88058b3) I use ABI names for usage of register in order to better read the code. There are some lines that needs to calculate the offset in stack to find the disk's peg number: * version 1 ```c= # BLANK 17: Word-align disk index (multiply by what?) slli x5, x9, 2 # BLANK 18: Base offset for disk array addi x5, x5, 20 add x5, x2, x5 lw x18, 0(x5) ``` In version 1, it also needs to add offset `20`, so I stores disk's peg number at `sp[0]`, `sp[4]`, `sp[8]` to prevent that: ```c= addi sp, sp, -20 sw x0, 0(sp) # disk 0's peg number sw x0, 4(sp) # disk 1's peg number sw x0, 8(sp) # disk 2's peg number ``` Therefore, I can directly multiply it by 4, and load data from stack. ```c= slli t0, a5, 2 # see idx in sp, disk idx * 4 add t1, sp, t0 lw a6, 0(t1) # disk idx's current peg num ``` * Complete code ```c= # Perform the tower of Hanoi. # Disks: 0,1,2 (0 is the smallest one) # Peg: A,B,C (Try to move all disks from peg A to C) # Here is the adapted code from quiz 2 - A. # Defined constants first .equ STDOUT, 1 .equ WRITE, 64 .equ EXIT, 93 .text .globl play_toh_v2 .type play_toh_v2,%function .align 2 play_toh_v2: addi sp, sp, -20 sw x0, 0(sp) # disk 0's peg number sw x0, 4(sp) # disk 1's peg number sw x0, 8(sp) # disk 2's peg number sw ra, 12(sp) # ra sw s0, 16(sp) addi s0, x0, 1 # for loop counter, and it is also n in gray(n) game_loop: li t0, 8 beq t0, s0, finish_game # Gray code formula: gray(n) = n XOR (n >> 1) srli t0, s0, 1 xor a1, t0, s0 # gray(n) addi t0, s0, -1 # n-1 srli t1, t0, 1 xor a2, t0, t1 # gray(n-1) li a5, 0 # try to move which disk 0,1,2 xor a3, a1, a2 # diff_bit_pos andi t0, a3, 1 # try to move smallest disk beqz t0, handle_large # handle_smallest lw a6, 0(sp) # disk 0's current peg num # find target disk: current disk idx +2 -3 addi a4, a6, 2 li t0, 3 blt a4, t0, display_action sub a4, a4, t0 # -3 jal x0, display_action handle_large: addi a5, a5, 1 # middle disk 1 srli a3, a3, 1 andi t0, a3, 1 bnez t0, found_disk addi a5, a5, 1 # largest disk 2 found_disk: slli t0, a5, 2 # see idx in sp, disk idx * 4 add t1, sp, t0 lw a6, 0(t1) # disk idx's current peg num # find target disk: 3-(self peg + smallest disk's peg) lw t2, 0(sp) # smallest disk's peg li a4, 3 sub a4, a4, a6 sub a4, a4, t2 jal x0, display_action display_action: # Move disk a5 from a6 to a4 li a7, WRITE li a0, STDOUT # stdout la a1, str1 li a2, 10 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, disk addi t0, a5, 1 add a1, a1, t0 li a2, 1 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, str2 li a2, 6 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, peg add a1, a1, a6 li a2, 1 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, str3 li a2, 4 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, peg add a1, a1, a4 li a2, 1 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, nextline li a2, 1 #length character ecall # Update disk position ( Move disk a5 from a6 to a4 ) slli t0, a5, 2 # * 4 add t0, sp, t0 sw a4, 0(t0) # Increment counter and loop addi s0, s0, 1 jal x0, game_loop finish_game: lw ra, 12(sp) # ra lw s0, 16(sp) addi sp, sp, 20 li a0, 0 # return 0 if successful exit jr ra .data peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C' disk: .byte 48, 49, 50, 51 # ascii code: '0' '1' '2' '3' str1: .ascii "Move Disk " str2: .ascii " from " str3: .ascii " to " nextline: .ascii "\n" ``` #### Version 3 (Improved one from version 2 and final one) >[Commit b1dff94](https://github.com/AnnTaiwan/computer_arch/commit/b1dff945cbb45e9fd3404559bb8afebe99e44101) First looking at `version 2`, I check if there is anything fixed command in the `game_loop`, and remove it from the loop to execute just once. * constants: ```c= li t5, 8 # gameloop end situation li t6, 3 # constant ``` * `gray(n-1)` is stored in `s1` to prevent from calculating again in the next loop. In each iteration, it only needs to calculate `gray(n)` once. * When seeing which disks need to move by seeing bit variation of gray code, it will be search from the smallest disk to the biggest one. At each step, I add $4$ to `a5`(represent $n^{th}$ disk) in order to directly access `sp[a5]`, so I don't need to `slli a5, a5, 2` when calculating the offset. ```c= addi a5, a5, 4 # largest disk 2 found_disk: add t1, sp, a5 lw a6, 0(t1) # disk idx's current peg num ``` * complete code ```c= # Perform the tower of Hanoi. # Disks: 0,1,2 (0 is the smallest one) # Peg: A,B,C (Try to move all disks from peg A to C) # Here is the adapted code from quiz 2 - A. # Defined constants first .equ STDOUT, 1 .equ WRITE, 64 .equ EXIT, 93 .text .globl play_toh_v3 .type play_toh_v3,%function .align 2 play_toh_v3: addi sp, sp, -24 sw x0, 0(sp) # disk 0's peg number sw x0, 4(sp) # disk 1's peg number sw x0, 8(sp) # disk 2's peg number sw ra, 12(sp) # ra sw s0, 16(sp) sw s1, 20(sp) # s0: n # s1: store previous calculated gray(n-1) li s1, 0 # gray(0) li t5, 8 # gameloop end situation li t6, 3 # constant addi s0, x0, 1 # for loop counter, and it is also n in gray(n) game_loop: beq t5, s0, finish_game # Gray code formula: gray(n) = n XOR (n >> 1) srli t0, s0, 1 xor a1, t0, s0 # gray(n) li a5, 0 # try to move which disk 0,1,2 xor a3, a1, s1 # diff_bit_pos mv s1, a1 # update gray(n-1) for next round andi t0, a3, 1 # try to move smallest disk beqz t0, handle_large # handle_smallest lw a6, 0(sp) # disk 0's current peg num # find target disk: current disk idx +2 -3 addi a4, a6, 2 blt a4, t6, display_action sub a4, a4, t6 # -3 jal x0, display_action handle_large: addi a5, a5, 4 # middle disk 1, a5 now increases by 4, in order to align with sp[0], sp[4], sp[8] andi t0, a3, 0x2 # bit flip 010: this is bit 1 flip bnez t0, found_disk addi a5, a5, 4 # largest disk 2 found_disk: add t1, sp, a5 lw a6, 0(t1) # disk idx's current peg num # find target disk: 3-(self peg + smallest disk's peg) lw t2, 0(sp) # smallest disk's peg sub a4, t6, a6 sub a4, a4, t2 jal x0, display_action display_action: # Move disk a5 from a6 to a4 li a7, WRITE li a0, STDOUT # stdout la a1, str1 li a2, 10 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, disk addi t0, a5, 4 add a1, a1, t0 li a2, 1 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, str2 li a2, 6 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, peg add a1, a1, a6 li a2, 1 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, str3 li a2, 4 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, peg add a1, a1, a4 li a2, 1 #length character ecall li a7, WRITE li a0, STDOUT # stdout la a1, nextline li a2, 1 #length character ecall # Update disk position ( Move disk a5 from a6 to a4 ) add t0, sp, a5 sw a4, 0(t0) # Increment counter and loop addi s0, s0, 1 jal x0, game_loop finish_game: lw ra, 12(sp) # ra lw s0, 16(sp) lw s1, 20(sp) addi sp, sp, 24 li a0, 0 # return 0 if successful exit jr ra .data peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C' disk: .byte 48, 0, 0, 0, 49, 0, 0, 0, 50, 0, 0, 0, 51 # ascii code: '0' '1' '2' '3', 0 is buffer, not used str1: .ascii "Move Disk " str2: .ascii " from " str3: .ascii " to " nextline: .ascii "\n" ``` ### Result #### Ripes simulation (Original assembly code from Q2-A) ![image](https://hackmd.io/_uploads/SJL5BOEJbl.png =30%x) ![image](https://hackmd.io/_uploads/BkvKH_EJWl.png =30%x) #### Version 1 (Original assembly code from Q2-A) * count = 685 ![image](https://hackmd.io/_uploads/SJ-mBxSkWg.png) #### Version 2 (My handwritten code) * count = 533 ![image](https://hackmd.io/_uploads/SkCEHeH1be.png) #### Version 3 (My handwritten code and improved one)(Final) * ==count = 496== ![image](https://hackmd.io/_uploads/rkUqWCnJ-x.png) ### Disassemble the ELF files produced by the C compiler and contrast the handwritten and **compiler-optimized** assembly listings. * Version 2 :::spoiler ```c= 00010878 <play_toh_v2>: 10878: fec10113 addi sp,sp,-20 1087c: 00012023 sw zero,0(sp) 10880: 00012223 sw zero,4(sp) 10884: 00012423 sw zero,8(sp) 10888: 00112623 sw ra,12(sp) 1088c: 00812823 sw s0,16(sp) 10890: 00100413 li s0,1 00010894 <game_loop>: 10894: 00800293 li t0,8 10898: 14828063 beq t0,s0,109d8 <finish_game> 1089c: 00145293 srli t0,s0,0x1 108a0: 0082c5b3 xor a1,t0,s0 108a4: fff40293 addi t0,s0,-1 108a8: 0012d313 srli t1,t0,0x1 108ac: 0062c633 xor a2,t0,t1 108b0: 00000793 li a5,0 108b4: 00c5c6b3 xor a3,a1,a2 108b8: 0016f293 andi t0,a3,1 108bc: 00028e63 beqz t0,108d8 <handle_large> 108c0: 00012803 lw a6,0(sp) 108c4: 00280713 addi a4,a6,2 108c8: 00300293 li t0,3 108cc: 04574063 blt a4,t0,1090c <display_action> 108d0: 40570733 sub a4,a4,t0 108d4: 0380006f j 1090c <display_action> 000108d8 <handle_large>: 108d8: 00178793 addi a5,a5,1 108dc: 0016d693 srli a3,a3,0x1 108e0: 0016f293 andi t0,a3,1 108e4: 00029463 bnez t0,108ec <found_disk> 108e8: 00178793 addi a5,a5,1 000108ec <found_disk>: 108ec: 00279293 slli t0,a5,0x2 108f0: 00510333 add t1,sp,t0 108f4: 00032803 lw a6,0(t1) 108f8: 00012383 lw t2,0(sp) 108fc: 00300713 li a4,3 10900: 41070733 sub a4,a4,a6 10904: 40770733 sub a4,a4,t2 10908: 0040006f j 1090c <display_action> 0001090c <display_action>: 1090c: 04000893 li a7,64 10910: 00100513 li a0,1 10914: 00000597 auipc a1,0x0 10918: 76a58593 addi a1,a1,1898 # 1107e <str1> 1091c: 00a00613 li a2,10 10920: 00000073 ecall 10924: 04000893 li a7,64 10928: 00100513 li a0,1 1092c: 00000597 auipc a1,0x0 10930: 74e58593 addi a1,a1,1870 # 1107a <disk> 10934: 00178293 addi t0,a5,1 10938: 005585b3 add a1,a1,t0 1093c: 00100613 li a2,1 10940: 00000073 ecall 10944: 04000893 li a7,64 10948: 00100513 li a0,1 1094c: 00000597 auipc a1,0x0 10950: 73c58593 addi a1,a1,1852 # 11088 <str2> 10954: 00600613 li a2,6 10958: 00000073 ecall 1095c: 04000893 li a7,64 10960: 00100513 li a0,1 10964: 00000597 auipc a1,0x0 10968: 71358593 addi a1,a1,1811 # 11077 <peg> 1096c: 010585b3 add a1,a1,a6 10970: 00100613 li a2,1 10974: 00000073 ecall 10978: 04000893 li a7,64 1097c: 00100513 li a0,1 10980: 00000597 auipc a1,0x0 10984: 70e58593 addi a1,a1,1806 # 1108e <str3> 10988: 00400613 li a2,4 1098c: 00000073 ecall 10990: 04000893 li a7,64 10994: 00100513 li a0,1 10998: 00000597 auipc a1,0x0 1099c: 6df58593 addi a1,a1,1759 # 11077 <peg> 109a0: 00e585b3 add a1,a1,a4 109a4: 00100613 li a2,1 109a8: 00000073 ecall 109ac: 04000893 li a7,64 109b0: 00100513 li a0,1 109b4: 00000597 auipc a1,0x0 109b8: 6de58593 addi a1,a1,1758 # 11092 <nextline> 109bc: 00100613 li a2,1 109c0: 00000073 ecall 109c4: 00279293 slli t0,a5,0x2 109c8: 005102b3 add t0,sp,t0 109cc: 00e2a023 sw a4,0(t0) 109d0: 00140413 addi s0,s0,1 109d4: ec1ff06f j 10894 <game_loop> 000109d8 <finish_game>: 109d8: 00c12083 lw ra,12(sp) 109dc: 01012403 lw s0,16(sp) 109e0: 01410113 addi sp,sp,20 109e4: 00000513 li a0,0 109e8: 00008067 ret ``` ::: * Version 3 :::spoiler ```c= 000117cc <play_toh_v3>: 117cc: fe810113 addi sp,sp,-24 117d0: 00012023 sw zero,0(sp) 117d4: 00012223 sw zero,4(sp) 117d8: 00012423 sw zero,8(sp) 117dc: 00112623 sw ra,12(sp) 117e0: 00812823 sw s0,16(sp) 117e4: 00912a23 sw s1,20(sp) 117e8: 00000493 li s1,0 117ec: 00800f13 li t5,8 117f0: 00300f93 li t6,3 117f4: 00100413 li s0,1 000117f8 <game_loop>: 117f8: 128f0263 beq t5,s0,1191c <finish_game> 117fc: 00145293 srli t0,s0,0x1 11800: 0082c5b3 xor a1,t0,s0 11804: 00000793 li a5,0 11808: 0095c6b3 xor a3,a1,s1 1180c: 00058493 mv s1,a1 11810: 0016f293 andi t0,a3,1 11814: 00028c63 beqz t0,1182c <handle_large> 11818: 00012803 lw a6,0(sp) 1181c: 00280713 addi a4,a6,2 11820: 03f74a63 blt a4,t6,11854 <display_action> 11824: 41f70733 sub a4,a4,t6 11828: 02c0006f j 11854 <display_action> 0001182c <handle_large>: 1182c: 00478793 addi a5,a5,4 11830: 0026f293 andi t0,a3,2 11834: 00029463 bnez t0,1183c <found_disk> 11838: 00478793 addi a5,a5,4 0001183c <found_disk>: 1183c: 00f10333 add t1,sp,a5 11840: 00032803 lw a6,0(t1) 11844: 00012383 lw t2,0(sp) 11848: 410f8733 sub a4,t6,a6 1184c: 40770733 sub a4,a4,t2 11850: 0040006f j 11854 <display_action> 00011854 <display_action>: 11854: 04000893 li a7,64 11858: 00100513 li a0,1 1185c: 00001597 auipc a1,0x1 11860: 94758593 addi a1,a1,-1721 # 121a3 <str1> 11864: 00a00613 li a2,10 11868: 00000073 ecall 1186c: 04000893 li a7,64 11870: 00100513 li a0,1 11874: 00001597 auipc a1,0x1 11878: 92258593 addi a1,a1,-1758 # 12196 <disk> 1187c: 00478293 addi t0,a5,4 11880: 005585b3 add a1,a1,t0 11884: 00100613 li a2,1 11888: 00000073 ecall 1188c: 04000893 li a7,64 11890: 00100513 li a0,1 11894: 00001597 auipc a1,0x1 11898: 91958593 addi a1,a1,-1767 # 121ad <str2> 1189c: 00600613 li a2,6 118a0: 00000073 ecall 118a4: 04000893 li a7,64 118a8: 00100513 li a0,1 118ac: 00001597 auipc a1,0x1 118b0: 8e758593 addi a1,a1,-1817 # 12193 <peg> 118b4: 010585b3 add a1,a1,a6 118b8: 00100613 li a2,1 118bc: 00000073 ecall 118c0: 04000893 li a7,64 118c4: 00100513 li a0,1 118c8: 00001597 auipc a1,0x1 118cc: 8eb58593 addi a1,a1,-1813 # 121b3 <str3> 118d0: 00400613 li a2,4 118d4: 00000073 ecall 118d8: 04000893 li a7,64 118dc: 00100513 li a0,1 118e0: 00001597 auipc a1,0x1 118e4: 8b358593 addi a1,a1,-1869 # 12193 <peg> 118e8: 00e585b3 add a1,a1,a4 118ec: 00100613 li a2,1 118f0: 00000073 ecall 118f4: 04000893 li a7,64 118f8: 00100513 li a0,1 118fc: 00001597 auipc a1,0x1 11900: 8bb58593 addi a1,a1,-1861 # 121b7 <nextline> 11904: 00100613 li a2,1 11908: 00000073 ecall 1190c: 00f102b3 add t0,sp,a5 11910: 00e2a023 sw a4,0(t0) 11914: 00140413 addi s0,s0,1 11918: ee1ff06f j 117f8 <game_loop> 0001191c <finish_game>: 1191c: 00c12083 lw ra,12(sp) 11920: 01012403 lw s0,16(sp) 11924: 01412483 lw s1,20(sp) 11928: 01810113 addi sp,sp,24 1192c: 00000513 li a0,0 11930: 00008067 ret ``` ::: #### Observations and explain |name|number of lines| |--|--| |Version 2|104| |Version 2 with `-Ofast`|104| |Version 2 with `-Os`|104| |Version 2|101| |Version 2 with `-Ofast`|101| |Version 2 with `-Os`|101| If they are from same versions, the disassemble results are all the same, so the statistic results are similar. So, I just show the each example from each version. #### Optimize method * Add optimize flag in `CFLAGS` in `Makefile` ```c= # optimized for speed CFLAGS = -g -march=rv32i_zicsr -Ofast # optimized for size CFLAGS = -g -march=rv32i_zicsr -Os ``` * Add `__attribute__((optimize("O0")))` before `print_dec` in order to prevent it from optimizing, so I can see the cycle and instruction count messages. ```c= __attribute__((optimize("O0"))) /* Simple integer to decimal string conversion */ static void print_dec(unsigned long val) { char buf[20]; char *p = buf + sizeof(buf) - 1; *p = '\n'; p--; if (val == 0) { *p = '0'; p--; } else { while (val > 0) { *p = '0' + umod(val, 10); p--; val = udiv(val, 10); } } p++; printstr(p, (buf + sizeof(buf) - p)); } ``` * Version 2 - Running result: * with `-Ofast`: ![image](https://hackmd.io/_uploads/S15dUeHJWl.png) * Count of cycles and instructions are 520. * with `-Os`: ![image](https://hackmd.io/_uploads/rJTDrxr1Zg.png) * Count of cycles and instructions are 520. * Compared to `Version 2` result which is 533, count of cycles and instructions decreases due to optimization. * Version 3 - Running result: * with `-Ofast`: ![image](https://hackmd.io/_uploads/ry9DrC21Wx.png) * Count of cycles and instructions are 483. * with `-Os`: ![image](https://hackmd.io/_uploads/B19oSRh1-g.png) * Count of cycles and instructions are 483. * Compared to `Version 3` result which is 496, count of cycles and instructions decreases due to optimization. ## `fast_rsqrt` - from Problem C of Quiz 3 [Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3) * **Souce code** for `fast_rsqrt` using rv32emu’s system emulation: * [Without optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3) * [With `-Ofast` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Ofast) * [With `-Os` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Osize) The goal of this problem is to approximate the **reciprocal square root function** $$ y = \frac{1}{\sqrt{x}} $$ using three progressively refined methods: 1. **Lookup table (LUT)** — precompute discrete values of ( y ) for selected ( x ). 2. **Linear interpolation** — estimate intermediate values between table entries. 3. **Newton–Raphson iteration** — apply iterative refinement to improve accuracy. These techniques demonstrate the trade-offs between computational cost and numerical precision when implementing mathematical functions on resource-limited systems (e.g., embedded or bare-metal environments). ### My method Follow the c code shown in quiz PDF Problem C of Quiz 3. Try to realize the error rate by using with or without `Newton-Raphson Iteration or linear interpolation` and see how the number of `Newton-Raphson Iteration` affects the result. Besides, using different optimization flag to see difference and explain the result. ### Test data ```c= uint32_t in[] = {1, 4, 16, 20, 100, 258, 650}; char *s_exact_values[] = { "65536", "32768", "16384", "14654.2", "6553.6", "4080", "2570.5" }; uint32_t exact_values[] = { // multiply 10 for saving one-position fractional part 655360, // 65536 * 10 327680, 163840, 146542, 65536, // 6553.6 * 10 40800, 25705 }; ``` * `in`: the `x` input to `fast_rsqrt`, `result = fast_rsqrt(x)`, stores all the test datas. * `s_exact_values`: Hardcode the corresponding exact result values as `char*` strings in order to print them out. * `exact_values`: the exact result value already timed by 10 in order to keep the one-position of fractional part, and it will be used to calculate error. #### Test Method Try to input all `in` data into `fast_rsqrt`, and get the `result`, use this `result` with the `exact_values[i]` to calculate the error in pecentage(%). Notice that my percentage can only appear in integer way, so the `remainder` after the percentage calculation will also be displayed. In the `remainder` part, the value after `/` indicates the divisor used when calculating the percentage, allowing the remainder to be displayed more clearly. e.g.: ```c= fast_rsqrt(2147483648) = 1 (exact: 1.4), Approximately Error = 28%, Remainder = 8/14, Cycles: 71, Instructions: 71 ``` ### Result * The **total** cycle and instruction count includes the process of executing `fast_rsqrt`, printing messages, and calculating error, so I will print the specific cycle and instruction count of each test case in the end of that line. #### Case 1: Just do lookup table ![image](https://hackmd.io/_uploads/SyEJiXh1We.png) #### Case 2: Do lookup table and linear interpolation ![image](https://hackmd.io/_uploads/BJmsLv9JWe.png) * The error rate increases as the input data increases. #### Case 3: Do lookup table, linear interpolation, and ==1 time== Newton–Raphson iteration ![image](https://hackmd.io/_uploads/Hyd1ww91-l.png) #### Case 4: Do lookup table, linear interpolation, and ==2 times== Newton–Raphson iteration ![image](https://hackmd.io/_uploads/ryUNwv5ybg.png) * With two times iteration, it seems it can decrease the error enough. * Compared to case 2 and 3, the cycle and instruction count increases a lot. * Display Section Sizes: by `riscv-none-elf-size test.elf` ``` text data bss dec hex filename 7604 59 4097 11760 2df0 test.elf ``` #### Case 5: Do lookup table, linear interpolation, and ==3 times== Newton–Raphson iteration ![image](https://hackmd.io/_uploads/B10vDP9ybe.png) * It seems the result didn't change, but the cycle and instruction count increase a lot, which means 2 times Newton–Raphson iteration is enough. #### Case 6: Do lookup table, and ==1 times== Newton–Raphson iteration * Without linear interpolation ![image](https://hackmd.io/_uploads/S1ECjm3y-e.png) #### Case 7: Do lookup table, and ==2 times== Newton–Raphson iteration * Without linear interpolation ![image](https://hackmd.io/_uploads/rJpwsQhkbg.png) #### Conclusion of result: >Use **case number** shown above to represent. |name|comparing result| |--|--| |cycle count|5 > 7 > 4 > 3 > 2 > 6 > 1| |error rate|1 > 6 > 2 > 7 > 3 > 4 = 5| I notice that **adding one more time Newton-Raphson iteration** can lead to significant increase in cycle and instruction count. This happens because each additional iteration performs more chained multiplications. The execution time of `mul32` depends on the number of `1` bits in the operand — more `1`s result in more addition operations inside the loop. As Newton-Raphson refines the approximation, the intermediate values tend to have more active bits, increasing the workload for each multiplication. Consequently, each extra iteration significantly raises both cycle and instruction counts. ### Disassemble the ELF files produced by the C compiler #### [Case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration) ```c= 000117d8 <fast_rsqrt>: 117d8: fa010113 addi sp,sp,-96 117dc: 04112e23 sw ra,92(sp) 117e0: 04812c23 sw s0,88(sp) 117e4: 05212a23 sw s2,84(sp) 117e8: 05312823 sw s3,80(sp) 117ec: 05412623 sw s4,76(sp) 117f0: 05512423 sw s5,72(sp) 117f4: 05612223 sw s6,68(sp) 117f8: 05712023 sw s7,64(sp) 117fc: 03812e23 sw s8,60(sp) 11800: 03912c23 sw s9,56(sp) 11804: 03a12a23 sw s10,52(sp) 11808: 03b12823 sw s11,48(sp) 1180c: 06010413 addi s0,sp,96 11810: faa42623 sw a0,-84(s0) 11814: fac42783 lw a5,-84(s0) 11818: 00079663 bnez a5,11824 <fast_rsqrt+0x4c> 1181c: fff00793 li a5,-1 11820: 1f40006f j 11a14 <fast_rsqrt+0x23c> 11824: fac42703 lw a4,-84(s0) 11828: 00100793 li a5,1 1182c: 00f71663 bne a4,a5,11838 <fast_rsqrt+0x60> 11830: 000107b7 lui a5,0x10 11834: 1e00006f j 11a14 <fast_rsqrt+0x23c> 11838: fac42503 lw a0,-84(s0) 1183c: eb9ff0ef jal 116f4 <clz> 11840: 00050713 mv a4,a0 11844: 01f00793 li a5,31 11848: 40e787b3 sub a5,a5,a4 1184c: fcf42023 sw a5,-64(s0) 11850: 000127b7 lui a5,0x12 11854: d3478713 addi a4,a5,-716 # 11d34 <rsqrt_table> 11858: fc042783 lw a5,-64(s0) 1185c: 00279793 slli a5,a5,0x2 11860: 00f707b3 add a5,a4,a5 11864: 0007a783 lw a5,0(a5) 11868: fcf42623 sw a5,-52(s0) 1186c: fc042783 lw a5,-64(s0) 11870: 00100713 li a4,1 11874: 00f717b3 sll a5,a4,a5 11878: fac42703 lw a4,-84(s0) 1187c: 10e7f063 bgeu a5,a4,1197c <fast_rsqrt+0x1a4> 11880: fc042703 lw a4,-64(s0) 11884: 01e00793 li a5,30 11888: 02e7c463 blt a5,a4,118b0 <fast_rsqrt+0xd8> 1188c: fc042783 lw a5,-64(s0) 11890: 00178793 addi a5,a5,1 11894: 00012737 lui a4,0x12 11898: d3470713 addi a4,a4,-716 # 11d34 <rsqrt_table> 1189c: 00279793 slli a5,a5,0x2 118a0: 00f707b3 add a5,a4,a5 118a4: 0007a783 lw a5,0(a5) 118a8: fcf42423 sw a5,-56(s0) 118ac: 0080006f j 118b4 <fast_rsqrt+0xdc> 118b0: fc042423 sw zero,-56(s0) 118b4: fcc42703 lw a4,-52(s0) 118b8: fc842783 lw a5,-56(s0) 118bc: 40f707b3 sub a5,a4,a5 118c0: faf42e23 sw a5,-68(s0) 118c4: fac42783 lw a5,-84(s0) 118c8: 00078d13 mv s10,a5 118cc: 00000d93 li s11,0 118d0: fc042783 lw a5,-64(s0) 118d4: 00100713 li a4,1 118d8: 00f717b3 sll a5,a4,a5 118dc: faf42023 sw a5,-96(s0) 118e0: fa042223 sw zero,-92(s0) 118e4: fa042583 lw a1,-96(s0) 118e8: fa442603 lw a2,-92(s0) 118ec: 00058693 mv a3,a1 118f0: 40dd0733 sub a4,s10,a3 118f4: 00070693 mv a3,a4 118f8: 00dd36b3 sltu a3,s10,a3 118fc: 40cd87b3 sub a5,s11,a2 11900: 40d786b3 sub a3,a5,a3 11904: 00068793 mv a5,a3 11908: 01075693 srli a3,a4,0x10 1190c: 01079993 slli s3,a5,0x10 11910: 013689b3 add s3,a3,s3 11914: 01071913 slli s2,a4,0x10 11918: fc042783 lw a5,-64(s0) 1191c: fe078793 addi a5,a5,-32 11920: 0007c863 bltz a5,11930 <fast_rsqrt+0x158> 11924: 00f9da33 srl s4,s3,a5 11928: 00000a93 li s5,0 1192c: 02c0006f j 11958 <fast_rsqrt+0x180> 11930: 00199713 slli a4,s3,0x1 11934: 01f00693 li a3,31 11938: fc042783 lw a5,-64(s0) 1193c: 40f687b3 sub a5,a3,a5 11940: 00f717b3 sll a5,a4,a5 11944: fc042703 lw a4,-64(s0) 11948: 00e95a33 srl s4,s2,a4 1194c: 01478a33 add s4,a5,s4 11950: fc042783 lw a5,-64(s0) 11954: 00f9dab3 srl s5,s3,a5 11958: fb442c23 sw s4,-72(s0) 1195c: fb842583 lw a1,-72(s0) 11960: fbc42503 lw a0,-68(s0) 11964: 931fe0ef jal 10294 <__mulsi3> 11968: 00050793 mv a5,a0 1196c: 0107d793 srli a5,a5,0x10 11970: fcc42703 lw a4,-52(s0) 11974: 40f707b3 sub a5,a4,a5 11978: fcf42623 sw a5,-52(s0) 1197c: fc042223 sw zero,-60(s0) 11980: 0840006f j 11a04 <fast_rsqrt+0x22c> 11984: fcc42583 lw a1,-52(s0) 11988: fcc42503 lw a0,-52(s0) 1198c: c75ff0ef jal 11600 <mul32> 11990: 00050713 mv a4,a0 11994: 00058793 mv a5,a1 11998: fae42a23 sw a4,-76(s0) 1199c: fb442583 lw a1,-76(s0) 119a0: fac42503 lw a0,-84(s0) 119a4: c5dff0ef jal 11600 <mul32> 119a8: 00050713 mv a4,a0 119ac: 00058793 mv a5,a1 119b0: 01079693 slli a3,a5,0x10 119b4: 01075b13 srli s6,a4,0x10 119b8: 01668b33 add s6,a3,s6 119bc: 0107db93 srli s7,a5,0x10 119c0: fb642823 sw s6,-80(s0) 119c4: 00030737 lui a4,0x30 119c8: fb042783 lw a5,-80(s0) 119cc: 40f707b3 sub a5,a4,a5 119d0: 00078593 mv a1,a5 119d4: fcc42503 lw a0,-52(s0) 119d8: c29ff0ef jal 11600 <mul32> 119dc: 00050713 mv a4,a0 119e0: 00058793 mv a5,a1 119e4: 00f79693 slli a3,a5,0xf 119e8: 01175c13 srli s8,a4,0x11 119ec: 01868c33 add s8,a3,s8 119f0: 0117dc93 srli s9,a5,0x11 119f4: fd842623 sw s8,-52(s0) 119f8: fc442783 lw a5,-60(s0) 119fc: 00178793 addi a5,a5,1 11a00: fcf42223 sw a5,-60(s0) 11a04: fc442703 lw a4,-60(s0) 11a08: 00100793 li a5,1 11a0c: f6e7dce3 bge a5,a4,11984 <fast_rsqrt+0x1ac> 11a10: fcc42783 lw a5,-52(s0) 11a14: 00078513 mv a0,a5 11a18: 05c12083 lw ra,92(sp) 11a1c: 05812403 lw s0,88(sp) 11a20: 05412903 lw s2,84(sp) 11a24: 05012983 lw s3,80(sp) 11a28: 04c12a03 lw s4,76(sp) 11a2c: 04812a83 lw s5,72(sp) 11a30: 04412b03 lw s6,68(sp) 11a34: 04012b83 lw s7,64(sp) 11a38: 03c12c03 lw s8,60(sp) 11a3c: 03812c83 lw s9,56(sp) 11a40: 03412d03 lw s10,52(sp) 11a44: 03012d83 lw s11,48(sp) 11a48: 06010113 addi sp,sp,96 11a4c: 00008067 ret ``` >[time=Sat, Nov 8, 2025 2:45 PM] >**Update for checking the usage of `mul32`:** >Notice that there is a `__mulsi3` [:bulb: Link](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Optimized-version-of-mul32-__mulsi3) in above assembly code. In `fast_rsqrt.c`, I find this happens at `y -= (uint32_t)((delta * frac) >> 16);`, so I try to replace this line into `y -= (uint32_t)(mul32(delta, frac) >> 16);` to use handwritten `mul32`. And then, check the result of disassembling, `fast_rsqrt` won't use `__mulsi3` anymore. Due to more cycles that my handwritten `mul32` needs, so **some count statistics are larger** than [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration): ```c= Test 3: fast_rsqrt (C code) from Q3-C. Test: fast_rsqrt In below error percentage, it might lose fractional part of error in pecentage due to udiv, so show the remainder to see. fast_rsqrt(1) = 65536 (exact: 65536), Approximately Error = 0%, Remainder = 0/655360, Cycles: 61, Instructions: 61 fast_rsqrt(4) = 32768 (exact: 32768), Approximately Error = 0%, Remainder = 0/327680, Cycles: 2783, Instructions: 2783 fast_rsqrt(16) = 16384 (exact: 16384), Approximately Error = 0%, Remainder = 0/163840, Cycles: 2789, Instructions: 2789 fast_rsqrt(20) = 14654 (exact: 14654.2), Approximately Error = 0%, Remainder = 200/146542, Cycles: 4696, Instructions: 4696 fast_rsqrt(100) = 6553 (exact: 6553.6), Approximately Error = 0%, Remainder = 600/65536, Cycles: 4482, Instructions: 4482 fast_rsqrt(258) = 4080 (exact: 4080), Approximately Error = 0%, Remainder = 0/40800, Cycles: 4540, Instructions: 4540 fast_rsqrt(650) = 2570 (exact: 2570.5), Approximately Error = 0%, Remainder = 500/25705, Cycles: 4251, Instructions: 4251 fast_rsqrt: PASSED Total Cycles: 221334 Total Instructions: 221334 ``` #### With `-Ofast` ```c= 000109ec <fast_rsqrt>: 109ec: 28050663 beqz a0,10c78 <fast_rsqrt+0x28c> 109f0: 00100793 li a5,1 109f4: 28f50863 beq a0,a5,10c84 <fast_rsqrt+0x298> 109f8: fe010113 addi sp,sp,-32 109fc: 00112e23 sw ra,28(sp) 10a00: 00812c23 sw s0,24(sp) 10a04: 000107b7 lui a5,0x10 10a08: 18f56463 bltu a0,a5,10b90 <fast_rsqrt+0x1a4> 10a0c: 00050793 mv a5,a0 10a10: 00800613 li a2,8 10a14: 00000693 li a3,0 10a18: 01000737 lui a4,0x1000 10a1c: 00e7f663 bgeu a5,a4,10a28 <fast_rsqrt+0x3c> 10a20: 00879793 slli a5,a5,0x8 10a24: 00060693 mv a3,a2 10a28: 10000737 lui a4,0x10000 10a2c: 00e7f663 bgeu a5,a4,10a38 <fast_rsqrt+0x4c> 10a30: 00468693 addi a3,a3,4 10a34: 00479793 slli a5,a5,0x4 10a38: 40000737 lui a4,0x40000 10a3c: 00050e13 mv t3,a0 10a40: 1ee7fc63 bgeu a5,a4,10c38 <fast_rsqrt+0x24c> 10a44: 00279713 slli a4,a5,0x2 10a48: 00268693 addi a3,a3,2 10a4c: 00074463 bltz a4,10a54 <fast_rsqrt+0x68> 10a50: 00168693 addi a3,a3,1 10a54: 01f00713 li a4,31 10a58: 40d70733 sub a4,a4,a3 10a5c: 00011637 lui a2,0x11 10a60: 00271593 slli a1,a4,0x2 10a64: fd860613 addi a2,a2,-40 # 10fd8 <rsqrt_table> 10a68: 00100793 li a5,1 10a6c: 00b605b3 add a1,a2,a1 10a70: 00e797b3 sll a5,a5,a4 10a74: 0005a883 lw a7,0(a1) 10a78: 07c7fc63 bgeu a5,t3,10af0 <fast_rsqrt+0x104> 10a7c: 02000593 li a1,32 10a80: 40d586b3 sub a3,a1,a3 10a84: 00269693 slli a3,a3,0x2 10a88: 00d60633 add a2,a2,a3 10a8c: 00062503 lw a0,0(a2) 10a90: 40a88533 sub a0,a7,a0 10a94: 40fe07b3 sub a5,t3,a5 10a98: 00fe36b3 sltu a3,t3,a5 10a9c: 40d006b3 neg a3,a3 10aa0: 0107d593 srli a1,a5,0x10 10aa4: 01069693 slli a3,a3,0x10 10aa8: fe070613 addi a2,a4,-32 # 3fffffe0 <__stack_top+0x3ffedf40> 10aac: 00d586b3 add a3,a1,a3 10ab0: 01079793 slli a5,a5,0x10 10ab4: 00c6d5b3 srl a1,a3,a2 10ab8: 00065e63 bgez a2,10ad4 <fast_rsqrt+0xe8> 10abc: 01f00613 li a2,31 10ac0: 00169693 slli a3,a3,0x1 10ac4: 40e60633 sub a2,a2,a4 10ac8: 00e7d5b3 srl a1,a5,a4 10acc: 00c697b3 sll a5,a3,a2 10ad0: 00b785b3 add a1,a5,a1 10ad4: 01c12623 sw t3,12(sp) 10ad8: 01112423 sw a7,8(sp) 10adc: 9a9ff0ef jal 10484 <__mulsi3> 10ae0: 00812883 lw a7,8(sp) 10ae4: 00c12e03 lw t3,12(sp) 10ae8: 01055793 srli a5,a0,0x10 10aec: 40f888b3 sub a7,a7,a5 10af0: 00200e93 li t4,2 10af4: 00100593 li a1,1 10af8: 01f00313 li t1,31 10afc: 02000813 li a6,32 10b00: 001e5513 srli a0,t3,0x1 10b04: 00030f37 lui t5,0x30 10b08: 00000293 li t0,0 10b0c: 00000793 li a5,0 10b10: fe078613 addi a2,a5,-32 # ffe0 <EXIT+0xff83> 10b14: 00f596b3 sll a3,a1,a5 10b18: 00f89733 sll a4,a7,a5 10b1c: 41f65613 srai a2,a2,0x1f 10b20: 0116f6b3 and a3,a3,a7 10b24: 00c77733 and a4,a4,a2 10b28: 00178793 addi a5,a5,1 10b2c: 00068463 beqz a3,10b34 <fast_rsqrt+0x148> 10b30: 00e282b3 add t0,t0,a4 10b34: fd079ee3 bne a5,a6,10b10 <fast_rsqrt+0x124> 10b38: 00000613 li a2,0 10b3c: 00000f93 li t6,0 10b40: 00000793 li a5,0 10b44: 0280006f j 10b6c <fast_rsqrt+0x180> 10b48: 00000713 li a4,0 10b4c: 00de16b3 sll a3,t3,a3 10b50: 00e60733 add a4,a2,a4 10b54: 00c733b3 sltu t2,a4,a2 10b58: 00df8fb3 add t6,t6,a3 10b5c: 00070613 mv a2,a4 10b60: 01f38fb3 add t6,t2,t6 10b64: 00178793 addi a5,a5,1 10b68: 03078c63 beq a5,a6,10ba0 <fast_rsqrt+0x1b4> 10b6c: 00f59733 sll a4,a1,a5 10b70: 00577733 and a4,a4,t0 10b74: fe078693 addi a3,a5,-32 10b78: fe0706e3 beqz a4,10b64 <fast_rsqrt+0x178> 10b7c: 40f30733 sub a4,t1,a5 10b80: fc06d4e3 bgez a3,10b48 <fast_rsqrt+0x15c> 10b84: 00e556b3 srl a3,a0,a4 10b88: 00fe1733 sll a4,t3,a5 10b8c: fc5ff06f j 10b50 <fast_rsqrt+0x164> 10b90: 01051793 slli a5,a0,0x10 10b94: 01800613 li a2,24 10b98: 01000693 li a3,16 10b9c: e7dff06f j 10a18 <fast_rsqrt+0x2c> 10ba0: 010f9f93 slli t6,t6,0x10 10ba4: 01065613 srli a2,a2,0x10 10ba8: 00cf8633 add a2,t6,a2 10bac: 40cf0633 sub a2,t5,a2 10bb0: 00000293 li t0,0 10bb4: 00000f93 li t6,0 10bb8: 00000793 li a5,0 10bbc: 0018d413 srli s0,a7,0x1 10bc0: 0280006f j 10be8 <fast_rsqrt+0x1fc> 10bc4: 00000713 li a4,0 10bc8: 00d896b3 sll a3,a7,a3 10bcc: 00e28733 add a4,t0,a4 10bd0: 005733b3 sltu t2,a4,t0 10bd4: 00df8fb3 add t6,t6,a3 10bd8: 00070293 mv t0,a4 10bdc: 01f38fb3 add t6,t2,t6 10be0: 00178793 addi a5,a5,1 10be4: 03078463 beq a5,a6,10c0c <fast_rsqrt+0x220> 10be8: 00f59733 sll a4,a1,a5 10bec: 00c77733 and a4,a4,a2 10bf0: fe078693 addi a3,a5,-32 10bf4: fe0706e3 beqz a4,10be0 <fast_rsqrt+0x1f4> 10bf8: 40f30733 sub a4,t1,a5 10bfc: fc06d4e3 bgez a3,10bc4 <fast_rsqrt+0x1d8> 10c00: 00e456b3 srl a3,s0,a4 10c04: 00f89733 sll a4,a7,a5 10c08: fc5ff06f j 10bcc <fast_rsqrt+0x1e0> 10c0c: 00ff9f93 slli t6,t6,0xf 10c10: 0112d293 srli t0,t0,0x11 10c14: 005f88b3 add a7,t6,t0 10c18: 00be8663 beq t4,a1,10c24 <fast_rsqrt+0x238> 10c1c: 00100e93 li t4,1 10c20: ee9ff06f j 10b08 <fast_rsqrt+0x11c> 10c24: 01c12083 lw ra,28(sp) 10c28: 01812403 lw s0,24(sp) 10c2c: 00088513 mv a0,a7 10c30: 02010113 addi sp,sp,32 10c34: 00008067 ret 10c38: e007dce3 bgez a5,10a50 <fast_rsqrt+0x64> 10c3c: 01f00513 li a0,31 10c40: 40d50733 sub a4,a0,a3 10c44: 00011637 lui a2,0x11 10c48: 00271593 slli a1,a4,0x2 10c4c: fd860613 addi a2,a2,-40 # 10fd8 <rsqrt_table> 10c50: 00100793 li a5,1 10c54: 00b605b3 add a1,a2,a1 10c58: 00e797b3 sll a5,a5,a4 10c5c: 0005a883 lw a7,0(a1) 10c60: e9c7f8e3 bgeu a5,t3,10af0 <fast_rsqrt+0x104> 10c64: e0069ce3 bnez a3,10a7c <fast_rsqrt+0x90> 10c68: 00050713 mv a4,a0 10c6c: 800007b7 lui a5,0x80000 10c70: 00088513 mv a0,a7 10c74: e21ff06f j 10a94 <fast_rsqrt+0xa8> 10c78: fff00893 li a7,-1 10c7c: 00088513 mv a0,a7 10c80: 00008067 ret 10c84: 000108b7 lui a7,0x10 10c88: ff5ff06f j 10c7c <fast_rsqrt+0x290> ``` #### With `-Os` ```c= 000109f0 <fast_rsqrt>: 109f0: fd010113 addi sp,sp,-48 109f4: 02812423 sw s0,40(sp) 109f8: 02112623 sw ra,44(sp) 109fc: 02912223 sw s1,36(sp) 10a00: 03212023 sw s2,32(sp) 10a04: 01312e23 sw s3,28(sp) 10a08: 01412c23 sw s4,24(sp) 10a0c: fff00413 li s0,-1 10a10: 12050863 beqz a0,10b40 <fast_rsqrt+0x150> 10a14: 000107b7 lui a5,0x10 10a18: 00100713 li a4,1 10a1c: 00078413 mv s0,a5 10a20: 12e50063 beq a0,a4,10b40 <fast_rsqrt+0x150> 10a24: 14f57063 bgeu a0,a5,10b64 <fast_rsqrt+0x174> 10a28: 01051793 slli a5,a0,0x10 10a2c: 01000713 li a4,16 10a30: 010006b7 lui a3,0x1000 10a34: 00d7f663 bgeu a5,a3,10a40 <fast_rsqrt+0x50> 10a38: 00870713 addi a4,a4,8 10a3c: 00879793 slli a5,a5,0x8 10a40: 100006b7 lui a3,0x10000 10a44: 00d7f663 bgeu a5,a3,10a50 <fast_rsqrt+0x60> 10a48: 00470713 addi a4,a4,4 10a4c: 00479793 slli a5,a5,0x4 10a50: 400006b7 lui a3,0x40000 10a54: 00d7f663 bgeu a5,a3,10a60 <fast_rsqrt+0x70> 10a58: 00270713 addi a4,a4,2 10a5c: 00279793 slli a5,a5,0x2 10a60: fff7c793 not a5,a5 10a64: 01f7d793 srli a5,a5,0x1f 10a68: 00f70733 add a4,a4,a5 10a6c: 00050493 mv s1,a0 10a70: 01f00513 li a0,31 10a74: 40e50533 sub a0,a0,a4 10a78: 000116b7 lui a3,0x11 10a7c: ea068693 addi a3,a3,-352 # 10ea0 <rsqrt_table> 10a80: 00251793 slli a5,a0,0x2 10a84: 00f687b3 add a5,a3,a5 10a88: 0007a403 lw s0,0(a5) # 10000 <_start> 10a8c: 00100793 li a5,1 10a90: 00a797b3 sll a5,a5,a0 10a94: 0297fe63 bgeu a5,s1,10ad0 <fast_rsqrt+0xe0> 10a98: 00000593 li a1,0 10a9c: 00070c63 beqz a4,10ab4 <fast_rsqrt+0xc4> 10aa0: 02000613 li a2,32 10aa4: 40e60733 sub a4,a2,a4 10aa8: 00271713 slli a4,a4,0x2 10aac: 00e686b3 add a3,a3,a4 10ab0: 0006a583 lw a1,0(a3) 10ab4: 40f487b3 sub a5,s1,a5 10ab8: 01079793 slli a5,a5,0x10 10abc: 40b405b3 sub a1,s0,a1 10ac0: 00a7d533 srl a0,a5,a0 10ac4: 96dff0ef jal 10430 <__mulsi3> 10ac8: 01055513 srli a0,a0,0x10 10acc: 40a40433 sub s0,s0,a0 10ad0: 00200913 li s2,2 10ad4: 000309b7 lui s3,0x30 10ad8: 00100a13 li s4,1 10adc: 00040593 mv a1,s0 10ae0: 00040513 mv a0,s0 10ae4: 00c10693 addi a3,sp,12 10ae8: 00810613 addi a2,sp,8 10aec: e1dff0ef jal 10908 <mul32> 10af0: 00c12583 lw a1,12(sp) 10af4: 00c10693 addi a3,sp,12 10af8: 00810613 addi a2,sp,8 10afc: 00048513 mv a0,s1 10b00: e09ff0ef jal 10908 <mul32> 10b04: 00812783 lw a5,8(sp) 10b08: 00e15583 lhu a1,14(sp) 10b0c: 00040513 mv a0,s0 10b10: 01079793 slli a5,a5,0x10 10b14: 00f5e5b3 or a1,a1,a5 10b18: 00c10693 addi a3,sp,12 10b1c: 00810613 addi a2,sp,8 10b20: 40b985b3 sub a1,s3,a1 10b24: de5ff0ef jal 10908 <mul32> 10b28: 00812783 lw a5,8(sp) 10b2c: 00c12403 lw s0,12(sp) 10b30: 00f79793 slli a5,a5,0xf 10b34: 01145413 srli s0,s0,0x11 10b38: 00f46433 or s0,s0,a5 10b3c: 03491a63 bne s2,s4,10b70 <fast_rsqrt+0x180> 10b40: 02c12083 lw ra,44(sp) 10b44: 00040513 mv a0,s0 10b48: 02812403 lw s0,40(sp) 10b4c: 02412483 lw s1,36(sp) 10b50: 02012903 lw s2,32(sp) 10b54: 01c12983 lw s3,28(sp) 10b58: 01812a03 lw s4,24(sp) 10b5c: 03010113 addi sp,sp,48 10b60: 00008067 ret 10b64: 00050793 mv a5,a0 10b68: 00000713 li a4,0 10b6c: ec5ff06f j 10a30 <fast_rsqrt+0x40> 10b70: 00100913 li s2,1 10b74: f69ff06f j 10adc <fast_rsqrt+0xec> ``` #### Observations and explain |name|number of lines| |--|--| |Case 4|159| |With `-Ofast`|169| |With `-Os`|**99**| With `-Os`, code size is the smallest. Compared to case 4, the number of registers (like saved register or `ra`) `fast_rsqrt` used is smaller because `-Ofast` case only needs stack to store 8 registers(`addi sp,sp,-32`). In case 4, it needs to store 24 registers(`addi sp,sp,-96`). In `-Os` case, callee stores 12 registers(`addi sp,sp,-48`). In `-Ofast` case, original `mul32` is be optimized into below assembly code: ```c= 00010484 <__mulsi3>: 10484: 00050713 mv a4,a0 10488: 00000513 li a0,0 1048c: 02058263 beqz a1,104b0 <__mulsi3+0x2c> 10490: 01f59693 slli a3,a1,0x1f 10494: 41f6d793 srai a5,a3,0x1f 10498: 00e7f7b3 and a5,a5,a4 1049c: 0015d593 srli a1,a1,0x1 104a0: 00f50533 add a0,a0,a5 104a4: 00171713 slli a4,a4,0x1 104a8: fe0594e3 bnez a1,10490 <__mulsi3+0xc> 104ac: 00008067 ret 104b0: 00008067 ret ``` ##### Optimized version of `mul32`: `__mulsi3` When compiling with optimization flags such as `-Ofast`, the compiler may replace manually written routines with built-in helper functions. In this case, our handcrafted `mul32` function, which implements a 32×32→64-bit multiplication using shifts and additions, **is replaced by the compiler-generated `__mulsi3` routine**. This function is part of libgcc and implements a software-emulated 32×32 multiply that returns a 64-bit result. The compiler performs this replacement because it recognizes the multiplication pattern and knows a tested, optimized implementation exists in libgcc. On RV32 targets, which do not have a native 64-bit multiplication instruction, `__mulsi3` provides a correct and efficient way to compute the result. The assembly of `__mulsi3` uses a shift-and-add loop to accumulate the product, similar in logic to our original `mul32`, but often more optimized for instruction scheduling and correctness. I notice that in `__mulsi3`, it all uses a0~a5 registers, so it didn't need to store saved registers into stack first, which also means that it didn't require the additional stack space to store the temporary values during the process. In below `mul32` function, the compiler uses the stack to store many local variables and intermediate results because RV32 has a limited number of registers, and this function performs a 32×32→64-bit multiplication that produces multiple intermediate values. Without using the stack, there wouldn’t be enough registers, which could lead to incorrect computation or make the function non-reentrant. Using the stack ensures correct computation, reentrancy, and also stores the return address and frame pointer. * original `mul32` assembly code: ```c= 00011600 <mul32>: 11600: fd010113 addi sp,sp,-48 11604: 02112623 sw ra,44(sp) 11608: 02812423 sw s0,40(sp) 1160c: 03010413 addi s0,sp,48 11610: fca42e23 sw a0,-36(s0) 11614: fcb42c23 sw a1,-40(s0) 11618: 00000513 li a0,0 1161c: 00000593 li a1,0 11620: fea42423 sw a0,-24(s0) 11624: feb42623 sw a1,-20(s0) 11628: fe042223 sw zero,-28(s0) 1162c: 09c0006f j 116c8 <mul32+0xc8> 11630: fe442583 lw a1,-28(s0) 11634: 00100513 li a0,1 11638: 00b51533 sll a0,a0,a1 1163c: fd842583 lw a1,-40(s0) 11640: 00b575b3 and a1,a0,a1 11644: 06058c63 beqz a1,116bc <mul32+0xbc> 11648: fdc42583 lw a1,-36(s0) 1164c: 00058613 mv a2,a1 11650: 00000693 li a3,0 11654: fe442583 lw a1,-28(s0) 11658: fe058593 addi a1,a1,-32 1165c: 0005c863 bltz a1,1166c <mul32+0x6c> 11660: 00b617b3 sll a5,a2,a1 11664: 00000713 li a4,0 11668: 02c0006f j 11694 <mul32+0x94> 1166c: 00165513 srli a0,a2,0x1 11670: 01f00813 li a6,31 11674: fe442583 lw a1,-28(s0) 11678: 40b805b3 sub a1,a6,a1 1167c: 00b555b3 srl a1,a0,a1 11680: fe442503 lw a0,-28(s0) 11684: 00a697b3 sll a5,a3,a0 11688: 00f587b3 add a5,a1,a5 1168c: fe442583 lw a1,-28(s0) 11690: 00b61733 sll a4,a2,a1 11694: fe842803 lw a6,-24(s0) 11698: fec42883 lw a7,-20(s0) 1169c: 00e80533 add a0,a6,a4 116a0: 00050313 mv t1,a0 116a4: 01033333 sltu t1,t1,a6 116a8: 00f885b3 add a1,a7,a5 116ac: 00b30833 add a6,t1,a1 116b0: 00080593 mv a1,a6 116b4: fea42423 sw a0,-24(s0) 116b8: feb42623 sw a1,-20(s0) 116bc: fe442583 lw a1,-28(s0) 116c0: 00158593 addi a1,a1,1 116c4: feb42223 sw a1,-28(s0) 116c8: fe442503 lw a0,-28(s0) 116cc: 01f00593 li a1,31 116d0: f6a5d0e3 bge a1,a0,11630 <mul32+0x30> 116d4: fe842703 lw a4,-24(s0) 116d8: fec42783 lw a5,-20(s0) 116dc: 00070513 mv a0,a4 116e0: 00078593 mv a1,a5 116e4: 02c12083 lw ra,44(sp) 116e8: 02812403 lw s0,40(sp) 116ec: 03010113 addi sp,sp,48 116f0: 00008067 ret ``` #### Optimize method: Use optimization flag to improve the code. >[!Note] >When doing optimization, some **printed message** command will be deleted. So, it won't show all messages. In order to show the numeric result, I put `__attribute__((optimize("O0")))` above the function `print_dec`, so I can see the value of result. Therefore, I think comparing total count isn't fair for non-optimized case. It's better to compare the single case's count which can clearly see the performance difference of `fast_rsqrt`. >In below result, their `fast_rsqrt` is doing case 4: **Do lookup table, linear interpolation, and ==2 times== Newton–Raphson iteration** ##### The statistics of program's execution: **1. Optimize by `-Ofast`** : ==Improve the cycle count.== * cycle and instruction count result ```c= 1 // case index from 1 to 7 65536 // fast_rsqrt's return value 25 // cycle count of calling fast_rsqrt 25 // instruction count of calling fast_rsqrt 2 32768 1483 1483 3 16384 43 43 4 14654 1169 1169 5 6553 40 40 6 4080 40 40 7 2570 207 207 51998 // total cycle count 51998 // total instruction count ``` The `fast_rsqrt`'s result are all correct, and **each case's count statistics are all smaller than [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration) and `-Os` case**, which shows the effect of `-Ofast` optimization. * Display Section Sizes: by `riscv-none-elf-size test.elf` ``` text data bss dec hex filename 4184 59 4109 8352 20a0 test.elf ``` **2. Optimize by `-Os`** : ==Improve the code size.== ```c= 1 // case index from 1 to 7 65536 // fast_rsqrt's return value 40 // cycle count of calling fast_rsqrt 40 // instruction count of calling fast_rsqrt 2 32768 1635 1635 3 16384 1542 1542 4 14654 2481 2481 5 6553 2375 2375 6 4080 2065 2065 7 2570 2233 2233 58394 // total cycle count 58394 // total instruction count ``` The `fast_rsqrt`'s result are all correct, and each case's count statistics are all smaller than [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration). * Display Section Sizes: by `riscv-none-elf-size test.elf` ``` text data bss dec hex filename 3872 59 4101 8032 1f60 test.elf ``` By comparing the [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration), `-Ofast` case, and this case, this case **exists the smallest size in total**, which shows the effect of `-Os` optimization.