# Assignment2: Complete Applications
contributed by < [AnnTaiwan](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2) >
>[!Note] AI tools usage
>I use ChatGPT to assist with Quiz 2 & 3 by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage.
>[!Tip] Tutorial
>Follow instructions to build rv32emu : [Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/@sysprog/Sko2Ja5pel)
>* I install on Ubuntu 24.04.3.
>
>[Assignment2: Complete Applications](https://hackmd.io/@sysprog/2025-arch-homework2)
## Pick one complete program (with test suite) done in Homework 1 and make it run in a bare-metal environment using rv32emu’s system emulation.
> I pick [Quiz 1-Problem B: uf8_t decode and encode](https://github.com/AnnTaiwan/ca2025-quizzes/blob/main/q1-uf8_rv32_mine_v1.s).
* [souce code for uf8_t decode and encode using rv32emu’s system emulation](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q2)
* Function declaration in `main.c`:
(`CLZ`, `uf8_decode`, and `uf8_encode` are assembly code.)
```c=
/* ============= q1_uf8 Declaration ============= */
extern uint32_t uf8_decode(uint8_t in);
extern uint8_t uf8_encode(uint32_t in);
```
* Disassemble elf:
```c=
0001432c <CLZ>:
1432c: 02000293 li t0,32
14330: 01000313 li t1,16
00014334 <do_while>:
14334: 006553b3 srl t2,a0,t1
14338: 00038663 beqz t2,14344 <SHIFT_TOO_MUCH>
1433c: 406282b3 sub t0,t0,t1
14340: 00038533 add a0,t2,zero
00014344 <SHIFT_TOO_MUCH>:
14344: 00135313 srli t1,t1,0x1
14348: fe0316e3 bnez t1,14334 <do_while>
1434c: 40a28533 sub a0,t0,a0
14350: 00008067 ret
00014354 <uf8_decode>:
14354: 00f57293 andi t0,a0,15
14358: 00455313 srli t1,a0,0x4
1435c: 00f00393 li t2,15
14360: 406383b3 sub t2,t2,t1
14364: 00008e37 lui t3,0x8
14368: fffe0e13 addi t3,t3,-1 # 7fff <_start-0x8001>
1436c: 007e5e33 srl t3,t3,t2
14370: 004e1e13 slli t3,t3,0x4
14374: 006293b3 sll t2,t0,t1
14378: 01c38533 add a0,t2,t3
1437c: 00008067 ret
00014380 <uf8_encode>:
14380: ffc10113 addi sp,sp,-4
14384: 00112023 sw ra,0(sp)
14388: 00050fb3 add t6,a0,zero
1438c: 01000293 li t0,16
14390: 085fc863 blt t6,t0,14420 <RETURN>
14394: f99ff0ef jal 1432c <CLZ>
14398: 01f00293 li t0,31
1439c: 40a282b3 sub t0,t0,a0
143a0: 00050333 add t1,a0,zero
143a4: 00000393 li t2,0
143a8: 00000e13 li t3,0
143ac: 00500e93 li t4,5
143b0: 05d2c063 blt t0,t4,143f0 <Find_exact_exponent>
143b4: ffc28393 addi t2,t0,-4
143b8: 00f00e93 li t4,15
143bc: 007ed463 bge t4,t2,143c4 <Cal_overflow>
143c0: 00f00393 li t2,15
```
* Run result

It successfully passes all the tests.
## Tower of Hanoi - from Problem A of Quiz 2
[Quiz2 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz2-sol#Problem-A)
* **Souce code** for Tower of Hanoi using rv32emu’s system emulation:
* [Without optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3)
* [With `-Ofast` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Ofast)
* [With `-Os` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Osize)
### Purpose
Move 3 disks from peg A to peg C, and print the actions.
### Flow by seeing gray code's different bit position
| Gray Code (3-bit) | Bit variation | Moving disk | Action |
| :-------------: | :-------: | :----: | :---: |
| 000 | Start | — | — |
| 001 | bit0 flip | A (smallese) | A → C |
| 011 | bit1 flip | B | A → B |
| 010 | bit0 flip | A | C → B |
| 110 | bit2 flip | C (biggest) | A → C |
| 111 | bit0 flip | A | B → A |
| 101 | bit1 flip | B | B → C |
| 100 | bit0 flip | A | A → C |
* Gray code formula: $Gray(n) = n\ \oplus \ (n>>1)$
* See different bit position: $diff\_bit\_pos = Gray(n)\ \oplus \ Gray(n-1)$
#### Original Code in quiz 2 - problem A
* register usage
| Step | Purpose | Register/Concept |
| -------- | ------------------ | ---------------------- |
| `x8` | Move counter | 1→7 |
| `x6` | Gray(n) | Current Gray code |
| `x7` | Gray(n-1) | Previous Gray code |
| `x5` | XOR diff | Which disk moves |
| `x9` | Disk index | 0=A,1=B,2=C |
| `x18` | Disk’s current peg | 0=A,1=B,2=C |
| `x19` | Disk’s next peg | computed target |
| `obdata` | Obfuscated pegs | Decoded to ‘A’,‘B’,‘C’ |
##### How to move disk 1, 2 to the pegs except the one that the smallest disk is at.
All pegs is $0,1,2$, so total sum is $0+1+2=3$. Current moving disk can't be on top of the smallest one and can't move to its current peg. Therefore, when moving the disks except the smallest one, the formula is $Target\ peg= 3-self's\ peg-smallest\ disk's\ peg$.
### Code written in assembly explanation
* Function declaration in `main.c`:
```c=
/* ============= Tower of Hanoi Declaration ============= */
// return 0 (stored in a0) if successful exit
extern uint32_t play_toh_v1(void);
extern uint32_t play_toh_v2(void);
extern uint32_t play_toh_v3(void);
```
#### Version 1 (Original assembly code from Q2-A)
I only adjust the `display_move` to apply the print format of rv32emu.
* Complete code
```c=
# Perform the tower of Hanoi.
# Disks: 0,1,2 (0 is the smallest one)
# Peg: A,B,C (Try to move all disks from peg A to C)
# Here is the original code from quiz 2 - A.
# I only adjust the `display_move` to apply the print format of rv32emu.
.text
.globl play_toh_v1
.type play_toh_v1,%function
.align 2
play_toh_v1:
addi x2, x2, -32
sw x8, 0(x2)
sw x9, 4(x2)
sw x18, 8(x2) # disk0's peg
sw x19, 12(x2) # disk1's peg
sw x20, 16(x2) # disk2's peg
li x5, 0x15
sw x5, 20(x2) # 20(x2) is disk 1's peg number
sw x5, 24(x2)
sw x5, 28(x2)
# Fix disk positions (BLANK 1-3: neutralize x5 effect)
# BLANK 1: Fix position at x2+20
sw x0, 20(x2)
# BLANK 2: Fix position at x2+24
sw x0, 24(x2)
# BLANK 3: Fix position at x2+28
sw x0, 28(x2)
addi x8, x0, 1
game_loop:
# BLANK 4: Check loop termination (2^3 moves)
addi x5, x0, 8
beq x8, x5, finish_game
# Gray code formula: gray(n) = n XOR (n >> k)
# BLANK 5: What is k for Gray code?
srli x5, x8, 1
# BLANK 6: Complete Gray(n) calculation
xor x6, x8, x5
# BLANK 7-8: Calculate previous value and its shift
addi x7, x8, -1
srli x28, x7, 1
# BLANK 9: Generate Gray(n-1)
xor x7, x7, x28
# BLANK 10: Which bits changed?
xor x5, x6, x7
# Initialize disk number
addi x9, x0, 0
# BLANK 11: Mask for testing LSB
andi x6, x5, 1
# BLANK 12: Branch if disk 0 moves
bne x6, x0, disk_found
# BLANK 13: Set disk 1
addi x9, x0, 1
# BLANK 14: Test second bit with proper mask
andi x6, x5, 2
bne x6, x0, disk_found
# BLANK 15: Last disk number
addi x9, x0, 2
disk_found:
# BLANK 16: Check impossible pattern (multiple bits)
andi x30, x5, 5
addi x31, x0, 5
beq x30, x31, pattern_match
jal x0, continue_move
pattern_match:
continue_move:
# BLANK 17: Word-align disk index (multiply by what?)
slli x5, x9, 2
# BLANK 18: Base offset for disk array
addi x5, x5, 20
add x5, x2, x5
lw x18, 0(x5)
bne x9, x0, handle_large
# BLANK 19: Small disk moves by how many positions?
addi x19, x18, 2
# BLANK 20: Number of pegs
addi x6, x0, 3
blt x19, x6, display_move
sub x19, x19, x6
jal x0, display_move
handle_large:
# BLANK 21: Load reference disk position
lw x6, 20(x2)
# BLANK 22: Sum of all peg indices (0+1+2)
addi x19, x0, 3
sub x19, x19, x18
sub x19, x19, x6
display_move:
la x20, obdata
add x5, x20, x18
# BLANK 23: Load byte from obfuscated data
lbu x13, 0(x5)
# BLANK 24: Decode constant (0x6F)
li x6, 0x6F
xor x13, x13, x6
# BLANK 25: Final offset adjustment
addi x13, x13, -0x12
add x7, x20, x19
lbu x14, 0(x7)
xor x14, x14, x6
addi x14, x14, -0x12
# la x10, str1
# addi x17, x0, 4
# ecall
add a7, x0, 0x40
add a0, x0, 0x1 # stdout
la a1, str1
li a2, 10 #length character
ecall
# addi x10, x9, 1
# addi x17, x0, 1
# ecall
addi a7, x9, 1 # x9 now is 0,1,2. In order to print 1,2,3 , add 1
la a1, disk # Find corresponding ascii code
add a1, a1, a7
add a7, x0, 0x40
add a0, x0, 0x1 # stdout
li a2, 1 #length character
ecall
# la x10, str2
# addi x17, x0, 4
# ecall
add a7, x0, 0x40
add a0, x0, 0x1 # stdout
la a1, str2
li a2, 6 #length character
ecall
# addi x10, x13, 0
# addi x17, x0, 11 # print char
# ecall
add a7, x0, 0x40
add a0, x0, 0x1 # stdout
la a1, peg
addi x13, x13, -65 # get the index from 0 to 2
add a1, a1, x13
li a2, 1 #length character
ecall
# la x10, str3
# addi x17, x0, 4
# ecall
add a7, x0, 0x40
add a0, x0, 0x1 # stdout
la a1, str3
li a2, 4 #length character
ecall
# addi x10, x14, 0
# addi x17, x0, 11
# ecall
add a7, x0, 0x40
add a0, x0, 0x1 # stdout
la a1, peg
addi x14, x14, -65 # get the index from 0 to 2
add a1, a1, x14
li a2, 1 #length character
ecall
# addi x10, x0, 10
# addi x17, x0, 11
# ecall
add a7, x0, 0x40
add a0, x0, 0x1 # stdout
la a1, nextline
li a2, 1 #length character
ecall
# BLANK 26: Calculate storage offset
slli x5, x9, 2
addi x5, x5, 20
add x5, x2, x5
# BLANK 27: Update disk position
sw x19, 0(x5)
# BLANK 28-29: Increment counter and loop
addi x8, x8, 1
jal x0, game_loop
finish_game:
lw x8, 0(x2)
lw x9, 4(x2)
lw x18, 8(x2)
lw x19, 12(x2)
lw x20, 16(x2)
addi x2, x2, 32
li a0, 0 # return 0 if successful exit
jr ra
# addi x17, x0, 10
# ecall
.data
obdata: .byte 0x3c, 0x3b, 0x3a
peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C'
disk: .byte 48, 49, 50, 51 # ascii code: '0' '1' '2' '3'
str1: .ascii "Move Disk "
str2: .ascii " from "
str3: .ascii " to "
nextline: .ascii "\n"
```
For printing decimal number in charcter, '1', '2', '3', and disk character, "A", "B", "C", I use ASCII code to represent:
```c=
peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C'
disk: .byte 48, 49, 50, 51 # ascii code: '0' '1' '2' '3'
str1: .ascii "Move Disk "
str2: .ascii " from "
str3: .ascii " to "
nextline: .ascii "\n"
```
#### Version 2 (handwritten code)
>[Commit bc36823](https://github.com/AnnTaiwan/computer_arch/commit/bc36823c08f84fbf1c5007ff7819d897e88058b3)
I use ABI names for usage of register in order to better read the code. There are some lines that needs to calculate the offset in stack to find the disk's peg number:
* version 1
```c=
# BLANK 17: Word-align disk index (multiply by what?)
slli x5, x9, 2
# BLANK 18: Base offset for disk array
addi x5, x5, 20
add x5, x2, x5
lw x18, 0(x5)
```
In version 1, it also needs to add offset `20`, so I stores disk's peg number at `sp[0]`, `sp[4]`, `sp[8]` to prevent that:
```c=
addi sp, sp, -20
sw x0, 0(sp) # disk 0's peg number
sw x0, 4(sp) # disk 1's peg number
sw x0, 8(sp) # disk 2's peg number
```
Therefore, I can directly multiply it by 4, and load data from stack.
```c=
slli t0, a5, 2 # see idx in sp, disk idx * 4
add t1, sp, t0
lw a6, 0(t1) # disk idx's current peg num
```
* Complete code
```c=
# Perform the tower of Hanoi.
# Disks: 0,1,2 (0 is the smallest one)
# Peg: A,B,C (Try to move all disks from peg A to C)
# Here is the adapted code from quiz 2 - A.
# Defined constants first
.equ STDOUT, 1
.equ WRITE, 64
.equ EXIT, 93
.text
.globl play_toh_v2
.type play_toh_v2,%function
.align 2
play_toh_v2:
addi sp, sp, -20
sw x0, 0(sp) # disk 0's peg number
sw x0, 4(sp) # disk 1's peg number
sw x0, 8(sp) # disk 2's peg number
sw ra, 12(sp) # ra
sw s0, 16(sp)
addi s0, x0, 1 # for loop counter, and it is also n in gray(n)
game_loop:
li t0, 8
beq t0, s0, finish_game
# Gray code formula: gray(n) = n XOR (n >> 1)
srli t0, s0, 1
xor a1, t0, s0 # gray(n)
addi t0, s0, -1 # n-1
srli t1, t0, 1
xor a2, t0, t1 # gray(n-1)
li a5, 0 # try to move which disk 0,1,2
xor a3, a1, a2 # diff_bit_pos
andi t0, a3, 1 # try to move smallest disk
beqz t0, handle_large
# handle_smallest
lw a6, 0(sp) # disk 0's current peg num
# find target disk: current disk idx +2 -3
addi a4, a6, 2
li t0, 3
blt a4, t0, display_action
sub a4, a4, t0 # -3
jal x0, display_action
handle_large:
addi a5, a5, 1 # middle disk 1
srli a3, a3, 1
andi t0, a3, 1
bnez t0, found_disk
addi a5, a5, 1 # largest disk 2
found_disk:
slli t0, a5, 2 # see idx in sp, disk idx * 4
add t1, sp, t0
lw a6, 0(t1) # disk idx's current peg num
# find target disk: 3-(self peg + smallest disk's peg)
lw t2, 0(sp) # smallest disk's peg
li a4, 3
sub a4, a4, a6
sub a4, a4, t2
jal x0, display_action
display_action:
# Move disk a5 from a6 to a4
li a7, WRITE
li a0, STDOUT # stdout
la a1, str1
li a2, 10 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, disk
addi t0, a5, 1
add a1, a1, t0
li a2, 1 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, str2
li a2, 6 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, peg
add a1, a1, a6
li a2, 1 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, str3
li a2, 4 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, peg
add a1, a1, a4
li a2, 1 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, nextline
li a2, 1 #length character
ecall
# Update disk position ( Move disk a5 from a6 to a4 )
slli t0, a5, 2 # * 4
add t0, sp, t0
sw a4, 0(t0)
# Increment counter and loop
addi s0, s0, 1
jal x0, game_loop
finish_game:
lw ra, 12(sp) # ra
lw s0, 16(sp)
addi sp, sp, 20
li a0, 0 # return 0 if successful exit
jr ra
.data
peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C'
disk: .byte 48, 49, 50, 51 # ascii code: '0' '1' '2' '3'
str1: .ascii "Move Disk "
str2: .ascii " from "
str3: .ascii " to "
nextline: .ascii "\n"
```
#### Version 3 (Improved one from version 2 and final one)
>[Commit b1dff94](https://github.com/AnnTaiwan/computer_arch/commit/b1dff945cbb45e9fd3404559bb8afebe99e44101)
First looking at `version 2`, I check if there is anything fixed command in the `game_loop`, and remove it from the loop to execute just once.
* constants:
```c=
li t5, 8 # gameloop end situation
li t6, 3 # constant
```
* `gray(n-1)` is stored in `s1` to prevent from calculating again in the next loop. In each iteration, it only needs to calculate `gray(n)` once.
* When seeing which disks need to move by seeing bit variation of gray code, it will be search from the smallest disk to the biggest one. At each step, I add $4$ to `a5`(represent $n^{th}$ disk) in order to directly access `sp[a5]`, so I don't need to `slli a5, a5, 2` when calculating the offset.
```c=
addi a5, a5, 4 # largest disk 2
found_disk:
add t1, sp, a5
lw a6, 0(t1) # disk idx's current peg num
```
* complete code
```c=
# Perform the tower of Hanoi.
# Disks: 0,1,2 (0 is the smallest one)
# Peg: A,B,C (Try to move all disks from peg A to C)
# Here is the adapted code from quiz 2 - A.
# Defined constants first
.equ STDOUT, 1
.equ WRITE, 64
.equ EXIT, 93
.text
.globl play_toh_v3
.type play_toh_v3,%function
.align 2
play_toh_v3:
addi sp, sp, -24
sw x0, 0(sp) # disk 0's peg number
sw x0, 4(sp) # disk 1's peg number
sw x0, 8(sp) # disk 2's peg number
sw ra, 12(sp) # ra
sw s0, 16(sp)
sw s1, 20(sp)
# s0: n
# s1: store previous calculated gray(n-1)
li s1, 0 # gray(0)
li t5, 8 # gameloop end situation
li t6, 3 # constant
addi s0, x0, 1 # for loop counter, and it is also n in gray(n)
game_loop:
beq t5, s0, finish_game
# Gray code formula: gray(n) = n XOR (n >> 1)
srli t0, s0, 1
xor a1, t0, s0 # gray(n)
li a5, 0 # try to move which disk 0,1,2
xor a3, a1, s1 # diff_bit_pos
mv s1, a1 # update gray(n-1) for next round
andi t0, a3, 1 # try to move smallest disk
beqz t0, handle_large
# handle_smallest
lw a6, 0(sp) # disk 0's current peg num
# find target disk: current disk idx +2 -3
addi a4, a6, 2
blt a4, t6, display_action
sub a4, a4, t6 # -3
jal x0, display_action
handle_large:
addi a5, a5, 4 # middle disk 1, a5 now increases by 4, in order to align with sp[0], sp[4], sp[8]
andi t0, a3, 0x2 # bit flip 010: this is bit 1 flip
bnez t0, found_disk
addi a5, a5, 4 # largest disk 2
found_disk:
add t1, sp, a5
lw a6, 0(t1) # disk idx's current peg num
# find target disk: 3-(self peg + smallest disk's peg)
lw t2, 0(sp) # smallest disk's peg
sub a4, t6, a6
sub a4, a4, t2
jal x0, display_action
display_action:
# Move disk a5 from a6 to a4
li a7, WRITE
li a0, STDOUT # stdout
la a1, str1
li a2, 10 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, disk
addi t0, a5, 4
add a1, a1, t0
li a2, 1 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, str2
li a2, 6 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, peg
add a1, a1, a6
li a2, 1 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, str3
li a2, 4 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, peg
add a1, a1, a4
li a2, 1 #length character
ecall
li a7, WRITE
li a0, STDOUT # stdout
la a1, nextline
li a2, 1 #length character
ecall
# Update disk position ( Move disk a5 from a6 to a4 )
add t0, sp, a5
sw a4, 0(t0)
# Increment counter and loop
addi s0, s0, 1
jal x0, game_loop
finish_game:
lw ra, 12(sp) # ra
lw s0, 16(sp)
lw s1, 20(sp)
addi sp, sp, 24
li a0, 0 # return 0 if successful exit
jr ra
.data
peg: .byte 65, 66, 67 # ascii code: 'A' 'B' 'C'
disk: .byte 48, 0, 0, 0, 49, 0, 0, 0, 50, 0, 0, 0, 51 # ascii code: '0' '1' '2' '3', 0 is buffer, not used
str1: .ascii "Move Disk "
str2: .ascii " from "
str3: .ascii " to "
nextline: .ascii "\n"
```
### Result
#### Ripes simulation (Original assembly code from Q2-A)


#### Version 1 (Original assembly code from Q2-A)
* count = 685

#### Version 2 (My handwritten code)
* count = 533

#### Version 3 (My handwritten code and improved one)(Final)
* ==count = 496==

### Disassemble the ELF files produced by the C compiler and contrast the handwritten and **compiler-optimized** assembly listings.
* Version 2
:::spoiler
```c=
00010878 <play_toh_v2>:
10878: fec10113 addi sp,sp,-20
1087c: 00012023 sw zero,0(sp)
10880: 00012223 sw zero,4(sp)
10884: 00012423 sw zero,8(sp)
10888: 00112623 sw ra,12(sp)
1088c: 00812823 sw s0,16(sp)
10890: 00100413 li s0,1
00010894 <game_loop>:
10894: 00800293 li t0,8
10898: 14828063 beq t0,s0,109d8 <finish_game>
1089c: 00145293 srli t0,s0,0x1
108a0: 0082c5b3 xor a1,t0,s0
108a4: fff40293 addi t0,s0,-1
108a8: 0012d313 srli t1,t0,0x1
108ac: 0062c633 xor a2,t0,t1
108b0: 00000793 li a5,0
108b4: 00c5c6b3 xor a3,a1,a2
108b8: 0016f293 andi t0,a3,1
108bc: 00028e63 beqz t0,108d8 <handle_large>
108c0: 00012803 lw a6,0(sp)
108c4: 00280713 addi a4,a6,2
108c8: 00300293 li t0,3
108cc: 04574063 blt a4,t0,1090c <display_action>
108d0: 40570733 sub a4,a4,t0
108d4: 0380006f j 1090c <display_action>
000108d8 <handle_large>:
108d8: 00178793 addi a5,a5,1
108dc: 0016d693 srli a3,a3,0x1
108e0: 0016f293 andi t0,a3,1
108e4: 00029463 bnez t0,108ec <found_disk>
108e8: 00178793 addi a5,a5,1
000108ec <found_disk>:
108ec: 00279293 slli t0,a5,0x2
108f0: 00510333 add t1,sp,t0
108f4: 00032803 lw a6,0(t1)
108f8: 00012383 lw t2,0(sp)
108fc: 00300713 li a4,3
10900: 41070733 sub a4,a4,a6
10904: 40770733 sub a4,a4,t2
10908: 0040006f j 1090c <display_action>
0001090c <display_action>:
1090c: 04000893 li a7,64
10910: 00100513 li a0,1
10914: 00000597 auipc a1,0x0
10918: 76a58593 addi a1,a1,1898 # 1107e <str1>
1091c: 00a00613 li a2,10
10920: 00000073 ecall
10924: 04000893 li a7,64
10928: 00100513 li a0,1
1092c: 00000597 auipc a1,0x0
10930: 74e58593 addi a1,a1,1870 # 1107a <disk>
10934: 00178293 addi t0,a5,1
10938: 005585b3 add a1,a1,t0
1093c: 00100613 li a2,1
10940: 00000073 ecall
10944: 04000893 li a7,64
10948: 00100513 li a0,1
1094c: 00000597 auipc a1,0x0
10950: 73c58593 addi a1,a1,1852 # 11088 <str2>
10954: 00600613 li a2,6
10958: 00000073 ecall
1095c: 04000893 li a7,64
10960: 00100513 li a0,1
10964: 00000597 auipc a1,0x0
10968: 71358593 addi a1,a1,1811 # 11077 <peg>
1096c: 010585b3 add a1,a1,a6
10970: 00100613 li a2,1
10974: 00000073 ecall
10978: 04000893 li a7,64
1097c: 00100513 li a0,1
10980: 00000597 auipc a1,0x0
10984: 70e58593 addi a1,a1,1806 # 1108e <str3>
10988: 00400613 li a2,4
1098c: 00000073 ecall
10990: 04000893 li a7,64
10994: 00100513 li a0,1
10998: 00000597 auipc a1,0x0
1099c: 6df58593 addi a1,a1,1759 # 11077 <peg>
109a0: 00e585b3 add a1,a1,a4
109a4: 00100613 li a2,1
109a8: 00000073 ecall
109ac: 04000893 li a7,64
109b0: 00100513 li a0,1
109b4: 00000597 auipc a1,0x0
109b8: 6de58593 addi a1,a1,1758 # 11092 <nextline>
109bc: 00100613 li a2,1
109c0: 00000073 ecall
109c4: 00279293 slli t0,a5,0x2
109c8: 005102b3 add t0,sp,t0
109cc: 00e2a023 sw a4,0(t0)
109d0: 00140413 addi s0,s0,1
109d4: ec1ff06f j 10894 <game_loop>
000109d8 <finish_game>:
109d8: 00c12083 lw ra,12(sp)
109dc: 01012403 lw s0,16(sp)
109e0: 01410113 addi sp,sp,20
109e4: 00000513 li a0,0
109e8: 00008067 ret
```
:::
* Version 3
:::spoiler
```c=
000117cc <play_toh_v3>:
117cc: fe810113 addi sp,sp,-24
117d0: 00012023 sw zero,0(sp)
117d4: 00012223 sw zero,4(sp)
117d8: 00012423 sw zero,8(sp)
117dc: 00112623 sw ra,12(sp)
117e0: 00812823 sw s0,16(sp)
117e4: 00912a23 sw s1,20(sp)
117e8: 00000493 li s1,0
117ec: 00800f13 li t5,8
117f0: 00300f93 li t6,3
117f4: 00100413 li s0,1
000117f8 <game_loop>:
117f8: 128f0263 beq t5,s0,1191c <finish_game>
117fc: 00145293 srli t0,s0,0x1
11800: 0082c5b3 xor a1,t0,s0
11804: 00000793 li a5,0
11808: 0095c6b3 xor a3,a1,s1
1180c: 00058493 mv s1,a1
11810: 0016f293 andi t0,a3,1
11814: 00028c63 beqz t0,1182c <handle_large>
11818: 00012803 lw a6,0(sp)
1181c: 00280713 addi a4,a6,2
11820: 03f74a63 blt a4,t6,11854 <display_action>
11824: 41f70733 sub a4,a4,t6
11828: 02c0006f j 11854 <display_action>
0001182c <handle_large>:
1182c: 00478793 addi a5,a5,4
11830: 0026f293 andi t0,a3,2
11834: 00029463 bnez t0,1183c <found_disk>
11838: 00478793 addi a5,a5,4
0001183c <found_disk>:
1183c: 00f10333 add t1,sp,a5
11840: 00032803 lw a6,0(t1)
11844: 00012383 lw t2,0(sp)
11848: 410f8733 sub a4,t6,a6
1184c: 40770733 sub a4,a4,t2
11850: 0040006f j 11854 <display_action>
00011854 <display_action>:
11854: 04000893 li a7,64
11858: 00100513 li a0,1
1185c: 00001597 auipc a1,0x1
11860: 94758593 addi a1,a1,-1721 # 121a3 <str1>
11864: 00a00613 li a2,10
11868: 00000073 ecall
1186c: 04000893 li a7,64
11870: 00100513 li a0,1
11874: 00001597 auipc a1,0x1
11878: 92258593 addi a1,a1,-1758 # 12196 <disk>
1187c: 00478293 addi t0,a5,4
11880: 005585b3 add a1,a1,t0
11884: 00100613 li a2,1
11888: 00000073 ecall
1188c: 04000893 li a7,64
11890: 00100513 li a0,1
11894: 00001597 auipc a1,0x1
11898: 91958593 addi a1,a1,-1767 # 121ad <str2>
1189c: 00600613 li a2,6
118a0: 00000073 ecall
118a4: 04000893 li a7,64
118a8: 00100513 li a0,1
118ac: 00001597 auipc a1,0x1
118b0: 8e758593 addi a1,a1,-1817 # 12193 <peg>
118b4: 010585b3 add a1,a1,a6
118b8: 00100613 li a2,1
118bc: 00000073 ecall
118c0: 04000893 li a7,64
118c4: 00100513 li a0,1
118c8: 00001597 auipc a1,0x1
118cc: 8eb58593 addi a1,a1,-1813 # 121b3 <str3>
118d0: 00400613 li a2,4
118d4: 00000073 ecall
118d8: 04000893 li a7,64
118dc: 00100513 li a0,1
118e0: 00001597 auipc a1,0x1
118e4: 8b358593 addi a1,a1,-1869 # 12193 <peg>
118e8: 00e585b3 add a1,a1,a4
118ec: 00100613 li a2,1
118f0: 00000073 ecall
118f4: 04000893 li a7,64
118f8: 00100513 li a0,1
118fc: 00001597 auipc a1,0x1
11900: 8bb58593 addi a1,a1,-1861 # 121b7 <nextline>
11904: 00100613 li a2,1
11908: 00000073 ecall
1190c: 00f102b3 add t0,sp,a5
11910: 00e2a023 sw a4,0(t0)
11914: 00140413 addi s0,s0,1
11918: ee1ff06f j 117f8 <game_loop>
0001191c <finish_game>:
1191c: 00c12083 lw ra,12(sp)
11920: 01012403 lw s0,16(sp)
11924: 01412483 lw s1,20(sp)
11928: 01810113 addi sp,sp,24
1192c: 00000513 li a0,0
11930: 00008067 ret
```
:::
#### Observations and explain
|name|number of lines|
|--|--|
|Version 2|104|
|Version 2 with `-Ofast`|104|
|Version 2 with `-Os`|104|
|Version 2|101|
|Version 2 with `-Ofast`|101|
|Version 2 with `-Os`|101|
If they are from same versions, the disassemble results are all the same, so the statistic results are similar. So, I just show the each example from each version.
#### Optimize method
* Add optimize flag in `CFLAGS` in `Makefile`
```c=
# optimized for speed
CFLAGS = -g -march=rv32i_zicsr -Ofast
# optimized for size
CFLAGS = -g -march=rv32i_zicsr -Os
```
* Add `__attribute__((optimize("O0")))` before `print_dec` in order to prevent it from optimizing, so I can see the cycle and instruction count messages.
```c=
__attribute__((optimize("O0")))
/* Simple integer to decimal string conversion */
static void print_dec(unsigned long val)
{
char buf[20];
char *p = buf + sizeof(buf) - 1;
*p = '\n';
p--;
if (val == 0) {
*p = '0';
p--;
} else {
while (val > 0) {
*p = '0' + umod(val, 10);
p--;
val = udiv(val, 10);
}
}
p++;
printstr(p, (buf + sizeof(buf) - p));
}
```
* Version 2 - Running result:
* with `-Ofast`:

* Count of cycles and instructions are 520.
* with `-Os`:

* Count of cycles and instructions are 520.
* Compared to `Version 2` result which is 533, count of cycles and instructions decreases due to optimization.
* Version 3 - Running result:
* with `-Ofast`:

* Count of cycles and instructions are 483.
* with `-Os`:

* Count of cycles and instructions are 483.
* Compared to `Version 3` result which is 496, count of cycles and instructions decreases due to optimization.
## `fast_rsqrt` - from Problem C of Quiz 3
[Quiz3 of Computer Architecture (2025 Fall)](https://hackmd.io/@sysprog/arch2025-quiz3)
* **Souce code** for `fast_rsqrt` using rv32emu’s system emulation:
* [Without optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3)
* [With `-Ofast` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Ofast)
* [With `-Os` optimization](https://github.com/AnnTaiwan/computer_arch/tree/main/caHW2/Q3_Osize)
The goal of this problem is to approximate the **reciprocal square root function**
$$
y = \frac{1}{\sqrt{x}}
$$
using three progressively refined methods:
1. **Lookup table (LUT)** — precompute discrete values of ( y ) for selected ( x ).
2. **Linear interpolation** — estimate intermediate values between table entries.
3. **Newton–Raphson iteration** — apply iterative refinement to improve accuracy.
These techniques demonstrate the trade-offs between computational cost and numerical precision when implementing mathematical functions on resource-limited systems (e.g., embedded or bare-metal environments).
### My method
Follow the c code shown in quiz PDF Problem C of Quiz 3. Try to realize the error rate by using with or without `Newton-Raphson Iteration or linear interpolation` and see how the number of `Newton-Raphson Iteration` affects the result. Besides, using different optimization flag to see difference and explain the result.
### Test data
```c=
uint32_t in[] = {1, 4, 16, 20, 100, 258, 650};
char *s_exact_values[] = {
"65536",
"32768",
"16384",
"14654.2",
"6553.6",
"4080",
"2570.5"
};
uint32_t exact_values[] = { // multiply 10 for saving one-position fractional part
655360, // 65536 * 10
327680,
163840,
146542,
65536, // 6553.6 * 10
40800,
25705
};
```
* `in`: the `x` input to `fast_rsqrt`, `result = fast_rsqrt(x)`, stores all the test datas.
* `s_exact_values`: Hardcode the corresponding exact result values as `char*` strings in order to print them out.
* `exact_values`: the exact result value already timed by 10 in order to keep the one-position of fractional part, and it will be used to calculate error.
#### Test Method
Try to input all `in` data into `fast_rsqrt`, and get the `result`, use this `result` with the `exact_values[i]` to calculate the error in pecentage(%). Notice that my percentage can only appear in integer way, so the `remainder` after the percentage calculation will also be displayed. In the `remainder` part, the value after `/` indicates the divisor used when calculating the percentage, allowing the remainder to be displayed more clearly.
e.g.:
```c=
fast_rsqrt(2147483648) = 1 (exact: 1.4), Approximately Error = 28%, Remainder = 8/14, Cycles: 71, Instructions: 71
```
### Result
* The **total** cycle and instruction count includes the process of executing `fast_rsqrt`, printing messages, and calculating error, so I will print the specific cycle and instruction count of each test case in the end of that line.
#### Case 1: Just do lookup table

#### Case 2: Do lookup table and linear interpolation

* The error rate increases as the input data increases.
#### Case 3: Do lookup table, linear interpolation, and ==1 time== Newton–Raphson iteration

#### Case 4: Do lookup table, linear interpolation, and ==2 times== Newton–Raphson iteration

* With two times iteration, it seems it can decrease the error enough.
* Compared to case 2 and 3, the cycle and instruction count increases a lot.
* Display Section Sizes: by `riscv-none-elf-size test.elf`
```
text data bss dec hex filename
7604 59 4097 11760 2df0 test.elf
```
#### Case 5: Do lookup table, linear interpolation, and ==3 times== Newton–Raphson iteration

* It seems the result didn't change, but the cycle and instruction count increase a lot, which means 2 times Newton–Raphson iteration is enough.
#### Case 6: Do lookup table, and ==1 times== Newton–Raphson iteration
* Without linear interpolation

#### Case 7: Do lookup table, and ==2 times== Newton–Raphson iteration
* Without linear interpolation

#### Conclusion of result:
>Use **case number** shown above to represent.
|name|comparing result|
|--|--|
|cycle count|5 > 7 > 4 > 3 > 2 > 6 > 1|
|error rate|1 > 6 > 2 > 7 > 3 > 4 = 5|
I notice that **adding one more time Newton-Raphson iteration** can lead to significant increase in cycle and instruction count.
This happens because each additional iteration performs more chained multiplications. The execution time of `mul32` depends on the number of `1` bits in the operand — more `1`s result in more addition operations inside the loop. As Newton-Raphson refines the approximation, the intermediate values tend to have more active bits, increasing the workload for each multiplication. Consequently, each extra iteration significantly raises both cycle and instruction counts.
### Disassemble the ELF files produced by the C compiler
#### [Case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration)
```c=
000117d8 <fast_rsqrt>:
117d8: fa010113 addi sp,sp,-96
117dc: 04112e23 sw ra,92(sp)
117e0: 04812c23 sw s0,88(sp)
117e4: 05212a23 sw s2,84(sp)
117e8: 05312823 sw s3,80(sp)
117ec: 05412623 sw s4,76(sp)
117f0: 05512423 sw s5,72(sp)
117f4: 05612223 sw s6,68(sp)
117f8: 05712023 sw s7,64(sp)
117fc: 03812e23 sw s8,60(sp)
11800: 03912c23 sw s9,56(sp)
11804: 03a12a23 sw s10,52(sp)
11808: 03b12823 sw s11,48(sp)
1180c: 06010413 addi s0,sp,96
11810: faa42623 sw a0,-84(s0)
11814: fac42783 lw a5,-84(s0)
11818: 00079663 bnez a5,11824 <fast_rsqrt+0x4c>
1181c: fff00793 li a5,-1
11820: 1f40006f j 11a14 <fast_rsqrt+0x23c>
11824: fac42703 lw a4,-84(s0)
11828: 00100793 li a5,1
1182c: 00f71663 bne a4,a5,11838 <fast_rsqrt+0x60>
11830: 000107b7 lui a5,0x10
11834: 1e00006f j 11a14 <fast_rsqrt+0x23c>
11838: fac42503 lw a0,-84(s0)
1183c: eb9ff0ef jal 116f4 <clz>
11840: 00050713 mv a4,a0
11844: 01f00793 li a5,31
11848: 40e787b3 sub a5,a5,a4
1184c: fcf42023 sw a5,-64(s0)
11850: 000127b7 lui a5,0x12
11854: d3478713 addi a4,a5,-716 # 11d34 <rsqrt_table>
11858: fc042783 lw a5,-64(s0)
1185c: 00279793 slli a5,a5,0x2
11860: 00f707b3 add a5,a4,a5
11864: 0007a783 lw a5,0(a5)
11868: fcf42623 sw a5,-52(s0)
1186c: fc042783 lw a5,-64(s0)
11870: 00100713 li a4,1
11874: 00f717b3 sll a5,a4,a5
11878: fac42703 lw a4,-84(s0)
1187c: 10e7f063 bgeu a5,a4,1197c <fast_rsqrt+0x1a4>
11880: fc042703 lw a4,-64(s0)
11884: 01e00793 li a5,30
11888: 02e7c463 blt a5,a4,118b0 <fast_rsqrt+0xd8>
1188c: fc042783 lw a5,-64(s0)
11890: 00178793 addi a5,a5,1
11894: 00012737 lui a4,0x12
11898: d3470713 addi a4,a4,-716 # 11d34 <rsqrt_table>
1189c: 00279793 slli a5,a5,0x2
118a0: 00f707b3 add a5,a4,a5
118a4: 0007a783 lw a5,0(a5)
118a8: fcf42423 sw a5,-56(s0)
118ac: 0080006f j 118b4 <fast_rsqrt+0xdc>
118b0: fc042423 sw zero,-56(s0)
118b4: fcc42703 lw a4,-52(s0)
118b8: fc842783 lw a5,-56(s0)
118bc: 40f707b3 sub a5,a4,a5
118c0: faf42e23 sw a5,-68(s0)
118c4: fac42783 lw a5,-84(s0)
118c8: 00078d13 mv s10,a5
118cc: 00000d93 li s11,0
118d0: fc042783 lw a5,-64(s0)
118d4: 00100713 li a4,1
118d8: 00f717b3 sll a5,a4,a5
118dc: faf42023 sw a5,-96(s0)
118e0: fa042223 sw zero,-92(s0)
118e4: fa042583 lw a1,-96(s0)
118e8: fa442603 lw a2,-92(s0)
118ec: 00058693 mv a3,a1
118f0: 40dd0733 sub a4,s10,a3
118f4: 00070693 mv a3,a4
118f8: 00dd36b3 sltu a3,s10,a3
118fc: 40cd87b3 sub a5,s11,a2
11900: 40d786b3 sub a3,a5,a3
11904: 00068793 mv a5,a3
11908: 01075693 srli a3,a4,0x10
1190c: 01079993 slli s3,a5,0x10
11910: 013689b3 add s3,a3,s3
11914: 01071913 slli s2,a4,0x10
11918: fc042783 lw a5,-64(s0)
1191c: fe078793 addi a5,a5,-32
11920: 0007c863 bltz a5,11930 <fast_rsqrt+0x158>
11924: 00f9da33 srl s4,s3,a5
11928: 00000a93 li s5,0
1192c: 02c0006f j 11958 <fast_rsqrt+0x180>
11930: 00199713 slli a4,s3,0x1
11934: 01f00693 li a3,31
11938: fc042783 lw a5,-64(s0)
1193c: 40f687b3 sub a5,a3,a5
11940: 00f717b3 sll a5,a4,a5
11944: fc042703 lw a4,-64(s0)
11948: 00e95a33 srl s4,s2,a4
1194c: 01478a33 add s4,a5,s4
11950: fc042783 lw a5,-64(s0)
11954: 00f9dab3 srl s5,s3,a5
11958: fb442c23 sw s4,-72(s0)
1195c: fb842583 lw a1,-72(s0)
11960: fbc42503 lw a0,-68(s0)
11964: 931fe0ef jal 10294 <__mulsi3>
11968: 00050793 mv a5,a0
1196c: 0107d793 srli a5,a5,0x10
11970: fcc42703 lw a4,-52(s0)
11974: 40f707b3 sub a5,a4,a5
11978: fcf42623 sw a5,-52(s0)
1197c: fc042223 sw zero,-60(s0)
11980: 0840006f j 11a04 <fast_rsqrt+0x22c>
11984: fcc42583 lw a1,-52(s0)
11988: fcc42503 lw a0,-52(s0)
1198c: c75ff0ef jal 11600 <mul32>
11990: 00050713 mv a4,a0
11994: 00058793 mv a5,a1
11998: fae42a23 sw a4,-76(s0)
1199c: fb442583 lw a1,-76(s0)
119a0: fac42503 lw a0,-84(s0)
119a4: c5dff0ef jal 11600 <mul32>
119a8: 00050713 mv a4,a0
119ac: 00058793 mv a5,a1
119b0: 01079693 slli a3,a5,0x10
119b4: 01075b13 srli s6,a4,0x10
119b8: 01668b33 add s6,a3,s6
119bc: 0107db93 srli s7,a5,0x10
119c0: fb642823 sw s6,-80(s0)
119c4: 00030737 lui a4,0x30
119c8: fb042783 lw a5,-80(s0)
119cc: 40f707b3 sub a5,a4,a5
119d0: 00078593 mv a1,a5
119d4: fcc42503 lw a0,-52(s0)
119d8: c29ff0ef jal 11600 <mul32>
119dc: 00050713 mv a4,a0
119e0: 00058793 mv a5,a1
119e4: 00f79693 slli a3,a5,0xf
119e8: 01175c13 srli s8,a4,0x11
119ec: 01868c33 add s8,a3,s8
119f0: 0117dc93 srli s9,a5,0x11
119f4: fd842623 sw s8,-52(s0)
119f8: fc442783 lw a5,-60(s0)
119fc: 00178793 addi a5,a5,1
11a00: fcf42223 sw a5,-60(s0)
11a04: fc442703 lw a4,-60(s0)
11a08: 00100793 li a5,1
11a0c: f6e7dce3 bge a5,a4,11984 <fast_rsqrt+0x1ac>
11a10: fcc42783 lw a5,-52(s0)
11a14: 00078513 mv a0,a5
11a18: 05c12083 lw ra,92(sp)
11a1c: 05812403 lw s0,88(sp)
11a20: 05412903 lw s2,84(sp)
11a24: 05012983 lw s3,80(sp)
11a28: 04c12a03 lw s4,76(sp)
11a2c: 04812a83 lw s5,72(sp)
11a30: 04412b03 lw s6,68(sp)
11a34: 04012b83 lw s7,64(sp)
11a38: 03c12c03 lw s8,60(sp)
11a3c: 03812c83 lw s9,56(sp)
11a40: 03412d03 lw s10,52(sp)
11a44: 03012d83 lw s11,48(sp)
11a48: 06010113 addi sp,sp,96
11a4c: 00008067 ret
```
>[time=Sat, Nov 8, 2025 2:45 PM]
>**Update for checking the usage of `mul32`:**
>Notice that there is a `__mulsi3` [:bulb: Link](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Optimized-version-of-mul32-__mulsi3) in above assembly code. In `fast_rsqrt.c`, I find this happens at `y -= (uint32_t)((delta * frac) >> 16);`, so I try to replace this line into `y -= (uint32_t)(mul32(delta, frac) >> 16);` to use handwritten `mul32`. And then, check the result of disassembling, `fast_rsqrt` won't use `__mulsi3` anymore. Due to more cycles that my handwritten `mul32` needs, so **some count statistics are larger** than [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration):
```c=
Test 3: fast_rsqrt (C code) from Q3-C.
Test: fast_rsqrt
In below error percentage, it might lose fractional part of error in pecentage due to udiv, so show the remainder to see.
fast_rsqrt(1) = 65536 (exact: 65536), Approximately Error = 0%, Remainder = 0/655360, Cycles: 61, Instructions: 61
fast_rsqrt(4) = 32768 (exact: 32768), Approximately Error = 0%, Remainder = 0/327680, Cycles: 2783, Instructions: 2783
fast_rsqrt(16) = 16384 (exact: 16384), Approximately Error = 0%, Remainder = 0/163840, Cycles: 2789, Instructions: 2789
fast_rsqrt(20) = 14654 (exact: 14654.2), Approximately Error = 0%, Remainder = 200/146542, Cycles: 4696, Instructions: 4696
fast_rsqrt(100) = 6553 (exact: 6553.6), Approximately Error = 0%, Remainder = 600/65536, Cycles: 4482, Instructions: 4482
fast_rsqrt(258) = 4080 (exact: 4080), Approximately Error = 0%, Remainder = 0/40800, Cycles: 4540, Instructions: 4540
fast_rsqrt(650) = 2570 (exact: 2570.5), Approximately Error = 0%, Remainder = 500/25705, Cycles: 4251, Instructions: 4251
fast_rsqrt: PASSED
Total Cycles: 221334
Total Instructions: 221334
```
#### With `-Ofast`
```c=
000109ec <fast_rsqrt>:
109ec: 28050663 beqz a0,10c78 <fast_rsqrt+0x28c>
109f0: 00100793 li a5,1
109f4: 28f50863 beq a0,a5,10c84 <fast_rsqrt+0x298>
109f8: fe010113 addi sp,sp,-32
109fc: 00112e23 sw ra,28(sp)
10a00: 00812c23 sw s0,24(sp)
10a04: 000107b7 lui a5,0x10
10a08: 18f56463 bltu a0,a5,10b90 <fast_rsqrt+0x1a4>
10a0c: 00050793 mv a5,a0
10a10: 00800613 li a2,8
10a14: 00000693 li a3,0
10a18: 01000737 lui a4,0x1000
10a1c: 00e7f663 bgeu a5,a4,10a28 <fast_rsqrt+0x3c>
10a20: 00879793 slli a5,a5,0x8
10a24: 00060693 mv a3,a2
10a28: 10000737 lui a4,0x10000
10a2c: 00e7f663 bgeu a5,a4,10a38 <fast_rsqrt+0x4c>
10a30: 00468693 addi a3,a3,4
10a34: 00479793 slli a5,a5,0x4
10a38: 40000737 lui a4,0x40000
10a3c: 00050e13 mv t3,a0
10a40: 1ee7fc63 bgeu a5,a4,10c38 <fast_rsqrt+0x24c>
10a44: 00279713 slli a4,a5,0x2
10a48: 00268693 addi a3,a3,2
10a4c: 00074463 bltz a4,10a54 <fast_rsqrt+0x68>
10a50: 00168693 addi a3,a3,1
10a54: 01f00713 li a4,31
10a58: 40d70733 sub a4,a4,a3
10a5c: 00011637 lui a2,0x11
10a60: 00271593 slli a1,a4,0x2
10a64: fd860613 addi a2,a2,-40 # 10fd8 <rsqrt_table>
10a68: 00100793 li a5,1
10a6c: 00b605b3 add a1,a2,a1
10a70: 00e797b3 sll a5,a5,a4
10a74: 0005a883 lw a7,0(a1)
10a78: 07c7fc63 bgeu a5,t3,10af0 <fast_rsqrt+0x104>
10a7c: 02000593 li a1,32
10a80: 40d586b3 sub a3,a1,a3
10a84: 00269693 slli a3,a3,0x2
10a88: 00d60633 add a2,a2,a3
10a8c: 00062503 lw a0,0(a2)
10a90: 40a88533 sub a0,a7,a0
10a94: 40fe07b3 sub a5,t3,a5
10a98: 00fe36b3 sltu a3,t3,a5
10a9c: 40d006b3 neg a3,a3
10aa0: 0107d593 srli a1,a5,0x10
10aa4: 01069693 slli a3,a3,0x10
10aa8: fe070613 addi a2,a4,-32 # 3fffffe0 <__stack_top+0x3ffedf40>
10aac: 00d586b3 add a3,a1,a3
10ab0: 01079793 slli a5,a5,0x10
10ab4: 00c6d5b3 srl a1,a3,a2
10ab8: 00065e63 bgez a2,10ad4 <fast_rsqrt+0xe8>
10abc: 01f00613 li a2,31
10ac0: 00169693 slli a3,a3,0x1
10ac4: 40e60633 sub a2,a2,a4
10ac8: 00e7d5b3 srl a1,a5,a4
10acc: 00c697b3 sll a5,a3,a2
10ad0: 00b785b3 add a1,a5,a1
10ad4: 01c12623 sw t3,12(sp)
10ad8: 01112423 sw a7,8(sp)
10adc: 9a9ff0ef jal 10484 <__mulsi3>
10ae0: 00812883 lw a7,8(sp)
10ae4: 00c12e03 lw t3,12(sp)
10ae8: 01055793 srli a5,a0,0x10
10aec: 40f888b3 sub a7,a7,a5
10af0: 00200e93 li t4,2
10af4: 00100593 li a1,1
10af8: 01f00313 li t1,31
10afc: 02000813 li a6,32
10b00: 001e5513 srli a0,t3,0x1
10b04: 00030f37 lui t5,0x30
10b08: 00000293 li t0,0
10b0c: 00000793 li a5,0
10b10: fe078613 addi a2,a5,-32 # ffe0 <EXIT+0xff83>
10b14: 00f596b3 sll a3,a1,a5
10b18: 00f89733 sll a4,a7,a5
10b1c: 41f65613 srai a2,a2,0x1f
10b20: 0116f6b3 and a3,a3,a7
10b24: 00c77733 and a4,a4,a2
10b28: 00178793 addi a5,a5,1
10b2c: 00068463 beqz a3,10b34 <fast_rsqrt+0x148>
10b30: 00e282b3 add t0,t0,a4
10b34: fd079ee3 bne a5,a6,10b10 <fast_rsqrt+0x124>
10b38: 00000613 li a2,0
10b3c: 00000f93 li t6,0
10b40: 00000793 li a5,0
10b44: 0280006f j 10b6c <fast_rsqrt+0x180>
10b48: 00000713 li a4,0
10b4c: 00de16b3 sll a3,t3,a3
10b50: 00e60733 add a4,a2,a4
10b54: 00c733b3 sltu t2,a4,a2
10b58: 00df8fb3 add t6,t6,a3
10b5c: 00070613 mv a2,a4
10b60: 01f38fb3 add t6,t2,t6
10b64: 00178793 addi a5,a5,1
10b68: 03078c63 beq a5,a6,10ba0 <fast_rsqrt+0x1b4>
10b6c: 00f59733 sll a4,a1,a5
10b70: 00577733 and a4,a4,t0
10b74: fe078693 addi a3,a5,-32
10b78: fe0706e3 beqz a4,10b64 <fast_rsqrt+0x178>
10b7c: 40f30733 sub a4,t1,a5
10b80: fc06d4e3 bgez a3,10b48 <fast_rsqrt+0x15c>
10b84: 00e556b3 srl a3,a0,a4
10b88: 00fe1733 sll a4,t3,a5
10b8c: fc5ff06f j 10b50 <fast_rsqrt+0x164>
10b90: 01051793 slli a5,a0,0x10
10b94: 01800613 li a2,24
10b98: 01000693 li a3,16
10b9c: e7dff06f j 10a18 <fast_rsqrt+0x2c>
10ba0: 010f9f93 slli t6,t6,0x10
10ba4: 01065613 srli a2,a2,0x10
10ba8: 00cf8633 add a2,t6,a2
10bac: 40cf0633 sub a2,t5,a2
10bb0: 00000293 li t0,0
10bb4: 00000f93 li t6,0
10bb8: 00000793 li a5,0
10bbc: 0018d413 srli s0,a7,0x1
10bc0: 0280006f j 10be8 <fast_rsqrt+0x1fc>
10bc4: 00000713 li a4,0
10bc8: 00d896b3 sll a3,a7,a3
10bcc: 00e28733 add a4,t0,a4
10bd0: 005733b3 sltu t2,a4,t0
10bd4: 00df8fb3 add t6,t6,a3
10bd8: 00070293 mv t0,a4
10bdc: 01f38fb3 add t6,t2,t6
10be0: 00178793 addi a5,a5,1
10be4: 03078463 beq a5,a6,10c0c <fast_rsqrt+0x220>
10be8: 00f59733 sll a4,a1,a5
10bec: 00c77733 and a4,a4,a2
10bf0: fe078693 addi a3,a5,-32
10bf4: fe0706e3 beqz a4,10be0 <fast_rsqrt+0x1f4>
10bf8: 40f30733 sub a4,t1,a5
10bfc: fc06d4e3 bgez a3,10bc4 <fast_rsqrt+0x1d8>
10c00: 00e456b3 srl a3,s0,a4
10c04: 00f89733 sll a4,a7,a5
10c08: fc5ff06f j 10bcc <fast_rsqrt+0x1e0>
10c0c: 00ff9f93 slli t6,t6,0xf
10c10: 0112d293 srli t0,t0,0x11
10c14: 005f88b3 add a7,t6,t0
10c18: 00be8663 beq t4,a1,10c24 <fast_rsqrt+0x238>
10c1c: 00100e93 li t4,1
10c20: ee9ff06f j 10b08 <fast_rsqrt+0x11c>
10c24: 01c12083 lw ra,28(sp)
10c28: 01812403 lw s0,24(sp)
10c2c: 00088513 mv a0,a7
10c30: 02010113 addi sp,sp,32
10c34: 00008067 ret
10c38: e007dce3 bgez a5,10a50 <fast_rsqrt+0x64>
10c3c: 01f00513 li a0,31
10c40: 40d50733 sub a4,a0,a3
10c44: 00011637 lui a2,0x11
10c48: 00271593 slli a1,a4,0x2
10c4c: fd860613 addi a2,a2,-40 # 10fd8 <rsqrt_table>
10c50: 00100793 li a5,1
10c54: 00b605b3 add a1,a2,a1
10c58: 00e797b3 sll a5,a5,a4
10c5c: 0005a883 lw a7,0(a1)
10c60: e9c7f8e3 bgeu a5,t3,10af0 <fast_rsqrt+0x104>
10c64: e0069ce3 bnez a3,10a7c <fast_rsqrt+0x90>
10c68: 00050713 mv a4,a0
10c6c: 800007b7 lui a5,0x80000
10c70: 00088513 mv a0,a7
10c74: e21ff06f j 10a94 <fast_rsqrt+0xa8>
10c78: fff00893 li a7,-1
10c7c: 00088513 mv a0,a7
10c80: 00008067 ret
10c84: 000108b7 lui a7,0x10
10c88: ff5ff06f j 10c7c <fast_rsqrt+0x290>
```
#### With `-Os`
```c=
000109f0 <fast_rsqrt>:
109f0: fd010113 addi sp,sp,-48
109f4: 02812423 sw s0,40(sp)
109f8: 02112623 sw ra,44(sp)
109fc: 02912223 sw s1,36(sp)
10a00: 03212023 sw s2,32(sp)
10a04: 01312e23 sw s3,28(sp)
10a08: 01412c23 sw s4,24(sp)
10a0c: fff00413 li s0,-1
10a10: 12050863 beqz a0,10b40 <fast_rsqrt+0x150>
10a14: 000107b7 lui a5,0x10
10a18: 00100713 li a4,1
10a1c: 00078413 mv s0,a5
10a20: 12e50063 beq a0,a4,10b40 <fast_rsqrt+0x150>
10a24: 14f57063 bgeu a0,a5,10b64 <fast_rsqrt+0x174>
10a28: 01051793 slli a5,a0,0x10
10a2c: 01000713 li a4,16
10a30: 010006b7 lui a3,0x1000
10a34: 00d7f663 bgeu a5,a3,10a40 <fast_rsqrt+0x50>
10a38: 00870713 addi a4,a4,8
10a3c: 00879793 slli a5,a5,0x8
10a40: 100006b7 lui a3,0x10000
10a44: 00d7f663 bgeu a5,a3,10a50 <fast_rsqrt+0x60>
10a48: 00470713 addi a4,a4,4
10a4c: 00479793 slli a5,a5,0x4
10a50: 400006b7 lui a3,0x40000
10a54: 00d7f663 bgeu a5,a3,10a60 <fast_rsqrt+0x70>
10a58: 00270713 addi a4,a4,2
10a5c: 00279793 slli a5,a5,0x2
10a60: fff7c793 not a5,a5
10a64: 01f7d793 srli a5,a5,0x1f
10a68: 00f70733 add a4,a4,a5
10a6c: 00050493 mv s1,a0
10a70: 01f00513 li a0,31
10a74: 40e50533 sub a0,a0,a4
10a78: 000116b7 lui a3,0x11
10a7c: ea068693 addi a3,a3,-352 # 10ea0 <rsqrt_table>
10a80: 00251793 slli a5,a0,0x2
10a84: 00f687b3 add a5,a3,a5
10a88: 0007a403 lw s0,0(a5) # 10000 <_start>
10a8c: 00100793 li a5,1
10a90: 00a797b3 sll a5,a5,a0
10a94: 0297fe63 bgeu a5,s1,10ad0 <fast_rsqrt+0xe0>
10a98: 00000593 li a1,0
10a9c: 00070c63 beqz a4,10ab4 <fast_rsqrt+0xc4>
10aa0: 02000613 li a2,32
10aa4: 40e60733 sub a4,a2,a4
10aa8: 00271713 slli a4,a4,0x2
10aac: 00e686b3 add a3,a3,a4
10ab0: 0006a583 lw a1,0(a3)
10ab4: 40f487b3 sub a5,s1,a5
10ab8: 01079793 slli a5,a5,0x10
10abc: 40b405b3 sub a1,s0,a1
10ac0: 00a7d533 srl a0,a5,a0
10ac4: 96dff0ef jal 10430 <__mulsi3>
10ac8: 01055513 srli a0,a0,0x10
10acc: 40a40433 sub s0,s0,a0
10ad0: 00200913 li s2,2
10ad4: 000309b7 lui s3,0x30
10ad8: 00100a13 li s4,1
10adc: 00040593 mv a1,s0
10ae0: 00040513 mv a0,s0
10ae4: 00c10693 addi a3,sp,12
10ae8: 00810613 addi a2,sp,8
10aec: e1dff0ef jal 10908 <mul32>
10af0: 00c12583 lw a1,12(sp)
10af4: 00c10693 addi a3,sp,12
10af8: 00810613 addi a2,sp,8
10afc: 00048513 mv a0,s1
10b00: e09ff0ef jal 10908 <mul32>
10b04: 00812783 lw a5,8(sp)
10b08: 00e15583 lhu a1,14(sp)
10b0c: 00040513 mv a0,s0
10b10: 01079793 slli a5,a5,0x10
10b14: 00f5e5b3 or a1,a1,a5
10b18: 00c10693 addi a3,sp,12
10b1c: 00810613 addi a2,sp,8
10b20: 40b985b3 sub a1,s3,a1
10b24: de5ff0ef jal 10908 <mul32>
10b28: 00812783 lw a5,8(sp)
10b2c: 00c12403 lw s0,12(sp)
10b30: 00f79793 slli a5,a5,0xf
10b34: 01145413 srli s0,s0,0x11
10b38: 00f46433 or s0,s0,a5
10b3c: 03491a63 bne s2,s4,10b70 <fast_rsqrt+0x180>
10b40: 02c12083 lw ra,44(sp)
10b44: 00040513 mv a0,s0
10b48: 02812403 lw s0,40(sp)
10b4c: 02412483 lw s1,36(sp)
10b50: 02012903 lw s2,32(sp)
10b54: 01c12983 lw s3,28(sp)
10b58: 01812a03 lw s4,24(sp)
10b5c: 03010113 addi sp,sp,48
10b60: 00008067 ret
10b64: 00050793 mv a5,a0
10b68: 00000713 li a4,0
10b6c: ec5ff06f j 10a30 <fast_rsqrt+0x40>
10b70: 00100913 li s2,1
10b74: f69ff06f j 10adc <fast_rsqrt+0xec>
```
#### Observations and explain
|name|number of lines|
|--|--|
|Case 4|159|
|With `-Ofast`|169|
|With `-Os`|**99**|
With `-Os`, code size is the smallest.
Compared to case 4, the number of registers (like saved register or `ra`) `fast_rsqrt` used is smaller because `-Ofast` case only needs stack to store 8 registers(`addi sp,sp,-32`). In case 4, it needs to store 24 registers(`addi sp,sp,-96`). In `-Os` case, callee stores 12 registers(`addi sp,sp,-48`).
In `-Ofast` case, original `mul32` is be optimized into below assembly code:
```c=
00010484 <__mulsi3>:
10484: 00050713 mv a4,a0
10488: 00000513 li a0,0
1048c: 02058263 beqz a1,104b0 <__mulsi3+0x2c>
10490: 01f59693 slli a3,a1,0x1f
10494: 41f6d793 srai a5,a3,0x1f
10498: 00e7f7b3 and a5,a5,a4
1049c: 0015d593 srli a1,a1,0x1
104a0: 00f50533 add a0,a0,a5
104a4: 00171713 slli a4,a4,0x1
104a8: fe0594e3 bnez a1,10490 <__mulsi3+0xc>
104ac: 00008067 ret
104b0: 00008067 ret
```
##### Optimized version of `mul32`: `__mulsi3`
When compiling with optimization flags such as `-Ofast`, the compiler may replace manually written routines with built-in helper functions. In this case, our handcrafted `mul32` function, which implements a 32×32→64-bit multiplication using shifts and additions, **is replaced by the compiler-generated `__mulsi3` routine**. This function is part of libgcc and implements a software-emulated 32×32 multiply that returns a 64-bit result.
The compiler performs this replacement because it recognizes the multiplication pattern and knows a tested, optimized implementation exists in libgcc. On RV32 targets, which do not have a native 64-bit multiplication instruction, `__mulsi3` provides a correct and efficient way to compute the result. The assembly of `__mulsi3` uses a shift-and-add loop to accumulate the product, similar in logic to our original `mul32`, but often more optimized for instruction scheduling and correctness.
I notice that in `__mulsi3`, it all uses a0~a5 registers, so it didn't need to store saved registers into stack first, which also means that it didn't require the additional stack space to store the temporary values during the process.
In below `mul32` function, the compiler uses the stack to store many local variables and intermediate results because RV32 has a limited number of registers, and this function performs a 32×32→64-bit multiplication that produces multiple intermediate values. Without using the stack, there wouldn’t be enough registers, which could lead to incorrect computation or make the function non-reentrant. Using the stack ensures correct computation, reentrancy, and also stores the return address and frame pointer.
* original `mul32` assembly code:
```c=
00011600 <mul32>:
11600: fd010113 addi sp,sp,-48
11604: 02112623 sw ra,44(sp)
11608: 02812423 sw s0,40(sp)
1160c: 03010413 addi s0,sp,48
11610: fca42e23 sw a0,-36(s0)
11614: fcb42c23 sw a1,-40(s0)
11618: 00000513 li a0,0
1161c: 00000593 li a1,0
11620: fea42423 sw a0,-24(s0)
11624: feb42623 sw a1,-20(s0)
11628: fe042223 sw zero,-28(s0)
1162c: 09c0006f j 116c8 <mul32+0xc8>
11630: fe442583 lw a1,-28(s0)
11634: 00100513 li a0,1
11638: 00b51533 sll a0,a0,a1
1163c: fd842583 lw a1,-40(s0)
11640: 00b575b3 and a1,a0,a1
11644: 06058c63 beqz a1,116bc <mul32+0xbc>
11648: fdc42583 lw a1,-36(s0)
1164c: 00058613 mv a2,a1
11650: 00000693 li a3,0
11654: fe442583 lw a1,-28(s0)
11658: fe058593 addi a1,a1,-32
1165c: 0005c863 bltz a1,1166c <mul32+0x6c>
11660: 00b617b3 sll a5,a2,a1
11664: 00000713 li a4,0
11668: 02c0006f j 11694 <mul32+0x94>
1166c: 00165513 srli a0,a2,0x1
11670: 01f00813 li a6,31
11674: fe442583 lw a1,-28(s0)
11678: 40b805b3 sub a1,a6,a1
1167c: 00b555b3 srl a1,a0,a1
11680: fe442503 lw a0,-28(s0)
11684: 00a697b3 sll a5,a3,a0
11688: 00f587b3 add a5,a1,a5
1168c: fe442583 lw a1,-28(s0)
11690: 00b61733 sll a4,a2,a1
11694: fe842803 lw a6,-24(s0)
11698: fec42883 lw a7,-20(s0)
1169c: 00e80533 add a0,a6,a4
116a0: 00050313 mv t1,a0
116a4: 01033333 sltu t1,t1,a6
116a8: 00f885b3 add a1,a7,a5
116ac: 00b30833 add a6,t1,a1
116b0: 00080593 mv a1,a6
116b4: fea42423 sw a0,-24(s0)
116b8: feb42623 sw a1,-20(s0)
116bc: fe442583 lw a1,-28(s0)
116c0: 00158593 addi a1,a1,1
116c4: feb42223 sw a1,-28(s0)
116c8: fe442503 lw a0,-28(s0)
116cc: 01f00593 li a1,31
116d0: f6a5d0e3 bge a1,a0,11630 <mul32+0x30>
116d4: fe842703 lw a4,-24(s0)
116d8: fec42783 lw a5,-20(s0)
116dc: 00070513 mv a0,a4
116e0: 00078593 mv a1,a5
116e4: 02c12083 lw ra,44(sp)
116e8: 02812403 lw s0,40(sp)
116ec: 03010113 addi sp,sp,48
116f0: 00008067 ret
```
#### Optimize method: Use optimization flag to improve the code.
>[!Note]
>When doing optimization, some **printed message** command will be deleted. So, it won't show all messages. In order to show the numeric result, I put `__attribute__((optimize("O0")))` above the function `print_dec`, so I can see the value of result. Therefore, I think comparing total count isn't fair for non-optimized case. It's better to compare the single case's count which can clearly see the performance difference of `fast_rsqrt`.
>In below result, their `fast_rsqrt` is doing case 4: **Do lookup table, linear interpolation, and ==2 times== Newton–Raphson iteration**
##### The statistics of program's execution:
**1. Optimize by `-Ofast`** : ==Improve the cycle count.==
* cycle and instruction count result
```c=
1 // case index from 1 to 7
65536 // fast_rsqrt's return value
25 // cycle count of calling fast_rsqrt
25 // instruction count of calling fast_rsqrt
2
32768
1483
1483
3
16384
43
43
4
14654
1169
1169
5
6553
40
40
6
4080
40
40
7
2570
207
207
51998 // total cycle count
51998 // total instruction count
```
The `fast_rsqrt`'s result are all correct, and **each case's count statistics are all smaller than [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration) and `-Os` case**, which shows the effect of `-Ofast` optimization.
* Display Section Sizes: by `riscv-none-elf-size test.elf`
```
text data bss dec hex filename
4184 59 4109 8352 20a0 test.elf
```
**2. Optimize by `-Os`** : ==Improve the code size.==
```c=
1 // case index from 1 to 7
65536 // fast_rsqrt's return value
40 // cycle count of calling fast_rsqrt
40 // instruction count of calling fast_rsqrt
2
32768
1635
1635
3
16384
1542
1542
4
14654
2481
2481
5
6553
2375
2375
6
4080
2065
2065
7
2570
2233
2233
58394 // total cycle count
58394 // total instruction count
```
The `fast_rsqrt`'s result are all correct, and each case's count statistics are all smaller than [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration).
* Display Section Sizes: by `riscv-none-elf-size test.elf`
```
text data bss dec hex filename
3872 59 4101 8032 1f60 test.elf
```
By comparing the [case 4](https://hackmd.io/7CWlPeGeR9yvznJkRTx3WA?both#Case-4-Do-lookup-table-linear-interpolation-and-2-times-Newton%E2%80%93Raphson-iteration), `-Ofast` case, and this case, this case **exists the smallest size in total**, which shows the effect of `-Os` optimization.