# Assignment 2: Complete Applications
Contributed by [rrrchii](https://github.com/rrrchii/ca_hw2)
## Lab 2: RISC-V Requirement
:::info
Still getting familiar with using bare metal and RISC-V. The report will include my basic notes.
:::
### Download Ubuntu 24.04 on WSL
The host machine is a Windows system, and I have only ever used WSL to install Ubuntu for Linux. Therefore, I will use this system again for Assignment 2.
### Bare-metal
Bare metal means running programs directly on the hardware without any operating system.
There is no Linux, no system calls, and no standard libraries, so all initialization, IO, and arithmetic routines must be implemented manually.
### Install rv32emu on Ubuntu 24.04 according to Lab 2 instructions
- `riscv-none-elf-objdump -d PATH/TO/xxx.elf`: Used to disassemble the ELF file back into assembly code.
- `source $HOME/riscv-none-elf-gcc/setenv`: Load the environment (do this each time you start a new shell).
- `build/rv32emu PATH/TO/xxx.elf`: Return to the root directory `rv32emu` to run a test program.
- `riscv-none-elf-readelf -h PATH/TO/xxx.elf`: Display header information.
- `riscv-none-elf-size PATH/TO/xxx.elf`: Display section size (the numbers are in bytes).
- `make`, `make clean`, `make run`: Execute these commands within the directory used to generate `.elf` and `.o` files. `make` compiles the files, `make clean` deletes the `.elf` and `.o` files for recompilation, and `make run` executes the `.elf` file (the Makefile must have the corresponding target commands).
| Field | Meaning | Plain-Language Explanation |
| :--- | :--- | :--- |
| **Magic** | `7f 45 4c 46` = `0x7F 'E' 'L' 'F'` | These four bytes are the magic number, confirming that this is an ELF file (just like how a JPEG starts with `FF D8 FF`). |
| **Class** | ELF32 | Indicates this is a **32-bit ELF**. RISC-V has 32-bit (RV32) and 64-bit (RV64) versions. |
| **Data** | 2’s complement, little endian | The data storage method. RISC-V uses **little endian**: the least significant byte is stored at the lowest memory address. |
| **Version** | 1 (current) | The version of the ELF specification, which is currently 1. |
| **OS/ABI** | UNIX - System V | Specifies which operating system ABI this ELF is compatible with. System V is the standard UNIX ABI. |
| **ABI Version** | 0 | The specific ABI version. Usually, 0 indicates the default. |
| **Type** | EXEC (Executable file) | This is an **executable file**, not a relocatable object file or a shared object. |
| **Machine** | RISC-V | Indicates that this ELF is intended to run on a **RISC-V processor**. |
| **Entry point address** | `0x0` | The starting address of execution. It is usually specified by the linker script (`linker.ld`), such as the `_start` label. In this example, `0x0` means starting from address 0 (a very simplified bare-metal environment). |
| **Start of program headers** | `52` | The Program Header Table starts at byte 52 of the file. It describes the "segments to be loaded." |
| **Start of section headers** | `4240` | The location of the Section Header Table. This describes each section (e.g., .text, .data, .bss). |
| **Flags** | `0x0` | RISC-V architecture flags. Usually 0 for pure RV32I. |
| **Size of this header** | 52 bytes | The fixed length of the ELF header is 52 bytes (for ELF32). |
| **Size of program headers** | 32 bytes | The length of each program header. |
| **Number of program headers** | 2 | This ELF has two "segments to be loaded": typically code/text + data. |
| **Size of section headers** | 40 bytes | The length of each section header. |
| **Number of section headers** | 4 | There are a total of 4 sections (likely .text, .data, .bss, .shstrtab). |
| **Section header string table index** | 3 | Indicates which section stores the section name string table. This allows the section names (e.g., `.text`) to be parsed. | |
### Reference
[The 101 of ELF files on Linux: Understanding and Analysis](https://linux-audit.com/elf-binaries-on-linux-understanding-and-analysis/)
## UF8 to UINT32 in HW1
### Modify `makefile`
- `OBJS = uf8_to_uint32_last_version.o`
### Modify `uf8_to_uint32_last_version.s`
- In the makefile, the rule is `%.o: %.S $(AS) $(AFLAGS) $< -o $@`.
Therefore, I changed the file name to `uf8_to_uint32_last_version.S` to ensure it compiles correctly.
- Added the following directives at the beginning of the `uf8decode` and `uf8encode` function:
```
.globl func # Make func public so other files can call it
.type func, %function # Tell the linker: func is a function, not a variable
.align 2 # Align the function boundary to 2^2 = 4 bytes
```
- Added the following directives at the end of the uf8decode and uf8encode function:
```
# .size symbol, expression sets the size of the symbol
# Current position (.) minus the start position of 'func' -> total bytes occupied by the function
.size func, .-func
```
### Add Code to `main.c`
- Declare the `uf8decode` and `uf8encode` functions:
```c
extern uint32_t uf8_decode(uint8_t in);
extern uint8_t uf8_encode(uint32_t in);
```
- Integrate the test cases from Quiz 1 Problem B into main.c and modify them using the existing functions to resolve the bare-metal limitation of not being able to use printf.
```c
static void test_uf8_to_uint32(void)
{
TEST_LOGGER("Test: uf8_to_uint32\n");
int32_t previous_value = -1;
bool passed = true;
/*test 0 ~ 255*/
for (int i = 0; i < 256; i++) {
uint8_t fl = i;
int32_t value = uf8_decode(fl);
uint8_t fl2 = uf8_encode(value);
if (fl != fl2) {
/* rewrite printf("%02x: produces value %d but encodes back to %02x\n",*/
/* fl, value, fl2);*/
print_hex(fl);
TEST_LOGGER(": produces value ")
print_dec(value);
TEST_LOGGER(" but encodes back to ")
print_hex(fl2);
TEST_LOGGER("\n")
passed = false;
}
if (value <= previous_value) {
/* rewrite printf("%02x: value %d <= previous_value %d\n", fl,*/
/* value,previous_value); */
print_hex(fl);
TEST_LOGGER(": value ")
print_dec(value);
TEST_LOGGER(" <= previous_value ")
print_dec(previous_value);
TEST_LOGGER("\n")
passed = false;
}
previous_value = value;
}
```
- Write out Test 6: uf8_to_uint32 using the previous method:
```c
/* Test 6: uf8_to_uint32 */
TEST_LOGGER("Test 6: uf8_to_uint32\n");
start_cycles = get_cycles();
start_instret = get_instret();
test_uf8_to_uint32();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = ed_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
```
### Result
- Bare matal
```
Test 6: uf8_to_uint32
Test: uf8_to_uint32
PASSED
Cycles: 31667
Instructions: 31667
```
- RISC-V from HW1

### Compare `-Os` and `-Ofast` with original compiler
#### Issue
- Phenomenon
- Output is correct under **-O0** optimization level.
- Output for `printstr` is incorrect under **-O2** / **-Ofast** optimization levels.
```
# wrong output
��
# correct output
0: value 0 <= previous_value 1015792
```
- Root Cause Analysis
- Before performing the system call, the program **stores all characters of the output string into a local buffer**. The inline assembly that triggers the `write` syscall uses the pointer to this buffer, and the syscall reads the contents directly from that memory region. Therefore, the stores that fill the buffer must happen before the assembly statement. If the `"memory"` clobber is missing, the compiler is free to **reorder or even delete those stores**, leading to corrupted output under higher optimization levels.
- Solution and Evidence
- Add `"memory"` to the asm clobber list, informing the compiler that the assembly might read or write memory.
- The compiler thus preserves and fixes the order of the stores (writing to the buffer before the assembly).
- Disassembling and comparing the two ELF files: The version with `"memory"` shows the `sb` instruction, while the version without `"memory"` does not. This directly demonstrates the difference in optimization behavior.
```
# w/o "memory" #-Ofast
0001003c <print_dec>:
1003c: fe010113 addi sp,sp,-32
10040: 01e10e13 addi t3,sp,30
10044: 04050e63 beqz a0,100a0 <print_dec+0x64>
10048: 000e0313 mv t1,t3
1004c: 00900893 li a7,9
10050: 00100813 li a6,1
10054: fff00593 li a1,-1
10058: 00030e13 mv t3,t1
1005c: 00000613 li a2,0
10060: fff30313 addi t1,t1,-1
10064: 01f00713 li a4,31
10068: 00000793 li a5,0
1006c: 00e556b3 srl a3,a0,a4
10070: 0016f693 andi a3,a3,1
10074: 00179793 slli a5,a5,0x1
10078: 00f6e7b3 or a5,a3,a5
1007c: 00e816b3 sll a3,a6,a4
10080: fff70713 addi a4,a4,-1
10084: 00f8f663 bgeu a7,a5,10090 <print_dec+0x54>
10088: ff678793 addi a5,a5,-10
1008c: 00d66633 or a2,a2,a3
10090: fcb71ee3 bne a4,a1,1006c <print_dec+0x30>
10094: 00060663 beqz a2,100a0 <print_dec+0x64>
10098: 00060513 mv a0,a2
1009c: fbdff06f j 10058 <print_dec+0x1c>
100a0: 02010793 addi a5,sp,32
100a4: 41c787b3 sub a5,a5,t3
100a8: 04000893 li a7,64
100ac: 00100513 li a0,1
100b0: 01c005b3 add a1,zero,t3
100b4: 00078613 mv a2,a5
100b8: 00000073 ecall
100bc: 02010113 addi sp,sp,32
100c0: 00008067 ret
# w "memory" #-Ofast
0001003c <print_dec>:
1003c: fe010113 addi sp,sp,-32
10040: 00a00793 li a5,10
10044: 00f10fa3 sb a5,31(sp) *
10048: 02051a63 bnez a0,1007c <print_dec+0x40>
1004c: 03000793 li a5,48
10050: 00f10f23 sb a5,30(sp) *
10054: 01e10313 addi t1,sp,30
10058: 02010793 addi a5,sp,32
1005c: 406787b3 sub a5,a5,t1
10060: 04000893 li a7,64
10064: 00100513 li a0,1
10068: 006005b3 add a1,zero,t1
1006c: 00078613 mv a2,a5
10070: 00000073 ecall
10074: 02010113 addi sp,sp,32
10078: 00008067 ret
1007c: 01e10313 addi t1,sp,30
10080: 00900593 li a1,9
10084: fff00613 li a2,-1
10088: 00100893 li a7,1
1008c: 01f00713 li a4,31
10090: 00000793 li a5,0
10094: 00e556b3 srl a3,a0,a4
10098: 00179793 slli a5,a5,0x1
1009c: 0016f693 andi a3,a3,1
100a0: 00f6e7b3 or a5,a3,a5
100a4: fff70713 addi a4,a4,-1
100a8: 00f5f463 bgeu a1,a5,100b0 <print_dec+0x74>
100ac: ff678793 addi a5,a5,-10
100b0: fec712e3 bne a4,a2,10094 <print_dec+0x58>
100b4: 03078793 addi a5,a5,48
100b8: 00f30023 sb a5,0(t1) *
100bc: fff30e13 addi t3,t1,-1
100c0: 00000813 li a6,0
100c4: 01f00713 li a4,31
100c8: 00000793 li a5,0
100cc: 00e556b3 srl a3,a0,a4
100d0: 0016f693 andi a3,a3,1
100d4: 00179793 slli a5,a5,0x1
100d8: 00f6e7b3 or a5,a3,a5
100dc: 00e896b3 sll a3,a7,a4
100e0: fff70713 addi a4,a4,-1
100e4: 00f5f663 bgeu a1,a5,100f0 <print_dec+0xb4>
100e8: ff678793 addi a5,a5,-10
100ec: 00d86833 or a6,a6,a3
100f0: fcc71ee3 bne a4,a2,100cc <print_dec+0x90>
100f4: f60802e3 beqz a6,10058 <print_dec+0x1c>
100f8: 000e0313 mv t1,t3
100fc: 00080513 mv a0,a6
10100: f8dff06f j 1008c <print_dec+0x50>
```
- Experimental Result: After adding `"memory"`, the output is correct for all optimization levels (**-O2**, **-Ofast**).
#### cycle/instruction
```
# -O0
Cycles: 31667
Instructions: 31667
# -Ofast
Cycles: 28318
Instructions: 28318
# -Os
Cycles: 29485
Instructions: 29485
```
- Speed: **-Ofast** > **-Os** > **-O0**
- **-Ofast** is about **10.6% faster** than **-O0**.
- **-Os** is about **6.9% faster** than **-O0**.
#### Code size
```
# -O0
text data bss dec hex filename
9400 0 4102 13502 34be test.elf
# -Ofast
text data bss dec hex filename
3012 0 4108 7120 1bd0 test.elf
# -Os
text data bss dec hex filename
1850 0 4102 5952 1740 test.elf
```
- Section `.text`
- Code size: **-Os** < **-Ofast** < **-O0**
- Section `.bss`
- The impact of optimization on this section is very small.
#### Discussion
- **Code Size**: What did **-Os** optimize to reduce the size?
- The difference was found in the recompilation of `__mulsi3`.
```assembly
# -O0 : Redundant STACK operations and unnecessary CALL
00010294 <__mulsi3>:
10294: fe010113 addi sp,sp,-32
10298: 00112e23 sw ra,28(sp)
1029c: 00812c23 sw s0,24(sp)
102a0: 02010413 addi s0,sp,32
// Save a0, a1 to stack, then read them back
102a4: fea42623 sw a0,-20(s0)
102a8: feb42423 sw a1,-24(s0)
102ac: fe842583 lw a1,-24(s0)
102b0: fec42503 lw a0,-20(s0)
// Call another function `umul` to perform the actual multiplication
102b4: f6dff0ef jal 10220 <umul>
102b8: 00050793 mv a5,a0
102bc: 00078513 mv a0,a5
102c0: 01c12083 lw ra,28(sp)
102c4: 01812403 lw s0,24(sp)
102c8: 02010113 addi sp,sp,32
102cc: 00008067 ret
# -Os : No need for STACK, compilation completed using existing REGISTERS
0001019c <__mulsi3>:
// Directly write a small multiplication loop using shift-add
1019c: 00050793 mv a5,a0
101a0: 00000513 li a0,0
101a4: 00059463 bnez a1,101ac <__mulsi3+0x10>
101a8: 00008067 ret
101ac: 0015f713 andi a4,a1,1
101b0: 40e00733 neg a4,a4
101b4: 00f77733 and a4,a4,a5
101b8: 00e50533 add a0,a0,a4
101bc: 00179793 slli a5,a5,0x1
101c0: 0015d593 srli a1,a1,0x1
101c4: fe1ff06f j 101a4 <__mulsi3+0x8>
```
- Under -Os, the entire stack frame disappears, meaning the compiler judged the function small enough to keep everything in registers.
- a0 and a1 are unnecessarily stored to stack and immediately loaded back.-Os removes these redundant memory operations completely.
- **Speed**: What did **-Ofast** optimize to make it faster?
- The difference was found in the recompilation of `__mulsi3`.
- Under **-O0**, a stack frame is created and an additional `umul` function is called. Under both **-Os** and **-Ofast**, the calculation uses shift-add and only registers. The main difference is that one places the conditional branch at the end of the loop and the other at the beginning. I believe this example doesn't clearly show the advantage of **-Ofast**.
```assembly
# -Ofast
000101a8 <__mulsi3>:
101a8: 00050713 mv a4,a0
101ac: 00000513 li a0,0
101b0: 02058263 beqz a1,101d4 <__mulsi3+0x2c>
101b4: 0015f793 andi a5,a1,1
101b8: 40f007b3 neg a5,a5
101bc: 00e7f7b3 and a5,a5,a4
101c0: 0015d593 srli a1,a1,0x1
101c4: 00f50533 add a0,a0,a5
101c8: 00171713 slli a4,a4,0x1
101cc: fe0594e3 bnez a1,101b4 <__mulsi3+0xc>
101d0: 00008067 ret
101d4: 00008067 ret
```
- A smaller code size does not necessarily mean faster speed (e.g., loops).
## Tower of Hanoi
### Modify `makefile`
- Add `Tower_of_Hanoi.o` to the object list
- The overall logic remains unchanged.
### Modify `Tower_of_Hanoi.s`
- Comment all lines to facilitate understanding of the assembly code logic.
- Modify the code to adapt the system calls for rv32emu
- original print string :
```
la x10, str1 # "Move Disk"
addi x17, x0, 4 # a7 == 4 print_str
ecall
```
- Adapted Print String for rv32emu :
```
# a7=64 a0 a1 a2
# write (1, str1, 10);
li a0, 1 # a0 = 1 : stdout
la a1, str1 # a1 : "Move Disk" address
li a2, 10 # a2 = 10 : output 10 character
li a7, 64 # a7 = 64 : Linux 'write'
ecall
```
- original disk number print (1/2/3) :
```
addi x10, x9, 1 # disk + 1 = 1/2/3
addi x17, x0, 1 # a7 == 1 print_int
ecall
```
- Adapted number print for `rv32emu` :
```
li a0, 1
la a1, disk
add a1, a1, x9
li a2, 1
li a7, 64
ecall
.data
disk: .byte 0x32, 0x33, 0x34 # ascii code: '1' '2' '3'
```
- original disk position print (A/B/C):
```
addi x10, x11, 0 # disk position = A/B/C
addi x17, x0, 11 # a7 == 11 print_char
ecall
```
- Adapted position print for `rv32emu`:
```
li a0, 1
la a1, peg
addi x13, x13, -65 # x13 maybe : 0x41/0x42/0x43
add a1, a1, x13
li a2, 1
li a7, 64
ecall
.data
peg: .byte 0x41, 0x42, 0x43 # ascii code: 'A' 'B' 'C'
```
### Modify `main.c`
- Function Declaration `tower_of_hanoi_v1`
`extern uint32_t tower_of_hanoi_v1(void);`
- Test Function Definition `test_tower_of_hanoi_v1`
```c
static void test_tower_of_hanoi_v1(void)
{
uint32_t ret = tower_of_hanoi_v1();
}
```
- Test Execution and Benchmarking
```c
/* Test : Tower of Hanoi */
TEST_LOGGER("Test : Tower of Hanoi\n");
start_cycles = get_cycles();
start_instret = get_instret();
test_tower_of_hanoi_v1();
end_cycles = get_cycles();
end_instret = get_instret();
cycles_elapsed = end_cycles - start_cycles;
instret_elapsed = end_instret - start_instret;
TEST_LOGGER(" Cycles: ");
print_dec((unsigned long) cycles_elapsed);
TEST_LOGGER(" Instructions: ");
print_dec((unsigned long) instret_elapsed);
```
### Result
- Bare matal
```
Test1 : Tower of Hanoi
Move Disk 2 from A to C
Move Disk 3 from A to B
Move Disk 2 from C to B
Move Disk 4 from A to C
Move Disk 2 from B to A
Move Disk 3 from B to C
Move Disk 2 from A to C
Cycles: 674
Instructions: 674
```
- RISC-V
### Reference
[河內塔問題的啟示](https://vocus.cc/article/67e2372efd89780001f5ce5a)
[](https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Extended-Asm.html?utm_source=chatgpt.com#Clobbers-and-Scratch-Registers)
## rsqrt
RSQRT is a fast reciprocal square root function optimized for the RV32I architecture. It utilizes a lookup table to find an initial approximation close to the correct value, and then refines the answer using Newton's method (Newton-Raphson iteration).
### Q16.16
$$ \frac{1}{\sqrt{x}} = \frac{1}{\sqrt{m \cdot 2^e}} = \frac{1}{\sqrt{m}} \cdot \frac{1}{\sqrt{2^e}} = \frac{1}{\sqrt{m}} \cdot 2^{-\frac{e}{2}} $$
### lookup table for predict exp part
Correspondence Method:
```c
static const uint16_t rsqrt_table[32] = {
65536, 46341, 32768, 23170, 16384, /* 2^0 to 2^4 */
11585, 8192, 5793, 4096, 2896, /* 2^5 to 2^9 */
2048, 1448, 1024, 724, 512, /* 2^10 to 2^14 */
362, 256, 181, 128, 90, /* 2^15 to 2^19 */
64, 45, 32, 23, 16, /* 2^20 to 2^24 */
11, 8, 6, 4, 3, /* 2^25 to 2^29 */
2, 1 /* 2^30, 2^31 */
};
```
### multiply two 32-bit integers
#### Explantion
This function implements multiplication by repeated addition and shifting, based on the binary representation of b.
#### Algorithm:
- For each bit position `i` in `b` (from 0 to 31):
- If bit `i` of `b` is `1`, add `a << i` to the result
- This is equivalent to: `a × b = a × (b₀×2⁰ + b₁×2¹ + b₂×2² + ... + b₃₁×2³¹)`
#### Key point
Returns `uint64_t` to handle overflow (32-bit × 32-bit can produce 64-bit result)
```c
static uint64_t mul32(uint32_t a, uint32_t b)
{
uint64_t r = 0;
for(int i = 0; i < 32; i++)
{
if(b & 1u << i)
r += (uint64_t)a << i;
}
return r;
}
```
### count leading zeros
#### Explanation
This function counts the number of leading zero bits in a 32-bit integer using binary search.
#### Algorithm
- Check progressively smaller groups of bits from most significant to least significant
- If a group is all zeros, increment the count and shift left to check the next group
- Uses 5 checks to cover all 32 bits (16 + 8 + 4 + 2 + 1 = 31 bits checked)
```c
static int clz(uint32_t x)
{
if(!x) return 32;
int n = 0;
if(!(x & 0XFFFF0000)) { n += 16; x <<= 16;}
if(!(x & 0XFF000000)) { n += 8; x <<= 8;}
if(!(x & 0XF0000000)) { n += 4; x <<= 4;}
if(!(x & 0XC0000000)) { n += 2; x <<= 2;}
if(!(x & 0X80000000)) { n += 1;}
return n;
}
```
### fast rsqrt implementation
#### Explanation
This function computes 1/√x using lookup table + linear interpolation + Newton-Raphson refinement in Q16.16 fixed-point format.
#### Algorithm
- Stage 1: Handle edge cases
If `x = 0`, return infinity `0xFFFFFFFF`
If `x = 1`, return 1.0 in Q16.16 format `65536`
- Stage 2: Initial approximation from lookup table
Find the MSB position: `exp = 31 - clz(x)` (which power of 2 is x closest to)
Get base value from table: `y = rsqrt_table[exp]` (stores pre-computed 1/√(2^exp))
- Stage 3: Linear interpolation (if needed)
If x is between `2^exp` and `2^(exp+1)``, refine the estimate
Calculate fraction: how far `x` is from `2^exp`
Interpolate between y and y_next using the fraction
- Stage 4: Newton-Raphson refinement (2 iterations)
Uses formula: `y_new = y × (3 - x×y²) / 2`
Each iteration improves accuracy
Steps:
`y2 = y²`: square current estimate
`xy2 = x × y²`: multiply by input
`y = y × (3 - xy2) / 2`: Newton-Raphson update
```c
uint32_t fast_rsqrt(uint32_t x)
{
if(x == 0) return 0xFFFFFFFF;
if(x == 1) return 65536;
int exp = 31 - clz(x);
uint32_t y = rsqrt_table[exp];
if(x > (1u << exp))
{
uint32_t y_next = (exp < 31) ? rsqrt_table[exp + 1] : 0;
uint32_t delta = y - y_next;
uint32_t frac = (uint32_t)((((uint64_t)x - (1UL << exp)) << 16) >> exp);
y -= (uint32_t)((delta * frac) >> 16);
}
for(int i = 0; i < 2; i++)
{
uint32_t y2 = (uint32_t)mul32(y, y);
uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16);
y = (uint32_t)(mul32(y, (3u << 16) - xy2) >> 17);
}
return y;
}
```
### Result
```
Test2 : Fast Rsqrt
Cycles: 61 Instructions: 61
Input:1 Output:65536 Expected:65536 Error : 0 %
Cycles: 2783 Instructions: 2783
Input:4 Output:32768 Expected:32768 Error : 0 %
Cycles: 4417 Instructions: 4417
Input:9 Output:21845 Expected:21845 Error : 0 %
Cycles: 4420 Instructions: 4420
Input:25 Output:13107 Expected:13107 Error : 0 %
Cycles: 2788 Instructions: 2788
Input:64 Output:8192 Expected:8192 Error : 0 %
Cycles: 4255 Instructions: 4255
Input:144 Output:5461 Expected:5461 Error : 0 %
Cycles: 2794 Instructions: 2794
Input:256 Output:4096 Expected:4096 Error : 0 %
=== Fast Rsqrt Test Completed ===
Total Cycles: 189895
Total Instructions: 189895
```
## Course disscusion
### 1. What do the .text, .data, and .bss sections mean in an ELF file?
Although the original question focuses on the .bss section, I reviewed all three major sections to better understand the structure of an ELF file in Linux and in bare-metal environments.
#### `.text`
The .text section has read + execute permission and stores the program’s machine code generated from our C source.
Since program logic does not change at runtime, this section is typically read-only and loaded only once.
#### `.bss`
The .bss section has read + write permission and stores uninitialized global/static variables, including those explicitly initialized to 0.
The key characteristic of .bss is:
.bss does not occupy space in the ELF file—only its size is recorded.
At runtime, the startup code allocates RAM for .bss and clears it to zero.
Conditions for placement in .bss:
- Global or static variable
- No initial value, or initial value = 0
**Examples:**
```
int g3; // → .bss
static int g4 = 0; // → .bss
static char buf[1024]; // → .bss
```
```
# from start.S
la t0, __bss_start #extract section .bss
la t1, __bss_end
1:
bge t0, t1, 2f #loop將所有.bss的varible清0
sw zero, 0(t0)
addi t0, t0, 4
j 1b
```
#### `.data`
The .data section has read + write permission and stores global/static variables with non-zero initial values.
The ELF file contains these initial values in binary form.
During startup, these values are copied from the ELF image into RAM to initialize the .data section.
Conditions for placement in .data:
- Global or static variable
- Has a non-zero initializer
**Examples:**
```
int g1 = 10; // .data
static int g2 = 5; // .data
char message[] = "Hello"; // .data
```
### main.c的entry point?
In the linker script:
```
ENTRY(_start)
```
This specifies that the entry point of the ELF file is _start, not main().
The script also places _start at address 0x10000:
```
. = 0x10000;
.text : {
*(.text._start)
*(.text)
}
```
As a result:
- _start is located at 0x10000
- The CPU begins execution at 0x10000