owned this note
owned this note
Published
Linked with GitHub
# Assignment 2: Complete Applications (Grammar Pending)
contributed by < [`eastwillow`](https://github.com/eastWillow) >
You can find the source code [here](https://github.com/eastWillow/ca2025-quizzes). Feel free to fork and modify it.
**Acknowledgment of AI Usage**:
All use of AI tools has been documented transparently.
I use the AI tool for grammar correctly and learn the pre requiremnets knowledage.
I will focus on Analyzing performance.
There are just RV32I instructions that can be used. This means that you MUST build C programs with the -march=rv32izicsr or -march=rv32i_zicsr flags.
RV32M (multiply and divide) and RV32F (single-precision floating point) are not permitted.
## rv32emu build options
```
make ENABLE_ELF_LOADER=1 ENABLE_EXT_C=0 ENABLE_SYSTEM=1 ENABLE_GDBSTUB=1 misalign-in-blk-emu
```
## Print Hello World
get the link script and start script from [HERE](https://gist.github.com/jserv/5f682ac880773cab69e3564f4f20d60a)
git clone https://gist.github.com/jserv/5f682ac880773cab69e3564f4f20d60a tests/system/playground
https://github.com/eastWillow/ca2025-quizzes/commit/1b1ec4eb7ec5dd26bb556f6ecda24c79d68f3a1b
## Print Hello World in ASM Only (syscall)
https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md
```a
.equ STDOUT, 1
.equ WRITE, 64
.equ EXIT, 93
.section .rodata
.align 2
msg:
.ascii "Hello World\n"
.section .text
.align 2
.globl _start
_start:
li a0, STDOUT # file descriptor
la a1, msg # address of string
li a2, 12 # length of string
li a7, WRITE # syscall number for write
ecall
# MISSING: Check for error condition
li a0, 0 # 0 signals success
li a7, EXIT
ecall
```
## Print Hello World in ASM With C
https://github.com/eastWillow/ca2025-quizzes/commit/081d193639b481943c1b211020148101fc21de4b
### Explain the Start Assembly (Draft)
[From ChatGPT](https://chatgpt.com/share/6914a363-5700-8003-bb49-ecbc1bd03af6)
```javascript
.section .text._start
.globl _start
.type _start, @function
# ------------------------------------------------------------
# RISC-V Bare-Metal Startup Code
# ------------------------------------------------------------
# This is the program entry point (_start).
# It sets up the stack, clears the .bss section,
# and then calls main().
# ------------------------------------------------------------
_start:
# --------------------------------------------------------
# 1. Initialize the stack pointer
# --------------------------------------------------------
# Load the address of the top of the stack into sp.
# __stack_top should be defined in the linker script.
la sp, __stack_top
# --------------------------------------------------------
# 2. Zero-initialize the .bss section
# --------------------------------------------------------
# t0 = current address
# t1 = end address of .bss
la t0, __bss_start
la t1, __bss_end
1: bge t0, t1, 2f # if (t0 >= t1) goto 2 (done)
sw zero, 0(t0) # store 0 to *t0
addi t0, t0, 4 # increment pointer by 4 bytes
j 1b # repeat until .bss is cleared
2:
# --------------------------------------------------------
# 3. Call the main() function
# --------------------------------------------------------
# After the environment is ready, call main().
call main
# --------------------------------------------------------
# 4. If main() returns, perform an exit syscall (Linux)
# --------------------------------------------------------
li a7, 93 # syscall number for exit (SYS_exit)
li a0, 0 # exit code = 0
ecall # perform system call
# --------------------------------------------------------
# 5. Infinite loop (should never reach here)
# --------------------------------------------------------
3: j 3b # endless loop for safety
.size _start, .-_start
# ------------------------------------------------------------
# Weak symbols (in case linker script doesn't define them)
# ------------------------------------------------------------
# These prevent linker errors if the symbols are missing,
# but they will resolve to 0 if not defined — which is unsafe.
# Always define them properly in the linker script.
# ------------------------------------------------------------
.weak __bss_start
.weak __bss_end
.weak __stack_top
```
All objects with static storage duration that are not explicitly initialized shall be initialized to zero. — ISO/IEC 9899:2018 §6.7.9/10
Need to initial the varaibel at `.bss` section within start script.
### Explain the Linker Script (Draft)
[From ChatGPT](https://chatgpt.com/share/6914a3fb-a138-8003-9bc3-614d6f567d7d)
```javascript=
OUTPUT_ARCH("riscv") /* Specify the target architecture as RISC-V */
ENTRY(_start) /* Define the program entry point symbol (_start) */
SECTIONS
{
. = 0x10000; /* Set the starting address for the first section (program load address) */
.text : {
*(.text._start) /* Place the startup code (_start) first */
*(.text) /* Place all other text (code) sections here */
}
.data : {
*(.data) /* Place all initialized data sections */
}
.bss : {
__bss_start = .; /* Mark the beginning of the BSS section */
*(.bss) /* Place all uninitialized data sections here */
__bss_end = .; /* Mark the end of the BSS section */
}
.stack (NOLOAD) : {
. = ALIGN(16); /* Align stack start address to 16 bytes for proper alignment */
. += 4096; /* Allocate 4 KB for the stack */
__stack_top = .; /* Define symbol for the top of the stack */
}
}
```
It contains global and static variables that are declared but not explicitly initialized — for example:
```c
int counter; // goes into .bss
static int flag; // also in .bss
```
https://man7.org/linux/man-pages/man5/elf.5.html
```
.bss This section holds uninitialized data that contributes to
the program's memory image. By definition, the system
initializes the data with zeros when the program begins to
run. This section is of type SHT_NOBITS. The attribute
types are SHF_ALLOC and SHF_WRITE.
```
These variables exist in memory, but their values are zero-initialized at runtime (the C standard guarantees this).
https://en.wikipedia.org/wiki/.bss
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
```
-fno-zero-initialized-in-bss
If the target supports a BSS section, GCC by default puts variables that are initialized to zero into BSS. This can save space in the resulting code.
This option turns off this behavior because some programs explicitly rely on variables going to the data section—e.g., so that the resulting executable can find the beginning of that section and/or make assumptions based on that.
The default is -fzero-initialized-in-bss except in Ada.
```
The .bss section is defined in ELF (Executable and Linkable Format) as SHT_NOBITS.
```c
//test.c
int globla_a;
int globla_b = 42;
int main(void)
{
static int local_a; //Only This one will go into .bss section
static int local_b = 42;
}
```
b → local .bss symbol
B → global .bss symbol
```shell
$ riscv-none-elf-nm test.o
00000000 B globla_a
00000000 D globla_b
00000000 b local_a.0
00000004 d local_b.1
00000000 T main
```
```shell
$ riscv-none-elf-readelf -s test.o
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 FILE LOCAL DEFAULT ABS test.c
2: 00000000 0 SECTION LOCAL DEFAULT 1 .text
3: 00000000 0 SECTION LOCAL DEFAULT 2 .data
4: 00000000 0 SECTION LOCAL DEFAULT 3 .bss
5: 00000000 0 SECTION LOCAL DEFAULT 4 .sbss
6: 00000000 0 SECTION LOCAL DEFAULT 5 .sdata
7: 00000000 0 NOTYPE LOCAL DEFAULT 1 $xrv32i2p1_zicsr2p0
8: 00000004 4 OBJECT LOCAL DEFAULT 5 local_b.1
9: 00000000 4 OBJECT LOCAL DEFAULT 3 local_a.0
10: 00000000 0 SECTION LOCAL DEFAULT 7 .note.GNU-stack
11: 00000000 0 SECTION LOCAL DEFAULT 6 .comment
12: 00000000 0 SECTION LOCAL DEFAULT 8 .riscv.attributes
13: 00000000 4 OBJECT GLOBAL DEFAULT 4 globla_a
14: 00000000 4 OBJECT GLOBAL DEFAULT 5 globla_b
15: 00000000 40 FUNC GLOBAL DEFAULT 1 main
```
```shell
$ riscv-none-elf-readelf -S test.o
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000028 00 AX 0 0 4
[ 2] .data PROGBITS 00000000 00005c 000000 00 WA 0 0 1
[ 3] .bss NOBITS 00000000 00005c 000004 00 WA 0 0 4
[ 4] .sbss NOBITS 00000000 00005c 000004 00 WA 0 0 4
[ 5] .sdata PROGBITS 00000000 00005c 000008 00 WA 0 0 4
[ 6] .comment PROGBITS 00000000 000064 000034 01 MS 0 0 1
[ 7] .note.GNU-stack PROGBITS 00000000 000098 000000 00 0 0 1
[ 8] .riscv.attributes RISCV_ATTRIBUTE 00000000 000098 000025 00 0 0 1
[ 9] .symtab SYMTAB 00000000 0000c0 000100 10 10 13 4
[10] .strtab STRTAB 00000000 0001c0 000047 00 0 0 1
[11] .shstrtab STRTAB 00000000 000207 000064 00 0 0 1
````
### Linker and Start Scritpt have some Optional improvement
1. Clear .sbss section as well
2. Copy .data from flash/ROM to RAM ?
## Pick one complete program (with test suite) done in My Homework 1
git sha: [8e190c](https://github.com/eastWillow/ca2025-quizzes/commit/8e190c4124c2bdb49a22684bc3d30db1418feee7)
rv32emu with gcc compiler `-Ofast` result
```
=== uf8 Tests ===
=== uf8 decode ===
correct
correct
correct
correct
correct
correct
correct
correct
correct
correct
correct
Cycles:218
=== uf8 encode ===
correct
correct
correct
correct
correct
correct
correct
correct
correct
correct
correct
correct
Cycles:713
=== All Tests Completed ===
```
hand-write assembly : [git](https://github.com/eastWillow/ca2025-quizzes/commit/eac44dba41953557bfe1b1430d8422a3b36819c4)
| Result | -O0 | -Ofast | hand-write assembly |
|:------ |:----:|:------:|:-------------------:|
| Cycles | 3184 | 931 | 1721 |
| Ratio | 100% | 29.23% | 54.05% |
### Contrast the handwritten and compiler-optimized assembly listings.
My obserations: `<uf8_decode>` and `<uf8_encode>` is optimized with unroll method, becasue i see the same pattern ecall in Disassemble Code
## Adapt Problem A from Quiz 2
https://hackmd.io/@sysprog/arch2025-quiz2-sol
make them run in a bare-metal environment using rv32emu’s system emulation
https://chatgpt.com/share/690cb87c-59b0-8003-a4ea-0c75283017cc
1-bit Gray code: 01
```
0 1
```
2-bit Gray Code: 00, 01, 11, 10
```
0 + (0,1) = 00, 01
1 + (1,0) = 11, 10
```
3-bit Gray Code: 000, 001, 011, 010, 110, 111, 101, 100
```
Orignal Sequence: 0 + (00, 01, 11, 10)
Mirror Sequence: 1 + (10, 11, 01, 00)
```
Flip Bit → Determine the disk to move
If it is the smallest disk → Decide **to** according to the rotation direction
If it is not the smallest disk → The only legal pole is **to**
| n | Minimal disk rotation direction |
| -- | ----------------- |
| ODD | A → C → B → A → … |
| EVEN | A → B → C → A → … |
| Step | Gray code | Flip Bit | Column A | Column B | Column C | Move Disk | from → to |
| :--: | :-------: | :------: | :---------: | :------: | :---------: | :-------: | :-------: |
| 0 | 000 | - | 1<br>2<br>3 | - | - | - | - |
| 1 | 001 | 0 | 2<br>3 | - | 1 | 1 | A → C |
| 2 | 011 | 1 | 3 | 2 | 1 | 2 | A → B |
| 3 | 010 | 0 | 3 | 1<br>2 | - | 1 | C → B |
| 4 | 110 | 2 | - | 1<br>2 | 3 | 3 | A → C |
| 5 | 111 | 0 | 1 | 2 | 3 | 1 | B → A |
| 6 | 101 | 1 | 1 | - | 2<br>3 | 2 | B → C |
| 7 | 100 | 0 | - | - | 1<br>2<br>3 | 1 | A → C |
The rv32emu system call does not include print functions for decimal/hex, unlike Ripes.
Ripes Result:
```
Move Disk 1 from A to C
Move Disk 2 from A to B
Move Disk 1 from C to B
Move Disk 3 from A to C
Move Disk 1 from B to A
Move Disk 2 from B to C
Move Disk 1 from A to C
```
The result in rv32emu is identical to Ripes
commit is https://github.com/eastWillow/ca2025-quizzes/commit/16081f4f251e00844b94fd83d9444e8796c117e9
## Problem C from Quiz 3
Password A5(x15)->A4(x14)->Zero(x0)
Password x15x14x0
make them run in a bare-metal environment using rv32emu’s system emulation
x86_64 code : https://github.com/eastWillow/ca2025-quizzes/commit/ca69eb95ccc444aa60675741b3ab52844088bd12
rv32emu code :
https://github.com/eastWillow/ca2025-quizzes/commit/945e773e89044650bc375cb71e7f88239fa49f33
## Focus On Problem C From Quiz 3
### Improve the assembly code generated by gcc with optimizations
Change `-Ofast` ( for speed) to `-Os` (optimized for size).
Add -Ofast will change the code behavior.
I ask the Chat GPT how to Protect and Cross Compiler Method:
https://github.com/eastWillow/ca2025-quizzes/commit/c47a0b00297c0ee55f37c93ce0bbd3decc4947be
Add -Os will change the compiler & linkder behavior.
Now will lack the __ashldi3 & __lshrdi3
https://gcc.gnu.org/onlinedocs/gccint/Integer-library-routines.html
### Try to get some new idea from valgrind
Accroding to the Amdahl'Law. I need to find the most exection time function.
mul32 is 42.93% on x86_64 with `-O0` [^fig1]
I got the Idea From Chat GPT , Mul32 can add the Lut2d Method
https://chatgpt.com/share/690f7f36-5fd4-8003-9b71-72e00129bc4d
git sha1 : [1c4d175](https://github.com/eastWillow/ca2025-quizzes/commit/1c4d1753a87e40294ef637e5106c0ae8306fafc5)
mul32 is 24.05% on x86_64 with `-O0` [^fig2]
mul32 is 11.97% on x86_64 with `-Ofast` [^fig3]
:warning: We care about CSR cycles at the moment.
| Result | -Ofast+Mul32(Lut2D) | -Ofast | -00+Mul32(Lut3D) | -O0 |
|:------------ |:-------------------:|:------:|:----------------:|:------:|
| Cycles | 55489 | 58879 | 140277 | 141834 |
| Instructions | 55489 | 58878 | 140277 | 141834 |
| Ticks | 55489 | 58880 | 140277 | 141834 |
| Cycles Ratio | 39.1% | 41.5% | 98.9% | 100% |
### Try to get some new idea from valgrind
```
make clean && make && ../../rv32emu/build/rv32emu test.elf
valgrind --tool=callgrind ../../rv32emu/build/rv32emu test.elf
```
op_code run realitive [^fig4]
see a lot of `add` so i try to reduce the `add`
https://github.com/eastWillow/ca2025-quizzes/commit/084117251563acc69ea76fe4b09a3fa11e1ebc1e
| op_code | -Ofast+Mul32(Lut2D) | -Ofast+Mul32(Lut2D fast-path zero skip) |
|:------- |:-------------------:|:---------------------------------------:|
| srai | 13.54% | 14.26% |
| srl | 11.20% | 12.43% |
| or | 10.90% | 12.10% |
| bne | 10.82% | 12.05% |
| bgeu | 10.68% | 11.85% |
| addi | 9.63% | 10.86% |
| andi | 7.98% | 9.09% |
| sll | 5.39% | 6.19% |
| **add** | 5.14% | 0.75% |
:warning: We care about CSR cycles at the moment. `-O0` is 100%
| Result | -Ofast+Mul32(Lut2D+ZeroSkip) | -Ofast+Mul32(Lut2D) | -Ofast |
|:------------ |:----------------------------:|:-------------------:|:------:|
| Cycles | 50659 | 55489 | 58879 |
| Instructions | 50656 | 55489 | 58878 |
| Ticks | 50659 | 55489 | 58880 |
| Cycles Ratio | 35.7% | 39.1% | 41.5% |
### Try to Reduce the mul32 uint64_t mainpulation
https://github.com/eastWillow/ca2025-quizzes/commit/db4754e415d695fd2cbe05b1bfb29632bccaf13a
:warning: We care about CSR cycles at the moment. `-O0` is 100%
Base Line is -Ofast+Mul32
| Result | Lut2D+ZeroSkip+ReduceDataSize | Lut2D+ZeroSkip | Lut2D |
|:------------ |:-----------------------------:|:--------------:|:-----:|
| Cycles | 50132 | 50659 | 55489 |
| Instructions | 50129 | 50656 | 55489 |
| Ticks | 50132 | 50659 | 55489 |
| Cycles Ratio | 35.3% | 35.7% | 39.1% |
## Reference
https://lelouch.dev/blog/you-are-probably-not-dumb/
This is a good blog post — it makes learning new things more convenient for me.
### Prequirement Waht I have
https://hackmd.io/@eastWillow/Fast-Inverse-Square-Root
https://hackmd.io/@eastWillow/arch2025-homework1
### Prequirement Waht I Learn From
Must add the return in any function also the void function with asm volatile
```c
void printstr(char *ptr, unsigned long length)
{
asm volatile(
"add a7, x0, 0x40;"
"add a0, x0, 0x1;" /* stdout */
"add a1, x0, %0;"
"mv a2, %1;" /* length character */
"ecall;"
:
: "r"(ptr), "r"(length)
: "a0", "a1", "a2", "a7", "memory");
return;
}
```
The last colon-separated section `"a0", "a1", "a2", "a7", "memory"`
tells the compiler which registers or resources your assembly might modify, so that it can generate correct surrounding code.
By adding `"memory"`, you prevent the compiler from:
Reordering memory accesses across the asm block.
Caching variables in registers that could have been changed by your assembly.
Becasue i want to call the function in the obj from assembly
```
la a0, str # address of string
li a1, str_size # length of string
call printstr
```
```shell
gdb dump memory at uf8_encode.S start addres is from `la t0, test_values`
```
> -exec dump memory mem 0x10454 0x10500
```
```shell
$ xxd mem
00000000: 0000 0000 0100 0000 0f00 0000 1000 0000 ................
00000010: 0001 0000 0010 0000 0000 0100 0000 0200 ................
00000020: 0000 0400 0000 0800 f07f 0f00 ffff 0f00 ................
00000030: 0001 0f10 4180 c0d0 e0f0 ffff 636f 7272 ....A.......corr
00000040: 6563 740a 0077 726f 6e67 0a00 4379 636c ect..wrong..Cycl
00000050: 6573 3a00 0a00 0000 0000 0000 0000 0000 es:.............
00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000a0: 0000 0000 0000 0000 0000 0000 ............
```
gdb print memory
```
> -exec x/12xw 0x10454
0x10454: 0x00000000 0x00000001 0x0000000f 0x00000010
0x10464: 0x00000100 0x00001000 0x00010000 0x00020000
0x10474: 0x00040000 0x00080000 0x000f7ff0 0x000fffff
```
asm jump label alias
```
<number>b → Search upwards for the nearest label with the same number
<number>f → Search downwards for the nearest label with the same number
```
.ascii Store ASCII string(s) as-is ❌ No null terminator H e l l o
.asciz Store ASCII string(s) with null terminator ✅ Adds 0x00 automatically H e l l o \0
[Lab2: RISC-V Instruction Set Simulator and System Emulator](https://hackmd.io/jmznp2mFREC0fUNTZ1x2XQ)
Check the ticks.c and perfcounter for the statistics of your program's execution.
read more [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c)
read more [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter)
## What is the CSR cycle in RISC-V?
Form ChatGPT:
> In RISC-V, CSR means Control and Status Register — a special register inside the CPU used to store and report internal state or control bits. They’re accessed using CSRRx instructions (like csrr, csrw, etc.).
| CSR name | Description | RV32 notes |
| ---------- | -------------------------------------- | ------------------------------ |
| `cycle` | Number of CPU clock cycles since start | Lower 32 bits |
| `cycleh` | Upper 32 bits of same counter | RV32-only |
| `instret` | Number of retired instructions | Lower 32 bits |
| `instreth` | Upper 32 bits of same counter | RV32-only |
| `time` | Real-time clock (if implemented) | Often mapped to a timer device |
## How to use valgrind + kcachegrind
```shell
sudo apt-get install valgrind kcachegrind graphviz
valgrind --tool=callgrind ./q3-problemC
```
https://hackmd.io/@jasonmatobo/Linux_Kernel_Note_2021/%2F%40jasonmatobo%2F2021q1_homweork_lab0
## Future Work
I am lack the time to Analyzing precision with math.
May be i will go deep in future homework.
## Figure
[^fig1]: 
[^fig2]: 
[^fig3]: 
[^fig4]: -O0+Mul32(Lut2D) with rv32emu :

-Ofast+Mul32(Lut2D) in rv32emu :
