ca2025: HW2 - HackMD

# ca2025: HW2 Bare metal programming > [commit fe526ef](https://github.com/Max042004/rv32emu/commit/fe526efaaad4c8ba27526a199e0a750145aadf5e) ## Run with rv32emu build ``` make distclean && make ENABLE_SYSTEM=1 ENABLE_ELF_LOADER=1 ENABLE_GDBSTUB=1 ``` Because rv32emu not supports `printf`, if we want to output string to terminal, using system call `write`. ```c ssize_t write(size_t count; int fd, const void buf[count], size_t count); ``` However, `write` only accept address of string. So it need to translate our output from unsigned integer 32 bits to ASCII code. The formula is $$q=(n*0xCCCCCCCCD)>>35$$ $$r=n-10*q$$ where $n$ is unsigned integer 32 bits, $r$ is least degits of $n$, $q$ is quotient of $\frac{n}{10}$. After each interation, let $n=q$ until $q=0$. For example, $$n=35$$ $$q=(35*0xCCCCCCD)>>35=3$$ $$r=35-10*3=5$$ then $$n=q=3$$ $$q=(3*0xCCCCCCCD)>>3=05$$ $$r=3-0=3$$ So in this method, we obtain every decimal degit for any unsigned integer 32 bits. Then just add $48$ transform them to ASCII representation number. [assembly source code](https://github.com/Max042004/rv32emu/blob/master/tests/quiz3-c/q3-rsqrt.S) `q3-sqrt.ld` ``` ENTRY(_start) SECTIONS { . = 0x00000000; .text : { *(.text*) } .rodata : { *(.rodata*) } .data : { *(.data*) } .bss (NOLOAD) : { *(.bss*) *(COMMON) } } ``` From q3-sqrt.ld, q3-sqrt.S generate q3-sqrt.elf run q3-sqrt.elf with rv32emu ``` cd rv32emu make ./build/rv32emu tests/quiz3-c/q3-rsqrt.elf ``` we can get the result: ``` 22:41:49 INFO src/riscv.c:552: Log level: TRACE 22:41:49 INFO src/riscv.c:565: tests/quiz3-c/q3-rsqrt.elf ELF loaded 22:41:49 INFO src/main.c:315: RISC-V emulator is created and ready to run 4294967295 65536 32768 29308 1638 66 18 1 1 1 22:41:49 INFO src/main.c:338: RISC-V emulator is destroyed ``` ## Run with system emulation ### Quiz3 problem C In example directory: ``` tests/system/playground |- Makefile |- chacha20_asm.S |- main.c |- perfcounter.S |- start.S |- linker.ld ``` chacha20 is a cipher algorithm. It routine is that using a 256 bits key and 64 bits vector build a keystream. Takes keystream and plain text, XOR one by one bit then result is cipher text. To decipher, just XOR cipher text and keystream. chacha20 cipher algorithm is quick and also safe. Our directory: ``` tests/system/fast_rsqrt |- Makefile |- fast_rsqrt.S |- main.c |- perfcounter.S |- start.S |- linker.ld ``` I implement two versions, `fast_rsqrt.S` is totally assembly implement previously and `main.c` is same but a bare metal C program. `linker.ld` defines memory layout for program: ![圖片](https://hackmd.io/_uploads/Bya2KKdJZg.png) It will follow the definition in `linker.ld` to combine `.text` segments instructions, `.data`, `.rodata` and `.bss` segments of `start.o` and `main.o`. `perfcounter.S` utilize CSR (control status registers). cycle and instret are 64 bits, so to read them, it should first read high 32 bits part and then read low 32 bits part. ``` get_cycles: csrr a1, cycleh csrr a0, cycle csrr a2, cycleh bne a1, a2, get_cycles ``` However, while reading low 32 bits part, high part may carry. So it need a loop to avoid carry happens. Expected output: ``` ../../../build/rv32emu test.elf 12:43:56 INFO src/riscv.c:552: Log level: TRACE 12:43:56 INFO src/riscv.c:565: test.elf ELF loaded 12:43:56 INFO src/main.c:315: RISC-V emulator is created and ready to run ======= Fast Reciprocal Square Root Tests ======= Test1: reciprocal square root assembly 4294967295 65536 32768 29308 1638 66 18 1 1 1 Cycles: 30301 Instructions: 30301 Test2: reciprocal square root C == Testcase 0 == fast_rsqrt(0): 4294967295 , numerical diff: 0 == Testcase 1 == fast_rsqrt(1): 65536 , numerical diff: 0 == Testcase 2 == fast_rsqrt(4): 32768 , numerical diff: 0 == Testcase 3 == fast_rsqrt(5): 29308 , numerical diff: 0 == Testcase 4 == fast_rsqrt(1600): 1638 , numerical diff: 0 == Testcase 5 == fast_rsqrt(1000000): 66 , numerical diff: 1 == Testcase 6 == fast_rsqrt(12582912): 18 , numerical diff: 0 == Testcase 7 == fast_rsqrt(4294967293): 1 , numerical diff: 0 == Testcase 8 == fast_rsqrt(4294967294): 1 , numerical diff: 0 == Testcase 9 == fast_rsqrt(4294967295): 1 , numerical diff: 0 Cycles: 152667 Instructions: 152667 === Tests Completed === 12:43:56 INFO src/main.c:338: RISC-V emulator is destroyed ``` rv32emu is not similar to Ripes simulator. rv32emu not simulate CPU processing. It focus on emulate state machine for executing RISC-V instructions. ### Quiz2 problem A ### compared compiler generated assembly compared size under `-Ofast` and `-Os` flags: ``` # -Ofast text data bss dec hex filename 4450 189 4111 8750 222e test.elf # -Os text data bss dec hex filename 3698 189 4111 7998 1f3e test.elf ``` #### `-Ofast` flag while using `-Ofast` flag, initially I got wrong output: ``` ../../../build/rv32emu test.elf 15:08:39 INFO src/riscv.c:552: Log level: TRACE 15:08:39 INFO src/riscv.c:565: test.elf ELF loaded 15:08:39 INFO src/main.c:315: RISC-V emulator is created and ready to run 4294967295 65536 32768 29308 1638 66 18 1 1 1 ��15:08:39 INFO src/main.c:338: RISC-V emulator is destroyed ``` because gcc compiler try to optimiz `print_dec` function, it rearrange instructions. This result in storing value to memory (buffer) for system call 64 `write` not works. To solve this issue, need add clobber `"memory"` to `printstr` According to [gcc docs](https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Extended-Asm.html?utm_source=chatgpt.com#Clobbers-and-Scratch-Registers): > The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm. Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler. clobber `"memory"` serve as **memory barrier**. Any read of a C variable's value that happens in the source after an `asm` statement must be after the memory-clobbering `asm` statement in the compiler-generated assembly output for the target machine. On the other hand, any read of a C var in the source before an `asm` statement similarly must stay sequenced before, otherwise it might incorrectly read a modified value. ```diff #define printstr(ptr, length) \ do { \ asm volatile( \ "li a7, 0x40;" \ "li a0, 0x1;" /* stdout */ \ "mv a1, %0;" \ "mv a2, %1;" /* length character */ \ "ecall;" \ : \ : "r"(ptr), "r"(length) \ - : "a0", "a1", "a2", "a7"); + : "a0", "a1", "a2", "a7", "memory"); } while (0) ``` Comparing two ELF file one's clobber includes `"memory"` while another not. We can see that the one with `"memory"` clobber (the right one), it contains many `sb` instructions which store byte into memory. However, the one without `"memory"` clobber (the left one) not contain any `store` instructions. This makes `write` system call can not successfully read value in buffer. ``` 000102ac <print_dec>: | 000102ac <print_dec>: 102ac: fe010113 addi sp,sp,-32 | 102ac: fe010113 addi sp,sp,-32 102b0: 01e10e13 addi t3,sp,30 | 102b0: 00010fa3 sb zero,31(sp) 102b4: 04050e63 beqz a0,10310 <pri| 102b4: 02051a63 bnez a0,102e8 <pr 102b8: 000e0313 mv t1,t3 | 102b8: 03000793 li a5,48 102bc: 00900893 li a7,9 | 102bc: 00f10f23 sb a5,30(sp) 102c0: 00100813 li a6,1 | 102c0: 01e10313 addi t1,sp,30 102c4: fff00593 li a1,-1 | 102c4: 02010793 addi a5,sp,32 102c8: 00030e13 mv t3,t1 | 102c8: 406787b3 sub a5,a5,t1 102cc: 00000613 li a2,0 | 102cc: 04000893 li a7,64 102d0: fff30313 addi t1,t1,-1 | 102d0: 00100513 li a0,1 102d4: 01f00713 li a4,31 | 102d4: 00030593 mv a1,t1 102d8: 00000793 li a5,0 | 102d8: 00078613 mv a2,a5 102dc: 00e556b3 srl a3,a0,a4 | 102dc: 00000073 ecall 102e0: 0016f693 andi a3,a3,1 | 102e0: 02010113 addi sp,sp,32 102e4: 00179793 slli a5,a5,0x1 | 102e4: 00008067 ret 102e8: 00f6e7b3 or a5,a3,a5 | 102e8: 01e10313 addi t1,sp,30 102ec: 00e816b3 sll a3,a6,a4 | 102ec: 00900593 li a1,9 102f0: fff70713 addi a4,a4,-1 | 102f0: fff00613 li a2,-1 102f4: 00f8f663 bgeu a7,a5,10300 <| 102f4: 00100893 li a7,1 102f8: ff678793 addi a5,a5,-10 | 102f8: 01f00713 li a4,31 102fc: 00d66633 or a2,a2,a3 | 102fc: 00000793 li a5,0 10300: fcb71ee3 bne a4,a1,102dc <| 10300: 00e556b3 srl a3,a0,a4 10304: 00060663 beqz a2,10310 <pri| 10304: 00179793 slli a5,a5,0x1 10308: 00060513 mv a0,a2 | 10308: 0016f693 andi a3,a3,1 1030c: fbdff06f j 102c8 <print_| 1030c: 00f6e7b3 or a5,a3,a5 10310: 02010793 addi a5,sp,32 | 10310: fff70713 addi a4,a4,-1 10314: 41c787b3 sub a5,a5,t3 | 10314: 00f5f463 bgeu a1,a5,1031c 10318: 04000893 li a7,64 | 10318: ff678793 addi a5,a5,-10 1031c: 00100513 li a0,1 | 1031c: fec712e3 bne a4,a2,10300 10320: 000e0593 mv a1,t3 | 10320: 03078793 addi a5,a5,48 10324: 00078613 mv a2,a5 | 10324: 00f30023 sb a5,0(t1) 10328: 00000073 ecall | 10328: fff30e13 addi t3,t1,-1 1032c: 02010113 addi sp,sp,32 | 1032c: 00000813 li a6,0 10330: 00008067 ret | 10330: 01f00713 li a4,31 -------------------------------------------------------------| 10334: 00000793 li a5,0 -------------------------------------------------------------| 10338: 00e556b3 srl a3,a0,a4 -------------------------------------------------------------| 1033c: 0016f693 andi a3,a3,1 -------------------------------------------------------------| 10340: 00179793 slli a5,a5,0x1 -------------------------------------------------------------| 10344: 00f6e7b3 or a5,a3,a5 -------------------------------------------------------------| 10348: 00e896b3 sll a3,a7,a4 -------------------------------------------------------------| 1034c: fff70713 addi a4,a4,-1 -------------------------------------------------------------| 10350: 00f5f663 bgeu a1,a5,1035c -------------------------------------------------------------| 10354: ff678793 addi a5,a5,-10 -------------------------------------------------------------| 10358: 00d86833 or a6,a6,a3 -------------------------------------------------------------| 1035c: fcc71ee3 bne a4,a2,10338 -------------------------------------------------------------| 10360: f60802e3 beqz a6,102c4 <pr -------------------------------------------------------------| 10364: 000e0313 mv t1,t3 -------------------------------------------------------------| 10368: 00080513 mv a0,a6 -------------------------------------------------------------| 1036c: f8dff06f j 102f8 <print | no_clobber_barrier.txt tex… 5173W 15% ㏑:195 ℅:1 | has_clobber_barrier.txt tex… 0W 18% ㏑:195 ℅:1 ``` Moreover, for other optimization flag such as `-O2` exists same problem. For all of same issues, adds clobber `"memory"` in macro `printstr` solves the problem. #### `-Os` flag using `-Os` flag, I got error message: ``` riscv-none-elf-ld: main.o: in function `fast_rsqrt': /home/max/sysprog/projects/rv32emu/tests/system/fast_rsqrt/main.c:197:(.text+0x1bc): undefined reference to `__lshrdi3' riscv-none-elf-ld: main.o: in function `mul32': /home/max/sysprog/projects/rv32emu/tests/system/fast_rsqrt/main.c:167:(.text+0x208): undefined reference to `__ashldi3' riscv-none-elf-ld: /home/max/sysprog/projects/rv32emu/tests/system/fast_rsqrt/main.c:167:(.text+0x240): undefined reference to `__ashldi3' riscv-none-elf-ld: /home/max/sysprog/projects/rv32emu/tests/system/fast_rsqrt/main.c:167:(.text+0x298): undefined reference to `__ashldi3' ``` `__ashldi3` is library function used when computing 64 bits. But it didn't link correctly. So we manually add it to Makefile command: ```diff + LIBGCC = $(shell riscv-none-elf-gcc -print-libgcc-file-name) $(EXEC): $(OBJS) $(LINKER_SCRIPT) - $(LD) $(LDFLAGS) -o $@ $(OBJS) + $(LD) $(LDFLAGS) -o $@ $(OBJS) $(LIBGCC) ``` Then compile successfully: ``` 21:01:48 INFO src/riscv.c:552: Log level: TRACE 21:01:48 INFO src/riscv.c:565: test.elf ELF loaded 21:01:48 INFO src/main.c:315: RISC-V emulator is created and ready to run ======= Fast Reciprocal Square Root Tests ======= Test1: reciprocal square root assembly 4294967295 65536 32768 29308 1638 66 18 1 1 1 Cycles: 30963 Instructions: 31002 Test2: reciprocal square root C == Testcase 0 == fast_rsqrt(0): 4294967295 , numerical diff = 0 , relative error = 0% Cycles: 45 Instructions: 46 == Testcase 1 == fast_rsqrt(1): 65536 , numerical diff = 0 , relative error = 0% Cycles: 33 Instructions: 33 == Testcase 2 == fast_rsqrt(4): 32768 , numerical diff = 0 , relative error = 0% Cycles: 83 Instructions: 83 == Testcase 3 == fast_rsqrt(5): 29308 , numerical diff = 0 , relative error = 0% Cycles: 2214 Instructions: 2214 == Testcase 4 == fast_rsqrt(1600): 1638 , numerical diff = 0 , relative error = 0% Cycles: 0 Instructions: 0 == Testcase 5 == fast_rsqrt(1000000): 66 , numerical diff = 1 , relative error = 1% Cycles: 1036 Instructions: 1036 == Testcase 6 == fast_rsqrt(12582912): 18 , numerical diff = 0 , relative error = 0% Cycles: 256 Instructions: 256 == Testcase 7 == fast_rsqrt(4294967293): 1 , numerical diff = 0 , relative error = 0% Cycles: 600 Instructions: 600 == Testcase 8 == fast_rsqrt(4294967294): 1 , numerical diff = 0 , relative error = 0% Cycles: 0 Instructions: 0 == Testcase 9 == fast_rsqrt(4294967295): 1 , numerical diff = 0 , relative error = 0% Cycles: 0 Instructions: 0 TEST SUCCUESSFUL === Tests Completed === 21:01:48 INFO src/main.c:338: RISC-V emulator is destroyed ``` ### enable rv32emu GDB debug When I build, ``` make distclean && make ENABLE_SYSTEM=1 ENABLE_ELF_LOADER=1 ENABLE_GDBSTUB=1 ``` I encounter issues, ``` git submodule update --init src/mini-gdbstub/ Submodule 'mini-gdbstub' (https://github.com/RinHizakura/mini-gdbstub) registered for path 'src/mini-gdbstub' Cloning into '/home/max/sysprog/projects/rv32emu/src/mini-gdbstub'... fatal: transport 'file' not allowed fatal: Fetched in submodule path 'src/mini-gdbstub', but it did not contain 78b00cdfb1352ca80117832a898919d58d47b53b. Direct fetching of that commit failed. make: *** [Makefile:238: src/mini-gdbstub/Makefile] Error 128 ``` To fix, I run: ``` git -c protocol.file.allow=always git config -f .gitmodules submodule.mini-gdbstub.shallow false git submodule sync --recursive git submodule deinit -f src/mini-gdbstub rm -rf .git/modules/src/mini-gdbstub src/mini-gdbstub git submodule update --init --recursive src/mini-gdbstub ``` but is still error, so I run another command: ``` git -C src/mini-gdbstub config branch.master.remote origin git -C src/mini-gdbstub fetch --unshallow || git -C src/mini-gdbstub fetch --depth=1000000 ``` then it build successfully with ENABLE_GDBSTUB=1. But while executing `./build/rv32emu tests/system/fast_rsqrt/test.elf`. I encounter `riscv-unknown-gdb not found` then I adapts solution on [this page](https://zhuanlan.zhihu.com/p/638731320) ``` sudo apt-get install libgmp-dev # 安装相关依赖 sudo apt-get install libncurses5-dev python3 python3-dev texinfo libreadline-dev # 从清华大学开源镜像站下载gdb源码(约23MB) wget https://mirrors.tuna.tsinghua.edu.cn/gnu/gdb/gdb-13.1.tar.xz # 解压gdb源码压缩包 tar -xvf gdb-13.1.tar.xz # 进入gdb源码目录 cd gdb-13.1 mkdir build && cd build # 配置编译选项，这里只编译riscv32-unknown-elf一个目标文件 ../configure --prefix=/usr/local --target=riscv32-unknown-elf --enable-tui=yes # 在上面一行编译配置选项中，很多其他的文章会配置一个python选项 # 但我在尝试中发现配置了python选项后后面的编译过程会报错，不添加python选项则没有问题 # 开始编译，这里编译可能消耗较长时间，具体时长取决于机器性能 make -j$(nproc) # 编译完成后进行安装 sudo make install ``` After this, rv32emu success run remote GDB. Then to execute one instruction by one instruction. Using `si` or `ni` which equal to `s` and `n` for executing one line by one line of C code. ``` build/rv32emu -g tests/system/fast_rsqrt/test.elf ``` In another window: ``` $ riscv32-unknown-elf-gdb (gdb) file test.elf (gdb) target remote :1234 # show next executing instrucion (gdb) set disassemble-next-line on # TUI UI (gdb) layout asm # 或 layout split 顯示混合視圖 (gdb) si (gdb) ni ``` compared `-Os` code to mine: first, in `fast_rsqrt` function, it store value of saved registers into stack. ``` 10184: fd010113 addi sp,sp,-48 10188: 02812423 sw s0,40(sp) 1018c: 02112623 sw ra,44(sp) 10190: 02912223 sw s1,36(sp) 10194: 03212023 sw s2,32(sp) 10198: 01312e23 sw s3,28(sp) 1019c: 01412c23 sw s4,24(sp) 101a0: 01512a23 sw s5,20(sp) 101a4: 01612823 sw s6,16(sp) 101a8: 01712623 sw s7,12(sp) 101ac: 01812423 sw s8,8(sp) 101b0: 01912223 sw s9,4(sp) ``` When I was tracing using remote GDB, I find that even after compiler optimazation, the instructions can still find corresponding parts to original C code. Under `-Os` flag, functions `clz` and `mul` are cooperate into `fast_rsqrt` in order to shrink the code size. However, I want to observe under `-Os` flag, why some output cost zero cycle: ``` == Testcase 4 == fast_rsqrt(1600): 1638 , numerical diff = 0 , relative error = 0% Cycles: 0 Instructions: 0 == Testcase 5 == fast_rsqrt(1000000): 66 , numerical diff = 1 , relative error = 1% Cycles: 1038 Instructions: 1038 == Testcase 6 == fast_rsqrt(12582912): 18 , numerical diff = 0 , relative error = 0% Cycles: 258 Instructions: 258 == Testcase 7 == fast_rsqrt(4294967293): 1 , numerical diff = 0 , relative error = 0% Cycles: 602 Instructions: 602 == Testcase 8 == fast_rsqrt(4294967294): 1 , numerical diff = 0 , relative error = 0% Cycles: 0 Instructions: 0 == Testcase 9 == fast_rsqrt(4294967295): 1 , numerical diff = 0 , relative error = 0% Cycles: 0 Instructions: 0 ``` But when I utilize remote GDB to trace the instructions, strange thing happens, the cycle count were not zero: ``` == Testcase 4 == fast_rsqrt(1600): 1638 , numerical diff = 0 , relative error = 0% Cycles: 2024 Instructions: 2024 == Testcase 5 == fast_rsqrt(1000000): 66 , numerical diff = 1 , relative error = 1% Cycles: 1701 Instructions: 1701 == Testcase 6 == fast_rsqrt(12582912): 18 , numerical diff = 0 , relative error = 0% Cycles: 1590 Instructions: 1590 == Testcase 7 == fast_rsqrt(4294967293): 1 , numerical diff = 0 , relative error = 0% Cycles: 1354 Instructions: 1354 == Testcase 8 == fast_rsqrt(4294967294): 1 , numerical diff = 0 , relative error = 0% Cycles: 1354 Instructions: 1354 == Testcase 9 == fast_rsqrt(4294967295): 1 , numerical diff = 0 , relative error = 0% Cycles: 1354 Instructions: 1354 ``` Same issue while `-Ofast` sets. But if I add a operation such as `TEST_LOGGER`: ```diff uint32_t in = *(volatile uint32_t *)&testcase[i]; uint32_t y = fast_rsqrt(in); + TEST_LOGGER("=="); ``` the cycle count become much closer to output of remote GDB. So it must be gcc compiler or gdb do something that I don't understand. ``` testcase 9 first time call csr_ptr, return let rv->csr_cycle = 119177 ``` After I used remote gdb tracing testcases which cycle count are zero, I found that these testcase actually run through same instructions path just like other testcase. So theoritically its cycle count should not be zero. Based on previous observation, I suspects that rv32emu has bug on computing cycle count, specifically `rdcycle rd` instruction may have issue. To find out the problem, I enable GDB to debug rv32emu. First, I re-built rv32emu with `-g` gcc flag to let GDB traces souce C code. Then set break point at `csr_get_ptr` function ``` gdb --args build/rv32emu tests/system/fast_rsqrt/test.elf (gdb) b csr_get_ptr ``` As what I thought, after tracing the execution, I notice that there was another varible called `cycle` which increment everytime after executing an instruction. ``` (gdb) do_csrrs (rv=0x55555559b9e0, ir=0x7ffff7fa36e8, cycle=82870, PC=67316) at src/rv32_template.c:1223 1223 RVOP( ``` For testcase which cycle count is zero. `csr_get_ptr` return value, a pointer to `rv->csr_cycle`, its dereference value didn't change after perform lots of instructions. Moreover, it didn't equal to `cycle`. But we know that `cycle` is actual value representing cycle counter. We stop a little bit here. We organize what we found so far. We found that in `fast_rsqrt` code under `-Ofast` or `-Os` optimization cause some testcase's cycle count zero. This is not allowed, because no matter what happens, we always use `rdcycle` to collect `start_cycle` and `end_cycle`, so `end_cycle - start_cycle` should at least $1$. Acccording to [Five EmbedDev docs](https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/counters.html) > RV32I provides a number of 64-bit read-only user-level counters, which are mapped into the 12-bit CSR address space and accessed in 32-bit pieces using CSRRS instructions. > The RDCYCLE pseudoinstruction reads the low XLEN bits of the cycle CSR which holds a count of the number of clock cycles executed by the processor core on which the hart is running from an arbitrary start time in the past. I trace the source code of rv32emu First, `rdcycle` is a pseudo instruction, which implement with `csrrs rd, cycle[h], x0`: In rv32emu, function `csr_csrrs` implement `csrrs` instruction. so for `rdcycle`, actually it was `csr_csrrs (val=0, csr=294966272, rv=0x55555559b9e0)` where `csr=294966272` is bitmask, `rv=0x55555559b9e0` is `struct riscv_interal` containing necessary processor information. ```c CSR_CYCLE = 0xC00, /* Cycle counter for RDCYCLE instruction */ RVOP( csrrs, { uint32_t tmp = csr_csrrs( rv, ir->imm, (ir->rs1 == rv_reg_zero) ? 0U : rv->X[ir->rs1]); rv->X[ir->rd] = ir->rd ? tmp : rv->X[ir->rd]; }, GEN({ assert; /* FIXME: Implement */ })) /* get a pointer to a CSR */ static uint32_t *csr_get_ptr(riscv_t *rv, uint32_t csr) { /* csr & 0xFFF prevent sign-extension in decode stage */ switch (csr & 0xFFF) { ... /* Machine Counter/Timers */ case CSR_CYCLE: /* Cycle counter for RDCYCLE instruction */ return (uint32_t *) &rv->csr_cycle; case CSR_CYCLEH: /* Upper 32 bits of cycle */ return &((uint32_t *) &rv->csr_cycle)[1]; ... default: return NULL; } } /* perform csrrs (atomic read and set) */ static uint32_t csr_csrrs(riscv_t *rv, uint32_t csr, uint32_t val) { uint32_t *c = csr_get_ptr(rv, csr); if (!c) return 0; uint32_t out = *c; #if RV32_HAS(EXT_F) if (csr == CSR_FFLAGS) out &= FFLAG_MASK; #endif *c |= val; return out; } ``` From above code we know that `rdcycle a0`, actually gets `rv->csr_cycle`. However, `rv->csr_cycle` does not update everytime after an instruction finish. We take a look at `RVOP`: ```c /* Interpreter-based execution path */ #define RVOP(inst, code, asm) \ static bool do_##inst(riscv_t *rv, const rv_insn_t *ir, uint64_t cycle, \ uint32_t PC) \ { \ IIF(RV32_HAS(SYSTEM))(rv->timer++;, ) cycle++; \ code; \ IIF(RV32_HAS(SYSTEM)) \ ( \ if (need_handle_signal) { \ need_handle_signal = false; \ return true; \ }, ) nextop : PC += __rv_insn_##inst##_len; \ IIF(RV32_HAS(SYSTEM)) \ (IIF(RV32_HAS(JIT))( \ , if (unlikely(need_clear_block_map)) { \ block_map_clear(rv); \ need_clear_block_map = false; \ rv->csr_cycle = cycle; \ rv->PC = PC; \ return false; \ }), ); \ if (unlikely(RVOP_NO_NEXT(ir))) \ goto end_op; \ const rv_insn_t *next = ir->next; \ MUST_TAIL return next->impl(rv, next, cycle, PC); \ end_op: \ rv->csr_cycle = cycle; \ rv->PC = PC; \ return true; \ } #include "rv32_template.c" #undef RVOP ``` ```c void rv_step(void *arg) { assert(arg); riscv_t *rv = arg; vm_attr_t *attr = PRIV(rv); uint32_t cycles = attr->cycle_per_step; /* find or translate a block for starting PC */ const uint64_t cycles_target = rv->csr_cycle + cycles; /* loop until hitting the cycle target */ while (rv->csr_cycle < cycles_target && !rv->halt) { #if RV32_HAS(SYSTEM) && !RV32_HAS(ELF_LOADER) /* check for any interrupt after every block emulation */ rv_check_interrupt(rv); #endif if (prev && prev->pc_start != last_pc) { /* update previous block */ #if !RV32_HAS(JIT) prev = block_find(&rv->block_map, last_pc); #else prev = cache_get(rv->block_cache, last_pc, false); #endif } /* lookup the next block in block map or translate a new block, * and move onto the next block. */ block_t *block = block_find_or_translate(rv); /* by now, a block should be available */ if (unlikely(!block)) { rv_log_fatal("Failed to allocate or translate block at PC=0x%08x", rv->PC); rv->halt = true; return; } assert(block); #if RV32_HAS(JIT) && RV32_HAS(SYSTEM) assert(block->satp == rv->csr_satp); #endif #if !RV32_HAS(SYSTEM) /* on exit */ if (unlikely(block->ir_head->pc == PRIV(rv)->exit_addr)) PRIV(rv)->on_exit = true; #endif /* After emulating the previous block, it is determined whether the * branch is taken or not. The IR array of the current block is then * assigned to either the branch_taken or branch_untaken pointer of * the previous block. */ #if RV32_HAS(BLOCK_CHAINING) if (prev #if RV32_HAS(JIT) && RV32_HAS(SYSTEM) && prev->satp == rv->csr_satp #endif ) { rv_insn_t *last_ir = prev->ir_tail; /* chain block */ if (!insn_is_unconditional_branch(last_ir->opcode)) { if (is_branch_taken && !last_ir->branch_taken) { last_ir->branch_taken = block->ir_head; } else if (!is_branch_taken && !last_ir->branch_untaken) { last_ir->branch_untaken = block->ir_head; } } else if (insn_is_direct_branch(last_ir->opcode)) { if (!last_ir->branch_taken) { last_ir->branch_taken = block->ir_head; } } } #endif last_pc = rv->PC; #if RV32_HAS(JIT) #if RV32_HAS(T2C) /* executed through the tier-2 JIT compiler */ if (block->hot2) { ((exec_t2c_func_t) block->func)(rv); prev = NULL; continue; } /* check if invoking times of t1 generated code exceed threshold */ else if (!block->compiled && block->n_invoke >= THRESHOLD) { block->compiled = true; queue_entry_t *entry = malloc(sizeof(queue_entry_t)); if (unlikely(!entry)) { /* Malloc failed - reset compiled flag to allow retry later */ block->compiled = false; continue; } entry->block = block; pthread_mutex_lock(&rv->wait_queue_lock); list_add(&entry->list, &rv->wait_queue); pthread_mutex_unlock(&rv->wait_queue_lock); } #endif /* executed through the tier-1 JIT compiler */ struct jit_state *state = rv->jit_state; /* * TODO: We do not explicitly need the translated block. We only need * the program counter as a key for searching the corresponding * entry in compiled binary buffer. */ if (block->hot) { block->n_invoke++; ((exec_block_func_t) state->buf)( rv, (uintptr_t) (state->buf + block->offset)); prev = NULL; continue; } /* check if the execution path is potential hotspot */ if (block->translatable #if !RV32_HAS(ARCH_TEST) && runtime_profiler(rv, block) #endif ) { jit_translate(rv, block); ((exec_block_func_t) state->buf)( rv, (uintptr_t) (state->buf + block->offset)); prev = NULL; continue; } set_reset(&pc_set); has_loops = false; #endif /* execute the block by interpreter */ const rv_insn_t *ir = block->ir_head; if (unlikely(!ir->impl(rv, ir, rv->csr_cycle, rv->PC))) { /* block should not be extended if execption handler invoked */ prev = NULL; break; } #if RV32_HAS(JIT) if (has_loops && !block->has_loops) block->has_loops = true; #endif prev = block; } #ifdef __EMSCRIPTEN__ if (rv_has_halted(rv)) { emscripten_cancel_main_loop(); rv_delete(rv); /* clean up and reuse memory */ rv_log_info("RISC-V emulator is destroyed"); enable_run_button(); } #endif } ``` But interestingly, in [藉由 JIT 編譯加速 rv32emu](https://hackmd.io/@lambert-wu/rv32emu#JIT-%E7%B7%A8%E8%AD%AF%E5%99%A8) `rv_step` contains `rv->csr_cycle++` But currently, this part disapper. Utilize git to trace the previous changes: ``` git log -S 'rv->csr_cycle++' -p --follow -- src/emulate.c ``` Found [commit 78836cf](https://github.com/sysprog21/rv32emu/commit/78836cf7b6ee082307072630d47977116480e82d) remove `rv->csr_cycle++` What this commit do is that on the hot spot, it prevent constantly access memory while modify struct member `csr_cycle` and `pc`. As the result, we should write back `csr_cycle` and `pc` when needed. For example, in `ecall`, we need to manipulate `pc`, so before calling it, we need write back `pc` to memory: ```c /* ECALL: Environment Call */ RVOP( ecall, { rv->compressed = false; rv->csr_cycle = cycle; rv->PC = PC; rv->io.on_ecall(rv); return true; }, GEN({ break; ldimm, TMP, pc; st, S32, TMP, PC; call, ecall; exit; })) ``` However, for `csrr` count need to update, program counter not needed. ```diff RVOP( csrrs, { + rv->csr_cycle = cycle; uint32_t tmp = csr_csrrs( rv, ir->imm, (ir->rs1 == rv_reg_zero) ? 0U : rv->X[ir->rs1]); rv->X[ir->rd] = ir->rd ? tmp : rv->X[ir->rd]; }, GEN({ assert; /* FIXME: Implement */ })) ``` Same for `csrrw`, `csrrc`, `csrrsi`, `csrrwi` ## send PR to rv32emu ### csr cycle not asnyc [PR #630 Fix csr cycle unasyc](https://github.com/sysprog21/rv32emu/pull/630) But, I have a concern. What will happen if `rv->csr_cycle` will update in the middle of block. To answer this question, we need to dive into the operation of rv32emu. ### Question about rv32emu 1. why Domain specific language ## TODO grap code program csr cycle dsp code rv32emu vector extension ### Optimize After observation through remote GDB on fast_rsqrt. I notice that for larger input, such as $\frac{65536}{\sqrt(2^{31})}$, it still go through two times newton iteration. Even under `-Ofast` or `-Osize`. To optimize code execution, we can add early return to those don't need two times newton iteration. Furthermore, for certain cases, it only need one times newton iteration. > [commit 02a759b](https://github.com/Max042004/rv32emu/commit/02a759b3bbc5b42dd84902d4271afd94e6bc1382) ```diff uint32_t frac = (uint32_t)((((uint64_t)x - (1UL << exp)) << 16) >> exp); y -= (uint32_t)(delta * frac) >> 16; - for (int iter = 0; iter < 2; iter++) { - uint32_t y2 = (uint32_t)mul32(y, y); // Q0.32 - uint32_t xy2 = (uint32_t)(mul32(x, y2) >> 16); // Q16.16 - uint64_t tmp = mul32(y, (3u << 16) - xy2); - y = (uint32_t)((tmp >> 17) + ((tmp >> 16) & 1)); + if (x < 512) { // 2^9 + y = newton_iter(x, y); + y = newton_iter(x, y); + } + else if (x < 33554432) { // 2^25 + y = newton_iter(x, y); } } ``` We can see that for those larger parameter, cycle count reduce significantly. ``` == Testcase 4 == fast_rsqrt(1600): 1634 , numerical diff = 4 , relative error = 0% Cycles: 1104 Instructions: 1104 == Testcase 5 == fast_rsqrt(1000000): 65 , numerical diff = 0 , relative error = 0% Cycles: 201 Instructions: 201 == Testcase 6 == fast_rsqrt(12582912): 18 , numerical diff = 0 , relative error = 0% Cycles: 213 Instructions: 213 == Testcase 7 == fast_rsqrt(4294967293): 1 , numerical diff = 0 , relative error = 0% Cycles: 193 Instructions: 193 == Testcase 8 == fast_rsqrt(4294967294): 1 , numerical diff = 0 , relative error = 0% Cycles: 189 Instructions: 189 == Testcase 9 == fast_rsqrt(4294967295): 1 , numerical diff = 0 , relative error = 0% Cycles: 189 Instructions: 189 ``` **peephole optimization** is a kind of optimization inspect instruction code in very small window. Such as for these snippet: ``` sub r3, fp, #8 -- PUSHLA i sub r4, fp, #8 -- PUSHLA i ldr r4, [r4, #0] -- LOAD mov r5, #1 -- PUSHI 1 add r4, r4, r5 -- ADD str r4, [r3, #0] -- POPS ``` Inspect with only two instructions at the same time: ``` mov |a|, #|b| add |c|, |d|, |a| ``` can simplify as: ``` add |c|, |d|, #|b| ``` while this: ``` sub |a|, |b|, #|c| ldr |d|, [|a|, #0] ``` can simplify as: ``` ldr |d|, [|b|, #-|c|] ``` so after peephole optimization, snippet becomes: ``` sub r3, fp, #8 -- PUSHLA i ldr r4, [fp, #-8] -- PUSHLA i / LOAD add r4, r4, #1 -- PUSHI 1 / ADD str r4, [r3, #0] -- POPS ``` peephole optimization often apply to stack-based instructions. ### BOLT BOLT is a post-link optimizer. It utilize profiling, probe, perf tools to collect run time information. These techniques is called **PGO (post time optimization)** Apply collected statistic to optimize code. * inlining: make hotspot function inline Some of its optimizing pipeline: 1. strip-rep-tet: fix AMD branch prediction issues. 2. icf: indentical code folding 3. icp: indirect call promotion 4. peepholes 5. simplify-ro-loads: fetch constant data in .rodata whose address is known statically and mutate a load into a mov. 6. icf (again) 7. plt: remove indirection from PLT calls 8. reorder-bbs: reorder basic blocks and split hot/cold blocks into separate sections (layout optimization) 9. peepholes (again) 10. uce: eliminate unreachable basic blocks 11. fixup-branchs: fix basic block terminator instructions to match the CFG and the current layout (redone by reorder-bbs) 12. reorder-functions: apply HFSort to reorder function. 13. sctc: simplify conditional tail calls 14. frame-opts: removes unnecessary caller-saved register spilling 15. shrink-wrappping: moves callee-saved register spills closer to where they are needed, if profilling data shows it is better to do so. To do PGO, we need collects CPU data. However, PGO maybe have some side effects: * function reordering https://blogs.igalia.com/compilers/2023/06/30/porting-bolt-to-risc-v/ ### Try testing GCC PGO in host computer user-space ``` # O2 $ gcc main_host.c -O2 -o main_host_o2 # O2 with PGO $ gcc main_host.c -O2 -o main_host_gen -fprofile-generate # generate main_host_gen-main_host.gcda $ ./main_host_gen $ gcc main_host.c -O2 -o main_host_gen -fprofile-use ``` Ensuring all output and testing are pass. -O2: ``` === Tests Completed === [bench] N=16777216, window≈282ms, repeats=7 (median) [bench] Throughput: 59288554.07 iter/s, cost: 16.87 ns/iter ``` -O2 with GCC PGO: ``` === Tests Completed === [bench] N=16777216, window≈209ms, repeats=7 (median) [bench] Throughput: 79993271.82 iter/s, cost: 12.50 ns/iter ``` PGO has two types, one is instructment pgo, another is hardware pgo. instrument pgo insert code logic to collect information. while hardware pgo no onsert logic but utilize performance tools like linux perf to collect information. https://ubuntu.com/blog/profile-guided-optimization-a-case-study https://documentation.ubuntu.com/server/explanation/performance/perf-pgo/?_gl=1*14pmskw*_gcl_au*OTc2MjcxNTczLjE3NjI2OTk4MTU. ### Utilize profiling tools rv32emu has several profiling tools to find hotspot. ``` $ build/rv32emu -p build/[test_program].elf $ tools/rv_profiler [--start-address|--stop-address|--graph-ir] [test_program] Example: $ tools/rv_profiler --graph-ir 0x10000 [test_program] ``` Let's draw basic block diagram from begining of `fast_rsqrt` program ![Screenshot from 2025-11-08 14-02-33](https://hackmd.io/_uploads/SJtH0LhyWx.png) > a **basic block** is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.