# libriscv ## Read-write arena 設計 原本 libriscv 的記憶體管理方式是使用分頁 每次存取記憶體都需要比對讀寫權限等等 而有了 read-write arena 的設計幾乎全部的記憶體檢查只需要一個指令就可以知道這次的存取是否合法 ```cpp // a <= x < b // (unsigned)x - a < (unsigned)b - a else if constexpr (flat_readwrite_arena) { if (LIKELY(address - RWREAD_BEGIN < memory_arena_read_boundary())) { #ifdef RISCV_EXT_VECTOR if constexpr (sizeof(T) >= 32) { // Reads and writes using vectors might have alignment requirements auto* arena = (VectorLane *)m_arena.data; return arena[RISCV_SPECSAFE(address / sizeof(VectorLane))]; } #endif return *(T *)&((const char*)m_arena.data)[RISCV_SPECSAFE(address)]; } [[unlikely]]; } const auto& pagedata = cached_readable_page(address, sizeof(T)); return pagedata.template aligned_read<T>(offset); } // encompassing_Nbit_arena ``` :::success rv32emu 的做法看起來也類似 read-write arena 先分配好整塊記憶體,但似乎沒記憶體區間讀寫權限? ::: ## 二進位轉譯 ```mermaid flowchart TD A[開始] --> B[逐一解碼指令] B --> C{是否為跳躍或停止指令?} C -->|是| D[切分指令區塊] C -->|否| B D --> E[檢查快取] E -->|找到| F[使用快取的轉譯] E -->|未找到| G[產生 C 程式碼] G --> H[存入快取] H --> I[返回] F --> I I --> J[結束] ``` ### 切分指令區塊 可以看到依據 JALR or stop(custom) 指令來切指令區塊: ```cpp for (; pc < endbasepc; ) { const rv32i_instruction instruction = read_instruction(exec.exec_data(), pc, endbasepc); if constexpr (compressed_enabled) pc += instruction.length(); else pc += 4; block_insns++; // JALR and STOP are show-stoppers / code-block enders if (block_insns >= ITS_TIME_TO_SPLIT && is_stopping_instruction(instruction)) { break; } } ``` 可用事先產生的 C 程式碼,搭配快取: ```cpp if (options.translate_enable_embedded) { for (size_t i = 0; i < registered_embedded_translations<W>.count; i++) { auto& translation = registered_embedded_translations<W>.translations[i]; if (translation.hash == checksum) { // Found a matching translation *translation.api_table = create_bintr_callback_table(exec); // Additional processing... return 0; } } } ``` libriscv 主要的應用場景是搭配遊戲引擎執行,同個 scripts 可能被執行多次,於是其改進著重於低延遲 (vmcall 後,快速返回至遊戲引擎),而非像 rv32emu 著重於長時間執行的場景 (特別日後整合系統模擬時)。 ### custom instr STOP > [Using C++ as a Scripting Language, part 11](https://fwsgonzo.medium.com/the-stop-instruction-9eb536560e95) 在寫 scripts 時可使用 libriscv 專屬的 STOP 指令取代 ret 指令 就可以少掉正常 ret 指令的多道指令操作 ### prepare call > [Lowering the latency of the lowest-latency emulator](https://fwsgonzo.medium.com/lowering-the-latency-of-the-lowest-latency-emulator-1751c0d0e4e2) ![image](https://hackmd.io/_uploads/HJk6XA-oC.png) 預先準備好函數的地址參數等等 -> 可避免重複計算提高效率 ## JIT compilation with TinyCC libriscv 用 tinycc 來達成 jit 的方法跟上面的流程類似 切 codeblock $\to$ 產生 C 程式 $\to$ 使用 TinyCC 來執行 ```shell cd emulator cmake .. -DRISCV_BINARY_TRANSLATION=1 bash build.sh -x --tcc ``` ## 著重於 libriscv 直譯器分析 ### 跟 rv32emu 的直譯器做個比較 ``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s):1 NUMA node(s): 1 Vendor ID:GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz Stepping: 7 CPU MHz: 2394.374 BogoMIPS: 4788.74 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache:256 KiB L1i cache:256 KiB L2 cache: 32 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-7 ``` --- Disabling the C-extension, Enabling the read-write arena (default ON). Enabling experimental features, like 32-bit unbounded address space ``` cmake .. -DRISCV_EXPERIMENTAL=ON -DRISCV_FLAT_RW_ARENA=ON -DRISCV_EXT_C=OFF -DCMAKE_BUILD_TYPE=Release ``` **without experimental features** ![upload_f1c9b479d8883eb685b86fb6aac1c897](https://hackmd.io/_uploads/HJeDHlosC.png) --- ``` gcc (GCC) 14.2.0 Copyright (C) 2024 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ``` 指令數: | Program | rv32emu | libriscv | | --------- | -------------- | --------------- | | Dhrystone | 71500011737 | 71500011833 | | AES | 4258475 | 4276028 | | Mandelbrot| 3256610 | 3256700 | | Pi | 13594914 | 13594999 | | Richards | 359441863 | 359441948 | | N-Queens | 2428587791 | 2428587876 | | Primes | 7114988149 | 7114988162 | | SHA-512 | 7617544976 | 7617545069 | 執行時間 (ms): | Program | rv32emu | libriscv + EXPERIMENTAL | | --------- | -------------- | ----------------------- | | Dhrystone | 181241.922ms | 130059.465ms | | *AES | 19.372ms | 10.888ms | | Mandelbrot| 12.469ms | 11.942ms | | Pi | 44.524ms | 34.392ms | | Richards | 1098.411ms | 799.831ms | | N-Queens | 6762.061ms | 4551.411ms | | *Primes | 22713.252ms | 18506.417ms | | *SHA-512 |19198.100ms | 18254.654ms | ![Screenshot 2024-08-24 at 12.32.55 AM](https://hackmd.io/_uploads/ByeZqN8oR.png) --- ### libriscv 設計了 decoder cache ```mermaid graph TD A[載入 ELF 檔案] --> C[產生 Decoder Cache] C --> D[main loop] D --> E[從 Decoder Cache 擷取指令] E --> F[執行指令] F --> G[更新 PC] G --> E ``` **在 load 完 elf 後先對整個可執行段解碼, 產生 cache, 之後在執行時可以減少重複的解碼** ### 實作 rv32emu 的 decoder cache 目的:可以更了解 rv32emu 的架構 #### docoder cahce v1. 1. 在 load elf 之後對 exec segment 產生 decoder cache * decoder cache 可以依照 pc 來查找 instruction * 需要在 rv32emu load elf 階段把 exec segment 拉出來 * 在生成 decoder cache 也可以把 basic block 標出來 2. 保留 basic_block 機制 * 直接透過 decoder cache 來生成 basic block --- ### uftrace 分析 rv32emu 跟 libriscv 加上 -finstrument-functions flag ``` uftrace record build/rv32emu build/riscv32/mandelbrot ``` 下圖為 uftrace 所得到的時間佔比 ![image](https://hackmd.io/_uploads/By15Ga2nA.png) > 其實在 instruction decode 佔比並不高, 且 basic block 也有類似 decoder cache 的功用所以這裡並不是我們主要的 bottle neck 有發現 rv32emu 的 call depth 似乎會比 libriscv 的要深, 感覺會是一個方向 ### 比較 libriscv 跟 rv32emu 的 tail call optimize tco 應該是預設有打開的呀, 看了 rv32emu 的 must_tail 用法感覺不應該有這麼深的遞迴 研究一下 ... --- ``` 翻看 gcc-14.2 的 document, 沒找到 musttail attribute https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc.pdf ``` ::: info gcc-15.0 只支援 [[gnu::musttail]] 的用法 [相關討論](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83324) [Support old style statement attributes](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116545) [6.40 Statement Attributes ](https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html#index-musttail-statement-attribute) ::: 先改回使用 clang 且成功套用 tco 之後的 rv32emu 的 flame graph 看起來比較正常了 ``` clang-18 --version Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin make CC=clang-18 ``` ![flame3](https://hackmd.io/_uploads/HJzfwL62R.svg) 之前 libriscv 的 tailcall 跟 rv32emu 的 tailcall 都沒有正確開啟, 兩個全都開啟再來做一次實驗 :::info 延伸閱讀: [musttail-efficient-interpreters](https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html) ["musttail" statement attribute for GCC?](https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html) ::: ### libriscv dispatch mode > libriscv 要 enable RISCV_TAILCALL_DISPATCH 才會把 tailcall_dispatch.cpp 加進來 >> 研究一下他的 tailcall_dispatch, threaded_dispatch, bytecode_dispatch 的差異 > ```cmake= if (RISCV_THREADED OR RISCV_TAILCALL_DISPATCH) if (CMAKE_CXX_COMPILER_ID STREQUAL "Clang" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 13.0 AND RISCV_TAILCALL_DISPATCH) # Experimental tail-call dispatch message(STATUS "libriscv: Tail-call dispatch enabled") list(APPEND SOURCES libriscv/tailcall_dispatch.cpp ) else() message(STATUS "libriscv: Threaded dispatch enabled") list(APPEND SOURCES libriscv/threaded_dispatch.cpp ) endif() else() message(STATUS "libriscv: Switch-based dispatch enabled") list(APPEND SOURCES libriscv/bytecode_dispatch.cpp ) endif() ``` | 特性 | Tail-call dispatch | Threaded dispatch | Switch-based dispatch | |------|---------------------|-------------------|------------------------| | 特點 | tall call | dispatch table | switch case | | 性能 | 高 | 高 | 中等 | | 編譯器支持 | 需要 tail call optimized | 需要 computed goto | 無 | | 可讀性 | 中等 | 較低 | 高 | | 指令切換成本 | 低 | 非常低 | 中等 | | 記憶體使用 | 低 | 中等 (跳轉表) | 低 | 之前的實驗 libriscv 是使用了 threaded_dispatch mode --- #### benchmark : rv32emu(tco) vs. libriscv(tco) vs. libriscv(thread dispatch) ``` cmake .. -DRISCV_EXPERIMENTAL=ON -DRISCV_FLAT_RW_ARENA=ON -DRISCV_TAILCALL_DISPATCH=ON -DCMAKE_CXX_COMPILER=clang++-18 -DCMAKE_BUILD_TYPE=Release $OPTS -DEMBED_FILES="$EMBED_FILES" ``` :::warning 部分 benchmark 在用 libriscv 跑時會遇到 Protection fault (data: 0x0)錯誤, 先不採用 ./rvlinux ~/workspace/rv32emu/build/riscv32/aes --accurate [0001FDE4] 00000073 SYS ECALL >>> Machine exception 2: Protection fault (data: 0x0) [RA 00010130] [SP 01127E20] [GP 00027810] [TP 00000000] [LR 00000000] [TMP1 0000000F] [TMP2 00000000] [SR0 00000000] [SR1 00000000] [A0 00000000] [A1 10000018] [A2 80000000] [A3 00000007] [A4 00000000] [A5 00000006] [A6 0000007E] [A7 000000D6] [SR2 00000000] [SR3 00000000] [SR4 00000000] [SR5 00000000] [SR6 10000000] [SR7 00000000] [SR8 00000000] [SR9 00000000] [SR10 00000000] [SR11 00000000] [TMP3 00000000] [TMP4 00000000] [TMP5 00000000] [TMP6 00000000] -> [0] 0x0001fda8 + 0x03c: _sbrk -> [1] 0x000100e0 + 0x050: main >>> Program exited, exit code = 0 (0x0) Instructions executed: 1002 Runtime: 3.882ms Insn/s: 0mi/s Pages in use: 27 (108 kB virtual memory, total 187 kB) ::: 執行時間 (ms): | Program | rv32emu(tco) | libriscv(tco) | libriscv(threaded dispatch)| | --------- | -------------- | ----------------------- | --- | | dhrystone | 218903.909ms | 231396.677ms | 164086.052ms | | mandelbrot| 12.868ms | 9.872ms | 10.576ms | | richards | 1189.493ms | 1316.668ms | 815.390ms | | nqueens | 7870.810ms | 6184.693ms | 4934.192ms | | sha512 | 210645.159ms | 140724.875ms | 146419.812ms | | puzzle. | 19.956ms | 18.575ms | 21.897ms | | pi | 44.796ms | 51.575ms | 51.926ms | :::success 忽略執行時間太短的 benchmark 可以很明顯看出 libriscv 使用 thread dispatch 的方式是有比較好的表現的 實驗結果符合作者的描述 RISCV_THREADED Enable threaded dispatch, using computed goto. Fastest dispatch method. When threaded and tailcall are both disabled, fall back to switch-based dispatch. RISCV_TAILCALL_DISPATCH Enable dispatch using musttail. Clang only. Faster than threaded for simple loops, but on real programs it is always a bit slower. ::: > [libriscv 作者對於 tail call 的優化幅度感到失望](https://github.com/libriscv/libriscv/pull/88) > [延伸閱讀 Threaded Code](http://www.complang.tuwien.ac.at/forth/threaded-code.html) **有幾個問題有待釐清** 1. 為什麼 libriscv 的 threaded dispatch 表現會比 tco 好 ? tail call dispatch ```c++= #define EXECUTE_INSTR() \ computed_opcode<W>[d->get_bytecode()](d, exec, cpu, pc, counter) #define EXECUTE_CURRENT() \ MUSTTAIL return EXECUTE_INSTR(); ``` threaded dispatch ```c++= #define EXECUTE_INSTR() \ if constexpr (FUZZING) { \ if (UNLIKELY(decoder->get_bytecode() >= BYTECODES_MAX)) \ abort(); \ } \ goto *computed_opcode[decoder->get_bytecode()]; ``` ``` threaded dispatch Performance counter stats for './rvlinux /home/youngli/workspace/rv32emu/build/riscv32/nqueens': 13,763,750,466 cycles 29,478,400,991 instructions # 2.14 insn per cycle 947,414 cache-references 338,176 cache-misses # 35.695 % of all cache refs 42,645,900 branch-misses 4.668070367 seconds time elapsed 4.651183000 seconds user 0.016032000 seconds sys tail-call dispatch Performance counter stats for './rvlinux /home/youngli/workspace/rv32emu/build/riscv32/nqueens': 20,854,252,292 cycles 49,336,713,155 instructions # 2.37 insn per cycle 2,487,022 cache-references 1,219,755 cache-misses # 49.045 % of all cache refs 39,455,093 branch-misses 7.082288287 seconds time elapsed 7.077223000 seconds user 0.004020000 seconds sys ``` * tail-call dispatch 執行的 instructions 數是比較多的, 這應該會是執行效率差的主因 * tail-call dispatch 的 cache miss 比較高但是 branch predict miss 較低, 且有較好的 insn per cycle 2. 同樣都是使用 tco 時, 在 I/O intensive 的 benchmark rv32emu 會表現的比 libriscv 好,而在 Computing intensive 則相反,我們可以探討一下原因 所以我們針對 libriscv 表現比較好的 computing intensive(sha512) 的 benchmark ``` Performance counter stats for './rv32emu riscv32/sha512': 576,661,862,217 cycles (71.43%) 1,353,653,303,632 instructions # 2.35 insn per cycle (85.72%) 33,464,988 cache-references (85.71%) 10,246,043 cache-misses # 30.617 % of all cache refs (85.71%) 510,995,285 branch-misses (85.72%) 497,012,433,423 L1-dcache-loads (85.72%) 3,750,744,443 L1-dcache-load-misses # 0.75% of all L1-dcache accesses (57.14%) <not supported> LLC-loads <not supported> LLC-load-misses 198.639135358 seconds time elapsed 198.600216000 seconds user 0.019998000 seconds sys Performance counter stats for './rvlinux /home/youngli/workspace/rv32emu/build/riscv32/sha512': 359,139,043,501 cycles (71.44%) 995,566,163,779 instructions # 2.77 insn per cycle (85.72%) 21,309,210 cache-references (85.71%) 7,639,837 cache-misses # 35.852 % of all cache refs (85.71%) 296,970,164 branch-misses (85.72%) 469,682,864,321 L1-dcache-loads (85.72%) 88,723,586 L1-dcache-load-misses # 0.02% of all L1-dcache accesses (57.14%) <not supported> LLC-loads <not supported> LLC-load-misses 123.630150398 seconds time elapsed 123.589485000 seconds user 0.016000000 seconds sys ``` 在這種情況下 rv32emu 執行了約 36% 更多的指令,但花費了約 60% 更多的時間 在用 perf 來觀察 ``` sudo perf report -i perf.data ``` ``` # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 732K of event 'cycles' # Event count (approx.): 577438045690 # # Children Self Command Shared Object Symbol # ........ ........ ....... ....................... ................................................................ # 87.68% 0.00% rv32emu rv32emu [.] on_mem_ifetch.llvm.16934892941773522949 | ---on_mem_ifetch.llvm.16934892941773522949 | |--15.77%--do_fuse5 | |--15.45%--do_add | |--10.14%--do_xor | |--9.83%--do_or | |--8.91%--do_addi | |--5.51%--do_slli | |--5.17%--do_sltu | |--4.38%--do_srli | |--4.25%--do_lw | |--3.34%--do_and | |--1.49%--do_andi | |--0.72%--do_fuse3 | |--0.59%--do_lui | --0.56%--do_bne # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 462K of event 'cycles' # Event count (approx.): 363982144251 # # Children Self Command Shared Object Symbol # ........ ........ ....... .................... ............................................. # 98.63% 0.00% rvlinux libc.so.6 [.] __libc_start_call_main | ---__libc_start_call_main | |--16.46%--riscv::rv32i_ldw<4> | |--13.31%--riscv::rv32i_op_add<4> | |--12.39%--riscv::rv32i_slli<4> | |--10.89%--riscv::rv32i_srli<4> | |--9.46%--riscv::rv32i_op_xor<4> | |--8.53%--riscv::rv32i_op_or<4> | |--7.02%--riscv::rv32i_stw<4> | |--5.63%--riscv::rv32i_mv<4> | |--5.05%--riscv::rv32i_op_sltu<4> | |--2.65%--riscv::rv32i_op_and<4> | |--2.14%--riscv::rv32i_addi<4> | |--1.19%--riscv::rv32i_andi<4> | --1.18%--riscv::rv32i_bne<4> ``` 比較 rv32emu do_add 跟 libriscv 的 rv32i_add ``` sudo perf annotate -i perf.data ``` ##### do_add ```c= #define RVOP(inst, code, asm) \ static bool do_##inst(riscv_t *rv, rv_insn_t *ir, uint64_t cycle, \ uint32_t PC) \ { \ cycle++; \ code; \ nextop: \ PC += __rv_insn_##inst##_len; \ if (unlikely(RVOP_NO_NEXT(ir))) { \ goto end_op; \ } \ const rv_insn_t *next = ir->next; \ MUST_TAIL return next->impl(rv, next, cycle, PC); \ end_op: \ rv->csr_cycle = cycle; \ rv->PC = PC; \ return true; \ } /* ADD */ RVOP( add, { rv->X[ir->rd] = rv->X[ir->rs1] + rv->X[ir->rs2]; }, GEN({ rald2, rs1, rs2; map, VR2, rd; mov, VR1, TMP; mov, VR0, VR2; alu32, 0x01, TMP, VR2; })) ========================================================================= Percent│ │ │ │ Disassembly of section .text: │ │ 0000000000009b50 <do_add>: │ do_add(): 2.40 │ inc %rdx // cycle++; 0.38 │ movzbl 0x5(%rsi),%eax 4.38 │ movzbl 0x6(%rsi),%r8d 9.59 │ mov 0x60(%rdi,%r8,4),%r8d 22.98 │ add 0x60(%rdi,%rax,4),%r8d 12.93 │ movzbl 0x4(%rsi),%eax 0.51 │ mov %r8d,0x60(%rdi,%rax,4) // rv->X[ir->rd] = rv->X[ir->rs1] + rv->X[ir->rs2]; 27.68 │ add $0x4,%ecx // PC += __rv_insn_add_len; 0.58 │ mov 0x20(%rsi),%rsi 0.15 │ test %rsi,%rsi │ ↓ je 31 // if (unlikely(RVOP_NO_NEXT(ir))) { goto end_op; } 0.37 │ mov 0x28(%rsi),%rax 18.05 │ → jmp *%rax // const rv_insn_t *next = ir->next; // MUST_TAIL return next->impl(rv, next, cycle, PC); │31: mov %rdx,0x178(%rdi) │ mov %ecx,0xe0(%rdi) │ mov $0x1,%al │ ← ret // endop ``` 更新 pc 這一行感覺怪怪的, 為什麼會是熱點 做個實驗 rv32emu 的 ir 裡面其實也有 pc 的位置,應該是不需要每做一個指令都去更新 PC 先調整成下面的樣子 ```c= /* Interpreter-based execution path */ #define RVOP(inst, code, asm) \ static bool do_##inst(riscv_t *rv, rv_insn_t *ir, uint64_t cycle, \ uint32_t PC) \ { \ cycle++; \ code; \ nextop: \ if (unlikely(RVOP_NO_NEXT(ir))) { \ PC += __rv_insn_##inst##_len; \ goto end_op; \ } \ const rv_insn_t *next = ir->next; \ MUST_TAIL return next->impl(rv, next, cycle, next->pc); \ end_op: \ rv->csr_cycle = cycle; \ rv->PC = PC; \ return true; \ } Samples: 482K of event 'cycles', 3500 Hz, Event count (approx.): 376949164434 do_add /home/youngli/workspace/rv32emu/build/rv32emu [Percent: local period] Percent│ │ Disassembly of section .text: │ │ 0000000000009b90 <do_add>: │ do_add(): 2.84 │ inc %rdx 0.56 │ movzbl 0x5(%rsi),%eax 2.60 │ movzbl 0x6(%rsi),%r8d 10.40 │ mov 0x60(%rdi,%r8,4),%r8d 22.69 │ add 0x60(%rdi,%rax,4),%r8d 13.54 │ movzbl 0x4(%rsi),%eax 0.23 │ mov %r8d,0x60(%rdi,%rax,4) 27.60 │ mov 0x20(%rsi),%rsi 1.01 │ test %rsi,%rsi │ ↓ je 31 0.05 │ mov 0x28(%rsi),%rax 5.15 │ mov 0x18(%rsi),%ecx 13.32 │ → jmp *%rax │31: add $0x4,%ecx │ mov %rdx,0x178(%rdi) │ mov %ecx,0xe0(%rdi) │ mov $0x1,%al │ ← ret ``` 反而變成 27.60 │ mov 0x20(%rsi),%rsi (拿 ir->next) 這一行是熱點 ? ##### libriscv add ```c++= #define INSTRUCTION(bytecode, name) \ template <int W> \ static TcoRet<W> name(DecoderData<W>* d, MUNUSED DecodedExecuteSegment<W>* exec, MUNUSED CPU<W>& cpu, MUNUSED address_type<W> pc, MUNUSED InstrCounter& counter) #define VIEW_INSTR() \ auto instr = *(rv32i_instruction *)&d->instr; #define VIEW_INSTR_AS(name, x) \ auto&& name = *(x *)&d->instr; #define EXECUTE_INSTR() \ computed_opcode<W>[d->get_bytecode()](d, exec, cpu, pc, counter) #define EXECUTE_CURRENT() \ MUSTTAIL return EXECUTE_INSTR(); #define NEXT_INSTR() \ d += (compressed_enabled ? 2 : 1); \ EXECUTE_CURRENT() #define OP_INSTR() \ VIEW_INSTR_AS(fi, FasterOpType); \ auto& dst = REG(fi.get_rd()); \ const auto src1 = REG(fi.get_rs1()); \ const auto src2 = REG(fi.get_rs2()); INSTRUCTION(RV32I_BC_OP_ADD, rv32i_op_add) { OP_INSTR(); dst = src1 + src2; NEXT_INSTR(); } #define REG(x) cpu.reg(x) Percent│ ▒ │ 000000000006bb20 <std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExe▒ │ std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, r▒ 17.24 │ movzwl 0x4(%rsi),%eax ▒ 7.45 │ movzbl 0x7(%rsi),%r10d ▒ 2.70 │ movzbl 0x6(%rsi),%r11d ▒ // OP_INSTR(); 10.82 │ mov 0x4(%rcx,%r11,4),%r11d ▒ 11.35 │ add 0x4(%rcx,%r10,4),%r11d ▒ 14.92 │ mov %r11d,0x4(%rcx,%rax,4) ▒ // dst = src1 + src2; 25.44 │ movzbl 0x10(%rsi),%eax ▒ 5.27 │ add $0x10,%rsi ▒ 1.77 │ lea riscv::(anonymous namespace)::computed_opcode<4,%r11 ▒ 3.03 │ → jmp *(%r11,%rax,8) ▒ ``` **emulator 不太適合使用 gprof 或者 uftrace 的工具, 因為 -pg 會在 call stack 插 code 破壞 tail call optimize** 所以有些資訊還是需要自己插 code ps. 下面為 puzzle benchmark -> 因為 sha512 用 valgrind 會跑太久或者被 killed ``` -------------------------------------------------------------------------------- Profile data file 'callgrind.out.68317' (creator: callgrind-3.22.0) -------------------------------------------------------------------------------- I1 cache: D1 cache: LL cache: Timerange: Basic block 0 - 12832312 Trigger: Program termination Profiled target: ./rvlinux /home/youngli/workspace/rv32emu/build/riscv32/puzzle (PID 68317, part 1) Events recorded: Ir Bc Bcm Bi Bim Events shown: Ir Bc Bcm Bi Bim Event sort order: Ir Bc Bcm Bi Bim Thresholds: 99 0 0 0 0 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- Ir Bc Bcm Bi Bim -------------------------------------------------------------------------------- 139,869,584 (100.0%) 4,312,959 (100.0%) 121,889 (100.0%) 7,649,488 (100.0%) 2,915,605 (100.0%) PROGRAM TOTALS -------------------------------------------------------------------------------- Ir Bc Bcm Bi Bim file:function -------------------------------------------------------------------------------- 17,903,080 (12.80%) 0 0 895,154 (11.70%) 164,036 ( 5.63%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldbu<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 14,035,548 (10.03%) 0 0 2,339,258 (30.58%) 2,323,880 (79.70%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 13,421,436 ( 9.60%) 1,487,272 (34.48%) 9,433 ( 7.74%) 746,635 ( 9.76%) 4,013 ( 0.14%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bge<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 13,163,761 ( 9.41%) 786,117 (18.23%) 55,506 (45.54%) 786,117 (10.28%) 69,768 ( 2.39%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_beq_fw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 7,257,069 ( 5.19%) 0 0 806,341 (10.54%) 38,692 ( 1.33%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_slli<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 7,046,874 ( 5.04%) 0 0 782,986 (10.24%) 11,097 ( 0.38%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_srai<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 7,017,774 ( 5.02%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 6,266,078 ( 4.48%) 895,154 (20.75%) 7 ( 0.01%) . . /home/youngli/workspace/libriscv/lib/libriscv/memory_inline.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldbu<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 4,663,140 ( 3.33%) 0 0 932,628 (12.19%) 148,890 ( 5.11%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_addi<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 4,616,588 ( 3.30%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_beq_fw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 3,739,173 ( 2.67%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bge<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 3,727,177 ( 2.66%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/instruction_counter.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bge<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 3,144,468 ( 2.25%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/instruction_counter.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_beq_fw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 2,797,884 ( 2.00%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_addi<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 2,685,462 ( 1.92%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldbu<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 2,419,023 ( 1.73%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_slli<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 2,348,958 ( 1.68%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_srai<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 2,339,258 ( 1.67%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 2,258,237 ( 1.61%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_beq_fw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 2,233,907 ( 1.60%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bge<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 1,426,272 ( 1.02%) 225,304 ( 5.22%) 9,807 ( 8.05%) 2 ( 0.00%) 1 ( 0.00%) /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.cpp:riscv::Memory<4>::generate_decoder_cache(riscv::MachineOptions<4> const&, std::shared_ptr<riscv::DecodedExecuteSegment<4> >&, bool) [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 1,270,473 ( 0.91%) 113,647 ( 2.64%) 19,645 (16.12%) 77,461 ( 1.01%) 34,708 ( 1.19%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bne<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 958,697 ( 0.69%) 117,639 ( 2.73%) 2,255 ( 1.85%) . . ./elf/./elf/dl-lookup.c:_dl_lookup_symbol_x [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 932,628 ( 0.67%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_addi<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 895,154 ( 0.64%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldbu<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 806,341 ( 0.58%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_slli<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 782,986 ( 0.56%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_srai<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 635,991 ( 0.45%) 42,405 ( 0.98%) 22 ( 0.02%) 42,398 ( 0.55%) 11,122 ( 0.38%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_blt<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 579,853 ( 0.41%) 86,526 ( 2.01%) 6,476 ( 5.31%) 246 ( 0.00%) 4 ( 0.00%) ./elf/./elf/dl-lookup.c:do_lookup_x [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 505,847 ( 0.36%) 36,900 ( 0.86%) 511 ( 0.42%) 18,407 ( 0.24%) 13,574 ( 0.47%) /home/youngli/workspace/libriscv/lib/libriscv/threaded_rewriter.cpp:riscv::DecodedExecuteSegment<4>::threaded_rewrite(unsigned long, unsigned int, riscv::rv32i_instruction&) [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 500,441 ( 0.36%) 0 0 26,339 ( 0.34%) 15,194 ( 0.52%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_stw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 498,780 ( 0.36%) 0 0 27,710 ( 0.36%) 8,014 ( 0.27%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_stb<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 449,560 ( 0.32%) 0 0 22,478 ( 0.29%) 10,342 ( 0.35%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 428,580 ( 0.31%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bne<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 420,684 ( 0.30%) 36,512 ( 0.85%) 8,875 ( 7.28%) 25,929 ( 0.34%) 3,738 ( 0.13%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_beq<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 387,220 ( 0.28%) 0 0 19,361 ( 0.25%) 30 ( 0.00%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldh<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 352,979 ( 0.25%) 64,376 ( 1.49%) 2,339 ( 1.92%) 26,745 ( 0.35%) 15,826 ( 0.54%) /home/youngli/workspace/libriscv/lib/libriscv/decode_bytecodes.cpp:riscv::CPU<4>::computed_index_for(riscv::rv32i_instruction) 346,030 ( 0.25%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/instruction_counter.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bne<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 283,566 ( 0.20%) 54,086 ( 1.25%) 1,107 ( 0.91%) 3,511 ( 0.05%) 75 ( 0.00%) ./elf/../sysdeps/x86_64/dl-machine.h:_dl_relocate_object 225,303 ( 0.16%) 55,282 ( 1.28%) 29 ( 0.02%) . . /home/youngli/workspace/libriscv/lib/libriscv/safe_instr_loader.hpp:riscv::Memory<4>::generate_decoder_cache(riscv::MachineOptions<4> const&, std::shared_ptr<riscv::DecodedExecuteSegment<4> >&, bool) 211,997 ( 0.15%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_blt<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 191,108 ( 0.14%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_bne<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 184,270 ( 0.13%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/rv32i_instr.hpp:riscv::Memory<4>::generate_decoder_cache(riscv::MachineOptions<4> const&, std::shared_ptr<riscv::DecodedExecuteSegment<4> >&, bool) 169,599 ( 0.12%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/instruction_counter.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_blt<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 166,260 ( 0.12%) 27,710 ( 0.64%) 3 ( 0.00%) . . /home/youngli/workspace/libriscv/lib/libriscv/memory_inline.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_stb<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 158,034 ( 0.11%) 26,339 ( 0.61%) 8 ( 0.01%) . . /home/youngli/workspace/libriscv/lib/libriscv/memory_inline.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_stw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 157,346 ( 0.11%) 22,478 ( 0.52%) 14 ( 0.01%) . . /home/youngli/workspace/libriscv/lib/libriscv/memory_inline.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 140,228 ( 0.10%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_beq<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 135,876 ( 0.10%) 24,692 ( 0.57%) 43 ( 0.04%) 6,173 ( 0.08%) 5,098 ( 0.17%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_jalr<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 135,527 ( 0.10%) 19,361 ( 0.45%) 1 ( 0.00%) . . /home/youngli/workspace/libriscv/lib/libriscv/memory_inline.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_ldh<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 123,694 ( 0.09%) 17,933 ( 0.42%) 115 ( 0.09%) . . ./elf/./elf/do-rel.h:_dl_relocate_object 118,032 ( 0.08%) 0 0 29,508 ( 0.39%) 16,117 ( 0.55%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_li<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 116,872 ( 0.08%) 15,166 ( 0.35%) 88 ( 0.07%) . . ./elf/./elf/dl-lookup.c:check_match [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 114,299 ( 0.08%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/instruction_counter.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_beq<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 110,840 ( 0.08%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_stb<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 109,728 ( 0.08%) 6,096 ( 0.14%) 0 6,096 ( 0.08%) 4,062 ( 0.14%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_fast_call<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 104,291 ( 0.07%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/rv32i_instr.hpp:riscv::CPU<4>::computed_index_for(riscv::rv32i_instruction) [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 84,803 ( 0.06%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_blt<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 82,304 ( 0.06%) 74,546 ( 1.73%) 106 ( 0.09%) . . ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6] 79,017 ( 0.06%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_stw<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 78,366 ( 0.06%) 9,185 ( 0.21%) 673 ( 0.55%) 193 ( 0.00%) 136 ( 0.00%) ./string/../sysdeps/x86_64/strcmp.S:strcmp [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] ``` :::success [ref: Guide to callgrind](https://web.stanford.edu/class/archive/cs/cs107/cs107.1174/guide_callgrind.html) ::: 14,035,548 (10.03%) 0 0 2,339,258 (30.58%) 2,323,880 (79.70%) /home/youngli/workspace/libriscv/lib/libriscv/bytecode_impl.cpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 7,017,774 ( 5.02%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/threaded_bytecodes.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 [/home/youngli/workspace/libriscv/emulator/.build/rvlinux] 2,339,258 ( 1.67%) . . . . /home/youngli/workspace/libriscv/lib/libriscv/decoder_cache.hpp:std::tuple<std::conditional<(4)==(4), unsigned int, unsigned long>::type> riscv::rv32i_op_add<4>(riscv::DecoderData<4>*, riscv::DecodedExecuteSegment<4>*, riscv::CPU<4>&, std::conditional<(4)==(4), unsigned int, unsigned long>::type, riscv::InstrCounter&)'2 rv32i_op_add 會有三筆應該是跟 libriscv 的實作比較有關係 我們在看總 Ir 時應該都要計算 23,392,580 instructions (16.72% of total) ``` -------------------------------------------------------------------------------- Profile data file 'callgrind.out.83009' (creator: callgrind-3.22.0) -------------------------------------------------------------------------------- I1 cache: D1 cache: LL cache: Timerange: Basic block 0 - 26530996 Trigger: Program termination Profiled target: build/rv32emu build/riscv32/puzzle (PID 83009, part 1) Events recorded: Ir Bc Bcm Bi Bim Events shown: Ir Bc Bcm Bi Bim Event sort order: Ir Bc Bcm Bi Bim Thresholds: 99 0 0 0 0 Include dirs: User annotated: Auto-annotation: on -------------------------------------------------------------------------------- Ir Bc Bcm Bi Bim -------------------------------------------------------------------------------- 155,003,573 (100.0%) 15,721,179 (100.0%) 258,755 (100.0%) 8,363,467 (100.0%) 2,818,493 (100.0%) PROGRAM TOTALS -------------------------------------------------------------------------------- Ir Bc Bcm Bi Bim file:function -------------------------------------------------------------------------------- 30,127,643 (19.44%) 2,317,511 (14.74%) 6 ( 0.00%) 2,317,511 (27.71%) 2,302,165 (81.68%) src/rv32_template.c:do_add'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 26,843,040 (17.32%) 894,768 ( 5.69%) 18 ( 0.01%) 1,789,536 (21.40%) 167,994 ( 5.96%) src/rv32_template.c:do_lbu'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 11,477,095 ( 7.40%) 1,623,150 (10.32%) 72,663 (28.08%) 811,558 ( 9.70%) 42,544 ( 1.51%) src/rv32_template.c:do_beq'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 11,051,052 ( 7.13%) 920,921 ( 5.86%) 7 ( 0.00%) 920,921 (11.01%) 158,561 ( 5.63%) src/rv32_template.c:do_addi'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 10,930,800 ( 7.05%) 1,639,620 (10.43%) 3,863 ( 1.49%) 546,540 ( 6.53%) 38,325 ( 1.36%) src/rv32_template.c:do_slli'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 10,453,966 ( 6.74%) 1,492,569 ( 9.49%) 6,006 ( 2.32%) 746,280 ( 8.92%) 4,006 ( 0.14%) src/rv32_template.c:do_bge'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 9,941,066 ( 6.41%) 1,046,428 ( 6.66%) 1 ( 0.00%) 523,214 ( 6.26%) . src/rv32_template.c:do_srai'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 9,663,707 ( 6.23%) 1,853,084 (11.79%) 42,813 (16.55%) 246 ( 0.00%) 4 ( 0.00%) ./elf/./elf/dl-lookup.c:do_lookup_x [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 5,754,318 ( 3.71%) 719,294 ( 4.58%) 1 ( 0.00%) . . src/rv32_template.c:do_fuse5'2 5,754,276 ( 3.71%) 959,047 ( 6.10%) 8 ( 0.00%) 239,761 ( 2.87%) 6,006 ( 0.21%) src/emulate.c:do_fuse5'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 2,758,881 ( 1.78%) 220,212 ( 1.40%) 35,101 (13.57%) . . ./math/../sysdeps/ieee754/dbl-64/s_sin.c:__cos_fma [/usr/lib/x86_64-linux-gnu/libm.so.6] 2,685,465 ( 1.73%) . . . . src/io.c:on_mem_read_b.llvm.15422988938575356667 [/home/youngli/workspace/rv32emu/build/rv32emu] 1,978,472 ( 1.28%) 238,901 ( 1.52%) 7,901 ( 3.05%) . . ./elf/./elf/dl-lookup.c:_dl_lookup_symbol_x [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 1,407,044 ( 0.91%) 116,554 ( 0.74%) 16,037 ( 6.20%) . . ./math/../sysdeps/ieee754/dbl-64/s_sin.c:__sin_fma [/usr/lib/x86_64-linux-gnu/libm.so.6] 1,319,517 ( 0.85%) 100,402 ( 0.64%) 3,233 ( 1.25%) . . ???:0x00000000000179f0 [/usr/lib/x86_64-linux-gnu/libmodplug.so.1.0.0] 1,151,057 ( 0.74%) 158,500 ( 1.01%) 17,355 ( 6.71%) 78,737 ( 0.94%) 10,213 ( 0.36%) src/rv32_template.c:do_bne'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 946,058 ( 0.61%) 109,961 ( 0.70%) 12,317 ( 4.76%) 8,695 ( 0.10%) 7,008 ( 0.25%) ./string/../sysdeps/x86_64/strcmp.S:strcmp [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 895,155 ( 0.58%) . . . . src/riscv.c:on_mem_read_b.llvm.15422988938575356667 861,083 ( 0.56%) 164,784 ( 1.05%) 1,904 ( 0.74%) 8,749 ( 0.10%) 415 ( 0.01%) ./elf/../sysdeps/x86_64/dl-machine.h:_dl_relocate_object 792,860 ( 0.51%) 27,340 ( 0.17%) 2 ( 0.00%) 54,680 ( 0.65%) 7,662 ( 0.27%) src/rv32_template.c:do_sb'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 702,889 ( 0.45%) 37,994 ( 0.24%) 1 ( 0.00%) 37,994 ( 0.45%) 7 ( 0.00%) src/rv32_template.c:do_lh'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 681,300 ( 0.44%) 94,042 ( 0.60%) 2,020 ( 0.78%) 23,006 ( 0.28%) 4,011 ( 0.14%) src/emulate.c:do_fuse1'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 630,374 ( 0.41%) 84,050 ( 0.53%) 7 ( 0.00%) 42,024 ( 0.50%) 11,086 ( 0.39%) src/rv32_template.c:do_blt'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 457,738 ( 0.30%) 66,287 ( 0.42%) 6,678 ( 2.58%) 23,651 ( 0.28%) 15 ( 0.00%) src/emulate.c:do_fuse3'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 416,301 ( 0.27%) 59,444 ( 0.38%) 5,682 ( 2.20%) 21,037 ( 0.25%) 3,665 ( 0.13%) src/emulate.c:do_fuse4'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 383,813 ( 0.25%) 61,362 ( 0.39%) 479 ( 0.19%) . . ./elf/./elf/do-rel.h:_dl_relocate_object 382,450 ( 0.25%) 49,836 ( 0.32%) 1,788 ( 0.69%) . . ./elf/./elf/dl-lookup.c:check_match [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 327,796 ( 0.21%) 65,558 ( 0.42%) 5 ( 0.00%) . . ./math/../sysdeps/x86/fpu/fenv_private.h:__cos_fma 293,692 ( 0.19%) 20,978 ( 0.13%) 1 ( 0.00%) 20,978 ( 0.25%) 7,646 ( 0.27%) src/emulate.c:do_fuse2'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 270,240 ( 0.17%) 27,024 ( 0.17%) 12 ( 0.00%) 27,024 ( 0.32%) 15,224 ( 0.54%) src/rv32_template.c:do_lui'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 204,876 ( 0.13%) 11,382 ( 0.07%) 0 11,382 ( 0.14%) 4,989 ( 0.18%) src/rv32_template.c:do_sw'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 201,328 ( 0.13%) 29,849 ( 0.19%) 3,142 ( 1.21%) 8,088 ( 0.10%) 3,347 ( 0.12%) src/emulate.c:rv_step [/home/youngli/workspace/rv32emu/build/rv32emu] 164,008 ( 0.11%) 25,280 ( 0.16%) 521 ( 0.20%) . . ./elf/./elf/dl-misc.c:_dl_name_match_p [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 150,115 ( 0.10%) 31,931 ( 0.20%) 851 ( 0.33%) . . ./elf/./elf/dl-load.c:_dl_map_object [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 149,717 ( 0.10%) 15,340 ( 0.10%) 432 ( 0.17%) . . ???:0x0000000000017670 [/usr/lib/x86_64-linux-gnu/libmodplug.so.1.0.0] 147,510 ( 0.10%) 32,780 ( 0.21%) 3 ( 0.00%) . . ./math/../sysdeps/x86/fpu/fenv_private.h:__sin_fma 138,232 ( 0.09%) 7,472 ( 0.05%) 10 ( 0.00%) 7,472 ( 0.09%) 3,721 ( 0.13%) src/rv32_template.c:do_lw'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 137,825 ( 0.09%) 19,454 ( 0.12%) 37 ( 0.01%) 9,727 ( 0.12%) 8,067 ( 0.29%) src/rv32_template.c:do_jal'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 116,493 ( 0.08%) 28,981 ( 0.18%) 1,184 ( 0.46%) . . ./elf/./elf/dl-version.c:_dl_check_map_versions [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 98,358 ( 0.06%) 11,754 ( 0.07%) 62 ( 0.02%) 5,717 ( 0.07%) 3,695 ( 0.13%) src/rv32_template.c:do_jalr'2 [/home/youngli/workspace/rv32emu/build/rv32emu] 83,287 ( 0.05%) 78,680 ( 0.50%) 98 ( 0.04%) . . ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6] 83,133 ( 0.05%) . . . . src/io.c:on_mem_write_b.llvm.15422988938575356667 [/home/youngli/workspace/rv32emu/build/rv32emu] 78,963 ( 0.05%) . . . . src/io.c:on_mem_write_w.llvm.15422988938575356667 [/home/youngli/workspace/rv32emu/build/rv32emu] 70,468 ( 0.05%) 9,550 ( 0.06%) 299 ( 0.12%) . . ./malloc/./malloc/malloc.c:_int_malloc [/usr/lib/x86_64-linux-gnu/libc.so.6] 67,362 ( 0.04%) . . . . src/io.c:on_mem_read_w.llvm.15422988938575356667 [/home/youngli/workspace/rv32emu/build/rv32emu] 65,560 ( 0.04%) 0 0 32,780 ( 0.39%) 1 ( 0.00%) ???:0x00000000052c0830 [???] 61,338 ( 0.04%) 6,652 ( 0.04%) 2 ( 0.00%) . . ./elf/../sysdeps/generic/dl-protected.h:do_lookup_x 60,817 ( 0.04%) 16,453 ( 0.10%) 819 ( 0.32%) . . ./elf/./elf/dl-cache.c:_dl_cache_libcmp [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] 60,475 ( 0.04%) 14,438 ( 0.09%) 965 ( 0.37%) . . ./elf/./elf/dl-load.c:_dl_map_object_from_fd [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2] ``` rv32emu 的 do_add: 30,127,643 (19.44%) ### wrap up 針對 puzzle 這個 benchmark do_add, 跟 rv32i_op_add 都被 call 了 2339259 次 或許我們可以分析為什麼 rv32emu 的 do_add 用了多 28.79% 的 Ir --- 1. 兩個實作的 Bi, Bim 還蠻接近的, 這部分算合理 tail call dispatch 都是用一個 jump 來做結尾, 且目的地是未定的.Bi miss 機率自然就會比較大. 2. libriscv 的 decoderCache 會在一開始就先把執行段都 decode 並且存起來, 一般的指令(add) 對比 rv32emu 就可以少掉以下的操作 ``` PC += __rv_insn_##inst##_len; if (unlikely(RVOP_NO_NEXT(ir))) { \ goto end_op; \ } ``` 其實 rv32emu 的 basic block 跟 decoder cache 有點類似, 是可以考慮採用 libriscv 的方式. --- ### wasm3 M3(Massey Meta Machine) 蠻多篇 interpreter 實作(libriscv, protobuf..) 都有參考到 wasm3 interpreter 的實作,應該也蠻有參考價值的 > ref: [M3(Massey Meta Machine)](https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md#m3-massey-meta-machine) --- ### GNU jitter 分析 GNU jitter 實作,看有沒有參考價值 > ref: [GNU Jitter and the illusion of simplicity](https://ageinghacker.net/talks/jitter-slides--saiu--bts-2022--2022-03-06.pdf) > ref: [My virtual machine is faster than yours](https://ageinghacker.net/talks/jitter-slides--saiu--ghm2017--2017-08-25.pdf) --- ### YETI: a graduallY Extensible Trace Interpreter > ref: [YETI: a graduallY Extensible Trace Interpreter](https://www.cs.toronto.edu/~matz/dissertation/matzDissertation-latex2html/matzDissertation-latex2html.html) --- :::warning libriscv 的 embedded libtcc 本身沒有特別之處, 反而要看針對特定指令 (如 indirect jump 的處理) 是否有可取之處, 而且要對照 rv32emu 的直譯器模式,比對 libriscv 效能表現 以 tiered JIT compilation 來說,如果直譯器速度夠快,那就可及早定位出 hotspot,從而降低 JIT 編譯的代價 ::: 目前看起來 libriscv 並沒有像 rv32emu 一樣做很多可以減少 instruction count 的設計 而是比較著重於呼叫 vmcall 的延遲,並且使用 fully binary translate 來提升效率 e.g., 使用情境 server prebuild 好 binary 傳給 client 用 embed binary (shared library) 的方式來執行 在遊戲場景裡同個 script 可能要執行上百萬次, 所以低延遲是必須的 --- 前面多篇的開發日程都是為了讓 scripts 可以跟遊戲引擎溝通跟介紹 libriscv > [Using C++ as a scripting language, part 8](https://fwsgonzo.medium.com/using-c-as-a-scripting-language-part-8-d366fd98676) 讓 libriscv 執行的 script 可以呼叫遊戲引擎的功能 當需要傳遞字串到遊戲引擎的時候改用 string_view 就是他提到的 Zero-copy lookups