2020q3 專題: RISC-V 模擬器 w/ JIT

# 2020q3 專題: RISC-V 模擬器 w/ JIT ###### tags: `sysprog2020` `ria-jit` > [ria-jit Github](https://github.com/ria-jit/ria-jit) > [My Github fork](https://github.com/WeiCheng14159/ria-jit/tree/master) > [ria-jit Paper](https://github.com/ria-jit/ria-jit/blob/master/documentation/paper/paper.pdf) #### Goal - [x] 研讀 ria-jit 論文並摘錄重點 - [x] 整合 RISC-V Compliance test 並修正潛在的不相容 - [x] `DIVW` 指令有問題已修正 - [x] Merged [here](https://github.com/ria-jit/ria-jit/commit/da807dfcecbf231a0b0931c412a1edff7cceb835) - [x] 探討 ria-jit 所做的最佳化，並對比 QEMU 最新的表現 - [x] 修改編譯 flag 後問題已解決 #### Progress - [x] Run riscv64 linux on qemu - [x] Install qemu - [x] Install riscv64 gnu toolchain - [x] Install busybox - [x] RISC-V `rv64im` compliance test on `riscvOVPsim` - [x] Generate 64-bit RISC-V compliance test binary - [x] Test cases running on `riscvOVPsim` (ALL PASS) - [x] RISC-V `rv64im` compliance test on `ria-jit` - [x] Generate 64-bit RISC-V compliance test binary - [x] Test cases running on `ria-jit` (ALL PASS) - [x] `DIVW.S` fail due to divide by zero problem, fixed now - [x] RISC-V `rv64im` compliance test on `QEMU` - [x] Generate 64-bit RISC-V compliance test binary - [x] Test cases running on `QEMU` (ALL PASS) - [x] Running RISC-V CoreMark on `QEMU` (Done) - [x] Running RISC-V CoreMark on `ria-jit` (Done) # ria-jit paper study ==[link](https://hackmd.io/@WeiCheng14159/rkCixiYnv)== # RISC-V compliance test (v1.0) on ria-jit ## Introduction I learn how to use the RISC-V GNU toolchain, run compliance tests, and verify the results from the computer architecture course in NCKU this semester. By comparing the "signature" (aka memory dump) from the simulator, one can verify the correctness of a simulator or a hardware implementation. In that course, I use the `riscvOVPsim` simulator to run the compliance tests. However, the following problem must be solved in order to replace the default `riscvOVPsim` simulator with our target simulator `ria-jit`: - **Lack of RV64 compliance tests**: Most of the existing compliance tests in `riscv-compliance` repository are based on RISC-V 32-bit architecture. There're few (<20) compliance tests designed from RV64 architecture, and placed in a `wip` (work in progress) folder in the repo, indicating that some work needs to be done to actually run it. Since `ria-jit` is designed for 64-bit architecture, 32-bit compliance tests aren't helpful. - **Add ria-jit support in riscv-compliance repository**: There're multiple RISC-V simulator / hardware (or called `riscv-target`) in the repo. Famous one like `riscvOVPsim` and `spike` are fully supported in the repo. Add `ria-jit` as one of the target platform requires some effort including: - Understanding Makefile complilation logic - Come up with two files `compliance_io.h` and `compliance_test.h` that define the interaction between compliance tests and the underlying simulator. - **Verification of a successful test**: Unlike the `riscvOVPsim` simulator which is capable of dumping a region of memory, `ria-jit` doesn't implement this feature. Which means, with `ria-jit`, we cannot compare the execution result with the `reference_output` signature files to verify the execution results. Other mechanism must be implemented to determine the "PASS" or "FAIL" of a certain compliance test. Notice that another RISC-V emulator `NanoRVI` deal with this problem by returning the failed compliance test number to indicate that the simulator failed at a certain test. However, as pointed out by the author of `NanoRVI` in the README file, that implementation is a "quick and dirty" way to run compliance tests on its simulator and cannot be embedded into the official `riscv-compliance` repo. Even worse, we cannot get the return status of a program in `ria-jit` simulator (could be implemented in the future) so that doesn't work as well. ## Existing RV64 support in riscv-compliance repo - There're 9 tests for `rv64i` architecture targeting the following instruction (Note the `W` postfix of every instructions indicates that it is a 64-bit instruction) - ADDIW - ADDW - SLLIW - SLLW - SRAIW - SRAW - SRLIW - SRLW - SUBW - There're 4 tests for `rv64im` architecture targeting the following instruction (`m` stands for multiplication) - DIVW - MULW - REMUW - REMW ## Add ria-jit as riscv-target device - RISC-V compliance tests make uses of various C marco to simplify the process of writing assembly-level compliance tests. This RISC-V compliance test `MULW.S` is shown as an example here: ```c=1 #include "test_macros.h" #include "compliance_test.h" #include "compliance_io.h" RV_COMPLIANCE_RV32M RV_COMPLIANCE_CODE_BEGIN RVTEST_IO_INIT RVTEST_IO_ASSERT_GPR_EQ(x31, x0, 0x00000000) RVTEST_IO_WRITE_STR(x31, "Test Begin Reserved regs ra(x1) a0(x10) t0(x5)\n") # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 1 - corner cases\n") # address for test results la x2, test_1_res TEST_RR_SRC2(mulw, x3, x4, 0, 0x0, 0x0, x2, 0) TEST_RR_SRC2(mulw, x8, x9, 0, 0x0, 0x1, x2, 8) TEST_RR_SRC2(mulw, x11, x12, 0, 0x0, -0x1, x2, 16) TEST_RR_SRC2(mulw, x13, x14, 0, 0x0, 0x7fffffffffffffff, x2, 24) TEST_RR_SRC2(mulw, x15, x16, 0, 0x0, 0x8000000000000000, x2, 32) ..... skip ..... RV_COMPLIANCE_HALT RV_COMPLIANCE_CODE_END ``` - In the very beginning, I thought compiling the compliance tests and run them with `ria-jit` simulator should be relatively easy since `ria-jit` has been tested intensively. However, I ran into endless `segmentation fault` & `system call xx not implemented` error from the `ria-jit` simulator. After days of investigation, I pinned down the root cause to a segment of RISC-V code shown below: - The `RV_COMPLIANCE_CODE_BEGIN` marco in the code example above will be expanded into the following: ```c=1 #define RVTEST_CODE_BEGIN .section .text.init; \ .align 6; \ .weak stvec_handler; \ .weak mtvec_handler; \ .globl _start; \ _start: \ /* reset vector */ \ j reset_vector; \ .align 2; \ trap_vector: \ /* test whether the test came from pass/fail */ \ csrr t5, mcause; \ li t6, CAUSE_USER_ECALL; \ beq t5, t6, write_tohost; \ li t6, CAUSE_SUPERVISOR_ECALL; \ beq t5, t6, write_tohost; \ li t6, CAUSE_MACHINE_ECALL; \ beq t5, t6, write_tohost; \ ..... skipped ..... ``` - Understanding this code segment itself is very challenging, as far as I can tell, this is a exception handler written in assembly code, the simulator execute `CSR` instruction to get the state of processor and handle the exception (which indicates a certain test case fails) by jumpping to corresponding reset vector. Unfortunately, the instruction `csrr t5, mcause;` isn't supported and is the root cause of error in `ria-jit` simulator. So this sophisicated, assembly coded `expcetion handling` doesn't work on `ria-jit`. (Also, doesn't work in `QEMU`) - I found a presentation from `SiFive` that introduce the RISC-V interrupts [here page 32](https://cdn2.hubspot.net/hubfs/3020607/An%20Introduction%20to%20the%20RISC-V%20Architecture.pdf). I am reading it right now and hope to fix this bug in `ria-jit` simulator - Stuck in this bug for weeks, I figure out how to bypass this problem temporarily by going through the code of another RISC-V simulator `nanoRVI`. Unlike the original implmentation that does complicated assembly-coded exception handling, `NanoRVI` remove the complicated exception handling scheme, code as follow: - Notice the `_start` symbol is where the execution starts: ```c=1 #define RVTEST_CODE_BEGIN \ .section .text.init; \ .align 6; \ .weak stvec_handler; \ .weak mtvec_handler; \ .globl _start; \ _start: \ ``` - Eventually, I implemented the missing functions in two files `compliance_test.h` and `compliance_io.h` and other environment setup that is too tedious to cover here. My work incorporating `ria-jit` into the `riscv-compliance` repo as one of the target device can be found in this [commit](https://github.com/WeiCheng14159/riscv-compliance/commit/0292e7c523daf0030bdb3c4279394d91da154042) ## Verification of a successful test - Replace the content of `RVTEST_CODE_BEGIN` marco from `NanoRVI` isn't enough. I have to come up with a different mechanism to determine whether a test is a "PASS" or "FAIL" - `ria-jit` isn't capable of dumping memory like `riscvOVPsim` simulator so I need some other way for `ria-jit` to "communicate" with outside environment indicating a test is "FAIL" or "PASS" - Instead of dumping memory, `NanoRVI` takes advantage of return code to indicate a "PASS" by returning 0 and "FAIL" by returning the failed test number. - Notice that implementing "memory dump" feature for `ria-jit` is definitely possible, but right now the mechanism I came up with is to **return silently for a PASS** and **write to protected memory 0x0 and trigger a segmentation fault for a FAIL**. This mechanism is very stupid and can definitely be handled more elegently but everything has to be done at assembly level. I am currently reading RISC-V interrupt handling mechanism and hope to come up with something better. - Actual implementation - `RVTEST_IO_ASSERT_GPR_EQ(_SP, _R, _I)` is a marco that assert the vaule of a GPR register, if the value of register is as expected then the test should continue. If not, the test should stop the execution and report the error. - Notice that `_SP` is a volatile register, usually, `t6` (aka `x31`) is used. - `_R` is the register that stores the result - `_I` is the expected value of the computation result - This marco is implemented as follow: ```c=1 #define RVTEST_IO_ASSERT_GPR_EQ(_SP, _R, _I) \ li _SP, _I; \ beq _SP, _R, 20002f; \ RVTEST_FAIL; \ 20002: \ ``` - `RVTEST_PASS` and `RVTEST_FAIL` are marcos that determine the behavior of a PASS or FAIL - Notice register `a0` is assigned `0` for a PASS, and `1` for a FAIL - register `a7` is assigned the value `93` to call the `exit` system call ```c=1 #undef RVTEST_PASS #define RVTEST_PASS \ li a7, 93; \ li a0, 0; \ j end_testcode; \ #undef RVTEST_FAIL #define RVTEST_FAIL \ li a7, 93; \ li a0, 1; \ j end_testcode; \ ``` - `RV_COMPLIANCE_CODE_END` is appended at the bottom of each compliance test. - If the value of `a0` isn't zero then this compliance test is a FAIL. The program will jump to tag `1234f` and execute `sw x0, 0(x0)` that triggers a segmentation fault. ```c=1 #define RV_COMPLIANCE_CODE_END \ end_testcode: \ bne a0, x0, 1234f; \ ecall; \ 1234: \ sw x0, 0(x0); \ ``` ## System call implemented on ria-jit (FYI) :::spoiler The following Linux syscall are implemented on ria-jit: getcwd (17), fcntl (25), ioctl (29), unlinkat (35), ftruncate (46), faccessat (48), chdir (49), fchmod (52), fchown (55), pipe2 (59), openat (56), close (57), getdents64 (61), lseek (62), read (63), write (64), writev (66), readlinkat (78), fstatat (79), fstat (80), utimensat (88), exit (93), exit_group (94), set_tid_address (96), futex (98), set_robust_list (99), clock_gettime (113), tgkill (131), rt_sigaction (134), rt_sigprocmask (135), uname (160), gettimeofday (169), getpid (172), getuid (174), geteuid (175), getgid (176), getegid (177), gettid (178), sysinfo (179), brk (214), munmap (215), execve (221), mmap (222), wait4 (260), prlimit64 (261), renameat2 (276), getrandom (278) ::: ## Bug in RV64IM riscv-compliance repo I found a bug that involves 64-bit compatibility in `riscv-compliance` repo, the fix can be found [here](https://github.com/WeiCheng14159/riscv-compliance/commit/12073dc20f432419987cc24bdc4218549997251a) :::warning Would you send pull request back to upstream? `SX` might not be a good name since it causes confusion. :notes: jserv ::: :::info Indeed, `SX` isn't satisfactory ! Need to find a better way to decouple the compliance test and the target platform. The solution I provided above originated from how `riscvOVPsim` simulator deal with the 32/64 bit compatibility issue, their implementation can be found [here](https://github.com/riscv/riscv-compliance/blob/207bc4e3ac3af94fd6759aa9bed7a241692c101c/riscv-target/riscvOVPsim/compliance_io.h#L56) ::: ::: info Update: Recent update in `riscv-compliance` repo fundamentally changes the project organization and standardizes the simulator interface for compliance tests. My work cannot be merge directly into the repo, I am switching my work to riscv-compliance 2.0 (WIP). ::: ## riscv-compliance tests (v1.0) results - The following `rv64im` tests PASS on `ria-jit` simulator - :::spoiler Passed 12/13 - ADDIW - ADDW - SLLIW - SLLW - SRAIW - SRAW - SRLIW - SRLW - SUBW - MULW - REMUW - REMW ::: - The following `rv64im` tests FAIL on `ria-jit` simulator - DIVW - All tests pass on QEMU simulator ## Fix: DIVW compliance test fails on `ria-jit` - Methodology: Run `ria-jit` with single stepping mode and disable all JIT optimization - Problem identification: - DIVW instruction explained - > DIVW and DIVUW instructions are only valid for RV64, and divide the lower 32 bits of `rs1` by the lower 32 bits of `rs2`. DIVW and DIVUW treat them as signed and unsigned integers respectively, placing the 32-bit quotient in `rd`, and then sign-extended the result to 64 bits. - DIVW compliance test fail on this test case: - `TEST_RR_SRC2(divw, x3, x4, 0xffffffffffffffff, 0x0, 0x0, x2, 0)` [link](https://github.com/riscv/riscv-compliance/blob/2636302c27557b42d99bed7e0537beffdf8e1ab4/riscv-test-suite/wip/rv64im/src/DIVW.S#L50) - This line of instruction will load `0x0` to `x3` and load `0x0` to `x4` - Perform `divw x4, x3, x4`, and write the result to address stored in `x2(sp)` with offset `0` - The expected result is `0xffffffffffffffff`, as specified in the RISC-V spec - The following figure is captured from RISC-V spec v2.2, Vol. 1, 6.2 Division Operations - ![](https://i.imgur.com/1SNnkuw.png) - `DIV` instruction is expected to output `-1 (0xffffffffffffffff)` for the divide-by-zero case. It's the same when dealing with `DIVW` instructions - Generated RISC-V assembly - :::spoiler assembly code ```c=1 ... 100dc: 00000213 li tp,0 100e0: 00000193 li gp,0 100e4: 023241bb divw gp,tp,gp 100e8: 00313023 sd gp,0(sp) 100ec: fff00f93 li t6,-1 100f0: 003f8863 beq t6,gp,10100 <_start+0x40> 100f4: 05d00893 li a7,93 100f8: 00100513 li a0,1 100fc: 4080006f j 10504 <end_testcode> ... ``` ::: - Generated x86_64 assembly for `divw gp,tp,gp` - To be more specific, the `DIVW` RISC-V instruction is mapped to `IDIV` x86_64 instruction - :::spoiler assembly code (~30 lines) ```c=1 ... [jit-gen] [NOP] [jit-gen] [MOV reg8:r1 mem8:r16+0x7f265e90] [jit-gen] [MOV reg8:r0 mem8:r16+0x7f265e91] [jit-gen] [MOV mem8:r16+0x7f265e8a reg8:r0] [jit-gen] [TEST reg4:r1 reg4:r1] [jit-gen] [JZ imm8:0x77ff810003e2] [jit-gen] [C_SEP_4] [jit-gen] [IDIV reg4:r1] [jit-gen] [MOVSX reg8:r1 reg4:r0] [jit-gen] [JMP imm8:0x77ff810003e9] [jit-gen] [MOV reg8:r0 imm8:0xffffffffffffffff] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [MOV mem8:r16+0x7f265e65 reg8:r1] [jit-gen] [LEA reg8:r0 mem0:r16+-0x7] [jit-gen] [MOV mem8:r16+0x7f265c27 reg8:r0] [jit-gen] [MOV mem4:r16+0x7f265c25 imm4:0x245] [jit-gen] [MOV reg8:r0 imm8:0x100e8] [jit-gen] [MOV mem8:r16+0x7f265f27 reg8:r0] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [RET_8] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [NOP] ... ``` ::: - Root cause: `x4` register is expected to store the value of `0 / 0`. According to RISC-V spec, `x4` should have value `0xffffffffffffffff` ; however, eventually `x4` has value `0x0`. This is the root cause of failure. - **Why I think this is a bug ?** Divide by zero test is actually covered in the testing program provided by `ria-jit` as shown in this [line](https://github.com/WeiCheng14159/ria-jit/blob/9aada51e0b47aa0fa4c93bd0b64183643d5c6777/test/test_programs/arithmetic_test/compiled_test_arithm.c#L76) - ```c=1 ... { //div-zero quotient should have all bits set init(++num, "DivZero"); size_t n = 256; size_t m = 0; assert_equals(0xFFFFFFFFFFFFFFFF, (n / m), &failed_tests); } ... ``` - I check the compiled assembly of this test case and find that `(n/m)` is compiled into `DIVUW` RISC-V instruction since `m` and `n` are of type `size_t` (unsigned 32-bit integer) - Change `m` and `n` from `unsigned int` to `signed int` as shown below will results in a failure because `DIVW` RISC-V instruction is used. I figured it out by disassembling the compiled binary. - ```c=1 ... { //div-zero quotient should have all bits set init(++num, "DivZero"); int n = 256; int m = 0; assert_equals(0xFFFFFFFFFFFFFFFF, (n / m), &failed_tests); } ... ``` - For your reference, changing `m` and `n` from `size_t` to `int64_t` will be correct because `DIV` RISC-V instruction is used. Notice that `ria-jit` passes the `DIV` and `DIVUW` test case in riscv compliance test. So unsigned integer division is correct as expected. - `ria-jit` provides test cases that include divide by zero; **however, divide-by-zero case of `DIVW` instruction isn't covered by the test cases. This bug is now exposed by the `riscv-compliance` tests. This bug is crucial because 32-bit signed int (`int`) division is widely used.** - **Fix**: The function `void translate_DIVW(...)` is responsible for translating `DIVW` instruction from RISC-V to x86_64 shown [here](https://github.com/WeiCheng14159/ria-jit/blob/9aada51e0b47aa0fa4c93bd0b64183643d5c6777/src/gen/instr/ext/translate_m_ext.c#L408) - Mentioned earlier, `IDIV` instruction is used to simulate `DIVW` on x86 platform. `IDIV` expect `EDX:EAX` as implicit input, and stores the quotient in `EAX`, and remainder in `EDX`. More info can be found [here](https://mudongliang.github.io/x86/html/file_module_x86_id_137.html) - However, `EAX` register is NOT always mapped to `rd` register in RISC-V. Due to the register mapping feature in `ria-jit`, I found `rd` register is mapped to `r11` register instead of `EAX` register. - If not mapped correctly, the division result stays in `EAX` while RISC-V binary expects the result in `r11` register. - To correctly mapped the result of division, I add 3 lines of code that mapped `EAX` to `rd` if not already mapped - ```c=1 ... if (regDest != FE_AX) { ///rd is mapped so move the result in RAX there. err |= fe_enc64(&current, FE_MOVSXr64r32, regDest, FE_AX); } ... ``` - **[Update]** The author clarify that this isn't the root cause. Our discussion can be found [here](https://github.com/ria-jit/ria-jit/pull/3) - As suggested by the author that `EAX` register isn't mapped to `rd` given zero divisor is the actual problem. - ```c=1 ... // fe_enc64(&current, FE_MOV64ri, FE_AX, -1); fe_enc64(&current, FE_MOV64ri, regDest, -1); ... ``` - **Result** - Pass ALL `rv64im` `riscv-compliance` tests and ALL test cases that come with by `ria-jit` - GitHub pull request created [here](https://github.com/ria-jit/ria-jit/pull/3) # Compare perf. of ria-jit and QEMU - The performance of `ria-jit` is tested by `Coremark` benchmark, and compared with the `QEMU` simulator. This section will demonstrate the performance optimization done by `ria-jit` ## What is Coremark ? > EEMBC’s CoreMark® is a benchmark that measures the performance of microcontrollers (MCUs) and central processing units (CPUs) used in embedded systems. Replacing the antiquated Dhrystone benchmark, Coremark contains implementations of the following algorithms: **list processing** (find and sort), **matrix manipulation** (common matrix operations), **state machine** (determine if an input stream contains valid numbers), and **CRC** (cyclic redundancy check). It is designed to run on devices from 8-bit microcontrollers to 64-bit microprocessors. > It has been designed around **embedded applications** and therefore demonstrates **highly favorable numbers for relatively simple designs** (e.g., dual-issue in-order) while having **weaker performance scaling in complex designs** (e.g., out-of-order superscalar). ## Experiment setup - Coremark port to RISC-V - Coremark binary is ported to RISC-V by using the Coremark EEMBC Wrapper [here](https://github.com/riscv-boom/riscv-coremark) - Coremark benchmark binary is compiled with the following compiler flag: `-march=rv64imafd -mabi=lp64d -O3 -static`. This is the setup suggested by `ria-jit` - There're two target in the Coremark EEMBC Wrapper: `normal` and `bare_metal`. The `bare_metal` target is compiled with `-nostdlib -nostartfiles` flags. Which will cause a `segmentation fault` when executed on `ria-jit` since some `CSR` instructions aren't supported. Thus, Coremark binary will be compiled with `normal` mode. - Software - Compiler: Latest RISC-V GNU toolchain `GCC 10.2.0` compiled from source - `ria-jit`: [Tag v1.3.1](https://github.com/ria-jit/ria-jit/releases/tag/v1.3.1) - `QEMU`: [Tag 5.1.0](https://github.com/qemu/qemu/releases/tag/v5.1.0) with Linux v5.4 kernel image - Hardware - Ubuntu 18.04 - Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz ## Results There're 4 optimization techiques implemented in `ria-jit`, namely, return address stack **(RAS)**, block chaining **(BC)**, recursive translation of jump target , marco fusion/conversion **(Fusion)**. Firstly, I will test the Coremark benchmark with ONE optimization technique DISABLE at a time. Then, I will test it with ONE optimization technique ENABLE at a time to show the performance difference. ### Base line (QEMU) - `QEMU` with 2048M memory - :::spoiler score: 6184.917101 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 17972919 Total time (secs): 17.972919 Iterations/Sec : 6184.917101 Iterations : 110000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 6184.917101 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Base line (ria-jit) - `ria-jit` with ALL optimization - :::spoiler score: 17684.590726 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16963921 Total time (secs): 16.963921 Iterations/Sec : 17684.590726 Iterations : 300000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0xcc42 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 17684.590726 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Disable ONE opt. technique - Disable **return address stack** by setting `--optimize=no-ras` - :::spoiler score: 11277.077744 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 17735091 Total time (secs): 17.735091 Iterations/Sec : 11277.077744 Iterations : 200000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x4983 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 11277.077744 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Disable **block chaining** by setting `--optimize=no-chain` - :::spoiler score: 661.703370 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16623763 Total time (secs): 16.623763 Iterations/Sec : 661.703370 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 661.703370 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Disable **recursive translation of jump targets** by setting `--optimize=no-jump` - :::spoiler score: Nan ```c=1 Segmentation fault ``` ::: - Disable **macro opcode fusion/conversion** by setting `--optimize=no-fusion` - :::spoiler score: 16393.457404 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 12199989 Total time (secs): 12.199989 Iterations/Sec : 16393.457404 Iterations : 200000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x4983 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 16393.457404 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Disable **all optimization** by setting `--optimize=none` - :::spoiler score: 612.171699 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 17968815 Total time (secs): 17.968815 Iterations/Sec : 612.171699 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 612.171699 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Enable ONE opt. technique **`jump target` optimization must be enable to run the benchmark. So it's enable by default.** - Enable **return address stack** by setting `--optimize=no-chain,no-fusion` - :::spoiler score: 663.241927 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16585200 Total time (secs): 16.585200 Iterations/Sec : 663.241927 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 663.241927 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Dnable **block chaining** by setting `--optimize=no-ras,no-fusion` - :::spoiler score: 10980.685578 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 18213799 Total time (secs): 18.213799 Iterations/Sec : 10980.685578 Iterations : 200000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x4983 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 10980.685578 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Enable **macro opcode fusion/conversion** by setting `--optimize=no-ras,no-chain` - :::spoiler score: 650.748298 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16903617 Total time (secs): 16.903617 Iterations/Sec : 650.748298 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 650.748298 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Chart #### ria-jit opt. effect - Enable 1 ria-jit optimization technique at a time - ![](https://i.imgur.com/rhsoasA.png) - Disable 1 ria-jit optimization technique at a time - ![](https://i.imgur.com/lbmPsAI.png) ### Discussion **Block chaining** seems to be the most crucial optimization techniqe that boosts the performance of `ria-jit` compared to QEMU. This conclusion agrees with the SPEC CPU 2017 experiment result conducted by the author. The experiment results from the original paper will be shown in later section. #### Compare with results from original paper The paper gives some possible reason why block chaining is the most important optimization, I think the same can be concluded even if we use a different benchmark, which is Coremark, to test the simulator. ![](https://i.imgur.com/MPZ3H8X.png) ![](https://i.imgur.com/70KRwyX.png) > **Macro operation fusion does not seem to provide a large performance benefit**, in most benchmarks the numbers do not even suggest any performance increase above natural deviation of benchmark runs. This means the **implemented pattern matching does not give the desired effect of a good performance increase**. Further tweaking of the checked patterns might make this optimisation more worthwhile. :::info Note that marco operation fusion optmization provides little advantage in Coremark. ::: > **The return address stack provided for a significant advantage in some benchmarks**. **Especially the function call heavy** 620.omnetp, 623.xalancbmk, 631.deepsjeng, 641.leela **benchmarks** showed good performance gains of over 50 %. The 600.perlbench, as well as the 648.exchange2 and 657.xz benchmarks where most of the runtime is spent in only a couple loops naturally could not benefit a lot. :::info Note that the return address stack optmization provides little advantage in Coremark. ::: > Recursive jump translation without also utilising the return address stack only provided a performance increase over disabling both in some benchmarks. The main reason for this might be that this also makes context switches necessary on unconditional jumps that aren’t function calls or returns. **This makes jump-heavy benchmarks take a performance hit while jump-light benchmarks are almost unaffected.** :::info I cannot verify it due to the segmentation fault when disabling recursive jump translation optimization. ::: > Expectedly, **the highest performance penalty was incurred by disabling chaining** as well. This **makes a context switch back to the translator necessary for every executed basic block.** The benchmarks that are less impacted by disabling block chaining are the ones where fewer basic blocks were executed relative to their runtime. This correlates with the fact that the most executed blocks of these benchmarks contain more instructions and hence execute for a longer time. :::info Once block chaining is disabled, `ria-jit` does a context switch (Not OS level context switch) back to the translator for every executed basic block. Which means, `ria-jit` gained significant performance improvement by reducing the number of context switch. This is the core contribution of this paper. ::: #### Why `ria-jit` is better than QEMU ? From the original paper > QEMU first translates the instructions into microinstructions in an **intermediate representation** independent of platform and ISA. While this allows easy implementation of new host and guest architectures, it also means that specific advantages of both guest and host can not be used. This leads to needing potentially more host instructions than necessary for a simple task. **QEMU also does not employ a return address stack**, an optimisation which proved to be very worthwhile. One of the **biggest disadvantages of QEMU**, though, is that **it does not use a static register mapping.** Instead it holds all registers in memory and only loads them into one of a few temporary registers when needing them as input operands. The loaded output register may then be reused as input for the next instruction. If the next instruction needs different inputs, though, the register is written back and needs to be reloaded next time it is needed. This obviously leads to a big overhead which we can avoid by statically mapping the most used registers and dynamically allocating temporaries for the rest. # Profiling & Improvement ## Linux Perf ```c=1 Performance counter stats for 'ria-jit -f coremark.riscv' (5 runs): 95 context-switches ( +- 9.55% ) 1523 page-faults ( +- 0.07% ) 379,4957,5400 branches ( +- 0.00% ) (83.33%) 7,6821,7118 branch-misses # 2.02% of all branches ( +- 0.07% ) (83.33%) 116,1346 cache-misses # 9.973 % of all cache refs ( +- 8.16% ) (83.33%) 1164,4721 cache-references ( +- 9.79% ) (66.67%) 2788,1446,6011 instructions # 2.70 insn per cycle ( +- 0.01% ) (83.34%) 1032,3521,2972 cycles ( +- 0.61% ) (83.33%) 23.675 +- 0.148 seconds time elapsed ( +- 0.63% ) ``` - Most cycle time spent - ![](https://i.imgur.com/Ulf8zea.png) - Corresponding cirtical path ```c=1 t_cache_loc lookup_cache_entry(t_risc_addr risc_addr) { if (flag_do_profile) profile_cache_access(); size_t smallHash = smallhash(risc_addr); if (tlb[smallHash].risc_addr == risc_addr) { return tlb[smallHash].cache_loc; } size_t index = find_lin_slot(risc_addr); if (cache_table[index].cache_loc != 0) { //value is cached and exists set_tlb(risc_addr, cache_table[index].cache_loc); return cache_table[index].cache_loc; } else { //value does not exist return UNSEEN_CODE; } } ``` - I've pinned down the critical path but haven't come up with good optimization strategy. I've tried changing the existing TLB-like cache system with Bloom filter. Still debugging. # Reference [在QEMU上執行64 bit RISC-V Linux](https://medium.com/swark/%E5%9C%A8qemu%E4%B8%8A%E5%9F%B7%E8%A1%8C64-bit-risc-v-linux-2a527a078819) [與妖精共舞：在 RISC-V 架構上使用 GO 語言實作 binutils 工具包](https://ithelp.ithome.com.tw/articles/10192454) [Running 64- and 32-bit RISC-V Linux on QEMU](https://risc-v-getting-started-guide.readthedocs.io/en/latest/linux-qemu.html) [在 QEMU 上运行 RISC-V 64 位版本的 Linux](https://zhuanlan.zhihu.com/p/258394849)