Contribution to ria-jit (an open source RISC-V to x86 binary translator)

# Contribution to ria-jit (an open source RISC-V to x86 binary translator) ###### tags: `ria-jit` `2023_fall_phd` > [ria-jit Github](https://github.com/ria-jit/ria-jit) > [ria-jit My fork](https://github.com/WeiCheng14159/ria-jit/tree/master) > [ria-jit Paper](https://github.com/ria-jit/ria-jit/blob/master/documentation/paper/paper.pdf) [TOC] ## Short intro **ria-jit** is an open-sourced just-in-time RISC-V to x86 binary translator. It is built by students from [Technical University of Munich](https://en.wikipedia.org/wiki/Technical_University_of_Munich) and it outperforms [QEMU](https://en.wikipedia.org/wiki/QEMU), which is a free and open-source emulator. In this paragraph, I will go through the details of ria-jit, and demonstrate how I found a **divide-by-zero** bug by running RISC-V compliance tests on it. ## Why JIT (just-in-time) translation is difficult - Memory addressing mode is different between RISC-V and x86_64 - Mismatch number of GPR (General Purpose Register), FPR (Floating Point Register) between RISC-V and x86-64 - RISC-V: 32 GPR - x86_64: 16 GPR - RISC-V is a **[load-store](https://en.wikipedia.org/wiki/Load%E2%80%93store_architecture)** arch, x86_64: **[register-memory](https://en.wikipedia.org/wiki/Register_memory_architecture)** arch - Why this is a problem for JIT ? - Consider the following example that translates RISC-V assembly code `sub x3 x2 x1` to x86_64 assembly. - For RISC-V load-store architecture, it's common to have three different source and destination operands like `sub x3 x2 x1` - For a register-memory architecture like x86_64, one of the source operands is also the **implicit destination** operand. i.e. `addq %rbx %rcx` means `%rcx = %rcx + %rbx` So more instructions are needed to emulate RISC-V on x86_64. - To emulate `sub x3, x2, x1` on x86_64, we need two x86_64 instructions, namely `movq %rax %rcx` then `addq %rbx %rcx` (assume `%rax` is mapped to `x1`, `%rbx` is mapped to `x2`, `%rcx` is mapped to `r3`) ::: info Efficient fusion of multiple RISC–V instructions into fewer x86–64 instructions is a very challenging problem. ::: ## ria-jit guest/host memory layout - Dynamic Binary Translation (DBT) is responsible for **managing the execution environment** of the guest binary (which is RISC-V) in the shared address space (shared between RISC-V guest and x86 host). - The header of the **ELF-file** (Executable and Linkable Format) specifies which section(s) of the program need to be loaded, and where in memory they must reside. - The memory is laid out as follows: ![](https://i.imgur.com/JPGbxu1.png) - How ria-jit load RISC-V elf is demonstrated in the following code segment (`src/elf/loadElf.h`) ```c= /** * Maps all LOAD segments of the ELF file at the given path into the correct memory regions. * @param filePath the path to the ELF file. * @return t_risc_elf_map_result the map result containing or INVALID_ELF_MAP if the mapping failed. */ t_risc_elf_map_result mapIntoMemory(const char *filePath); ``` - ria-jit translator address is defined in `CMakeLists.txt` ```c= set(TRANSLATOR_BASE 0x780000000000) ``` ## ria-jit's basic block binary translation - DBT must somehow divide the code into chunks it can then process for translation and execution. The natural choice here is for the translator to partition the code into **basic blocks**. (Basic blocks: Code segment with a single point of entry and exit) - For our purposes, a basic block will be terminated by any control-flow altering instruction like a **jump**, **call** or **return statement**, or a **system call**. ## JIT translation details :::info ### Summary This section briefly summarizes what I learned in `Chapter 3: Approach` in this paper. To understand more details, please read the proceding sections or the paper itself. In general, a DBT must ensure the **correctness** of translation, and the **performance** of emulator itself. Note that sometimes a DBT might sacrifice some aspects of correctness to trade for better performance - **Correctness** of DBT covers a broad range of aspects: - Precise exception handling - What's the difference between RSICV and x86_64 floating-point computation exeception handling ? - **Performance** of DBT is exetremely challenging. The following optimization strategies are applied: - Code cache - Has this segment of code been translated ? If yes, then an additional translation can be saved. - How to efficiently retrieved the translated code segments ? (Searching problem) - How to replace cached segment when code cache is full ? (Cache replacement policy) - Handling of host register ::: ### Translation procedure - Step 1: Parse RISC-V code into internal representation until the end of a code block. - Step 2: Check if this code block has been translated and stored in cache ? (By two-level cache lookup) - If no. Do code translation - Dispatch the parsed intructions to translator, which then emits the matching x86_64 code into memory allocated for that block. - If yes. Copy translated code from cache, and start executing the translated x86_64 code. - Step 3: Before returning to the code parsing loop, sets the address where program execution should continue. (Return to Step 1) ### Code cache and block handling - Problem: How to efficiently retrieved the translated code segments ? - Solution Proposed: Two-level cache (Found in `src/cache/cache.c`) - Swoftware cache architecture - Large cache contains 8192 of entries (could be resize) - Each entry (`t_cache_entry`) contains: - `t_risc_addr` RISC-V PC addr (unsigned long) - `t_cache_loc` Pointer to a cache block (void *) - Code ```c=1 //cache entries for translated code segments typedef struct { //the full RISC-V pc address t_risc_addr risc_addr; //cache location of that code segment t_cache_loc cache_loc; } t_cache_entry; ``` - Hash function ```c=1 inline size_t hash(t_risc_addr risc_addr) { return (risc_addr >> 2u) & (table_size - 1); } ``` - Small cache contains 32 entries (fixed) - cache entry is the same for large/small cache - Hash function ```c=1 inline size_t smallhash(t_risc_addr risc_addr) { return (risc_addr >> 3u) & (SMALLTLB - 1); } ``` - Software cache mechanism - Given a RISC-V 64-bit address, hash it to a value between 0 to 31 - Use the hashed value as index to lookup small cache (of size 32) - Hit: Return cache location - Miss: Do large cache lookup - Hash 64-bit address to a value between 0 to 8191 - Use the hased value as index to lookup large cache (of size 8192) - Hit: Update entries in small cache. Return cache location. - Miss: Do code translation - Problem: How to replace cached segment when code cache is full ? - Solution Proposed: - Invalidate and purge some or all of the blocks currently residing in the cache - Cons: Add performance overhead in case purged blocks are needed in the future. - Dynamically resize the cache according to the needs of the guest program - Cons: Higher memory usage ### Handling of host register - Problem; x86-64: 16 GPR, RISC-V: 32 GPR - Why not mapped 32 RISC-V registers to memory so that r/w to registers will be redirected to r/w to memory ? (This is how rv32emu-next handle host register read/write) :::spoiler Why this leads to bad performance > Keeping a **guest register file exclusively in memory**, and loading them into native registers when needed within the translations of single instructions is technically possible, especially in light of the ability to extensively use memory operands in the instructions on x86–64. However, this necessitates a **large number of memory accesses** for both memory operands in the instructions as well as local register allocation within the translated blocks. Due to the very large performance gain connected to using register operands instead of memory operands, this is also not feasible at scale. ::: - Proposed solution: - We utilise the tools we designed to discover the **most-used registers** in the guest programs, and statically map these to general purpose x86–64 registers. - Default mapping: ![](https://i.imgur.com/dhFFaIU.png) - The remaining operands are then **dynamically allocated** into **reserved host registers** inside the translation of a single block. .... In case the translator requires a value not currently present in a replacement register, **the oldest value is written back to the register file in memory** - Implementaion: context switch (Not the same with OS context switch) - Context switch here is a mechanism that handles the register mapping between guest context (RISC-V) and host (x86_64) context. - Before switching to guest context, statically mapped register values will be restored from guest context. Then, DBT switch to guest address space and start executing translated guest code. - Before switching back to host context, the value of mapped register values should be save, and host context should be restored. ### System call handling - For RISC–V, the instruction **ECALL** (for environment call, formerly SCALL) handles these requests, with the **system call number** residing in register **a7** and the **arguments** being passed in **a0 – a6**. - The RISC–V guest program expects a **different operating system kernel** than is present natively on the host; with that, the system call interface also differs - In order to handle the ECALL instruction correctly, the translator must thus **build the translated instruction to call a specific handler routine** not too dissimilar from one that may be found in a kernel ### Floating point extension - The main difficulties (and their resolutions) that arise by using the **x86–64 SSE** extensions to translate the RISC-V **F- and D-extensions** - Register handling - We utilise the tools we designed to discover the **most-used registers** in the guest programs, and statically map these to **x86–64 SSE registers XMM2–XMM15**. - Missing equivalent SSE instructions - The instructions that need to be emulated are **unsigned conversion** instructions - Rounding modes: - Handled differently in the RISC–V architecture, as the rounding mode can be set **individually for every instruction**. The rounding mode of the **SSE extension** however is controlled by the state of the **MXCSR** control and status register. - Exception handling - RISC–V is realized by reading the **fcsr** floating-point control and status register, **traps are not supported.** - The **CSR** instructions used to read this register are thus emulated to instead **read and translate the MXCSR exception flags**. # RISC-V compliance test (v1.0) on ria-jit ## Introduction I learn how to use the RISC-V GNU toolchain, run compliance tests, and verify the results from the computer architecture course in NCKU this semester. By comparing the "signature" (aka memory dump) from the simulator, one can verify the correctness of a simulator or a hardware implementation. In that course, I use the `riscvOVPsim` simulator to run the compliance tests. However, the following problem must be solved in order to replace the default `riscvOVPsim` simulator with our target simulator `ria-jit`: - **Lack of RV64 compliance tests**: Most of the existing compliance tests in `riscv-compliance` repository are based on RISC-V 32-bit architecture. There're few (<20) compliance tests designed from RV64 architecture, and placed in a `wip` (work in progress) folder in the repo, indicating that some work needs to be done to actually run it. Since `ria-jit` is designed for 64-bit architecture, 32-bit compliance tests aren't helpful. - **Add ria-jit support in riscv-compliance repository**: There're multiple RISC-V simulator / hardware (or called `riscv-target`) in the repo. Famous one like `riscvOVPsim` and `spike` are fully supported in the repo. Add `ria-jit` as one of the target platform requires some effort including: - Understanding Makefile complilation logic - Come up with two files `compliance_io.h` and `compliance_test.h` that define the interaction between compliance tests and the underlying simulator. - **Verification of a successful test**: Unlike the `riscvOVPsim` simulator which is capable of dumping a region of memory, `ria-jit` doesn't implement this feature. Which means, with `ria-jit`, we cannot compare the execution result with the `reference_output` signature files to verify the execution results. Other mechanism must be implemented to determine the "PASS" or "FAIL" of a certain compliance test. Notice that another RISC-V emulator `NanoRVI` deal with this problem by returning the failed compliance test number to indicate that the simulator failed at a certain test. However, as pointed out by the author of `NanoRVI` in the README file, that implementation is a "quick and dirty" way to run compliance tests on its simulator and cannot be embedded into the official `riscv-compliance` repo. Even worse, we cannot get the return status of a program in `ria-jit` simulator (could be implemented in the future) so that doesn't work as well. ## Existing RV64 support in riscv-compliance repo - There're 9 tests for `rv64i` architecture targeting the following instruction (Note the `W` postfix of every instructions indicates that it is a 64-bit instruction) - ADDIW - ADDW - SLLIW - SLLW - SRAIW - SRAW - SRLIW - SRLW - SUBW - There're 4 tests for `rv64im` architecture targeting the following instruction (`m` stands for multiplication) - DIVW - MULW - REMUW - REMW ## Add ria-jit as riscv-target device - RISC-V compliance tests make uses of various C marco to simplify the process of writing assembly-level compliance tests. This RISC-V compliance test `MULW.S` is shown as an example here: ```c=1 #include "test_macros.h" #include "compliance_test.h" #include "compliance_io.h" RV_COMPLIANCE_RV32M RV_COMPLIANCE_CODE_BEGIN RVTEST_IO_INIT RVTEST_IO_ASSERT_GPR_EQ(x31, x0, 0x00000000) RVTEST_IO_WRITE_STR(x31, "Test Begin Reserved regs ra(x1) a0(x10) t0(x5)\n") # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 1 - corner cases\n") # address for test results la x2, test_1_res TEST_RR_SRC2(mulw, x3, x4, 0, 0x0, 0x0, x2, 0) TEST_RR_SRC2(mulw, x8, x9, 0, 0x0, 0x1, x2, 8) TEST_RR_SRC2(mulw, x11, x12, 0, 0x0, -0x1, x2, 16) TEST_RR_SRC2(mulw, x13, x14, 0, 0x0, 0x7fffffffffffffff, x2, 24) TEST_RR_SRC2(mulw, x15, x16, 0, 0x0, 0x8000000000000000, x2, 32) ..... skip ..... RV_COMPLIANCE_HALT RV_COMPLIANCE_CODE_END ``` - In the very beginning, I thought compiling the compliance tests and run them with `ria-jit` simulator should be relatively easy since `ria-jit` has been tested intensively. However, I ran into endless `segmentation fault` & `system call xx not implemented` error from the `ria-jit` simulator. After days of investigation, I pinned down the root cause to a segment of RISC-V code shown below: - The `RV_COMPLIANCE_CODE_BEGIN` marco in the code example above will be expanded into the following: ```c=1 #define RVTEST_CODE_BEGIN .section .text.init; \ .align 6; \ .weak stvec_handler; \ .weak mtvec_handler; \ .globl _start; \ _start: \ /* reset vector */ \ j reset_vector; \ .align 2; \ trap_vector: \ /* test whether the test came from pass/fail */ \ csrr t5, mcause; \ li t6, CAUSE_USER_ECALL; \ beq t5, t6, write_tohost; \ li t6, CAUSE_SUPERVISOR_ECALL; \ beq t5, t6, write_tohost; \ li t6, CAUSE_MACHINE_ECALL; \ beq t5, t6, write_tohost; \ ..... skipped ..... ``` - Understanding this code segment itself is very challenging, as far as I can tell, this is a exception handler written in assembly code, the simulator execute `CSR` instruction to get the state of processor and handle the exception (which indicates a certain test case fails) by jumpping to corresponding reset vector. Unfortunately, the instruction `csrr t5, mcause;` isn't supported and is the root cause of error in `ria-jit` simulator. So this sophisicated, assembly coded `expcetion handling` doesn't work on `ria-jit`. (Also, doesn't work in `QEMU`) - I found a presentation from `SiFive` that introduce the RISC-V interrupts [here page 32](https://cdn2.hubspot.net/hubfs/3020607/An%20Introduction%20to%20the%20RISC-V%20Architecture.pdf). I am reading it right now and hope to fix this bug in `ria-jit` simulator - Stuck in this bug for weeks, I figure out how to bypass this problem temporarily by going through the code of another RISC-V simulator `nanoRVI`. Unlike the original implmentation that does complicated assembly-coded exception handling, `NanoRVI` remove the complicated exception handling scheme, code as follow: - Notice the `_start` symbol is where the execution starts: ```c=1 #define RVTEST_CODE_BEGIN \ .section .text.init; \ .align 6; \ .weak stvec_handler; \ .weak mtvec_handler; \ .globl _start; \ _start: \ ``` - Eventually, I implemented the missing functions in two files `compliance_test.h` and `compliance_io.h` and other environment setup that is too tedious to cover here. My work incorporating `ria-jit` into the `riscv-compliance` repo as one of the target device can be found in this [commit](https://github.com/WeiCheng14159/riscv-compliance/commit/0292e7c523daf0030bdb3c4279394d91da154042) ## Verification of a successful test - Replace the content of `RVTEST_CODE_BEGIN` marco from `NanoRVI` isn't enough. I have to come up with a different mechanism to determine whether a test is a "PASS" or "FAIL" - `ria-jit` isn't capable of dumping memory like `riscvOVPsim` simulator so I need some other way for `ria-jit` to "communicate" with outside environment indicating a test is "FAIL" or "PASS" - Instead of dumping memory, `NanoRVI` takes advantage of return code to indicate a "PASS" by returning 0 and "FAIL" by returning the failed test number. - Notice that implementing "memory dump" feature for `ria-jit` is definitely possible, but right now the mechanism I came up with is to **return silently for a PASS** and **write to protected memory 0x0 and trigger a segmentation fault for a FAIL**. This mechanism is very stupid and can definitely be handled more elegently but everything has to be done at assembly level. I am currently reading RISC-V interrupt handling mechanism and hope to come up with something better. - Actual implementation - `RVTEST_IO_ASSERT_GPR_EQ(_SP, _R, _I)` is a marco that assert the vaule of a GPR register, if the value of register is as expected then the test should continue. If not, the test should stop the execution and report the error. - Notice that `_SP` is a volatile register, usually, `t6` (aka `x31`) is used. - `_R` is the register that stores the result - `_I` is the expected value of the computation result - This marco is implemented as follow: ```c=1 #define RVTEST_IO_ASSERT_GPR_EQ(_SP, _R, _I) \ li _SP, _I; \ beq _SP, _R, 20002f; \ RVTEST_FAIL; \ 20002: \ ``` - `RVTEST_PASS` and `RVTEST_FAIL` are marcos that determine the behavior of a PASS or FAIL - Notice register `a0` is assigned `0` for a PASS, and `1` for a FAIL - register `a7` is assigned the value `93` to call the `exit` system call ```c=1 #undef RVTEST_PASS #define RVTEST_PASS \ li a7, 93; \ li a0, 0; \ j end_testcode; \ #undef RVTEST_FAIL #define RVTEST_FAIL \ li a7, 93; \ li a0, 1; \ j end_testcode; \ ``` - `RV_COMPLIANCE_CODE_END` is appended at the bottom of each compliance test. - If the value of `a0` isn't zero then this compliance test is a FAIL. The program will jump to tag `1234f` and execute `sw x0, 0(x0)` that triggers a segmentation fault. ```c=1 #define RV_COMPLIANCE_CODE_END \ end_testcode: \ bne a0, x0, 1234f; \ ecall; \ 1234: \ sw x0, 0(x0); \ ``` ## System call implemented on ria-jit (FYI) :::spoiler The following Linux syscall are implemented on ria-jit: getcwd (17), fcntl (25), ioctl (29), unlinkat (35), ftruncate (46), faccessat (48), chdir (49), fchmod (52), fchown (55), pipe2 (59), openat (56), close (57), getdents64 (61), lseek (62), read (63), write (64), writev (66), readlinkat (78), fstatat (79), fstat (80), utimensat (88), exit (93), exit_group (94), set_tid_address (96), futex (98), set_robust_list (99), clock_gettime (113), tgkill (131), rt_sigaction (134), rt_sigprocmask (135), uname (160), gettimeofday (169), getpid (172), getuid (174), geteuid (175), getgid (176), getegid (177), gettid (178), sysinfo (179), brk (214), munmap (215), execve (221), mmap (222), wait4 (260), prlimit64 (261), renameat2 (276), getrandom (278) ::: ## Bug in RV64IM riscv-compliance repo I found a bug that involves 64-bit compatibility in `riscv-compliance` repo, the fix can be found [here](https://github.com/WeiCheng14159/riscv-compliance/commit/12073dc20f432419987cc24bdc4218549997251a) :::warning Would you send pull request back to upstream? `SX` might not be a good name since it causes confusion. :notes: jserv ::: :::info Indeed, `SX` isn't satisfactory ! Need to find a better way to decouple the compliance test and the target platform. The solution I provided above originated from how `riscvOVPsim` simulator deal with the 32/64 bit compatibility issue, their implementation can be found [here](https://github.com/riscv/riscv-compliance/blob/207bc4e3ac3af94fd6759aa9bed7a241692c101c/riscv-target/riscvOVPsim/compliance_io.h#L56) ::: ::: info Update: Recent update in `riscv-compliance` repo fundamentally changes the project organization and standardizes the simulator interface for compliance tests. My work cannot be merge directly into the repo, I am switching my work to riscv-compliance 2.0 (WIP). ::: ## riscv-compliance tests (v1.0) results - The following `rv64im` tests PASS on `ria-jit` simulator - :::spoiler Passed 12/13 - ADDIW - ADDW - SLLIW - SLLW - SRAIW - SRAW - SRLIW - SRLW - SUBW - MULW - REMUW - REMW ::: - The following `rv64im` tests FAIL on `ria-jit` simulator - DIVW - All tests pass on QEMU simulator ## Fix: DIVW compliance test fails on `ria-jit` - Methodology: Run `ria-jit` with single stepping mode and disable all JIT optimization - Problem identification: - DIVW instruction explained - > DIVW and DIVUW instructions are only valid for RV64, and divide the lower 32 bits of `rs1` by the lower 32 bits of `rs2`. DIVW and DIVUW treat them as signed and unsigned integers respectively, placing the 32-bit quotient in `rd`, and then sign-extended the result to 64 bits. - DIVW compliance test fail on this test case: - `TEST_RR_SRC2(divw, x3, x4, 0xffffffffffffffff, 0x0, 0x0, x2, 0)` [link](https://github.com/riscv/riscv-compliance/blob/2636302c27557b42d99bed7e0537beffdf8e1ab4/riscv-test-suite/wip/rv64im/src/DIVW.S#L50) - This line of instruction will load `0x0` to `x3` and load `0x0` to `x4` - Perform `divw x4, x3, x4`, and write the result to address stored in `x2(sp)` with offset `0` - The expected result is `0xffffffffffffffff`, as specified in the RISC-V spec - The following figure is captured from RISC-V spec v2.2, Vol. 1, 6.2 Division Operations - ![](https://i.imgur.com/1SNnkuw.png) - `DIV` instruction is expected to output `-1 (0xffffffffffffffff)` for the divide-by-zero case. It's the same when dealing with `DIVW` instructions - Generated RISC-V assembly - :::spoiler assembly code ```c=1 ... 100dc: 00000213 li tp,0 100e0: 00000193 li gp,0 100e4: 023241bb divw gp,tp,gp 100e8: 00313023 sd gp,0(sp) 100ec: fff00f93 li t6,-1 100f0: 003f8863 beq t6,gp,10100 <_start+0x40> 100f4: 05d00893 li a7,93 100f8: 00100513 li a0,1 100fc: 4080006f j 10504 <end_testcode> ... ``` ::: - Generated x86_64 assembly for `divw gp,tp,gp` - To be more specific, the `DIVW` RISC-V instruction is mapped to `IDIV` x86_64 instruction - :::spoiler assembly code (~30 lines) ```c=1 ... [jit-gen] [NOP] [jit-gen] [MOV reg8:r1 mem8:r16+0x7f265e90] [jit-gen] [MOV reg8:r0 mem8:r16+0x7f265e91] [jit-gen] [MOV mem8:r16+0x7f265e8a reg8:r0] [jit-gen] [TEST reg4:r1 reg4:r1] [jit-gen] [JZ imm8:0x77ff810003e2] [jit-gen] [C_SEP_4] [jit-gen] [IDIV reg4:r1] [jit-gen] [MOVSX reg8:r1 reg4:r0] [jit-gen] [JMP imm8:0x77ff810003e9] [jit-gen] [MOV reg8:r0 imm8:0xffffffffffffffff] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [MOV mem8:r16+0x7f265e65 reg8:r1] [jit-gen] [LEA reg8:r0 mem0:r16+-0x7] [jit-gen] [MOV mem8:r16+0x7f265c27 reg8:r0] [jit-gen] [MOV mem4:r16+0x7f265c25 imm4:0x245] [jit-gen] [MOV reg8:r0 imm8:0x100e8] [jit-gen] [MOV mem8:r16+0x7f265f27 reg8:r0] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [RET_8] [jit-gen] [NOP] [jit-gen] [NOP] [jit-gen] [NOP] ... ``` ::: - Root cause: `x4` register is expected to store the value of `0 / 0`. According to RISC-V spec, `x4` should have value `0xffffffffffffffff` ; however, eventually `x4` has value `0x0`. This is the root cause of failure. - **Why I think this is a bug ?** Divide by zero test is actually covered in the testing program provided by `ria-jit` as shown in this [line](https://github.com/WeiCheng14159/ria-jit/blob/9aada51e0b47aa0fa4c93bd0b64183643d5c6777/test/test_programs/arithmetic_test/compiled_test_arithm.c#L76) - ```c=1 ... { //div-zero quotient should have all bits set init(++num, "DivZero"); size_t n = 256; size_t m = 0; assert_equals(0xFFFFFFFFFFFFFFFF, (n / m), &failed_tests); } ... ``` - I check the compiled assembly of this test case and find that `(n/m)` is compiled into `DIVUW` RISC-V instruction since `m` and `n` are of type `size_t` (unsigned 32-bit integer) - Change `m` and `n` from `unsigned int` to `signed int` as shown below will results in a failure because `DIVW` RISC-V instruction is used. I figured it out by disassembling the compiled binary. - ```c=1 ... { //div-zero quotient should have all bits set init(++num, "DivZero"); int n = 256; int m = 0; assert_equals(0xFFFFFFFFFFFFFFFF, (n / m), &failed_tests); } ... ``` - For your reference, changing `m` and `n` from `size_t` to `int64_t` will be correct because `DIV` RISC-V instruction is used. Notice that `ria-jit` passes the `DIV` and `DIVUW` test case in riscv compliance test. So unsigned integer division is correct as expected. - `ria-jit` provides test cases that include divide by zero; **however, divide-by-zero case of `DIVW` instruction isn't covered by the test cases. This bug is now exposed by the `riscv-compliance` tests. This bug is crucial because 32-bit signed int (`int`) division is widely used.** - **Fix**: The function `void translate_DIVW(...)` is responsible for translating `DIVW` instruction from RISC-V to x86_64 shown [here](https://github.com/WeiCheng14159/ria-jit/blob/9aada51e0b47aa0fa4c93bd0b64183643d5c6777/src/gen/instr/ext/translate_m_ext.c#L408) - Mentioned earlier, `IDIV` instruction is used to simulate `DIVW` on x86 platform. `IDIV` expect `EDX:EAX` as implicit input, and stores the quotient in `EAX`, and remainder in `EDX`. More info can be found [here](https://mudongliang.github.io/x86/html/file_module_x86_id_137.html) - However, `EAX` register is NOT always mapped to `rd` register in RISC-V. Due to the register mapping feature in `ria-jit`, I found `rd` register is mapped to `r11` register instead of `EAX` register. - If not mapped correctly, the division result stays in `EAX` while RISC-V binary expects the result in `r11` register. - To correctly mapped the result of division, I add 3 lines of code that mapped `EAX` to `rd` if not already mapped - ```c=1 ... if (regDest != FE_AX) { ///rd is mapped so move the result in RAX there. err |= fe_enc64(&current, FE_MOVSXr64r32, regDest, FE_AX); } ... ``` - **[Update]** The author clarify that this isn't the root cause. Our discussion can be found [here](https://github.com/ria-jit/ria-jit/pull/3) - As suggested by the author that `EAX` register isn't mapped to `rd` given zero divisor is the actual problem. - ```c=1 ... // fe_enc64(&current, FE_MOV64ri, FE_AX, -1); fe_enc64(&current, FE_MOV64ri, regDest, -1); ... ``` - **Result** - Pass ALL `rv64im` `riscv-compliance` tests and ALL test cases that come with by `ria-jit` - GitHub pull request created [here](https://github.com/ria-jit/ria-jit/pull/3) # Compare perf. of ria-jit and QEMU - The performance of `ria-jit` is tested by `Coremark` benchmark, and compared with the `QEMU` simulator. This section will demonstrate the performance optimization done by `ria-jit` ## What is Coremark ? > EEMBC’s CoreMark® is a benchmark that measures the performance of microcontrollers (MCUs) and central processing units (CPUs) used in embedded systems. Replacing the antiquated Dhrystone benchmark, Coremark contains implementations of the following algorithms: **list processing** (find and sort), **matrix manipulation** (common matrix operations), **state machine** (determine if an input stream contains valid numbers), and **CRC** (cyclic redundancy check). It is designed to run on devices from 8-bit microcontrollers to 64-bit microprocessors. > It has been designed around **embedded applications** and therefore demonstrates **highly favorable numbers for relatively simple designs** (e.g., dual-issue in-order) while having **weaker performance scaling in complex designs** (e.g., out-of-order superscalar). ## Experiment setup - Coremark port to RISC-V - Coremark binary is ported to RISC-V by using the Coremark EEMBC Wrapper [here](https://github.com/riscv-boom/riscv-coremark) - Coremark benchmark binary is compiled with the following compiler flag: `-march=rv64imafd -mabi=lp64d -O3 -static`. This is the setup suggested by `ria-jit` - There're two target in the Coremark EEMBC Wrapper: `normal` and `bare_metal`. The `bare_metal` target is compiled with `-nostdlib -nostartfiles` flags. Which will cause a `segmentation fault` when executed on `ria-jit` since some `CSR` instructions aren't supported. Thus, Coremark binary will be compiled with `normal` mode. - Software - Compiler: Latest RISC-V GNU toolchain `GCC 10.2.0` compiled from source - `ria-jit`: [Tag v1.3.1](https://github.com/ria-jit/ria-jit/releases/tag/v1.3.1) - `QEMU`: [Tag 5.1.0](https://github.com/qemu/qemu/releases/tag/v5.1.0) with Linux v5.4 kernel image - Hardware - Ubuntu 18.04 - Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz ## Results There're 4 optimization techiques implemented in `ria-jit`, namely, return address stack **(RAS)**, block chaining **(BC)**, recursive translation of jump target , marco fusion/conversion **(Fusion)**. Firstly, I will test the Coremark benchmark with ONE optimization technique DISABLE at a time. Then, I will test it with ONE optimization technique ENABLE at a time to show the performance difference. ### Base line (QEMU) - `QEMU` with 2048M memory - :::spoiler score: 6184.917101 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 17972919 Total time (secs): 17.972919 Iterations/Sec : 6184.917101 Iterations : 110000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 6184.917101 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Base line (ria-jit) - `ria-jit` with ALL optimization - :::spoiler score: 17684.590726 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16963921 Total time (secs): 16.963921 Iterations/Sec : 17684.590726 Iterations : 300000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0xcc42 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 17684.590726 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Disable ONE opt. technique - Disable **return address stack** by setting `--optimize=no-ras` - :::spoiler score: 11277.077744 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 17735091 Total time (secs): 17.735091 Iterations/Sec : 11277.077744 Iterations : 200000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x4983 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 11277.077744 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Disable **block chaining** by setting `--optimize=no-chain` - :::spoiler score: 661.703370 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16623763 Total time (secs): 16.623763 Iterations/Sec : 661.703370 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 661.703370 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Disable **recursive translation of jump targets** by setting `--optimize=no-jump` - :::spoiler score: Nan ```c=1 Segmentation fault ``` ::: - Disable **macro opcode fusion/conversion** by setting `--optimize=no-fusion` - :::spoiler score: 16393.457404 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 12199989 Total time (secs): 12.199989 Iterations/Sec : 16393.457404 Iterations : 200000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x4983 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 16393.457404 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Disable **all optimization** by setting `--optimize=none` - :::spoiler score: 612.171699 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 17968815 Total time (secs): 17.968815 Iterations/Sec : 612.171699 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 612.171699 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Enable ONE opt. technique **`jump target` optimization must be enable to run the benchmark. So it's enable by default.** - Enable **return address stack** by setting `--optimize=no-chain,no-fusion` - :::spoiler score: 663.241927 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16585200 Total time (secs): 16.585200 Iterations/Sec : 663.241927 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 663.241927 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Dnable **block chaining** by setting `--optimize=no-ras,no-fusion` - :::spoiler score: 10980.685578 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 18213799 Total time (secs): 18.213799 Iterations/Sec : 10980.685578 Iterations : 200000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x4983 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 10980.685578 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: - Enable **macro opcode fusion/conversion** by setting `--optimize=no-ras,no-chain` - :::spoiler score: 650.748298 ```c=1 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16903617 Total time (secs): 16.903617 Iterations/Sec : 650.748298 Iterations : 11000 Compiler version : GCC10.2.0 Compiler flags : -march=rv64imafd -mabi=lp64d -O3 -static Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x33ff Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 650.748298 / GCC10.2.0 -march=rv64imafd -mabi=lp64d -O3 -static / Heap ``` ::: ### Chart #### ria-jit opt. effect - Enable 1 ria-jit optimization technique at a time - ![](https://i.imgur.com/rhsoasA.png) - Disable 1 ria-jit optimization technique at a time - ![](https://i.imgur.com/lbmPsAI.png) ### Discussion **Block chaining** seems to be the most crucial optimization techniqe that boosts the performance of `ria-jit` compared to QEMU. This conclusion agrees with the SPEC CPU 2017 experiment result conducted by the author. The experiment results from the original paper will be shown in later section. #### Compare with results from original paper The paper gives some possible reason why block chaining is the most important optimization, I think the same can be concluded even if we use a different benchmark, which is Coremark, to test the simulator. ![](https://i.imgur.com/MPZ3H8X.png) ![](https://i.imgur.com/70KRwyX.png) > **Macro operation fusion does not seem to provide a large performance benefit**, in most benchmarks the numbers do not even suggest any performance increase above natural deviation of benchmark runs. This means the **implemented pattern matching does not give the desired effect of a good performance increase**. Further tweaking of the checked patterns might make this optimisation more worthwhile. :::info Note that marco operation fusion optmization provides little advantage in Coremark. ::: > **The return address stack provided for a significant advantage in some benchmarks**. **Especially the function call heavy** 620.omnetp, 623.xalancbmk, 631.deepsjeng, 641.leela **benchmarks** showed good performance gains of over 50 %. The 600.perlbench, as well as the 648.exchange2 and 657.xz benchmarks where most of the runtime is spent in only a couple loops naturally could not benefit a lot. :::info Note that the return address stack optmization provides little advantage in Coremark. ::: > Recursive jump translation without also utilising the return address stack only provided a performance increase over disabling both in some benchmarks. The main reason for this might be that this also makes context switches necessary on unconditional jumps that aren’t function calls or returns. **This makes jump-heavy benchmarks take a performance hit while jump-light benchmarks are almost unaffected.** :::info I cannot verify it due to the segmentation fault when disabling recursive jump translation optimization. ::: > Expectedly, **the highest performance penalty was incurred by disabling chaining** as well. This **makes a context switch back to the translator necessary for every executed basic block.** The benchmarks that are less impacted by disabling block chaining are the ones where fewer basic blocks were executed relative to their runtime. This correlates with the fact that the most executed blocks of these benchmarks contain more instructions and hence execute for a longer time. :::info Once block chaining is disabled, `ria-jit` does a context switch (Not OS level context switch) back to the translator for every executed basic block. Which means, `ria-jit` gained significant performance improvement by reducing the number of context switch. This is the core contribution of this paper. ::: #### Why `ria-jit` is better than QEMU ? From the original paper > QEMU first translates the instructions into microinstructions in an **intermediate representation** independent of platform and ISA. While this allows easy implementation of new host and guest architectures, it also means that specific advantages of both guest and host can not be used. This leads to needing potentially more host instructions than necessary for a simple task. **QEMU also does not employ a return address stack**, an optimisation which proved to be very worthwhile. One of the **biggest disadvantages of QEMU**, though, is that **it does not use a static register mapping.** Instead it holds all registers in memory and only loads them into one of a few temporary registers when needing them as input operands. The loaded output register may then be reused as input for the next instruction. If the next instruction needs different inputs, though, the register is written back and needs to be reloaded next time it is needed. This obviously leads to a big overhead which we can avoid by statically mapping the most used registers and dynamically allocating temporaries for the rest. # Profiling & Improvement ## Linux Perf ```c=1 Performance counter stats for 'ria-jit -f coremark.riscv' (5 runs): 95 context-switches ( +- 9.55% ) 1523 page-faults ( +- 0.07% ) 379,4957,5400 branches ( +- 0.00% ) (83.33%) 7,6821,7118 branch-misses # 2.02% of all branches ( +- 0.07% ) (83.33%) 116,1346 cache-misses # 9.973 % of all cache refs ( +- 8.16% ) (83.33%) 1164,4721 cache-references ( +- 9.79% ) (66.67%) 2788,1446,6011 instructions # 2.70 insn per cycle ( +- 0.01% ) (83.34%) 1032,3521,2972 cycles ( +- 0.61% ) (83.33%) 23.675 +- 0.148 seconds time elapsed ( +- 0.63% ) ``` - Most cycle time spent - ![](https://i.imgur.com/Ulf8zea.png) - Corresponding cirtical path ```c=1 t_cache_loc lookup_cache_entry(t_risc_addr risc_addr) { if (flag_do_profile) profile_cache_access(); size_t smallHash = smallhash(risc_addr); if (tlb[smallHash].risc_addr == risc_addr) { return tlb[smallHash].cache_loc; } size_t index = find_lin_slot(risc_addr); if (cache_table[index].cache_loc != 0) { //value is cached and exists set_tlb(risc_addr, cache_table[index].cache_loc); return cache_table[index].cache_loc; } else { //value does not exist return UNSEEN_CODE; } } ``` - I've pinned down the critical path but haven't come up with good optimization strategy. I've tried changing the existing TLB-like cache system with Bloom filter. Still debugging. # Reference [在QEMU上執行64 bit RISC-V Linux](https://medium.com/swark/%E5%9C%A8qemu%E4%B8%8A%E5%9F%B7%E8%A1%8C64-bit-risc-v-linux-2a527a078819) [與妖精共舞：在 RISC-V 架構上使用 GO 語言實作 binutils 工具包](https://ithelp.ithome.com.tw/articles/10192454) [Running 64- and 32-bit RISC-V Linux on QEMU](https://risc-v-getting-started-guide.readthedocs.io/en/latest/linux-qemu.html) [在 QEMU 上运行 RISC-V 64 位版本的 Linux](https://zhuanlan.zhihu.com/p/258394849)