contributed by < qwe661234
>
Repository: rv32emu
The commit 3305664 is regarded as the baseline for performance evaluation and enhancements. The RISC-V instruction set compliance with RV32I, RV32M, RV32C, and privilege can be passed by the baseline. We compare the performance of the emulator with and without the computed-goto technique using the benchmark tool CoreMark and Dhrystone.
Ensure that rv32emu
binary is built properly.
Compiler | w/ computed-goto | CoreMark (Iterations/Sec) | Dhrystone (DIMPS) |
---|---|---|---|
clang-15 | NO | 360.3897405 | 325.0 |
clang-15 | YES | 434.6317140 | 391.6 |
gcc-12 | NO | 380.4077495 | 328.0 |
gcc-12 | YES | 444.7024895 | 415.8 |
DMIPS: Dhrystone MIPS (Million Instructions per Second)
According to benchmark result, emulator has better result with computed-goto technique.
We use perf
to perform additional performance analysis and discover that we spend a significant amount of time decoding instructions in function rv_step
. As a result, we import IR and the basic block in order to speed up instruction decoding.
perf
for commit 3305664A basic block contains one or more instructions, and we manage blocks efficiently using hash tables and block prediction. In the decode stage, a new block is allocated that can hold up to 1024 instructions by default, and the instructions are decoded as IR into the block until it is full or the latest instruction is a branch instruction, after which it is placed in the block map.
In execution stage, the emulator executes the instructions in the basic block. If the program executes the same instruction sequence in a basic block, we can reuse the basic block without decoding the instruction repeatly.
We can see from the call graph below that CoreMark
and Dhrystone
frequently invoke the specific function, it is more likely that we can reuse the basic block of these functions. The results of running the coremark and dhrystone benchmarks show that importing basic blocks significantly reduces the overhead of instruction decoding in such a situation.
Compiler | w/ computed-goto | CoreMark (Iterations/Sec) | Dhrystone (DIMPS) |
---|---|---|---|
clang-15 | NO | 850.4265125 | 665.2 |
clang-15 | YES | 854.0181165 | 741.6 |
gcc-12 | NO | 858.7766405 | 759.2 |
gcc-12 | YES | 870.4854225 | 799.6 |
DMIPS: Dhrystone MIPS (Million Instructions per Second)
Dispatch is the mechanism used by a high level language emulator to transfer control from the code to emulate one instruction to the next.
There are two types of techique using in rv32emu
.
Each instruction is implemented by a case in the switch statement. Instruction bodies are simply the compiler-generated code for each case.
As Re: Suggestions on implementing an efficient instruction set simulator in LuaJIT2 mentioned, the control-flow graph of an interpreter with C switch-based dispatch looks like this:
Each individual instruction execution looks like this:
The drawback of switch dispatch is the compiler would get into some troubles:
We can use a direct or indirect-threaded interpreter even in C, e.g. with the computed-goto
. The control-flow graph of an interpreter with computed-goto looks like this:
Why is direct threading more efficient? As stated in Computed goto for efficient dispatch tables, it's faster for two reasons:
The difference between direct threading
and switch dispatch
is the "bounds check" step of the switch. Why is it required? the compiler is forced to generate the bounds check for the switch statement to conform to the C standard. Quoting from C99:
If no converted case constant expression matches and there is no default label, no part of the switch body is executed.
Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.
Modern CPUs have deep instruction pipelines and go to great lengths ensuring that the pipelines stay as full as possible. One thing that can ruin a pipeline's day is a branch, which is why branch predictors exist. Since a CPU can easily pre-fetch instructions from the branch's target, successful prediction can make the pre-fetched instructions valid and there is no need to fully flush the pipeline.
Since the switch statement has a single "master jump" that dispatches all opcodes, predicting its destination is quite difficult. On the other hand, the computed-goto statement is compiled into a separate jump per opcode, so given that instructions often come in pairs, it's much easier for the branch predictor to "home in" on the various jumps correctly.
Think about it this way: for each jump, the branch predictor keeps a prediction of where it will jump next. If there's a jump per opcode, this is equivalent to predicting the second opcode in an opcode pair, which actually has some chance of success from time to time. On the other hand, if there's just a single jump, the prediction is shared between all opcodes and they keep stepping on each other's toes with each iteration.
Unfortunately, even though we used direct threading to dispatch, we still have two issues. The first issue is that direct threading took a long time to calculate the jumping address. We refer to wasm3
's implementation to solve this issue. Wasm3
is a webassembly interpreter that uses subroutine threading with tail-call optimization to dispatch instructions. As a result, we follow it to separate all instruction emulations and modify struct rv_insn_t
so that we can directly assign instruction emulation to IR by adding member impl
.
The second issue is that we must create a function stack frame for each instruction emulation, which TCO can handle. TCO allows any instruction in a basic block to emulate a function by using the same function stack frame, which saves the overhead of creating function stack frames.
To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version. To accomplish this, we add a variable tailcall
to the struct rv_insn_t
to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable.
Running coremark and dhrystone benchmarks now produces faster results than it did previously, and the test results show below.
Compiler | w/ TCO | CoreMark (Iterations/Sec) | Dhrystone (DMIPS) |
---|---|---|---|
clang-15 | NO | 944.9438388 | 841.4 |
clang-15 | YES | 983.6298667 | 873.4 |
gcc-12 | NO | 952.4544735 | 905.8 |
gcc-12 | YES | 978.6633022 | 923.4 |
DMIPS: Dhrystone MIPS (Million Instructions per Second)