# Linux term project (2023): Improve RISC-V Emulation Performance
contributed by < [`qwe661234`](https://github.com/qwe661234) >
> [Repository](https://github.com/sysprog21/rv32emu)
> [video](https://youtu.be/zGNJPYcpl0s)
## Environment setup
1. Clone the repository
```shell
$ git clone https://github.com/sysprog21/rv32emu.git
```
2. Follow [the instructions of rv32emu](https://github.com/sysprog21/rv32emu/blob/master/README.md)
## Baseline: commit [3305664](https://github.com/sysprog21/rv32emu/commit/33056640b488b6590dcc5641e53a46ea22751e85)
The commit [3305664](https://github.com/sysprog21/rv32emu/commit/33056640b488b6590dcc5641e53a46ea22751e85) is regarded as the baseline for performance evaluation and enhancements. The RISC-V instruction set compliance with RV32I, RV32M, RV32C, and privilege can be passed by the baseline. We compare the performance of the emulator with and without the [computed-goto](https://hackmd.io/@sysprog/c-control-flow#switch-%E8%83%8C%E5%BE%8C%E7%9A%84-goto-%E5%92%8C%E5%AF%A6%E4%BD%9C%E8%80%83%E9%87%8F) technique using the benchmark tool CoreMark and Dhrystone.
### Benchmarking with CoreMark and Dhrystone
Ensure that `rv32emu` binary is built properly.
```shell
$ cd rv32emu
$ make clean all
```
* Run [CoreMark](https://github.com/eembc/coremark)
```shell
$ build/rv32emu build/coremark.elf
```
* Run [Dhrystone](https://github.com/sysprog21/rv32emu/blob/master/tests/dhrystone.c)
```shell
$ build/rv32emu build/dhrystone.elf
```
### Benchmark result
* Mircoprocessor: Core i7-8700
| Compiler | w/ computed-goto | CoreMark (Iterations/Sec) | Dhrystone (DMIPS) |
| -------- | -------- | -------- |-------- |
| clang-15 | NO | 360.3897405 | 325.0 |
| clang-15 | YES | 434.6317140 | 391.6 |
| gcc-12 | NO | 380.4077495 | 328.0 |
| gcc-12 | YES | 444.7024895 | 415.8 |
> DMIPS: Dhrystone MIPS (Million Instructions per Second)
According to benchmark result, emulator has better result with [computed-goto](https://hackmd.io/@sysprog/c-control-flow#switch-%E8%83%8C%E5%BE%8C%E7%9A%84-goto-%E5%92%8C%E5%AF%A6%E4%BD%9C%E8%80%83%E9%87%8F) technique.
## Preliminary introduction of basic block
We utilize `perf` for additional performance analysis and discover that a substantial amount of time is spent on decoding instructions within the `rv_step` function. Consequently, we import intermediate representations (IR) and the corresponding basic block to accelerate the process of instruction decoding.
* CoreMark performance analysis result with `perf` for commit [3305664](https://github.com/sysprog21/rv32emu/commit/33056640b488b6590dcc5641e53a46ea22751e85)
```
30.21% rv32emu rv32emu [.] rv_step
12.71% rv32emu rv32emu [.] op_op_imm
10.33% rv32emu rv32emu [.] op_branch
10.10% rv32emu rv32emu [.] memory_ifetch
7.59% rv32emu rv32emu [.] op_load
6.79% rv32emu rv32emu [.] rv_userdata
5.51% rv32emu rv32emu [.] on_mem_ifetch
```
A basic block consists of one or more instructions, and we efficiently manage blocks using hash tables. During the decode stage, a new block is allocated with a capacity of up to 1024 instructions. These instructions are decoded as intermediate representations (IR) and stored within the block. We continue translating instructions until the block reaches its maximum capacity or encounters a branch instruction.
In the execution stage, the emulator executes the instructions within the basic block sequentially. If the program reaches the program counter where the basic block has already been translated, we can reuse the basic block without repeatedly decoding the instructions.
The call graph below clearly illustrates that `Dhrystone` frequently invokes a specific function. This observation suggests a higher likelihood of reusing the basic blocks within these functions. Running the `Coremark` and `Dhrystone` benchmarks has yielded promising results, showing that importing these basic blocks significantly reduces the overhead of instruction decoding in such scenarios.
### Call graph
* Dhrystone
```graphviz
// Call graph generator. Dec 26 2022 build, written by Kuoping Hsu, MIT license
// https://github.com/kuopinghsu/callgraph-gen
digraph callgraph {
graph [ rankdir = "LR" ];
node [ fontsize = "16" shape = "record" style = "filled" ];
edge [ color = "black" ];
_start_0 [ label = "_start|PROG: 292|STACK: 0|FRAME: 0" fillcolor = "white"];
_start_0 -> _init_0 [ color = "red" ];
_init_0 [ label = "_init|PROG: 172|STACK: 176|FRAME: 176" fillcolor = "white"];
_init_0 -> memcpy_0;
_init_0 -> memset_0;
_init_0 -> thread_entry_0;
_init_0 -> main_0 [ color = "red" ];
_init_0 -> tohost_exit_0;
_init_0 -> sprintf_0;
_init_0 -> printstr_1;
trap_entry_0 [ label = "trap_entry|PROG: 158|STACK: 272|FRAME: 272" fillcolor = "white"];
trap_entry_0 -> handle_trap_0 [ color = "red" ];
handle_trap_0 [ label = "handle_trap|PROG: 16|STACK: 0|FRAME: 272" fillcolor = "white"];
Proc_2_0 [ label = "Proc_2|PROG: 38|STACK: 0|FRAME: 528" fillcolor = "white"];
Proc_3_0 [ label = "Proc_3|PROG: 32|STACK: 0|FRAME: 560" fillcolor = "white"];
Proc_3_0 -> Proc_7_1;
Proc_7_0 [ label = "Proc_7|PROG: 8|STACK: 0|FRAME: 528" fillcolor = "white"];
Proc_1_0 [ label = "Proc_1|PROG: 168|STACK: 32|FRAME: 560" fillcolor = "white"];
Proc_1_0 -> Proc_3_0;
Proc_1_0 -> Proc_6_0;
Proc_1_0 -> Proc_7_2;
Proc_6_0 [ label = "Proc_6|PROG: 112|STACK: 32|FRAME: 592" fillcolor = "white"];
Proc_6_0 -> Func_3_0;
Proc_4_0 [ label = "Proc_4|PROG: 44|STACK: 0|FRAME: 528" fillcolor = "white"];
Proc_5_0 [ label = "Proc_5|PROG: 22|STACK: 0|FRAME: 528" fillcolor = "white"];
Proc_8_0 [ label = "Proc_8|PROG: 74|STACK: 0|FRAME: 528" fillcolor = "white"];
Func_1_0 [ label = "Func_1|PROG: 28|STACK: 0|FRAME: 560" fillcolor = "white"];
Func_2_0 [ label = "Func_2|PROG: 66|STACK: 32|FRAME: 560" fillcolor = "white"];
Func_2_0 -> Func_1_0;
Func_2_0 -> strcmp_0;
strcmp_0 [ label = "strcmp|PROG: 30|STACK: 0|FRAME: 560" fillcolor = "white"];
Func_3_0 [ label = "Func_3|PROG: 8|STACK: 0|FRAME: 592" fillcolor = "white"];
debug_printf_0 [ label = "debug_printf|PROG: 20|STACK: 64|FRAME: 64" fillcolor = "white"];
vprintfmt_0 [ label = "vprintfmt|PROG: 744|STACK: 352|FRAME: 976" fillcolor = "white"];
vprintfmt_0 -> __unknown___0 [ color = "red" ];
__unknown___0 [ label = "__unknown__|PROG: 0|STACK: 0|FRAME: 976" fillcolor = "lightgray"];
sprintf_putch.3143_0 [ label = "sprintf_putch.3143|PROG: 14|STACK: 0|FRAME: 0" fillcolor = "white"];
putchar_0 [ label = "putchar|PROG: 118|STACK: 112|FRAME: 112" fillcolor = "white"];
setStats_0 [ label = "setStats|PROG: 78|STACK: 0|FRAME: 528" fillcolor = "white"];
tohost_exit_0 [ label = "tohost_exit|PROG: 18|STACK: 0|FRAME: 176" fillcolor = "white"];
exit_0 [ label = "exit|PROG: 8|STACK: 16|FRAME: 16" fillcolor = "white"];
exit_0 -> tohost_exit_1 [ color = "red" ];
abort_0 [ label = "abort|PROG: 14|STACK: 0|FRAME: 0" fillcolor = "white"];
printstr_0 [ label = "printstr|PROG: 88|STACK: 112|FRAME: 640" fillcolor = "white"];
thread_entry_0 [ label = "thread_entry|PROG: 6|STACK: 0|FRAME: 176" fillcolor = "white"];
printhex_0 [ label = "printhex|PROG: 70|STACK: 48|FRAME: 48" fillcolor = "white"];
printhex_0 -> printstr_2 [ color = "red" ];
printf_0 [ label = "printf|PROG: 50|STACK: 96|FRAME: 624" fillcolor = "white"];
printf_0 -> vprintfmt_0 [ color = "red" ];
sprintf_0 [ label = "sprintf|PROG: 64|STACK: 96|FRAME: 272" fillcolor = "white"];
sprintf_0 -> vprintfmt_1;
memcpy_0 [ label = "memcpy|PROG: 66|STACK: 0|FRAME: 176" fillcolor = "white"];
memset_0 [ label = "memset|PROG: 70|STACK: 0|FRAME: 176" fillcolor = "white"];
main_0 [ label = "main|PROG: 1780|STACK: 352|FRAME: 528" fillcolor = "white"];
main_0 -> printf_0 [ color = "red" ];
main_0 -> setStats_0;
main_0 -> Proc_5_0;
main_0 -> Proc_4_0;
main_0 -> Func_2_0;
main_0 -> Proc_7_0;
main_0 -> Proc_8_0;
main_0 -> Proc_1_0;
main_0 -> Func_1_1;
main_0 -> Proc_6_1;
main_0 -> Proc_2_0;
main_0 -> printstr_0;
strlen_0 [ label = "strlen|PROG: 26|STACK: 0|FRAME: 0" fillcolor = "white"];
strnlen_0 [ label = "strnlen|PROG: 44|STACK: 0|FRAME: 0" fillcolor = "white"];
strcpy_0 [ label = "strcpy|PROG: 18|STACK: 0|FRAME: 0" fillcolor = "white"];
atol_0 [ label = "atol|PROG: 98|STACK: 0|FRAME: 0" fillcolor = "white"];
Proc_7_1 [ label = "Proc_7|PROG: 8|STACK: 0|FRAME: 560" fillcolor = "white"];
Proc_7_2 [ label = "Proc_7|PROG: 8|STACK: 0|FRAME: 560" fillcolor = "white"];
Func_1_1 [ label = "Func_1|PROG: 28|STACK: 0|FRAME: 528" fillcolor = "white"];
Proc_6_1 [ label = "Proc_6|PROG: 112|STACK: 32|FRAME: 560" fillcolor = "white"];
Proc_6_1 -> Func_3_1;
Func_3_1 [ label = "Func_3|PROG: 8|STACK: 0|FRAME: 560" fillcolor = "white"];
vprintfmt_1 [ label = "vprintfmt|PROG: 744|STACK: 352|FRAME: 624" fillcolor = "white"];
vprintfmt_1 -> __unknown___1;
__unknown___1 [ label = "__unknown__|PROG: 0|STACK: 0|FRAME: 624" fillcolor = "lightgray"];
printstr_1 [ label = "printstr|PROG: 88|STACK: 112|FRAME: 288" fillcolor = "white"];
tohost_exit_1 [ label = "tohost_exit|PROG: 18|STACK: 0|FRAME: 16" fillcolor = "white"];
printstr_2 [ label = "printstr|PROG: 88|STACK: 112|FRAME: 160" fillcolor = "white"];
}
```
### Benchmark result
* Mircoprocessor: Core i7-8700
| Compiler | w/ computed-goto | CoreMark (Iterations/Sec) |Dhrystone (DMIPS)|
| -------- | -------- | -------- |-------- |
| clang-15 | NO | 850.4265125 | 665.2 |
| clang-15 | YES | 854.0181165 | 741.6 |
| gcc-12 | NO | 858.7766405 | 759.2 |
| gcc-12 | YES | 870.4854225 | 799.6 |
> DMIPS: Dhrystone MIPS (Million Instructions per Second)
## Use [tail-call optimization(TCO)](https://en.wikipedia.org/wiki/Tail_call) to lower overhead of instruction dispatching
### Definition of instruction dispatching
Dispatch is the mechanism used by a high level language emulator to transfer control from the code to emulate one instruction to the next.
There are two types of techique using in `rv32emu`.
### Switch dispatch
Each instruction is implemented by a case in the switch statement. Instruction bodies are simply the compiler-generated code for each case.
As [Re: Suggestions on implementing an efficient instruction set simulator in LuaJIT2](http://lua-users.org/lists/lua-l/2011-02/msg00742.html) mentioned, the control-flow graph of an interpreter with C switch-based dispatch looks like this:
```
.------.
V |
| |
L | L = instruction load
D | D = instruction dispatch
/ /|\ \ |
/ / | \ \ |
C C C C C | C = operand decode
X X X X X | X = instruction execution
\ \ | / / |
\ \|/ / |
| |
V |
`------'
```
Each individual instruction execution looks like this:
```
......V......
:X | :
: |\ \ :
: F S S : F = fast path
: |/ / : S = slow paths
: | :
:.....V.....:
```
The drawback of switch dispatch is the compiler would get into some troubles:
* Diamond-shaped control-flow is known to be the worst-case scenario for most optimizations and for register alloction. Nested diamond-shaped control-flow is even worse.
* The compiler doesn't have enough hints to see what the fast paths and what the slow paths are. Even if you'd be able to tell it, it still sees this as a single giant control-flow graph. Anything in this loop could potentially influence anything else, so almost nothing can be hoisted or eliminated. The slow paths kill the opportunities for the fast paths and the complex instructions kill the opportunities for the simpler instructions. As a result, all instruction emulation would be trnslated to a big loop.
* The standard register allocation heuristics fail at this scale, so the compiler has trouble keeping important variables in registers.
### Direct threading (computed-goto)
We can use a direct or indirect-threaded interpreter even in C, e.g. with the `computed-goto`. The control-flow graph of an interpreter with computed-goto looks like this:
```
* * * * *
| | | | |
C C C C C C = operand decode
X X X X X X = instruction execution
L L L L L L = next instruction load
D D D D D D = next instruction dispatch
| | | | |
V V V V V
```
Why is direct threading more efficient? As stated in [Computed goto for efficient dispatch tables](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables), it's faster for two reasons:
#### 1. The switch does a bit more per iteration because of bounds checking.
The difference between `direct threading` and `switch dispatch` is the "bounds check" step of the switch. Why is it required? the compiler is forced to generate the bounds check for the switch statement to conform to the C standard. Quoting from C99:
>If no converted case constant expression matches and there is no default label, no part of the switch body is executed.
Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.
#### 2. The effects of hardware branch prediction.
Modern CPUs have deep instruction pipelines and go to great lengths ensuring that the pipelines stay as full as possible. One thing that can ruin a pipeline's day is a branch, which is why branch predictors exist. Since a CPU can easily pre-fetch instructions from the branch's target, successful prediction can make the pre-fetched instructions valid and there is no need to fully flush the pipeline.
Since the switch statement has a single "master jump" that dispatches all opcodes, predicting its destination is quite difficult. On the other hand, the computed-goto statement is compiled into a separate jump per opcode, so given that instructions often come in pairs, it's much easier for the branch predictor to "home in" on the various jumps correctly.
Think about it this way: for each jump, the branch predictor keeps a prediction of where it will jump next. If there's a jump per opcode, this is equivalent to predicting the second opcode in an opcode pair, which actually has some chance of success from time to time. On the other hand, if there's just a single jump, the prediction is shared between all opcodes and they keep stepping on each other's toes with each iteration.
### Subroutine threading with tail-call optimization
Unfortunately, despite using direct threading for dispatching, we encountered two issues. The first issue was the significant time taken to calculate the jumping address. To address this, we refered a solution from the implementation of `wasm3`, a webassembly interpreter that utilizes subroutine threading with tail-call optimization for instruction dispatch. Consequently, we followed their approach by separating all instruction emulations and modifying the `rv_insn_t` struct. This modification allowed us to directly assign the instruction emulation function to the intermediate representation (IR) by adding a new member named `impl`.
The second issue is that we must create a function stack frame for each instruction emulation, which TCO can handle. TCO allows any instruction in a basic block to emulate a function by using the same function stack frame, which saves the overhead of creating function stack frames.
To fulfill the tail-call optimization requirement, we need to transform the emulation loop into a recursive version. To achieve this, we introduce a variable `tailcall` to the `rv_insn_t` struct, which helps us determine if the basic block has reached its termination. Consequently, we can rewrite the emulate function as a self-recursive function, utilizing this variable.
As shown in the statsitic below, `Coremark` and `Dhrystone` benchmarks now produces faster results than it did previously.
### Benchmark result
* Mircoprocessor: Core i7-8700
| Compiler | w/ TCO | CoreMark (Iterations/Sec) | Dhrystone (DMIPS)|
| -------- | -------- | -------- |-------- |
| clang-15 | NO | 944.9438388 | 841.4 |
| clang-15 | YES | 983.6298667 | 873.4 |
| gcc-12 | NO | 952.4544735 | 905.8 |
| gcc-12 | YES | 978.6633022 | 923.4 |
> DMIPS: Dhrystone MIPS (Million Instructions per Second)
## TODO: JIT compilation experiments
To optimize performance with JIT, it's crucial to have an adequate number of instructions in a basic block. If the block contains too few instructions, any potential benefits may be lost due to the costs of code generation and compilation. To overcome this limitation, we can consider using an EBB strategy to extend our basic block and improve performance.
### Extended Basic Block(EBB)
* A collection of basic blocks
* Tail call optimizations are more effective on extended basic blocks
To implement EBB, we employ two main strategies: recursive jump translation and retroactive block chaining.
1. recursive jump translation
Target addresses of jal, cj, and cjal instructions are determined after instruction decoding, allowing for recursive translation of jump target instructions into the same basic block.
2. retroactive block chaining
The previous block is chained to the current block after emulation using the obtained branch target during emulation.
![](https://hackmd.io/_uploads/H1TIVSmUh.png)
### The result of EBB
After analyzing the number of instructions in a basic block (BB) or extended basic block (EBB) while running the CoreMark benchmark, we found that the EBB strategy significantly increases the number of instructions from less than 25 to over 300.
![](https://hackmd.io/_uploads/SkhSrBXL3.png)
### Benchmark result of EBB
* Mircoprocessor: Core i7-8700
|Test| commit 1c11b39 |EBB |Speedup|
|--------| -------- | -------- | -------- |
| CoreMark | 1155.174 (Iterations/Sec)| 1351.065 (Iterations/Sec) |+16.64%|
| dhrystone | 1017 DMIPS |1073 DMIPS |+5.5%|
## Baseline JIT compiler
Our objective with the baseline JIT compiler is to integrate it seamlessly with minimal effort. The performance of this version will serve as the foundation for future optimizations.
### The workflow of JIT compiler
In the case of the interpreter, we already have the necessary components, `Extended Basic Block` and `LFU Cache`, in place for managing the basic blocks. Therefore, we can proceed with adding three new components - `Code Generator`, `Compiler`, and `Code Cache` - without requiring any modifications to the existing interpreter architecture.
![](https://hackmd.io/_uploads/ByFxL1g_2.png)
### What are the JIT compiler optimizes
#### 1. Lower the emulation loop overhead and only one function call
In the case of the interpreter, each instruction requires a separate function call for emulation. However, when integrating the JIT compiler, we can reduce this to just one function call to emulate all instructions within an Extended Basic Block (EBB).
#### 2. Immediate and register number are constant
Prior to code generation, all information of the instructions has been decoded.
#### 3. Cache friendly
Most of the emulation functions are located near the subsequent emulation functions.
### Benchmark Result
We integrate two compilers, `clang` and `mir`, for our experiments. Each of these compilers has its own advantages and disadvantages.
The `clang` compiler provides stronger optimization capabilities, but it incurs a significant launch time. On the other hand, the `mir` compiler has a shorter launch time, but its optimization is not as effective as `clang`.
#### clang
![](https://hackmd.io/_uploads/SJDVCyeuh.png)
#### mir
![](https://hackmd.io/_uploads/H1AURkg_3.png)