Try   HackMD

Reduce rv32emu Dispatch Overhead

contributed by < qwe661234 >

Repository: rv32emu

Environment setup

  1. Clone the repository
    ​​​​$ git clone https://github.com/sysprog21/rv32emu.git
    
  2. Follow the instructions of rv32emu

Baseline: commit 3305664

The commit 3305664 is regarded as the baseline for performance evaluation and enhancements. The RISC-V instruction set compliance with RV32I, RV32M, RV32C, and privilege can be passed by the baseline. We compare the performance of the emulator with and without the computed-goto technique using the benchmark tool CoreMark and Dhrystone.

Benchmarking with CoreMark and Dhrystone

Ensure that rv32emu binary is built properly.

$ cd rv32emu
$ make clean all
$ build/rv32emu build/coremark.elf
$ build/rv32emu build/dhrystone.elf

Benchmark result

  • Mircoprocessor: Core i7-8700
Compiler w/ computed-goto CoreMark (Iterations/Sec) Dhrystone (DIMPS)
clang-15 NO 360.3897405 325.0
clang-15 YES 434.6317140 391.6
gcc-12 NO 380.4077495 328.0
gcc-12 YES 444.7024895 415.8

DMIPS: Dhrystone MIPS (Million Instructions per Second)

According to benchmark result, emulator has better result with computed-goto technique.

Preliminary introduction of basic block

We use perf to perform additional performance analysis and discover that we spend a significant amount of time decoding instructions in function rv_step. As a result, we import IR and the basic block in order to speed up instruction decoding.

  • CoreMark performance analysis result with perf for commit 3305664
  30.21%  rv32emu  rv32emu           [.] rv_step
  12.71%  rv32emu  rv32emu           [.] op_op_imm
  10.33%  rv32emu  rv32emu           [.] op_branch
  10.10%  rv32emu  rv32emu           [.] memory_ifetch
   7.59%  rv32emu  rv32emu           [.] op_load
   6.79%  rv32emu  rv32emu           [.] rv_userdata
   5.51%  rv32emu  rv32emu           [.] on_mem_ifetch

A basic block contains one or more instructions, and we manage blocks efficiently using hash tables and block prediction. In the decode stage, a new block is allocated that can hold up to 1024 instructions by default, and the instructions are decoded as IR into the block until it is full or the latest instruction is a branch instruction, after which it is placed in the block map.

In execution stage, the emulator executes the instructions in the basic block. If the program executes the same instruction sequence in a basic block, we can reuse the basic block without decoding the instruction repeatly.

We can see from the call graph below that CoreMark and Dhrystone frequently invoke the specific function, it is more likely that we can reuse the basic block of these functions. The results of running the coremark and dhrystone benchmarks show that importing basic blocks significantly reduces the overhead of instruction decoding in such a situation.

Call graph

  • CoreMark






callgraph



text_0

text

PROG: 92220

STACK: 9648



__unknown___0

__unknown__

PROG: 0

STACK: 0



text_0->__unknown___0





  • Dhrystone






callgraph



_start_0

_start

PROG: 292

STACK: 0

FRAME: 0



_init_0

_init

PROG: 172

STACK: 176

FRAME: 176



_start_0->_init_0





memcpy_0

memcpy

PROG: 66

STACK: 0

FRAME: 176



_init_0->memcpy_0





memset_0

memset

PROG: 70

STACK: 0

FRAME: 176



_init_0->memset_0





thread_entry_0

thread_entry

PROG: 6

STACK: 0

FRAME: 176



_init_0->thread_entry_0





main_0

main

PROG: 1780

STACK: 352

FRAME: 528



_init_0->main_0





tohost_exit_0

tohost_exit

PROG: 18

STACK: 0

FRAME: 176



_init_0->tohost_exit_0





sprintf_0

sprintf

PROG: 64

STACK: 96

FRAME: 272



_init_0->sprintf_0





printstr_1

printstr

PROG: 88

STACK: 112

FRAME: 288



_init_0->printstr_1





Proc_2_0

Proc_2

PROG: 38

STACK: 0

FRAME: 528



main_0->Proc_2_0





Proc_7_0

Proc_7

PROG: 8

STACK: 0

FRAME: 528



main_0->Proc_7_0





Proc_1_0

Proc_1

PROG: 168

STACK: 32

FRAME: 560



main_0->Proc_1_0





Proc_4_0

Proc_4

PROG: 44

STACK: 0

FRAME: 528



main_0->Proc_4_0





Proc_5_0

Proc_5

PROG: 22

STACK: 0

FRAME: 528



main_0->Proc_5_0





Proc_8_0

Proc_8

PROG: 74

STACK: 0

FRAME: 528



main_0->Proc_8_0





Func_2_0

Func_2

PROG: 66

STACK: 32

FRAME: 560



main_0->Func_2_0





setStats_0

setStats

PROG: 78

STACK: 0

FRAME: 528



main_0->setStats_0





printstr_0

printstr

PROG: 88

STACK: 112

FRAME: 640



main_0->printstr_0





printf_0

printf

PROG: 50

STACK: 96

FRAME: 624



main_0->printf_0





Func_1_1

Func_1

PROG: 28

STACK: 0

FRAME: 528



main_0->Func_1_1





Proc_6_1

Proc_6

PROG: 112

STACK: 32

FRAME: 560



main_0->Proc_6_1





vprintfmt_1

vprintfmt

PROG: 744

STACK: 352

FRAME: 624



sprintf_0->vprintfmt_1





trap_entry_0

trap_entry

PROG: 158

STACK: 272

FRAME: 272



handle_trap_0

handle_trap

PROG: 16

STACK: 0

FRAME: 272



trap_entry_0->handle_trap_0





Proc_3_0

Proc_3

PROG: 32

STACK: 0

FRAME: 560



Proc_7_1

Proc_7

PROG: 8

STACK: 0

FRAME: 560



Proc_3_0->Proc_7_1





Proc_1_0->Proc_3_0





Proc_6_0

Proc_6

PROG: 112

STACK: 32

FRAME: 592



Proc_1_0->Proc_6_0





Proc_7_2

Proc_7

PROG: 8

STACK: 0

FRAME: 560



Proc_1_0->Proc_7_2





Func_3_0

Func_3

PROG: 8

STACK: 0

FRAME: 592



Proc_6_0->Func_3_0





Func_1_0

Func_1

PROG: 28

STACK: 0

FRAME: 560



Func_2_0->Func_1_0





strcmp_0

strcmp

PROG: 30

STACK: 0

FRAME: 560



Func_2_0->strcmp_0





debug_printf_0

debug_printf

PROG: 20

STACK: 64

FRAME: 64



vprintfmt_0

vprintfmt

PROG: 744

STACK: 352

FRAME: 976



__unknown___0

__unknown__

PROG: 0

STACK: 0

FRAME: 976



vprintfmt_0->__unknown___0





sprintf_putch

sprintf_putch



.3143

.3143



_0

sprintf_putch.3143

PROG: 14

STACK: 0

FRAME: 0



putchar_0

putchar

PROG: 118

STACK: 112

FRAME: 112



exit_0

exit

PROG: 8

STACK: 16

FRAME: 16



tohost_exit_1

tohost_exit

PROG: 18

STACK: 0

FRAME: 16



exit_0->tohost_exit_1





abort_0

abort

PROG: 14

STACK: 0

FRAME: 0



printhex_0

printhex

PROG: 70

STACK: 48

FRAME: 48



printstr_2

printstr

PROG: 88

STACK: 112

FRAME: 160



printhex_0->printstr_2





printf_0->vprintfmt_0





__unknown___1

__unknown__

PROG: 0

STACK: 0

FRAME: 624



vprintfmt_1->__unknown___1





Func_3_1

Func_3

PROG: 8

STACK: 0

FRAME: 560



Proc_6_1->Func_3_1





strlen_0

strlen

PROG: 26

STACK: 0

FRAME: 0



strnlen_0

strnlen

PROG: 44

STACK: 0

FRAME: 0



strcpy_0

strcpy

PROG: 18

STACK: 0

FRAME: 0



atol_0

atol

PROG: 98

STACK: 0

FRAME: 0



Benchmark result

  • Mircoprocessor: Core i7-8700
Compiler w/ computed-goto CoreMark (Iterations/Sec) Dhrystone (DIMPS)
clang-15 NO 850.4265125 665.2
clang-15 YES 854.0181165 741.6
gcc-12 NO 858.7766405 759.2
gcc-12 YES 870.4854225 799.6

DMIPS: Dhrystone MIPS (Million Instructions per Second)

Use tail-call optimization(TCO) to lower overhead of instruction dispatching

Definition of instruction dispatching

Dispatch is the mechanism used by a high level language emulator to transfer control from the code to emulate one instruction to the next.

There are two types of techique using in rv32emu.

Switch dispatch

Each instruction is implemented by a case in the switch statement. Instruction bodies are simply the compiler-generated code for each case.

As Re: Suggestions on implementing an efficient instruction set simulator in LuaJIT2 mentioned, the control-flow graph of an interpreter with C switch-based dispatch looks like this:

      .------.
      V      |
      |      |
      L      |  L = instruction load
      D      |  D = instruction dispatch
   / /|\ \   |
  / / | \ \  |
  C C C C C  |  C = operand decode
  X X X X X  |  X = instruction execution
  \ \ | / /  |
   \ \|/ /   |
      |      |
      V      |
      `------'

Each individual instruction execution looks like this:

  ......V......
  :X    |     :
  :     |\ \  :
  :     F S S :  F = fast path
  :     |/ /  :  S = slow paths
  :     |     :
  :.....V.....:

The drawback of switch dispatch is the compiler would get into some troubles:

  • Diamond-shaped control-flow is known to be the worst-case scenario for most optimizations and for register alloction. Nested diamond-shaped control-flow is even worse.
  • The compiler doesn't have enough hints to see what the fast paths and what the slow paths are. Even if you'd be able to tell it, it still sees this as a single giant control-flow graph. Anything in this loop could potentially influence anything else, so almost nothing can be hoisted or eliminated. The slow paths kill the opportunities for the fast paths and the complex instructions kill the opportunities for the simpler instructions. As a result, all instruction emulation would be trnslated to a big loop.
  • The standard register allocation heuristics fail at this scale, so the compiler has trouble keeping important variables in registers.

Direct threading (computed-goto)

We can use a direct or indirect-threaded interpreter even in C, e.g. with the computed-goto. The control-flow graph of an interpreter with computed-goto looks like this:

  * * * * *
  | | | | |
  C C C C C    C = operand decode
  X X X X X    X = instruction execution
  L L L L L    L = next instruction load
  D D D D D    D = next instruction dispatch
  | | | | |
  V V V V V

Why is direct threading more efficient? As stated in Computed goto for efficient dispatch tables, it's faster for two reasons:

1. The switch does a bit more per iteration because of bounds checking.

The difference between direct threading and switch dispatch is the "bounds check" step of the switch. Why is it required? the compiler is forced to generate the bounds check for the switch statement to conform to the C standard. Quoting from C99:

If no converted case constant expression matches and there is no default label, no part of the switch body is executed.

Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.

2. The effects of hardware branch prediction.

Modern CPUs have deep instruction pipelines and go to great lengths ensuring that the pipelines stay as full as possible. One thing that can ruin a pipeline's day is a branch, which is why branch predictors exist. Since a CPU can easily pre-fetch instructions from the branch's target, successful prediction can make the pre-fetched instructions valid and there is no need to fully flush the pipeline.

Since the switch statement has a single "master jump" that dispatches all opcodes, predicting its destination is quite difficult. On the other hand, the computed-goto statement is compiled into a separate jump per opcode, so given that instructions often come in pairs, it's much easier for the branch predictor to "home in" on the various jumps correctly.

Think about it this way: for each jump, the branch predictor keeps a prediction of where it will jump next. If there's a jump per opcode, this is equivalent to predicting the second opcode in an opcode pair, which actually has some chance of success from time to time. On the other hand, if there's just a single jump, the prediction is shared between all opcodes and they keep stepping on each other's toes with each iteration.

Subroutine threading with tail-call optimization

Unfortunately, even though we used direct threading to dispatch, we still have two issues. The first issue is that direct threading took a long time to calculate the jumping address. We refer to wasm3's implementation to solve this issue. Wasm3 is a webassembly interpreter that uses subroutine threading with tail-call optimization to dispatch instructions. As a result, we follow it to separate all instruction emulations and modify struct rv_insn_t so that we can directly assign instruction emulation to IR by adding member impl.

The second issue is that we must create a function stack frame for each instruction emulation, which TCO can handle. TCO allows any instruction in a basic block to emulate a function by using the same function stack frame, which saves the overhead of creating function stack frames.

To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version. To accomplish this, we add a variable tailcall to the struct rv_insn_t to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable.

Running coremark and dhrystone benchmarks now produces faster results than it did previously, and the test results show below.

Benchmark result

  • Mircoprocessor: Core i7-8700
Compiler w/ TCO CoreMark (Iterations/Sec) Dhrystone (DMIPS)
clang-15 NO 944.9438388 841.4
clang-15 YES 983.6298667 873.4
gcc-12 NO 952.4544735 905.8
gcc-12 YES 978.6633022 923.4

DMIPS: Dhrystone MIPS (Million Instructions per Second)