Try   HackMD

Improve rv32emu performance

蔡文賓

Introduction

Background

Chen et al.(2024) mention that tiered JIT compilation boosts simulation speed and efficiency so noticeably that it becomes apparent for long, computation-intensive tasks. However, very significant challenges remain in obtaining good performance with varied workloads and, especially at the same time, fast execution considering the complexity and overhead involved with advanced compilation techniques. Regardless of these issues, the tiered JIT approach used in rv32emu simulations gives a pretty good base for more research and improvements in high-performance simulation tools.

The rv32emu interpreter thus lags behind libriscv, which points to a significant opportunity for improvement. This project shall undertake the performance optimization of the rv32emu interpreter by referencing optimization techniques in libriscv and implementing them effectively.

Benchmarking separately on rv32emu and libriscv

Hardware information

$ lscpu | grep Model
Model name:                           Intel(R) Core(TM) i7-14700K
Model:                                183

After checking the CPU frequency scaling, I've been trying to enforce the maximum CPU frequency. However, I noticed that most methods throw an Operation not permitted error, due to secure boot or kernel lockdown perhaps.
Nonetheless, according to the section on Intel performance and energy bias hint, Intel provides an interface called EPB (Energy Performance Bias) that allows users to specify the desired power-performance tradeoff. As mentioned in the documentation, a value of 0 corresponds to maximum performance, with the default set to 6 (normal). The following command can be used to adjust the EPB to performance mode.

$ sudo x86_energy_perf_policy --epb 0
$ sudo x86_energy_perf_policy
cpu0: EPB 0
cpu0: HWP_REQ: min 11 max 70 des 0 epp 128 window 0x0 (0*10^0us) use_pkg 0
cpu0: HWP_CAP: low 1 eff 21 guar 44 high 77

And the turbo boost need to be prohibited.

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

If the settings in the BIOS could be adjusted properly, it can be achieved that a more strict CPU frequency, ensuring the processor operates at a fixed clock speed regardless of workload.

  • TODO: Redo benchmarking

libriscv and rv32emu must be built first.

Build libriscv

$ git clone https://github.com/libriscv/libriscv
$ cd libriscv/emulator
$ ./build.sh

Build rv32emu

$ git clone https://github.com/sysprog21/rv32emu
$ cd rv32emu
$ make

Then, I used libriscv and rv32emu to run Coremark respectively, and record the results.

Run coremark

$ ./build/rv32emu ./build/rv32emu/coremark

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 42008898
Total time (secs): 42.008898
Iterations/Sec   : 1904.358453
Iterations       : 80000
Compiler version : GCC14.2.0
Compiler flags   : -O2
Memory location  : Main memory (heap)
Correct operation validated.
CoreMark 1.0 : 1904.358453 / GCC14.2.0 -O2  / Main memory (heap)
inferior exit code 0
$ ./emulator/rvlinux ~/project/rv32emu/build/riscv32/coremark

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 20029360
Total time (secs): 20.029360
Iterations/Sec   : 3994.136607
Iterations       : 80000
Compiler version : GCC14.2.0
Compiler flags   : -O2
Memory location  : Main memory (heap)
Correct operation validated.
CoreMark 1.0 : 3994.136607 / GCC14.2.0 -O2  / Main memory (heap)
>>> Program exited, exit code = 0 (0x0)
Runtime: 20030.141ms   (Use --accurate for instruction counting)
Pages in use: 27 (108 kB virtual memory, total 195 kB)

Next, after completing the remaining tests, use gnuplot to create a chart of the normalized time elapsed based on the recorded data, as shown below.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The preliminary benchmarking results, as shown in the chart, indicate that when using the interpreter, the execution time for libriscv is shorter than that for rv32emu in all cases.

Replicate the experiment from Issue #288

Perform a performance comparison between libriscv and rv32emu, focusing on both interpreter and binary translation modes, evaluating each separately.

For binary translation, libriscv and rv32emu require specific parameters to be set during building.

For libriscv

$ ./build.sh -b

For rv32emu

$ make ENABLE_JIT=1

I used perf stat to record the execution time of libriscv and rv32emu for different tests and used gnuplot to create a chart based on the results.

$ perf stat ./libriscv/emulator/rvlinux ./rv32emu/build/riscv32/dhrystone 
$ perf stat ./rv32emu/build/rv32emu ./rv32emu/build/riscv32/dhrystone

and the following figure shows the result in interpreter mode.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The result in binary translation mode.

The research paper "Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation" uses a diverse set of benchmark programs to evaluate system performance. These benchmarks test different aspects of the system, including:
numeric sorting, string sorting, bitfield operations, floating-point emulation, variable assignment, IDEA encryption, Huffman compression, Dhrystone synthetic benchmarking, prime number calculations, and SHA-512 hashing.

Prebuilt RV32I ELF files: https://github.com/sysprog21/rv32emu-prebuilt/releases

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Replicate the benchmark procedure in "Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation" on libriscv and rv32emu.

$ wget https://github.com/sysprog21/rv32emu-prebuilt/releases/download/2024.12.20-ELF/rv32emu-prebuilt.tar.gz
$ tar zxvf rv32emu-prebuilt.tar.gz

The result in interpreter mode

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The result in binary translation mode

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The benchmarking results show that in interpreter mode, the performance of rv32emu is indeed inferior to that of libriscv. However, in binary translation mode, the results are reversed. Nonetheless, in the primes test, rv32emu consistently achieved significantly lower results than libriscv, regardless of the mode.

Benchmark after setting EPB to performance and turning off turbo boost

After setting EPB to performance and turning off turbo boost, the CPU frequency will remain fixed at the officially rated base frequency during workloads without dynamic adjustment.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The result in interpreter mode

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The result in binary translation mode

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Identify the methods that demonstrate the performance advantages of the libriscv interpreter.

Reference

Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation
rv32emu 開發紀錄
Using C++ as a game engine scripting language
Final Project: Improve rv32emu performance
rv32emu issue#288
仔細閱讀 libriscv 的開發日程