contributed by < HenryChiang >
RISC-V
Computer Architecture
Virtual Machine
This is a simulator
designed based on the RISC-V
architecture, capable of executing ELF format files. The operation involves initially using an interpreter to translate the instructions read from the ELF file into corresponding C code. Finally, a JIT compiler is employed to compile the generated C code into the corresponding machine language.
The goal is to create special functions
for common instructions, allowing them to be translated into faster C code
. The objective is to reduce the overhead during the execution stage of these common instructions, indirectly speeding up the translation speed of the interpreter. We will also apply this technology to this project to accelerate the execution speed of rv32emu.
We will apply the technique of Instruction Specialization to rv32emu, improving its efficiency when executing coremark.elf
, specifically the iteration/sec
metric. For this, we need the following preparations:
Add decision for special opcode
in the decode stage
. To achieve this, we must introduce special opcodes and additional conditional checks to determine the corresponding special functions.
Integrate special functions
into the execution stage
. Due to the large number and certain similarity among the generated functions, we will use another Python script
to generate these functions. Of course, the development of this Python script is also part of the preparation.
addi
is the most frequently used instruction in RISC-V assembly language programs. Therefore, it is the focal point of our specialization efforts. We will use addi to provide a simple introduction to our generation targets
and the methods
employed in the generation process.
Object | Description |
---|---|
special opcode | For example, in the case of addi, if the specialized instruction is addi a0, a0, 1 , then the opcode would be addi10 . (This is because the a0 register corresponds to the x10 register.) |
selection | Selecting the opcode corresponding to the instruction is determined based on rd and rs1, for example, if( rd == 10 && rs1 == 10){ ir->opcode = addi10} |
special function | Originally, the execution stage required fetching the instruction's rd and rs1 for execution. Now, the C code in the function can be simplified to: rv[10] = rv[10] + ir->imm; |
Python can achieve simple script writing
through powerful string operations (e.g., format) and File I/O functionality. This time, we use Python to generate the members of the three Instruction Specialization implementations mentioned above. Here are some examples of simple code:
Here, we will explain the process according to the flow of the Interpreter
. The execution of rv32emu is based on blocks. We then follow the steps for each block
.
Firstly, we generate a block and store the decoded data structure (IR)
into our block. A block, before encountering a jump instruction, stores a large number of IRs. We focus on IR and perform the first optimization.
IR is the result of instruction decoding, including opcode
, rd, rs1, rs2
, imm
. This allows easy manipulation in the execution stage.
Finally, it enters the execution stage. We take the opcode
of IR, match it with the dispatch table to find the corresponding execution function. For example, addi
has a corresponding addi function, performing rd = rs1 + imm
.
First, the reason why instruction specialization works is that when we execute a program, it is done in blocks. If the same block is executed repeatedly, we don't need to decode that block again. We can directly enter the execution stage. Therefore, if we can reduce the overhead of the execution stage, we can further improve the execution speed of rv32emu. This is why we implement multiple specialized instructions in the execution stage because they can more effectively reduce dispatch overhead than conventional methods.
Here, we will apply the generated special opcode
, selection
, and function
from the previous chapter, complemented by the information in Section 3.2.
First, during the decoding process into IR, if the instruction is known to be addi
, we will further determine if this instruction is a special instruction, such as addi x10, x10, imm
. This process is performed using selection
, assigning a special opcode
to the IR. In this example, the special opcode
is addi10
.
Optimization will be demonstrated in the execution stage. Here, the special opcode
corresponds to a special function
. These functions, as rd
and rs1
are already known, can save time by avoiding two memory accesses, reducing the dispatch overhead in the execution stage.
Upon the teacher's advice, I analyzed the results of the GCC assemble the assembly language, and I found that in special cases, the ADDI instruction can be translated into other instructions. For instance, when rs1 == x0
, the instruction is equivalent to assigning imm
to rd
, so GCC translates it into the li instruction. Another special case is when imm == 0
, which is equivalent to assigning rs1
to rd
, so it is specialized into the mv instruction. I optimized my ADDI instruction based on these two examples, resulting in a significant speed improvement due to the reduced number of operands.
inc and dec are also been specialized, if addi instruction immediate value is 1
or -1
, and the rd
register is as same as rs1
,then the instruction will be specialized as inc and dec.
I specialized the jal
instruction by observing the results of GCC assembly. In most cases, the target register is x1
, a common scenario. Despite this, I noticed that the decode stage still searches for the target register rd. Therefore, after assigning the opcode as jal, I separately check if rd == x1
. If true, I specialize it into jalx1
.
In the standard RISCV32I, we have beq
and bne
instructions, but there is a lack of commonly needed instructions in C, such as beqz
and bnez
. Hence, I aimed to specialize these instructions. I implemented beqz and bnez in rv32emu separately from the RISCV-C instructions. I check if rs2 is x0
, and if true, I specialize beq into beqz and bne into bnez.
For instructions where rd == rs1
and are related to the ALU, I specialize them into new instructions. For example, in the case of addi x10, x10, 1
, the new instruction is named addix10
, and empirical results show an acceleration in execution speed.
We analyze the build/rv32emu
to examine the effects of specialized instructions. First, we look at the addi
instruction. We compare the non-specialized do_addi
with the specialized do_addi025025
(addi x25, x25, imm). We run this on an x86_64
architecture computer, and here are the disassembled results. We observe a difference of four instructions between them, including 2 mov
instructions and 2 movzbl
instructions.
We also compared beq
and beqz
, and after specialization, there is a reduction of four instructions in total.
Pipe command
grep ADDI | sort | uniq -c
Interpreter function
utilization
Whenever build/rv32emu
executes an addi
instruction, it prints out the current set of addi instructions it is executing. Our goal is to use the Linux pipe command to concurrently run the benchmark and collect statistics
on the occurrences of each ADDI instruction.
We will specialize based on the most frequently occurring (rd, rs1) combinations in ADDI. For example, in coremark
, the most common occurrences are ADDI(12,12), ADDI(13,13), and ADDI(15,15). Specializing for these instructions can accelerate the execution speed of coremark.
Because the decode phase uses an if-elif-else
structure to determine whether it is a specialized instruction, the order of determination becomes crucial. We prioritize the most frequently occurring ones mentioned in section 5.2 as the first to be determined. This effectively speeds up the decoding process for frequently encountered instructions. An example is provided below:
We use CoreMark
as the benchmark for our performance evaluation. CoreMark is commonly used for testing CPU performance
in embedded systems. Its advantage lies in the fact that it does not include the runtime of libraries during testing; it focuses solely on the original code being tested. Moreover, it specifies the compiler specifications, making it convenient for comparisons across different platforms and compilers.
Compiler version : GCC13.2.0
Compiler flags : -Ofast -std=c99 -march=rv32im -mabi=ilp32 -DPERFORMANCE_RUN=1
Compiler | w/ specialize insn. | CoreMark (Iterations/Sec) | Dhrystone (DIMPS) |
---|---|---|---|
gcc-13 | NO | 299.805127 | 343.0 |
gcc-13 | YES | 320.179300 | 359.0 |
DMIPS: Dhrystone MIPS (Million Instructions per Second)
During the implementation process, I attempted to implement specialized instructions for various possible addi
instructions (approximately 1024 types). The selection
in the decode stage was changed from the original if
structure to a switch
structure when implementing more optimized instructions and reducing overhead. However, upon running the performance analysis for CoreMark again under these conditions, it was observed that the results were worse than the aforementioned version and the original rv32emu.
I believe the reasons are as follows:
Although multiple optimized instructions were implemented, these instructions did not appear frequently. Therefore, instead of optimizing, they increased the overhead in the decode stage.
In this implementation, I chose more frequently occurring instructions as optimization targets, and I believe this is a key factor in improving performance.
After practical performance testing, the actual execution performance using instruction specialization shows a growth of 6%
. I believe the possible reasons for the modest improvement in performance are:
Additional decode stage overhead, as mentioned earlier, the selection
process, which occupies a significant portion and decrease the performance gained in the execution stage.
Limited optimization of instructions through instruction specialization. The author's optimization focuses only on instructions with a higher occurrence and the same rd
and rs1
values. This characteristic is only present in instructions related to the ALU, and not in instructions like beq
and lw
, limiting the scope of optimization.
Although addi
is the most common instruction in CoreMark, the following lw
and sw
instructions, which are next in frequency, were not effectively optimized. Coupled with the decode stage overhead
, this leads to limited optimization effectiveness. In my opinion, if the benchmark test is computing-intensive
, instruction specialization primarily focusing on the ALU should exhibit better performance.