陳柏儒
My task is to optimize the performance of rv32emu in the interpreter and extend its existing macro-operation fusion to accelerate the execution of CoreMark and Dhrystone benchmark programs.
Write down your thoughts and questions here. Also, summarize the talk.
Traditional interpretation is switch-based, where each instruction execution involves checking all conditions in the switch-case statement. The detailed code is available in switch.c.
In this program, the following declaration is observed:
This indicates that a union value contains three members of different types:
enum insn in
: This is an enumeration representing the "type of instruction"long i
: Used to store an integer or parameters involved in the computation process.union value *p
: A pointer to another union value of the same typeThen there is a declaration in the program:
This is an array of type union value, where each element is a single union value. This array can be thought of as "a sequence of instructions and parameters," with every three or four elements forming one complete instruction statement. The comments in the code provide a mapping to:
Main program execution flow:
Instruction Decoding:
In each iteration, theswitch (insn->in)
statement checks the value ofinsn->in
, which is an enumerated type indicating the kind of instruction to execute (e.g.,insn_set
,insn_add
,insn_bnz
, etc.).Instruction Execution:
The corresponding case in the switch statement executes the appropriate logic for the instruction.
For certain operations (e.g., addition), they are commutative
add r0, r1, r2
andadd r0, r2, r1
produce the same result.
Assume the original instruction is:
The slide suggests rewriting it into two instructions, with the general logic as follows:
r2
to r0
.r1
and the value just copied into r0
, and store the result back into r0
.In RISC-V, we could write it as:
However, this rewrite has limitations. It is only effective when r2 ≠ r0
.
I don't understand the explanation in this section.
From the changes, it seems that a single instruction using three registers is split into two instructions, each using two registers.
What are the advantages of doing this?
Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation
4.1 Macro-Operation Fusion (MOP Fusion)
QuickInterp - Improving interpreter performance with superinstructions
In rv32emu/build/riscv32
, you can find the executable files for coremark and dhrystone.
After compilation, I can use the following commands to view the RV32 instructions/registers in the coremark and dhrystone programs:
-r: output usage of registers(default is usage of instructions)
To compare the differences before and after optimization, I must first execute coremark and dhrystone to check their original performance.
emulate.c
Here, we can see the block_translate
function. Its purpose is to read instructions and group consecutive instructions into individual blocks.
In decode.h, within the #define RV_INSN_LIST
, we can see that instructions such as jal
, jalr
, beq
, and bne
have their can_branch
property set to 1.
Therefore, when dividing blocks, as soon as an instruction with can_branch
set to 1 is encountered, it is treated as the endpoint of the current block.
The focus of my task this time is on modifying the Macro-Operation Fusion logic. This part of the program's logic is located within the match_pattern
function.
We can try implementing a new fusion by identifying the most frequently occurring patterns that are not currently fused in rv32emu, as the more frequent the pattern, the greater the potential benefit.
lui
instructionsCompile and convert to a .elf file executable on rv32emu.
The result here does not meet my expectations.
In reality, there are only five lui instructions, but the result shows seven.
I suspect that the li instruction might have been expanded, but I could not find the corresponding implementation in the project.
The problem has been solved, please refer here
To verify my hypothesis, I removed the li instruction.
But the result remains the same.
I used additional test data and discovered that the add instruction is also identified as consecutive lui instructions.
The problem has been solved, please refer here
I tested the original hello.elf in the project.
However, I actually disassembled hello.elf using the following command:
Currently, I need to use GDB to understand the overall execution lifecycle of the program for this issue.
Based on the GDB output, we can analyze the contents of ir, ir->next, and ir->next->next step by step and compare them with the disassembled code:
Contents of ir
pc = 0
:
This corresponds to instruction at address 0x0
.
In the disassembly, the instruction at 0x0
is:
However, the opcode = 1
corresponds to lui
(based on the impl field pointing to do_lui
), indicating the instruction was misdecoded. It should have been addi
.
imm = 0
:
The immediate value is 0, which matches addi t0, zero, 0
.
However, the impl field points to do_lui
instead of do_addi
, further confirming the decoding issue.
rd = 5
:
The destination register is x5
(which corresponds to t0
in RISC-V), and this part is correct.
Contents of ir->next
pc = 0
:
This corresponds to instruction at address 0x4
.
In the disassembly, the instruction at 0x4
is:
Again, the opcode = 1
is decoded as lui
instead of addi
, indicating another decoding issue.
imm = 5
:
The immediate value is 5, which matches addi t0, zero, 0
.
rd = 5
:
The destination register is x6
(which corresponds to t1
in RISC-V), and this part is correct.
Contents of ir->next->next
In this test,I inserted breakpoints in block_translate
and match_pattern
.
The following are the observed results:
To facilitate observation, you can set a breakpoint at b src/emulate.c:683
.
This way, you can monitor a complete block.
Additionally, I wrote a GDB script to make it easier to print out the instruction information of a block during the GDB testing process:
Before using GDB test commands, you need to load the corresponding GDB script file (in this case, commands.gdb).
The effect is as follows:
After a series of tests, it was finally discovered that the opcodes of various instructions were modified in rv32_constopt.c
, which caused counting errors in the previous tests.
The reason behind this implementation is still unclear, and further modifications to the code are currently underway.
Do send pull request(s) once you figure out the fixes.
This part uses constant folding technique. When the instruction requires register contents that are compile-time constants, the macro CONSTOPT
will change the opcode of the instruction. After changing the opcode, it will be detected as consecutive lui instructions.
From this test data, it can be observed that no consecutive lui instructions were detected before the NOP
instruction(addi x0, x0, 0
). This is because the register values are loaded into memory only after compilation is complete, so the macro CONSTOPT
does not change their opcodes.
lui
+ add
Register Dependency:
The following test data meets the fuse condition:
The destination register of lw has a dependency relationship with the source register of add.
Adjust the following test data so that it does not meet the fuse condition:
The destination register of lw does not have a dependency relationship with the source register of add.
For the final test data, I replaced the add instruction with the sub instruction
to verify whether any false-positive cases would occur.
Based on the results, it is correct. However, this also indicates that the lui + sub sequence has potential for extension. Next, I will proceed to improve the implementation of this part.
sw
& fused4 - multiple lw
COMBINE_MEM_OPS(0)
and COMBINE_MEM_OPS(1)
correspond to different types of memory accesses (specifically, sw
or lw
).
Assuming that IIF(0)(rv_insn_lw, rv_insn_sw)
is called.
IIF(0)
The definition of IIF(C)
is:
So when applying IFF(0)
, we get:
PRIMITIVE_CAT(IIF_,0)
The definition of PRIMITIVE_CAT(IIF_,0)
is:
Here, a =
IIF_
, and__VA_ARGS__
= 0.
So we get:
##
is token pasting operator, it will concatenateIIF_
and0
to formIIF_0
.
Therefore, IIF(0)
ultimately expands to IIF_0
.
IIF_0(rv_insn_lw, rv_insn_sw)
:The definition of IIF_0
is:
Here,
t
=rv_insn_lw
,
__VA_ARGS__
=rv_insn_sw
Only the "second argument" or the "remaining arguments" are retained.
For IIF_0(rv_insn_lw, rv_insn_sw)
, only rv_insn_sw
is retained.
The entire macro essentially acts as a conditional expression, ultimately resulting in rv_insn_sw
.
For testing purposes, I made the following modifications to the program.
And generated the corresponding test data for verification:
The result was different from what I expected; the order of the sw and lw instructions was reversed.
After individually testing the lw
and sw
instructions, I have confirmed that there are some issues. However, I am not yet certain of the cause and am currently further narrowing down the problem.
I found out that my debug program was wrong:
And my debug program:
This should be modified:
slli
, srli
, srai
instructionslui
+ sub
Compared to the baseline, both performances have significantly declined. This indicates that the expansion is not suitable.
While following the environment setup instructions in the r32emu README.md,
I encountered an issue where the following command did not work:
My solution was to follow the instructions described in the LLVM Debian/Ubuntu nightly packages:
You don't have to enable JIT for this project. Concentrate on the improvements of RISC-V interpreter.
okay.