# Speed up rv32emu interpreter
> 陳柏儒
## Introduction
My task is to optimize the performance of rv32emu in the interpreter and extend its existing macro-operation fusion to accelerate the execution of CoreMark and Dhrystone benchmark programs.
### The art of the language VM
[Jitter GHM2017](https://ageinghacker.net/talks/jitter-slides--saiu--ghm2017--2017-08-25.pdf) / [Video](https://audio-video.gnu.org/video/ghm2017/2017-08--saiu--jitter--ghm.webm)
:::warning
Write down your thoughts and questions here. Also, summarize the talk.
:::
#### Traditional interpretation & Direct-threaded interpretation
Traditional interpretation is switch-based, where each instruction execution involves checking all conditions in the switch-case statement. The detailed code is available in [switch.c](https://ageinghacker.net/projects/jitter-ghm-2017/c-examples/switch.c).
In this program, the following declaration is observed:
```c
union value {
enum insn in;
long i;
union value *p;
};
```
This indicates that a union value contains three members of different types:
* `enum insn in`: This is an enumeration representing the "type of instruction"
* `long i`: Used to store an integer or parameters involved in the computation process.
* `union value *p`: A pointer to another union value of the same type
Then there is a declaration in the program:
```c
union value program[] = {
// Assign 2000000000 into %r0.
{.in = insn_set}, // 0
{.i = 2000000000}, // 1
{.i = 0}, // 2
// Assign -1 into %r1.
{.in = insn_set}, // 3
{.i = -1}, // 4
{.i = 1}, // 5
// Sum %r0 and %r1 into %r0.
{.in = insn_add}, // 6
{.i = 0}, // 7
{.i = 1}, // 8
{.i = 0}, // 9
```
This is an array of type union value, where each element is a single union value. This array can be thought of as "a sequence of instructions and parameters," with every three or four elements forming one complete instruction statement. The comments in the code provide a mapping to:
* Instruction Type: Represented by .in = something. This specifies the operation (e.g., insn_set, insn_add, etc.).
Parameters: Represented by .i = number or .p = program + offset. These parameters are the operands or addresses associated with the instruction.
* In direct-threaded interpretation, each instruction contains the execution address of the next instruction and immediately jumps to it after execution (using goto * to the code pointer).
Main program execution flow:
1. Register Declaration
```c
long regs[10];
```
2. Interpreter Loop
```c
const union value *insn = program;
while (true)
switch (insn->in)
{
...
}
```
> **Instruction Decoding:**
> In each iteration, the `switch (insn->in)` statement checks the value of `insn->in`, which is an enumerated type indicating the kind of instruction to execute (e.g., `insn_set`, `insn_add`, `insn_bnz`, etc.).
>
> **Instruction Execution:**
The corresponding case in the switch statement executes the appropriate logic for the instruction.
___
#### Direct-threaded interpretation
___
#### VM specialized instructions
##### Commutativity of Instructions
For certain operations (e.g., addition), they are commutative
> `add r0, r1, r2` and `add r0, r2, r1` produce the same result.
##### Instruction Rewriting
Assume the original instruction is:
```c
add r0, r1, r2 # r0 <- r1 + r2
```
The slide suggests rewriting it into two instructions, with the general logic as follows:
1. First, copy the content of `r2` to `r0`.
2. Then, perform addition: add `r1` and the value just copied into `r0`, and store the result back into `r0`.
In RISC-V, we could write it as:
```c
mv r0, r2 # r0 <- r2
add r0, r1, r0 # r0 <- r1 + r0
```
However, this rewrite has limitations. It is only effective when `r2 ≠ r0`.
:::info
I don't understand the explanation in this section.
From the changes, it seems that a single instruction using three registers is split into two instructions, each using two registers.
What are the advantages of doing this?
:::
___
[Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation](https://dl.acm.org/doi/pdf/10.1145/3689490.3690399)
4.1 Macro-Operation Fusion (MOP Fusion)
[QuickInterp - Improving interpreter performance with superinstructions](https://essay.utwente.nl/81408/1/Miedema_MA_EEMCS.pdf)
## Benchmarks
In `rv32emu/build/riscv32`, you can find the executable files for **coremark** and **dhrystone**.
```shell
$ make tool
CC build/rv_histogram.o
LD build/rv_histogram
```
After compilation, I can use the following commands to view the RV32 instructions/registers in the **coremark** and **dhrystone** programs:
```shell
$ ./build/rv_histogram ./build/riscv32/coremark
$ ./build/rv_histogram -r ./build/riscv32/coremark
```
>-r: output usage of registers(default is usage of instructions)


```shell
$ ./build/rv_histogram ./build/riscv32/dhrystone
$ ./build/rv_histogram -r ./build/riscv32/dhrystone
```


### Baselines
To compare the differences before and after optimization, I must first execute coremark and dhrystone to check their original performance.
#### coremark
```shell
$ ./build/rv32emu ./build/riscv32/coremark
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 43442522
Total time (secs): 43.442522
Iterations/Sec : 1841.513713
Iterations : 80000
Compiler version : GCC14.2.0
Compiler flags : -O2
Memory location : Main memory (heap)
Correct operation validated.
CoreMark 1.0 : 1841.513713 / GCC14.2.0 -O2 / Main memory (heap)
inferior exit code 0
```
#### dhrystone
```shell
$ ./build/rv32emu ./build/riscv32/dhrystone
Dhrystone(1.1-mc), 500000000 passes, 91538320 microseconds, 3108 DMIPS
inferior exit code 0
```
## rv32emu
### `emulate.c`
Here, we can see the `block_translate` function. Its purpose is to read instructions and group consecutive instructions into individual blocks.
```c
FORCE_INLINE bool insn_is_branch(uint8_t opcode)
{
switch (opcode) {
#define _(inst, can_branch, insn_len, translatable, reg_mask) \
IIF(can_branch) \
(case rv_insn_##inst:, )
RV_INSN_LIST
#undef _
return true;
}
return false;
}
static void block_translate(riscv_t *rv, block_t *block)
{
...
/* stop on branch */
if (insn_is_branch(ir->opcode)) {
if (insn_is_indirect_branch(ir->opcode)) {
ir->branch_table = calloc(1, sizeof(branch_history_table_t));
assert(ir->branch_table);
memset(ir->branch_table->PC, -1,
sizeof(uint32_t) * HISTORY_SIZE);
}
break;
}
...
}
```
In [decode.h](https://github.com/sysprog21/rv32emu/blob/master/src/decode.h), within the `#define RV_INSN_LIST`, we can see that instructions such as `jal`, `jalr`, `beq`, and `bne` have their `can_branch` property set to 1.
Therefore, when dividing blocks, as soon as an instruction with `can_branch` set to 1 is encountered, it is treated as the endpoint of the current block.
The focus of my task this time is on modifying the Macro-Operation Fusion logic. This part of the program's logic is located within the `match_pattern` function.
We can try implementing a new fusion by identifying the most frequently occurring patterns that are not currently fused in rv32emu, as the more frequent the pattern, the greater the potential benefit.
### Test the existing macro-operation fusion
#### fused1 - Consecutive `lui` instructions
```diff
case rv_insn_lui:
next_ir = ir->next;
switch (next_ir->opcode) {
case rv_insn_lui:
count = 1;
while (1) {
if (!IF_insn(next_ir, lui))
break;
count++;
if (!next_ir->next)
break;
next_ir = next_ir->next;
}
if (count > 1) {
+ printf("[DEBUG] fuse1 triggered: %d consecutive LUI instructions.\n", count);
ir->opcode = rv_insn_fuse1;
ir->fuse = malloc(count * sizeof(opcode_fuse_t));
assert(ir->fuse);
ir->imm2 = count;
memcpy(ir->fuse, ir, sizeof(opcode_fuse_t));
ir->impl = dispatch_table[ir->opcode];
next_ir = ir->next;
for (int j = 1; j < count; j++, next_ir = next_ir->next)
memcpy(ir->fuse + j, next_ir, sizeof(opcode_fuse_t));
remove_next_nth_ir(rv, ir, block, count - 1);
}
break;
}
break;
```
##### Test data 1_1
```c
.section .text
.global _start
_start:
lui x10, 0x12345
lui x11, 0x23456
lui x12, 0x34567
lui x13, 0x45678
lui x14, 0x56789
li a0, 0
li a7, 93
ecall
```
Compile and convert to a .elf file executable on rv32emu.
```shell
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostartfiles -nostdlib -o <file_name>.elf <file_name>.s
$ ./build/rv32emu test_fuse1.elf
[DEBUG] fuse1 triggered: 7 consecutive LUI instructions.
```
:::info
The result here does not meet my expectations.
In reality, there are only five lui instructions, but the result shows seven.
I suspect that the li instruction might have been expanded, but I could not find the corresponding implementation in the project.
> The problem has been solved, please refer [here](#####fused1_explanation_of_previous_question)
:::
##### Test data 1_2
To verify my hypothesis, I removed the li instruction.
```diff
.section .text
.global _start
_start:
lui x10, 0x12345
lui x11, 0x23456
lui x12, 0x34567
lui x13, 0x45678
lui x14, 0x56789
- li a0, 0
- li a7, 93
+ addi a0, x0, 0
+ addi a7, x0, 93
ecall
```
```shell
$ ./build/rv32emu test_fuse1.elf
[DEBUG] fuse1 triggered: 7 consecutive LUI instructions.
```
But the result remains the same.
##### Test data 1_3
```diff
.section .text
.global _start
_start:
lui x10, 0x12345
lui x11, 0x23456
lui x12, 0x34567
lui x13, 0x45678
lui x14, 0x56789
+ add x15, x10, x11
+ add x15, x10, x11
li a0, 0
li a7, 93
ecall
```
```shell
$ ./build/rv32emu test_fuse1.elf
[DEBUG] fuse1 triggered: 9 consecutive LUI instructions.
```
:::info
I used additional test data and discovered that the add instruction is also identified as consecutive lui instructions.
> The problem has been solved, please refer [here](#####fused1_explanation_of_previous_question)
:::
I tested the original hello.elf in the project.
```shell
$ ./build/rv32emu ./build/hello.elf
[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
[DEBUG] fuse1 triggered: 4 consecutive LUI instructions.
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
inferior exit code 0
```
However, I actually disassembled hello.elf using the following command:
```shell
$ riscv64-unknown-elf-objdump -d -M no-aliases ./build/hello.elf
```
```shell
Disassembly of section .text:
00000000 <_start>:
0: 00000293 addi t0,zero,0
4: 00500313 addi t1,zero,5
8: 0040006f jal zero,c <_start+0xc>
c: 00000013 addi zero,zero,0
00000010 <loop>:
10: 02628063 beq t0,t1,30 <end>
14: 04000893 addi a7,zero,64
18: 00100513 addi a0,zero,1
1c: 03c00593 addi a1,zero,60
20: 00d00613 addi a2,zero,13
24: 00000073 ecall
28: 00128293 addi t0,t0,1
2c: fe5ff06f jal zero,10 <loop>
00000030 <end>:
30: 05d00893 addi a7,zero,93
34: 00000513 addi a0,zero,0
38: 00000073 ecall
```
Currently, I need to use GDB to understand the overall execution lifecycle of the program for this issue.
```shell
$ gdb --args ./build/rv32emu ./build/hello.elf
(gdb) b match_pattern
Breakpoint 1 at 0x55555556b7bb: file src/emulate.c, line 732.
(gdb) r
Starting program: /home/boju/rv32emu/build/rv32emu ./build/hello.elf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, match_pattern (rv=0x0, block=0x0) at src/emulate.c:732
732 {
```
```shell
(gdb) p ir
$1 = (rv_insn_t *) 0x7ffff6093008
(gdb) p ir->next
$2 = (struct rv_insn *) 0x7ffff6093058
(gdb) p ir->next->next
$3 = (struct rv_insn *) 0x7ffff60930a8
(gdb) p *ir
$4 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0,
next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *ir->next
$5 = {{imm = 5, rs3 = 5 '\005'}, rd = 6 '\006', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 4,
next = 0x7ffff60930a8, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *ir->next->next
$6 = {{imm = 4, rs3 = 4 '\004'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 3 '\003', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 8, next = 0x0,
impl = 0x55555555f3d7 <do_jal>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
```
Based on the GDB output, we can analyze the contents of ir, ir->next, and ir->next->next step by step and compare them with the disassembled code:
Contents of `ir`
```shell
$4 = {
{imm = 0, rs3 = 0 '\000'},
rd = 5 '\005',
rs1 = 0 '\000',
rs2 = 0 '\000',
opcode = 1 '\001',
shamt = 0 '\000',
rm = 0 '\000',
imm2 = 0,
fuse = 0x0,
pc = 0,
next = 0x7ffff6093058,
impl = 0x55555555f25f <do_lui>,
branch_taken = 0x0,
branch_untaken = 0x0,
branch_table = 0x0
}
```
* `pc = 0`:
This corresponds to instruction at address `0x0`.
In the disassembly, the instruction at `0x0` is:
```shell
0: 00000293 addi t0, zero, 0
```
However, the `opcode = 1` corresponds to `lui` (based on the impl field pointing to `do_lui`), indicating the instruction was misdecoded. It should have been `addi`.
* `imm = 0`:
The immediate value is 0, which matches `addi t0, zero, 0`.
However, the impl field points to `do_lui` instead of `do_addi`, further confirming the decoding issue.
* `rd = 5`:
The destination register is `x5` (which corresponds to `t0` in RISC-V), and this part is correct.
Contents of `ir->next`
```shell
$5 = {
{imm = 5, rs3 = 5 '\005'},
rd = 6 '\006',
rs1 = 0 '\000',
rs2 = 0 '\000',
opcode = 1 '\001',
shamt = 0 '\000',
rm = 0 '\000',
imm2 = 0,
fuse = 0x0,
pc = 4,
next = 0x7ffff60930a8,
impl = 0x55555555f25f <do_lui>,
branch_taken = 0x0,
branch_untaken = 0x0,
branch_table = 0x0
}
```
* `pc = 0`:
This corresponds to instruction at address `0x4`.
In the disassembly, the instruction at `0x4` is:
```shell
4: 00500313 addi t1, zero, 5
```
Again, the `opcode = 1` is decoded as `lui` instead of `addi`, indicating another decoding issue.
* `imm = 5`:
The immediate value is 5, which matches `addi t0, zero, 0`.
* `rd = 5`:
The destination register is `x6` (which corresponds to `t1` in RISC-V), and this part is correct.
Contents of `ir->next->next`
```shell
$6 = {
{imm = 4, rs3 = 4 '\004'},
rd = 0 '\000',
rs1 = 0 '\000',
rs2 = 0 '\000',
opcode = 3 '\003',
shamt = 0 '\000',
rm = 0 '\000',
imm2 = 0,
fuse = 0x0,
pc = 8,
next = 0x0,
impl = 0x55555555f3d7 <do_jal>,
branch_taken = 0x0,
branch_untaken = 0x0,
branch_table = 0x0
}
```
In this test,I inserted breakpoints in `block_translate` and `match_pattern`.
The following are the observed results:
```shell
Breakpoint 1, block_translate (rv=0x55555555efeb <block_alloc+29>, block=0x7fffffffd7c0) at src/emulate.c:607
607 {
...
(gdb) p *ir
$10 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, next = 0x0,
impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) c
Continuing.
Breakpoint 2, match_pattern (rv=0x0, block=0x0) at src/emulate.c:713
713 {
...
(gdb) n
719 rv_insn_t *next_ir = NULL;
(gdb) p *ir
$11 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0,
next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
```
```shell
(gdb) p *block->ir_head
$28 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0,
next = 0x7ffff6093058, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next
$29 = {{imm = 5, rs3 = 5 '\005'}, rd = 6 '\006', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 4,
next = 0x7ffff60930a8, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next->next
$30 = {{imm = 4, rs3 = 4 '\004'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 3 '\003', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, next = 0x0,
impl = 0x55555555f3d7 <do_jal>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next->next->next
Cannot access memory at address 0x0
(gdb) p *block->ir_head->next->next->next
Cannot access memory at address 0x0
(gdb) n
641 block->pc_end += is_compressed(insn) ? 2 : 4;
...
(gdb) n
872 match_pattern(rv, next_blk);
(gdb) n
Breakpoint 2, match_pattern (rv=0x0, block=0x0) at src/emulate.c:713
713 {
(gdb) n
716 for (i = 0, ir = block->ir_head; i < block->n_insn - 1;
(gdb) p *block->ir_head
$31 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0,
next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next
$32 = {{imm = 5, rs3 = 5 '\005'}, rd = 6 '\006', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 4,
next = 0x7ffff60930a8, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next->next
$33 = {{imm = 4, rs3 = 4 '\004'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 3 '\003', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 8, next = 0x0,
impl = 0x55555555f3d7 <do_jal>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
```
To facilitate observation, you can set a breakpoint at `b src/emulate.c:683`.
This way, you can monitor a complete block.
Additionally, I wrote a GDB script to make it easier to print out the instruction information of a block during the GDB testing process:
```shell
define print_ir_list
set $node = $arg0
while $node != 0
printf "Node at %p:\n", $node
p *$node
set $node = $node->next
end
end
```
Before using GDB test commands, you need to load the corresponding GDB script file (in this case, commands.gdb).
```shell
$ gdb -x ./commands.gdb --args ./build/rv32emu ./test.elf
```
The effect is as follows:
```shell
(gdb) print_ir_list block->ir_head
Node at 0x7ffff6093008:
$183 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65652,
next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff6093058:
$184 = {{imm = 5, rs3 = 5 '\005'}, rd = 5 '\005', rs1 = 5 '\005', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65656,
next = 0x7ffff60930a8, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff60930a8:
$185 = {{imm = 10, rs3 = 10 '\n'}, rd = 5 '\005', rs1 = 5 '\005', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65660,
next = 0x7ffff60930f8, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff60930f8:
$186 = {{imm = -3, rs3 = 253 '\375'}, rd = 5 '\005', rs1 = 5 '\005', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65664,
next = 0x7ffff6093148, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff6093148:
$187 = {{imm = 93, rs3 = 93 ']'}, rd = 17 '\021', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65668,
next = 0x7ffff6093198, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff6093198:
$188 = {{imm = 0, rs3 = 0 '\000'}, rd = 10 '\n', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65672,
next = 0x7ffff60931e8, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff60931e8:
$189 = {{imm = 0, rs3 = 0 '\000'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 39 '\'', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65676,
next = 0x0, impl = 0x555555561cc9 <do_ecall>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
```
After a series of tests, it was finally discovered that the opcodes of various instructions were modified in `rv32_constopt.c`, which caused counting errors in the previous tests.
```c
CONSTOPT(andi, {
if (info->is_constant[ir->rs1]) {
ir->imm &= info->const_val[ir->rs1];
info->is_constant[ir->rd] = true;
info->const_val[ir->rd] = ir->imm;
ir->opcode = rv_insn_lui; //??
ir->impl = dispatch_table[ir->opcode];
} else
info->is_constant[ir->rd] = false;
})
CONSTOPT(add, {
if (info->is_constant[ir->rs1] && info->is_constant[ir->rs2]) {
info->is_constant[ir->rd] = true;
ir->imm = info->const_val[ir->rs1] + info->const_val[ir->rs2];
info->const_val[ir->rd] = ir->imm;
ir->opcode = rv_insn_lui; //??
ir->impl = dispatch_table[ir->opcode];
} else
info->is_constant[ir->rd] = false;
})
```
:::info
The reason behind this implementation is still unclear, and further modifications to the code are currently underway.
> Do send pull request(s) once you figure out the fixes.
:::
#### fused1_explanation_of_previous_question
This part uses constant folding technique. When the instruction requires register contents that are compile-time constants, the macro `CONSTOPT` will change the opcode of the instruction. After changing the opcode, it will be detected as consecutive lui instructions.
```c
.section .text
.global _start
_start:
lui x10, 0x12345
lui x11, 0x23456
add x12, x10, x11
addi x0, x0, 0
li a0, 0
li a7, 93
ecall
```
```shell
$ ./build/rv32emu add.elf
[DEBUG] fuse1 triggered: 3 consecutive LUI instructions.
[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
inferior exit code 0
```
From this test data, it can be observed that no consecutive lui instructions were detected before the `NOP` instruction(`addi x0, x0, 0`). This is because the register values are loaded into memory only after compilation is complete, so the macro `CONSTOPT` does not change their opcodes.
```c
.section .text
.global _start
value: .word 10
_start:
lui x10, 0x12345
lw x10, 0(x10)
add x11, x10, x10
addi x0, x0, 0
li a0, 0
li a7, 93
ecall
```
```shell
$ ./build/rv32emu add_NCF.elf
[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
inferior exit code 0
```
#### fused2 - `lui` + `add`
Register Dependency:
* In this scenario, the destination register of the lui instruction (ir->rd) is checked against the source registers of the add instruction (next_ir->rs1 and next_ir->rs2).
* If there is a dependency between lui and add, these two instructions are fused into a single instruction (rv_insn_fuse2) to optimize execution.
```diff
switch (ir->opcode) {
case rv_insn_lui:
next_ir = ir->next;
switch (next_ir->opcode) {
case rv_insn_add:
if (ir->rd == next_ir->rs2 || ir->rd == next_ir->rs1) {
+ printf("[DEBUG] fuse2 triggered: LUI + ADD\n");
ir->opcode = rv_insn_fuse2;
ir->rs2 = next_ir->rd;
if (ir->rd == next_ir->rs2)
ir->rs1 = next_ir->rs1;
else
ir->rs1 = next_ir->rs2;
ir->impl = dispatch_table[ir->opcode];
remove_next_nth_ir(rv, ir, block, 1);
}
break;
}
+ else {
+ printf("[DEBUG] fuse2 not triggered\n");
}
```
##### Test data 1_1
The following test data meets the fuse condition:
> The destination register of lw has a dependency relationship with the source register of add.
```c
.section .text
.global _start
_start:
lui x1, 0xABCDE
add x3, x1, x2
li a0, 0
li a7, 93
ecall
```
```shell
$ (base) boju@BAISP rv32emu % build/rv32emu test_fuse2.elf
[DEBUG] fuse2 triggered: LUI + ADD
inferior exit code 0
```
##### Test data 1_2
Adjust the following test data so that it does not meet the fuse condition:
> The destination register of lw does not have a dependency relationship with the source register of add.
```c
.section .text
.global _start
_start:
lui x1, 0xABCDE
add x3, x2, x2
li a0, 0
li a7, 93
ecall
```
```shell
$ (base) boju@BAISP rv32emu % build/rv32emu test_fuse2_2.elf
[DEBUG] fuse2 not triggered
inferior exit code 0
```
##### Test data 1_3
For the final test data, I replaced the add instruction with the sub instruction
to verify whether any false-positive cases would occur.
```shell
.section .text
.global _start
_start:
lui x1, 0xABCDE
sub x3, x1, x2
li a0, 0
li a7, 93
ecall
```
```shell
$ (base) boju@BAISP rv32emu % build/rv32emu test_fuse2_3.elf
inferior exit code 0
```
Based on the results, it is correct. However, this also indicates that the lui + sub sequence has potential for extension. Next, I will proceed to improve the implementation of this part.
#### fused3 - multiple `sw` & fused4 - multiple `lw`
`COMBINE_MEM_OPS(0)` and `COMBINE_MEM_OPS(1)` correspond to different types of memory accesses (specifically, `sw` or `lw`).
Assuming that `IIF(0)(rv_insn_lw, rv_insn_sw)` is called.
##### step1 : Expanding `IIF(0)`
The definition of `IIF(C)` is:
```c
#define IIF(c) PRIMITIVE_CAT(IIF_, c)
```
So when applying `IFF(0)`, we get:
```c
IIF(0) -> PRIMITIVE_CAT(IIF_,0)
```
##### step2 : Expanding `PRIMITIVE_CAT(IIF_,0)`
The definition of `PRIMITIVE_CAT(IIF_,0)` is:
```c
#define PRIMITIVE_CAT(a, ...) FIX_VC_BUG(a##__VA_ARGS__)
```
> Here, a = `IIF_`, and `__VA_ARGS__` = 0.
So we get:
```c
PRIMITIVE_CAT(IIF_, 0) -> FIX_VC_BUG(IIF_##0)
```
> `##` is token pasting operator, it will concatenate `IIF_` and `0` to form `IIF_0`.
Therefore, `IIF(0)` ultimately expands to `IIF_0`.
##### step3 : Expanding `IIF_0(rv_insn_lw, rv_insn_sw)`:
The definition of `IIF_0` is:
```c
#define IIF_0(t, ...) __VA_ARGS__
```
>Here, `t` = `rv_insn_lw`,
`__VA_ARGS__` = `rv_insn_sw`
Only the "second argument" or the "remaining arguments" are retained.
For `IIF_0(rv_insn_lw, rv_insn_sw)`, only `rv_insn_sw` is retained.
##### step4 : result
The entire macro essentially acts as a conditional expression, ultimately resulting in `rv_insn_sw`.
For testing purposes, I made the following modifications to the program.
```diff
#define COMBINE_MEM_OPS(RW) \
next_ir = ir->next; \
count = 1; \
while (1) { \
if (next_ir->opcode != IIF(RW)(rv_insn_lw, rv_insn_sw)) \
break; \
count++; \
if (!next_ir->next) \
break; \
next_ir = next_ir->next; \
} \
if (count > 1) { \
+ printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("sw", "lw"));\
ir->opcode = IIF(RW)(rv_insn_fuse4, rv_insn_fuse3); \
ir->fuse = malloc(count * sizeof(opcode_fuse_t)); \
assert(ir->fuse); \
ir->imm2 = count; \
memcpy(ir->fuse, ir, sizeof(opcode_fuse_t)); \
ir->impl = dispatch_table[ir->opcode]; \
next_ir = ir->next; \
for (int j = 1; j < count; j++, next_ir = next_ir->next) \
memcpy(ir->fuse + j, next_ir, sizeof(opcode_fuse_t)); \
remove_next_nth_ir(rv, ir, block, count - 1); \
}
```
And generated the corresponding test data for verification:
##### Test data 1_1
```c
.section .text
.global _start
_start:
sw t1, 0(t0)
sw t1, 0(t0)
sw t1, 0(t0)
sw t1, 0(t0)
lw t2, 0(t0)
lw t2, 0(t0)
lw t2, 0(t0)
lw t2, 0(t0)
li a0, 0
li a7, 93
ecall
```
```shell
$ boju@BAISPdeMacBook-Air-138 rv32emu % ./build/rv32emu ./test.elf
[DEBUG] Triggered: 4 consecutive lw instructions.
[DEBUG] Triggered: 4 consecutive sw instructions.
inferior exit code 0
```
The result was different from what I expected; the order of the sw and lw instructions was reversed.
##### Test data 1_2
```c
.section .text
.global _start
_start:
lw t2, 0(t0)
lw t2, 0(t0)
lw t2, 0(t0)
lw t2, 0(t0)
li a0, 0
li a7, 93
ecall
```
```shell
$ boju@BAISPdeMacBook-Air-138 rv32emu % ./build/rv32emu ./test_lw.elf
[DEBUG] Triggered: 4 consecutive sw instructions.
inferior exit code 0
```
##### Test data 1_3
```c
.section .text
.global _start
_start:
sw t1, 0(t0)
sw t1, 0(t0)
sw t1, 0(t0)
sw t1, 0(t0)
li a0, 0
li a7, 93
ecall
```
```shell
$ boju@BAISPdeMacBook-Air-138 rv32emu % ./build/rv32emu ./test_sw.elf
[DEBUG] Triggered: 4 consecutive lw instructions.
```
:::info
After individually testing the `lw` and `sw` instructions, I have confirmed that there are some issues. However, I am not yet certain of the cause and am currently further narrowing down the problem.
:::
I found out that my debug program was wrong:
```c
/* When the condition is 0, select the second parameter */
#define IIF_0(t, ...) __VA_ARGS__
/* When the condition is 1, select the first parameter */
#define IIF_1(t, ...) t
```
And my debug program:
```c
printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("sw", "lw"));
```
* If RW is 1: "sw" will be printed.
* If RW is 0: "lw" is printed.
This should be modified:
```diff
#define COMBINE_MEM_OPS(RW) \
next_ir = ir->next; \
count = 1; \
while (1) { \
if (next_ir->opcode != IIF(RW)(rv_insn_lw, rv_insn_sw)) \
break; \
count++; \
if (!next_ir->next) \
break; \
next_ir = next_ir->next; \
} \
if (count > 1) { \
- printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("sw", "lw"));\
+ printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("lw", "sw"));\
ir->opcode = IIF(RW)(rv_insn_fuse4, rv_insn_fuse3); \
ir->fuse = malloc(count * sizeof(opcode_fuse_t)); \
assert(ir->fuse); \
ir->imm2 = count; \
memcpy(ir->fuse, ir, sizeof(opcode_fuse_t)); \
ir->impl = dispatch_table[ir->opcode]; \
next_ir = ir->next; \
for (int j = 1; j < count; j++, next_ir = next_ir->next) \
memcpy(ir->fuse + j, next_ir, sizeof(opcode_fuse_t)); \
remove_next_nth_ir(rv, ir, block, count - 1); \
}
```
#### fused5 - Consecutive `slli`, `srli`, `srai` instructions
### Extending the Existing Macro-Operation Fusion
#### fused6 - `lui` + `sub`
> [commit 1faf723](https://github.com/yy214123/rv32emu/commit/1faf72307239c976d3e2f5e32b38c5b03f7b420e)
##### coremark
```shell
$ ./build/rv32emu ./build/riscv32/coremark
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 82814707
Total time (secs): 82.814707
Iterations/Sec : 966.011991
Iterations : 80000
Compiler version : GCC14.2.0
Compiler flags : -O2
Memory location : Main memory (heap)
Correct operation validated.
CoreMark 1.0 : 966.011991 / GCC14.2.0 -O2 / Main memory (heap)
inferior exit code 0
```
##### dhrystone
```shell
$ ./build/rv32emu ./build/riscv32/dhrystone
Dhrystone(1.1-mc), 500000000 passes, 240199144 microseconds, 1184 DMIPS
inferior exit code 0
```
Compared to the [baseline](###Baselines), both performances have significantly declined. This indicates that the expansion is not suitable.
## Problem Rocord
While following the environment setup instructions in the r32emu README.md,
I encountered an issue where the following command did not work:
```shell
$ sudo apt-get install llvm-18
```
My solution was to follow the instructions described in the [LLVM Debian/Ubuntu nightly packages](https://apt.llvm.org/):
```shell
$ wget https://apt.llvm.org/llvm.sh
$ chmod +x llvm.sh
$ sudo ./llvm.sh 18
```
:::warning
You don't have to enable JIT for this project. Concentrate on the improvements of RISC-V interpreter.
> okay.
:::