Try   HackMD

Speed up rv32emu interpreter

陳柏儒

Introduction

My task is to optimize the performance of rv32emu in the interpreter and extend its existing macro-operation fusion to accelerate the execution of CoreMark and Dhrystone benchmark programs.

The art of the language VM

Jitter GHM2017 / Video

Write down your thoughts and questions here. Also, summarize the talk.

Traditional interpretation & Direct-threaded interpretation

Traditional interpretation is switch-based, where each instruction execution involves checking all conditions in the switch-case statement. The detailed code is available in switch.c.

In this program, the following declaration is observed:

union value {
  enum insn in;
  long i;
  union value *p;
};

This indicates that a union value contains three members of different types:

  • enum insn in: This is an enumeration representing the "type of instruction"
  • long i: Used to store an integer or parameters involved in the computation process.
  • union value *p: A pointer to another union value of the same type

Then there is a declaration in the program:

union value program[] = {
    // Assign 2000000000 into %r0.
    {.in = insn_set},  // 0
    {.i = 2000000000}, // 1
    {.i = 0},          // 2

    // Assign -1 into %r1.
    {.in = insn_set},  // 3
    {.i = -1},         // 4
    {.i = 1},          // 5

    // Sum %r0 and %r1 into %r0.
    {.in = insn_add},  // 6
    {.i = 0},          // 7
    {.i = 1},          // 8
    {.i = 0},          // 9

This is an array of type union value, where each element is a single union value. This array can be thought of as "a sequence of instructions and parameters," with every three or four elements forming one complete instruction statement. The comments in the code provide a mapping to:

  • Instruction Type: Represented by .in = something. This specifies the operation (e.g., insn_set, insn_add, etc.).
    Parameters: Represented by .i = number or .p = program + offset. These parameters are the operands or addresses associated with the instruction.
  • In direct-threaded interpretation, each instruction contains the execution address of the next instruction and immediately jumps to it after execution (using goto * to the code pointer).

Main program execution flow:

  1. Register Declaration
long regs[10];
  1. Interpreter Loop
const union value *insn = program;

while (true)
  switch (insn->in)
  {
    ...
  }

Instruction Decoding:
In each iteration, the switch (insn->in) statement checks the value of insn->in, which is an enumerated type indicating the kind of instruction to execute (e.g., insn_set, insn_add, insn_bnz, etc.).

Instruction Execution:
The corresponding case in the switch statement executes the appropriate logic for the instruction.


Direct-threaded interpretation


VM specialized instructions

Commutativity of Instructions

For certain operations (e.g., addition), they are commutative

add r0, r1, r2 and add r0, r2, r1 produce the same result.

Instruction Rewriting

Assume the original instruction is:

add r0, r1, r2   # r0 <- r1 + r2

The slide suggests rewriting it into two instructions, with the general logic as follows:

  1. First, copy the content of r2 to r0.
  2. Then, perform addition: add r1 and the value just copied into r0, and store the result back into r0.

In RISC-V, we could write it as:

mv r0, r2        # r0 <- r2
add r0, r1, r0   # r0 <- r1 + r0

However, this rewrite has limitations. It is only effective when r2 ≠ r0.

I don't understand the explanation in this section.
From the changes, it seems that a single instruction using three registers is split into two instructions, each using two registers.
What are the advantages of doing this?


Accelerate RISC-V Instruction Set Simulation by Tiered JIT Compilation
4.1 Macro-Operation Fusion (MOP Fusion)
QuickInterp - Improving interpreter performance with superinstructions

Benchmarks

In rv32emu/build/riscv32, you can find the executable files for coremark and dhrystone.

$ make tool                        

CC	build/rv_histogram.o
LD	build/rv_histogram

After compilation, I can use the following commands to view the RV32 instructions/registers in the coremark and dhrystone programs:

$ ./build/rv_histogram  ./build/riscv32/coremark
$ ./build/rv_histogram -r ./build/riscv32/coremark

-r: output usage of registers(default is usage of instructions)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

$ ./build/rv_histogram  ./build/riscv32/dhrystone
$ ./build/rv_histogram -r ./build/riscv32/dhrystone

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Baselines

To compare the differences before and after optimization, I must first execute coremark and dhrystone to check their original performance.

coremark

$ ./build/rv32emu ./build/riscv32/coremark

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 43442522
Total time (secs): 43.442522
Iterations/Sec   : 1841.513713
Iterations       : 80000
Compiler version : GCC14.2.0
Compiler flags   : -O2 
Memory location  : Main memory (heap)
Correct operation validated.
CoreMark 1.0 : 1841.513713 / GCC14.2.0 -O2  / Main memory (heap)
inferior exit code 0

dhrystone

$ ./build/rv32emu ./build/riscv32/dhrystone

Dhrystone(1.1-mc), 500000000 passes, 91538320 microseconds, 3108 DMIPS
inferior exit code 0

rv32emu

emulate.c

Here, we can see the block_translate function. Its purpose is to read instructions and group consecutive instructions into individual blocks.

FORCE_INLINE bool insn_is_branch(uint8_t opcode)
{
    switch (opcode) {
#define _(inst, can_branch, insn_len, translatable, reg_mask) \
    IIF(can_branch)                                           \
    (case rv_insn_##inst:, )
        RV_INSN_LIST
#undef _
        return true;
    }
    return false;
}

static void block_translate(riscv_t *rv, block_t *block)
{
...
    /* stop on branch */
        if (insn_is_branch(ir->opcode)) {
            if (insn_is_indirect_branch(ir->opcode)) {
                ir->branch_table = calloc(1, sizeof(branch_history_table_t));
                assert(ir->branch_table);
                memset(ir->branch_table->PC, -1,
                       sizeof(uint32_t) * HISTORY_SIZE);
            }
            break;
        }
...
}

In decode.h, within the #define RV_INSN_LIST, we can see that instructions such as jal, jalr, beq, and bne have their can_branch property set to 1.
Therefore, when dividing blocks, as soon as an instruction with can_branch set to 1 is encountered, it is treated as the endpoint of the current block.

The focus of my task this time is on modifying the Macro-Operation Fusion logic. This part of the program's logic is located within the match_pattern function.

We can try implementing a new fusion by identifying the most frequently occurring patterns that are not currently fused in rv32emu, as the more frequent the pattern, the greater the potential benefit.

Test the existing macro-operation fusion

fused1 - Consecutive lui instructions

case rv_insn_lui:
    next_ir = ir->next;
    switch (next_ir->opcode) {
    case rv_insn_lui:
        count = 1;
        while (1) {
            if (!IF_insn(next_ir, lui))
                break;
            count++;
            if (!next_ir->next)
                break;
            next_ir = next_ir->next;
        }
        if (count > 1) {
+       printf("[DEBUG] fuse1 triggered: %d consecutive LUI instructions.\n", count);
            ir->opcode = rv_insn_fuse1;
            ir->fuse = malloc(count * sizeof(opcode_fuse_t));
            assert(ir->fuse);
            ir->imm2 = count;
            memcpy(ir->fuse, ir, sizeof(opcode_fuse_t));
            ir->impl = dispatch_table[ir->opcode];
            next_ir = ir->next;
            for (int j = 1; j < count; j++, next_ir = next_ir->next)
                memcpy(ir->fuse + j, next_ir, sizeof(opcode_fuse_t));
            remove_next_nth_ir(rv, ir, block, count - 1);
        }
        break;
    }
    break;
Test data 1_1
.section .text
.global _start

_start:
    lui x10, 0x12345
    lui x11, 0x23456
    lui x12, 0x34567
    lui x13, 0x45678
    lui x14, 0x56789
 
    li a0, 0
    li a7, 93
    ecall

Compile and convert to a .elf file executable on rv32emu.

$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -nostartfiles -nostdlib -o <file_name>.elf <file_name>.s

$ ./build/rv32emu test_fuse1.elf      

[DEBUG] fuse1 triggered: 7 consecutive LUI instructions.

The result here does not meet my expectations.
In reality, there are only five lui instructions, but the result shows seven.
I suspect that the li instruction might have been expanded, but I could not find the corresponding implementation in the project.

The problem has been solved, please refer here

Test data 1_2

To verify my hypothesis, I removed the li instruction.

.section .text
.global _start

_start:
    lui x10, 0x12345
    lui x11, 0x23456
    lui x12, 0x34567
    lui x13, 0x45678
    lui x14, 0x56789
 
-   li a0, 0
-   li a7, 93
+   addi a0, x0, 0
+   addi a7, x0, 93
    ecall
$ ./build/rv32emu test_fuse1.elf       

[DEBUG] fuse1 triggered: 7 consecutive LUI instructions.

But the result remains the same.

Test data 1_3
.section .text
.global _start

_start:
    lui x10, 0x12345
    lui x11, 0x23456
    lui x12, 0x34567
    lui x13, 0x45678
    lui x14, 0x56789
    
+   add x15, x10, x11
+   add x15, x10, x11
 
    li a0, 0
    li a7, 93
    ecall
$ ./build/rv32emu test_fuse1.elf       

[DEBUG] fuse1 triggered: 9 consecutive LUI instructions.

I used additional test data and discovered that the add instruction is also identified as consecutive lui instructions.

The problem has been solved, please refer here

I tested the original hello.elf in the project.

$ ./build/rv32emu ./build/hello.elf

[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
[DEBUG] fuse1 triggered: 4 consecutive LUI instructions.
Hello World!
Hello World!
Hello World!
Hello World!
Hello World!
[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
inferior exit code 0

However, I actually disassembled hello.elf using the following command:

$ riscv64-unknown-elf-objdump -d -M no-aliases ./build/hello.elf
Disassembly of section .text:

00000000 <_start>:
   0:   00000293                addi    t0,zero,0
   4:   00500313                addi    t1,zero,5
   8:   0040006f                jal     zero,c <_start+0xc>
   c:   00000013                addi    zero,zero,0

00000010 <loop>:
  10:   02628063                beq     t0,t1,30 <end>
  14:   04000893                addi    a7,zero,64
  18:   00100513                addi    a0,zero,1
  1c:   03c00593                addi    a1,zero,60
  20:   00d00613                addi    a2,zero,13
  24:   00000073                ecall
  28:   00128293                addi    t0,t0,1
  2c:   fe5ff06f                jal     zero,10 <loop>

00000030 <end>:
  30:   05d00893                addi    a7,zero,93
  34:   00000513                addi    a0,zero,0
  38:   00000073                ecall

Currently, I need to use GDB to understand the overall execution lifecycle of the program for this issue.

$ gdb --args ./build/rv32emu ./build/hello.elf
(gdb) b match_pattern
Breakpoint 1 at 0x55555556b7bb: file src/emulate.c, line 732.
(gdb) r
Starting program: /home/boju/rv32emu/build/rv32emu ./build/hello.elf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, match_pattern (rv=0x0, block=0x0) at src/emulate.c:732
732     {
(gdb) p ir
$1 = (rv_insn_t *) 0x7ffff6093008
(gdb) p ir->next
$2 = (struct rv_insn *) 0x7ffff6093058
(gdb) p ir->next->next
$3 = (struct rv_insn *) 0x7ffff60930a8
(gdb) p *ir
$4 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, 
  next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *ir->next
$5 = {{imm = 5, rs3 = 5 '\005'}, rd = 6 '\006', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 4, 
  next = 0x7ffff60930a8, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *ir->next->next
$6 = {{imm = 4, rs3 = 4 '\004'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 3 '\003', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 8, next = 0x0, 
  impl = 0x55555555f3d7 <do_jal>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}

Based on the GDB output, we can analyze the contents of ir, ir->next, and ir->next->next step by step and compare them with the disassembled code:

Contents of ir

$4 = {
    {imm = 0, rs3 = 0 '\000'},
    rd = 5 '\005',
    rs1 = 0 '\000',
    rs2 = 0 '\000',
    opcode = 1 '\001',
    shamt = 0 '\000',
    rm = 0 '\000',
    imm2 = 0,
    fuse = 0x0,
    pc = 0,
    next = 0x7ffff6093058,
    impl = 0x55555555f25f <do_lui>,
    branch_taken = 0x0,
    branch_untaken = 0x0,
    branch_table = 0x0
}
  • pc = 0:
    This corresponds to instruction at address 0x0.
    In the disassembly, the instruction at 0x0 is:

    ​​​​0: 00000293  addi t0, zero, 0
    

    However, the opcode = 1 corresponds to lui (based on the impl field pointing to do_lui), indicating the instruction was misdecoded. It should have been addi.

  • imm = 0:
    The immediate value is 0, which matches addi t0, zero, 0.
    However, the impl field points to do_lui instead of do_addi, further confirming the decoding issue.

  • rd = 5:
    The destination register is x5 (which corresponds to t0 in RISC-V), and this part is correct.

Contents of ir->next

$5 = {
    {imm = 5, rs3 = 5 '\005'},
    rd = 6 '\006',
    rs1 = 0 '\000',
    rs2 = 0 '\000',
    opcode = 1 '\001',
    shamt = 0 '\000',
    rm = 0 '\000',
    imm2 = 0,
    fuse = 0x0,
    pc = 4,
    next = 0x7ffff60930a8,
    impl = 0x55555555f25f <do_lui>,
    branch_taken = 0x0,
    branch_untaken = 0x0,
    branch_table = 0x0
}
  • pc = 0:
    This corresponds to instruction at address 0x4.
    In the disassembly, the instruction at 0x4 is:

    ​​​​4: 00500313 addi t1, zero, 5
    

    Again, the opcode = 1 is decoded as lui instead of addi, indicating another decoding issue.

  • imm = 5:
    The immediate value is 5, which matches addi t0, zero, 0.

  • rd = 5:
    The destination register is x6 (which corresponds to t1 in RISC-V), and this part is correct.

Contents of ir->next->next

$6 = {
    {imm = 4, rs3 = 4 '\004'},
    rd = 0 '\000',
    rs1 = 0 '\000',
    rs2 = 0 '\000',
    opcode = 3 '\003',
    shamt = 0 '\000',
    rm = 0 '\000',
    imm2 = 0,
    fuse = 0x0,
    pc = 8,
    next = 0x0,
    impl = 0x55555555f3d7 <do_jal>,
    branch_taken = 0x0,
    branch_untaken = 0x0,
    branch_table = 0x0
}

In this test,I inserted breakpoints in block_translate and match_pattern.
The following are the observed results:

Breakpoint 1, block_translate (rv=0x55555555efeb <block_alloc+29>, block=0x7fffffffd7c0) at src/emulate.c:607
607     {

...

(gdb) p *ir
$10 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, next = 0x0, 
  impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) c
Continuing.

Breakpoint 2, match_pattern (rv=0x0, block=0x0) at src/emulate.c:713
713     {

...

(gdb) n
719             rv_insn_t *next_ir = NULL;
(gdb) p *ir
$11 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, 
  next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head
$28 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, 
  next = 0x7ffff6093058, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next
$29 = {{imm = 5, rs3 = 5 '\005'}, rd = 6 '\006', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 4, 
  next = 0x7ffff60930a8, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next->next
$30 = {{imm = 4, rs3 = 4 '\004'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 3 '\003', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, next = 0x0, 
  impl = 0x55555555f3d7 <do_jal>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next->next->next
Cannot access memory at address 0x0
(gdb) p *block->ir_head->next->next->next
Cannot access memory at address 0x0
(gdb) n
641             block->pc_end += is_compressed(insn) ? 2 : 4;

...

(gdb) n
872         match_pattern(rv, next_blk);
(gdb) n

Breakpoint 2, match_pattern (rv=0x0, block=0x0) at src/emulate.c:713
713     {
(gdb) n
716         for (i = 0, ir = block->ir_head; i < block->n_insn - 1;
(gdb) p *block->ir_head
$31 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 0, 
  next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next
$32 = {{imm = 5, rs3 = 5 '\005'}, rd = 6 '\006', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 4, 
  next = 0x7ffff60930a8, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
(gdb) p *block->ir_head->next->next
$33 = {{imm = 4, rs3 = 4 '\004'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 3 '\003', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 8, next = 0x0, 
  impl = 0x55555555f3d7 <do_jal>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}

To facilitate observation, you can set a breakpoint at b src/emulate.c:683.
This way, you can monitor a complete block.

Additionally, I wrote a GDB script to make it easier to print out the instruction information of a block during the GDB testing process:

define print_ir_list
  set $node = $arg0
  while $node != 0
    printf "Node at %p:\n", $node
    p *$node
    set $node = $node->next
  end
end

Before using GDB test commands, you need to load the corresponding GDB script file (in this case, commands.gdb).

$ gdb -x ./commands.gdb --args ./build/rv32emu ./test.elf

The effect is as follows:

(gdb) print_ir_list block->ir_head
Node at 0x7ffff6093008:
$183 = {{imm = 0, rs3 = 0 '\000'}, rd = 5 '\005', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65652, 
  next = 0x7ffff6093058, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff6093058:
$184 = {{imm = 5, rs3 = 5 '\005'}, rd = 5 '\005', rs1 = 5 '\005', rs2 = 0 '\000', opcode = 1 '\001', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65656, 
  next = 0x7ffff60930a8, impl = 0x55555555f25f <do_lui>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff60930a8:
$185 = {{imm = 10, rs3 = 10 '\n'}, rd = 5 '\005', rs1 = 5 '\005', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65660, 
  next = 0x7ffff60930f8, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff60930f8:
$186 = {{imm = -3, rs3 = 253 '\375'}, rd = 5 '\005', rs1 = 5 '\005', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65664, 
  next = 0x7ffff6093148, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff6093148:
$187 = {{imm = 93, rs3 = 93 ']'}, rd = 17 '\021', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65668, 
  next = 0x7ffff6093198, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff6093198:
$188 = {{imm = 0, rs3 = 0 '\000'}, rd = 10 '\n', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 19 '\023', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65672, 
  next = 0x7ffff60931e8, impl = 0x5555555608ea <do_addi>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}
Node at 0x7ffff60931e8:
$189 = {{imm = 0, rs3 = 0 '\000'}, rd = 0 '\000', rs1 = 0 '\000', rs2 = 0 '\000', opcode = 39 '\'', shamt = 0 '\000', rm = 0 '\000', imm2 = 0, fuse = 0x0, pc = 65676, 
  next = 0x0, impl = 0x555555561cc9 <do_ecall>, branch_taken = 0x0, branch_untaken = 0x0, branch_table = 0x0}

After a series of tests, it was finally discovered that the opcodes of various instructions were modified in rv32_constopt.c, which caused counting errors in the previous tests.

CONSTOPT(andi, {
    if (info->is_constant[ir->rs1]) {
        ir->imm &= info->const_val[ir->rs1];
        info->is_constant[ir->rd] = true;
        info->const_val[ir->rd] = ir->imm;
        ir->opcode = rv_insn_lui; //??
        ir->impl = dispatch_table[ir->opcode];
    } else
        info->is_constant[ir->rd] = false;
})
    
CONSTOPT(add, {
    if (info->is_constant[ir->rs1] && info->is_constant[ir->rs2]) {
        info->is_constant[ir->rd] = true;
        ir->imm = info->const_val[ir->rs1] + info->const_val[ir->rs2];
        info->const_val[ir->rd] = ir->imm;
        ir->opcode = rv_insn_lui; //??
        ir->impl = dispatch_table[ir->opcode];
    } else
        info->is_constant[ir->rd] = false;
})

The reason behind this implementation is still unclear, and further modifications to the code are currently underway.

Do send pull request(s) once you figure out the fixes.

fused1_explanation_of_previous_question

This part uses constant folding technique. When the instruction requires register contents that are compile-time constants, the macro CONSTOPT will change the opcode of the instruction. After changing the opcode, it will be detected as consecutive lui instructions.

.section .text
.global _start

_start:

    lui x10, 0x12345
    lui x11, 0x23456
    add x12, x10, x11   
    
    addi x0, x0, 0
   
    li a0, 0
    li a7, 93
    ecall
$ ./build/rv32emu add.elf
[DEBUG] fuse1 triggered: 3 consecutive LUI instructions.
[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
inferior exit code 0

From this test data, it can be observed that no consecutive lui instructions were detected before the NOP instruction(addi x0, x0, 0). This is because the register values are loaded into memory only after compilation is complete, so the macro CONSTOPT does not change their opcodes.

.section .text
.global _start
value: .word 10

_start:
    lui x10, 0x12345
    lw x10, 0(x10)
    add x11, x10, x10   
    
    addi x0, x0, 0
    
    li a0, 0
    li a7, 93
    ecall
$ ./build/rv32emu add_NCF.elf
[DEBUG] fuse1 triggered: 2 consecutive LUI instructions.
inferior exit code 0

fused2 - lui + add

Register Dependency:

  • In this scenario, the destination register of the lui instruction (ir->rd) is checked against the source registers of the add instruction (next_ir->rs1 and next_ir->rs2).
  • If there is a dependency between lui and add, these two instructions are fused into a single instruction (rv_insn_fuse2) to optimize execution.
switch (ir->opcode) {
    case rv_insn_lui:
        next_ir = ir->next;
        switch (next_ir->opcode) {
        case rv_insn_add:
            if (ir->rd == next_ir->rs2 || ir->rd == next_ir->rs1) {
+               printf("[DEBUG] fuse2 triggered: LUI + ADD\n");
                ir->opcode = rv_insn_fuse2;
                ir->rs2 = next_ir->rd;
                if (ir->rd == next_ir->rs2)
                    ir->rs1 = next_ir->rs1;
                else
                    ir->rs1 = next_ir->rs2;
                ir->impl = dispatch_table[ir->opcode];
                remove_next_nth_ir(rv, ir, block, 1);
            }
            break;
        }
+       else {
+           printf("[DEBUG] fuse2 not triggered\n");
        }
Test data 1_1

The following test data meets the fuse condition:

The destination register of lw has a dependency relationship with the source register of add.

.section .text
.global _start

_start:
    lui x1, 0xABCDE           
    add x3, x1, x2            

    li a0, 0
    li a7, 93
    ecall
$ (base) boju@BAISP rv32emu % build/rv32emu test_fuse2.elf

[DEBUG] fuse2 triggered: LUI + ADD
inferior exit code 0
Test data 1_2

Adjust the following test data so that it does not meet the fuse condition:

The destination register of lw does not have a dependency relationship with the source register of add.

.section .text
.global _start

_start:
    lui x1, 0xABCDE           
    add x3, x2, x2            

    li a0, 0
    li a7, 93
    ecall
$ (base) boju@BAISP rv32emu % build/rv32emu test_fuse2_2.elf

[DEBUG] fuse2 not triggered
inferior exit code 0
Test data 1_3

For the final test data, I replaced the add instruction with the sub instruction
to verify whether any false-positive cases would occur.

.section .text
.global _start

_start:
    lui x1, 0xABCDE           
    sub x3, x1, x2            

    li a0, 0
    li a7, 93
    ecall
$ (base) boju@BAISP rv32emu % build/rv32emu test_fuse2_3.elf

inferior exit code 0

Based on the results, it is correct. However, this also indicates that the lui + sub sequence has potential for extension. Next, I will proceed to improve the implementation of this part.

fused3 - multiple sw & fused4 - multiple lw

COMBINE_MEM_OPS(0) and COMBINE_MEM_OPS(1) correspond to different types of memory accesses (specifically, sw or lw).

Assuming that IIF(0)(rv_insn_lw, rv_insn_sw) is called.

step1 : Expanding IIF(0)

The definition of IIF(C) is:

#define IIF(c) PRIMITIVE_CAT(IIF_, c)

So when applying IFF(0), we get:

IIF(0) -> PRIMITIVE_CAT(IIF_,0)
step2 : Expanding PRIMITIVE_CAT(IIF_,0)

The definition of PRIMITIVE_CAT(IIF_,0) is:

#define PRIMITIVE_CAT(a, ...) FIX_VC_BUG(a##__VA_ARGS__)

Here, a = IIF_, and __VA_ARGS__ = 0.

So we get:

PRIMITIVE_CAT(IIF_, 0) -> FIX_VC_BUG(IIF_##0)

## is token pasting operator, it will concatenate IIF_ and 0 to form IIF_0.

Therefore, IIF(0) ultimately expands to IIF_0.

step3 : Expanding IIF_0(rv_insn_lw, rv_insn_sw):

The definition of IIF_0 is:

#define IIF_0(t, ...) __VA_ARGS__

Here, t = rv_insn_lw,
__VA_ARGS__ = rv_insn_sw

Only the "second argument" or the "remaining arguments" are retained.

For IIF_0(rv_insn_lw, rv_insn_sw), only rv_insn_sw is retained.

step4 : result

The entire macro essentially acts as a conditional expression, ultimately resulting in rv_insn_sw.

For testing purposes, I made the following modifications to the program.

#define COMBINE_MEM_OPS(RW)                                       \
    next_ir = ir->next;                                           \
    count = 1;                                                    \
    while (1) {                                                   \
        if (next_ir->opcode != IIF(RW)(rv_insn_lw, rv_insn_sw))   \
            break;                                                \
        count++;                                                  \
        if (!next_ir->next)                                       \
            break;                                                \
        next_ir = next_ir->next;                                  \
    }                                                             \
    if (count > 1) {                                              \
+       printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("sw", "lw"));\
        ir->opcode = IIF(RW)(rv_insn_fuse4, rv_insn_fuse3);       \
        ir->fuse = malloc(count * sizeof(opcode_fuse_t));         \
        assert(ir->fuse);                                         \
        ir->imm2 = count;                                         \
        memcpy(ir->fuse, ir, sizeof(opcode_fuse_t));              \
        ir->impl = dispatch_table[ir->opcode];                    \
        next_ir = ir->next;                                       \
        for (int j = 1; j < count; j++, next_ir = next_ir->next)  \
            memcpy(ir->fuse + j, next_ir, sizeof(opcode_fuse_t)); \
        remove_next_nth_ir(rv, ir, block, count - 1);             \
    }

And generated the corresponding test data for verification:

Test data 1_1
.section .text
.global _start

_start:
           
    sw t1, 0(t0)
    sw t1, 0(t0) 
    sw t1, 0(t0) 
    sw t1, 0(t0)           

    lw t2, 0(t0)
    lw t2, 0(t0) 
    lw t2, 0(t0) 
    lw t2, 0(t0)           

    li a0, 0              
    li a7, 93             
    ecall
$ boju@BAISPdeMacBook-Air-138 rv32emu % ./build/rv32emu ./test.elf
[DEBUG] Triggered: 4 consecutive lw instructions.
[DEBUG] Triggered: 4 consecutive sw instructions.
inferior exit code 0

The result was different from what I expected; the order of the sw and lw instructions was reversed.

Test data 1_2
.section .text
.global _start

_start:
           
    lw t2, 0(t0)
    lw t2, 0(t0) 
    lw t2, 0(t0) 
    lw t2, 0(t0)           

    li a0, 0              
    li a7, 93             
    ecall
$ boju@BAISPdeMacBook-Air-138 rv32emu % ./build/rv32emu ./test_lw.elf
[DEBUG] Triggered: 4 consecutive sw instructions.
inferior exit code 0
Test data 1_3
.section .text
.global _start

_start:
           
    sw t1, 0(t0)
    sw t1, 0(t0) 
    sw t1, 0(t0) 
    sw t1, 0(t0)           
        
    li a0, 0              
    li a7, 93             
    ecall
$ boju@BAISPdeMacBook-Air-138 rv32emu % ./build/rv32emu ./test_sw.elf
[DEBUG] Triggered: 4 consecutive lw instructions.

After individually testing the lw and sw instructions, I have confirmed that there are some issues. However, I am not yet certain of the cause and am currently further narrowing down the problem.

I found out that my debug program was wrong:

/* When the condition is 0, select the second parameter */
#define IIF_0(t, ...) __VA_ARGS__
/* When the condition is 1, select the first parameter */
#define IIF_1(t, ...) t

And my debug program:

printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("sw", "lw"));
  • If RW is 1: "sw" will be printed.
  • If RW is 0: "lw" is printed.

This should be modified:

#define COMBINE_MEM_OPS(RW)                                       \
    next_ir = ir->next;                                           \
    count = 1;                                                    \
    while (1) {                                                   \
        if (next_ir->opcode != IIF(RW)(rv_insn_lw, rv_insn_sw))   \
            break;                                                \
        count++;                                                  \
        if (!next_ir->next)                                       \
            break;                                                \
        next_ir = next_ir->next;                                  \
    }                                                             \
    if (count > 1) {                                              \
-       printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("sw", "lw"));\
+       printf("[DEBUG] Triggered: %d consecutive %s instructions.\n", count, IIF(RW)("lw", "sw"));\
        ir->opcode = IIF(RW)(rv_insn_fuse4, rv_insn_fuse3);       \
        ir->fuse = malloc(count * sizeof(opcode_fuse_t));         \
        assert(ir->fuse);                                         \
        ir->imm2 = count;                                         \
        memcpy(ir->fuse, ir, sizeof(opcode_fuse_t));              \
        ir->impl = dispatch_table[ir->opcode];                    \
        next_ir = ir->next;                                       \
        for (int j = 1; j < count; j++, next_ir = next_ir->next)  \
            memcpy(ir->fuse + j, next_ir, sizeof(opcode_fuse_t)); \
        remove_next_nth_ir(rv, ir, block, count - 1);             \
    }

fused5 - Consecutive slli, srli, srai instructions

Extending the Existing Macro-Operation Fusion

fused6 - lui + sub

commit 1faf723

coremark
$ ./build/rv32emu ./build/riscv32/coremark

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 82814707
Total time (secs): 82.814707
Iterations/Sec   : 966.011991
Iterations       : 80000
Compiler version : GCC14.2.0
Compiler flags   : -O2 
Memory location  : Main memory (heap)
Correct operation validated.
CoreMark 1.0 : 966.011991 / GCC14.2.0 -O2  / Main memory (heap)
inferior exit code 0
dhrystone
$ ./build/rv32emu ./build/riscv32/dhrystone

Dhrystone(1.1-mc), 500000000 passes, 240199144 microseconds, 1184 DMIPS
inferior exit code 0

Compared to the baseline, both performances have significantly declined. This indicates that the expansion is not suitable.

Problem Rocord

While following the environment setup instructions in the r32emu README.md,
I encountered an issue where the following command did not work:

$ sudo apt-get install llvm-18

My solution was to follow the instructions described in the LLVM Debian/Ubuntu nightly packages:

$ wget https://apt.llvm.org/llvm.sh
$ chmod +x llvm.sh
$ sudo ./llvm.sh 18

You don't have to enable JIT for this project. Concentrate on the improvements of RISC-V interpreter.

okay.