Final Project: Improve rv32emu performance

contributed by < HenryChiang >

tags: `RISC-V` `Computer Architecture` `Virtual Machine`

1. Introduction to Project

1.1. rv32emu

This is a simulator designed based on the RISC-V architecture, capable of executing ELF format files. The operation involves initially using an interpreter to translate the instructions read from the ELF file into corresponding C code. Finally, a JIT compiler is employed to compile the generated C code into the corresponding machine language.

1.2. Instruction Specialization

The goal is to create special functions for common instructions, allowing them to be translated into faster C code. The objective is to reduce the overhead during the execution stage of these common instructions, indirectly speeding up the translation speed of the interpreter. We will also apply this technology to this project to accelerate the execution speed of rv32emu.

1.3. Purpose

We will apply the technique of Instruction Specialization to rv32emu, improving its efficiency when executing coremark.elf, specifically the iteration/sec metric. For this, we need the following preparations:

Add decision for special opcode in the decode stage. To achieve this, we must introduce special opcodes and additional conditional checks to determine the corresponding special functions.
Integrate special functions into the execution stage. Due to the large number and certain similarity among the generated functions, we will use another Python script to generate these functions. Of course, the development of this Python script is also part of the preparation.

2.Generate Specialize Instruction

2.1. Example for addi instruction

addi is the most frequently used instruction in RISC-V assembly language programs. Therefore, it is the focal point of our specialization efforts. We will use addi to provide a simple introduction to our generation targets and the methods employed in the generation process.

2.2 Generate Object

Object	Description
special opcode	For example, in the case of addi, if the specialized instruction is `addi a0, a0, 1`, then the opcode would be `addi10`. (This is because the a0 register corresponds to the x10 register.)
selection	Selecting the opcode corresponding to the instruction is determined based on rd and rs1, for example, `if( rd == 10 && rs1 == 10){ ir->opcode = addi10}`
special function	Originally, the execution stage required fetching the instruction's rd and rs1 for execution. Now, the C code in the function can be simplified to: `rv[10] = rv[10] + ir->imm;`

2.3 Generate Method

Python can achieve simple script writing through powerful string operations (e.g., format) and File I/O functionality. This time, we use Python to generate the members of the three Instruction Specialization implementations mentioned above. Here are some examples of simple code:

Generate function


















#sample code

class Gen:
    
    self.Function = []
    
    def gen_RVFunc_instance(self, r0, r1):
        self.Function.append('{rv->X['+r0+'] = (int32_t) (rv->X['+r1+']) + ir->imm;}\n')


    def gen_RVFunc(self):
        for r0 in self.place(fmt, 0):
            for r1 in self.place(fmt, 1):
                    self.gen_RVFunc_instance(r0, r1)
                    

generator = Gen()
generator.gen_RVFunc()

Output part of C program


















#sample code

C_code_src_fmt = '''
    {{self.Function}}
'''

class Gen:
    
    self.Function = []
    
    def write(self, name='code'):        
        with open(name + '.c', 'w') as f:
            self.name = name
            self.Function = ''.join(self.Function)
            f.write(C_code_src_fmt.format(self=self))
            
generator.write()

3. Operating Process

3.1 Operational Process Overview

Here, we will explain the process according to the flow of the Interpreter. The execution of rv32emu is based on blocks. We then follow the steps for each block.

Firstly, we generate a block and store the decoded data structure (IR) into our block. A block, before encountering a jump instruction, stores a large number of IRs. We focus on IR and perform the first optimization.
IR is the result of instruction decoding, including opcode, rd, rs1, rs2, imm. This allows easy manipulation in the execution stage.
Finally, it enters the execution stage. We take the opcode of IR, match it with the dispatch table to find the corresponding execution function. For example, addi has a corresponding addi function, performing rd = rs1 + imm.

3.2 Operating Optimization Principles

First, the reason why instruction specialization works is that when we execute a program, it is done in blocks. If the same block is executed repeatedly, we don't need to decode that block again. We can directly enter the execution stage. Therefore, if we can reduce the overhead of the execution stage, we can further improve the execution speed of rv32emu. This is why we implement multiple specialized instructions in the execution stage because they can more effectively reduce dispatch overhead than conventional methods.

3.3 Implement Instruction Specialization

Here, we will apply the generated special opcode, selection, and function from the previous chapter, complemented by the information in Section 3.2.

First, during the decoding process into IR, if the instruction is known to be addi, we will further determine if this instruction is a special instruction, such as addi x10, x10, imm. This process is performed using selection, assigning a special opcode to the IR. In this example, the special opcode is addi10.
Optimization will be demonstrated in the execution stage. Here, the special opcode corresponds to a special function. These functions, as rd and rs1 are already known, can save time by avoiding two memory accesses, reducing the dispatch overhead in the execution stage.

3.4 Visualization of the Process

4. Specialize Strategy

4.1 specialized addi

Upon the teacher's advice, I analyzed the results of the GCC assemble the assembly language, and I found that in special cases, the ADDI instruction can be translated into other instructions. For instance, when rs1 == x0, the instruction is equivalent to assigning imm to rd, so GCC translates it into the li instruction. Another special case is when imm == 0, which is equivalent to assigning rs1 to rd, so it is specialized into the mv instruction. I optimized my ADDI instruction based on these two examples, resulting in a significant speed improvement due to the reduced number of operands.

inc and dec are also been specialized, if addi instruction immediate value is 1 or -1, and the rd register is as same as rs1,then the instruction will be specialized as inc and dec.

4.2 specialized jump

I specialized the jal instruction by observing the results of GCC assembly. In most cases, the target register is x1, a common scenario. Despite this, I noticed that the decode stage still searches for the target register rd. Therefore, after assigning the opcode as jal, I separately check if rd == x1. If true, I specialize it into jalx1.

4.3 specialized branch

In the standard RISCV32I, we have beq and bne instructions, but there is a lack of commonly needed instructions in C, such as beqz and bnez. Hence, I aimed to specialize these instructions. I implemented beqz and bnez in rv32emu separately from the RISCV-C instructions. I check if rs2 is x0, and if true, I specialize beq into beqz and bne into bnez.

4.4 other instruction

For instructions where rd == rs1 and are related to the ALU, I specialize them into new instructions. For example, in the case of addi x10, x10, 1, the new instruction is named addix10, and empirical results show an acceleration in execution speed.

4.5 presentation and comparison (on x86_64 machine)

We analyze the build/rv32emu to examine the effects of specialized instructions. First, we look at the addi instruction. We compare the non-specialized do_addi with the specialized do_addi025025 (addi x25, x25, imm). We run this on an x86_64 architecture computer, and here are the disassembled results. We observe a difference of four instructions between them, including 2 mov instructions and 2 movzbl instructions.

do_addi

000000000000a420 <do_addi.lto_priv.0>:
    a420:	44 0f b6 4e 05       	movzbl 0x5(%rsi),%r9d
    a425:	8b 06                	mov    (%rsi),%eax
    a427:	48 83 c2 01          	add    $0x1,%rdx
    a42b:	83 c1 04             	add    $0x4,%ecx
    a42e:	44 0f b6 46 04       	movzbl 0x4(%rsi),%r8d
    a433:	48 8b 76 20          	mov    0x20(%rsi),%rsi
    a437:	42 03 44 8f 68       	add    0x68(%rdi,%r9,4),%eax
    a43c:	42 89 44 87 68       	mov    %eax,0x68(%rdi,%r8,4)
    a441:	48 85 f6             	test   %rsi,%rsi
    a444:	74 0a                	je     a450 <do_addi.lto_priv.0+0x30>
    a446:	ff 66 28             	jmp    *0x28(%rsi)
    a449:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    a450:	48 89 97 80 01 00 00 	mov    %rdx,0x180(%rdi)
    a457:	b8 01 00 00 00       	mov    $0x1,%eax
    a45c:	89 8f e8 00 00 00    	mov    %ecx,0xe8(%rdi)
    a462:	c3                   	ret    
    a463:	66 66 2e 0f 1f 84 00 	data16 cs nopw 0x0(%rax,%rax,1)
    a46a:	00 00 00 00 
    a46e:	66 90                	xchg   %ax,%ax

do_addi025025

0000000000013b00 <do_addi025025.lto_priv.0>:
   13b00:	8b 06                	mov    (%rsi),%eax
   13b02:	48 8b 76 20          	mov    0x20(%rsi),%rsi
   13b06:	48 83 c2 01          	add    $0x1,%rdx
   13b0a:	83 c1 04             	add    $0x4,%ecx
   13b0d:	01 87 cc 00 00 00    	add    %eax,0xcc(%rdi)
   13b13:	48 85 f6             	test   %rsi,%rsi
   13b16:	74 08                	je     13b20 <do_addi025025.lto_priv.0+0x20>
   13b18:	ff 66 28             	jmp    *0x28(%rsi)
   13b1b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
   13b20:	48 89 97 80 01 00 00 	mov    %rdx,0x180(%rdi)
   13b27:	b8 01 00 00 00       	mov    $0x1,%eax
   13b2c:	89 8f e8 00 00 00    	mov    %ecx,0xe8(%rdi)
   13b32:	c3                   	ret    
   13b33:	66 66 2e 0f 1f 84 00 	data16 cs nopw 0x0(%rax,%rax,1)
   13b3a:	00 00 00 00 
   13b3e:	66 90                	xchg   %ax,%ax

We also compared beq and beqz, and after specialization, there is a reduction of four instructions in total.

beq

0000000000009dc0 <do_beq.lto_priv.0>:
    9dc0:	0f b6 46 06          	movzbl 0x6(%rsi),%eax
    9dc4:	44 0f b6 46 05       	movzbl 0x5(%rsi),%r8d
    9dc9:	48 83 c2 01          	add    $0x1,%rdx
    9dcd:	8b 44 87 68          	mov    0x68(%rdi,%rax,4),%eax
    9dd1:	42 39 44 87 68       	cmp    %eax,0x68(%rdi,%r8,4)
    9dd6:	74 28                	je     9e00 <do_beq.lto_priv.0+0x40>
    9dd8:	4c 8b 46 38          	mov    0x38(%rsi),%r8
    9ddc:	c6 05 41 d2 01 00 00 	movb   $0x0,0x1d241(%rip)        # 27024 <is_branch_taken.lto_priv.0>
    9de3:	83 c1 04             	add    $0x4,%ecx
    9de6:	4d 85 c0             	test   %r8,%r8
    9de9:	74 35                	je     9e20 <do_beq.lto_priv.0+0x60>
    9deb:	89 0d 37 d2 01 00    	mov    %ecx,0x1d237(%rip)        # 27028 <last_pc.lto_priv.0>
    9df1:	49 8b 40 28          	mov    0x28(%r8),%rax
    9df5:	4c 89 c6             	mov    %r8,%rsi
    9df8:	ff e0                	jmp    *%rax
    9dfa:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    9e00:	03 0e                	add    (%rsi),%ecx
    9e02:	48 8b 76 30          	mov    0x30(%rsi),%rsi
    9e06:	c6 05 17 d2 01 00 01 	movb   $0x1,0x1d217(%rip)        # 27024 <is_branch_taken.lto_priv.0>
    9e0d:	48 85 f6             	test   %rsi,%rsi
    9e10:	74 1e                	je     9e30 <do_beq.lto_priv.0+0x70>
    9e12:	89 0d 10 d2 01 00    	mov    %ecx,0x1d210(%rip)        # 27028 <last_pc.lto_priv.0>
    9e18:	ff 66 28             	jmp    *0x28(%rsi)
    9e1b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
    9e20:	48 8b 76 20          	mov    0x20(%rsi),%rsi
    9e24:	48 85 f6             	test   %rsi,%rsi
    9e27:	74 07                	je     9e30 <do_beq.lto_priv.0+0x70>
    9e29:	ff 66 28             	jmp    *0x28(%rsi)
    9e2c:	0f 1f 40 00          	nopl   0x0(%rax)
    9e30:	48 89 97 80 01 00 00 	mov    %rdx,0x180(%rdi)
    9e37:	b8 01 00 00 00       	mov    $0x1,%eax
    9e3c:	89 8f e8 00 00 00    	mov    %ecx,0xe8(%rdi)
    9e42:	c3                   	ret    
    9e43:	66 66 2e 0f 1f 84 00 	data16 cs nopw 0x0(%rax,%rax,1)
    9e4a:	00 00 00 00 
    9e4e:	66 90                	xchg   %ax,%ax

beqz

0000000000009e50 <do_beqz2.lto_priv.0>:
    9e50:	0f b6 46 05          	movzbl 0x5(%rsi),%eax
    9e54:	48 83 c2 01          	add    $0x1,%rdx
    9e58:	8b 44 87 68          	mov    0x68(%rdi,%rax,4),%eax
    9e5c:	85 c0                	test   %eax,%eax
    9e5e:	74 28                	je     9e88 <do_beqz2.lto_priv.0+0x38>
    9e60:	4c 8b 46 38          	mov    0x38(%rsi),%r8
    9e64:	c6 05 b9 d1 01 00 00 	movb   $0x0,0x1d1b9(%rip)        # 27024 <is_branch_taken.lto_priv.0>
    9e6b:	83 c1 04             	add    $0x4,%ecx
    9e6e:	4d 85 c0             	test   %r8,%r8
    9e71:	74 4d                	je     9ec0 <do_beqz2.lto_priv.0+0x70>
    9e73:	89 0d af d1 01 00    	mov    %ecx,0x1d1af(%rip)        # 27028 <last_pc.lto_priv.0>
    9e79:	49 8b 40 28          	mov    0x28(%r8),%rax
    9e7d:	4c 89 c6             	mov    %r8,%rsi
    9e80:	ff e0                	jmp    *%rax
    9e82:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    9e88:	03 0e                	add    (%rsi),%ecx
    9e8a:	48 8b 76 30          	mov    0x30(%rsi),%rsi
    9e8e:	c6 05 8f d1 01 00 01 	movb   $0x1,0x1d18f(%rip)        # 27024 <is_branch_taken.lto_priv.0>
    9e95:	48 85 f6             	test   %rsi,%rsi
    9e98:	74 0e                	je     9ea8 <do_beqz2.lto_priv.0+0x58>
    9e9a:	89 0d 88 d1 01 00    	mov    %ecx,0x1d188(%rip)        # 27028 <last_pc.lto_priv.0>
    9ea0:	ff 66 28             	jmp    *0x28(%rsi)
    9ea3:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
    9ea8:	48 89 97 80 01 00 00 	mov    %rdx,0x180(%rdi)
    9eaf:	b8 01 00 00 00       	mov    $0x1,%eax
    9eb4:	89 8f e8 00 00 00    	mov    %ecx,0xe8(%rdi)
    9eba:	c3                   	ret    
    9ebb:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
    9ec0:	48 8b 76 20          	mov    0x20(%rsi),%rsi
    9ec4:	48 85 f6             	test   %rsi,%rsi
    9ec7:	74 df                	je     9ea8 <do_beqz2.lto_priv.0+0x58>
    9ec9:	ff 66 28             	jmp    *0x28(%rsi)
    9ecc:	0f 1f 40 00          	nopl   0x0(%rax)

5. Specialization for Benchmarks

5.1 Observing all addi instructions in the benchmark.

Pipe command

grep ADDI | sort | uniq -c

Interpreter function

do_addi(riscv_t *rv, rv_insn_t *ir){
     
     rv[ir->rd] = rv[ir->rs1] + ir->imm;
     printf("ADDI :(%d,%d)",ir->rd,ir->rs1);

}

utilization
Whenever build/rv32emu executes an addi instruction, it prints out the current set of addi instructions it is executing. Our goal is to use the Linux pipe command to concurrently run the benchmark and collect statistics on the occurrences of each ADDI instruction.

5.2 Observation Results

We will specialize based on the most frequently occurring (rd, rs1) combinations in ADDI. For example, in coremark, the most common occurrences are ADDI(12,12), ADDI(13,13), and ADDI(15,15). Specializing for these instructions can accelerate the execution speed of coremark.

5.3 Prioritize based on frequency of occurrence.

Because the decode phase uses an if-elif-else structure to determine whether it is a specialized instruction, the order of determination becomes crucial. We prioritize the most frequently occurring ones mentioned in section 5.2 as the first to be determined. This effectively speeds up the decoding process for frequently encountered instructions. An example is provided below:







  if(rd==12 && rs1==12){          //most frequent
      ir->opcode = addi012012;
  }else if(rd==16 && rs1==16){    //second frequent
      ir->opcode = addi016016;
  }
  ...                            //others

6. Performance Evaluation

6.1 CoreMark

We use CoreMark as the benchmark for our performance evaluation. CoreMark is commonly used for testing CPU performance in embedded systems. Its advantage lies in the fact that it does not include the runtime of libraries during testing; it focuses solely on the original code being tested. Moreover, it specifies the compiler specifications, making it convenient for comparisons across different platforms and compilers.

6.2 Compare

Compiler version : GCC13.2.0
Compiler flags : -Ofast -std=c99 -march=rv32im -mabi=ilp32 -DPERFORMANCE_RUN=1

Mircoprocessor: Core i7-6700

Compiler	w/ specialize insn.	CoreMark (Iterations/Sec)	Dhrystone (DIMPS)
gcc-13	NO	299.805127	343.0
gcc-13	YES	320.179300	359.0

DMIPS: Dhrystone MIPS (Million Instructions per Second)

6.3 Discussion and non-adapted solution

During the implementation process, I attempted to implement specialized instructions for various possible addi instructions (approximately 1024 types). The selection in the decode stage was changed from the original if structure to a switch structure when implementing more optimized instructions and reducing overhead. However, upon running the performance analysis for CoreMark again under these conditions, it was observed that the results were worse than the aforementioned version and the original rv32emu.

I believe the reasons are as follows:

Although multiple optimized instructions were implemented, these instructions did not appear frequently. Therefore, instead of optimizing, they increased the overhead in the decode stage.

In this implementation, I chose more frequently occurring instructions as optimization targets, and I believe this is a key factor in improving performance.

7. Evaluate Conclusion

After practical performance testing, the actual execution performance using instruction specialization shows a growth of 6%. I believe the possible reasons for the modest improvement in performance are:

Additional decode stage overhead, as mentioned earlier, the selection process, which occupies a significant portion and decrease the performance gained in the execution stage.
Limited optimization of instructions through instruction specialization. The author's optimization focuses only on instructions with a higher occurrence and the same rd and rs1 values. This characteristic is only present in instructions related to the ALU, and not in instructions like beq and lw, limiting the scope of optimization.

Although addi is the most common instruction in CoreMark, the following lw and sw instructions, which are next in frequency, were not effectively optimized. Coupled with the decode stage overhead, this leads to limited optimization effectiveness. In my opinion, if the benchmark test is computing-intensive, instruction specialization primarily focusing on the ALU should exhibit better performance.