# Final Project: Improve rv32emu performance contributed by < [HenryChiang](https://github.com/HenryChaing/rv32emu) > ###### tags: `RISC-V` `Computer Architecture` `Virtual Machine` ## 1. Introduction to Project ### 1.1. rv32emu This is a `simulator` designed based on the `RISC-V` architecture, capable of executing ELF format files. The operation involves initially using an interpreter to translate the instructions read from the ELF file into corresponding C code. Finally, a JIT compiler is employed to compile the generated C code into the corresponding machine language. ### 1.2. Instruction Specialization The goal is to create `special functions` for common instructions, allowing them to be translated into `faster C code`. The objective is to reduce the overhead during the execution stage of these common instructions, indirectly speeding up the translation speed of the interpreter. We will also apply this technology to this project to accelerate the execution speed of rv32emu. ### 1.3. Purpose We will apply the technique of Instruction Specialization to rv32emu, improving its efficiency when executing `coremark.elf`, specifically the `iteration/sec` metric. For this, we need the following preparations: * Add decision for `special opcode` in the `decode stage`. To achieve this, we must introduce special opcodes and additional conditional checks to determine the corresponding special functions. * Integrate `special functions` into the `execution stage`. Due to the large number and certain similarity among the generated functions, we will use another `Python script` to generate these functions. Of course, the development of this Python script is also part of the preparation. ## 2.Generate Specialize Instruction ### 2.1. Example for addi instruction `addi` is the most frequently used instruction in RISC-V assembly language programs. Therefore, it is the focal point of our specialization efforts. We will use addi to provide a simple introduction to our `generation targets` and the `methods` employed in the generation process. ### 2.2 Generate Object | Object | Description | |:---------:|:----------------------------------------------------------------------------------------------------------------:| | special opcode | For example, in the case of addi, if the specialized instruction is `addi a0, a0, 1`, then the opcode would be `addi10`. (This is because the a0 register corresponds to the x10 register.) | | selection | Selecting the opcode corresponding to the instruction is determined based on rd and rs1, for example, ```if( rd == 10 && rs1 == 10){ ir->opcode = addi10}```| | special function | Originally, the execution stage required fetching the instruction's rd and rs1 for execution. Now, the C code in the function can be simplified to: ```rv[10] = rv[10] + ir->imm;``` | ### 2.3 Generate Method Python can achieve simple `script writing` through powerful string operations (e.g., format) and File I/O functionality. This time, we use Python to generate the members of the three Instruction Specialization implementations mentioned above. Here are some examples of simple code: * Generate function ```python= #sample code class Gen: self.Function = [] def gen_RVFunc_instance(self, r0, r1): self.Function.append('{rv->X['+r0+'] = (int32_t) (rv->X['+r1+']) + ir->imm;}\n') def gen_RVFunc(self): for r0 in self.place(fmt, 0): for r1 in self.place(fmt, 1): self.gen_RVFunc_instance(r0, r1) generator = Gen() generator.gen_RVFunc() ``` * Output part of C program ```python= #sample code C_code_src_fmt = ''' {{self.Function}} ''' class Gen: self.Function = [] def write(self, name='code'): with open(name + '.c', 'w') as f: self.name = name self.Function = ''.join(self.Function) f.write(C_code_src_fmt.format(self=self)) generator.write() ``` ## 3. Operating Process ### 3.1 Operational Process Overview Here, we will explain the process according to `the flow of the Interpreter`. The execution of rv32emu is based on blocks. We then follow the steps for each `block`. 1. Firstly, we generate a block and store the decoded `data structure (IR)` into our block. A block, before encountering a jump instruction, stores a large number of IRs. We focus on IR and perform the first optimization. 2. IR is the result of instruction decoding, including `opcode`, `rd, rs1, rs2`, `imm`. This allows easy manipulation in the execution stage. 3. Finally, it enters the execution stage. We take the `opcode` of IR, match it with the dispatch table to find the corresponding execution function. For example, `addi` has a corresponding addi function, performing `rd = rs1 + imm`. ### 3.2 Operating Optimization Principles First, the reason why instruction specialization works is that when we execute a program, it is done in blocks. If the same block is executed repeatedly, we don't need to decode that block again. We can directly enter the execution stage. Therefore, if we can reduce the overhead of the execution stage, we can further improve the execution speed of rv32emu. This is why we implement multiple specialized instructions in the execution stage because they can more effectively reduce dispatch overhead than conventional methods. ### 3.3 Implement Instruction Specialization Here, we will apply the generated `special opcode`, `selection`, and `function` from the previous chapter, complemented by the information in Section 3.2. 1. First, during the decoding process into IR, if the instruction is known to be `addi`, we will further determine if this instruction is a special instruction, such as `addi x10, x10, imm`. This process is performed using `selection`, assigning a `special opcode` to the IR. In this example, the `special opcode` is `addi10`. 2. Optimization will be demonstrated in the execution stage. Here, the `special opcode` corresponds to a `special function`. These functions, as `rd` and `rs1` are already known, can save time by avoiding two memory accesses, reducing the dispatch overhead in the execution stage. ### 3.4 Visualization of the Process ```flow st=>start: rv_step start e=>end: rv_step end op=>operation: decode instruction to data structure IR op2=>operation: 啦啦啦 op3=>operation: assign special opcode op4=>operation: assign opcode op5=>operation: execute special function (faster) op6=>operation: execute function cond=>condition: 是或否? cond2=>condition: special instruction? cond3=>condition: is instruction jump or branch? st->op->cond2 cond2(yes)->op3 cond2(no)->op4 op3->op5->cond3 op4->op6->cond3 cond3(yes)->e cond3(no)->op ``` ## 4. Specialize Strategy ### 4.1 specialized addi Upon the teacher's advice, I analyzed the results of the GCC assemble the assembly language, and I found that in special cases, the ADDI instruction can be translated into other instructions. For instance, when `rs1 == x0`, the instruction is equivalent to assigning `imm` to `rd`, so GCC translates it into the **li** instruction. Another special case is when `imm == 0`, which is equivalent to assigning `rs1` to `rd`, so it is specialized into the **mv** instruction. I optimized my ADDI instruction based on these two examples, resulting in a significant speed improvement due to the reduced number of operands. **inc** and **dec** are also been specialized, if addi instruction immediate value is `1` or `-1`, and the `rd` register is as same as `rs1`,then the instruction will be specialized as inc and dec. ### 4.2 specialized jump I specialized the `jal` instruction by observing the results of GCC assembly. In most cases, the target register is `x1`, a common scenario. Despite this, I noticed that the decode stage still searches for the target register rd. Therefore, after assigning the opcode as jal, I separately check if `rd == x1`. If true, I specialize it into `jalx1`. ### 4.3 specialized branch In the standard RISCV32I, we have `beq` and `bne` instructions, but there is a lack of commonly needed instructions in C, such as `beqz` and `bnez`. Hence, I aimed to specialize these instructions. I implemented beqz and bnez in rv32emu separately from the RISCV-C instructions. I check if `rs2 is x0`, and if true, **I specialize beq into beqz and bne into bnez**. ### 4.4 other instruction For instructions where `rd == rs1` and are related to the ALU, I specialize them into new instructions. For example, in the case of `addi x10, x10, 1`, the new instruction is named `addix10`, and empirical results show an acceleration in execution speed. ### 4.5 presentation and comparison (on x86_64 machine) We analyze the `build/rv32emu` to examine the effects of specialized instructions. First, we look at the `addi` instruction. We compare the non-specialized `do_addi` with the specialized `do_addi025025` (addi x25, x25, imm). We run this on an `x86_64` architecture computer, and here are the disassembled results. We observe a difference of four instructions between them, including 2 `mov` instructions and 2 `movzbl` instructions. * do_addi ``` 000000000000a420 <do_addi.lto_priv.0>: a420: 44 0f b6 4e 05 movzbl 0x5(%rsi),%r9d a425: 8b 06 mov (%rsi),%eax a427: 48 83 c2 01 add $0x1,%rdx a42b: 83 c1 04 add $0x4,%ecx a42e: 44 0f b6 46 04 movzbl 0x4(%rsi),%r8d a433: 48 8b 76 20 mov 0x20(%rsi),%rsi a437: 42 03 44 8f 68 add 0x68(%rdi,%r9,4),%eax a43c: 42 89 44 87 68 mov %eax,0x68(%rdi,%r8,4) a441: 48 85 f6 test %rsi,%rsi a444: 74 0a je a450 <do_addi.lto_priv.0+0x30> a446: ff 66 28 jmp *0x28(%rsi) a449: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) a450: 48 89 97 80 01 00 00 mov %rdx,0x180(%rdi) a457: b8 01 00 00 00 mov $0x1,%eax a45c: 89 8f e8 00 00 00 mov %ecx,0xe8(%rdi) a462: c3 ret a463: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) a46a: 00 00 00 00 a46e: 66 90 xchg %ax,%ax ``` * do_addi025025 ``` 0000000000013b00 <do_addi025025.lto_priv.0>: 13b00: 8b 06 mov (%rsi),%eax 13b02: 48 8b 76 20 mov 0x20(%rsi),%rsi 13b06: 48 83 c2 01 add $0x1,%rdx 13b0a: 83 c1 04 add $0x4,%ecx 13b0d: 01 87 cc 00 00 00 add %eax,0xcc(%rdi) 13b13: 48 85 f6 test %rsi,%rsi 13b16: 74 08 je 13b20 <do_addi025025.lto_priv.0+0x20> 13b18: ff 66 28 jmp *0x28(%rsi) 13b1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 13b20: 48 89 97 80 01 00 00 mov %rdx,0x180(%rdi) 13b27: b8 01 00 00 00 mov $0x1,%eax 13b2c: 89 8f e8 00 00 00 mov %ecx,0xe8(%rdi) 13b32: c3 ret 13b33: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 13b3a: 00 00 00 00 13b3e: 66 90 xchg %ax,%ax ``` We also compared `beq` and `beqz`, and after specialization, there is a reduction of four instructions in total. :::spoiler beq ``` 0000000000009dc0 <do_beq.lto_priv.0>: 9dc0: 0f b6 46 06 movzbl 0x6(%rsi),%eax 9dc4: 44 0f b6 46 05 movzbl 0x5(%rsi),%r8d 9dc9: 48 83 c2 01 add $0x1,%rdx 9dcd: 8b 44 87 68 mov 0x68(%rdi,%rax,4),%eax 9dd1: 42 39 44 87 68 cmp %eax,0x68(%rdi,%r8,4) 9dd6: 74 28 je 9e00 <do_beq.lto_priv.0+0x40> 9dd8: 4c 8b 46 38 mov 0x38(%rsi),%r8 9ddc: c6 05 41 d2 01 00 00 movb $0x0,0x1d241(%rip) # 27024 <is_branch_taken.lto_priv.0> 9de3: 83 c1 04 add $0x4,%ecx 9de6: 4d 85 c0 test %r8,%r8 9de9: 74 35 je 9e20 <do_beq.lto_priv.0+0x60> 9deb: 89 0d 37 d2 01 00 mov %ecx,0x1d237(%rip) # 27028 <last_pc.lto_priv.0> 9df1: 49 8b 40 28 mov 0x28(%r8),%rax 9df5: 4c 89 c6 mov %r8,%rsi 9df8: ff e0 jmp *%rax 9dfa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 9e00: 03 0e add (%rsi),%ecx 9e02: 48 8b 76 30 mov 0x30(%rsi),%rsi 9e06: c6 05 17 d2 01 00 01 movb $0x1,0x1d217(%rip) # 27024 <is_branch_taken.lto_priv.0> 9e0d: 48 85 f6 test %rsi,%rsi 9e10: 74 1e je 9e30 <do_beq.lto_priv.0+0x70> 9e12: 89 0d 10 d2 01 00 mov %ecx,0x1d210(%rip) # 27028 <last_pc.lto_priv.0> 9e18: ff 66 28 jmp *0x28(%rsi) 9e1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 9e20: 48 8b 76 20 mov 0x20(%rsi),%rsi 9e24: 48 85 f6 test %rsi,%rsi 9e27: 74 07 je 9e30 <do_beq.lto_priv.0+0x70> 9e29: ff 66 28 jmp *0x28(%rsi) 9e2c: 0f 1f 40 00 nopl 0x0(%rax) 9e30: 48 89 97 80 01 00 00 mov %rdx,0x180(%rdi) 9e37: b8 01 00 00 00 mov $0x1,%eax 9e3c: 89 8f e8 00 00 00 mov %ecx,0xe8(%rdi) 9e42: c3 ret 9e43: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) 9e4a: 00 00 00 00 9e4e: 66 90 xchg %ax,%ax ``` ::: :::spoiler beqz ``` 0000000000009e50 <do_beqz2.lto_priv.0>: 9e50: 0f b6 46 05 movzbl 0x5(%rsi),%eax 9e54: 48 83 c2 01 add $0x1,%rdx 9e58: 8b 44 87 68 mov 0x68(%rdi,%rax,4),%eax 9e5c: 85 c0 test %eax,%eax 9e5e: 74 28 je 9e88 <do_beqz2.lto_priv.0+0x38> 9e60: 4c 8b 46 38 mov 0x38(%rsi),%r8 9e64: c6 05 b9 d1 01 00 00 movb $0x0,0x1d1b9(%rip) # 27024 <is_branch_taken.lto_priv.0> 9e6b: 83 c1 04 add $0x4,%ecx 9e6e: 4d 85 c0 test %r8,%r8 9e71: 74 4d je 9ec0 <do_beqz2.lto_priv.0+0x70> 9e73: 89 0d af d1 01 00 mov %ecx,0x1d1af(%rip) # 27028 <last_pc.lto_priv.0> 9e79: 49 8b 40 28 mov 0x28(%r8),%rax 9e7d: 4c 89 c6 mov %r8,%rsi 9e80: ff e0 jmp *%rax 9e82: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 9e88: 03 0e add (%rsi),%ecx 9e8a: 48 8b 76 30 mov 0x30(%rsi),%rsi 9e8e: c6 05 8f d1 01 00 01 movb $0x1,0x1d18f(%rip) # 27024 <is_branch_taken.lto_priv.0> 9e95: 48 85 f6 test %rsi,%rsi 9e98: 74 0e je 9ea8 <do_beqz2.lto_priv.0+0x58> 9e9a: 89 0d 88 d1 01 00 mov %ecx,0x1d188(%rip) # 27028 <last_pc.lto_priv.0> 9ea0: ff 66 28 jmp *0x28(%rsi) 9ea3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 9ea8: 48 89 97 80 01 00 00 mov %rdx,0x180(%rdi) 9eaf: b8 01 00 00 00 mov $0x1,%eax 9eb4: 89 8f e8 00 00 00 mov %ecx,0xe8(%rdi) 9eba: c3 ret 9ebb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 9ec0: 48 8b 76 20 mov 0x20(%rsi),%rsi 9ec4: 48 85 f6 test %rsi,%rsi 9ec7: 74 df je 9ea8 <do_beqz2.lto_priv.0+0x58> 9ec9: ff 66 28 jmp *0x28(%rsi) 9ecc: 0f 1f 40 00 nopl 0x0(%rax) ``` ::: ## 5. Specialization for Benchmarks ### 5.1 Observing all addi instructions in the benchmark. * Pipe command > grep ADDI | sort | uniq -c * Interpreter function ```c do_addi(riscv_t *rv, rv_insn_t *ir){ rv[ir->rd] = rv[ir->rs1] + ir->imm; printf("ADDI :(%d,%d)",ir->rd,ir->rs1); } ``` * utilization Whenever `build/rv32emu` executes an `addi` instruction, it prints out the current set of addi instructions it is executing. Our goal is to use the Linux pipe command to concurrently run the benchmark and `collect statistics` on the occurrences of each ADDI instruction. ### 5.2 Observation Results We will specialize based on the most frequently occurring (rd, rs1) combinations in ADDI. For example, in `coremark`, the most common occurrences are ADDI(12,12), ADDI(13,13), and ADDI(15,15). Specializing for these instructions can accelerate the execution speed of coremark. ### 5.3 Prioritize based on frequency of occurrence. Because the decode phase uses an `if-elif-else` structure to determine whether it is a specialized instruction, the order of determination becomes crucial. We prioritize the most frequently occurring ones mentioned in section 5.2 as the first to be determined. This effectively speeds up the decoding process for frequently encountered instructions. An example is provided below: ```c= if(rd==12 && rs1==12){ //most frequent ir->opcode = addi012012; }else if(rd==16 && rs1==16){ //second frequent ir->opcode = addi016016; } ... //others ``` ## 6. Performance Evaluation ### 6.1 CoreMark We use `CoreMark` as the benchmark for our performance evaluation. CoreMark is commonly used for testing `CPU performance` in embedded systems. Its advantage lies in the fact that it does not include the runtime of libraries during testing; it focuses solely on the original code being tested. Moreover, it specifies the compiler specifications, making it convenient for comparisons across different platforms and compilers. ### 6.2 Compare >Compiler version : GCC13.2.0 Compiler flags : -Ofast -std=c99 -march=rv32im -mabi=ilp32 -DPERFORMANCE_RUN=1 * Mircoprocessor: Core i7-6700 | Compiler | w/ specialize insn. | CoreMark (Iterations/Sec) | Dhrystone (DIMPS) | | -------- | -------- | -------- |-------- | | gcc-13 | NO | 299.805127 | 343.0 | | gcc-13 | YES | 320.179300 | 359.0 | > DMIPS: Dhrystone MIPS (Million Instructions per Second) ### 6.3 Discussion and non-adapted solution During the implementation process, I attempted to implement specialized instructions for various possible `addi` instructions (approximately 1024 types). The `selection` in the decode stage was changed from the original `if` structure to a `switch` structure when implementing more optimized instructions and reducing overhead. However, upon running the performance analysis for CoreMark again under these conditions, it was observed that the results were worse than the aforementioned version and the original rv32emu. I believe the reasons are as follows: > Although multiple optimized instructions were implemented, these instructions did not appear frequently. Therefore, instead of optimizing, they increased the overhead in the decode stage. In this implementation, I chose more frequently occurring instructions as optimization targets, and I believe this is a key factor in improving performance. ## 7. Evaluate Conclusion After practical performance testing, the actual execution performance using instruction specialization shows a growth of `6%`. I believe the possible reasons for the modest improvement in performance are: * Additional decode stage overhead, as mentioned earlier, the `selection` process, which occupies a significant portion and decrease the performance gained in the execution stage. * Limited optimization of instructions through instruction specialization. The author's optimization focuses only on instructions with a higher occurrence and the same `rd` and `rs1` values. This characteristic is only present in instructions related to the ALU, and not in instructions like `beq` and `lw`, limiting the scope of optimization. Although `addi` is the most common instruction in CoreMark, the following `lw` and `sw` instructions, which are next in frequency, were not effectively optimized. Coupled with the `decode stage overhead`, this leads to limited optimization effectiveness. In my opinion, if the benchmark test is `computing-intensive`, instruction specialization primarily focusing on the ALU should exhibit better performance.