# Contribute to rv32emu-next
## Goal

base on sammer1077 contribute, let the rv32emu-next completely pass riscv-arch-test
- If there are meet tecnique problem, solve it and pull request
- Provide rv32emu run time information(CPU cycle/branch/instruction/count), and top 10 most frequency instruction
- optional: integrate RV32C to rv32emu-next base on pull request(PR#3 or PR#4)
## Important Note

I find out compressed extension are needed for some instruction test and starting integrate RV32C to rv32emu-next.I choose PR#3 to integrate RV32C.
## RISCV-Arch-Test
The risc-v architecture test, that is the test for ensure a cpu model is meet the risc-v specification.
1. There are include the Reference signature
2. Let our model running the test-specific program that ovewrite data in specific parts call test-signature.
3. Test-signature will be compare with Reference signature.If there are exactly equal, the compliance test is passed.
If starting compiling and running test, we can find the elf in riscv-arch-test/work/C, directories.Using riscv-none-embed-objdump -D to get more information.
```assembly=
...
800000f8 <rvtest_code_begin>:
800000f8: 00003417 auipc s0,0x3
800000fc: f1840413 addi s0,s0,-232 # 80003010 <begin_signature>
80000100 <inst_0>:
80000100: 80000e37 lui t3,0x80000
80000104: 00000b93 li s7,0
80000108: 9bf2 add s7,s7,t3
8000010a: 01742023 sw s7,0(s0)
...
```
This is the test code for c.add, we can find that call c.add and store to 0(s0), that is the begin_signature position.
```assembly
..
Disassembly of section .data:
80003000 <rvtest_data_begin>:
80003000: cafe sw t6,84(sp)
80003002: babe fsd fa5,368(sp)
80003004 <rvtest_data_end>:
...
80003010 <begin_signature>:
80003010: deadbeef jal t4,7ffde5fa <offset+0x7ffde532>
80003014: deadbeef jal t4,7ffde5fe <offset+0x7ffde536>
80003018: deadbeef jal t4,7ffde602 <offset+0x7ffde53a>
8000301c: deadbeef jal t4,7ffde606 <offset+0x7ffde53e
...
```
This is the section .data in testing code.After begin_signature that store test signature and that is initial deadbeef, and change in runtime.
### The Key of Compliance Test
Compliance Test will test the specific-instrusction:
- Functionality
- Correctness of result
- Make sure the x0 cannot be modify
- Decode correctness
- In special case the result of decode will be "Reserved/HINT"
- Reserved part,HINT cannot do anything
If any different in signature, the compliance test will fail.
## RV32C, standard instruction for compressed instruction
Compressed instruction is a shorter instuction in special case:
- Small immediate/address offset
- One of the register is x0,x1,x2
- destination register = source register
- using 8 most popular register

The C-extension is compatible with other standard instruction.(intermiexed)with the 16-bit boundary.
- next PC is PC+2 when the Compressed instruction running
- RV32C instruction only has 16-bit in specific-setting
## The progress on pull request of PR#3
PR#3 contributor:
ccs100203 <ccs100203@gmail.com>
Uduru0522 <kurasiki.homura@gmail.com>
The tests fail:
- c.lui
- c.srli
- c.srai
- c.andi
- c.jalr
- c.jal
- c.ebreak
- c.addi16sp
And that is not support Compute Goto implementation
## How the RV32C is compatible with other instruction set?
Any instruction that fetch by rv32emu-next, we can check the instruction last 2 bit.
- If that is 3(11), that is the standard uncompress instruction, next PC should be PC+4.
- Otherwise, that is compressed instruction, next PC should be PC+2
- in Implementation
```assembly
while (rv->csr_cycle < cycles_target && !rv->halt) {
// fetch the next instruction
inst = rv->io.mem_ifetch(rv, rv->PC);
// standard uncompressed instruction
if ((inst & 3) == 3) {
index = (inst & INST_6_2) >> 2;
// dispatch this opcode
TABLE_TYPE op = jump_table[index];
assert(op);
rv->inst_len = INST_32;
if (!op(rv, inst))
break;
// increment the cycles csr
rv->csr_cycle++;
} else {
// TODO: compressed instruction
const uint16_t c_index = (inst & FR_C_15_13 >> 11) | (inst & FR_C_1_0);
// TODO: table implement
const c_opcode_t op = c_opcodes[c_index];
assert(op);
rv->inst_len = INST_16;
if (!op(rv, inst))
break;
// increment the cycles csr
rv->csr_cycle++;
}
```
## How to debug with Compliance Test
After running compliance test, we can see

1. Check and record each failed instruction, and go to work directories in riscv-arch-tests, find the corresponding .elf and .diff.
>.diff
This is the diff illustrate the different of test signature and reference signature.
>.elf
using riscv-none-embed-objdump -D to open this file can get the instruction of test code.
2. And then we can running ./rv32emu --trace FailInstructionTest.elf, the trace arguement let the rv32emu print the running instruction number.
- I recommend adding some print such as rs1,rs2,rd,imm to code for debug message.
3. After running the command that will be print all runtime information,check each line specially the test instruction.
- Using instruction decode rule to manually calculate and compare the result to your printed message.
- If something wrong, try to find the position error code base on conclusion.
4. Note that some instruction test will use another instruction, make sure other instruction using in test code correctness.
5. In exception test such as c.ebreak, that will jump to trap handler and solve it in test.
- We cannot stop immediately when the trap occur.
- If we immediately stop the program,we can see our test signature has many "deadbeef" compare to reference signature.
## Example
Fix the arithmetic error.
### c.lui

The c.lui was failed in compliance test.
Note that
> - c.lui loads the non-zero 6 bit immediate field in to bits 17-12 of destination register,clears the bottom 12 bits, and sign-extends bit 17 in to all higher bits of destination
> - The code points with nzimm=0 are reserved
> - The remaining code points with rd=x0 are HINTS
```assembly
static bool c_op_lui(struct riscv_t *rv, uint16_t inst)
{
const uint16_t rd = c_dec_rd(inst);
if (rd == 2) {
// C.ADDI16SP
uint32_t tmp = (inst & 0x1000) >> 3;
tmp |= (inst & 0x40) >> 2;//Error tmp |= (inst & 0x40)
tmp |= (inst & 0x20) << 1;
tmp |= (inst & 0x18) << 4;
tmp |= (inst & 0x4) << 3;
const uint32_t imm = (tmp & 0x200) ? (0xfffffc00 | tmp) : tmp;
if (imm != 0)
rv->X[rd] += imm;
else { /*imm==0 is reserved */ //Reserve parts addon
}
} else if (rd != 0) {
// C.LUI
uint32_t tmp = (inst & 0x1000) << 5 | (inst & 0x7c) << 10;
const int32_t imm = (tmp & 0x20000) ? (0xfffc0000 | tmp) : tmp;
if (imm != 0)
rv->X[rd] = imm;
else { /*imm==0 is reserved*/
}
} else {
// HINTS
}
rv->PC += rv->inst_len;
return true;
}
```
### c.jalr
> c.jalr expands to jalr x1,0(rs1)
Special part in test code
```
80000230 <inst_8>:
80000230: 00000097 auipc ra,0x0
80000234: 01008093 addi ra,ra,16 # 80000240 <inst_8+0x10>
80000238: 9082 jalr ra
8000023a: 0020c093 xori ra,ra,2
8000023e: a019 j 80000244 <inst_8+0x14>
80000240: 0030c093 xori ra,ra,3
80000244: 00000597 auipc a1,0x0
80000248: fec58593 addi a1,a1,-20 # 80000230 <inst_8>
8000024c: 99f1 andi a1,a1,-4
8000024e: 40b080b3 sub ra,ra,a1
80000252: 02152023 sw ra,32(a0)
```
c.jalr x instruction means,we need to save Next PC to ra,and jump to the instruction line to x's value.But that is special case, c.jalr ra.
If we directly save Next PC to ra, ra's value **will be replace by Next PC** and **jump to Next PC**.(What wrong?)
#### Error parts
Directly save Next PC to ra,and jump to the instruction line to x's value.
```C=
// C.JALR
rv->X[1] = rv->PC + 2;
rv->PC = rv->X[rs1];
if (rv->PC & 1) {
rv_except_inst_misaligned(rv, rv->PC);
return false;
}
// can branch
return false;
```
#### Corrected parts
There has variable rs_value store the value in register x, and save the Next PC to ra, jump to rs_value line.
```C=
// C.JALR
const int32_t rs_value = rv->X[rs1];
rv->X[rv_reg_ra] = rv->PC + rv->inst_len;
rv->PC = rs_value;
if (rv->PC & 0x1) {
rv_except_inst_misaligned(rv, rv->PC);
return false;
}
// can branch
return false;
```
## Integrate RV32C with Compute GoTo
## Compute Goto
Compute Goto is a optimization technique to improve the branch prediction accuracy.
> For each jump, the branch predictor keeps a prediction of where it will jump next.[Computed goto for efficient dispatch table](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables)
rv32emu-next using the Compute Goto:
- After all of instruction call, that use same process DISPATH but that is the same code on differennt position(no reuse code, but use the define reduce the working of program maintenance)
- Define macro op_xxxx to setting to keep the all instruction handler have different "jump"(not shared)
```C=
#define TARGET(instr) \
op_##instr : EXEC(instr); \
DISPATCH();
```
This is why Compute Goto technique can improve the speed of rv32emu-next.Note that, it is improved the branch perdiction of **program(rv32emu-next)** not the risc-v model's branch perdiction.
The Compute Goto for RV32C, we need to prepare
- Jump table which store pointer to functions
- DISPATCH
- EXEC: Using jump table statement
### RV32C Implementation with Compute Goto
**Rename function name from c_op_* to op_c***
- That can use the same macro which compatible the standard instruction Compute GoTo before
- That should be minimize the change of the code structure(only rename function and several lines)
#### jump table
Using OP to fill the jump table, all of RV32C instruction has 'c' charactor to distinguish to standard instruction.
```C=
TABLE_TYPE_RVC jump_table_rvc[] = {
//00 01 10 11
OP(caddi4spn), OP(caddi), OP(cslli), OP(unimp), // 000
OP(cfld), OP(cjal), OP(cfldsp), OP(unimp), // 001
OP(clw), OP(cli), OP(clwsp), OP(unimp), // 010
OP(cflw), OP(clui), OP(cflwsp), OP(unimp), // 011
OP(unimp), OP(cmisc_alu), OP(ccr), OP(unimp), // 100
OP(cfsd), OP(cj), OP(cfsdsp), OP(unimp), // 101
OP(csw), OP(cbeqz), OP(cswsp), OP(unimp), // 110
OP(cfsw), OP(cbnez), OP(cfswsp), OP(unimp), // 111
};
```
#### DISPATH
1. check the cpu halt or trace cycle meets target
2. Fetch instruction
3. Determined compressed/uncompress instruction
- If it is compressed intruction
- change intruction length into INST_16(2)
- compressed instruction decode
```C=
#define DISPATCH() \
{ \
if (rv->csr_cycle >= cycles_target || rv->halt) \
goto quit; \
/* fetch the next instruction */ \
inst = rv->io.mem_ifetch(rv, rv->PC); \
/* standard uncompressed instruction */ \
if ((inst & 3) == 3) { \
uint32_t index = (inst & INST_6_2) >> 2; \
rv->inst_len = INST_32; \
goto *jump_table[index]; \
} else { \
/* Compressed Extension Instruction */ \
inst &= 0x0000FFFF; \
int16_t c_index = (inst & FC_FUNC3) >> 11 | (inst & FC_OPCODE); \
rv->inst_len = INST_16; \
goto *jump_table_rvc[c_index]; \
} \
} \
```
#### EXEC
The EXEC parts is not changed, this why I rename the function name and not use other macro to do same things
```C=
#define EXEC(instr) \
{ \
/* dispatch this opcode */ \
if (!op_##instr(rv, inst)) \
goto quit; \
/* increment the cycles csr*/ \
rv->csr_cycle++; \
}
```
and add the TARGET, this parts make sure the that will do the instruction fetching after handle instruction.(TARGET=EXEC+DISPATCH)
```C=
#ifdef ENABLE_RV32C
TARGET(caddi4spn)
TARGET(caddi)
TARGET(cslli)
TARGET(cjal)
TARGET(clw)
TARGET(cli)
TARGET(clwsp)
TARGET(clui)
TARGET(cmisc_alu)
TARGET(ccr)
TARGET(cj)
TARGET(csw)
TARGET(cbeqz)
TARGET(cswsp)
TARGET(cbnez)
#endif
```
## Pull Request
This project is still on pull request, but I need to do huge modified in my commit message from reviewer's advice.I try my best until the end of project.
## Conclusion
In this project,I integrate RV32C(based on ccs100203 and Uduru0522 contributed, and RV32C.F not support) with Compute Goto in this project to rv32emu-next.
RV32C is the parts of Standard extension for compressed instruction that the instruction length can be reduce when
- Register is identical(ex: rs1 == rd)
- Reduction register bit(same region represent rs1/rd)
- Some register are most common use
- Using less bit to represent the register
- small immediate value
- less bit representation
To make the module pass the compliance test, I concern
- instruction functionality
- Does any decode result is "reserved" or "HINT"?
Compute Goto
- Make the different "jump" to improve the branch prediction accuracy
- Compute Goto force the "jump" not shared and branch predictor keep independently the result of severals "jump" last time
## Referrence
[Specifications - RISC-V International](https://riscv.org/technical/specifications/)
[Computed goto for efficient dispatch tables](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables)
[2020q3 project: RISC-V simulator](https://hackmd.io/@sysprog/HJOpsvFqP?fbclid=IwAR1taUEhwvhZIkOXeAKXdDtorOILa8JeIYJIVeoYpSaplCiX99w7JSkX3WE#%E5%88%A9%E7%94%A8-computed-go-%E5%8A%A0%E9%80%9F-main-loop-%E9%81%8B%E4%BD%9C)