Contribute to rv32emu-next

# Contribute to rv32emu-next ## Goal ![](https://i.imgur.com/Xa2R7Ix.png) base on sammer1077 contribute, let the rv32emu-next completely pass riscv-arch-test - If there are meet tecnique problem, solve it and pull request - Provide rv32emu run time information(CPU cycle/branch/instruction/count), and top 10 most frequency instruction - optional: integrate RV32C to rv32emu-next base on pull request(PR#3 or PR#4) ## Important Note ![](https://i.imgur.com/idJkfWr.png) I find out compressed extension are needed for some instruction test and starting integrate RV32C to rv32emu-next.I choose PR#3 to integrate RV32C. ## RISCV-Arch-Test The risc-v architecture test, that is the test for ensure a cpu model is meet the risc-v specification. 1. There are include the Reference signature 2. Let our model running the test-specific program that ovewrite data in specific parts call test-signature. 3. Test-signature will be compare with Reference signature.If there are exactly equal, the compliance test is passed. If starting compiling and running test, we can find the elf in riscv-arch-test/work/C, directories.Using riscv-none-embed-objdump -D to get more information. ```assembly= ... 800000f8 <rvtest_code_begin>: 800000f8: 00003417 auipc s0,0x3 800000fc: f1840413 addi s0,s0,-232 # 80003010 <begin_signature> 80000100 <inst_0>: 80000100: 80000e37 lui t3,0x80000 80000104: 00000b93 li s7,0 80000108: 9bf2 add s7,s7,t3 8000010a: 01742023 sw s7,0(s0) ... ``` This is the test code for c.add, we can find that call c.add and store to 0(s0), that is the begin_signature position. ```assembly .. Disassembly of section .data: 80003000 <rvtest_data_begin>: 80003000: cafe sw t6,84(sp) 80003002: babe fsd fa5,368(sp) 80003004 <rvtest_data_end>: ... 80003010 <begin_signature>: 80003010: deadbeef jal t4,7ffde5fa <offset+0x7ffde532> 80003014: deadbeef jal t4,7ffde5fe <offset+0x7ffde536> 80003018: deadbeef jal t4,7ffde602 <offset+0x7ffde53a> 8000301c: deadbeef jal t4,7ffde606 <offset+0x7ffde53e ... ``` This is the section .data in testing code.After begin_signature that store test signature and that is initial deadbeef, and change in runtime. ### The Key of Compliance Test Compliance Test will test the specific-instrusction: - Functionality - Correctness of result - Make sure the x0 cannot be modify - Decode correctness - In special case the result of decode will be "Reserved/HINT" - Reserved part,HINT cannot do anything If any different in signature, the compliance test will fail. ## RV32C, standard instruction for compressed instruction Compressed instruction is a shorter instuction in special case: - Small immediate/address offset - One of the register is x0,x1,x2 - destination register = source register - using 8 most popular register ![](https://i.imgur.com/DAOYhyf.png) The C-extension is compatible with other standard instruction.(intermiexed)with the 16-bit boundary. - next PC is PC+2 when the Compressed instruction running - RV32C instruction only has 16-bit in specific-setting ## The progress on pull request of PR#3 PR#3 contributor: ccs100203 <ccs100203@gmail.com> Uduru0522 <kurasiki.homura@gmail.com> The tests fail: - c.lui - c.srli - c.srai - c.andi - c.jalr - c.jal - c.ebreak - c.addi16sp And that is not support Compute Goto implementation ## How the RV32C is compatible with other instruction set? Any instruction that fetch by rv32emu-next, we can check the instruction last 2 bit. - If that is 3(11), that is the standard uncompress instruction, next PC should be PC+4. - Otherwise, that is compressed instruction, next PC should be PC+2 - in Implementation ```assembly while (rv->csr_cycle < cycles_target && !rv->halt) { // fetch the next instruction inst = rv->io.mem_ifetch(rv, rv->PC); // standard uncompressed instruction if ((inst & 3) == 3) { index = (inst & INST_6_2) >> 2; // dispatch this opcode TABLE_TYPE op = jump_table[index]; assert(op); rv->inst_len = INST_32; if (!op(rv, inst)) break; // increment the cycles csr rv->csr_cycle++; } else { // TODO: compressed instruction const uint16_t c_index = (inst & FR_C_15_13 >> 11) | (inst & FR_C_1_0); // TODO: table implement const c_opcode_t op = c_opcodes[c_index]; assert(op); rv->inst_len = INST_16; if (!op(rv, inst)) break; // increment the cycles csr rv->csr_cycle++; } ``` ## How to debug with Compliance Test After running compliance test, we can see ![](https://i.imgur.com/Xad9CJz.png =50%x) 1. Check and record each failed instruction, and go to work directories in riscv-arch-tests, find the corresponding .elf and .diff. >.diff This is the diff illustrate the different of test signature and reference signature. >.elf using riscv-none-embed-objdump -D to open this file can get the instruction of test code. 2. And then we can running ./rv32emu --trace FailInstructionTest.elf, the trace arguement let the rv32emu print the running instruction number. - I recommend adding some print such as rs1,rs2,rd,imm to code for debug message. 3. After running the command that will be print all runtime information,check each line specially the test instruction. - Using instruction decode rule to manually calculate and compare the result to your printed message. - If something wrong, try to find the position error code base on conclusion. 4. Note that some instruction test will use another instruction, make sure other instruction using in test code correctness. 5. In exception test such as c.ebreak, that will jump to trap handler and solve it in test. - We cannot stop immediately when the trap occur. - If we immediately stop the program,we can see our test signature has many "deadbeef" compare to reference signature. ## Example Fix the arithmetic error. ### c.lui ![](https://i.imgur.com/1VmBCEv.png) The c.lui was failed in compliance test. Note that > - c.lui loads the non-zero 6 bit immediate field in to bits 17-12 of destination register,clears the bottom 12 bits, and sign-extends bit 17 in to all higher bits of destination > - The code points with nzimm=0 are reserved > - The remaining code points with rd=x0 are HINTS ```assembly static bool c_op_lui(struct riscv_t *rv, uint16_t inst) { const uint16_t rd = c_dec_rd(inst); if (rd == 2) { // C.ADDI16SP uint32_t tmp = (inst & 0x1000) >> 3; tmp |= (inst & 0x40) >> 2;//Error tmp |= (inst & 0x40) tmp |= (inst & 0x20) << 1; tmp |= (inst & 0x18) << 4; tmp |= (inst & 0x4) << 3; const uint32_t imm = (tmp & 0x200) ? (0xfffffc00 | tmp) : tmp; if (imm != 0) rv->X[rd] += imm; else { /*imm==0 is reserved */ //Reserve parts addon } } else if (rd != 0) { // C.LUI uint32_t tmp = (inst & 0x1000) << 5 | (inst & 0x7c) << 10; const int32_t imm = (tmp & 0x20000) ? (0xfffc0000 | tmp) : tmp; if (imm != 0) rv->X[rd] = imm; else { /*imm==0 is reserved*/ } } else { // HINTS } rv->PC += rv->inst_len; return true; } ``` ### c.jalr > c.jalr expands to jalr x1,0(rs1) Special part in test code ``` 80000230 <inst_8>: 80000230: 00000097 auipc ra,0x0 80000234: 01008093 addi ra,ra,16 # 80000240 <inst_8+0x10> 80000238: 9082 jalr ra 8000023a: 0020c093 xori ra,ra,2 8000023e: a019 j 80000244 <inst_8+0x14> 80000240: 0030c093 xori ra,ra,3 80000244: 00000597 auipc a1,0x0 80000248: fec58593 addi a1,a1,-20 # 80000230 <inst_8> 8000024c: 99f1 andi a1,a1,-4 8000024e: 40b080b3 sub ra,ra,a1 80000252: 02152023 sw ra,32(a0) ``` c.jalr x instruction means,we need to save Next PC to ra,and jump to the instruction line to x's value.But that is special case, c.jalr ra. If we directly save Next PC to ra, ra's value **will be replace by Next PC** and **jump to Next PC**.(What wrong?) #### Error parts Directly save Next PC to ra,and jump to the instruction line to x's value. ```C= // C.JALR rv->X[1] = rv->PC + 2; rv->PC = rv->X[rs1]; if (rv->PC & 1) { rv_except_inst_misaligned(rv, rv->PC); return false; } // can branch return false; ``` #### Corrected parts There has variable rs_value store the value in register x, and save the Next PC to ra, jump to rs_value line. ```C= // C.JALR const int32_t rs_value = rv->X[rs1]; rv->X[rv_reg_ra] = rv->PC + rv->inst_len; rv->PC = rs_value; if (rv->PC & 0x1) { rv_except_inst_misaligned(rv, rv->PC); return false; } // can branch return false; ``` ## Integrate RV32C with Compute GoTo ## Compute Goto Compute Goto is a optimization technique to improve the branch prediction accuracy. > For each jump, the branch predictor keeps a prediction of where it will jump next.[Computed goto for efficient dispatch table](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables) rv32emu-next using the Compute Goto: - After all of instruction call, that use same process DISPATH but that is the same code on differennt position(no reuse code, but use the define reduce the working of program maintenance) - Define macro op_xxxx to setting to keep the all instruction handler have different "jump"(not shared) ```C= #define TARGET(instr) \ op_##instr : EXEC(instr); \ DISPATCH(); ``` This is why Compute Goto technique can improve the speed of rv32emu-next.Note that, it is improved the branch perdiction of **program(rv32emu-next)** not the risc-v model's branch perdiction. The Compute Goto for RV32C, we need to prepare - Jump table which store pointer to functions - DISPATCH - EXEC: Using jump table statement ### RV32C Implementation with Compute Goto **Rename function name from c_op_* to op_c*** - That can use the same macro which compatible the standard instruction Compute GoTo before - That should be minimize the change of the code structure(only rename function and several lines) #### jump table Using OP to fill the jump table, all of RV32C instruction has 'c' charactor to distinguish to standard instruction. ```C= TABLE_TYPE_RVC jump_table_rvc[] = { //00 01 10 11 OP(caddi4spn), OP(caddi), OP(cslli), OP(unimp), // 000 OP(cfld), OP(cjal), OP(cfldsp), OP(unimp), // 001 OP(clw), OP(cli), OP(clwsp), OP(unimp), // 010 OP(cflw), OP(clui), OP(cflwsp), OP(unimp), // 011 OP(unimp), OP(cmisc_alu), OP(ccr), OP(unimp), // 100 OP(cfsd), OP(cj), OP(cfsdsp), OP(unimp), // 101 OP(csw), OP(cbeqz), OP(cswsp), OP(unimp), // 110 OP(cfsw), OP(cbnez), OP(cfswsp), OP(unimp), // 111 }; ``` #### DISPATH 1. check the cpu halt or trace cycle meets target 2. Fetch instruction 3. Determined compressed/uncompress instruction - If it is compressed intruction - change intruction length into INST_16(2) - compressed instruction decode ```C= #define DISPATCH() \ { \ if (rv->csr_cycle >= cycles_target || rv->halt) \ goto quit; \ /* fetch the next instruction */ \ inst = rv->io.mem_ifetch(rv, rv->PC); \ /* standard uncompressed instruction */ \ if ((inst & 3) == 3) { \ uint32_t index = (inst & INST_6_2) >> 2; \ rv->inst_len = INST_32; \ goto *jump_table[index]; \ } else { \ /* Compressed Extension Instruction */ \ inst &= 0x0000FFFF; \ int16_t c_index = (inst & FC_FUNC3) >> 11 | (inst & FC_OPCODE); \ rv->inst_len = INST_16; \ goto *jump_table_rvc[c_index]; \ } \ } \ ``` #### EXEC The EXEC parts is not changed, this why I rename the function name and not use other macro to do same things ```C= #define EXEC(instr) \ { \ /* dispatch this opcode */ \ if (!op_##instr(rv, inst)) \ goto quit; \ /* increment the cycles csr*/ \ rv->csr_cycle++; \ } ``` and add the TARGET, this parts make sure the that will do the instruction fetching after handle instruction.(TARGET=EXEC+DISPATCH) ```C= #ifdef ENABLE_RV32C TARGET(caddi4spn) TARGET(caddi) TARGET(cslli) TARGET(cjal) TARGET(clw) TARGET(cli) TARGET(clwsp) TARGET(clui) TARGET(cmisc_alu) TARGET(ccr) TARGET(cj) TARGET(csw) TARGET(cbeqz) TARGET(cswsp) TARGET(cbnez) #endif ``` ## Pull Request This project is still on pull request, but I need to do huge modified in my commit message from reviewer's advice.I try my best until the end of project. ## Conclusion In this project,I integrate RV32C(based on ccs100203 and Uduru0522 contributed, and RV32C.F not support) with Compute Goto in this project to rv32emu-next. RV32C is the parts of Standard extension for compressed instruction that the instruction length can be reduce when - Register is identical(ex: rs1 == rd) - Reduction register bit(same region represent rs1/rd) - Some register are most common use - Using less bit to represent the register - small immediate value - less bit representation To make the module pass the compliance test, I concern - instruction functionality - Does any decode result is "reserved" or "HINT"? Compute Goto - Make the different "jump" to improve the branch prediction accuracy - Compute Goto force the "jump" not shared and branch predictor keep independently the result of severals "jump" last time ## Referrence [Specifications - RISC-V International](https://riscv.org/technical/specifications/) [Computed goto for efficient dispatch tables](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables) [2020q3 project: RISC-V simulator](https://hackmd.io/@sysprog/HJOpsvFqP?fbclid=IwAR1taUEhwvhZIkOXeAKXdDtorOILa8JeIYJIVeoYpSaplCiX99w7JSkX3WE#%E5%88%A9%E7%94%A8-computed-go-%E5%8A%A0%E9%80%9F-main-loop-%E9%81%8B%E4%BD%9C)