Implement RISC-V Compressed Instruction Set for [Reindeer](https://github.com/PulseRain/Reindeer)

# Implement RISC-V Compressed Instruction Set for [Reindeer](https://github.com/PulseRain/Reindeer) :::danger TODO: make a list for project members ::: contributed by < 林家葦 >,< 黃俞紘 >,< 曾士峰>,< 潘家瑞 > ## Requirements - extend pull request [RV32C support](https://github.com/PulseRain/Reindeer/pull/25) to ensure functional pipeline design; - Validate with simulator; - (optional) validate on Step CYC10 FPGA board; ###### tags: `Computer Architecture` ## RV32C * RVC uses a simple compression scheme that offers shorter 16-bit versions of common 32-bit RICS-V instructions * the immediate or address offset is small * one of the registers is the zero register (x0), the ABI link register (x1), or the ABI stack pointer * the destination register and the first source register are identical * the registers used are the 8 most popular ones * How to encoder Compress Instruction ![](https://i.imgur.com/Nzx74ad.png) * How to generate compress instruction * We add extra flag -march=rv32im**c** to compile original source code, and we can get the compress instruction ``` riscv-none-embed-gcc -march=rv32imc -mabi=ilp32 -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -I/home/wayen/riscv-compliance/riscv-test-env/ -I/home /wayen/riscv-compliance/riscv-test-env/p/ -I/home/wayen/riscv-compliance/riscv-target/riscvOVPsim/ -T/home/wayen/riscv-compliance/riscv-test-env/p/link.ld src/C-J.S -o /h ome/wayen/riscv-compliance/work/rv32imc/C-J.elf; riscv-none-embed-objdump -D /home/wayen/riscv-compliance/work/rv32imc/C-J.elf > /home/wayen/riscv-compliance/work/rv32imc /C-J.elf.objdump ``` * Source code ``` = assember #include "riscv_test_macros.h" #include "compliance_test.h" #include "compliance_io.h" RV_COMPLIANCE_RV32M RV_COMPLIANCE_CODE_BEGIN RVTEST_IO_INIT RVTEST_IO_ASSERT_GPR_EQ(x31, x0, 0x00000000) RVTEST_IO_WRITE_STR(x31, "Test Begin\n") # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 1\n") # address for test results la x5, test_1_res TEST_RR_OP(add, x0, x31, x16, 0x0, -0x1, 0x0, x5, 0, x6) # Testcase 0 TEST_RR_OP(add, x1, x30, x15, 0xfffff802, 0x1, -0x7ff, x5, 4, x6) # Testcase 1 TEST_RR_OP(add, x2, x29, x14, 0xffffffff, 0x0, -0x1, x5, 8, x6) # Testcase 2 TEST_RR_OP(add, x3, x28, x13, 0xfffff5cb, 0x7ff, -0x1234, x5, 12, x6) # Testcase 3 TEST_RR_OP(add, x4, x27, x12, 0x80000000, 0x0, 0x80000000, x5, 16, x6) # Testcase 4 # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 2\n") # address for test results la x1, test_2_res TEST_RR_OP(add, x5, x26, x11, 0x1a34, 0x800, 0x1234, x1, 0, x2) # Testcase 5 TEST_RR_OP(add, x6, x25, x10, 0x7654320, 0x7654321, 0xffffffff, x1, 4, x2) # Testcase 6 TEST_RR_OP(add, x7, x24, x9, 0x80000000, 0x7fffffff, 0x1, x1, 8, x2) # Testcase 7 TEST_RR_OP(add, x8, x23, x8, 0x80000000, 0x1, 0x7fffffff, x1, 12, x2) # Testcase 8 TEST_RR_OP(add, x9, x22, x7, 0x7654320, 0xffffffff, 0x7654321, x1, 16, x2) # Testcase 9 # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 3\n") # address for test results la x1, test_3_res TEST_RR_OP(add, x10, x21, x6, 0x1a34, 0x1234, 0x800, x1, 0, x7) # Testcase 10 TEST_RR_OP(add, x11, x20, x5, 0x80000000, 0x80000000, 0x0, x1, 4, x7) # Testcase 11 TEST_RR_OP(add, x12, x19, x4, 0xfffff5cb, -0x1234, 0x7ff, x1, 8, x7) # Testcase 12 TEST_RR_OP(add, x13, x18, x3, 0xfffffffe, -0x1, -0x1, x1, 12, x7) # Testcase 13 TEST_RR_OP(add, x14, x17, x2, 0xfffff802, -0x7ff, 0x1, x1, 16, x7) # Testcase 14 # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 4\n") # address for test results la x2, test_4_res TEST_RR_OP(add, x15, x16, x1, 0x0, 0x0, 0x0, x2, 0, x3) # Testcase 15 TEST_RR_OP(add, x16, x15, x0, 0xffffffff, -0x1, 0x0, x2, 4, x3) # Testcase 16 TEST_RR_OP(add, x17, x14, x31, 0xfffff802, 0x1, -0x7ff, x2, 8, x3) # Testcase 17 TEST_RR_OP(add, x18, x13, x30, 0xffffffff, 0x0, -0x1, x2, 12, x3) # Testcase 18 TEST_RR_OP(add, x19, x12, x29, 0xfffff5cb, 0x7ff, -0x1234, x2, 16, x3) # Testcase 19 # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 5\n") # address for test results la x1, test_5_res TEST_RR_OP(add, x20, x11, x28, 0x80000000, 0x0, 0x80000000, x1, 0, x2) # Testcase 20 TEST_RR_OP(add, x21, x10, x27, 0x1a34, 0x800, 0x1234, x1, 4, x2) # Testcase 21 TEST_RR_OP(add, x22, x9, x26, 0x7654320, 0x7654321, 0xffffffff, x1, 8, x2) # Testcase 22 TEST_RR_OP(add, x23, x8, x25, 0x80000000, 0x7fffffff, 0x1, x1, 12, x2) # Testcase 23 TEST_RR_OP(add, x24, x7, x24, 0x80000000, 0x1, 0x7fffffff, x1, 16, x2) # Testcase 24 # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 6\n") # address for test results la x1, test_6_res TEST_RR_OP(add, x25, x6, x23, 0x7654320, 0xffffffff, 0x7654321, x1, 0, x7) # Testcase 25 TEST_RR_OP(add, x26, x5, x22, 0x1a34, 0x1234, 0x800, x1, 4, x7) # Testcase 26 TEST_RR_OP(add, x27, x4, x21, 0x80000000, 0x80000000, 0x0, x1, 8, x7) # Testcase 27 TEST_RR_OP(add, x28, x3, x20, 0xfffff5cb, -0x1234, 0x7ff, x1, 12, x7) # Testcase 28 TEST_RR_OP(add, x29, x2, x19, 0xfffffffe, -0x1, -0x1, x1, 16, x7) # Testcase 29 # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "# Test number 7\n") # address for test results la x2, test_7_res TEST_RR_OP(add, x30, x1, x18, 0xfffff802, -0x7ff, 0x1, x2, 0, x3) # Testcase 30 TEST_RR_OP(add, x31, x0, x17, 0x0, 0x0, 0x0, x2, 4, x3) # Testcase 31 # --------------------------------------------------------------------------------------------- RVTEST_IO_WRITE_STR(x31, "Test End\n") # --------------------------------------------------------------------------------------------- RV_COMPLIANCE_HALT RV_COMPLIANCE_CODE_END # Input data section. .data # Output data section. RV_COMPLIANCE_DATA_BEGIN test_1_res: .fill 5, 4, -1 test_2_res: .fill 5, 4, -1 test_3_res: .fill 5, 4, -1 test_4_res: .fill 5, 4, -1 test_5_res: .fill 5, 4, -1 test_6_res: .fill 5, 4, -1 test_7_res: .fill 5, 4, -1 RV_COMPLIANCE_DATA_END ``` * We fetch some objdump information * RV32C ``` 800000f0 <begin_testcode>: 800000f0: 00002117 auipc sp,0x2 800000f4: f1010113 addi sp,sp,-240 # 80002000 <begin_signature> 800000f8: 4201 li tp,0 800000fa: 4181 li gp,0 800000fc: 9192 add gp,gp,tp 800000fe: c00e sw gp,0(sp) 80000100: 4481 li s1,0 80000102: 4405 li s0,1 80000104: 9426 add s0,s0,s1 80000106: c222 sw s0,4(sp) 80000108: 4601 li a2,0 8000010a: fff00593 li a1,-1 8000010e: 95b2 add a1,a1,a2 . . . ``` * RV32I ``` 80000108: 00002097 auipc ra,0x2 8000010c: ef808093 addi ra,ra,-264 # 80002000 <test_A1_data> 80000110: 00002117 auipc sp,0x2 80000114: f2010113 addi sp,sp,-224 # 80002030 <begin_signature> 80000118: 0000a183 lw gp,0(ra) 8000011c: 00000213 li tp,0 80000120: 00100293 li t0,1 80000124: fff00313 li t1,-1 80000128: 800003b7 lui t2,0x80000 8000012c: fff38393 addi t2,t2,-1 # 7fffffff <_end+0xffffdf1f> . . . ``` ## How to work * We add instruction compress decoder to convert 16-bit instruction to 32-bit instructions. 1. First,we fetch the instruction according to program counter. 2. And I get the 32-bit/16bit instruction, and we send to **Compress Decoder Unit** to decode compress instruction. If the instruction is not compress,it will be passed by. But if it is compresss,it will be convert to 32-bit instruction. 3. send the already decoder instruction to Instruction decoder unit. 4. execute the instruction. ![](https://i.imgur.com/ihSjnnf.png) ## Problem we encountered ### **Problem 1: Instruction Unalignment** * RV32C reduces static and dynamic code size by adding short 16-bit instruction encoding for common operation. But because of adding short 16-bit instruction encoding for common operation, we must deal with fetching 32-bit and 16-bit instruction. * **memory unalignment** * When a computer reads from or writes to a memory address, it will do this in word sized chunks. Data alignment means putting the data at a memory offset equal to some multiple of the word size, which increases the system's performance due to the way the CPU handles memory * For example | Memory address | Alignment(32bit) | | -------- | -------- | | 0x0000_0000 |Aligned | | 0x0000_0001 |not Aligned | | 0x0000_0002 |not Aligned | | 0x0000_0003 |not Aligned | | 0x0000_0004 |Aligned | | 0x0000_0005 |not Aligned | | 0x0000_0006 |not Aligned | | 0x0000_0007 |not Aligned | | 0x0000_0008 |Aligned | * And we execute this code. The word sized equal to 32-bits. So we fetch 32-bit data every time. * Execute this code ``` 800000fc: 9192 add gp,gp,tp 800000fe: c00e sw gp,0(sp) 80000100: 4481 li s1,0 80000102: 00002117 auipc sp,0x2 80000106: f1010113 addi sp,sp,-240 # 80002000 ``` * Memory content | Memory address | content | | -------- | --------| | 0x8000_00fc |92 | | 0x8000_00fd |91 | | 0x8000_00fe |0e | | 0x8000_0100 |c0 | | 0x8000_0101 |81 | | 0x8000_0102 |44 | | 0x8000_0103 |17 | | 0x8000_0104 |21 | | 0x8000_0105 |00 | | 0x8000_0106 |00 | | 0x8000_0107 |13 | | 0x8000_0108 |01 | | 0x8000_0109 |01 | | 0x8000_010a |f1 | * In cycle 0, we get **c00e_9192** instruction but PC 8000_00fc is 16-bit instruction. So we just take we need instruction **9192**. But in cycle 3 the PC 8000_0102 sends 8000_0100 memory address to memory, getting wrong instruction **2117_4481** instead of correct instruction **0000_2117**. | cycle | 16bit or 32bit instruction|Program Counter | Memory address to Memory |Fetch Instruction |correct Instrucion|correct| | -------- | -------- |-------- |-------- | --------| --------| -------- | | 0 |16-bit| 0x8000_00fc |0x8000_00fc |0xc00e_9192 |0x9192|correct| | 1 | 16-bit| 0x8000_00fe |0x8000_00fc |0xc00e_9192 |0xc00e|correct | 2 | 16-bit |0x8000_0100 |0x8000_0100 |0x2117_4481 |0x4481|correct | 3 | 32-bit |0x8000_0102 |0x8000_0100 |0x2117_4481 |0x0000_2117|not correct| * **How to solve memory unalignment** * Reindeer splits one memory to high memory unit and low memory unit, we can use this characteristic to get correct instruction in 1 cycle. ![](https://i.imgur.com/MOxh32t.png) * In PulseRain_RV2T_MCU.v, we can see the origanl source code. ```= single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_high_i ( .addr (mem_addr), .din (mem_write_data [31 : 16]), .write_en (mem_write_en[3 : 2]), .clk (clk), .dout (dout_high)); single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_low_i ( .addr (mem_addr), .din (mem_write_data [15 : 0]), .write_en (mem_write_en[1 : 0]), .clk (clk), .dout (dout_low)); assign mem_read_data = {dout_high, dout_low}; assign mem_read_data = {dout_high, dout_low}; ``` * When memory receive the address, we can check the memory address second bit whether equals to 1. If equaling to 1 , we send next address to low memory unit, and send orignal address to high memory unit. And we can get correct instruction in 1 cycle. | cycle | Program Counter | Memory address to Memory |High Address|Low Address |High data | low data|Fetch Instruction | | -------- | -------- |-------- |--------|--------|--------|-------- |--------| | 0 | 8000_00fc |8000_00fc |8000_00fc|8000_00fc|c00e|9192|c00e_9192| | 1 | 8000_00fe |8000_00fc |8000_00fc|8000_0100|c00e|4481|4481_c00e| | 2 | 8000_0100 |8000_0100 |8000_0100|8000_0100|4481|2117|4481_2117| | 3 | 8000_0102 |8000_0100 |8000_0100|8000_0102|4481|0000|0000_2117| | 4 |8000_0106|8000_0104|8000_0104|8000_0108|0113|f101|f101_0113| * And we modify PulseRain_RV2T_MCU.v to solve unalignment data. ```=verilog single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_high_i ( .addr (mem_addr_high), .din (mem_write_data [31 : 16]), .write_en (mem_write_en[3 : 2]), .clk (clk), .dout (dout_high)); single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_low_i ( .addr (mem_addr_low), .din (mem_write_data [15 : 0]), .write_en (mem_write_en[1 : 0]), .clk (clk), .dout (dout_low)); assign mem_addr_high = mem_addr[`MEM_ADDR_BITS:1]; assign mem_addr_low = (mem_addr[0])? mem_addr[`MEM_ADDR_BITS:1] + 1 : mem_addr[`MEM_ADDR_BITS:1]; assign mem_read_data = (mem_addr_d1[0]) ? {dout_low,dout_high} : {dout_high,dout_low}; ``` * Original memory access ![](https://i.imgur.com/ZSL2dMh.png) * Our memory access ![](https://i.imgur.com/nUzQvPx.png) :::info And this method also can solve the branch address unalignment in one cycle. ::: ### **Problem 2: Program Counter** * When we fetch the 16-bit instruction, the program counter should be add 2. And we fetch the 32-bit instruction, the program counter should be add 4. * original code ```=verilog always @(posedge clk, negedge reset_n) begin : read_mem_proc if (!reset_n) begin read_mem_enable <= 0; read_mem_addr <= 0; read_mem_enable_d1 <= 0; end else begin read_mem_enable <= ctl_read_mem_enable; read_mem_enable_d1 <= read_mem_enable; if (ctl_load_start_addr) begin read_mem_addr <= start_addr; end else if (ctl_inc_read_addr) begin read_mem_addr <= read_mem_addr + 4; end end end ``` * Before we fetch next instruction, we should check previous instruction whether is 32-bit. ``` =verilog assign compress = (mem_read_done & read_mem_enable_d1 & ctl_inc_read_addr )? (~(mem_data[0] & mem_data[1])) : (ctl_inc_read_addr )? (~(tmp_out[0] & tmp_out[1])) : 1'b0; always@(posedge clk,negedge reset_n) begin if (!reset_n) tmp_out <= 0; else if (mem_read_done) tmp_out <= mem_data; end always @(posedge clk, negedge reset_n) begin : read_mem_proc if (!reset_n) begin read_mem_enable <= 0; read_mem_addr <= 0; read_mem_enable_d1 <= 0; end else begin read_mem_enable <= ctl_read_mem_enable; read_mem_enable_d1 <= read_mem_enable; if (ctl_load_start_addr) begin read_mem_addr <= start_addr; end else if (ctl_inc_read_addr) begin read_mem_addr <= (compress) ? read_mem_addr + 2 : read_mem_addr + 4; end end end ``` ### **Problem 3: Branch Exception** * If conditions of branch instruction is satisfied,the controller will check the branch address whether is misalignment. But in rv32c instruction set, the branch address may unalign. So we must deal with this problem. We modify source code to below. source code ``` =verilog ctl_fetch_init_branch = branch_active & (~(branch_addr[1])); if (branch_active & branch_addr[1]) begin ctl_instruction_addr_misalign_exception = 1'b0; next_state [S_EXCEPTION] = 1'b1; end ``` modify code ``` ctl_fetch_init_branch = branch_active; ``` * Load and store instruction have the same problem. ### **Problem 4: FPGA** * Because the [Reindeer](https://github.com/PulseRain/Reindeer) is for [MAX10 board](https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=218&No=998). But we just have [step cyc10 board](https://www.stepfpga.com/doc/step-cyc10). So we can't program the Reindeer in step cyc10. ## How to validate * I use verilator to simulation verilog. And we have makefile to help reduce test effort, we just use one instruction to get all test result. * take C-ADD for our test program ``` make test C-ADD ====> Testing ./obj_dir/VPulseRain_RV2T_MCU TEST CASE: C-ADD testing ../compliance/C-ADD.elf -r ../compliance/references/C-ADD.reference_output ============================================================= === PulseRain Technology, RISC-V RV32IM Test Bench ============================================================= elf file : ../compliance/C-ADD.elf reference : ../compliance/references/C-ADD.reference_output ``` ![](https://i.imgur.com/lvksMZg.png) ## Simulation Result * Test case - [x] C-ADDI16SP - [x] C-ADDI4SPN - [x] C-ADDI - [x] C-ADD - [x] C-ANDI - [x] C-AND - [x] C-BEQZ - [x] C-BNEZ - [x] C-JALR - [x] C-JAL - [x] C-JR - [x] C-J - [x] C-LI - [x] C-LUI - [x] C-LW - [x] C-LWSP - [x] C-MV - [x] C-OR - [x] C-SLLI - [x] C-SRAI - [x] C-SRLI - [x] C-SUB - [x] C-SW - [x] C-SWSP - [x] C-XOR - [x] I-ADD-01 - [x] I-ADDI-01 - [x] I-AND-01 - [x] I-ANDI-01 - [x] I-AUIPC-01 - [x] I-BEQ-01 - [x] I-BGE-01 - [x] I-BGEU-01 - [x] I-BLT-01 - [x] I-BLTU-01 - [x] I-BNE-01 - [x] I-CSRRC-01 - [x] I-CSRRCI-01 - [x] I-CSRRS-01 - [x] I-CSRRSI-01 - [x] I-CSRRW-01 - [x] I-CSRRWI-01 - [x] I-DELAY_SLOTS-01 - [x] I-EBREAK-01 - [x] I-ECALL-01 - [x] I-FENCE.I-01 - [x] I-IO - [x] I-JAL-01 - [x] I-JALR-01 - [x] I-LUI-01 - [x] I-LW-01 - [x] I-NOP-01 - [x] I-OR-01 - [x] I-ORI-01 - [x] I-RF_size-01 - [x] I-RF_width-01 - [x] I-RF_x0-01 - [x] I-SB-01 - [x] I-SH-01 - [x] I-SLL-01 - [x] I-SLLI-01 - [x] I-SLT-01 - [x] I-SLTI-01 - [x] I-SLTIU-01 - [x] I-SLTU-01 - [x] I-SRA-01 - [x] I-SRAI-01 - [x] I-SRL-01 - [x] I-SRLI-01 - [x] I-SUB-01 - [x] I-SW-01 - [x] I-XOR-01 - [x] I-XORI-01 - [x] I-LB-01 - [x] I-LBU-01 - [x] I-LH-01 - [x] I-LHU-01 - [x] MUL - [x] MULH - [x] MULHSU - [x] MULHU - [x] REM - [x] REMU * use making test_all,the makefile will help us to test all test program in above. ``` make test_all ====> Testing ALL ``` ![](https://i.imgur.com/ivSRaiU.png)