owned this note
                
                
                     
                     owned this note
                
                
                     
                    
                
                
                     
                    
                
                
                     
                    
                        
                            
                            Published
                        
                        
                            
                                
                                Linked with GitHub
                            
                            
                                
                                
                            
                        
                     
                
            
            
                
                    
                    
                
                
                    
                
                
                
                    
                        
                    
                    
                    
                
                
                
                    
                
            
            
         
        
        # Implement RISC-V Compressed Instruction Set for [Reindeer](https://github.com/PulseRain/Reindeer)
:::danger
TODO: make a list for project members
:::
contributed by < 林家葦 >,< 黃俞紘 >,< 曾士峰>,< 潘家瑞 >
## Requirements
- extend pull request [RV32C support](https://github.com/PulseRain/Reindeer/pull/25) to ensure functional pipeline design;
- Validate with simulator;
- (optional) validate on Step CYC10 FPGA board;
###### tags: `Computer Architecture`
## RV32C 
* RVC uses a simple compression scheme that offers shorter 16-bit versions of common 32-bit RICS-V instructions 
    * the immediate or address offset is small
    * one of the registers is the zero register (x0), the ABI link register (x1), or the ABI stack pointer
    * the destination register and the first source register are identical
    * the registers used are the 8 most popular ones
* How to encoder Compress Instruction

* How to generate compress instruction 
    * We add extra flag -march=rv32im**c** to compile original source code, and we can get the compress instruction
        ```
        riscv-none-embed-gcc -march=rv32imc -mabi=ilp32 -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -I/home/wayen/riscv-compliance/riscv-test-env/ -I/home        /wayen/riscv-compliance/riscv-test-env/p/ -I/home/wayen/riscv-compliance/riscv-target/riscvOVPsim/ -T/home/wayen/riscv-compliance/riscv-test-env/p/link.ld src/C-J.S -o /h        ome/wayen/riscv-compliance/work/rv32imc/C-J.elf; riscv-none-embed-objdump -D /home/wayen/riscv-compliance/work/rv32imc/C-J.elf > /home/wayen/riscv-compliance/work/rv32imc        /C-J.elf.objdump    
        ```
        
    * Source code 
        ``` = assember
        #include "riscv_test_macros.h"
        #include "compliance_test.h"
        #include "compliance_io.h"
        RV_COMPLIANCE_RV32M
        RV_COMPLIANCE_CODE_BEGIN
        RVTEST_IO_INIT
        RVTEST_IO_ASSERT_GPR_EQ(x31, x0, 0x00000000)
        RVTEST_IO_WRITE_STR(x31, "Test Begin\n")
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "# Test number 1\n")
        # address for test results
        la	x5, test_1_res
        TEST_RR_OP(add, x0, x31, x16, 0x0, -0x1, 0x0, x5, 0, x6)   # Testcase 0
        TEST_RR_OP(add, x1, x30, x15, 0xfffff802, 0x1, -0x7ff, x5, 4, x6)   # Testcase 1
        TEST_RR_OP(add, x2, x29, x14, 0xffffffff, 0x0, -0x1, x5, 8, x6)   # Testcase 2
        TEST_RR_OP(add, x3, x28, x13, 0xfffff5cb, 0x7ff, -0x1234, x5, 12, x6)   # Testcase 3
        TEST_RR_OP(add, x4, x27, x12, 0x80000000, 0x0, 0x80000000, x5, 16, x6)   # Testcase 4
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "# Test number 2\n")
        # address for test results
        la	x1, test_2_res
        TEST_RR_OP(add, x5, x26, x11, 0x1a34, 0x800, 0x1234, x1, 0, x2)   # Testcase 5
        TEST_RR_OP(add, x6, x25, x10, 0x7654320, 0x7654321, 0xffffffff, x1, 4, x2)   # Testcase 6
        TEST_RR_OP(add, x7, x24, x9, 0x80000000, 0x7fffffff, 0x1, x1, 8, x2)   # Testcase 7
        TEST_RR_OP(add, x8, x23, x8, 0x80000000, 0x1, 0x7fffffff, x1, 12, x2)   # Testcase 8
        TEST_RR_OP(add, x9, x22, x7, 0x7654320, 0xffffffff, 0x7654321, x1, 16, x2)   # Testcase 9
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "# Test number 3\n")
        # address for test results
        la	x1, test_3_res
        TEST_RR_OP(add, x10, x21, x6, 0x1a34, 0x1234, 0x800, x1, 0, x7)   # Testcase 10
        TEST_RR_OP(add, x11, x20, x5, 0x80000000, 0x80000000, 0x0, x1, 4, x7)   # Testcase 11
        TEST_RR_OP(add, x12, x19, x4, 0xfffff5cb, -0x1234, 0x7ff, x1, 8, x7)   # Testcase 12
        TEST_RR_OP(add, x13, x18, x3, 0xfffffffe, -0x1, -0x1, x1, 12, x7)   # Testcase 13
        TEST_RR_OP(add, x14, x17, x2, 0xfffff802, -0x7ff, 0x1, x1, 16, x7)   # Testcase 14
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "# Test number 4\n")
        # address for test results
        la	x2, test_4_res
        TEST_RR_OP(add, x15, x16, x1, 0x0, 0x0, 0x0, x2, 0, x3)   # Testcase 15
        TEST_RR_OP(add, x16, x15, x0, 0xffffffff, -0x1, 0x0, x2, 4, x3)   # Testcase 16
        TEST_RR_OP(add, x17, x14, x31, 0xfffff802, 0x1, -0x7ff, x2, 8, x3)   # Testcase 17
        TEST_RR_OP(add, x18, x13, x30, 0xffffffff, 0x0, -0x1, x2, 12, x3)   # Testcase 18
        TEST_RR_OP(add, x19, x12, x29, 0xfffff5cb, 0x7ff, -0x1234, x2, 16, x3)   # Testcase 19
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "# Test number 5\n")
        # address for test results
        la	x1, test_5_res
        TEST_RR_OP(add, x20, x11, x28, 0x80000000, 0x0, 0x80000000, x1, 0, x2)   # Testcase 20
        TEST_RR_OP(add, x21, x10, x27, 0x1a34, 0x800, 0x1234, x1, 4, x2)   # Testcase 21
        TEST_RR_OP(add, x22, x9, x26, 0x7654320, 0x7654321, 0xffffffff, x1, 8, x2)   # Testcase 22
        TEST_RR_OP(add, x23, x8, x25, 0x80000000, 0x7fffffff, 0x1, x1, 12, x2)   # Testcase 23
        TEST_RR_OP(add, x24, x7, x24, 0x80000000, 0x1, 0x7fffffff, x1, 16, x2)   # Testcase 24
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "# Test number 6\n")
        # address for test results
        la	x1, test_6_res
        TEST_RR_OP(add, x25, x6, x23, 0x7654320, 0xffffffff, 0x7654321, x1, 0, x7)   # Testcase 25
        TEST_RR_OP(add, x26, x5, x22, 0x1a34, 0x1234, 0x800, x1, 4, x7)   # Testcase 26
        TEST_RR_OP(add, x27, x4, x21, 0x80000000, 0x80000000, 0x0, x1, 8, x7)   # Testcase 27
        TEST_RR_OP(add, x28, x3, x20, 0xfffff5cb, -0x1234, 0x7ff, x1, 12, x7)   # Testcase 28
        TEST_RR_OP(add, x29, x2, x19, 0xfffffffe, -0x1, -0x1, x1, 16, x7)   # Testcase 29
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "# Test number 7\n")
        # address for test results
        la	x2, test_7_res
        TEST_RR_OP(add, x30, x1, x18, 0xfffff802, -0x7ff, 0x1, x2, 0, x3)   # Testcase 30
        TEST_RR_OP(add, x31, x0, x17, 0x0, 0x0, 0x0, x2, 4, x3)   # Testcase 31
        # ---------------------------------------------------------------------------------------------
        RVTEST_IO_WRITE_STR(x31, "Test End\n")
        # ---------------------------------------------------------------------------------------------
        RV_COMPLIANCE_HALT
        RV_COMPLIANCE_CODE_END
        # Input data section.
            .data
        # Output data section.
        RV_COMPLIANCE_DATA_BEGIN
        test_1_res:
            .fill 5, 4, -1
        test_2_res:
            .fill 5, 4, -1
        test_3_res:
            .fill 5, 4, -1
        test_4_res:
            .fill 5, 4, -1
        test_5_res:
            .fill 5, 4, -1
        test_6_res:
            .fill 5, 4, -1
        test_7_res:
            .fill 5, 4, -1
        RV_COMPLIANCE_DATA_END
        ```
    *  We fetch some objdump information
        *  RV32C
         ``` 
        800000f0 <begin_testcode>:
        800000f0:	00002117          	auipc	sp,0x2
        800000f4:	f1010113          	addi	sp,sp,-240 # 80002000 <begin_signature>
        800000f8:	4201                	li	tp,0
        800000fa:	4181                	li	gp,0
        800000fc:	9192                	add	gp,gp,tp
        800000fe:	c00e                	sw	gp,0(sp)
        80000100:	4481                	li	s1,0
        80000102:	4405                	li	s0,1
        80000104:	9426                	add	s0,s0,s1
        80000106:	c222                	sw	s0,4(sp)
        80000108:	4601                	li	a2,0
        8000010a:	fff00593          	li	a1,-1
        8000010e:	95b2                	add	a1,a1,a2
                        .
                        .
                        .
         ```
        *  RV32I
         ``` 
        80000108:	00002097          	auipc	ra,0x2
        8000010c:	ef808093          	addi	ra,ra,-264 # 80002000 <test_A1_data>
        80000110:	00002117          	auipc	sp,0x2
        80000114:	f2010113          	addi	sp,sp,-224 # 80002030 <begin_signature>
        80000118:	0000a183          	lw	gp,0(ra)
        8000011c:	00000213          	li	tp,0
        80000120:	00100293          	li	t0,1
        80000124:	fff00313          	li	t1,-1
        80000128:	800003b7          	lui	t2,0x80000
        8000012c:	fff38393          	addi	t2,t2,-1 # 7fffffff <_end+0xffffdf1f>
                        .
                        .
                        .
         ```
## How to work
* We add instruction compress decoder to convert 16-bit instruction to 32-bit instructions.
    1. First,we fetch the instruction according to program counter.
    2. And I get the 32-bit/16bit instruction, and we send to **Compress Decoder Unit** to decode compress instruction. If the instruction is not compress,it will be passed by. But if it is compresss,it will be convert to 32-bit instruction.
    3. send the already decoder instruction to Instruction decoder unit.
    4. execute the instruction.

## Problem we encountered
### **Problem 1: Instruction Unalignment**
* RV32C reduces static and dynamic code size by adding short 16-bit instruction encoding for common operation. But because of adding short 16-bit instruction encoding for common operation, we must deal with fetching 32-bit and 16-bit instruction.
* **memory unalignment**
    * When a computer reads from or writes to a memory address, it will do this in word sized chunks. Data alignment means putting the data at a memory offset equal to some multiple of the word size, which increases the system's performance due to the way the CPU handles memory
    * For example
        | Memory address | Alignment(32bit) |
        | --------       | --------         | 
        | 0x0000_0000    |Aligned           |
        | 0x0000_0001    |not Aligned       | 
        | 0x0000_0002    |not Aligned       | 
        | 0x0000_0003    |not Aligned       | 
        | 0x0000_0004    |Aligned           | 
        | 0x0000_0005    |not Aligned       | 
        | 0x0000_0006    |not Aligned       | 
        | 0x0000_0007    |not Aligned       | 
        | 0x0000_0008    |Aligned           | 
        
      
        
* And we execute this code. The word sized equal to 32-bits. So we fetch 32-bit data every time.
    * Execute this code
        ```
            800000fc:	9192                	add	gp,gp,tp
            800000fe:	c00e                	sw	gp,0(sp)
            80000100:	4481                	li	s1,0
            80000102:	00002117          	auipc	sp,0x2
            80000106:	f1010113          	addi	sp,sp,-240 # 80002000 
        ```
    * Memory content
        | Memory address | content |
        | --------       | --------|
        | 0x8000_00fc    |92       |
        | 0x8000_00fd    |91       |
        | 0x8000_00fe    |0e       |
        | 0x8000_0100    |c0       |
        | 0x8000_0101    |81       |
        | 0x8000_0102    |44       |
        | 0x8000_0103    |17       |
        | 0x8000_0104    |21       |
        | 0x8000_0105    |00       |
        | 0x8000_0106    |00       |
        | 0x8000_0107    |13       |
        | 0x8000_0108    |01       |
        | 0x8000_0109    |01       |
        | 0x8000_010a    |f1       |
    * In cycle 0, we get **c00e_9192** instruction but PC 8000_00fc is 16-bit instruction. So we just take we need instruction **9192**. But in cycle 3 the PC 8000_0102 sends 8000_0100 memory address to memory, getting wrong instruction **2117_4481** instead of correct instruction **0000_2117**.
    
        | cycle    | 16bit or 32bit instruction|Program Counter  | Memory address to Memory |Fetch Instruction  |correct Instrucion|correct|
        | -------- | --------         |--------  |--------            | --------|   --------|     -------- |    
        | 0        |16-bit|   0x8000_00fc      |0x8000_00fc |0xc00e_9192          |0x9192|correct|
        | 1        | 16-bit|  0x8000_00fe      |0x8000_00fc |0xc00e_9192          |0xc00e|correct
        | 2        | 16-bit  |0x8000_0100      |0x8000_0100 |0x2117_4481          |0x4481|correct
        | 3        | 32-bit  |0x8000_0102      |0x8000_0100 |0x2117_4481          |0x0000_2117|not correct|
* **How to solve memory unalignment**
    * Reindeer splits one memory to high memory unit and low memory unit, we can use this characteristic to get correct instruction in 1 cycle.
    
        * In PulseRain_RV2T_MCU.v, we can see the origanl source code.
        ```= 
        single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_high_i (
                            .addr (mem_addr),
                            .din (mem_write_data [31 : 16]),
                            .write_en (mem_write_en[3 : 2]),
                            .clk (clk),
                            .dout (dout_high));
        single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_low_i (
                            .addr (mem_addr),
                            .din (mem_write_data [15 : 0]),
                            .write_en (mem_write_en[1 : 0]),
                            .clk (clk),
                            .dout (dout_low));
                            assign mem_read_data = {dout_high, dout_low};
        assign mem_read_data = {dout_high, dout_low};
        ```
    * When memory receive the address, we can check the memory address second bit whether equals to 1. If equaling to 1 , we send next address to low memory unit, and send orignal address to high memory unit. And we can get correct instruction in 1 cycle.
         | cycle    | Program Counter  | Memory address to Memory |High Address|Low Address |High data | low                     data|Fetch Instruction  |
         | -------- | --------         |--------  |--------|--------|--------|-------- |--------|
         | 0        |   8000_00fc      |8000_00fc |8000_00fc|8000_00fc|c00e|9192|c00e_9192|
         | 1        |   8000_00fe      |8000_00fc |8000_00fc|8000_0100|c00e|4481|4481_c00e|
         | 2        |   8000_0100      |8000_0100 |8000_0100|8000_0100|4481|2117|4481_2117|
         | 3        |   8000_0102      |8000_0100 |8000_0100|8000_0102|4481|0000|0000_2117|
         | 4        |8000_0106|8000_0104|8000_0104|8000_0108|0113|f101|f101_0113|
         
         
         
        * And we modify PulseRain_RV2T_MCU.v to solve unalignment data.
    ```=verilog 
      single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_high_i (
                .addr (mem_addr_high),
                .din (mem_write_data [31 : 16]),
                .write_en (mem_write_en[3 : 2]),
                .clk (clk),
                .dout (dout_high));
              
            single_port_ram #(.ADDR_WIDTH (`MEM_ADDR_BITS), .DATA_WIDTH (16) ) ram_low_i (
                .addr (mem_addr_low),
                .din (mem_write_data [15 : 0]),
                .write_en (mem_write_en[1 : 0]),
                .clk (clk),
                .dout (dout_low));
                
    assign mem_addr_high = mem_addr[`MEM_ADDR_BITS:1];
    assign mem_addr_low = (mem_addr[0])? mem_addr[`MEM_ADDR_BITS:1] + 1 : mem_addr[`MEM_ADDR_BITS:1];
    assign mem_read_data = (mem_addr_d1[0]) ? {dout_low,dout_high} : {dout_high,dout_low};
    ```
    * Original memory access 
    
    * Our memory access
    
:::info 
And this method also can solve the branch address unalignment in one cycle.
:::
        
### **Problem 2:  Program Counter**
* When we fetch the 16-bit instruction, the program counter should be add 2. And we fetch the 32-bit instruction, the program counter should be add 4.
* original code 
```=verilog
 always @(posedge clk, negedge reset_n) begin : read_mem_proc
                    if (!reset_n) begin
                        read_mem_enable    <= 0;
                        read_mem_addr      <= 0;
                        read_mem_enable_d1 <= 0;
                    end else begin
                        read_mem_enable    <= ctl_read_mem_enable;
                        read_mem_enable_d1 <= read_mem_enable;
                        
                        if (ctl_load_start_addr) begin
                            read_mem_addr <= start_addr;
                        end else if (ctl_inc_read_addr) begin
                            read_mem_addr <= read_mem_addr + 4;
                        end
                        
                    end
                end
```
* Before we fetch next instruction, we should check previous instruction whether is 32-bit.
``` =verilog 
assign compress = (mem_read_done & read_mem_enable_d1 & ctl_inc_read_addr )? (~(mem_data[0] & mem_data[1]))  : 
                              (ctl_inc_read_addr   )?  (~(tmp_out[0] & tmp_out[1])) : 1'b0;
                              
always@(posedge clk,negedge reset_n) begin
    if (!reset_n) 
        tmp_out <= 0;
    else if (mem_read_done)
        tmp_out <= mem_data;
end
always @(posedge clk, negedge reset_n) begin : read_mem_proc
    if (!reset_n) begin
        read_mem_enable    <= 0;
        read_mem_addr      <= 0;
        read_mem_enable_d1 <= 0;
    end else begin
        read_mem_enable    <= ctl_read_mem_enable;
        read_mem_enable_d1 <= read_mem_enable;
        if (ctl_load_start_addr) begin
            read_mem_addr <= start_addr;
        end else if (ctl_inc_read_addr) begin
            read_mem_addr <= (compress) ? read_mem_addr + 2 : read_mem_addr + 4;
        end
    end
end
```
### **Problem 3: Branch Exception**
* If conditions of branch instruction is satisfied,the controller will check the branch address whether is misalignment. But in rv32c instruction set, the branch address may unalign. So we must deal with this problem. We modify source code to below.
source code 
``` =verilog 
ctl_fetch_init_branch = branch_active & (~(branch_addr[1]));
if (branch_active & branch_addr[1]) begin
    ctl_instruction_addr_misalign_exception = 1'b0;
    next_state [S_EXCEPTION] = 1'b1;
end
```
modify code 
```
    ctl_fetch_init_branch = branch_active;
```
* Load and store instruction have the same problem.
### **Problem 4: FPGA**
* Because the [Reindeer](https://github.com/PulseRain/Reindeer) is for [MAX10 board](https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=218&No=998). But we just have [step cyc10 board](https://www.stepfpga.com/doc/step-cyc10). So we can't program the Reindeer in step cyc10.
## How to validate 
* I use verilator to simulation verilog. And we have makefile to help reduce test effort, we just use one instruction to get all test result.
    * take C-ADD for our test program
```
make test C-ADD
====> Testing ./obj_dir/VPulseRain_RV2T_MCU
TEST CASE: C-ADD
testing ../compliance/C-ADD.elf  -r ../compliance/references/C-ADD.reference_output
=============================================================
=== PulseRain Technology, RISC-V RV32IM Test Bench
=============================================================
     elf file   : ../compliance/C-ADD.elf
     reference  : ../compliance/references/C-ADD.reference_output
```

## Simulation Result
* Test case 
- [x] C-ADDI16SP
- [x] C-ADDI4SPN
- [x] C-ADDI
- [x] C-ADD
- [x] C-ANDI
- [x] C-AND
- [x] C-BEQZ
- [x] C-BNEZ
- [x] C-JALR
- [x] C-JAL
- [x] C-JR
- [x] C-J
- [x] C-LI
- [x] C-LUI
- [x] C-LW
- [x] C-LWSP
- [x] C-MV
- [x] C-OR
- [x] C-SLLI
- [x] C-SRAI
- [x] C-SRLI
- [x] C-SUB
- [x] C-SW
- [x] C-SWSP
- [x] C-XOR
- [x] I-ADD-01 
- [x] I-ADDI-01 
- [x] I-AND-01 
- [x] I-ANDI-01 
- [x] I-AUIPC-01 
- [x] I-BEQ-01 
- [x] I-BGE-01 
- [x] I-BGEU-01 
- [x] I-BLT-01 
- [x] I-BLTU-01 
- [x] I-BNE-01 
- [x] I-CSRRC-01 
- [x] I-CSRRCI-01 
- [x] I-CSRRS-01 
- [x] I-CSRRSI-01 
- [x] I-CSRRW-01 
- [x] I-CSRRWI-01 
- [x] I-DELAY_SLOTS-01 
- [x] I-EBREAK-01 
- [x] I-ECALL-01 
- [x] I-FENCE.I-01 
- [x] I-IO 
- [x] I-JAL-01 
- [x] I-JALR-01 
- [x] I-LUI-01 
- [x] I-LW-01 
- [x] I-NOP-01 
- [x] I-OR-01 
- [x] I-ORI-01 
- [x] I-RF_size-01 
- [x] I-RF_width-01 
- [x] I-RF_x0-01 
- [x] I-SB-01 
- [x] I-SH-01 
- [x] I-SLL-01 
- [x] I-SLLI-01 
- [x] I-SLT-01 
- [x] I-SLTI-01 
- [x] I-SLTIU-01 
- [x] I-SLTU-01 
- [x] I-SRA-01 
- [x] I-SRAI-01 
- [x] I-SRL-01 
- [x] I-SRLI-01 
- [x] I-SUB-01 
- [x] I-SW-01 
- [x] I-XOR-01 
- [x] I-XORI-01 
- [x] I-LB-01 
- [x] I-LBU-01 
- [x] I-LH-01 
- [x] I-LHU-01 
- [x] MUL 
- [x] MULH 
- [x] MULHSU 
- [x] MULHU 
- [x] REM 
- [x] REMU
* use making test_all,the makefile will help us to test all test program in above.
    ```
    make test_all
    ====> Testing ALL
    ```
