Try   HackMD

Implement Vector extension for rv32emu

黃士昕

vestata/rv32emu:vector

Goal

Implement RVV instruction decoding and an interpreter on the latest rv32emu codebase (ensuring it's rebased). The implementation must ultimately pass the riscv-vector-tests

Environment

$ riscv64-unknown-elf-gcc --version
riscv64-unknown-elf-gcc (g04696df09) 14.2.0
$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   24
  On-line CPU(s) list:    0-23
Vendor ID:                GenuineIntel
  Model name:             13th Gen Intel(R) Core(TM) i7-13700
    CPU family:           6
    Model:                183
    Thread(s) per core:   2
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             1
    CPU(s) scaling MHz:   20%
    CPU max MHz:          5200.0000
    CPU min MHz:          800.0000
    BogoMIPS:             4224.00

[riscv-gnu-toolchain]

[riscv-isa-sim]

[riscv-pk]

Implementation

The vector extension adds 32 vector registers, and seven unprivileged CSRs (vstart, vxsat, vxrm, vcsr, vtype, vl, vlenb) to a base scalar RISC-V ISA.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

To emulate a vector register, which requires a length that is a power of two and where the VLEN is greater than or equal to the ELEN, we utilize an array of uint32_t for representation. This implementation uses a vector length of 128, but the code is designed to be scalable to other lengths.

 #if RV32_HAS(EXT_F)
 typedef softfloat_float32_t riscv_float_t;
 #endif
+#if RV32_HAS(EXT_V)
+/* Fixme:Temporary set vl as 128 */
+typedef uint32_t vector128_t[4];
+#endif
 
 /* memory read handlers */
 typedef riscv_word_t (*riscv_mem_ifetch)(riscv_t *rv, riscv_word_t addr);

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The vtype and vl CSRs play a crucial role in RVV, as vsew and vlmul can change dynamically during emulation. With the equation vlmax = lmul * vlen / sew, where vlmax represents the maximum vl an instruction can handle in sew units, these parameters are essential for controlling vector operations effectively.

Decode

In the RISC-V Vector Extension version 1.0 specification, the exact number of instructions in the extension instruction set is not explicitly stated. Based on my own analysis of the specification, I have identified a total of 616 instructions. These instructions can be broadly categorized into three main groups: Configuration-Setting Instructions, Vector Loads/Stores, and Vector Arithmetic Instructions.

The Vector Arithmetic Instructions can be further divided into six categories based on their functionality: Vector Integer Arithmetic Instructions, Vector Fixed-Point Arithmetic Instructions, Vector Floating-Point Instructions, Vector Reduction Operations, Vector Mask Instructions, and Vector Permutation Instructions.

Configuration-Setting Instructions

The Configuration-Setting Instructions and Vector Arithmetic Instructions share the same opcode, 101011, while the Vector Load Instructions use the same opcode as LOAD-FP, 0000111, and the Vector Store Instructions use the same opcode as STORE-FP, 0100111. In the first step of decoding, a jump table can be used to classify instructions based on their opcode.

Initially, instructions with the opcode 0100111 can be categorized:

+#if RV32_HAS(EXT_V)
+    /* Handle vector operations */
+    if (index == 0b10101) {
+        /* Since vcfg and vop uses the same opcode */
+        if (decode_funct3(insn) == 0b111) {
+            const decode_t op = rv_jump_table[index];
+            return op(ir, insn);
+        }
+        const uint32_t v_index = (insn >> 26) & 0x3F;
+        const decode_t op = rvv_jump_table[v_index];
+        return op(ir, insn);
+    /* Fixme:VMEM */
+    }
+#endif
+

By extending the original jump table to include OP(vcfg), the decoding of Configuration-Setting Instructions can be completed seamlessly.

static const decode_t rv_jump_table[] = {
     //  000         001           010        011           100         101        110      
  111
         OP(load),   OP(load_fp),  OP(unimp), OP(misc_mem), OP(op_imm), OP(auipc), OP(unimp), OP(unimp), // 00
         OP(store),  OP(store_fp), OP(unimp), OP(amo),      OP(op),     OP(lui),   OP(unimp), OP(unimp), // 01
-        OP(madd),   OP(msub),     OP(nmsub), OP(nmadd),    OP(op_fp),  OP(unimp), OP(unimp), OP(unimp), // 10
+        OP(madd),   OP(msub),     OP(nmsub), OP(nmadd),    OP(op_fp),  OP(vcfg), OP(unimp), OP(unimp),  // 10
         OP(branch), OP(jalr),     OP(unimp), OP(jal),      OP(system), OP(unimp), OP(unimp), OP(unimp), // 11
     };

Vector Arithmetic Instructions

The Vector Arithmetic Instructions category contains the largest number of instructions, with a total of 305 types. Based on the instruction table mentioned above, decoding OPI is achieved using a 6-bit function6 and a 3-bit function3. I have created an rvv_jump_table, which functions similarly to the existing rv_jump_table. Since the same function6 can include instructions from the OPI, OPM, and OPF categories, the function6 is directly named using its binary representation, such as op_111111.

+#if RV32_HAS(EXT_V)
+    /* RVV vector opcode map */
+    static const decode_t rvv_jump_table[] = {
+    /* Acording to https://github.com/riscvarchive/riscv-v-spec/blob/master/inst-table.adoc this table is for function6. */
+    //  000        001        010        011        100        101        110        111
+        OP(000000), OP(000001), OP(000010), OP(000011), OP(000100), OP(000101), OP(000110), OP(000111),  // 000
+        OP(001000), OP(001001), OP(001010), OP(001011), OP(001100), OP(unimp), OP(001110), OP(001111),  // 001
+        OP(010000), OP(010001), OP(010010), OP(010011), OP(010100), OP(unimp), OP(unimp), OP(010111),  // 010
+        OP(011000), OP(011001), OP(011010), OP(011011), OP(011100), OP(011101), OP(011110), OP(011111),  // 011
+        OP(100000), OP(100001), OP(100010), OP(100011), OP(100100), OP(100101), OP(100110), OP(100111),  // 100
+        OP(101000), OP(101001), OP(101010), OP(101011), OP(101100), OP(101101), OP(101110), OP(101111),  // 101
+        OP(110000), OP(110001), OP(110010), OP(110011), OP(110100), OP(110101), OP(110110), OP(110111),  // 110
+        OP(111000), OP(unimp), OP(111010), OP(111011), OP(111100), OP(111101), OP(111110), OP(111111)   // 111
+    };

Here is a demonstration of how the decoding process works. Other instructions within the Vector Arithmetic Instructions category follow the same decoding pattern.

+static inline bool op_101001(rv_insn_t *ir, const uint32_t insn)
+{
+    switch (decode_funct3(insn)) {
+    case 0:
+        decode_vvtype(ir, insn);
+        ir->opcode = rv_insn_vsra_vv;
+        break;
+    case 1:
+        decode_vvtype(ir, insn);
+        ir->opcode = rv_insn_vfnmadd_vv;
+        break;
+    case 2:
+        decode_vvtype(ir, insn);
+        ir->opcode = rv_insn_vmadd_vv;
+        break;
+    case 3:
+        decode_vitype(ir, insn);
+        ir->opcode = rv_insn_vsra_vi;
+        break;
+    case 4:
+        decode_vxtype(ir, insn);
+        ir->opcode = rv_insn_vsra_vx;
+        break;
+    case 5:
+        decode_vxtype(ir, insn);
+        ir->opcode = rv_insn_vfnmadd_vf;
+        break;
+    case 6:
+        decode_vxtype(ir, insn);
+        ir->opcode = rv_insn_vmadd_vx;
+        break;
+    default: /* illegal instruction */
+        return false;
+    }
+    return true;
+}

Vector Loads/Stores

Vector loads and stores are encoded within the scalar floating-point load and store major opcodes (LOAD-FP/STORE-FP).

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Since Vector Load and LOAD-FP share the same opcode, decoding must rely on function3. As LOAD-FP is an I-type instruction, it is distinguished using the function3 value 010.

Load

The Vector Load instructions consist of 177 types, making them highly complex. However, they can be systematically decoded by following the hierarchy of mop, lumop, nf, and eew. Below is a tree diagram categorizing the Vector Load instructions:

vlxxx<nf>e<i><eew><ff>_v
└── mop
    ├── 00
    │   └── lumop
    │       ├── 00000
    │       │   └── nf
    │       │       ├── 000 -> vle<eew>.v
    │       │       ├── 001 -> vlseg<2>e<eew>.v
    │       │       └── 111 -> vlseg<8>e<eew>.v
    │       ├── 01000
    │       │   └── nf
    │       │       ├── 000 -> vl<1>r<eew>.v
    │       │       └── 111 -> vl<8>r<eew>.v
    │       ├── 01011 -> vlm.v
    │       └── 10000
    │           ├── 000 -> vle<eew>ff.v
    │           ├── 001 -> vlseg<2>e<eew>ff.v
    │           └── 111 -> vlseg<8>e<eew>ff.v
    ├── 01
    │   └── nf
    │       ├── 000 -> vluxei<eew>.v
    │       └── 111 -> vluxseg<8>ei<eew>.v
    ├── 10
    │   └── nf
    │       ├── 000 -> vlse<eew>.v
    │       └── 111 -> vlsseg<8>e<eew>.v
    └── 11
        └── nf
            ├── 000 -> vloxei<eew>.v
            └── 111 -> vloxseg<8>ei<eew>.v

To handle decoding efficiently, an ir list is established in decode.h as an enum structure. By following the sequence in the tree diagram, the instructions are mapped into the ir list, enabling the opcode of each instruction to be derived based on relative bit values. This approach minimizes branching and reduces overall program size.

Below is the code for decoding Vector Load instructions:

+static inline bool op_load_fp(rv_insn_t *ir, const uint32_t insn)
+{
+#if RV32_HAS(EXT_V)
+    /* Fixme: The implementation now is just using switch statement, since there
+     * are multiple duplicate elements in vectore load/store instruction. I'm
+     * hoping to build clean and efficient code. */
+    /* inst nf mew mop vm   rs2/vs1  rs1   width vd  opcode
+     * ----+---+---+---+--+---------+-----+-----+---+--------
+     * VL*   nf mew mop vm    lumop  rs1   width vd  0000111
+     * VLS*  nf mew mop vm    rs2    rs1   width vd  0000111
+     * VLX*  nf mew mop vm    vs2    rs1   width vd  0000111
+     */
+    if (decode_funct3(insn) != 0b010) {
+        uint8_t eew = decode_eew(insn);
+        uint8_t nf = decode_nf(insn);
+        switch (decode_mop(insn)) {
+        case 0:
+            decode_VL(ir, insn);
+            /* check lumop */
+            switch (decode_24_20(insn)) {
+            case 0b00000:
+                if (!nf) {
+                    ir->opcode = rv_insn_vle8_v + eew;
+                } else {
+                    ir->opcode = rv_insn_vlseg2e8_v + 7 * eew + nf - 1;
+                }
+                break;
+            case 0b01000:
+                if (!nf) {
+                    ir->opcode = rv_insn_vle8ff_v + eew;
+                } else {
+                    ir->opcode = rv_insn_vlseg2e8ff_v + 7 * eew + nf - 1;
+                }
+                break;
+            default:
+                return false;
+            }
+            break;
+        case 1:
+            decode_VLX(ir, insn);
+            if (!nf) {
+                ir->opcode = rv_insn_vluxei8_v + eew;
+            } else {
+                ir->opcode = rv_insn_vluxseg2ei8_v + 7 * eew + nf - 1;
+            }
+            break;
+        case 2:
+            decode_VLS(ir, insn);
+            if (!nf) {
+                ir->opcode = rv_insn_vlse8_v + eew;
+            } else {
+                ir->opcode = rv_insn_vlsseg2e8_v + 7 * eew + nf - 1;
+            }
+            break;
+        case 3:
+            decode_VLX(ir, insn);
+            if (!nf) {
+                ir->opcode = rv_insn_vloxei8_v + eew;
+            } else {
+                ir->opcode = rv_insn_vloxseg2ei8_v + 7 * eew + nf - 1;
+            }
+            break;
+        default:
+            return false;
+        }
+        return true;
+    }
+
+#endif

The Vector Store Instructions are processed using the same decoding function, following a similar approach to the Vector Load Instructions. With opcode 0100111.

Varification

Since there are various vector instructions in RVV, it would be nice to verify the correctness of instruction decoding at the decode stage. Currently, I use a custom-written testing python script to validate the accuracy of RVV instruction decoding. This script automatically generates .s files containing vector instructions, which are then compiled into .elf files using GCC. Subsequently, I use objdump and rv32emu to generate .txt output files and compare the results of instructions and registers to ensure correctness. test case

Notice that, due to the complexity and variety of rules in immediate values in vector arithmetic instructions, the current testing program does not yet fully cover all possible ranges of immediate values.

Interpreter

The goal is to pass the vector-test, where the most frequently used vector instructions are vsetvli, vle<eew>.v, and vse<eew>.v. Priority will be given to implementing and optimizing these instructions.

Since these instruction appears so frequently, it will be ideal to include them into rv_insn_t

+#if RV32_HAS(EXT_V)
+    int32_t zimm;
+
+    uint8_t vd, vs1, vs2, vs3, eew;
+    uint8_t vm;
+#endif
+

Configuration-Setting Instructions

There is some important information in the spec that you should be aware of: The CSRs can only be updated using the vset{i}vl{i} instructions. The current implementation does not include vma and vta functionality, as these features are not part of the vector-test.

Bk5TWra8Jx
Bk5TWra8Jx

HkycZH6Lkl

rkzsZBTL1l

When configuring vsetvli and vsetivli, specific rules apply to ensure proper settings.

SJakUHpL1g

If an invalid configuration is encountered, the vill bit is set to 1, and all other bits in vtype are cleared to 0. The maximum number of elements that can be operated on, denoted as VLMAX, is determined by the formula VLMAX = LMUL * VLEN / SEW, which depends on the current configuration. For detailed constraints on setting vl, refer to section 6.3, Constraints on Setting vl Additionally, setting the value of rs1 to -1 enables vl as VLMAX directly.

The upper limit on VLEN allows software to know that indices will fitt into 16 bits (largest VLMAX of 65,536 occurs for LMUL=8 and SEW=8 with VLEN=65,536)

The concept of lmul defines how groups of vector registers are utilized. However, since vl already controls the number of elements processed in terms of sew, and vl is set based on vlmax, the primary purpose of lmul is to determine vlmax. An alternative, less straightforward way to compute vlmax can be achieved without relying on the fractional values of lmul (e.g., 1/2, 1/4, 1/8) as mentioned in the specification. This approach avoids the complexities associated with floating-point calculations for fractional lmul values.

uint16_t vlmax = (v_lmul < 4)
                     ? ((1 << v_lmul) * vlen) >> (3 + v_sew)
                     : (vlen >> (3 + v_sew) >> (3 - (v_lmul - 5)));
RVOP(
    vsetvli,
    {
        uint8_t v_lmul = ir->zimm & 0b111;
        uint8_t v_sew = (ir->zimm >> 3) & 0b111;

        if (v_lmul == 4 || v_sew >= 4) {
            /* Illegal setting */
            rv->csr_vl = 0;
            rv->csr_vtype = 0x80000000;
            return true;
        }
        uint16_t vlmax = (v_lmul < 4)
                             ? ((1 << v_lmul) * vlen) >> (3 + v_sew)
                             : (vlen >> (3 + v_sew) >> (3 - (v_lmul - 5)));
        if (ir->rs1) {
            vl_setting(vlmax, rv->X[ir->rs1], rv->csr_vl);
            rv->csr_vtype = ir->zimm;
        } else {
            if (!ir->rd) {
                rv->csr_vtype = ir->zimm;
            } else {
                rv->csr_vl = vlmax;
                rv->csr_vtype = ir->zimm;
            }
        }
        rv->X[ir->rd] = rv->X[ir->rs1];
    },
    GEN({/* no operation */}))

vsetvlvsetivli are done in the same fashion.

zimm represents the configuration value for vtype.

Vector Loads/Stores

The eew in load instructions is closely related to the sew specified by vset{i}vl{i}, with three possible scenarios:

  1. eew > sew: This is considered an illegal operation.
  2. eew < sew: Data will be loaded based on the size specified by eew.
  3. eew == sew: The load operation will proceed based on the allocated size.

It's recommend using eew == sew, but operations with eew < sew are also valid.

Another case of an illegal instruction occurs when lmul > 1 and a vector register greater than 31 is accessed. While this implementation can pass GCC validation, it results in errors when tested using Spike. So assert is inserted to handled this case.

The implementation adopts a loop-unrolling style to handle word lengths when eew = 8 or 16. The variable j keeps track of the current lmul, indicating the specific vector register being operated on, represented as v * n + j. The variable i corresponds to the current array index being processed (e.g., for vlen = 128, i will take values 0, 1, 2, 3). Additionally, the variable cnt serves as an iterator referencing csr_vl.

RVOP(
    vle8_v,
    {
        uint8_t sew = 8 << ((rv->csr_vtype >> 3) & 0b111);
        uint32_t addr = rv->X[ir->rs1];

        if (ir->eew > sew) {
            /* Illegal */
            rv->csr_vtype = 0x80000000;
            rv->csr_vl = 0;
            return true;
        } else {
            uint8_t i = 0;
            uint8_t j = 0;
            /* Handles a word at every loop. */
            for (uint32_t cnt = 0; rv->csr_vl - cnt >= 4;) {
                i %= LEN;
                /* Set illegal when trying to access vector register that is
                 * larger then 31 */
                assert(ir->vd + j < 32);
                /* Clear bits for*/
                rv->V[ir->vd + j][i] = 0;
                rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr);
                rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + 1) << 8;
                rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + 2) << 16;
                rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + 3) << 24;
                cnt += 4;
                i++;

                /* Move to next vector register. */
                if (!(cnt & ((LEN << 2) - 1))) {
                    j++;
                    i = 0;
                }
                addr += 4;
            }
            /* vl smaller than a word. */
            if(rv->csr_vl % 4){
                rv->V[ir->vd + j][i] %= 0xFFFFFFFF << ((rv->csr_vl % 4) << 3);
            }
            /* Handles data that is narrower then a word. */
            for (uint32_t cnt = 0; cnt < (rv->csr_vl % 4); cnt++) {
                assert(ir->vd + j < 32); /* Illegal */
                rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + cnt)
                                        << (cnt << 3);
            }
        }
    },
    GEN({/* no operation */}))

All other vector unit-stride instruction implementations are handled in a similar fashion, following the same loop-unrolling approach.

Vector Masking

This implementation utilizes a "mask-undisturbed" approach where the vector register v0 stores the mask. In a vlen = 128 configuration, v0 can hold a mask for 128 elements. It's important to note that all instructions, except for vset{i}vl{i}, include a vm bit that controls whether this mask is active or not.

vop.v*    v1, v2, v3, v0.t      # enabled where v0.mask[i]=1, vm=0
vop.v*    v1, v2, v3            # unmasked vector operation, vm=1

Another point to pay attention to is:

The destination vector register group for a masked vector instruction cannot overlap the source mask register (v0), unless the destination vector register is being written with a mask value (e.g., compares) or the scalar result of a reduction. These instruction encodings are reserved.

A quick demonstration is:

// vadd.vi v16 v11 5 v0.t
// The data in the register initial as(using a word as example)
// v16: 0xdddddddd
// v11: 0x01010101
// v0:  0xaaaaaaaa
// sew = 8, lmul = vlmax
------------------------
v16 | dd | dd | dd | dd
------------------------
v11 |  1 |  1 |  1 |  1
------------------------
mask|  0 |  1 |  0 |  1
------------------------
vadd| dd | de | dd | de
------------------------

In the interpreter implementation, we will first check the ir->vm bit, followed by checking the bit in v0 to determine whether to use the computed sew result of op1 and op2 or the sew from the destination vector register.

Vector Arithmetic Instructions

Since the RVV instruction set includes various instructions, my approach is to define macros to simplify the code-writing process. The design flow is outlined as follows:

Flowcharts (3)

OPT(vd, vs2, vs1/imm/rs1, operation, operation type)

To handle all the sews with XLEN=32, we have to execute them seperately.

#define OPT(des, op1, op2, op, op_type) {                                    \
    switch (8 << ((rv->csr_vtype >> 3) & 0b111)) {                           \
    case 8:                                                                  \
        sew_8b_handler(des, op1, op2, op, op_type);                          \
        break;                                                               \
    case 16:                                                                 \
        sew_16b_handler(des, op1, op2, op, op_type);                         \
        break;                                                               \
    case 32:                                                                 \
        sew_32b_handler(des, op1, op2, op, op_type);                         \
        break;                                                               \
    default:                                                                 \
        break;                                                               \
    }                                                                        \
}

Next, we retrieve csr->vl, which determines how many sew units need to be executed. Additionally, to handle cases where lmul > 1 results in group registers, we may need to access the next vector register. The distance to the required vector register is recorded using __j. Since the vector registers were initially defined as arrays composed of several 32-bit elements, __i is used to keep track of the array index.

Notices that an assert is insert to avoid lmul > 1 , which attempts to access vector register larger than 31.

Two macros will be used to handle different vl cases. If the length of sew multiplied by the remaining vl exceeds one word length, we use V*_LOOP to process one word at a time. If the product of the remaining vl and sew is less than one word, V*_LOOP_LEFT is used instead. Note that in V*_LOOP_LEFT, it is necessary to clear bits in the partial word to correctly update the computed result.

#define VI_LOOP(des, op1, op2, op, SHIFT, MASK, i, j, itr, vm)              \
    uint32_t tmp_1 = rv->V[op1 + j][i];                                     \
    uint32_t tmp_d = rv->V[des + j][i];                                     \
    uint32_t ans = 0;                                                       \
    rv->V[des + j][i] = 0;                                                  \
    for (uint8_t ___cnt = 0; ___cnt < itr; ___cnt++) {                      \
        if (ir->vm) {                                                       \
            ans = ((((tmp_1 >> (___cnt << (SHIFT))) op (op2)) & (MASK))     \
                   << (___cnt << (SHIFT)));                                 \
        } else {                                                            \
            ans = (vm & (0x1 << ___cnt))                                    \
                      ? ((((tmp_1 >> (___cnt << (SHIFT))) op (op2)) & (MASK))\
                         << (___cnt << (SHIFT)))                            \
                      : (tmp_d & (MASK << (___cnt << (SHIFT))));            \
        }                                                                   \
        rv->V[des + j][i] += ans;                                           \
    }

#define VI_LOOP_LEFT(des, op1, op2, op, SHIFT, MASK, i, j, itr, vm)         \
    uint32_t tmp_1 = rv->V[op1 + j][i];                                     \
    uint32_t tmp_d = rv->V[des + j][i];                                     \
    if (rv->csr_vl % itr) {                                                \
        rv->V[des + j][i] &=                                               \
            (0xFFFFFFFF << ((rv->csr_vl % itr) << SHIFT));                  \
    }                                                                       \
    uint32_t ans = 0;                                                       \
    for (uint8_t __cnt = 0; __cnt < (rv->csr_vl % itr); __cnt++) {          \
        assert((des + j) < 32);                                             \
        if (ir->vm) {                                                       \
            ans = ((((tmp_1 >> (__cnt << (SHIFT))) op (op2)) & (MASK))      \
                   << (__cnt << (SHIFT)));                                  \
        } else {                                                            \
            ans = (vm & (0x1 << __cnt))                                     \
                      ? ((((tmp_1 >> (__cnt << (SHIFT))) op (op2)) & (MASK))\
                         << (__cnt << (SHIFT)))                             \
                      : (tmp_d & (MASK << (__cnt << (SHIFT))));             \
        }                                                                   \
        rv->V[des + j][i] += ans;                                           \
    }

V*_LOOP steps:

  1. Copy the operand op1 and op2.
  2. Align op1(op2) to the right.
  3. Perform the specified operation between op1 and op2.
  4. Mask to the corresponding sew.
  5. Shift the result left to align with the corresponing position.
  6. Accumulate the result into vd.

Verification with riscv-vector-tests

This part is quit annoying. Took me tons of time to set up correctly.

Setup

To generate assembly and binary file for vector test check Prerequisite a specific version of spike is required. Remember to reset to that version.

// riscv-gnu-toolchain
$ git clone https://github.com/riscv-collab/riscv-gnu-toolchain
$ ./configure --prefix=/opt/riscv/ --with-arch=rv32gcv --with-abi=ilp32
$ make -j16

// riscv-isa-sim
$ git clone https://github.com/riscv-software-src/riscv-isa-sim.git
$ git reset --hard 91793ed7d964aa0031c5a9a31fa05ec3d11b3b0f
$ mkdir build
$ cd build
$ ../configure --prefix=$RISCV
$ make -j16
$ sudo make install

// riscv-pk
$ git clone https://github.com/riscv-software-src/riscv-pk.git
$ mkdir build
$ cd build
$ ../configure --prefix=$RISCV --host=riscv64-unknown-elf
$ make -j16
$ sudo make install

Get ready for rv32emu vector test:

$ cd tests
$ git clone https://github.com/chipsalliance/riscv-vector-tests.git
$ make all -j16 --environment-overrides VLEN=128 XLEN=32

You might encounter some issues as:

riscv_test.h: No such file or directory
5 | #include "riscv_test.h"
| ^~~~~~~~~~~~~~

Check #52.

We can use riscv32-unknown-elf-objdump to check the .elf in tests/riscv-vector-tests/out/v128x32machine/bin/stage2/vadd_vi-0 for example.

8001eed4 <fail>:
8001eed4:	0ff0000f          	fence
8001eed8:	00018063          	beqz	gp,8001eed8 <fail+0x4>
8001eedc:	0186                	slli	gp,gp,0x1
8001eede:	0011e193          	ori	gp,gp,1
8001eee2:	05d00893          	li	a7,93
8001eee6:	850e                	mv	a0,gp
8001eee8:	00000073          	ecall

8001eeec <pass>:
8001eeec:	0ff0000f          	fence
8001eef0:	4185                	li	gp,1
8001eef2:	05d00893          	li	a7,93
8001eef6:	4501                	li	a0,0
8001eef8:	00000073          	ecall

According to the instructions above, vadd_vi-0 will set a0 to zero if all tests pass. Recall that in rv32emu:

#ifdef __EMSCRIPTEN__
    if (rv_has_halted(rv)) {
        printf("inferior exit code %d\n", attr->exit_code);
        emscripten_cancel_main_loop();
        rv_delete(rv); /* clean up and reuse memory */
    }
#endif

If the command build/rv32emu tests/riscv-vector-tests/out/v128x32machine/bin/stage2/vadd_vi-0 passes the test successfully, the expected output will be inferior exit code 0. If the test fails, a non-zero exit code will be printed instead.

A Python script can be written to quickly test all items in the stage2/ directory, skipping vector floating-point instructions (files starting with vf) as floating-point vector registers are not implemented.

...
vxor_vx-0...............................pass
vxor_vx-1...............................pass
vxor_vx-2...............................pass
vxor_vx-3...............................pass
vxor_vx-4...............................pass
vzext_vf2-0.............................fail
vzext_vf4-0.............................fail
test:71 / 1247

Only 71 out of 1247 test cases passed, mostly involving single-width arithmetic instructions. When the interpreter was implemented, the behavior of instructions turned out to be far more diverse than expected. Refactoring the implementation with more customized instruction behaviors is necessary.

Note that some vector instruction test cases are not included in the RVV specification using make all, as they come from different branches. These should be further excluded.

Conclusion

Building the RVV emulator has deepened my understanding of RVV. Starting with the specification, I calculated the instruction set one by one and completed a functional, albeit imperfect, RVV decoder (585/616). This progress allowed me to advance to the Interpreter stage. While the Interpreter results are not ideal, I have successfully handled different sew, lmul, and vector masking based on the current design. Moving forward, refactoring the operation-related parts of the code will enable support for more RVV instructions.

Reference

The RISC-V Instruction Set Manual Volume I: Unprivileged ISA
riscv-v-spec