# Implement Vector extension for rv32emu > 黃士昕 [vestata/rv32emu:vector](https://github.com/vestata/rv32emu/tree/vector) ## Goal Implement RVV instruction decoding and an interpreter on the latest rv32emu codebase (ensuring it's rebased). The implementation must ultimately pass the [riscv-vector-tests](https://github.com/chipsalliance/riscv-vector-tests.git) ## Environment ``` $ riscv64-unknown-elf-gcc --version riscv64-unknown-elf-gcc (g04696df09) 14.2.0 ``` ``` $ gcc --version gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 ``` ``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i7-13700 CPU family: 6 Model: 183 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 20% CPU max MHz: 5200.0000 CPU min MHz: 800.0000 BogoMIPS: 4224.00 ``` ### [riscv-gnu-toolchain] ### [riscv-isa-sim] ### [riscv-pk] ## Implementation > The vector extension adds 32 vector registers, and seven unprivileged CSRs (vstart, vxsat, vxrm, vcsr, vtype, vl, vlenb) to a base scalar RISC-V ISA. ![image](https://hackmd.io/_uploads/HyIq97VDkl.png) To emulate a vector register, which requires a length that is a power of two and where the `VLEN` is greater than or equal to the `ELEN`, we utilize an array of uint32_t for representation. This implementation uses a vector length of 128, but the code is designed to be scalable to other lengths. ```diff #if RV32_HAS(EXT_F) typedef softfloat_float32_t riscv_float_t; #endif +#if RV32_HAS(EXT_V) +/* Fixme:Temporary set vl as 128 */ +typedef uint32_t vector128_t[4]; +#endif /* memory read handlers */ typedef riscv_word_t (*riscv_mem_ifetch)(riscv_t *rv, riscv_word_t addr); ``` ![image](https://hackmd.io/_uploads/Sy2yxN4Dke.png) The `vtype` and `vl` CSRs play a crucial role in RVV, as `vsew` and `vlmul` can change dynamically during emulation. With the equation `vlmax = lmul * vlen / sew`, where `vlmax` represents the maximum `vl` an instruction can handle in `sew` units, these parameters are essential for controlling vector operations effectively. ## Decode In the [RISC-V Vector Extension version 1.0 specification](https://github.com/riscvarchive/riscv-v-spec), the exact number of instructions in the extension instruction set is not explicitly stated. Based on my own analysis of the specification, I have identified a total of 616 instructions. These instructions can be broadly categorized into three main groups: Configuration-Setting Instructions, Vector Loads/Stores, and Vector Arithmetic Instructions. The Vector Arithmetic Instructions can be further divided into six categories based on their functionality: Vector Integer Arithmetic Instructions, Vector Fixed-Point Arithmetic Instructions, Vector Floating-Point Instructions, Vector Reduction Operations, Vector Mask Instructions, and Vector Permutation Instructions. ### Configuration-Setting Instructions The Configuration-Setting Instructions and Vector Arithmetic Instructions share the same opcode, `101011`, while the Vector Load Instructions use the same opcode as LOAD-FP, `0000111`, and the Vector Store Instructions use the same opcode as STORE-FP, `0100111`. In the first step of decoding, a jump table can be used to classify instructions based on their opcode. Initially, instructions with the opcode `0100111` can be categorized: ```diff +#if RV32_HAS(EXT_V) + /* Handle vector operations */ + if (index == 0b10101) { + /* Since vcfg and vop uses the same opcode */ + if (decode_funct3(insn) == 0b111) { + const decode_t op = rv_jump_table[index]; + return op(ir, insn); + } + const uint32_t v_index = (insn >> 26) & 0x3F; + const decode_t op = rvv_jump_table[v_index]; + return op(ir, insn); + /* Fixme:VMEM */ + } +#endif + ``` By extending the original jump table to include `OP(vcfg)`, the decoding of Configuration-Setting Instructions can be completed seamlessly. ```diff static const decode_t rv_jump_table[] = { // 000 001 010 011 100 101 110 111 OP(load), OP(load_fp), OP(unimp), OP(misc_mem), OP(op_imm), OP(auipc), OP(unimp), OP(unimp), // 00 OP(store), OP(store_fp), OP(unimp), OP(amo), OP(op), OP(lui), OP(unimp), OP(unimp), // 01 - OP(madd), OP(msub), OP(nmsub), OP(nmadd), OP(op_fp), OP(unimp), OP(unimp), OP(unimp), // 10 + OP(madd), OP(msub), OP(nmsub), OP(nmadd), OP(op_fp), OP(vcfg), OP(unimp), OP(unimp), // 10 OP(branch), OP(jalr), OP(unimp), OP(jal), OP(system), OP(unimp), OP(unimp), OP(unimp), // 11 }; ``` ### Vector Arithmetic Instructions The Vector Arithmetic Instructions category contains the largest number of instructions, with a total of 305 types. Based on the instruction table mentioned above, decoding OPI is achieved using a 6-bit `function6` and a 3-bit `function3`. I have created an `rvv_jump_table`, which functions similarly to the existing `rv_jump_table`. Since the same `function6` can include instructions from the `OPI`, `OPM`, and `OPF` categories, the `function6` is directly named using its binary representation, such as `op_111111`. ```diff +#if RV32_HAS(EXT_V) + /* RVV vector opcode map */ + static const decode_t rvv_jump_table[] = { + /* Acording to https://github.com/riscvarchive/riscv-v-spec/blob/master/inst-table.adoc this table is for function6. */ + // 000 001 010 011 100 101 110 111 + OP(000000), OP(000001), OP(000010), OP(000011), OP(000100), OP(000101), OP(000110), OP(000111), // 000 + OP(001000), OP(001001), OP(001010), OP(001011), OP(001100), OP(unimp), OP(001110), OP(001111), // 001 + OP(010000), OP(010001), OP(010010), OP(010011), OP(010100), OP(unimp), OP(unimp), OP(010111), // 010 + OP(011000), OP(011001), OP(011010), OP(011011), OP(011100), OP(011101), OP(011110), OP(011111), // 011 + OP(100000), OP(100001), OP(100010), OP(100011), OP(100100), OP(100101), OP(100110), OP(100111), // 100 + OP(101000), OP(101001), OP(101010), OP(101011), OP(101100), OP(101101), OP(101110), OP(101111), // 101 + OP(110000), OP(110001), OP(110010), OP(110011), OP(110100), OP(110101), OP(110110), OP(110111), // 110 + OP(111000), OP(unimp), OP(111010), OP(111011), OP(111100), OP(111101), OP(111110), OP(111111) // 111 + }; ``` Here is a demonstration of how the decoding process works. Other instructions within the Vector Arithmetic Instructions category follow the same decoding pattern. ```diff +static inline bool op_101001(rv_insn_t *ir, const uint32_t insn) +{ + switch (decode_funct3(insn)) { + case 0: + decode_vvtype(ir, insn); + ir->opcode = rv_insn_vsra_vv; + break; + case 1: + decode_vvtype(ir, insn); + ir->opcode = rv_insn_vfnmadd_vv; + break; + case 2: + decode_vvtype(ir, insn); + ir->opcode = rv_insn_vmadd_vv; + break; + case 3: + decode_vitype(ir, insn); + ir->opcode = rv_insn_vsra_vi; + break; + case 4: + decode_vxtype(ir, insn); + ir->opcode = rv_insn_vsra_vx; + break; + case 5: + decode_vxtype(ir, insn); + ir->opcode = rv_insn_vfnmadd_vf; + break; + case 6: + decode_vxtype(ir, insn); + ir->opcode = rv_insn_vmadd_vx; + break; + default: /* illegal instruction */ + return false; + } + return true; +} ``` ### Vector Loads/Stores > Vector loads and stores are encoded within the scalar floating-point load and store major opcodes (LOAD-FP/STORE-FP). ![HJmQ4dcrke](https://hackmd.io/_uploads/ByyZuENDkl.png) ![SJNV4OcHJg](https://hackmd.io/_uploads/ByPWuE4vJg.png) Since Vector Load and LOAD-FP share the same opcode, decoding must rely on function3. As LOAD-FP is an I-type instruction, it is distinguished using the `function3` value `010`. #### Load The Vector Load instructions consist of 177 types, making them highly complex. However, they can be systematically decoded by following the hierarchy of mop, lumop, nf, and eew. Below is a tree diagram categorizing the Vector Load instructions: ``` vlxxx<nf>e<i><eew><ff>_v └── mop ├── 00 │ └── lumop │ ├── 00000 │ │ └── nf │ │ ├── 000 -> vle<eew>.v │ │ ├── 001 -> vlseg<2>e<eew>.v │ │ └── 111 -> vlseg<8>e<eew>.v │ ├── 01000 │ │ └── nf │ │ ├── 000 -> vl<1>r<eew>.v │ │ └── 111 -> vl<8>r<eew>.v │ ├── 01011 -> vlm.v │ └── 10000 │ ├── 000 -> vle<eew>ff.v │ ├── 001 -> vlseg<2>e<eew>ff.v │ └── 111 -> vlseg<8>e<eew>ff.v ├── 01 │ └── nf │ ├── 000 -> vluxei<eew>.v │ └── 111 -> vluxseg<8>ei<eew>.v ├── 10 │ └── nf │ ├── 000 -> vlse<eew>.v │ └── 111 -> vlsseg<8>e<eew>.v └── 11 └── nf ├── 000 -> vloxei<eew>.v └── 111 -> vloxseg<8>ei<eew>.v ``` To handle decoding efficiently, an `ir` list is established in `decode.h` as an enum structure. By following the sequence in the tree diagram, the instructions are mapped into the `ir` list, enabling the opcode of each instruction to be derived based on relative bit values. This approach minimizes branching and reduces overall program size. Below is the code for decoding Vector Load instructions: ```diff +static inline bool op_load_fp(rv_insn_t *ir, const uint32_t insn) +{ +#if RV32_HAS(EXT_V) + /* Fixme: The implementation now is just using switch statement, since there + * are multiple duplicate elements in vectore load/store instruction. I'm + * hoping to build clean and efficient code. */ + /* inst nf mew mop vm rs2/vs1 rs1 width vd opcode + * ----+---+---+---+--+---------+-----+-----+---+-------- + * VL* nf mew mop vm lumop rs1 width vd 0000111 + * VLS* nf mew mop vm rs2 rs1 width vd 0000111 + * VLX* nf mew mop vm vs2 rs1 width vd 0000111 + */ + if (decode_funct3(insn) != 0b010) { + uint8_t eew = decode_eew(insn); + uint8_t nf = decode_nf(insn); + switch (decode_mop(insn)) { + case 0: + decode_VL(ir, insn); + /* check lumop */ + switch (decode_24_20(insn)) { + case 0b00000: + if (!nf) { + ir->opcode = rv_insn_vle8_v + eew; + } else { + ir->opcode = rv_insn_vlseg2e8_v + 7 * eew + nf - 1; + } + break; + case 0b01000: + if (!nf) { + ir->opcode = rv_insn_vle8ff_v + eew; + } else { + ir->opcode = rv_insn_vlseg2e8ff_v + 7 * eew + nf - 1; + } + break; + default: + return false; + } + break; + case 1: + decode_VLX(ir, insn); + if (!nf) { + ir->opcode = rv_insn_vluxei8_v + eew; + } else { + ir->opcode = rv_insn_vluxseg2ei8_v + 7 * eew + nf - 1; + } + break; + case 2: + decode_VLS(ir, insn); + if (!nf) { + ir->opcode = rv_insn_vlse8_v + eew; + } else { + ir->opcode = rv_insn_vlsseg2e8_v + 7 * eew + nf - 1; + } + break; + case 3: + decode_VLX(ir, insn); + if (!nf) { + ir->opcode = rv_insn_vloxei8_v + eew; + } else { + ir->opcode = rv_insn_vloxseg2ei8_v + 7 * eew + nf - 1; + } + break; + default: + return false; + } + return true; + } + +#endif ``` The Vector Store Instructions are processed using the same decoding function, following a similar approach to the Vector Load Instructions. With opcode `0100111`. ### Varification Since there are various vector instructions in RVV, it would be nice to verify the correctness of instruction decoding at the decode stage. Currently, I use a custom-written testing python script to validate the accuracy of RVV instruction decoding. This script automatically generates `.s` files containing vector instructions, which are then compiled into `.elf` files using GCC. Subsequently, I use `objdump` and `rv32emu` to generate `.txt` output files and compare the results of instructions and registers to ensure correctness. [test case](https://github.com/vestata/rv32emu/tree/test/mytest) Notice that, due to the complexity and variety of rules in immediate values in vector arithmetic instructions, the current testing program does not yet fully cover all possible ranges of immediate values. ## Interpreter The goal is to pass the [vector-test](https://github.com/chipsalliance/riscv-vector-tests.git), where the most frequently used vector instructions are `vsetvli`, `vle<eew>.v`, and `vse<eew>.v`. Priority will be given to implementing and optimizing these instructions. Since these instruction appears so frequently, it will be ideal to include them into `rv_insn_t` ```diff +#if RV32_HAS(EXT_V) + int32_t zimm; + + uint8_t vd, vs1, vs2, vs3, eew; + uint8_t vm; +#endif + ``` ### Configuration-Setting Instructions There is some important information in the [spec](https://drive.google.com/file/d/1uviu1nH-tScFfgrovvFCrj7Omv8tFtkp/view?usp=drive_link) that you should be aware of: The CSRs can only be updated using the `vset{i}vl{i}` instructions. The current implementation does not include `vma` and `vta` functionality, as these features are not part of the [vector-test](https://github.com/chipsalliance/riscv-vector-tests.git). ![Bk5TWra8Jx](https://hackmd.io/_uploads/ByyM24Nw1x.png) ![Bk5TWra8Jx](https://hackmd.io/_uploads/BybznNEw1l.png) ![HkycZH6Lkl](https://hackmd.io/_uploads/SJQM344wJg.png) ![rkzsZBTL1l](https://hackmd.io/_uploads/H14G2VVPJg.png) When configuring `vsetvli` and `vsetivli`, specific rules apply to ensure proper settings. ![SJakUHpL1g](https://hackmd.io/_uploads/H1Zjh4Ewkl.png) If an invalid configuration is encountered, the `vill` bit is set to 1, and all other bits in `vtype` are cleared to 0. The maximum number of elements that can be operated on, denoted as `VLMAX`, is determined by the formula `VLMAX = LMUL * VLEN / SEW`, which depends on the current configuration. For detailed constraints on setting `vl`, refer to [section 6.3, Constraints on Setting vl Additionally](https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc#constraints-on-setting-vl), setting the value of `rs1` to `-1` enables vl as `VLMAX` directly. >The upper limit on `VLEN` allows software to know that indices will fitt into 16 bits (largest `VLMAX` of 65,536 occurs for `LMUL`=8 and `SEW`=8 with `VLEN`=65,536) The concept of `lmul` defines how groups of vector registers are utilized. However, since `vl` already controls the number of elements processed in terms of sew, and `vl` is set based on vlmax, the primary purpose of lmul is to determine `vlmax`. An alternative, less straightforward way to compute `vlmax` can be achieved without relying on the fractional values of `lmul` (e.g., 1/2, 1/4, 1/8) as mentioned in the specification. This approach avoids the complexities associated with floating-point calculations for fractional `lmul` values. ```c uint16_t vlmax = (v_lmul < 4) ? ((1 << v_lmul) * vlen) >> (3 + v_sew) : (vlen >> (3 + v_sew) >> (3 - (v_lmul - 5))); ``` ```c RVOP( vsetvli, { uint8_t v_lmul = ir->zimm & 0b111; uint8_t v_sew = (ir->zimm >> 3) & 0b111; if (v_lmul == 4 || v_sew >= 4) { /* Illegal setting */ rv->csr_vl = 0; rv->csr_vtype = 0x80000000; return true; } uint16_t vlmax = (v_lmul < 4) ? ((1 << v_lmul) * vlen) >> (3 + v_sew) : (vlen >> (3 + v_sew) >> (3 - (v_lmul - 5))); if (ir->rs1) { vl_setting(vlmax, rv->X[ir->rs1], rv->csr_vl); rv->csr_vtype = ir->zimm; } else { if (!ir->rd) { rv->csr_vtype = ir->zimm; } else { rv->csr_vl = vlmax; rv->csr_vtype = ir->zimm; } } rv->X[ir->rd] = rv->X[ir->rs1]; }, GEN({/* no operation */})) ``` `vsetvl`、`vsetivli` are done in the same fashion. `zimm` represents the configuration value for `vtype`. ### Vector Loads/Stores The `eew` in load instructions is closely related to the `sew` specified by `vset{i}vl{i}`, with three possible scenarios: 1. `eew > sew`: This is considered an illegal operation. 2. `eew < sew`: Data will be loaded based on the size specified by eew. 3. `eew == sew`: The load operation will proceed based on the allocated size. It's recommend using `eew == sew`, but operations with `eew < sew` are also valid. Another case of an illegal instruction occurs when `lmul > 1` and a vector register greater than 31 is accessed. While this implementation can pass GCC validation, it results in errors when tested using Spike. So `assert` is inserted to handled this case. The implementation adopts a loop-unrolling style to handle word lengths when `eew = 8` or `16`. The variable `j` keeps track of the current `lmul`, indicating the specific vector register being operated on, represented as `v * n + j`. The variable `i` corresponds to the current array index being processed (e.g., for `vlen = 128`, `i` will take values 0, 1, 2, 3). Additionally, the variable `cnt` serves as an iterator referencing `csr_vl`. ```c RVOP( vle8_v, { uint8_t sew = 8 << ((rv->csr_vtype >> 3) & 0b111); uint32_t addr = rv->X[ir->rs1]; if (ir->eew > sew) { /* Illegal */ rv->csr_vtype = 0x80000000; rv->csr_vl = 0; return true; } else { uint8_t i = 0; uint8_t j = 0; /* Handles a word at every loop. */ for (uint32_t cnt = 0; rv->csr_vl - cnt >= 4;) { i %= LEN; /* Set illegal when trying to access vector register that is * larger then 31 */ assert(ir->vd + j < 32); /* Clear bits for*/ rv->V[ir->vd + j][i] = 0; rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr); rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + 1) << 8; rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + 2) << 16; rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + 3) << 24; cnt += 4; i++; /* Move to next vector register. */ if (!(cnt & ((LEN << 2) - 1))) { j++; i = 0; } addr += 4; } /* vl smaller than a word. */ if(rv->csr_vl % 4){ rv->V[ir->vd + j][i] %= 0xFFFFFFFF << ((rv->csr_vl % 4) << 3); } /* Handles data that is narrower then a word. */ for (uint32_t cnt = 0; cnt < (rv->csr_vl % 4); cnt++) { assert(ir->vd + j < 32); /* Illegal */ rv->V[ir->vd + j][i] |= rv->io.mem_read_b(rv, addr + cnt) << (cnt << 3); } } }, GEN({/* no operation */})) ``` All other vector unit-stride instruction implementations are handled in a similar fashion, following the same loop-unrolling approach. ### Vector Masking This implementation utilizes a "mask-undisturbed" approach where the vector register `v0` stores the mask. In a `vlen = 128` configuration, `v0` can hold a mask for 128 elements. It's important to note that all instructions, except for `vset{i}vl{i}`, include a `vm` bit that controls whether this mask is active or not. ``` vop.v* v1, v2, v3, v0.t # enabled where v0.mask[i]=1, vm=0 vop.v* v1, v2, v3 # unmasked vector operation, vm=1 ``` Another point to pay attention to is: >The destination vector register group for a masked vector instruction cannot overlap the source mask register (v0), unless the destination vector register is being written with a mask value (e.g., compares) or the scalar result of a reduction. These instruction encodings are reserved. A quick demonstration is: ``` // vadd.vi v16 v11 5 v0.t // The data in the register initial as(using a word as example) // v16: 0xdddddddd // v11: 0x01010101 // v0: 0xaaaaaaaa // sew = 8, lmul = vlmax ------------------------ v16 | dd | dd | dd | dd ------------------------ v11 | 1 | 1 | 1 | 1 ------------------------ mask| 0 | 1 | 0 | 1 ------------------------ vadd| dd | de | dd | de ------------------------ ``` In the interpreter implementation, we will first check the `ir->vm` bit, followed by checking the bit in `v0` to determine whether to use the computed `sew` result of `op1` and `op2` or the `sew` from the destination vector register. ### Vector Arithmetic Instructions Since the RVV instruction set includes various instructions, my approach is to define macros to simplify the code-writing process. The design flow is outlined as follows: ![Flowcharts (3)](https://hackmd.io/_uploads/rkZQbaRP1l.png) ```c OPT(vd, vs2, vs1/imm/rs1, operation, operation type) ``` To handle all the sews with `XLEN=32`, we have to execute them seperately. ```c #define OPT(des, op1, op2, op, op_type) { \ switch (8 << ((rv->csr_vtype >> 3) & 0b111)) { \ case 8: \ sew_8b_handler(des, op1, op2, op, op_type); \ break; \ case 16: \ sew_16b_handler(des, op1, op2, op, op_type); \ break; \ case 32: \ sew_32b_handler(des, op1, op2, op, op_type); \ break; \ default: \ break; \ } \ } ``` Next, we retrieve `csr->vl`, which determines how many `sew` units need to be executed. Additionally, to handle cases where `lmul > 1` results in group registers, we may need to access the next vector register. The distance to the required vector register is recorded using `__j`. Since the vector registers were initially defined as arrays composed of several 32-bit elements, `__i` is used to keep track of the array index. Notices that an assert is insert to avoid `lmul > 1` , which attempts to access vector register larger than 31. Two macros will be used to handle different `vl` cases. If the length of sew multiplied by the remaining `vl` exceeds one word length, we use `V*_LOOP` to process one word at a time. If the product of the remaining `vl` and `sew` is less than one word, `V*_LOOP_LEFT `is used instead. Note that in `V*_LOOP_LEFT`, it is necessary to clear bits in the partial word to correctly update the computed result. ```c #define VI_LOOP(des, op1, op2, op, SHIFT, MASK, i, j, itr, vm) \ uint32_t tmp_1 = rv->V[op1 + j][i]; \ uint32_t tmp_d = rv->V[des + j][i]; \ uint32_t ans = 0; \ rv->V[des + j][i] = 0; \ for (uint8_t ___cnt = 0; ___cnt < itr; ___cnt++) { \ if (ir->vm) { \ ans = ((((tmp_1 >> (___cnt << (SHIFT))) op (op2)) & (MASK)) \ << (___cnt << (SHIFT))); \ } else { \ ans = (vm & (0x1 << ___cnt)) \ ? ((((tmp_1 >> (___cnt << (SHIFT))) op (op2)) & (MASK))\ << (___cnt << (SHIFT))) \ : (tmp_d & (MASK << (___cnt << (SHIFT)))); \ } \ rv->V[des + j][i] += ans; \ } #define VI_LOOP_LEFT(des, op1, op2, op, SHIFT, MASK, i, j, itr, vm) \ uint32_t tmp_1 = rv->V[op1 + j][i]; \ uint32_t tmp_d = rv->V[des + j][i]; \ if (rv->csr_vl % itr) { \ rv->V[des + j][i] &= \ (0xFFFFFFFF << ((rv->csr_vl % itr) << SHIFT)); \ } \ uint32_t ans = 0; \ for (uint8_t __cnt = 0; __cnt < (rv->csr_vl % itr); __cnt++) { \ assert((des + j) < 32); \ if (ir->vm) { \ ans = ((((tmp_1 >> (__cnt << (SHIFT))) op (op2)) & (MASK)) \ << (__cnt << (SHIFT))); \ } else { \ ans = (vm & (0x1 << __cnt)) \ ? ((((tmp_1 >> (__cnt << (SHIFT))) op (op2)) & (MASK))\ << (__cnt << (SHIFT))) \ : (tmp_d & (MASK << (__cnt << (SHIFT)))); \ } \ rv->V[des + j][i] += ans; \ } ``` `V*_LOOP` steps: 1. Copy the operand op1 and op2. 2. Align `op1`(`op2`) to the right. 3. Perform the specified operation between `op1` and `op2`. 4. Mask to the corresponding `sew`. 5. Shift the result left to align with the corresponing position. 6. Accumulate the result into `vd`. ## Verification with [riscv-vector-tests](https://github.com/chipsalliance/riscv-vector-tests.git) This part is quit annoying. Took me tons of time to set up correctly. ### Setup To generate assembly and binary file for vector test check [Prerequisite](https://github.com/chipsalliance/riscv-vector-tests/tree/main#prerequisite) a specific version of spike is required. Remember to reset to that version. ``` // riscv-gnu-toolchain $ git clone https://github.com/riscv-collab/riscv-gnu-toolchain $ ./configure --prefix=/opt/riscv/ --with-arch=rv32gcv --with-abi=ilp32 $ make -j16 // riscv-isa-sim $ git clone https://github.com/riscv-software-src/riscv-isa-sim.git $ git reset --hard 91793ed7d964aa0031c5a9a31fa05ec3d11b3b0f $ mkdir build $ cd build $ ../configure --prefix=$RISCV $ make -j16 $ sudo make install // riscv-pk $ git clone https://github.com/riscv-software-src/riscv-pk.git $ mkdir build $ cd build $ ../configure --prefix=$RISCV --host=riscv64-unknown-elf $ make -j16 $ sudo make install ``` Get ready for `rv32emu` vector test: ``` $ cd tests $ git clone https://github.com/chipsalliance/riscv-vector-tests.git $ make all -j16 --environment-overrides VLEN=128 XLEN=32 ``` You might encounter some issues as: ``` riscv_test.h: No such file or directory 5 | #include "riscv_test.h" | ^~~~~~~~~~~~~~ ``` Check [#52](https://github.com/chipsalliance/riscv-vector-tests/issues/52). We can use `riscv32-unknown-elf-objdump` to check the `.elf` in `tests/riscv-vector-tests/out/v128x32machine/bin/stage2/vadd_vi-0` for example. ``` 8001eed4 <fail>: 8001eed4: 0ff0000f fence 8001eed8: 00018063 beqz gp,8001eed8 <fail+0x4> 8001eedc: 0186 slli gp,gp,0x1 8001eede: 0011e193 ori gp,gp,1 8001eee2: 05d00893 li a7,93 8001eee6: 850e mv a0,gp 8001eee8: 00000073 ecall 8001eeec <pass>: 8001eeec: 0ff0000f fence 8001eef0: 4185 li gp,1 8001eef2: 05d00893 li a7,93 8001eef6: 4501 li a0,0 8001eef8: 00000073 ecall ``` According to the instructions above, `vadd_vi-0` will set `a0` to zero if all tests pass. Recall that in `rv32emu`: ```c #ifdef __EMSCRIPTEN__ if (rv_has_halted(rv)) { printf("inferior exit code %d\n", attr->exit_code); emscripten_cancel_main_loop(); rv_delete(rv); /* clean up and reuse memory */ } #endif ``` If the command `build/rv32emu tests/riscv-vector-tests/out/v128x32machine/bin/stage2/vadd_vi-0` passes the test successfully, the expected output will be `inferior exit code 0`. If the test fails, a non-zero exit code will be printed instead. A Python script can be written to quickly test all items in the `stage2/` directory, skipping vector floating-point instructions (files starting with `vf`) as floating-point vector registers are not implemented. ```bash ... vxor_vx-0...............................pass vxor_vx-1...............................pass vxor_vx-2...............................pass vxor_vx-3...............................pass vxor_vx-4...............................pass vzext_vf2-0.............................fail vzext_vf4-0.............................fail test:71 / 1247 ``` Only `71` out of `1247` test cases passed, mostly involving single-width arithmetic instructions. When the interpreter was implemented, the behavior of instructions turned out to be far more diverse than expected. Refactoring the implementation with more customized instruction behaviors is necessary. Note that some vector instruction test cases are not included in the RVV specification using `make all`, as they come from different branches. These should be further excluded. ## Conclusion Building the RVV emulator has deepened my understanding of RVV. Starting with the specification, I calculated the instruction set one by one and completed a functional, albeit imperfect, RVV decoder (585/616). This progress allowed me to advance to the Interpreter stage. While the Interpreter results are not ideal, I have successfully handled different `sew`, `lmul`, and vector masking based on the current design. Moving forward, refactoring the operation-related parts of the code will enable support for more RVV instructions. ## Reference [The RISC-V Instruction Set Manual Volume I: Unprivileged ISA](https://drive.google.com/file/d/1uviu1nH-tScFfgrovvFCrj7Omv8tFtkp/view?usp=drive_link) [riscv-v-spec](https://drive.google.com/file/d/1uviu1nH-tScFfgrovvFCrj7Omv8tFtkp/view?usp=drive_link)