黃士昕
Implement RVV instruction decoding and an interpreter on the latest rv32emu codebase (ensuring it's rebased). The implementation must ultimately pass the riscv-vector-tests
The vector extension adds 32 vector registers, and seven unprivileged CSRs (vstart, vxsat, vxrm, vcsr, vtype, vl, vlenb) to a base scalar RISC-V ISA.
To emulate a vector register, which requires a length that is a power of two and where the VLEN
is greater than or equal to the ELEN
, we utilize an array of uint32_t for representation. This implementation uses a vector length of 128, but the code is designed to be scalable to other lengths.
The vtype
and vl
CSRs play a crucial role in RVV, as vsew
and vlmul
can change dynamically during emulation. With the equation vlmax = lmul * vlen / sew
, where vlmax
represents the maximum vl
an instruction can handle in sew
units, these parameters are essential for controlling vector operations effectively.
In the RISC-V Vector Extension version 1.0 specification, the exact number of instructions in the extension instruction set is not explicitly stated. Based on my own analysis of the specification, I have identified a total of 616 instructions. These instructions can be broadly categorized into three main groups: Configuration-Setting Instructions, Vector Loads/Stores, and Vector Arithmetic Instructions.
The Vector Arithmetic Instructions can be further divided into six categories based on their functionality: Vector Integer Arithmetic Instructions, Vector Fixed-Point Arithmetic Instructions, Vector Floating-Point Instructions, Vector Reduction Operations, Vector Mask Instructions, and Vector Permutation Instructions.
The Configuration-Setting Instructions and Vector Arithmetic Instructions share the same opcode, 101011
, while the Vector Load Instructions use the same opcode as LOAD-FP, 0000111
, and the Vector Store Instructions use the same opcode as STORE-FP, 0100111
. In the first step of decoding, a jump table can be used to classify instructions based on their opcode.
Initially, instructions with the opcode 0100111
can be categorized:
By extending the original jump table to include OP(vcfg)
, the decoding of Configuration-Setting Instructions can be completed seamlessly.
The Vector Arithmetic Instructions category contains the largest number of instructions, with a total of 305 types. Based on the instruction table mentioned above, decoding OPI is achieved using a 6-bit function6
and a 3-bit function3
. I have created an rvv_jump_table
, which functions similarly to the existing rv_jump_table
. Since the same function6
can include instructions from the OPI
, OPM
, and OPF
categories, the function6
is directly named using its binary representation, such as op_111111
.
Here is a demonstration of how the decoding process works. Other instructions within the Vector Arithmetic Instructions category follow the same decoding pattern.
Vector loads and stores are encoded within the scalar floating-point load and store major opcodes (LOAD-FP/STORE-FP).
Since Vector Load and LOAD-FP share the same opcode, decoding must rely on function3. As LOAD-FP is an I-type instruction, it is distinguished using thefunction3
value010
.
The Vector Load instructions consist of 177 types, making them highly complex. However, they can be systematically decoded by following the hierarchy of mop, lumop, nf, and eew. Below is a tree diagram categorizing the Vector Load instructions:
To handle decoding efficiently, an ir
list is established in decode.h
as an enum structure. By following the sequence in the tree diagram, the instructions are mapped into the ir
list, enabling the opcode of each instruction to be derived based on relative bit values. This approach minimizes branching and reduces overall program size.
Below is the code for decoding Vector Load instructions:
The Vector Store Instructions are processed using the same decoding function, following a similar approach to the Vector Load Instructions. With opcode 0100111
.
Since there are various vector instructions in RVV, it would be nice to verify the correctness of instruction decoding at the decode stage. Currently, I use a custom-written testing python script to validate the accuracy of RVV instruction decoding. This script automatically generates .s
files containing vector instructions, which are then compiled into .elf
files using GCC. Subsequently, I use objdump
and rv32emu
to generate .txt
output files and compare the results of instructions and registers to ensure correctness. test case
Notice that, due to the complexity and variety of rules in immediate values in vector arithmetic instructions, the current testing program does not yet fully cover all possible ranges of immediate values.
The goal is to pass the vector-test, where the most frequently used vector instructions are vsetvli
, vle<eew>.v
, and vse<eew>.v
. Priority will be given to implementing and optimizing these instructions.
Since these instruction appears so frequently, it will be ideal to include them into rv_insn_t
There is some important information in the spec that you should be aware of: The CSRs can only be updated using the vset{i}vl{i}
instructions. The current implementation does not include vma
and vta
functionality, as these features are not part of the vector-test.
When configuring vsetvli
and vsetivli
, specific rules apply to ensure proper settings.
If an invalid configuration is encountered, the vill
bit is set to 1, and all other bits in vtype
are cleared to 0. The maximum number of elements that can be operated on, denoted as VLMAX
, is determined by the formula VLMAX = LMUL * VLEN / SEW
, which depends on the current configuration. For detailed constraints on setting vl
, refer to section 6.3, Constraints on Setting vl Additionally, setting the value of rs1
to -1
enables vl as VLMAX
directly.
The upper limit on
VLEN
allows software to know that indices will fitt into 16 bits (largestVLMAX
of 65,536 occurs forLMUL
=8 andSEW
=8 withVLEN
=65,536)
The concept of lmul
defines how groups of vector registers are utilized. However, since vl
already controls the number of elements processed in terms of sew, and vl
is set based on vlmax, the primary purpose of lmul is to determine vlmax
. An alternative, less straightforward way to compute vlmax
can be achieved without relying on the fractional values of lmul
(e.g., 1/2, 1/4, 1/8) as mentioned in the specification. This approach avoids the complexities associated with floating-point calculations for fractional lmul
values.
vsetvl
、vsetivli
are done in the same fashion.
zimm
represents the configuration value for vtype
.
The eew
in load instructions is closely related to the sew
specified by vset{i}vl{i}
, with three possible scenarios:
eew > sew
: This is considered an illegal operation.eew < sew
: Data will be loaded based on the size specified by eew.eew == sew
: The load operation will proceed based on the allocated size.It's recommend using eew == sew
, but operations with eew < sew
are also valid.
Another case of an illegal instruction occurs when lmul > 1
and a vector register greater than 31 is accessed. While this implementation can pass GCC validation, it results in errors when tested using Spike. So assert
is inserted to handled this case.
The implementation adopts a loop-unrolling style to handle word lengths when eew = 8
or 16
. The variable j
keeps track of the current lmul
, indicating the specific vector register being operated on, represented as v * n + j
. The variable i
corresponds to the current array index being processed (e.g., for vlen = 128
, i
will take values 0, 1, 2, 3). Additionally, the variable cnt
serves as an iterator referencing csr_vl
.
All other vector unit-stride instruction implementations are handled in a similar fashion, following the same loop-unrolling approach.
This implementation utilizes a "mask-undisturbed" approach where the vector register v0
stores the mask. In a vlen = 128
configuration, v0
can hold a mask for 128 elements. It's important to note that all instructions, except for vset{i}vl{i}
, include a vm
bit that controls whether this mask is active or not.
Another point to pay attention to is:
The destination vector register group for a masked vector instruction cannot overlap the source mask register (v0), unless the destination vector register is being written with a mask value (e.g., compares) or the scalar result of a reduction. These instruction encodings are reserved.
A quick demonstration is:
In the interpreter implementation, we will first check the ir->vm
bit, followed by checking the bit in v0
to determine whether to use the computed sew
result of op1
and op2
or the sew
from the destination vector register.
Since the RVV instruction set includes various instructions, my approach is to define macros to simplify the code-writing process. The design flow is outlined as follows:
To handle all the sews with XLEN=32
, we have to execute them seperately.
Next, we retrieve csr->vl
, which determines how many sew
units need to be executed. Additionally, to handle cases where lmul > 1
results in group registers, we may need to access the next vector register. The distance to the required vector register is recorded using __j
. Since the vector registers were initially defined as arrays composed of several 32-bit elements, __i
is used to keep track of the array index.
Notices that an assert is insert to avoid lmul > 1
, which attempts to access vector register larger than 31.
Two macros will be used to handle different vl
cases. If the length of sew multiplied by the remaining vl
exceeds one word length, we use V*_LOOP
to process one word at a time. If the product of the remaining vl
and sew
is less than one word, V*_LOOP_LEFT
is used instead. Note that in V*_LOOP_LEFT
, it is necessary to clear bits in the partial word to correctly update the computed result.
V*_LOOP
steps:
op1
(op2
) to the right.op1
and op2
.sew
.vd
.This part is quit annoying. Took me tons of time to set up correctly.
To generate assembly and binary file for vector test check Prerequisite a specific version of spike is required. Remember to reset to that version.
Get ready for rv32emu
vector test:
You might encounter some issues as:
Check #52.
We can use riscv32-unknown-elf-objdump
to check the .elf
in tests/riscv-vector-tests/out/v128x32machine/bin/stage2/vadd_vi-0
for example.
According to the instructions above, vadd_vi-0
will set a0
to zero if all tests pass. Recall that in rv32emu
:
If the command build/rv32emu tests/riscv-vector-tests/out/v128x32machine/bin/stage2/vadd_vi-0
passes the test successfully, the expected output will be inferior exit code 0
. If the test fails, a non-zero exit code will be printed instead.
A Python script can be written to quickly test all items in the stage2/
directory, skipping vector floating-point instructions (files starting with vf
) as floating-point vector registers are not implemented.
Only 71
out of 1247
test cases passed, mostly involving single-width arithmetic instructions. When the interpreter was implemented, the behavior of instructions turned out to be far more diverse than expected. Refactoring the implementation with more customized instruction behaviors is necessary.
Note that some vector instruction test cases are not included in the RVV specification using make all
, as they come from different branches. These should be further excluded.
Building the RVV emulator has deepened my understanding of RVV. Starting with the specification, I calculated the instruction set one by one and completed a functional, albeit imperfect, RVV decoder (585/616). This progress allowed me to advance to the Interpreter stage. While the Interpreter results are not ideal, I have successfully handled different sew
, lmul
, and vector masking based on the current design. Moving forward, refactoring the operation-related parts of the code will enable support for more RVV instructions.
The RISC-V Instruction Set Manual Volume I: Unprivileged ISA
riscv-v-spec