廖奕凱
My task is to study the QOY format and leverage RISC-V Vector (RVV) extensions to accelerate its operations. The project cover the following aspects:
The RISC-V Vector Extension (RVV) is designed to enhance the RISC-V architecture with powerful vector computation capabilities, enabling efficient data-parallel processing for a wide range of applications such as high-performance computing, machine learning, and signal processing. RVV's flexible and scalable design allows it to cater to diverse hardware implementations and application requirements.
Instruction Set Size
: Vector ISAs are typically large due to the need for vector equivalents of scalar instructions, specialized memory access operations, and vector manipulation instructions.Predication Support
: Modern vector ISAs, including RVV, incorporate predication (masking) to enable conditional execution of vector elements.Instruction Encoding
: The complexity of vector operations often exceeds the capacity of 32-bit instruction encoding, necessitating the use of CPU state registers to manage vector operations.Number of Registers
: RVV defines 32 vector registers named v0 to v31.Register Size (VLEN)
: Each vector register is VLEN bits wide, where VLEN is a power of two (e.g., 64, 128, 256, 512 bits). The exact VLEN is determined by the implementer.Standard Constraints
: The Zv* standard extensions require VLEN to be at least 64 or 128 bits, similar in size to Intel's AVX-512 when VLEN is 512 bits.Element Size (ELEN)
: Elements within a vector are at least 8 bits and up to ELEN bits, where ELEN is also a power of two (8 ≤ ELEN ≤ VLEN).Standard Constraints
: The Zv* extensions constrain ELEN to be at least 32 or 64 bits.vtype (Vector Type Register)
: Describes the type of vector operation, including:
SEW (Standard Element Width)
: Size in bits of each vector element (8 ≤ SEW ≤ ELEN).LMUL (Length Multiplier)
: Determines the grouping of vector registers, allowing multipliers of 1/8, 1/4, 1/2, 1, 2, 4, or 8.vl (Vector Length Register)
: Specifies the number of elements to operate on, ranging from 0 to vlmax(SEW, LMUL), where:
Variable Vector Length
: RVV's design allows different hardware implementations to support varying vector lengths, enhancing scalability.Element Manipulation
: Supports operations for loading, storing, and manipulating vector elements, including non-contiguous memory accesses through scatter/gather operations.Mask Registers
: Enable conditional execution of vector operations, allowing certain elements to be processed based on predicate conditions.Predicate Operations
: Includes instructions for forming predicates through comparisons and other logical operations.Comprehensive Instruction Set
: RVV includes a wide range of arithmetic (e.g., add, subtract, multiply, divide) and logical (e.g., AND, OR, NOT) instructions tailored for vector processing.Vector-Specific Instructions
: Additional instructions are provided for tasks unique to vector processing, such as vector compression and decompression.QOI (Quite OK Image) is a simple and efficient lossless image compression format designed to provide fast encoding and decoding speeds. The main features of QOI include:
The encoding process of QOI primarily involves the following steps:
QOY (Quite OK YCbCr420A) is an extension of QOI, with the following notable features:
The encoding process of QOY resembles that of QOI but introduces enhancements for handling color space and alpha channels:
git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git --recursive
cd riscv-gnu-toolchain
mkdir build
cd build
../configure --prefix=$HOME/riscv-gnu-toolchain/build --with-arch=rv32gcv --with-abi=ilp32d --enable-multilib
This command configures the build process for the RISC-V toolchain.
make liunux
make-qemu
example for test rvv
I referenced the rvv_example repository on GitHub to test whether my environment is functioning properly.
main.c
#include <stdio.h>
#include <math.h>
struct pt {
float x;
float y;
float z;
};
void vec_len_rvv(float *r, struct pt *v, int n);
void vec_len(float *r, struct pt *v, int n){
for (int i=0; i<n; ++i){
struct pt p = v[i];
r[i] = sqrtf(p.x*p.x + p.y*p.y + p.z*p.z);
}
}
#define N 6
struct pt v[N] = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11, 12}, {13, 14, 15}, {16, 17, 18}};
int main(){
float lens[N], lens_rvv[N];
vec_len(lens, v, N);
vec_len_rvv(lens_rvv, v, N);
for (int i=0; i<N; ++i){
printf("%f %f\n", lens[i], lens_rvv[i]);
}
return 0;
}
vec.S
# void vec_len_rvv(float *r, struct pt *pts, int n)
#define r a0
#define pts a1
#define n a2
#define vl a3
#define Xs v0
#define Ys v1
#define Zs v2
#define lens v3
.globl vec_len_rvv
vec_len_rvv:
# 32 bit elements, don't care (Agnostic) how tail and mask are handled
vsetvli vl, n, e32, ta,ma
vlseg3e32.v Xs, (pts) # loads interleaved Xs, Ys, Zs into 3 registers
vfmul.vv lens, Xs, Xs
vfmacc.vv lens, Ys, Ys
vfmacc.vv lens, Zs, Zs
vfsqrt.v lens, lens
vse32.v lens, (r)
sub n, n, vl
sh2add r, vl, r # bump r ptr 4 bytes per float
sh1add vl, vl, vl # multiply vl by 3 floats per point
sh2add pts, vl, pts # bump v ptr 4 bytes per float (12 per pt)
bnez n, vec_len_rvv
ret
makefile
go: main
qemu-riscv32 -cpu rv32,v=true,zba=true,vlen=128,rvv_ta_all_1s=on,rvv_ma_all_1s=on ./main
main: main.c vec.S makefile
riscv32-unknown-elf-gcc -O main.c vec.S -o main -march=rv32gcv_zba -lm
Just type "make"
Expected output:
$ make
riscv32-unknown-elf-gcc -O main.c vec.S -o main -march=rv32gcv_zba -lm
qemu-riscv32 -cpu rv32,v=true,zba=true,vlen=128 ./main
3.741657 3.741657
8.774964 8.774964
13.928389 13.928389
19.104973 19.104973
24.289915 24.289915
29.478806 29.478806
Output in my Ubuntu 22.04 ARM64:
The output matches the expected result, and it compiles successfully. Currently, I plan to follow a similar strategy as in the rvv_example to modify certain functions in the source code into RVV versions.
The following code, qoy_rgba_to_ycbcra_two_lines(), is one of the core blocks in the QOY project used to convert RGBA image data into the YCbCr 4:2:0 A format. It processes "1 or 2 lines" of pixels at a time and handles two pixels at once (each pixel's r, g, b, a).
static inline int qoy_rgba_to_ycbcra_two_lines(const void* rgba_in, int width, int lines, int channels_in, int channels_out, void *ycbcr420a_out) {
if (channels_in != 4) channels_in = 3;
if (channels_out != 4) channels_out = 3;
unsigned char *line1 = (unsigned char *)rgba_in;
unsigned char *line2 = lines == 2 ? line1 + width * channels_in : line1;
unsigned char *out = ycbcr420a_out;
int size_out = (channels_out == 4) ? 10 : 6;
int written = 0;
for (int i = 0; i < width; i += 2, line1 += channels_in * 2, line2 += channels_in * 2, out += size_out) {
qoy_rgba_t *p1 = (qoy_rgba_t *)line1;
qoy_rgba_t *p2 = (qoy_rgba_t *)line2;
qoy_rgba_t *p3 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line1 : line1 + channels_in);
qoy_rgba_t *p4 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line2 : line2 + channels_in);
qoy_ycbcr420a_t *pout = (qoy_ycbcr420a_t *)out;
pout->y[0] = ((1254097 * p1->r) + (2462056 * p1->g) + (478151 * p1->b)) >> 22;
pout->y[1] = ((1254097 * p2->r) + (2462056 * p2->g) + (478151 * p2->b)) >> 22;
pout->y[2] = ((1254097 * p3->r) + (2462056 * p3->g) + (478151 * p3->b)) >> 22;
pout->y[3] = ((1254097 * p4->r) + (2462056 * p4->g) + (478151 * p4->b)) >> 22;
unsigned int r4 = p1->r + p2->r + p3->r + p4->r;
unsigned int g4 = p1->g + p2->g + p3->g + p4->g;
unsigned int b4 = p1->b + p2->b + p3->b + p4->b;
pout->cb = qoy_8bit_clamp((134217728 - (44233 * r4) - (86839 * g4) + (b4 << 17) + (1 << 19)) >> 20);
pout->cr = qoy_8bit_clamp((134217728 + (r4 << 17) - (109757 * g4) - (21315 * b4) + (1 << 19)) >> 20);
if (channels_out == 4) {
if (channels_in == 4) {
pout->a[0] = p1->a;
pout->a[1] = p2->a;
pout->a[2] = p3->a;
pout->a[3] = p4->a;
} else {
pout->a[0] = 0xff;
pout->a[1] = 0xff;
pout->a[2] = 0xff;
pout->a[3] = 0xff;
}
}
written += size_out;
}
return written;
}
rgba_in
: Points to the input buffer containing RGBA (or RGB) data.
width
: The width of the image line(s) to be processed.
lines
: Indicates whether to process 1 or 2 lines.
lines == 2
, line2
points to the next line of pixels (line1 + width * channels_in
).lines == 1
(processing the boundary), line2
points back to line1
, effectively duplicating the last line.channels_in
, channels_out
: Specify the number of channels in the input (3 or 4) and output data (3 or 4).
channels_in
is forced to 3 or 4:
if (channels_in != 4) channels_in = 3;
channels_out
is handled the same way.ycbcr420a_out
: The output buffer where the converted YCbCrA (or YCbCr) blocks will be written.
size_out
: Specifies the output block size:
In the YCbCr 4:2:0(A) format:
unsigned char *line1 = (unsigned char *)rgba_in;
unsigned char *line2 = lines == 2 ? line1 + width * channels_in : line1;
Loop:
for (int i = 0; i < width; i += 2, ...)
line1 += channels_in * 2;
line2 += channels_in * 2;
out += size_out;
qoy_rgba_t *p1 = (qoy_rgba_t *)line1;
qoy_rgba_t *p2 = (qoy_rgba_t *)line2;
qoy_rgba_t *p3 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line1 : line1 + channels_in);
qoy_rgba_t *p4 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line2 : line2 + channels_in);
pout->y[0] = ((1254097 * p1->r) + (2462056 * p1->g) + (478151 * p1->b)) >> 22;
pout->y[1] = ((1254097 * p2->r) + (2462056 * p2->g) + (478151 * p2->b)) >> 22;
pout->y[2] = ((1254097 * p3->r) + (2462056 * p3->g) + (478151 * p3->b)) >> 22;
pout->y[3] = ((1254097 * p4->r) + (2462056 * p4->g) + (478151 * p4->b)) >> 22;
1254097
, 2462056
, and 478151
are integer approximations of the coefficients (0.299), (0.587), and (0.114), respectively. These constants are scaled and then reduced using a right shift (>> 22
).unsigned int r4 = p1->r + p2->r + p3->r + p4->r;
unsigned int g4 = p1->g + p2->g + p3->g + p4->g;
unsigned int b4 = p1->b + p2->b + p3->b + p4->b;
pout->cb = qoy_8bit_clamp((134217728 - (44233 * r4) - (86839 * g4) + (b4 << 17) + (1 << 19)) >> 20);
pout->cr = qoy_8bit_clamp((134217728 + (r4 << 17) - (109757 * g4) - (21315 * b4) + (1 << 19)) >> 20);
if (channels_out == 4) {
if (channels_in == 4) {
pout->a[0] = p1->a;
pout->a[1] = p2->a;
pout->a[2] = p3->a;
pout->a[3] = p4->a;
} else {
pout->a[0] = 0xff;
pout->a[1] = 0xff;
pout->a[2] = 0xff;
pout->a[3] = 0xff;
}
}
written += size_out;
return written;
The current strategy involves rewriting qoy_rgba_to_ycbcra and qoy_ycbcra_to_rgba.These functions involve batch processing of a large number of pixels (color space conversion for RGBA, block-based 4:2:0 processing, clamping), which is similar to the previous vec_len logic. Both perform extensive per-pixel or per-block calculations.
Vector instructions can be utilized within the functions to replace the intensive calculations (addition, multiplication, shifting, clamping, etc.), enabling the processing of multiple pixels at once, and validate its output on QEMU to ensure it achieves the same functionality as the original C code.
RISC-V Vector (RVV) intrinsics provide a straightforward way to use RVV instructions directly in C/C++ without requiring assembly knowledge. Intrinsics are low-level functions defined by the compiler, offering a nearly one-to-one mapping with RVV instructions, allowing programmers to leverage vector operations in a high-level language.
Key Points:
Intrinsic: A low-level function defined by the compiler to expose individual instructions to a higher-level language.
Benefits: Simplifies low-level RVV programming without requiring in-depth knowledge of assembly.
Example: To perform vector addition (vadd.vv), the intrinsic function for 32-bit integer vectors (i32) in one vector register group (m1) is:
vint32m1_t __riscv_vadd_vv_i32m1(vint32m1_t, vint32m1_t, size_t);
Naming Scheme: Intrinsics follow a structured naming pattern:
This structured approach bridges low-level hardware instructions with high-level programming, making RVV accessible and efficient.
int qoy_rgba_to_ycbcra_rvv(
const void* rgba_in,
int width,
int height,
int channels_in,
int channels_out,
void *ycbcr420a_out
){
// (1) If channels_in != 4, set to 3
// If channels_out != 4, set to 3
if (channels_in != 4) channels_in = 3;
if (channels_out != 4) channels_out = 3;
const uint8_t* src = (const uint8_t*)rgba_in;
uint8_t* dst = (uint8_t*)ycbcr420a_out;
int block_size = (channels_out == 4) ? 10 : 6;
int written = 0;
for(int y = 0; y < height; y += 2){
int lineCount = 2;
if((y == height - 1) && (height & 1)) lineCount = 1; // Odd height => last line
// Allocate buffers:
uint8_t* R1 = (uint8_t*)malloc(width);
uint8_t* G1 = (uint8_t*)malloc(width);
uint8_t* B1 = (uint8_t*)malloc(width);
uint8_t* A1 = (uint8_t*)malloc(width);
uint8_t* R2 = (uint8_t*)malloc(width);
uint8_t* G2 = (uint8_t*)malloc(width);
uint8_t* B2 = (uint8_t*)malloc(width);
uint8_t* A2 = (uint8_t*)malloc(width);
const uint8_t* line1 = src + (y * width * channels_in);
const uint8_t* line2 = (lineCount == 2) ? (line1 + width * channels_in) : line1;
// ----------(A) Separate channels (can be scalar or vector)-----------
for(int x = 0; x < width; x++){
const uint8_t* p1 = line1 + x * channels_in;
R1[x] = p1[0];
G1[x] = p1[1];
B1[x] = p1[2];
A1[x] = (channels_in == 4) ? p1[3] : 0xff;
if(lineCount == 2){
const uint8_t* p2 = line2 + x * channels_in;
R2[x] = p2[0];
G2[x] = p2[1];
B2[x] = p2[2];
A2[x] = (channels_in == 4) ? p2[3] : 0xff;
} else {
// If lines = 1 => same as line1
R2[x] = R1[x];
G2[x] = G1[x];
B2[x] = B1[x];
A2[x] = A1[x];
}
}
// ----------(B) RVV Computation of Y-----------
uint8_t* Y1 = (uint8_t*)malloc(width);
uint8_t* Y2 = (uint8_t*)malloc(width);
int idx = 0;
while(idx < width){
size_t vl = __riscv_vsetvl_e8m1(width - idx);
// Load R1
vuint8m1_t vr_in = __riscv_vle8_v_u8m1(&R1[idx], vl);
vuint8m1_t vg_in = __riscv_vle8_v_u8m1(&G1[idx], vl);
vuint8m1_t vb_in = __riscv_vle8_v_u8m1(&B1[idx], vl);
// Zero-extend => i32
// First step: Zero-extend 8-bit unsigned integers to 16-bit
vuint16m2_t vr_temp = __riscv_vwcvtu_x_x_v_u16m2(vr_in, vl);
vuint16m2_t vg_temp = __riscv_vwcvtu_x_x_v_u16m2(vg_in, vl);
vuint16m2_t vb_temp = __riscv_vwcvtu_x_x_v_u16m2(vb_in, vl);
// Second step: Zero-extend 16-bit unsigned integers to 32-bit
vuint32m4_t vr = __riscv_vwcvtu_x_x_v_u32m4(vr_temp, vl);
vuint32m4_t vg = __riscv_vwcvtu_x_x_v_u32m4(vg_temp, vl);
vuint32m4_t vb = __riscv_vwcvtu_x_x_v_u32m4(vb_temp, vl);
// Multiply => c_r = vr * 1254097
vuint32m4_t c_r = __riscv_vmul_vx_u32m4(vr, 1254097, vl);
vuint32m4_t c_g = __riscv_vmul_vx_u32m4(vg, 2462056, vl);
vuint32m4_t c_b = __riscv_vmul_vx_u32m4(vb, 478151, vl);
// ysum = c_r + c_g + c_b => shift right by 22 => clamp
vuint32m4_t ysum = __riscv_vadd_vv_u32m4(c_r, c_g, vl);
ysum = __riscv_vadd_vv_u32m4(ysum, c_b, vl);
ysum = __riscv_vsrl_vx_u32m4(ysum, 22, vl); // >>22
// Clamp to range 0..255
ysum = __riscv_vmaxu_vx_u32m4(ysum, 0, vl);
ysum = __riscv_vminu_vx_u32m4(ysum, 255, vl);
printf("Index %d: c_r=%d, c_g=%d, c_b=%d, ysum=%d\n", idx, c_r, c_g, c_b, ysum);
// Cast to uint8 => vnclipu_wx
// First step: Narrow from vuint32m4_t to vuint16m2_t
vuint16m2_t ysum_16 = __riscv_vnclipu_wx_u16m2(ysum, 0, 0, vl);
// Second step: Narrow from vuint16m2_t to vuint8m1_t
vuint8m1_t vy = __riscv_vnclipu_wx_u8m1(ysum_16, 0, 0, vl);
// Store => Y1
__riscv_vse8_v_u8m1(&Y1[idx], vy, vl);
// Process line2
if(lineCount == 2){
vuint8m1_t vr2_in = __riscv_vle8_v_u8m1(&R2[idx], vl);
vuint8m1_t vg2_in = __riscv_vle8_v_u8m1(&G2[idx], vl);
vuint8m1_t vb2_in = __riscv_vle8_v_u8m1(&B2[idx], vl);
// Zero-extend => i32
// First step: Zero-extend 8-bit unsigned integers to 16-bit
vuint16m2_t vr2_temp = __riscv_vwcvtu_x_x_v_u16m2(vr2_in, vl);
vuint16m2_t vg2_temp = __riscv_vwcvtu_x_x_v_u16m2(vg2_in, vl);
vuint16m2_t vb2_temp = __riscv_vwcvtu_x_x_v_u16m2(vb2_in, vl);
// Second step: Zero-extend 16-bit unsigned integers to 32-bit
vuint32m4_t vr2 = __riscv_vwcvtu_x_x_v_u32m4(vr2_temp, vl);
vuint32m4_t vg2 = __riscv_vwcvtu_x_x_v_u32m4(vg2_temp, vl);
vuint32m4_t vb2 = __riscv_vwcvtu_x_x_v_u32m4(vb2_temp, vl);
vuint32m4_t c_r2 = __riscv_vmul_vx_u32m4(vr2, 1254097, vl);
vuint32m4_t c_g2 = __riscv_vmul_vx_u32m4(vg2, 2462056, vl);
vuint32m4_t c_b2 = __riscv_vmul_vx_u32m4(vb2, 478151, vl);
vuint32m4_t ysum2 = __riscv_vadd_vv_u32m4(c_r2, c_g2, vl);
ysum2 = __riscv_vadd_vv_u32m4(ysum2, c_b2, vl);
ysum2 = __riscv_vsrl_vx_u32m4(ysum2, 22, vl);
ysum2 = __riscv_vmaxu_vx_u32m4(ysum2, 0, vl);
ysum2 = __riscv_vminu_vx_u32m4(ysum2, 255, vl);
// First step: Narrow from vuint32m4_t to vuint16m2_t
vuint16m2_t ysum2_16 = __riscv_vnclipu_wx_u16m2(ysum2, 0, 0, vl);
// Second step: Narrow from vuint16m2_t to vuint8m1_t
vuint8m1_t vy2 = __riscv_vnclipu_wx_u8m1(ysum2_16, 0, 0, vl);
__riscv_vse8_v_u8m1(&Y2[idx], vy2, vl);
} else {
// If lineCount = 1 => same as Y1
__riscv_vse8_v_u8m1(&Y2[idx], vy, vl);
}
idx += vl;
}
// ----------(C) Block-based sum => yoy-----------
for(int x = 0; x < width; x += 2){
int x2 = ((x + 1) < width ? x + 1 : x); // If odd, repeat last
int r4 = R1[x] + R1[x2] + R2[x] + R2[x2];
int g4 = G1[x] + G1[x2] + G2[x] + G2[x2];
int b4 = B1[x] + B1[x2] + B2[x] + B2[x2];
// Cb= ...
int cb = 134217728 - 44233 * r4 - 86839 * g4 + (b4 << 17) + (1 << 19);
cb >>= 20;
if(cb < 0) cb = 0;
else if(cb > 255) cb = 255;
// Cr= ...
int cr = 134217728 + (r4 << 17) - 109757 * g4 - 21315 * b4 + (1 << 19);
cr >>= 20;
if(cr < 0) cr = 0;
else if(cr > 255) cr = 255;
printf("Block %d-%d: Cb=%d, Cr=%d\n", x, x2, cb, cr);
// Y => y0=Y1[x], y1=Y1[x2], y2=Y2[x], y3=Y2[x2]
uint8_t y0 = Y1[x], y1 = Y1[x2], y2 = Y2[x], y3 = Y2[x2];
// Alpha => A1[x], A1[x2], A2[x], A2[x2]
uint8_t a0 = A1[x], a1 = A1[x2], a2 = A2[x], a3 = A2[x2];
qoy_ycbcr420a_t* pout = (qoy_ycbcr420a_t*)dst;
pout->y[0] = y0;
pout->y[1] = y1;
pout->y[2] = y2;
pout->y[3] = y3;
pout->cb = cb;
pout->cr = cr;
if(channels_out == 4){
pout->a[0] = a0;
pout->a[1] = a1;
pout->a[2] = a2;
pout->a[3] = a3;
}
dst += block_size;
written += block_size;
printf("Final Output (Y, Cb, Cr):\n");
for (int i = 0; i < written; i++) {
printf("Byte %d: %02X\n", i, ((uint8_t *)ycbcr420a_out)[i]);
}
}
free(R1); free(G1); free(B1); free(A1);
free(R2); free(G2); free(B2); free(A2);
free(Y1); free(Y2);
}
return written;
}
The function qoy_rgba_to_ycbcra_rvv converts an RGBA image to YCbCrA format, potentially reducing the number of channels based on input parameters. The conversion process is divided into three main sections:
Section B: RVV Computation of Y
This is where RVV plays a crucial role in accelerating the computation of the Y component. Here's a step-by-step breakdown:
Vector Length Configuration:
size_t vl = __riscv_vsetvl_e8m1(width - idx);
Sets the vector length (vl) based on the remaining width. This allows dynamic adjustment to handle the remaining pixels that might not fit exactly into vector registers.
Loading R, G, B Channels:
vuint8m1_t vr_in = __riscv_vle8_v_u8m1(&R1[idx], vl);
vuint8m1_t vg_in = __riscv_vle8_v_u8m1(&G1[idx], vl);
vuint8m1_t vb_in = __riscv_vle8_v_u8m1(&B1[idx], vl);
Loads chunks of R, G, and B data into vector registers. Each vector can hold multiple pixel values, enabling parallel processing.
Zero-Extension of Data Types:
vuint16m2_t vr_temp = __riscv_vwcvtu_x_x_v_u16m2(vr_in, vl);
vuint16m2_t vg_temp = __riscv_vwcvtu_x_x_v_u16m2(vg_in, vl);
vuint16m2_t vb_temp = __riscv_vwcvtu_x_x_v_u16m2(vb_in, vl);
vuint32m4_t vr = __riscv_vwcvtu_x_x_v_u32m4(vr_temp, vl);
vuint32m4_t vg = __riscv_vwcvtu_x_x_v_u32m4(vg_temp, vl);
vuint32m4_t vb = __riscv_vwcvtu_x_x_v_u32m4(vb_temp, vl);
First Step: Extends the 8-bit unsigned integers to 16-bit to prevent overflow during multiplication.
Second Step: Further extends the 16-bit integers to 32-bit to accommodate the results of the multiplication operations.
Multiplication with Constants:
vuint32m4_t c_r = __riscv_vmul_vx_u32m4(vr, 1254097, vl);
vuint32m4_t c_g = __riscv_vmul_vx_u32m4(vg, 2462056, vl);
vuint32m4_t c_b = __riscv_vmul_vx_u32m4(vb, 478151, vl);
Multiplies each channel by specific constants that are part of the YCbCr conversion formula. These operations are performed in parallel for multiple pixels.
Summation and Shifting:
vuint32m4_t ysum = __riscv_vadd_vv_u32m4(c_r, c_g, vl);
ysum = __riscv_vadd_vv_u32m4(ysum, c_b, vl);
ysum = __riscv_vsrl_vx_u32m4(ysum, 22, vl); // Shift right by 22 bits
Sums the multiplied values and shifts right by 22 bits to scale down the result appropriately, as per the YCbCr conversion formula.
6. Clamping:
ysum = __riscv_vmaxu_vx_u32m4(ysum, 0, vl);
ysum = __riscv_vminu_vx_u32m4(ysum, 255, vl);
Ensures that the Y values are within the 0-255 range to fit into an 8-bit unsigned integer.
Narrowing to 8-bit and Storing Results:
vuint16m2_t ysum_16 = __riscv_vnclipu_wx_u16m2(ysum, 0, 0, vl);
vuint8m1_t vy = __riscv_vnclipu_wx_u8m1(ysum_16, 0, 0, vl);
__riscv_vse8_v_u8m1(&Y1[idx], vy, vl);
Processing the Second Line (if applicable):
If lineCount == 2, the same RVV operations are performed on the second line (R2, G2, B2) to compute Y2.
1. Logic Overview in Pure C
Function structure:
// Similar to:
// int qoy_rgba_to_ycbcra(const void* rgba_in, int width, int height, int channels_in, int channels_out, void *ycbcr420a_out)
{
// Main structure: process "two lines" at a time => call qoy_rgba_to_ycbcra_two_lines
// for (int y = 0; y < height; y += 2, pin += width*channels_in*2, pout += size_out * (width >> 1)) {
// written += qoy_rgba_to_ycbcra_two_lines( pin, width, lines=2 or 1, ... , pout );
// }
}
The outer function (qoy_rgba_to_ycbcra) iterates through the height with a step of y += 2, processing two lines at a time (or just one line if only one line is left). The rgba_in pointer passed to qoy_rgba_to_ycbcra_two_lines corresponds to the starting address of the two lines. Similarly, the pout pointer moves forward as blocks are processed.
1.1 Logic of qoy_rgba_to_ycbcra_two_lines
static inline int qoy_rgba_to_ycbcra_two_lines(
const void* rgba_in, int width, int lines, int channels_in, int channels_out, void *ycbcr420a_out
){
// line1 = (unsigned char *)rgba_in;
// line2 = (lines==2) ? (line1 + width*channels_in) : line1; // Two lines or repeat the same line
// ...
// for (int i=0; i<width; i+=2,
// line1 += channels_in*2, line2 += channels_in*2, out += size_out)
// {
// p1 = (qoy_rgba_t *) line1;
// p2 = (qoy_rgba_t *) line2;
// p3 = ( (width&1)&& (i==width-1) ) ? line1 : (line1+channels_in);
// p4 = ( (width&1)&& (i==width-1) ) ? line2 : (line2+channels_in);
// ...
// // Calculate Y0..Y3, then compute cb, cr, and a
// }
// return written;
}
Each iteration processes two pixels at a time (i += 2, a left-right pair). Both line1 and line2 pointers move forward by 2 * channels_in in each loop, pointing to the starting address of the next two pixels.
If the width is odd, when i == width-1, p3 and p4 are repeated as p1 and p2 (preventing out-of-bound reads). This ensures that data is accessed safely and processing progresses by moving pointers through the lines in pairs of two pixels.
2. Logic Overview in the RVV intrinsic Version
int qoy_rgba_to_ycbcra_rvv(
const void* rgba_in,
int width,
int height,
int channels_in,
int channels_out,
void *ycbcr420a_out
){
// Similar outer loop: iterate with y += 2
for(int y=0; y<height; y+=2){
// lineCount=2 or 1
// (A) Channel separation => extract each pixel of the entire row(s) into R1[x], G1[x], B1[x], A1[x] / R2[x], ...
// (B) Use RVV to compute Y1[x], Y2[x]
// (C) for (int i=0; i<width; i+=2) { ... assemble Y0..Y3 + cb/cr + alpha }
}
}
2.1 Channel Separation (Part (A))
uint8_t* R1= malloc(width); // ...
const uint8_t* line1 = src + (y*width*channels_in);
const uint8_t* line2 = line1 + (width*channels_in); // or same line if 1 line
for(int x=0; x<width; x++){
// R1[x] = line1[x*channels_in+0];
// ...
// R2[x] = line2[x*channels_in+0];
// ...
}
This step separates all pixels in a row (from x = 0 to x = width-1) into R1[], G1[], B1[], and A1[]. If lineCount = 2, line2 is also separated into R2[], G2[], B2[], and A2[].
Result:
2.2 Compute Y Using RVV (Part (B))
Process R1[idx..idx+vl-1], G1[idx..idx+vl-1], and B1[idx..idx+vl-1] using RVV instructions to calculate Y1[]. Similarly, compute Y2[] for the bottom row (R2, G2, B2).
2.3 Assemble Y, Cb, Cr, and A (Part ©)
for(int i=0; i<width; i+=2){
int i_next = i+1;
if((width&1)&& i==width-1){
i_next = i; // Repeat
}
// p1 => R1[i], p2 => R2[i], p3 => R1[i_next], p4 => R2[i_next]
// y0=>Y1[i], y1=>Y2[i], y2=>Y1[i_next], y3=>Y2[i_next]
// sum_r= r1+r2+r3+r4 => cb, cr => clamp => output
}
This step processes two pixels at a time (i += 2). For odd-width cases, when i == width-1, i_next = i ensures the last pixel is repeated, avoiding out-of-bounds access.
The final block is written to dst, and the number of processed blocks (written) is updated.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define QOI_IMPLEMENTATION
#include "qoi.h"
// Generate test input data
void generate_test_data(unsigned char *data, int width, int height, int channels) {
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int idx = (y * width + x) * channels;
data[idx] = x % 256; // R
data[idx + 1] = y % 256; // G
data[idx + 2] = (x + y) % 256; // B
if (channels == 4) {
data[idx + 3] = 255; // A
}
}
}
}
// Compare output results and output "match" or "mismatch" per block
void compare_outputs(const uint8_t *ref_out, const uint8_t *rvv_out, int num_bytes, int block_size) {
printf("Comparing Outputs...\n");
int num_blocks = num_bytes / block_size;
for (int block = 0; block < num_blocks; block++) {
int match = 1; // Assume match initially
for (int i = 0; i < block_size; i++) {
if (ref_out[block * block_size + i] != rvv_out[block * block_size + i]) {
match = 0; // Found a mismatch
break;
}
}
if (match) {
printf("Block %d: match\n", block);
} else {
printf("Block %d: mismatch\n", block);
}
}
}
// Execute test function
void run_test(int width, int height, int channels_in, int channels_out) {
printf("Testing with Width=%d, Height=%d, Channels=%d->%d\n", width, height, channels_in, channels_out);
int block_size = (channels_out == 4) ? 10 : 6; // Output size per block
// Allocate input and output buffers
uint8_t *rgba_input = (uint8_t *)malloc(width * height * channels_in);
uint8_t *ref_output = (uint8_t *)malloc(width * height * block_size / 2);
uint8_t *rvv_output = (uint8_t *)malloc(width * height * block_size / 2);
// Initialize test input data
generate_test_data(rgba_input, width, height, channels_in);
// Execute reference function and RVV function
int ref_bytes = qoy_rgba_to_ycbcra(rgba_input, width, height, channels_in, channels_out, ref_output);
int rvv_bytes = qoy_rgba_to_ycbcra_rvv(rgba_input, width, height, channels_in, channels_out, rvv_output);
// Verify that the output sizes are consistent
if (ref_bytes != rvv_bytes) {
printf("Error: Output sizes differ! Ref=%d bytes, RVV=%d bytes\n", ref_bytes, rvv_bytes);
free(rgba_input);
free(ref_output);
free(rvv_output);
return;
}
// Compare outputs
compare_outputs(ref_output, rvv_output, ref_bytes, block_size);
// Free memory
free(rgba_input);
free(ref_output);
free(rvv_output);
}
int main() {
printf("Running QOY Conversion Tests with Various Sizes...\n");
// Test multiple input sizes
int test_sizes[][2] = {
{4, 4}, // Basic test
{5, 4}, // Odd width
{4, 5}, // Odd height
{5, 5}, // Both width and height odd
};
int num_tests = sizeof(test_sizes) / sizeof(test_sizes[0]);
for (int i = 0; i < num_tests; i++) {
int width = test_sizes[i][0];
int height = test_sizes[i][1];
run_test(width, height, 4, 4); // Test RGBA to YCbCrA
}
printf("All Tests Completed.\n");
return 0;
}
riscv32-unknown-linux-gnu-gcc test.c -std=gnu99 -march=rv32gcv -mabi=ilp32d -O0 -lpng -lz -o test.out
qemu-riscv32 -L $HOME/riscv-gnu-toolchain/build_linux/sysroot ./qoy_rvvintrinsic/test.out
Running QOY Conversion Tests with Various Sizes...
Testing with Width=4, Height=4, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: match
Block 3: match
Testing with Width=5, Height=4, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: mismatch
Block 3: mismatch
Block 4: mismatch
Block 5: mismatch
Testing with Width=4, Height=5, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: match
Block 3: match
Block 4: match
Block 5: match
Testing with Width=5, Height=5, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: mismatch
Block 3: mismatch
Block 4: mismatch
Block 5: mismatch
Block 6: mismatch
Block 7: mismatch
Block 8: mismatch
All Tests Completed.
parallels@ubuntu-linux-22-04-desktop:~/riscv-gnu-toolchain$
Since I am temporarily unable to identify the issue, the current code can process images with even widths. Therefore, I would like to use it to test whether there is an improvement in the efficiency of image conversion.
I have confirmed that the images within the folder are identical after being compared between two versions of the conversion before proceeding with further testing.
Simple benchmark suite for qoy
Requires libpng, "stb_image.h" and "stb_image_write.h", "qoi.h"
Set Up the Sysroot Environment
The sysroot is a directory that mimics the root filesystem of the target architecture (RISC-V in this case). It contains all the necessary headers and libraries required for cross-compilation.
a. Locate the Sysroot
If you built the toolchain as shown above, the sysroot is typically located at $HOME/riscv32/sysroot. If not, you might need to specify or create one.
b. Prepare the Sysroot with Necessary Libraries
You need to ensure that libpng and zlib are available in the RISC-V sysroot. Here's how to do it:
i. Install Dependencies for Building Libraries
sudo apt install cmake libtool
ii. Cross-Compile zlib for RISC-V
wget https://zlib.net/zlib-1.2.13.tar.gz
tar -xvf zlib-1.2.13.tar.gz
cd zlib-1.3.1
Configure and Build
# Set environment variables for cross-compilation
export CC=riscv32-unknown-linux-gnu-gcc
export AR=riscv32-unknown-linux-gnu-ar
export RANLIB=riscv32-unknown-linux-gnu-ranlib
# Configure with sysroot
./configure --prefix=$HOME/riscv32/sysroot/usr --static
# Build and install
make
make install
iii. Cross-Compile libpng for RISC-V
Download libpng Source Code
cd ..
wget https://download.sourceforge.net/libpng/libpng-1.6.39.tar.gz
tar -xvf libpng-1.6.39.tar.gz
cd libpng-1.6.39
Configure and Build
# Set environment variables for cross-compilation
export CC=riscv32-unknown-linux-gnu-gcc
export AR=riscv32-unknown-linux-gnu-ar
export RANLIB=riscv32-unknown-linux-gnu-ranlib
# Configure with sysroot and zlib
./configure --prefix=$HOME/riscv-gnu-toolchain/build_linux/sysroot/usr --host=riscv32-unknown-linux-gnu --with-zlib-prefix=$HOME/riscv-gnu-toolchain/build_linux/sysroot/usr --enable-static --disable-shared
# Build and install
make install
/*******************************************************************************
* benchmark.c
*
* A benchmark that:
* 1. Takes either a .png file or a directory containing .png files.
* 2. For each PNG found:
* a) Reads it via stb_image (=> RGBA).
* b) Calls qoy_rgba_to_ycbcra (pure C) multiple times, measures time.
* c) Calls qoy_rgba_to_ycbcra_rvv (RVV) multiple times, measures time.
* d) Compares outputs and prints times for that file.
* 3. At the end, prints a "global average" across all files.
*
*******************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h> // For opendir, readdir
//------------------------------[ STB_IMAGE ]-----------------------------------
// Define STB_IMAGE_IMPLEMENTATION and include only PNG support
#define STB_IMAGE_IMPLEMENTATION
#define STBI_ONLY_PNG
#include "stb_image.h"
//------------------------------[ QOY ]-----------------------------------------
// Define QOY_IMPLEMENTATION to include the implementation
#define QOY_IMPLEMENTATION
#include "qoy.h"
//------------------------------[ Timer ]---------------------------------------
#if defined(__APPLE__)
#include <mach/mach_time.h>
#elif defined(__linux__)
#include <time.h>
#elif defined(_WIN32)
#include <windows.h>
#endif
static uint64_t ns(void) {
#if defined(__APPLE__)
static mach_timebase_info_data_t info;
static int init=0;
if(!init) {
mach_timebase_info(&info);
init=1;
}
uint64_t now = mach_absolute_time();
now = now * info.numer / info.denom;
return now;
#elif defined(__linux__)
struct timespec spec;
clock_gettime(CLOCK_MONOTONIC, &spec);
return (uint64_t)spec.tv_sec * 1000000000ULL + (uint64_t)spec.tv_nsec;
#elif defined(_WIN32)
static LARGE_INTEGER freq;
static int init=0;
if(!init){
QueryPerformanceFrequency(&freq);
init=1;
}
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
return (uint64_t)(1000000000ULL* now.QuadPart / freq.QuadPart);
#else
return (uint64_t)clock();
#endif
}
// -----------------------------------------------------------------------------
// Global variables for accumulating the total processing times (C and RVV)
static uint64_t g_sum_c_time_ns = 0; // Total "pure C conversion" time across all files (in ns)
static uint64_t g_sum_rvv_time_ns = 0; // Total "RVV conversion" time across all files (in ns)
static int g_image_count = 0; // Number of PNG files processed
static int g_runs = 1; // Global variable to store runs (optional)
/*
benchmark_image():
Reads a PNG => converts to RGBA; performs multiple (runs) pure C + RVV
conversions, measures time, prints the results for that file, and accumulates
the times into the global statistics (g_sum_c_time_ns / g_sum_rvv_time_ns).
*/
static void benchmark_image(const char* path, int runs) {
int width, height, comp;
unsigned char* rgba = stbi_load(path, &width, &height, &comp, 4);
if(!rgba) {
printf("Error: failed to load PNG: %s\n", path);
return;
}
printf("[File] %s => %dx%d, forced RGBA=4\n", path, width, height);
// (1) Allocate output buffer
int block_size = (4 == 4)? 10: 6;
int outbuf_size = (width>>1)*height* block_size; // Approximate upper limit
unsigned char* out_c = (unsigned char*)malloc(outbuf_size);
unsigned char* out_rvv = (unsigned char*)malloc(outbuf_size);
// (2) Perform warm-up in advance (optional)
qoy_rgba_to_ycbcra(rgba, width, height, 4, 4, out_c);
qoy_rgba_to_ycbcra_rvv(rgba, width, height, 4, 4, out_rvv);
// (3) Measure the times for each conversion
uint64_t sum_c=0, sum_rvv=0;
for(int i=0; i<runs; i++){
// C version
memset(out_c, 0, outbuf_size);
uint64_t t0 = ns();
qoy_rgba_to_ycbcra(rgba, width, height, 4, 4, out_c);
uint64_t t1 = ns();
sum_c += (t1 - t0);
// RVV version
memset(out_rvv, 0, outbuf_size);
uint64_t t2 = ns();
qoy_rgba_to_ycbcra_rvv(rgba, width, height, 4, 4, out_rvv);
uint64_t t3 = ns();
sum_rvv += (t3 - t2);
}
double avg_c = (double)sum_c / runs / 1.0e6; // ms
double avg_rvv = (double)sum_rvv / runs / 1.0e6;
printf("Runs=%d | C=%.3f ms, RVV=%.3f ms\n", runs, avg_c, avg_rvv);
// (4) Add to global statistics (sum_c, sum_rvv)
g_sum_c_time_ns += sum_c; // sum_c is still the total for "runs" times (in ns)
g_sum_rvv_time_ns += sum_rvv;
g_image_count++;
free(rgba);
free(out_c);
free(out_rvv);
}
/*
benchmark_directory():
Opens a directory (recursively or non-recursively), finds .png files,
and calls benchmark_image() for each PNG.
*/
static void benchmark_directory(const char* dirpath, int runs) {
DIR* dp = opendir(dirpath);
if(!dp) {
printf("Could not open directory: %s\n", dirpath);
return;
}
struct dirent* ent;
while((ent = readdir(dp)) != NULL) {
if(!strcmp(ent->d_name, ".") || !strcmp(ent->d_name, "..")) {
continue;
}
char filepath[1024];
snprintf(filepath, sizeof(filepath), "%s/%s", dirpath, ent->d_name);
// Check if the file extension is .png
size_t len = strlen(ent->d_name);
if(len>4 && strcmp(ent->d_name + (len-4), ".png")==0) {
benchmark_image(filepath, runs);
}
// If recursion for subdirectories is needed => if(ent->d_type==DT_DIR)...(call benchmark_directory)
}
closedir(dp);
}
//------------------------------[ main ]----------------------------------------
int main(int argc, char** argv) {
if(argc < 3) {
printf("Usage: %s <file_or_directory> <runs>\n", argv[0]);
return 0;
}
const char* input_path = argv[1];
g_runs = atoi(argv[2]); // Store in global variable
if(g_runs <= 0) g_runs=1;
// Determine whether input_path is a file or directory
struct stat st;
if(stat(input_path, &st)==0) {
if(S_ISDIR(st.st_mode)) {
// If it's a directory => traverse it
benchmark_directory(input_path, g_runs);
}
else if(S_ISREG(st.st_mode)) {
// If it's a file => treat it as a PNG
benchmark_image(input_path, g_runs);
}
else {
printf("Input path is neither file nor directory???\n");
}
} else {
printf("Cannot stat: %s\n", input_path);
}
// ---------- Print "global average" here ----------
if(g_image_count>0) {
// g_sum_c_time_ns, g_sum_rvv_time_ns are the "runs total" for all files
double avg_c = (double)g_sum_c_time_ns / (double)(g_image_count*g_runs);
double avg_rvv = (double)g_sum_rvv_time_ns / (double)(g_image_count*g_runs);
// Convert to milliseconds
avg_c /= 1.0e6;
avg_rvv /= 1.0e6;
printf("===== Global Average across %d PNG(s) =====\n", g_image_count);
printf("C version: %.3f ms\n", avg_c);
printf("RVV version: %.3f ms\n", avg_rvv);
} else {
printf("No PNG files were processed.\n");
}
return 0;
}
riscv32-unknown-linux-gnu-gcc benchmark.c -std=gnu99 -march=rv32gcv -mabi=ilp32d -O0 -lpng -lz -lm -o benchmark.out
qemu-riscv32 -L $HOME/riscv-gnu-toolchain/build_linux/sysroot ./qoy_rvvintrinsic/benchmark.out ./qoi_benchmark_suite/images/textures_pk 1
icon_512
===== Global Average across 213 PNG(s) =====
C version: 5.885 ms
RVV version: 126.008 ms
If compiled with -O0, pure C code may still benefit from simple optimizations by the compiler (or its logic might be straightforward). However, RVV intrinsic code often requires higher optimization levels (-O2, -O3) to fully optimize instruction sequences, register allocation, and similar aspects.
It is recommended to use at least -O2 or -O3 for RVV code.
Under -O0, the compiler may not perform sufficient instruction merging or redundancy elimination for vector operations, leading to excessive and unnecessary load/store operations and VL setting overhead.
===== Global Average across 213 PNG(s) =====
C version: 2.162 ms
RVV version: 12.286 ms
===== Global Average across 213 PNG(s) =====
C version: 63.954 ms
RVV version: 13.177 ms
===== Global Average across 1002 PNG(s) =====
C version: 0.999 ms
RVV version: 21.944 ms
===== Global Average across 1002 PNG(s) =====
C version: 0.363 ms
RVV version: 2.163 ms
===== Global Average across 1002 PNG(s) =====
C version: 10.647 ms
RVV version: 2.332 ms
Pure C Generated with -O3
Under -O3, the compiler often performs automatic vectorization, larger-scale function inlining, and loop unrolling, which can significantly alter the "instruction count" or "memory access patterns."
For simulators, more or more complex RISC-V instruction sequences may require "additional steps" for interpretation and execution, potentially making it slower than -O2 (or even -O0).
This does not mean it would be slower on actual hardware. On a real CPU, executing "more but more efficient instruction sequences" is likely faster than -O2. However, due to factors like the "cost of instruction interpretation" and "cache simulation," software simulators might show worse performance for the -O3 version of pure C code.
RVV Under -O3
The compiler applies the most aggressive optimizations to code containing RVV intrinsics under -O3, such as significantly reducing vsetvli, merging load/store operations, and unrolling loops. These optimizations may not always lighten the burden on the simulator (and can sometimes increase it), but they can significantly reduce the "number" of vector instructions or unnecessary "overhead" in certain cases.
Under -O2, the compiler might not optimize vector intrinsics as aggressively, leaving some redundant operations. This can result in better or worse performance depending on the generated instruction patterns.
In summary, interpreting RVV instructions is inherently expensive for simulators. If -O3 indeed "reduces" the instruction count or optimizes loops, it can outperform -O2. Conversely, if -O3 generates large and complex function inlining that worsens vector instruction arrangement, it may perform worse than -O2.
However, it is more likely that the way the RVV version is written has a significant impact on performance. Even on hardware or under high optimization levels, if RVV still performs worse than pure C, it is likely that the program structure, the use of intrinsics, or the compiler-generated code has considerable room for improvement.
QOY - The "Quite OK YCbCr420A" format for fast, lossless* image compression
QOI - The “Quite OK Image Format” for fast, lossless image compression
Simple RISC-V Vector example
RISC-V Vector Intrinsic Document