Vectorize Quite OK YCbCr420A
廖奕凱
GitHub
Objective
My task is to study the QOY format and leverage RISC-V Vector (RVV) extensions to accelerate its operations. The project cover the following aspects:
- Investigate the "Quite OK YCbCr420A" format and provide a concise comparison with QOI.
- Identify potential areas in QOY for vectorization and propose strategies to use the RISC-V Vector extension for performance improvement.
- Test the vectorized implementation using QEMU to ensure the QOY format functions as expected.
- Provide suggestions for further optimizing the performance of QOY.
What is QOY?
- QOY encodes and decodes images in a lossless format, as QOI does. But where QOI encodes RGBA, QOY encodes YCbCr 4:2:0 A (YCbCrA 4:2:0:4).
- QOY can work with both RGBA and YCbCrA pixel data. In case of RGBA data, it is converted to and from the YCbCrA colorspace on-the-fly, which is a lossy operation. QOY is only lossless when input and output are YCbCrA.
- QOY performance with RGBA pixel data is similar to QOI, and is about 1.5x (encoding) to 2.0x (decoding) faster when using YCbCrA pixel data.
- QOY encoded size is about half of QOI's, while the differences in decoded RGBA output are virtually indistinguishable.
- QOY's aim is to remain small, simple, and fast, while offering "quite ok" compression levels.
What is RISC-V Vector Extension
The RISC-V Vector Extension (RVV) is designed to enhance the RISC-V architecture with powerful vector computation capabilities, enabling efficient data-parallel processing for a wide range of applications such as high-performance computing, machine learning, and signal processing. RVV's flexible and scalable design allows it to cater to diverse hardware implementations and application requirements.
RVV Design Considerations
- Vector ISA Complexity:
Instruction Set Size
: Vector ISAs are typically large due to the need for vector equivalents of scalar instructions, specialized memory access operations, and vector manipulation instructions.
Predication Support
: Modern vector ISAs, including RVV, incorporate predication (masking) to enable conditional execution of vector elements.
Instruction Encoding
: The complexity of vector operations often exceeds the capacity of 32-bit instruction encoding, necessitating the use of CPU state registers to manage vector operations.
RVV Parameters and Basic State
- Vector Registers:
Number of Registers
: RVV defines 32 vector registers named v0 to v31.
Register Size (VLEN)
: Each vector register is VLEN bits wide, where VLEN is a power of two (e.g., 64, 128, 256, 512 bits). The exact VLEN is determined by the implementer.
Standard Constraints
: The Zv* standard extensions require VLEN to be at least 64 or 128 bits, similar in size to Intel's AVX-512 when VLEN is 512 bits.
- Vector Elements:
Element Size (ELEN)
: Elements within a vector are at least 8 bits and up to ELEN bits, where ELEN is also a power of two (8 ≤ ELEN ≤ VLEN).
Standard Constraints
: The Zv* extensions constrain ELEN to be at least 32 or 64 bits.
- Operational State Registers:
vtype (Vector Type Register)
: Describes the type of vector operation, including:
SEW (Standard Element Width)
: Size in bits of each vector element (8 ≤ SEW ≤ ELEN).
LMUL (Length Multiplier)
: Determines the grouping of vector registers, allowing multipliers of 1/8, 1/4, 1/2, 1, 2, 4, or 8.
vl (Vector Length Register)
: Specifies the number of elements to operate on, ranging from 0 to vlmax(SEW, LMUL), where:
- vlmax(SEW, LMUL) = (VLEN / SEW) × LMUL
Key Features of RVV
- Scalability and Flexibility:
Variable Vector Length
: RVV's design allows different hardware implementations to support varying vector lengths, enhancing scalability.
Element Manipulation
: Supports operations for loading, storing, and manipulating vector elements, including non-contiguous memory accesses through scatter/gather operations.
- Predication and Masking:
Mask Registers
: Enable conditional execution of vector operations, allowing certain elements to be processed based on predicate conditions.
Predicate Operations
: Includes instructions for forming predicates through comparisons and other logical operations.
- Arithmetic and Logical Operations:
Comprehensive Instruction Set
: RVV includes a wide range of arithmetic (e.g., add, subtract, multiply, divide) and logical (e.g., AND, OR, NOT) instructions tailored for vector processing.
Vector-Specific Instructions
: Additional instructions are provided for tasks unique to vector processing, such as vector compression and decompression.
QOI (Quite OK Image) is a simple and efficient lossless image compression format designed to provide fast encoding and decoding speeds. The main features of QOI include:
- Lossless Compression: Ensures image quality is preserved without degradation.
- Simple Encoding Algorithm: Utilizes basic differential encoding and a hash table to store colors.
- Efficient Memory Usage: Minimizes memory allocation and access operations.
The encoding process of QOI primarily involves the following steps:
- Initialize a hash table with 64 entries to store previously encountered colors.
- Iterate through the image pixels, calculating the difference (R, G, B, A) between the current pixel and the previous one.
- Select an appropriate opcode to encode the data based on the difference values.
- Add a termination marker at the end of the file to indicate the end of the image.
QOY (Quite OK YCbCr420A) is an extension of QOI, with the following notable features:
- YCbCr 4:2:0 Color Space: Converts the RGB color space to YCbCr 4:2:0, reducing chroma information and improving compression efficiency.
- Alpha Channel Support: Supports transparency, making it suitable for images with transparent backgrounds.
- Differential Encoding: Similar to QOI, it uses differential encoding to reduce redundant data.
- Multiple Compression Levels: Adopts different opcodes based on pixel difference ranges to balance compression ratio and encoding efficiency.
The encoding process of QOY resembles that of QOI but introduces enhancements for handling color space and alpha channels:
- Convert RGB pixels to the YCbCr 4:2:0 color space.
- Use differential encoding to compress information for Y, Cb, Cr, and the alpha channel.
- Employ various compression opcodes to represent differences in a range of values, thereby improving compression efficiency.
Environment setting
- Clone the riscv-gnu-toolchain repo from the official GitHub repo.
- Navigate to the riscv-gnu-toolchain folder, create a folder named build, enter the build folder, and configure the necessary options in the Makefile for the compilation process.
This command configures the build process for the RISC-V toolchain.
- with-arch=rv32gcv: Targets the RISC-V architecture with 32-bit support, including general, compressed, and vector extensions (RV32GCV).
- with-abi=ilp32d: Sets the ABI (Application Binary Interface) to ilp32d, which uses 32-bit integers, longs, and pointers, with double-precision floating-point support.
- enable-multilib: Enables support for multiple library variants (multilib), allowing the toolchain to generate code for multiple configurations of architecture and ABI.
- Builds the RISC-V toolchain based on the configuration set by the ../configure command,and Builds the QEMU emulator for RISC-V
- Compiles the RISC-V GNU toolchain based on the configuration provided by the ../configure command. This step builds the compiler, assembler, linker, and related tools.
- Builds the QEMU emulator for RISC-V, allowing you to simulate and test RISC-V programs without requiring physical hardware.
-
example for test rvv
I referenced the rvv_example repository on GitHub to test whether my environment is functioning properly.
main.c
vec.S
makefile
Just type "make"
Expected output:
Output in my Ubuntu 22.04 ARM64:

The output matches the expected result, and it compiles successfully. Currently, I plan to follow a similar strategy as in the rvv_example to modify certain functions in the source code into RVV versions.
RGBA to YCbCrA Conversion Process in QOY
The following code, qoy_rgba_to_ycbcra_two_lines(), is one of the core blocks in the QOY project used to convert RGBA image data into the YCbCr 4:2:0 A format. It processes "1 or 2 lines" of pixels at a time and handles two pixels at once (each pixel's r, g, b, a).
1. Parameters and Initial Setup
-
rgba_in
: Points to the input buffer containing RGBA (or RGB) data.
-
width
: The width of the image line(s) to be processed.
-
lines
: Indicates whether to process 1 or 2 lines.
- If
lines == 2
, line2
points to the next line of pixels (line1 + width * channels_in
).
- If
lines == 1
(processing the boundary), line2
points back to line1
, effectively duplicating the last line.
-
channels_in
, channels_out
: Specify the number of channels in the input (3 or 4) and output data (3 or 4).
- At the start of the function,
channels_in
is forced to 3 or 4:
Similarly, channels_out
is handled the same way.
-
ycbcr420a_out
: The output buffer where the converted YCbCrA (or YCbCr) blocks will be written.
-
size_out
: Specifies the output block size:
- 10 bytes: If the output includes alpha (4 channels), each 2×2 block writes 10 bytes.
- 6 bytes: If the output excludes alpha (3 channels), each 2×2 block writes 6 bytes.
In the YCbCr 4:2:0(A) format:
- Each 2×2 block contains 4 Y + 1 Cb + 1 Cr (= 6 bytes).
- Adding 4 alpha values results in a total of 10 bytes.
2. Obtaining Line Pointers (line1, line2) and Starting Output Pointer (out)
- If lines == 2, indicating the presence of a second line, line2 points to the next line of pixels.
- If lines == 1, line2 points back to line1, effectively duplicating the last line.
Loop:
- Processes 2 pixels at a time.
- Inside the loop, pointers are updated:
3. Retrieving Pixel Data from Line1 and Line2
- Each 2×2 block corresponds to four pixel pointers: p1, p2, p3, p4.
- p1 / p3: From line1 (same row, 2 pixels).
- p2 / p4: From line2 (next row, 2 pixels).
- For odd widths ((width & 0x01) == 1) and i == width - 1 (last pixel), line1 or line2 duplicates the last pixel to avoid out-of-bounds reads.
4. Calculating the Y Component
- This corresponds to an approximation of the JPEG color conversion formula:
- The constants
1254097
, 2462056
, and 478151
are integer approximations of the coefficients (0.299), (0.587), and (0.114), respectively. These constants are scaled and then reduced using a right shift (>> 22
).
- The values (y[0..3]) correspond to the four pixels in the 2×2 block.
5. Calculating Blended Cb/Cr
- In YCbCr 4:2:0, Cb and Cr are stored once for each 2×2 block by averaging pixel values.
- r4, g4, and b4 are the sums of the R, G, and B components for the four pixels.
- The integer approximation formulas calculate Cb and Cr, followed by qoy_8bit_clamp() to ensure values remain within the 0–255 range.
6. Handling the Alpha Channel
- If the output requires alpha (channels_out == 4), check if the input also has alpha:
- If yes (channels_in == 4), copy the corresponding a values from p1, p2, p3, and p4.
- If not (3-channel RGB), set the alpha values to 0xff (fully opaque).
7. Updating the Counters After Each Block
- size_out is the number of bytes written for this 2×2 block:
- 6 bytes: Without alpha.
- 10 bytes: With alpha.
- The loop continues until i += 2 exceeds width.
8. Returning the Total Written Bytes
- Returns the total number of bytes written, allowing the caller to know how much data was processed.
The current strategy involves rewriting qoy_rgba_to_ycbcra and qoy_ycbcra_to_rgba.These functions involve batch processing of a large number of pixels (color space conversion for RGBA, block-based 4:2:0 processing, clamping), which is similar to the previous vec_len logic. Both perform extensive per-pixel or per-block calculations.
Vector instructions can be utilized within the functions to replace the intensive calculations (addition, multiplication, shifting, clamping, etc.), enabling the processing of multiple pixels at once, and validate its output on QEMU to ensure it achieves the same functionality as the original C code.
RVV intrinsic
RISC-V Vector (RVV) intrinsics provide a straightforward way to use RVV instructions directly in C/C++ without requiring assembly knowledge. Intrinsics are low-level functions defined by the compiler, offering a nearly one-to-one mapping with RVV instructions, allowing programmers to leverage vector operations in a high-level language.
Key Points:
Intrinsic: A low-level function defined by the compiler to expose individual instructions to a higher-level language.
Benefits: Simplifies low-level RVV programming without requiring in-depth knowledge of assembly.
Example: To perform vector addition (vadd.vv), the intrinsic function for 32-bit integer vectors (i32) in one vector register group (m1) is:
This structured approach bridges low-level hardware instructions with high-level programming, making RVV accessible and efficient.
implement qoy_rgba_to_ycbcra_rvv
int qoy_rgba_to_ycbcra_rvv(
const void* rgba_in,
int width,
int height,
int channels_in,
int channels_out,
void *ycbcr420a_out
){
if (channels_in != 4) channels_in = 3;
if (channels_out != 4) channels_out = 3;
const uint8_t* src = (const uint8_t*)rgba_in;
uint8_t* dst = (uint8_t*)ycbcr420a_out;
int block_size = (channels_out == 4) ? 10 : 6;
int written = 0;
for(int y = 0; y < height; y += 2){
int lineCount = 2;
if((y == height - 1) && (height & 1)) lineCount = 1;
uint8_t* R1 = (uint8_t*)malloc(width);
uint8_t* G1 = (uint8_t*)malloc(width);
uint8_t* B1 = (uint8_t*)malloc(width);
uint8_t* A1 = (uint8_t*)malloc(width);
uint8_t* R2 = (uint8_t*)malloc(width);
uint8_t* G2 = (uint8_t*)malloc(width);
uint8_t* B2 = (uint8_t*)malloc(width);
uint8_t* A2 = (uint8_t*)malloc(width);
const uint8_t* line1 = src + (y * width * channels_in);
const uint8_t* line2 = (lineCount == 2) ? (line1 + width * channels_in) : line1;
for(int x = 0; x < width; x++){
const uint8_t* p1 = line1 + x * channels_in;
R1[x] = p1[0];
G1[x] = p1[1];
B1[x] = p1[2];
A1[x] = (channels_in == 4) ? p1[3] : 0xff;
if(lineCount == 2){
const uint8_t* p2 = line2 + x * channels_in;
R2[x] = p2[0];
G2[x] = p2[1];
B2[x] = p2[2];
A2[x] = (channels_in == 4) ? p2[3] : 0xff;
} else {
R2[x] = R1[x];
G2[x] = G1[x];
B2[x] = B1[x];
A2[x] = A1[x];
}
}
uint8_t* Y1 = (uint8_t*)malloc(width);
uint8_t* Y2 = (uint8_t*)malloc(width);
int idx = 0;
while(idx < width){
size_t vl = __riscv_vsetvl_e8m1(width - idx);
vuint8m1_t vr_in = __riscv_vle8_v_u8m1(&R1[idx], vl);
vuint8m1_t vg_in = __riscv_vle8_v_u8m1(&G1[idx], vl);
vuint8m1_t vb_in = __riscv_vle8_v_u8m1(&B1[idx], vl);
vuint16m2_t vr_temp = __riscv_vwcvtu_x_x_v_u16m2(vr_in, vl);
vuint16m2_t vg_temp = __riscv_vwcvtu_x_x_v_u16m2(vg_in, vl);
vuint16m2_t vb_temp = __riscv_vwcvtu_x_x_v_u16m2(vb_in, vl);
vuint32m4_t vr = __riscv_vwcvtu_x_x_v_u32m4(vr_temp, vl);
vuint32m4_t vg = __riscv_vwcvtu_x_x_v_u32m4(vg_temp, vl);
vuint32m4_t vb = __riscv_vwcvtu_x_x_v_u32m4(vb_temp, vl);
vuint32m4_t c_r = __riscv_vmul_vx_u32m4(vr, 1254097, vl);
vuint32m4_t c_g = __riscv_vmul_vx_u32m4(vg, 2462056, vl);
vuint32m4_t c_b = __riscv_vmul_vx_u32m4(vb, 478151, vl);
vuint32m4_t ysum = __riscv_vadd_vv_u32m4(c_r, c_g, vl);
ysum = __riscv_vadd_vv_u32m4(ysum, c_b, vl);
ysum = __riscv_vsrl_vx_u32m4(ysum, 22, vl);
ysum = __riscv_vmaxu_vx_u32m4(ysum, 0, vl);
ysum = __riscv_vminu_vx_u32m4(ysum, 255, vl);
printf("Index %d: c_r=%d, c_g=%d, c_b=%d, ysum=%d\n", idx, c_r, c_g, c_b, ysum);
vuint16m2_t ysum_16 = __riscv_vnclipu_wx_u16m2(ysum, 0, 0, vl);
vuint8m1_t vy = __riscv_vnclipu_wx_u8m1(ysum_16, 0, 0, vl);
__riscv_vse8_v_u8m1(&Y1[idx], vy, vl);
if(lineCount == 2){
vuint8m1_t vr2_in = __riscv_vle8_v_u8m1(&R2[idx], vl);
vuint8m1_t vg2_in = __riscv_vle8_v_u8m1(&G2[idx], vl);
vuint8m1_t vb2_in = __riscv_vle8_v_u8m1(&B2[idx], vl);
vuint16m2_t vr2_temp = __riscv_vwcvtu_x_x_v_u16m2(vr2_in, vl);
vuint16m2_t vg2_temp = __riscv_vwcvtu_x_x_v_u16m2(vg2_in, vl);
vuint16m2_t vb2_temp = __riscv_vwcvtu_x_x_v_u16m2(vb2_in, vl);
vuint32m4_t vr2 = __riscv_vwcvtu_x_x_v_u32m4(vr2_temp, vl);
vuint32m4_t vg2 = __riscv_vwcvtu_x_x_v_u32m4(vg2_temp, vl);
vuint32m4_t vb2 = __riscv_vwcvtu_x_x_v_u32m4(vb2_temp, vl);
vuint32m4_t c_r2 = __riscv_vmul_vx_u32m4(vr2, 1254097, vl);
vuint32m4_t c_g2 = __riscv_vmul_vx_u32m4(vg2, 2462056, vl);
vuint32m4_t c_b2 = __riscv_vmul_vx_u32m4(vb2, 478151, vl);
vuint32m4_t ysum2 = __riscv_vadd_vv_u32m4(c_r2, c_g2, vl);
ysum2 = __riscv_vadd_vv_u32m4(ysum2, c_b2, vl);
ysum2 = __riscv_vsrl_vx_u32m4(ysum2, 22, vl);
ysum2 = __riscv_vmaxu_vx_u32m4(ysum2, 0, vl);
ysum2 = __riscv_vminu_vx_u32m4(ysum2, 255, vl);
vuint16m2_t ysum2_16 = __riscv_vnclipu_wx_u16m2(ysum2, 0, 0, vl);
vuint8m1_t vy2 = __riscv_vnclipu_wx_u8m1(ysum2_16, 0, 0, vl);
__riscv_vse8_v_u8m1(&Y2[idx], vy2, vl);
} else {
__riscv_vse8_v_u8m1(&Y2[idx], vy, vl);
}
idx += vl;
}
for(int x = 0; x < width; x += 2){
int x2 = ((x + 1) < width ? x + 1 : x);
int r4 = R1[x] + R1[x2] + R2[x] + R2[x2];
int g4 = G1[x] + G1[x2] + G2[x] + G2[x2];
int b4 = B1[x] + B1[x2] + B2[x] + B2[x2];
int cb = 134217728 - 44233 * r4 - 86839 * g4 + (b4 << 17) + (1 << 19);
cb >>= 20;
if(cb < 0) cb = 0;
else if(cb > 255) cb = 255;
int cr = 134217728 + (r4 << 17) - 109757 * g4 - 21315 * b4 + (1 << 19);
cr >>= 20;
if(cr < 0) cr = 0;
else if(cr > 255) cr = 255;
printf("Block %d-%d: Cb=%d, Cr=%d\n", x, x2, cb, cr);
uint8_t y0 = Y1[x], y1 = Y1[x2], y2 = Y2[x], y3 = Y2[x2];
uint8_t a0 = A1[x], a1 = A1[x2], a2 = A2[x], a3 = A2[x2];
qoy_ycbcr420a_t* pout = (qoy_ycbcr420a_t*)dst;
pout->y[0] = y0;
pout->y[1] = y1;
pout->y[2] = y2;
pout->y[3] = y3;
pout->cb = cb;
pout->cr = cr;
if(channels_out == 4){
pout->a[0] = a0;
pout->a[1] = a1;
pout->a[2] = a2;
pout->a[3] = a3;
}
dst += block_size;
written += block_size;
printf("Final Output (Y, Cb, Cr):\n");
for (int i = 0; i < written; i++) {
printf("Byte %d: %02X\n", i, ((uint8_t *)ycbcr420a_out)[i]);
}
}
free(R1); free(G1); free(B1); free(A1);
free(R2); free(G2); free(B2); free(A2);
free(Y1); free(Y2);
}
return written;
}
The function qoy_rgba_to_ycbcra_rvv converts an RGBA image to YCbCrA format, potentially reducing the number of channels based on input parameters. The conversion process is divided into three main sections:
- Channel Separation (Section A): Separates the RGBA channels into individual R, G, B, and A buffers.
- Y Component Computation Using RVV (Section B): Calculates the Y (luma) component using RISC-V Vector Extensions (RVV) for parallel processing.
- Block-Based Sum and Cb/Cr Computation (Section C): Aggregates pixel values to compute the Cb and Cr (chrominance) components and assembles the final output.
Section B: RVV Computation of Y
This is where RVV plays a crucial role in accelerating the computation of the Y component. Here's a step-by-step breakdown:
-
Vector Length Configuration:
Sets the vector length (vl) based on the remaining width. This allows dynamic adjustment to handle the remaining pixels that might not fit exactly into vector registers.
-
Loading R, G, B Channels:
Loads chunks of R, G, and B data into vector registers. Each vector can hold multiple pixel values, enabling parallel processing.
-
Zero-Extension of Data Types:
First Step: Extends the 8-bit unsigned integers to 16-bit to prevent overflow during multiplication.
Second Step: Further extends the 16-bit integers to 32-bit to accommodate the results of the multiplication operations.
-
Multiplication with Constants:
Multiplies each channel by specific constants that are part of the YCbCr conversion formula. These operations are performed in parallel for multiple pixels.
-
Summation and Shifting:
Sums the multiplied values and shifts right by 22 bits to scale down the result appropriately, as per the YCbCr conversion formula.
6. Clamping:
Ensures that the Y values are within the 0-255 range to fit into an 8-bit unsigned integer.
-
Narrowing to 8-bit and Storing Results:
- First Narrowing Step: Converts the 32-bit results back to 16-bit.
- Second Narrowing Step: Further narrows the 16-bit values to 8-bit.
- Storing: Writes the final 8-bit Y values back to the Y1 buffer.
-
Processing the Second Line (if applicable):
If lineCount == 2, the same RVV operations are performed on the second line (R2, G2, B2) to compute Y2.
Comparison of pure C and RVV intrinsic version
-
1. Logic Overview in Pure C
Function structure:
The outer function (qoy_rgba_to_ycbcra) iterates through the height with a step of y += 2, processing two lines at a time (or just one line if only one line is left). The rgba_in pointer passed to qoy_rgba_to_ycbcra_two_lines corresponds to the starting address of the two lines. Similarly, the pout pointer moves forward as blocks are processed.
1.1 Logic of qoy_rgba_to_ycbcra_two_lines
Each iteration processes two pixels at a time (i += 2, a left-right pair). Both line1 and line2 pointers move forward by 2 * channels_in in each loop, pointing to the starting address of the next two pixels.
If the width is odd, when i == width-1, p3 and p4 are repeated as p1 and p2 (preventing out-of-bound reads). This ensures that data is accessed safely and processing progresses by moving pointers through the lines in pairs of two pixels.
-
2. Logic Overview in the RVV intrinsic Version
2.1 Channel Separation (Part (A))
This step separates all pixels in a row (from x = 0 to x = width-1) into R1[], G1[], B1[], and A1[]. If lineCount = 2, line2 is also separated into R2[], G2[], B2[], and A2[].
Result:
- R1[i], G1[i] correspond to the i-th pixel in the top row.
- R2[i], G2[i] correspond to the i-th pixel in the bottom row.
2.2 Compute Y Using RVV (Part (B))
Process R1[idx..idx+vl-1], G1[idx..idx+vl-1], and B1[idx..idx+vl-1] using RVV instructions to calculate Y1[]. Similarly, compute Y2[] for the bottom row (R2, G2, B2).
2.3 Assemble Y, Cb, Cr, and A (Part ©)
This step processes two pixels at a time (i += 2). For odd-width cases, when i == width-1, i_next = i ensures the last pixel is repeated, avoiding out-of-bounds access.
The final block is written to dst, and the number of processed blocks (written) is updated.
test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define QOI_IMPLEMENTATION
#include "qoi.h"
void generate_test_data(unsigned char *data, int width, int height, int channels) {
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int idx = (y * width + x) * channels;
data[idx] = x % 256;
data[idx + 1] = y % 256;
data[idx + 2] = (x + y) % 256;
if (channels == 4) {
data[idx + 3] = 255;
}
}
}
}
void compare_outputs(const uint8_t *ref_out, const uint8_t *rvv_out, int num_bytes, int block_size) {
printf("Comparing Outputs...\n");
int num_blocks = num_bytes / block_size;
for (int block = 0; block < num_blocks; block++) {
int match = 1;
for (int i = 0; i < block_size; i++) {
if (ref_out[block * block_size + i] != rvv_out[block * block_size + i]) {
match = 0;
break;
}
}
if (match) {
printf("Block %d: match\n", block);
} else {
printf("Block %d: mismatch\n", block);
}
}
}
void run_test(int width, int height, int channels_in, int channels_out) {
printf("Testing with Width=%d, Height=%d, Channels=%d->%d\n", width, height, channels_in, channels_out);
int block_size = (channels_out == 4) ? 10 : 6;
uint8_t *rgba_input = (uint8_t *)malloc(width * height * channels_in);
uint8_t *ref_output = (uint8_t *)malloc(width * height * block_size / 2);
uint8_t *rvv_output = (uint8_t *)malloc(width * height * block_size / 2);
generate_test_data(rgba_input, width, height, channels_in);
int ref_bytes = qoy_rgba_to_ycbcra(rgba_input, width, height, channels_in, channels_out, ref_output);
int rvv_bytes = qoy_rgba_to_ycbcra_rvv(rgba_input, width, height, channels_in, channels_out, rvv_output);
if (ref_bytes != rvv_bytes) {
printf("Error: Output sizes differ! Ref=%d bytes, RVV=%d bytes\n", ref_bytes, rvv_bytes);
free(rgba_input);
free(ref_output);
free(rvv_output);
return;
}
compare_outputs(ref_output, rvv_output, ref_bytes, block_size);
free(rgba_input);
free(ref_output);
free(rvv_output);
}
int main() {
printf("Running QOY Conversion Tests with Various Sizes...\n");
int test_sizes[][2] = {
{4, 4},
{5, 4},
{4, 5},
{5, 5},
};
int num_tests = sizeof(test_sizes) / sizeof(test_sizes[0]);
for (int i = 0; i < num_tests; i++) {
int width = test_sizes[i][0];
int height = test_sizes[i][1];
run_test(width, height, 4, 4);
}
printf("All Tests Completed.\n");
return 0;
}
1. Compilation Command
riscv32-unknown-linux-gnu-gcc test.c -std=gnu99 -march=rv32gcv -mabi=ilp32d -O0 -lpng -lz -o test.out
2. Execution Command
qemu-riscv32 -L $HOME/riscv-gnu-toolchain/build_linux/sysroot ./qoy_rvvintrinsic/test.out
3. Output
Since I am temporarily unable to identify the issue, the current code can process images with even widths. Therefore, I would like to use it to test whether there is an improvement in the efficiency of image conversion.
Benchmark
I have confirmed that the images within the folder are identical after being compared between two versions of the conversion before proceeding with further testing.
Simple benchmark suite for qoy
Requires libpng, "stb_image.h" and "stb_image_write.h", "qoi.h"
-
Set Up the Sysroot Environment
The sysroot is a directory that mimics the root filesystem of the target architecture (RISC-V in this case). It contains all the necessary headers and libraries required for cross-compilation.
a. Locate the Sysroot
If you built the toolchain as shown above, the sysroot is typically located at $HOME/riscv32/sysroot. If not, you might need to specify or create one.
b. Prepare the Sysroot with Necessary Libraries
You need to ensure that libpng and zlib are available in the RISC-V sysroot. Here's how to do it:
i. Install Dependencies for Building Libraries
ii. Cross-Compile zlib for RISC-V
- Download zlib Source Code
Configure and Build
iii. Cross-Compile libpng for RISC-V
Download libpng Source Code
Configure and Build
benchmark.c
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>
#define STB_IMAGE_IMPLEMENTATION
#define STBI_ONLY_PNG
#include "stb_image.h"
#define QOY_IMPLEMENTATION
#include "qoy.h"
#if defined(__APPLE__)
#include <mach/mach_time.h>
#elif defined(__linux__)
#include <time.h>
#elif defined(_WIN32)
#include <windows.h>
#endif
static uint64_t ns(void) {
#if defined(__APPLE__)
static mach_timebase_info_data_t info;
static int init=0;
if(!init) {
mach_timebase_info(&info);
init=1;
}
uint64_t now = mach_absolute_time();
now = now * info.numer / info.denom;
return now;
#elif defined(__linux__)
struct timespec spec;
clock_gettime(CLOCK_MONOTONIC, &spec);
return (uint64_t)spec.tv_sec * 1000000000ULL + (uint64_t)spec.tv_nsec;
#elif defined(_WIN32)
static LARGE_INTEGER freq;
static int init=0;
if(!init){
QueryPerformanceFrequency(&freq);
init=1;
}
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
return (uint64_t)(1000000000ULL* now.QuadPart / freq.QuadPart);
#else
return (uint64_t)clock();
#endif
}
static uint64_t g_sum_c_time_ns = 0;
static uint64_t g_sum_rvv_time_ns = 0;
static int g_image_count = 0;
static int g_runs = 1;
static void benchmark_image(const char* path, int runs) {
int width, height, comp;
unsigned char* rgba = stbi_load(path, &width, &height, &comp, 4);
if(!rgba) {
printf("Error: failed to load PNG: %s\n", path);
return;
}
printf("[File] %s => %dx%d, forced RGBA=4\n", path, width, height);
int block_size = (4 == 4)? 10: 6;
int outbuf_size = (width>>1)*height* block_size;
unsigned char* out_c = (unsigned char*)malloc(outbuf_size);
unsigned char* out_rvv = (unsigned char*)malloc(outbuf_size);
qoy_rgba_to_ycbcra(rgba, width, height, 4, 4, out_c);
qoy_rgba_to_ycbcra_rvv(rgba, width, height, 4, 4, out_rvv);
uint64_t sum_c=0, sum_rvv=0;
for(int i=0; i<runs; i++){
memset(out_c, 0, outbuf_size);
uint64_t t0 = ns();
qoy_rgba_to_ycbcra(rgba, width, height, 4, 4, out_c);
uint64_t t1 = ns();
sum_c += (t1 - t0);
memset(out_rvv, 0, outbuf_size);
uint64_t t2 = ns();
qoy_rgba_to_ycbcra_rvv(rgba, width, height, 4, 4, out_rvv);
uint64_t t3 = ns();
sum_rvv += (t3 - t2);
}
double avg_c = (double)sum_c / runs / 1.0e6;
double avg_rvv = (double)sum_rvv / runs / 1.0e6;
printf("Runs=%d | C=%.3f ms, RVV=%.3f ms\n", runs, avg_c, avg_rvv);
g_sum_c_time_ns += sum_c;
g_sum_rvv_time_ns += sum_rvv;
g_image_count++;
free(rgba);
free(out_c);
free(out_rvv);
}
static void benchmark_directory(const char* dirpath, int runs) {
DIR* dp = opendir(dirpath);
if(!dp) {
printf("Could not open directory: %s\n", dirpath);
return;
}
struct dirent* ent;
while((ent = readdir(dp)) != NULL) {
if(!strcmp(ent->d_name, ".") || !strcmp(ent->d_name, "..")) {
continue;
}
char filepath[1024];
snprintf(filepath, sizeof(filepath), "%s/%s", dirpath, ent->d_name);
size_t len = strlen(ent->d_name);
if(len>4 && strcmp(ent->d_name + (len-4), ".png")==0) {
benchmark_image(filepath, runs);
}
}
closedir(dp);
}
int main(int argc, char** argv) {
if(argc < 3) {
printf("Usage: %s <file_or_directory> <runs>\n", argv[0]);
return 0;
}
const char* input_path = argv[1];
g_runs = atoi(argv[2]);
if(g_runs <= 0) g_runs=1;
struct stat st;
if(stat(input_path, &st)==0) {
if(S_ISDIR(st.st_mode)) {
benchmark_directory(input_path, g_runs);
}
else if(S_ISREG(st.st_mode)) {
benchmark_image(input_path, g_runs);
}
else {
printf("Input path is neither file nor directory???\n");
}
} else {
printf("Cannot stat: %s\n", input_path);
}
if(g_image_count>0) {
double avg_c = (double)g_sum_c_time_ns / (double)(g_image_count*g_runs);
double avg_rvv = (double)g_sum_rvv_time_ns / (double)(g_image_count*g_runs);
avg_c /= 1.0e6;
avg_rvv /= 1.0e6;
printf("===== Global Average across %d PNG(s) =====\n", g_image_count);
printf("C version: %.3f ms\n", avg_c);
printf("RVV version: %.3f ms\n", avg_rvv);
} else {
printf("No PNG files were processed.\n");
}
return 0;
}
1. Compilation Command
riscv32-unknown-linux-gnu-gcc benchmark.c -std=gnu99 -march=rv32gcv -mabi=ilp32d -O0 -lpng -lz -lm -o benchmark.out
1. Execution Command
qemu-riscv32 -L $HOME/riscv-gnu-toolchain/build_linux/sysroot ./qoy_rvvintrinsic/benchmark.out ./qoi_benchmark_suite/images/textures_pk 1
Result
-
icon_512
If compiled with -O0, pure C code may still benefit from simple optimizations by the compiler (or its logic might be straightforward). However, RVV intrinsic code often requires higher optimization levels (-O2, -O3) to fully optimize instruction sequences, register allocation, and similar aspects.
It is recommended to use at least -O2 or -O3 for RVV code.
Under -O0, the compiler may not perform sufficient instruction merging or redundancy elimination for vector operations, leading to excessive and unnecessary load/store operations and VL setting overhead.
optimization level : -O2
===== Global Average across 213 PNG(s) =====
C version: 2.162 ms
RVV version: 12.286 ms
optimization level : -O3
===== Global Average across 213 PNG(s) =====
C version: 63.954 ms
RVV version: 13.177 ms
textures_pk
optimization level : -O0
optimization level : -O2
optimization level : -O3
Analysis
-
Pure C Generated with -O3
Under -O3, the compiler often performs automatic vectorization, larger-scale function inlining, and loop unrolling, which can significantly alter the "instruction count" or "memory access patterns."
For simulators, more or more complex RISC-V instruction sequences may require "additional steps" for interpretation and execution, potentially making it slower than -O2 (or even -O0).
This does not mean it would be slower on actual hardware. On a real CPU, executing "more but more efficient instruction sequences" is likely faster than -O2. However, due to factors like the "cost of instruction interpretation" and "cache simulation," software simulators might show worse performance for the -O3 version of pure C code.
-
RVV Under -O3
The compiler applies the most aggressive optimizations to code containing RVV intrinsics under -O3, such as significantly reducing vsetvli, merging load/store operations, and unrolling loops. These optimizations may not always lighten the burden on the simulator (and can sometimes increase it), but they can significantly reduce the "number" of vector instructions or unnecessary "overhead" in certain cases.
Under -O2, the compiler might not optimize vector intrinsics as aggressively, leaving some redundant operations. This can result in better or worse performance depending on the generated instruction patterns.
In summary, interpreting RVV instructions is inherently expensive for simulators. If -O3 indeed "reduces" the instruction count or optimizes loops, it can outperform -O2. Conversely, if -O3 generates large and complex function inlining that worsens vector instruction arrangement, it may perform worse than -O2.
However, it is more likely that the way the RVV version is written has a significant impact on performance. Even on hardware or under high optimization levels, if RVV still performs worse than pure C, it is likely that the program structure, the use of intrinsics, or the compiler-generated code has considerable room for improvement.
Reference
QOY - The "Quite OK YCbCr420A" format for fast, lossless* image compression
QOI - The “Quite OK Image Format” for fast, lossless image compression
Simple RISC-V Vector example
RISC-V Vector Intrinsic Document