Vectorize Quite OK YCbCr420A

廖奕凱

Objective

My task is to study the QOY format and leverage RISC-V Vector (RVV) extensions to accelerate its operations. The project cover the following aspects:

Investigate the "Quite OK YCbCr420A" format and provide a concise comparison with QOI.
Identify potential areas in QOY for vectorization and propose strategies to use the RISC-V Vector extension for performance improvement.
Test the vectorized implementation using QEMU to ensure the QOY format functions as expected.
Provide suggestions for further optimizing the performance of QOY.

What is QOY?

QOY encodes and decodes images in a lossless format, as QOI does. But where QOI encodes RGBA, QOY encodes YCbCr 4:2:0 A (YCbCrA 4:2:0:4).
QOY can work with both RGBA and YCbCrA pixel data. In case of RGBA data, it is converted to and from the YCbCrA colorspace on-the-fly, which is a lossy operation. QOY is only lossless when input and output are YCbCrA.
QOY performance with RGBA pixel data is similar to QOI, and is about 1.5x (encoding) to 2.0x (decoding) faster when using YCbCrA pixel data.
QOY encoded size is about half of QOI's, while the differences in decoded RGBA output are virtually indistinguishable.
QOY's aim is to remain small, simple, and fast, while offering "quite ok" compression levels.

What is RISC-V Vector Extension

The RISC-V Vector Extension (RVV) is designed to enhance the RISC-V architecture with powerful vector computation capabilities, enabling efficient data-parallel processing for a wide range of applications such as high-performance computing, machine learning, and signal processing. RVV's flexible and scalable design allows it to cater to diverse hardware implementations and application requirements.

RVV Design Considerations

Vector ISA Complexity:
- Instruction Set Size: Vector ISAs are typically large due to the need for vector equivalents of scalar instructions, specialized memory access operations, and vector manipulation instructions.
- Predication Support: Modern vector ISAs, including RVV, incorporate predication (masking) to enable conditional execution of vector elements.
- Instruction Encoding: The complexity of vector operations often exceeds the capacity of 32-bit instruction encoding, necessitating the use of CPU state registers to manage vector operations.

RVV Parameters and Basic State

Vector Registers:
- Number of Registers: RVV defines 32 vector registers named v0 to v31.
- Register Size (VLEN): Each vector register is VLEN bits wide, where VLEN is a power of two (e.g., 64, 128, 256, 512 bits). The exact VLEN is determined by the implementer.
- Standard Constraints: The Zv* standard extensions require VLEN to be at least 64 or 128 bits, similar in size to Intel's AVX-512 when VLEN is 512 bits.
Vector Elements:
- Element Size (ELEN): Elements within a vector are at least 8 bits and up to ELEN bits, where ELEN is also a power of two (8 ≤ ELEN ≤ VLEN).
- Standard Constraints: The Zv* extensions constrain ELEN to be at least 32 or 64 bits.
Operational State Registers:
- vtype (Vector Type Register): Describes the type of vector operation, including:
  - SEW (Standard Element Width): Size in bits of each vector element (8 ≤ SEW ≤ ELEN).
  - LMUL (Length Multiplier): Determines the grouping of vector registers, allowing multipliers of 1/8, 1/4, 1/2, 1, 2, 4, or 8.
- vl (Vector Length Register): Specifies the number of elements to operate on, ranging from 0 to vlmax(SEW, LMUL), where:
  - vlmax(SEW, LMUL) = (VLEN / SEW) × LMUL

Key Features of RVV

Scalability and Flexibility:
- Variable Vector Length: RVV's design allows different hardware implementations to support varying vector lengths, enhancing scalability.
- Element Manipulation: Supports operations for loading, storing, and manipulating vector elements, including non-contiguous memory accesses through scatter/gather operations.
Predication and Masking:
- Mask Registers: Enable conditional execution of vector operations, allowing certain elements to be processed based on predicate conditions.
- Predicate Operations: Includes instructions for forming predicates through comparisons and other logical operations.
Arithmetic and Logical Operations:
- Comprehensive Instruction Set: RVV includes a wide range of arithmetic (e.g., add, subtract, multiply, divide) and logical (e.g., AND, OR, NOT) instructions tailored for vector processing.
- Vector-Specific Instructions: Additional instructions are provided for tasks unique to vector processing, such as vector compression and decompression.

Comparison of QOY and QOI Formats

Overview of the QOI Format

QOI (Quite OK Image) is a simple and efficient lossless image compression format designed to provide fast encoding and decoding speeds. The main features of QOI include:

Lossless Compression: Ensures image quality is preserved without degradation.
Simple Encoding Algorithm: Utilizes basic differential encoding and a hash table to store colors.
Efficient Memory Usage: Minimizes memory allocation and access operations.

The encoding process of QOI primarily involves the following steps:

Initialize a hash table with 64 entries to store previously encountered colors.
Iterate through the image pixels, calculating the difference (R, G, B, A) between the current pixel and the previous one.
Select an appropriate opcode to encode the data based on the difference values.
Add a termination marker at the end of the file to indicate the end of the image.

Overview of the QOY Format

QOY (Quite OK YCbCr420A) is an extension of QOI, with the following notable features:

YCbCr 4:2:0 Color Space: Converts the RGB color space to YCbCr 4:2:0, reducing chroma information and improving compression efficiency.
Alpha Channel Support: Supports transparency, making it suitable for images with transparent backgrounds.
Differential Encoding: Similar to QOI, it uses differential encoding to reduce redundant data.
Multiple Compression Levels: Adopts different opcodes based on pixel difference ranges to balance compression ratio and encoding efficiency.

The encoding process of QOY resembles that of QOI but introduces enhancements for handling color space and alpha channels:

Convert RGB pixels to the YCbCr 4:2:0 color space.
Use differential encoding to compress information for Y, Cb, Cr, and the alpha channel.
Employ various compression opcodes to represent differences in a range of values, thereby improving compression efficiency.

Environment setting

Clone the riscv-gnu-toolchain repo from the official GitHub repo.

git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git --recursive

Navigate to the riscv-gnu-toolchain folder, create a folder named build, enter the build folder, and configure the necessary options in the Makefile for the compilation process.

cd riscv-gnu-toolchain
mkdir build 
cd build
../configure --prefix=$HOME/riscv-gnu-toolchain/build --with-arch=rv32gcv --with-abi=ilp32d --enable-multilib

This command configures the build process for the RISC-V toolchain.

with-arch=rv32gcv: Targets the RISC-V architecture with 32-bit support, including general, compressed, and vector extensions (RV32GCV).
with-abi=ilp32d: Sets the ABI (Application Binary Interface) to ilp32d, which uses 32-bit integers, longs, and pointers, with double-precision floating-point support.
enable-multilib: Enables support for multiple library variants (multilib), allowing the toolchain to generate code for multiple configurations of architecture and ABI.

Builds the RISC-V toolchain based on the configuration set by the ../configure command,and Builds the QEMU emulator for RISC-V

make liunux
make-qemu

Compiles the RISC-V GNU toolchain based on the configuration provided by the ../configure command. This step builds the compiler, assembler, linker, and related tools.
Builds the QEMU emulator for RISC-V, allowing you to simulate and test RISC-V programs without requiring physical hardware.

example for test rvv
I referenced the rvv_example repository on GitHub to test whether my environment is functioning properly.
main.c

#include <stdio.h>
#include <math.h>

struct pt {
  float x;
  float y;
  float z;
};

void vec_len_rvv(float *r, struct pt *v, int n);

void vec_len(float *r, struct pt *v, int n){
  for (int i=0; i<n; ++i){
    struct pt p = v[i];
    r[i] = sqrtf(p.x*p.x + p.y*p.y + p.z*p.z);
  }
}

#define N 6
struct pt v[N] = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11, 12}, {13, 14, 15}, {16, 17, 18}};

int main(){
  float lens[N], lens_rvv[N];

  vec_len(lens, v, N);
  vec_len_rvv(lens_rvv, v, N);

  for (int i=0; i<N; ++i){
    printf("%f %f\n", lens[i], lens_rvv[i]);
  }
  return 0;
}

vec.S

    # void vec_len_rvv(float *r, struct pt *pts, int n)

    #define r a0
    #define pts a1
    #define n a2
    #define vl a3
    #define Xs v0
    #define Ys v1
    #define Zs v2
    #define lens v3

    .globl vec_len_rvv
vec_len_rvv:
    # 32 bit elements, don't care (Agnostic) how tail and mask are handled
    vsetvli vl, n, e32, ta,ma
    vlseg3e32.v Xs, (pts) # loads interleaved Xs, Ys, Zs into 3 registers
    vfmul.vv lens, Xs, Xs
    vfmacc.vv lens, Ys, Ys
    vfmacc.vv lens, Zs, Zs
    vfsqrt.v lens, lens
    vse32.v lens, (r)
    sub n, n, vl
    sh2add r, vl, r # bump r ptr 4 bytes per float
    sh1add vl, vl, vl # multiply vl by 3 floats per point
    sh2add pts, vl, pts # bump v ptr 4 bytes per float (12 per pt)
    bnez n, vec_len_rvv
    ret

makefile

go: main
    qemu-riscv32 -cpu rv32,v=true,zba=true,vlen=128,rvv_ta_all_1s=on,rvv_ma_all_1s=on ./main

main: main.c vec.S makefile
    riscv32-unknown-elf-gcc -O main.c vec.S -o main -march=rv32gcv_zba -lm

Just type "make"
Expected output:

$ make
riscv32-unknown-elf-gcc -O main.c vec.S -o main -march=rv32gcv_zba -lm
qemu-riscv32 -cpu rv32,v=true,zba=true,vlen=128 ./main
3.741657 3.741657
8.774964 8.774964
13.928389 13.928389
19.104973 19.104973
24.289915 24.289915
29.478806 29.478806

Output in my Ubuntu 22.04 ARM64：

The output matches the expected result, and it compiles successfully. Currently, I plan to follow a similar strategy as in the rvv_example to modify certain functions in the source code into RVV versions.

RGBA to YCbCrA Conversion Process in QOY

The following code, qoy_rgba_to_ycbcra_two_lines(), is one of the core blocks in the QOY project used to convert RGBA image data into the YCbCr 4:2:0 A format. It processes "1 or 2 lines" of pixels at a time and handles two pixels at once (each pixel's r, g, b, a).

static inline int qoy_rgba_to_ycbcra_two_lines(const void* rgba_in, int width, int lines, int channels_in, int channels_out, void *ycbcr420a_out) {
    if (channels_in != 4) channels_in = 3;
    if (channels_out != 4) channels_out = 3;
    unsigned char *line1 = (unsigned char *)rgba_in;
    unsigned char *line2 = lines == 2 ? line1 + width * channels_in : line1;
    unsigned char *out = ycbcr420a_out;
    int size_out = (channels_out == 4) ? 10 : 6;
    int written = 0;
    for (int i = 0; i < width; i += 2, line1 += channels_in * 2, line2 += channels_in * 2, out += size_out) {
        qoy_rgba_t *p1 = (qoy_rgba_t *)line1;
        qoy_rgba_t *p2 = (qoy_rgba_t *)line2;
        qoy_rgba_t *p3 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line1 : line1 + channels_in);
        qoy_rgba_t *p4 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line2 : line2 + channels_in);
        qoy_ycbcr420a_t *pout = (qoy_ycbcr420a_t *)out;
        pout->y[0] = ((1254097 * p1->r) + (2462056 * p1->g) + (478151 * p1->b)) >> 22;
        pout->y[1] = ((1254097 * p2->r) + (2462056 * p2->g) + (478151 * p2->b)) >> 22;
        pout->y[2] = ((1254097 * p3->r) + (2462056 * p3->g) + (478151 * p3->b)) >> 22;
        pout->y[3] = ((1254097 * p4->r) + (2462056 * p4->g) + (478151 * p4->b)) >> 22;
        unsigned int r4 = p1->r + p2->r + p3->r + p4->r;
        unsigned int g4 = p1->g + p2->g + p3->g + p4->g;
        unsigned int b4 = p1->b + p2->b + p3->b + p4->b;
        pout->cb = qoy_8bit_clamp((134217728 - (44233 * r4) - (86839 * g4) + (b4 << 17) + (1 << 19)) >> 20);
        pout->cr = qoy_8bit_clamp((134217728 + (r4 << 17) - (109757 * g4) - (21315 * b4) + (1 << 19)) >> 20);
        if (channels_out == 4) {
            if (channels_in == 4) {
                pout->a[0] = p1->a;
                pout->a[1] = p2->a;
                pout->a[2] = p3->a;
                pout->a[3] = p4->a;
            } else {
                pout->a[0] = 0xff;
                pout->a[1] = 0xff;
                pout->a[2] = 0xff;
                pout->a[3] = 0xff;
            }
        }
        written += size_out;
    }
    return written;
}

1. Parameters and Initial Setup

rgba_in: Points to the input buffer containing RGBA (or RGB) data.
width: The width of the image line(s) to be processed.
lines: Indicates whether to process 1 or 2 lines.
- If lines == 2, line2 points to the next line of pixels (line1 + width * channels_in).
- If lines == 1 (processing the boundary), line2 points back to line1, effectively duplicating the last line.
channels_in, channels_out: Specify the number of channels in the input (3 or 4) and output data (3 or 4).
- At the start of the function, channels_in is forced to 3 or 4:
```
if (channels_in != 4) channels_in = 3;
```
  Similarly, channels_out is handled the same way.
ycbcr420a_out: The output buffer where the converted YCbCrA (or YCbCr) blocks will be written.
size_out: Specifies the output block size:
- 10 bytes: If the output includes alpha (4 channels), each 2×2 block writes 10 bytes.
- 6 bytes: If the output excludes alpha (3 channels), each 2×2 block writes 6 bytes.
In the YCbCr 4:2:0(A) format:
- Each 2×2 block contains 4 Y + 1 Cb + 1 Cr (= 6 bytes).
- Adding 4 alpha values results in a total of 10 bytes.

2. Obtaining Line Pointers (line1, line2) and Starting Output Pointer (out)

unsigned char *line1 = (unsigned char *)rgba_in;
unsigned char *line2 = lines == 2 ? line1 + width * channels_in : line1;

If lines == 2, indicating the presence of a second line, line2 points to the next line of pixels.
If lines == 1, line2 points back to line1, effectively duplicating the last line.

Loop:

for (int i = 0; i < width; i += 2, ...)

Processes 2 pixels at a time.

Inside the loop, pointers are updated:

line1 += channels_in * 2;
line2 += channels_in * 2;
out += size_out;

3. Retrieving Pixel Data from Line1 and Line2

qoy_rgba_t *p1 = (qoy_rgba_t *)line1;
qoy_rgba_t *p2 = (qoy_rgba_t *)line2;
qoy_rgba_t *p3 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line1 : line1 + channels_in);
qoy_rgba_t *p4 = (qoy_rgba_t *)(((width & 0x01) == 1 && i == width - 1) ? line2 : line2 + channels_in);

Each 2×2 block corresponds to four pixel pointers: p1, p2, p3, p4.
p1 / p3: From line1 (same row, 2 pixels).
p2 / p4: From line2 (next row, 2 pixels).
For odd widths ((width & 0x01) == 1) and i == width - 1 (last pixel), line1 or line2 duplicates the last pixel to avoid out-of-bounds reads.

4. Calculating the Y Component

pout->y[0] = ((1254097 * p1->r) + (2462056 * p1->g) + (478151 * p1->b)) >> 22;
pout->y[1] = ((1254097 * p2->r) + (2462056 * p2->g) + (478151 * p2->b)) >> 22;
pout->y[2] = ((1254097 * p3->r) + (2462056 * p3->g) + (478151 * p3->b)) >> 22;
pout->y[3] = ((1254097 * p4->r) + (2462056 * p4->g) + (478151 * p4->b)) >> 22;

This corresponds to an approximation of the JPEG color conversion formula:

$Y = 0.299 R + 0.587 G + 0.114 B$
The constants 1254097, 2462056, and 478151 are integer approximations of the coefficients (0.299), (0.587), and (0.114), respectively. These constants are scaled and then reduced using a right shift (>> 22).
The values (y[0..3]) correspond to the four pixels in the 2×2 block.

5. Calculating Blended Cb/Cr

unsigned int r4 = p1->r + p2->r + p3->r + p4->r;
unsigned int g4 = p1->g + p2->g + p3->g + p4->g;
unsigned int b4 = p1->b + p2->b + p3->b + p4->b;
pout->cb = qoy_8bit_clamp((134217728 - (44233 * r4) - (86839 * g4) + (b4 << 17) + (1 << 19)) >> 20);
pout->cr = qoy_8bit_clamp((134217728 + (r4 << 17) - (109757 * g4) - (21315 * b4) + (1 << 19)) >> 20);

In YCbCr 4:2:0, Cb and Cr are stored once for each 2×2 block by averaging pixel values.
r4, g4, and b4 are the sums of the R, G, and B components for the four pixels.
The integer approximation formulas calculate Cb and Cr, followed by qoy_8bit_clamp() to ensure values remain within the 0–255 range.

6. Handling the Alpha Channel

if (channels_out == 4) {
    if (channels_in == 4) {
        pout->a[0] = p1->a;
        pout->a[1] = p2->a;
        pout->a[2] = p3->a;
        pout->a[3] = p4->a;
    } else {
        pout->a[0] = 0xff;
        pout->a[1] = 0xff;
        pout->a[2] = 0xff;
        pout->a[3] = 0xff;
    }
}

If the output requires alpha (channels_out == 4), check if the input also has alpha:
- If yes (channels_in == 4), copy the corresponding a values from p1, p2, p3, and p4.
- If not (3-channel RGB), set the alpha values to 0xff (fully opaque).

7. Updating the Counters After Each Block

written += size_out;

size_out is the number of bytes written for this 2×2 block:
- 6 bytes: Without alpha.
- 10 bytes: With alpha.
The loop continues until i += 2 exceeds width.

8. Returning the Total Written Bytes

return written;

Returns the total number of bytes written, allowing the caller to know how much data was processed.

The current strategy involves rewriting qoy_rgba_to_ycbcra and qoy_ycbcra_to_rgba.These functions involve batch processing of a large number of pixels (color space conversion for RGBA, block-based 4:2:0 processing, clamping), which is similar to the previous vec_len logic. Both perform extensive per-pixel or per-block calculations.
Vector instructions can be utilized within the functions to replace the intensive calculations (addition, multiplication, shifting, clamping, etc.), enabling the processing of multiple pixels at once, and validate its output on QEMU to ensure it achieves the same functionality as the original C code.

RVV intrinsic

RISC-V Vector (RVV) intrinsics provide a straightforward way to use RVV instructions directly in C/C++ without requiring assembly knowledge. Intrinsics are low-level functions defined by the compiler, offering a nearly one-to-one mapping with RVV instructions, allowing programmers to leverage vector operations in a high-level language.
Key Points:
Intrinsic: A low-level function defined by the compiler to expose individual instructions to a higher-level language.

Benefits: Simplifies low-level RVV programming without requiring in-depth knowledge of assembly.

Example: To perform vector addition (vadd.vv), the intrinsic function for 32-bit integer vectors (i32) in one vector register group (m1) is:

vint32m1_t __riscv_vadd_vv_i32m1(vint32m1_t, vint32m1_t, size_t);

Naming Scheme: Intrinsics follow a structured naming pattern:
- __riscv – Prefix indicating the RISC-V architecture.
- vadd – The operation (e.g., vector addition).
- _vv – Operand configuration.
- i32 – Element format (32-bit integers).
- m1 – LMUL (vector group size).
- Optional suffix – Mask or tail configuration.

This structured approach bridges low-level hardware instructions with high-level programming, making RVV accessible and efficient.

implement qoy_rgba_to_ycbcra_rvv

int qoy_rgba_to_ycbcra_rvv(
    const void* rgba_in,
    int width,
    int height,
    int channels_in,
    int channels_out,
    void *ycbcr420a_out
){
    // (1) If channels_in != 4, set to 3
    //     If channels_out != 4, set to 3
    if (channels_in != 4) channels_in = 3;
    if (channels_out != 4) channels_out = 3;

    const uint8_t* src = (const uint8_t*)rgba_in;
    uint8_t* dst       = (uint8_t*)ycbcr420a_out;
    int block_size = (channels_out == 4) ? 10 : 6;
    int written = 0;

    for(int y = 0; y < height; y += 2){
        int lineCount = 2;
        if((y == height - 1) && (height & 1)) lineCount = 1; // Odd height => last line

        // Allocate buffers:
        uint8_t* R1 = (uint8_t*)malloc(width);
        uint8_t* G1 = (uint8_t*)malloc(width);
        uint8_t* B1 = (uint8_t*)malloc(width);
        uint8_t* A1 = (uint8_t*)malloc(width);
        uint8_t* R2 = (uint8_t*)malloc(width);
        uint8_t* G2 = (uint8_t*)malloc(width);
        uint8_t* B2 = (uint8_t*)malloc(width);
        uint8_t* A2 = (uint8_t*)malloc(width);

        const uint8_t* line1 = src + (y * width * channels_in);
        const uint8_t* line2 = (lineCount == 2) ? (line1 + width * channels_in) : line1;

        // ----------(A) Separate channels (can be scalar or vector)-----------
        for(int x = 0; x < width; x++){
            const uint8_t* p1 = line1 + x * channels_in;
            R1[x] = p1[0];
            G1[x] = p1[1];
            B1[x] = p1[2];
            A1[x] = (channels_in == 4) ? p1[3] : 0xff;
            if(lineCount == 2){
                const uint8_t* p2 = line2 + x * channels_in;
                R2[x] = p2[0];
                G2[x] = p2[1];
                B2[x] = p2[2];
                A2[x] = (channels_in == 4) ? p2[3] : 0xff;
            } else {
                // If lines = 1 => same as line1
                R2[x] = R1[x];
                G2[x] = G1[x];
                B2[x] = B1[x];
                A2[x] = A1[x];
            }
        }

        // ----------(B) RVV Computation of Y-----------
        uint8_t* Y1 = (uint8_t*)malloc(width);
        uint8_t* Y2 = (uint8_t*)malloc(width);

        int idx = 0;
        while(idx < width){
            size_t vl = __riscv_vsetvl_e8m1(width - idx); 
            // Load R1
            vuint8m1_t vr_in = __riscv_vle8_v_u8m1(&R1[idx], vl);
            vuint8m1_t vg_in = __riscv_vle8_v_u8m1(&G1[idx], vl);
            vuint8m1_t vb_in = __riscv_vle8_v_u8m1(&B1[idx], vl);

            // Zero-extend => i32
            // First step: Zero-extend 8-bit unsigned integers to 16-bit
            vuint16m2_t vr_temp = __riscv_vwcvtu_x_x_v_u16m2(vr_in, vl);
            vuint16m2_t vg_temp = __riscv_vwcvtu_x_x_v_u16m2(vg_in, vl);
            vuint16m2_t vb_temp = __riscv_vwcvtu_x_x_v_u16m2(vb_in, vl);

            // Second step: Zero-extend 16-bit unsigned integers to 32-bit
            vuint32m4_t vr = __riscv_vwcvtu_x_x_v_u32m4(vr_temp, vl);
            vuint32m4_t vg = __riscv_vwcvtu_x_x_v_u32m4(vg_temp, vl);
            vuint32m4_t vb = __riscv_vwcvtu_x_x_v_u32m4(vb_temp, vl);

            // Multiply => c_r = vr * 1254097
            vuint32m4_t c_r = __riscv_vmul_vx_u32m4(vr, 1254097, vl);
            vuint32m4_t c_g = __riscv_vmul_vx_u32m4(vg, 2462056, vl);
            vuint32m4_t c_b = __riscv_vmul_vx_u32m4(vb, 478151,  vl);

            // ysum = c_r + c_g + c_b => shift right by 22 => clamp
            vuint32m4_t ysum = __riscv_vadd_vv_u32m4(c_r, c_g, vl);
            ysum = __riscv_vadd_vv_u32m4(ysum, c_b, vl);
            ysum = __riscv_vsrl_vx_u32m4(ysum, 22, vl); // >>22
            // Clamp to range 0..255
            ysum = __riscv_vmaxu_vx_u32m4(ysum, 0, vl);
            ysum = __riscv_vminu_vx_u32m4(ysum, 255, vl);
            
            printf("Index %d: c_r=%d, c_g=%d, c_b=%d, ysum=%d\n", idx, c_r, c_g, c_b, ysum);

            // Cast to uint8 => vnclipu_wx
            // First step: Narrow from vuint32m4_t to vuint16m2_t
            vuint16m2_t ysum_16 = __riscv_vnclipu_wx_u16m2(ysum, 0, 0, vl);

            // Second step: Narrow from vuint16m2_t to vuint8m1_t    
            vuint8m1_t vy = __riscv_vnclipu_wx_u8m1(ysum_16, 0, 0, vl);

            // Store => Y1
            __riscv_vse8_v_u8m1(&Y1[idx], vy, vl);

            // Process line2
            if(lineCount == 2){
                vuint8m1_t vr2_in = __riscv_vle8_v_u8m1(&R2[idx], vl);
                vuint8m1_t vg2_in = __riscv_vle8_v_u8m1(&G2[idx], vl);
                vuint8m1_t vb2_in = __riscv_vle8_v_u8m1(&B2[idx], vl);
                // Zero-extend => i32
                // First step: Zero-extend 8-bit unsigned integers to 16-bit
                vuint16m2_t vr2_temp = __riscv_vwcvtu_x_x_v_u16m2(vr2_in, vl);
                vuint16m2_t vg2_temp = __riscv_vwcvtu_x_x_v_u16m2(vg2_in, vl);
                vuint16m2_t vb2_temp = __riscv_vwcvtu_x_x_v_u16m2(vb2_in, vl);

                // Second step: Zero-extend 16-bit unsigned integers to 32-bit
                vuint32m4_t vr2 = __riscv_vwcvtu_x_x_v_u32m4(vr2_temp, vl);
                vuint32m4_t vg2 = __riscv_vwcvtu_x_x_v_u32m4(vg2_temp, vl);
                vuint32m4_t vb2 = __riscv_vwcvtu_x_x_v_u32m4(vb2_temp, vl);
                
                vuint32m4_t c_r2 = __riscv_vmul_vx_u32m4(vr2, 1254097, vl);
                vuint32m4_t c_g2 = __riscv_vmul_vx_u32m4(vg2, 2462056, vl);
                vuint32m4_t c_b2 = __riscv_vmul_vx_u32m4(vb2, 478151,  vl);

                vuint32m4_t ysum2 = __riscv_vadd_vv_u32m4(c_r2, c_g2, vl);
                ysum2 = __riscv_vadd_vv_u32m4(ysum2, c_b2, vl);
                ysum2 = __riscv_vsrl_vx_u32m4(ysum2, 22, vl);
                ysum2 = __riscv_vmaxu_vx_u32m4(ysum2, 0, vl);
                ysum2 = __riscv_vminu_vx_u32m4(ysum2, 255, vl);
                // First step: Narrow from vuint32m4_t to vuint16m2_t
                vuint16m2_t ysum2_16 = __riscv_vnclipu_wx_u16m2(ysum2, 0, 0, vl);

                // Second step: Narrow from vuint16m2_t to vuint8m1_t
                vuint8m1_t vy2 = __riscv_vnclipu_wx_u8m1(ysum2_16, 0, 0, vl);

                __riscv_vse8_v_u8m1(&Y2[idx], vy2, vl);
            } else {
                // If lineCount = 1 => same as Y1
                __riscv_vse8_v_u8m1(&Y2[idx], vy, vl);
            }

            idx += vl;
        }

        // ----------(C) Block-based sum => yoy-----------
        for(int x = 0; x < width; x += 2){
            int x2 = ((x + 1) < width ? x + 1 : x); // If odd, repeat last
            int r4 = R1[x] + R1[x2] + R2[x] + R2[x2];
            int g4 = G1[x] + G1[x2] + G2[x] + G2[x2];
            int b4 = B1[x] + B1[x2] + B2[x] + B2[x2];
            // Cb= ...
            int cb = 134217728 - 44233 * r4 - 86839 * g4 + (b4 << 17) + (1 << 19);
            cb >>= 20; 
            if(cb < 0) cb = 0; 
            else if(cb > 255) cb = 255;
            // Cr= ...
            int cr = 134217728 + (r4 << 17) - 109757 * g4 - 21315 * b4 + (1 << 19);
            cr >>= 20; 
            if(cr < 0) cr = 0; 
            else if(cr > 255) cr = 255;
            
            printf("Block %d-%d: Cb=%d, Cr=%d\n", x, x2, cb, cr);
            
            // Y => y0=Y1[x], y1=Y1[x2], y2=Y2[x], y3=Y2[x2]
            uint8_t y0 = Y1[x], y1 = Y1[x2], y2 = Y2[x], y3 = Y2[x2];
            // Alpha => A1[x], A1[x2], A2[x], A2[x2]
            uint8_t a0 = A1[x], a1 = A1[x2], a2 = A2[x], a3 = A2[x2];

            qoy_ycbcr420a_t* pout = (qoy_ycbcr420a_t*)dst;
            pout->y[0] = y0;
            pout->y[1] = y1;
            pout->y[2] = y2;
            pout->y[3] = y3;
            pout->cb = cb;
            pout->cr = cr;
            if(channels_out == 4){
                pout->a[0] = a0;
                pout->a[1] = a1;
                pout->a[2] = a2;
                pout->a[3] = a3;
            }
            dst += block_size;
            written += block_size;
            
            printf("Final Output (Y, Cb, Cr):\n");
            for (int i = 0; i < written; i++) {
                printf("Byte %d: %02X\n", i, ((uint8_t *)ycbcr420a_out)[i]);
            }

        }

        free(R1); free(G1); free(B1); free(A1);
        free(R2); free(G2); free(B2); free(A2);
        free(Y1); free(Y2);
    }

    return written;
}

The function qoy_rgba_to_ycbcra_rvv converts an RGBA image to YCbCrA format, potentially reducing the number of channels based on input parameters. The conversion process is divided into three main sections:

Channel Separation (Section A): Separates the RGBA channels into individual R, G, B, and A buffers.
Y Component Computation Using RVV (Section B): Calculates the Y (luma) component using RISC-V Vector Extensions (RVV) for parallel processing.
Block-Based Sum and Cb/Cr Computation (Section C): Aggregates pixel values to compute the Cb and Cr (chrominance) components and assembles the final output.

Section B: RVV Computation of Y
This is where RVV plays a crucial role in accelerating the computation of the Y component. Here's a step-by-step breakdown:

Vector Length Configuration:
```
size_t vl = __riscv_vsetvl_e8m1(width - idx);
```
Sets the vector length (vl) based on the remaining width. This allows dynamic adjustment to handle the remaining pixels that might not fit exactly into vector registers.

Loading R, G, B Channels:

vuint8m1_t vr_in = __riscv_vle8_v_u8m1(&R1[idx], vl);
vuint8m1_t vg_in = __riscv_vle8_v_u8m1(&G1[idx], vl);
vuint8m1_t vb_in = __riscv_vle8_v_u8m1(&B1[idx], vl);

Loads chunks of R, G, and B data into vector registers. Each vector can hold multiple pixel values, enabling parallel processing.

Zero-Extension of Data Types:

vuint16m2_t vr_temp = __riscv_vwcvtu_x_x_v_u16m2(vr_in, vl);
vuint16m2_t vg_temp = __riscv_vwcvtu_x_x_v_u16m2(vg_in, vl);
vuint16m2_t vb_temp = __riscv_vwcvtu_x_x_v_u16m2(vb_in, vl);

vuint32m4_t vr = __riscv_vwcvtu_x_x_v_u32m4(vr_temp, vl);
vuint32m4_t vg = __riscv_vwcvtu_x_x_v_u32m4(vg_temp, vl);
vuint32m4_t vb = __riscv_vwcvtu_x_x_v_u32m4(vb_temp, vl);

First Step: Extends the 8-bit unsigned integers to 16-bit to prevent overflow during multiplication.
Second Step: Further extends the 16-bit integers to 32-bit to accommodate the results of the multiplication operations.

Multiplication with Constants:

vuint32m4_t c_r = __riscv_vmul_vx_u32m4(vr, 1254097, vl);
vuint32m4_t c_g = __riscv_vmul_vx_u32m4(vg, 2462056, vl);
vuint32m4_t c_b = __riscv_vmul_vx_u32m4(vb, 478151,  vl);

Multiplies each channel by specific constants that are part of the YCbCr conversion formula. These operations are performed in parallel for multiple pixels.

Summation and Shifting:

vuint32m4_t ysum = __riscv_vadd_vv_u32m4(c_r, c_g, vl);
ysum = __riscv_vadd_vv_u32m4(ysum, c_b, vl);
ysum = __riscv_vsrl_vx_u32m4(ysum, 22, vl); // Shift right by 22 bits

Sums the multiplied values and shifts right by 22 bits to scale down the result appropriately, as per the YCbCr conversion formula.
6. Clamping:

ysum = __riscv_vmaxu_vx_u32m4(ysum, 0, vl);
ysum = __riscv_vminu_vx_u32m4(ysum, 255, vl);

Ensures that the Y values are within the 0-255 range to fit into an 8-bit unsigned integer.

Narrowing to 8-bit and Storing Results:
```
vuint16m2_t ysum_16 = __riscv_vnclipu_wx_u16m2(ysum, 0, 0, vl);
vuint8m1_t vy = __riscv_vnclipu_wx_u8m1(ysum_16, 0, 0, vl);
__riscv_vse8_v_u8m1(&Y1[idx], vy, vl);
```
- First Narrowing Step: Converts the 32-bit results back to 16-bit.
- Second Narrowing Step: Further narrows the 16-bit values to 8-bit.
- Storing: Writes the final 8-bit Y values back to the Y1 buffer.
Processing the Second Line (if applicable):
If lineCount == 2, the same RVV operations are performed on the second line (R2, G2, B2) to compute Y2.

Comparison of pure C and RVV intrinsic version

1. Logic Overview in Pure C
Function structure:

// Similar to:
// int qoy_rgba_to_ycbcra(const void* rgba_in, int width, int height, int channels_in, int channels_out, void *ycbcr420a_out)
{
    // Main structure: process "two lines" at a time => call qoy_rgba_to_ycbcra_two_lines
    //   for (int y = 0; y < height; y += 2, pin += width*channels_in*2, pout += size_out * (width >> 1)) {
    //       written += qoy_rgba_to_ycbcra_two_lines( pin, width, lines=2 or 1, ... , pout );
    //   }
}

The outer function (qoy_rgba_to_ycbcra) iterates through the height with a step of y += 2, processing two lines at a time (or just one line if only one line is left). The rgba_in pointer passed to qoy_rgba_to_ycbcra_two_lines corresponds to the starting address of the two lines. Similarly, the pout pointer moves forward as blocks are processed.

1.1 Logic of qoy_rgba_to_ycbcra_two_lines

static inline int qoy_rgba_to_ycbcra_two_lines(
    const void* rgba_in, int width, int lines, int channels_in, int channels_out, void *ycbcr420a_out
){
    // line1 = (unsigned char *)rgba_in;
    // line2 = (lines==2) ? (line1 + width*channels_in) : line1; // Two lines or repeat the same line
    // ...
    // for (int i=0; i<width; i+=2,
    //      line1 += channels_in*2, line2 += channels_in*2, out += size_out)
    // {
    //    p1 = (qoy_rgba_t *) line1;
    //    p2 = (qoy_rgba_t *) line2;
    //    p3 = ( (width&1)&& (i==width-1) ) ? line1 : (line1+channels_in);
    //    p4 = ( (width&1)&& (i==width-1) ) ? line2 : (line2+channels_in);
    //    ...
    //    // Calculate Y0..Y3, then compute cb, cr, and a
    // }
    // return written;
}

Each iteration processes two pixels at a time (i += 2, a left-right pair). Both line1 and line2 pointers move forward by 2 * channels_in in each loop, pointing to the starting address of the next two pixels.

If the width is odd, when i == width-1, p3 and p4 are repeated as p1 and p2 (preventing out-of-bound reads). This ensures that data is accessed safely and processing progresses by moving pointers through the lines in pairs of two pixels.

2. Logic Overview in the RVV intrinsic Version

int qoy_rgba_to_ycbcra_rvv( 
    const void* rgba_in,
    int width,
    int height,
    int channels_in,
    int channels_out,
    void *ycbcr420a_out
){
    // Similar outer loop: iterate with y += 2
    for(int y=0; y<height; y+=2){
        // lineCount=2 or 1
        // (A) Channel separation => extract each pixel of the entire row(s) into R1[x], G1[x], B1[x], A1[x] / R2[x], ...
        // (B) Use RVV to compute Y1[x], Y2[x]
        // (C) for (int i=0; i<width; i+=2) { ... assemble Y0..Y3 + cb/cr + alpha }
    }
}

2.1 Channel Separation (Part (A))

uint8_t* R1= malloc(width); // ...
const uint8_t* line1 = src + (y*width*channels_in);
const uint8_t* line2 = line1 + (width*channels_in); // or same line if 1 line
for(int x=0; x<width; x++){
    // R1[x] = line1[x*channels_in+0];
    // ...
    // R2[x] = line2[x*channels_in+0];
    // ...
}

This step separates all pixels in a row (from x = 0 to x = width-1) into R1[], G1[], B1[], and A1[]. If lineCount = 2, line2 is also separated into R2[], G2[], B2[], and A2[].

Result:

R1[i], G1[i] correspond to the i-th pixel in the top row.
R2[i], G2[i] correspond to the i-th pixel in the bottom row.

2.2 Compute Y Using RVV (Part (B))
Process R1[idx..idx+vl-1], G1[idx..idx+vl-1], and B1[idx..idx+vl-1] using RVV instructions to calculate Y1[]. Similarly, compute Y2[] for the bottom row (R2, G2, B2).

for(int i=0; i<width; i+=2){
    int i_next = i+1;
    if((width&1)&& i==width-1){
        i_next = i; // Repeat
    }
    // p1 => R1[i], p2 => R2[i], p3 => R1[i_next], p4 => R2[i_next]
    // y0=>Y1[i], y1=>Y2[i], y2=>Y1[i_next], y3=>Y2[i_next]
    // sum_r= r1+r2+r3+r4 => cb, cr => clamp => output
}

This step processes two pixels at a time (i += 2). For odd-width cases, when i == width-1, i_next = i ensures the last pixel is repeated, avoiding out-of-bounds access.

The final block is written to dst, and the number of processed blocks (written) is updated.

test.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define QOI_IMPLEMENTATION
#include "qoi.h"

// Generate test input data
void generate_test_data(unsigned char *data, int width, int height, int channels) {
    for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
            int idx = (y * width + x) * channels;
            data[idx] = x % 256;           // R
            data[idx + 1] = y % 256;       // G
            data[idx + 2] = (x + y) % 256; // B
            if (channels == 4) {
                data[idx + 3] = 255;       // A
            }
        }
    }
}

// Compare output results and output "match" or "mismatch" per block
void compare_outputs(const uint8_t *ref_out, const uint8_t *rvv_out, int num_bytes, int block_size) {
    printf("Comparing Outputs...\n");
    int num_blocks = num_bytes / block_size;
    for (int block = 0; block < num_blocks; block++) {
        int match = 1; // Assume match initially
        for (int i = 0; i < block_size; i++) {
            if (ref_out[block * block_size + i] != rvv_out[block * block_size + i]) {
                match = 0; // Found a mismatch
                break;
            }
        }
        if (match) {
            printf("Block %d: match\n", block);
        } else {
            printf("Block %d: mismatch\n", block);
        }
    }
}

// Execute test function
void run_test(int width, int height, int channels_in, int channels_out) {
    printf("Testing with Width=%d, Height=%d, Channels=%d->%d\n", width, height, channels_in, channels_out);

    int block_size = (channels_out == 4) ? 10 : 6; // Output size per block

    // Allocate input and output buffers
    uint8_t *rgba_input = (uint8_t *)malloc(width * height * channels_in);
    uint8_t *ref_output = (uint8_t *)malloc(width * height * block_size / 2);
    uint8_t *rvv_output = (uint8_t *)malloc(width * height * block_size / 2);

    // Initialize test input data
    generate_test_data(rgba_input, width, height, channels_in);

    // Execute reference function and RVV function
    int ref_bytes = qoy_rgba_to_ycbcra(rgba_input, width, height, channels_in, channels_out, ref_output);
    int rvv_bytes = qoy_rgba_to_ycbcra_rvv(rgba_input, width, height, channels_in, channels_out, rvv_output);

    // Verify that the output sizes are consistent
    if (ref_bytes != rvv_bytes) {
        printf("Error: Output sizes differ! Ref=%d bytes, RVV=%d bytes\n", ref_bytes, rvv_bytes);
        free(rgba_input);
        free(ref_output);
        free(rvv_output);
        return;
    }

    // Compare outputs
    compare_outputs(ref_output, rvv_output, ref_bytes, block_size);

    // Free memory
    free(rgba_input);
    free(ref_output);
    free(rvv_output);
}

int main() {
    printf("Running QOY Conversion Tests with Various Sizes...\n");

    // Test multiple input sizes
    int test_sizes[][2] = {
        {4, 4},   // Basic test
        {5, 4},   // Odd width
        {4, 5},   // Odd height
        {5, 5},   // Both width and height odd
    };

    int num_tests = sizeof(test_sizes) / sizeof(test_sizes[0]);
    for (int i = 0; i < num_tests; i++) {
        int width = test_sizes[i][0];
        int height = test_sizes[i][1];
        run_test(width, height, 4, 4); // Test RGBA to YCbCrA
    }

    printf("All Tests Completed.\n");
    return 0;
}

1. Compilation Command

riscv32-unknown-linux-gnu-gcc test.c -std=gnu99 -march=rv32gcv -mabi=ilp32d -O0 -lpng -lz -o test.out

2. Execution Command

qemu-riscv32 -L $HOME/riscv-gnu-toolchain/build_linux/sysroot ./qoy_rvvintrinsic/test.out

3. Output

Running QOY Conversion Tests with Various Sizes...
Testing with Width=4, Height=4, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: match
Block 3: match
Testing with Width=5, Height=4, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: mismatch
Block 3: mismatch
Block 4: mismatch
Block 5: mismatch
Testing with Width=4, Height=5, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: match
Block 3: match
Block 4: match
Block 5: match
Testing with Width=5, Height=5, Channels=4->4
Comparing Outputs...
Block 0: match
Block 1: match
Block 2: mismatch
Block 3: mismatch
Block 4: mismatch
Block 5: mismatch
Block 6: mismatch
Block 7: mismatch
Block 8: mismatch
All Tests Completed.
parallels@ubuntu-linux-22-04-desktop:~/riscv-gnu-toolchain$

Since I am temporarily unable to identify the issue, the current code can process images with even widths. Therefore, I would like to use it to test whether there is an improvement in the efficiency of image conversion.

Benchmark

I have confirmed that the images within the folder are identical after being compared between two versions of the conversion before proceeding with further testing.

Simple benchmark suite for qoy
Requires libpng, "stb_image.h" and "stb_image_write.h", "qoi.h"

Set Up the Sysroot Environment
The sysroot is a directory that mimics the root filesystem of the target architecture (RISC-V in this case). It contains all the necessary headers and libraries required for cross-compilation.

a. Locate the Sysroot
If you built the toolchain as shown above, the sysroot is typically located at $HOME/riscv32/sysroot. If not, you might need to specify or create one.

b. Prepare the Sysroot with Necessary Libraries
You need to ensure that libpng and zlib are available in the RISC-V sysroot. Here's how to do it:

i. Install Dependencies for Building Libraries

sudo apt install cmake libtool

ii. Cross-Compile zlib for RISC-V

Download zlib Source Code

wget https://zlib.net/zlib-1.2.13.tar.gz
tar -xvf zlib-1.2.13.tar.gz
cd zlib-1.3.1

Configure and Build

# Set environment variables for cross-compilation
export CC=riscv32-unknown-linux-gnu-gcc
export AR=riscv32-unknown-linux-gnu-ar
export RANLIB=riscv32-unknown-linux-gnu-ranlib
# Configure with sysroot
./configure --prefix=$HOME/riscv32/sysroot/usr --static

# Build and install
make
make install

iii. Cross-Compile libpng for RISC-V
Download libpng Source Code

cd ..
wget https://download.sourceforge.net/libpng/libpng-1.6.39.tar.gz
tar -xvf libpng-1.6.39.tar.gz
cd libpng-1.6.39

Configure and Build

# Set environment variables for cross-compilation
export CC=riscv32-unknown-linux-gnu-gcc
export AR=riscv32-unknown-linux-gnu-ar
export RANLIB=riscv32-unknown-linux-gnu-ranlib

# Configure with sysroot and zlib
./configure --prefix=$HOME/riscv-gnu-toolchain/build_linux/sysroot/usr --host=riscv32-unknown-linux-gnu --with-zlib-prefix=$HOME/riscv-gnu-toolchain/build_linux/sysroot/usr --enable-static --disable-shared

# Build and install
make install

benchmark.c

/*******************************************************************************
 * benchmark.c
 *
 * A benchmark that:
 *   1. Takes either a .png file or a directory containing .png files.
 *   2. For each PNG found:
 *       a) Reads it via stb_image (=> RGBA).
 *       b) Calls qoy_rgba_to_ycbcra (pure C) multiple times, measures time.
 *       c) Calls qoy_rgba_to_ycbcra_rvv (RVV) multiple times, measures time.
 *       d) Compares outputs and prints times for that file.
 *   3. At the end, prints a "global average" across all files.
 *
 *******************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>     // For opendir, readdir

//------------------------------[ STB_IMAGE ]-----------------------------------
// Define STB_IMAGE_IMPLEMENTATION and include only PNG support
#define STB_IMAGE_IMPLEMENTATION
#define STBI_ONLY_PNG
#include "stb_image.h"

//------------------------------[ QOY ]-----------------------------------------
// Define QOY_IMPLEMENTATION to include the implementation
#define QOY_IMPLEMENTATION
#include "qoy.h"

//------------------------------[ Timer ]---------------------------------------
#if defined(__APPLE__)
  #include <mach/mach_time.h>
#elif defined(__linux__)
  #include <time.h>
#elif defined(_WIN32)
  #include <windows.h>
#endif

static uint64_t ns(void) {
#if defined(__APPLE__)
    static mach_timebase_info_data_t info;
    static int init=0;
    if(!init) {
        mach_timebase_info(&info);
        init=1;
    }
    uint64_t now = mach_absolute_time();
    now = now * info.numer / info.denom;
    return now;
#elif defined(__linux__)
    struct timespec spec;
    clock_gettime(CLOCK_MONOTONIC, &spec);
    return (uint64_t)spec.tv_sec * 1000000000ULL + (uint64_t)spec.tv_nsec;
#elif defined(_WIN32)
    static LARGE_INTEGER freq;
    static int init=0;
    if(!init){
        QueryPerformanceFrequency(&freq);
        init=1;
    }
    LARGE_INTEGER now;
    QueryPerformanceCounter(&now);
    return (uint64_t)(1000000000ULL* now.QuadPart / freq.QuadPart);
#else
    return (uint64_t)clock();
#endif
}

// -----------------------------------------------------------------------------
// Global variables for accumulating the total processing times (C and RVV)
static uint64_t g_sum_c_time_ns   = 0;  // Total "pure C conversion" time across all files (in ns)
static uint64_t g_sum_rvv_time_ns = 0;  // Total "RVV conversion" time across all files (in ns)
static int      g_image_count     = 0;  // Number of PNG files processed
static int      g_runs            = 1;  // Global variable to store runs (optional)

/* 
   benchmark_image():
   Reads a PNG => converts to RGBA; performs multiple (runs) pure C + RVV 
   conversions, measures time, prints the results for that file, and accumulates
   the times into the global statistics (g_sum_c_time_ns / g_sum_rvv_time_ns).
*/
static void benchmark_image(const char* path, int runs) {
    int width, height, comp;
    unsigned char* rgba = stbi_load(path, &width, &height, &comp, 4);
    if(!rgba) {
        printf("Error: failed to load PNG: %s\n", path);
        return;
    }
    printf("[File] %s => %dx%d, forced RGBA=4\n", path, width, height);

    // (1) Allocate output buffer
    int block_size = (4 == 4)? 10: 6; 
    int outbuf_size = (width>>1)*height* block_size; // Approximate upper limit
    unsigned char* out_c   = (unsigned char*)malloc(outbuf_size);
    unsigned char* out_rvv = (unsigned char*)malloc(outbuf_size);

    // (2) Perform warm-up in advance (optional)
    qoy_rgba_to_ycbcra(rgba, width, height, 4, 4, out_c);
    qoy_rgba_to_ycbcra_rvv(rgba, width, height, 4, 4, out_rvv);

    // (3) Measure the times for each conversion
    uint64_t sum_c=0, sum_rvv=0;
    for(int i=0; i<runs; i++){
        // C version
        memset(out_c, 0, outbuf_size);
        uint64_t t0 = ns();
        qoy_rgba_to_ycbcra(rgba, width, height, 4, 4, out_c);
        uint64_t t1 = ns();
        sum_c += (t1 - t0);

        // RVV version
        memset(out_rvv, 0, outbuf_size);
        uint64_t t2 = ns();
        qoy_rgba_to_ycbcra_rvv(rgba, width, height, 4, 4, out_rvv);
        uint64_t t3 = ns();
        sum_rvv += (t3 - t2);
    }

    double avg_c   = (double)sum_c   / runs / 1.0e6; // ms
    double avg_rvv = (double)sum_rvv / runs / 1.0e6;
    printf("Runs=%d | C=%.3f ms, RVV=%.3f ms\n", runs, avg_c, avg_rvv);

    // (4) Add to global statistics (sum_c, sum_rvv)
    g_sum_c_time_ns   += sum_c;    // sum_c is still the total for "runs" times (in ns)
    g_sum_rvv_time_ns += sum_rvv;
    g_image_count++; 

    free(rgba);
    free(out_c);
    free(out_rvv);
}

/* 
   benchmark_directory():
   Opens a directory (recursively or non-recursively), finds .png files, 
   and calls benchmark_image() for each PNG.
*/
static void benchmark_directory(const char* dirpath, int runs) {
    DIR* dp = opendir(dirpath);
    if(!dp) {
        printf("Could not open directory: %s\n", dirpath);
        return;
    }

    struct dirent* ent;
    while((ent = readdir(dp)) != NULL) {
        if(!strcmp(ent->d_name, ".") || !strcmp(ent->d_name, "..")) {
            continue;
        }
        char filepath[1024];
        snprintf(filepath, sizeof(filepath), "%s/%s", dirpath, ent->d_name);

        // Check if the file extension is .png
        size_t len = strlen(ent->d_name);
        if(len>4 && strcmp(ent->d_name + (len-4), ".png")==0) {
            benchmark_image(filepath, runs);
        }
        // If recursion for subdirectories is needed => if(ent->d_type==DT_DIR)...(call benchmark_directory)
    }

    closedir(dp);
}

//------------------------------[ main ]----------------------------------------
int main(int argc, char** argv) {
    if(argc < 3) {
        printf("Usage: %s <file_or_directory> <runs>\n", argv[0]);
        return 0;
    }

    const char* input_path = argv[1];
    g_runs = atoi(argv[2]);  // Store in global variable
    if(g_runs <= 0) g_runs=1;

    // Determine whether input_path is a file or directory
    struct stat st;
    if(stat(input_path, &st)==0) {
        if(S_ISDIR(st.st_mode)) {
            // If it's a directory => traverse it
            benchmark_directory(input_path, g_runs);
        }
        else if(S_ISREG(st.st_mode)) {
            // If it's a file => treat it as a PNG
            benchmark_image(input_path, g_runs);
        }
        else {
            printf("Input path is neither file nor directory???\n");
        }
    } else {
        printf("Cannot stat: %s\n", input_path);
    }

    // ---------- Print "global average" here ----------
    if(g_image_count>0) {
        // g_sum_c_time_ns, g_sum_rvv_time_ns are the "runs total" for all files
        double avg_c   = (double)g_sum_c_time_ns / (double)(g_image_count*g_runs);
        double avg_rvv = (double)g_sum_rvv_time_ns / (double)(g_image_count*g_runs);
        // Convert to milliseconds
        avg_c   /= 1.0e6;
        avg_rvv /= 1.0e6;

        printf("===== Global Average across %d PNG(s) =====\n", g_image_count);
        printf("C   version: %.3f ms\n", avg_c);
        printf("RVV version: %.3f ms\n", avg_rvv);
    } else {
        printf("No PNG files were processed.\n");
    }

    return 0;
}

1. Compilation Command

riscv32-unknown-linux-gnu-gcc benchmark.c -std=gnu99 -march=rv32gcv -mabi=ilp32d -O0 -lpng -lz -lm -o benchmark.out

1. Execution Command

qemu-riscv32 -L $HOME/riscv-gnu-toolchain/build_linux/sysroot ./qoy_rvvintrinsic/benchmark.out ./qoi_benchmark_suite/images/textures_pk 1

Result

icon_512
```
===== Global Average across 213 PNG(s) =====
C   version: 5.885 ms
RVV version: 126.008 ms
```
If compiled with -O0, pure C code may still benefit from simple optimizations by the compiler (or its logic might be straightforward). However, RVV intrinsic code often requires higher optimization levels (-O2, -O3) to fully optimize instruction sequences, register allocation, and similar aspects.

It is recommended to use at least -O2 or -O3 for RVV code.
Under -O0, the compiler may not perform sufficient instruction merging or redundancy elimination for vector operations, leading to excessive and unnecessary load/store operations and VL setting overhead.

optimization level : -O2
```
  ===== Global Average across 213 PNG(s) =====
  C   version: 2.162 ms
  RVV version: 12.286 ms
```
optimization level : -O3
```
  ===== Global Average across 213 PNG(s) =====
  C   version: 63.954 ms
  RVV version: 13.177 ms
```

textures_pk

optimization level : -O0

===== Global Average across 1002 PNG(s) =====
C   version: 0.999 ms
RVV version: 21.944 ms

optimization level : -O2

===== Global Average across 1002 PNG(s) =====
C   version: 0.363 ms
RVV version: 2.163 ms

optimization level : -O3

===== Global Average across 1002 PNG(s) =====
C   version: 10.647 ms
RVV version: 2.332 ms

Analysis

Pure C Generated with -O3
Under -O3, the compiler often performs automatic vectorization, larger-scale function inlining, and loop unrolling, which can significantly alter the "instruction count" or "memory access patterns."

For simulators, more or more complex RISC-V instruction sequences may require "additional steps" for interpretation and execution, potentially making it slower than -O2 (or even -O0).

This does not mean it would be slower on actual hardware. On a real CPU, executing "more but more efficient instruction sequences" is likely faster than -O2. However, due to factors like the "cost of instruction interpretation" and "cache simulation," software simulators might show worse performance for the -O3 version of pure C code.
RVV Under -O3
The compiler applies the most aggressive optimizations to code containing RVV intrinsics under -O3, such as significantly reducing vsetvli, merging load/store operations, and unrolling loops. These optimizations may not always lighten the burden on the simulator (and can sometimes increase it), but they can significantly reduce the "number" of vector instructions or unnecessary "overhead" in certain cases.

Under -O2, the compiler might not optimize vector intrinsics as aggressively, leaving some redundant operations. This can result in better or worse performance depending on the generated instruction patterns.

In summary, interpreting RVV instructions is inherently expensive for simulators. If -O3 indeed "reduces" the instruction count or optimizes loops, it can outperform -O2. Conversely, if -O3 generates large and complex function inlining that worsens vector instruction arrangement, it may perform worse than -O2.

However, it is more likely that the way the RVV version is written has a significant impact on performance. Even on hardware or under high optimization levels, if RVV still performs worse than pure C, it is likely that the program structure, the use of intrinsics, or the compiler-generated code has considerable room for improvement.

Reference

QOY - The "Quite OK YCbCr420A" format for fast, lossless* image compression
QOI - The “Quite OK Image Format” for fast, lossless image compression
Simple RISC-V Vector example
RISC-V Vector Intrinsic Document