RVV-accelerated Image Codec

洪至謙, 曾遠哲

What is RVV-accelerated?

The RISC-V Vector Extension is a key component of the RISC-V instruction set architecture, providing efficient vector computation capabilities.

Scalable Vector Length

Supports various vector register lengths (VLEN), allowing flexibility across different hardware platforms.
Dynamically sets the vector length using the vsetvli instruction.
Enables support for vectorV32IM implementatio operations of varying lengths within the same code.

Vector Operation Instructions

Supports basic arithmetic operations.
Bitwise operations.
Load/store instructions.
Vector reduction operations.
Vector masking operations.

What is QOI?

QOI (Quite OK Image Format) is a lossless image composition format designed with simplicity and speed.

Speed: Offers significantly faster encoding and decoding compared to stb_image_write(20x-50x) and stb_image(3x-4x).

Supports RGB and RGBA: Handles images with and without an alpha channel.

QOI file structure

Header (14 bytes):
1. magic bytes ("qoif")
  The string "qoi " is used to identify that this is a valid QOI file.
2. image width
3. height
4. number of channels
  3 indicates that the image uses the RGB color mode.
  4 indicates that the image uses the RGBA color mode.
5. colorspace info
  0 indicates sRGB.
  1 indicates linear RGB.

qoi_header {    
	char magic[4];       // magic bytes "qoif"
	uint32_t width;      // image width in pixels (BE)    
	uint32_t height;     // image height in pixels (BE)    
	uint8_t  channels;   // 3 = RGB, 4 = RGBA    
	uint8_t  colorspace; // 0 = sRGB with linear alpha
	                     // 1 = all channels linear 
	};

Images are encoded by

row by row
left to right
top to bottom

Encoder/Decoder start with {r:0, g:0, b:0, a:255} as previous pixel value.
When all pixels within the

width \times height

have been filled, this image is complete.

Pixels are encoded as

a run of the previous pixel
(run length encoding)
an index into an array of previously seen pixels
a difference to the previous pixel value in r,g,b
(difference must be very small, images with anti-aliasing)
full r,g,b or r,g,b,a values

Note: the color channels are assumed not to be premultiplied with the alpha channel.

index_position = (r \times 3 + g \times 5 + b \times 7 + a \times 11) % 64

This is a simple Hash algorithm that minimizes the Hash collision.

Every chunk starts with a 2/8-bit tag, followed by some data bits.
All chunks are byte aligned (The bit length of chunks is divisible by 8.)
All data bits' MSB are on the left.
The 8-bit tags have precedence over the 2-bit tags.
(A decoder must check for the presence of an 8-bit tag first.)

Reduce the indention.

Finished, please check it out. Thank you.

Data Chunks:

QOI_OP_RGBA/QOI_OP_RGB

QOI_OP_RGB

Byte[0]	Byte[1]	Byte[2]	Byte[3]
7 6 5 4 3 2 1 0	$7 \dots 0$	$7 \dots 0$	$7 \dots 0$
1 1 1 1 1 1 1 0	red	green	blue

QOI_OP_RGBA

Byte[0]	Byte[1]	Byte[2]	Byte[3]	Byte[4]
7 6 5 4 3 2 1 0	$7 \dots 0$	$7 \dots 0$	$7 \dots 0$	$7 \dots 0$
1 1 1 1 1 1 1 1	red	green	blue	alpha

QOI_OP_INDEX:

Byte[0]	-	-	-	-	-	-	-
7	6	5	4	3	2	1	0
0	0	index	-	-	-	-	-

QOI_OP_DIFF:

Byte[0]	-	-	-	-	-	-	-
7	6	5	4	3	2	1	0
0	1	dr	-	dg	-	db	-

2-bit tag b01

2-bit red channel difference from the previous pixel -2..1

2-bit green channel difference from the previous pixel -2..1

2-bit blue channel difference from the previous pixel -2..1

The difference to the current channel values are using a wraparound operation.
E.g.:
1 - 2 -> 255
255 + 1 -> 0

Values are stored as unsigned integers with a bias of 2.
E.g.:
-2 -> 0 (b00)
1 -> 3 (b11)

Byte[0]	-	-	-	-	-	-	-
7	6	5	4	3	2	1	0
1	0	diff green	-	-	-	-	-

Byte[1]	-	-	-	-	-	-	-
7	6	5	4	3	2	1	0
dr-dg	-	-	-	db-dg	-	-	-

2-bit tag b10

6-bit green channel difference from the previous pixel -32..31

4-bit red channel difference minus green channel difference -8..7

4-bit blue channel difference minus green channel difference -8..7

The green channel

indicate the general direction of change
encoded in 6 bits

The red and blue channels (dr and db) base their differences on the green channel difference.
I.e.:
dr_dg = (cur_px.r - prev_px.r) - (cur_px.g - prev_px.g)
db_dg = (cur_px.b - prev_px.b) - (cur_px.g - prev_px.g)

The difference to the current channel values are using a wraparound operation.
E.g.:
10 - 13 -> 253
250 + 7 -> 1

Values are stored as unsigned integers with a bias of 32 for the green channel and a bias of 8 for the red and blue channel.

QOI_OP_RUN:

Byte[0] - - - - - - -

7 6 5 4 3 2 1 0

1 1 index - - - - -

The run-length is stored with a bias of -1.

Note that the runlengths 63 and 64 (b111110 and b111111) are illegal as they are occupied by the QOI_OP_RGB and QOI_OP_RGBA tags.

End Marker (8 bytes):

QOI encoder - 曾遠哲

The following codes is untested.
The program cannot read the binary file from the QEMU emulator. Not so sure why but I am using the user mode of QEMU instead of system mode. That is, there shall not be a total isolated hardware to separate the environments.

Baseline implementation

qoi.h

// qoi.h
#ifndef QOI_H
#define QOI_H

#ifdef __cplusplus
extern "C" {
#endif

#define QOI_SRGB   0  // Standard RGB colorspace with linear alpha
#define QOI_LINEAR 1  // All channels are linear

// Description of the image - width, height, channels, and colorspace
typedef struct {
    unsigned int width;
    unsigned int height;
    unsigned char channels;    // 3 = RGB, 4 = RGBA
    unsigned char colorspace;  // 0 = sRGB, 1 = linear
} qoi_desc;

// Core encoding function: converts raw pixels to QOI format
void *qoi_encode(const void *data, const qoi_desc *desc, int *out_len);

// Core decoding function: converts QOI format back to raw pixels
void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels);

// File handling convenience functions
int qoi_write(const char *filename, const void *data, const qoi_desc *desc);
void *qoi_read(const char *filename, qoi_desc *desc, int channels);

#ifdef __cplusplus
}
#endif
#endif // QOI_H

#ifdef QOI_IMPLEMENTATION

// Include necessary headers
#include <stdlib.h>
#include <string.h>

// If stdio functions are needed
#ifndef QOI_NO_STDIO
    #include <stdio.h>
#endif

// Allow custom memory management
#ifndef QOI_MALLOC
    #define QOI_MALLOC(sz) malloc(sz)
    #define QOI_FREE(p)    free(p)
#endif

// Allow custom array zeroing
#ifndef QOI_ZEROARR
    #define QOI_ZEROARR(a) memset((a),0,sizeof(a))
#endif

// Chunk type tags
#define QOI_OP_INDEX  0x00 // 00xxxxxx - 6-bit index into color array
#define QOI_OP_DIFF   0x40 // 01xxxxxx - 2-bit RGB channel differences
#define QOI_OP_LUMA   0x80 // 10xxxxxx - Larger RGB differences
#define QOI_OP_RUN    0xc0 // 11xxxxxx - Run of pixels
#define QOI_OP_RGB    0xfe // 11111110 - Full RGB values
#define QOI_OP_RGBA   0xff // 11111111 - Full RGBA values

#define QOI_MASK_2    0xc0 // Mask for 2-bit tag

// Hash function for the color index array
#define QOI_COLOR_HASH(C) (C.rgba.r*3 + C.rgba.g*5 + C.rgba.b*7 + C.rgba.a*11)

// Magic bytes for file identification
#define QOI_MAGIC \
    (((unsigned int)'q') << 24 | ((unsigned int)'o') << 16 | \
     ((unsigned int)'i') <<  8 | ((unsigned int)'f'))

#define QOI_HEADER_SIZE 14

// Maximum image size (400 million pixels) for safety
#define QOI_PIXELS_MAX ((unsigned int)400000000)

// Union for RGBA pixel manipulation
typedef union {
    struct { unsigned char r, g, b, a; } rgba;
    unsigned int v;
} qoi_rgba_t;

// End-of-stream marker
static const unsigned char qoi_padding[8] = {0,0,0,0,0,0,0,1};

// Helper functions for handling 32-bit values
static void qoi_write_32(unsigned char *bytes, int *p, unsigned int v) {
    bytes[(*p)++] = (0xff000000 & v) >> 24;
    bytes[(*p)++] = (0x00ff0000 & v) >> 16;
    bytes[(*p)++] = (0x0000ff00 & v) >> 8;
    bytes[(*p)++] = (0x000000ff & v);
}

static unsigned int qoi_read_32(const unsigned char *bytes, int *p) {
    unsigned int a = bytes[(*p)++];
    unsigned int b = bytes[(*p)++];
    unsigned int c = bytes[(*p)++];
    unsigned int d = bytes[(*p)++];
    return a << 24 | b << 16 | c << 8 | d;
}

// The core encoding function
void *qoi_encode(const void *data, const qoi_desc *desc, int *out_len) {
    int i, max_size, p, run;
    int px_len, px_end, px_pos, channels;
    unsigned char *bytes;
    const unsigned char *pixels;
    qoi_rgba_t index[64];
    qoi_rgba_t px, px_prev;

    // Validate input parameters
    if (data == NULL || out_len == NULL || desc == NULL ||
        desc->width == 0 || desc->height == 0 ||
        desc->channels < 3 || desc->channels > 4 ||
        desc->colorspace > 1 ||
        desc->height >= QOI_PIXELS_MAX / desc->width)
    {
        return NULL;
    }

    // Calculate maximum possible size
    max_size = desc->width * desc->height * (desc->channels + 1) +
               QOI_HEADER_SIZE + sizeof(qoi_padding);

    // Allocate output buffer
    bytes = (unsigned char *) QOI_MALLOC(max_size);
    if (!bytes) {
        return NULL;
    }

    // Write file header
    p = 0;
    qoi_write_32(bytes, &p, QOI_MAGIC);
    qoi_write_32(bytes, &p, desc->width);
    qoi_write_32(bytes, &p, desc->height);
    bytes[p++] = desc->channels;
    bytes[p++] = desc->colorspace;

    // Initialize encoding state
    pixels = (const unsigned char *)data;
    QOI_ZEROARR(index);
    run = 0;
    px_prev.rgba.r = 0;
    px_prev.rgba.g = 0;
    px_prev.rgba.b = 0;
    px_prev.rgba.a = 255;
    px = px_prev;

    // Calculate pixel parameters
    px_len = desc->width * desc->height * desc->channels;
    px_end = px_len - desc->channels;
    channels = desc->channels;

    // Main encoding loop
    for (px_pos = 0; px_pos < px_len; px_pos += channels) {
        // Read pixel values
        px.rgba.r = pixels[px_pos + 0];
        px.rgba.g = pixels[px_pos + 1];
        px.rgba.b = pixels[px_pos + 2];
        if (channels == 4) {
            px.rgba.a = pixels[px_pos + 3];
        }

        // Check for run of identical pixels
        if (px.v == px_prev.v) {
            run++;
            if (run == 62 || px_pos == px_end) {
                bytes[p++] = QOI_OP_RUN | (run - 1);
                run = 0;
            }
        }
        else {
            // End any current run
            if (run > 0) {
                bytes[p++] = QOI_OP_RUN | (run - 1);
                run = 0;
            }

            // Check index for previously seen pixel
            int index_pos = QOI_COLOR_HASH(px) % 64;

            if (index[index_pos].v == px.v) {
                bytes[p++] = QOI_OP_INDEX | index_pos;
            }
            else {
                // Store pixel in index
                index[index_pos] = px;

                // Check if we can encode a small difference
                if (px.rgba.a == px_prev.rgba.a) {
                    signed char vr = px.rgba.r - px_prev.rgba.r;
                    signed char vg = px.rgba.g - px_prev.rgba.g;
                    signed char vb = px.rgba.b - px_prev.rgba.b;

                    signed char vg_r = vr - vg;
                    signed char vg_b = vb - vg;

                    if (vr > -3 && vr < 2 &&
                        vg > -3 && vg < 2 &&
                        vb > -3 && vb < 2)
                    {
                        // Small difference - use QOI_OP_DIFF
                        bytes[p++] = QOI_OP_DIFF | 
                                   ((vr + 2) << 4) | 
                                   ((vg + 2) << 2) | 
                                   (vb + 2);
                    }
                    else if (vg_r > -9 && vg_r < 8 &&
                             vg > -33 && vg < 32 &&
                             vg_b > -9 && vg_b < 8)
                    {
                        // Larger difference - use QOI_OP_LUMA
                        bytes[p++] = QOI_OP_LUMA | (vg + 32);
                        bytes[p++] = ((vg_r + 8) << 4) | (vg_b + 8);
                    }
                    else {
                        // Full RGB values needed
                        bytes[p++] = QOI_OP_RGB;
                        bytes[p++] = px.rgba.r;
                        bytes[p++] = px.rgba.g;
                        bytes[p++] = px.rgba.b;
                    }
                }
                else {
                    // Alpha changed - need full RGBA
                    bytes[p++] = QOI_OP_RGBA;
                    bytes[p++] = px.rgba.r;
                    bytes[p++] = px.rgba.g;
                    bytes[p++] = px.rgba.b;
                    bytes[p++] = px.rgba.a;
                }
            }
        }
        px_prev = px;
    }

    // Write end marker
    for (i = 0; i < (int)sizeof(qoi_padding); i++) {
        bytes[p++] = qoi_padding[i];
    }

    *out_len = p;
    return bytes;
}

// Core decoding function implementing the inverse operations
void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels) {
    const unsigned char *bytes;
    unsigned int header_magic;
    unsigned char *pixels;
    qoi_rgba_t index[64];
    qoi_rgba_t px;
    int px_len, chunks_len, px_pos;
    int p = 0, run = 0;

    // Input validation
    if (data == NULL || desc == NULL ||
        (channels != 0 && channels != 3 && channels != 4) ||
        size < QOI_HEADER_SIZE + (int)sizeof(qoi_padding))
    {
        return NULL;
    }

    // Parse header
    bytes = (const unsigned char *)data;
    header_magic = qoi_read_32(bytes, &p);
    desc->width = qoi_read_32(bytes, &p);
    desc->height = qoi_read_32(bytes, &p);
    desc->channels = bytes[p++];
    desc->colorspace = bytes[p++];

    // Validate header
    if (desc->width == 0 || desc->height == 0 ||
        desc->channels < 3 || desc->channels > 4 ||
        desc->colorspace > 1 ||
        header_magic != QOI_MAGIC ||
        desc->height >= QOI_PIXELS_MAX / desc->width)
    {
        return NULL;
    }

    // Set output channels
    if (channels == 0) {
        channels = desc->channels;
    }

    // Allocate pixel buffer
    px_len = desc->width * desc->height * channels;
    pixels = (unsigned char *) QOI_MALLOC(px_len);
    if (!pixels) {
        return NULL;
    }

    // Initialize decoder state
    QOI_ZEROARR(index);
    px.rgba.r = 0;
    px.rgba.g = 0;
    px.rgba.b = 0;
    px.rgba.a = 255;

    // Main decoding loop
    chunks_len = size - (int)sizeof(qoi_padding);
    for (px_pos = 0; px_pos < px_len; px_pos += channels) {
        if (run > 0) {
            run--;
        }
        else if (p < chunks_len) {
            int b1 = bytes[p++];

            if (b1 == QOI_OP_RGB) {
                px.rgba.r = bytes[p++];
                px.rgba.g = bytes[p++];
                px.rgba.b = bytes[p++];
            }
            else if (b1 == QOI_OP_RGBA) {
                px.rgba.r = bytes[p++];
                px.rgba.g = bytes[p++];
                px.rgba.b = bytes[p++];
                px.rgba.a = bytes[p++];
            }
            else if ((b1 & QOI_MASK_2) == QOI_OP_INDEX) {
                px = index[b1];
            }
            else if ((b1 & QOI_MASK_2) == QOI_OP_DIFF) {
                px.rgba.r += ((b1 >> 4) & 0x03) - 2;
                px.rgba.g += ((b1 >> 2) & 0x03) - 2;
                px.rgba.b += ( b1       & 0x03) - 2;
            }
            else if ((b1 & QOI_MASK_2) == QOI_OP_LUMA) {
                int b2 = bytes[p++];
                int vg = (b1 & 0x3f) - 32;
                px.rgba.r += vg - 8 + ((b2 >> 4) & 0x0f);
                px.rgba.g += vg;
                px.rgba.b += vg - 8 + (b2 & 0x0f);
            }
            else if ((b1 & QOI_MASK_2) == QOI_OP_RUN) {
                run = (b1 & 0x3f);
            }

            index[QOI_COLOR_HASH(px) % 64] = px;
        }

        // Write pixel values
        pixels[px_pos + 0] = px.rgba.r;
        pixels[px_pos + 1] = px.rgba.g;
        pixels[px_pos + 2] = px.rgba.b;
        if (channels == 4) {
            pixels[px_pos + 3] = px.rgba.a;
        }
    }

    return pixels;
}

// File handling functions if stdio is enabled
#ifndef QOI_NO_STDIO
// File I/O functions continued...
int qoi_write(const char *filename, const void *data, const qoi_desc *desc) {
    FILE *f = fopen(filename, "wb");
    int size, err;
    void *encoded;

    if (!f) {
        return 0;
    }

    // Encode the pixel data into QOI format
    encoded = qoi_encode(data, desc, &size);
    if (!encoded) {
        fclose(f);
        return 0;
    }

    // Write the encoded data to file
    fwrite(encoded, 1, size, f);
    fflush(f);
    err = ferror(f);
    fclose(f);

    QOI_FREE(encoded);
    return err ? 0 : size;
}

void *qoi_read(const char *filename, qoi_desc *desc, int channels) {
    FILE *f = fopen(filename, "rb");
    int size, bytes_read;
    void *pixels, *data;

    if (!f) {
        return NULL;
    }

    // Get file size
    fseek(f, 0, SEEK_END);
    size = ftell(f);
    if (size <= 0 || fseek(f, 0, SEEK_SET) != 0) {
        fclose(f);
        return NULL;
    }

    // Read entire file into memory
    data = QOI_MALLOC(size);
    if (!data) {
        fclose(f);
        return NULL;
    }

    // Read file content and decode
    bytes_read = fread(data, 1, size, f);
    fclose(f);
    pixels = (bytes_read != size) ? NULL : qoi_decode(data, bytes_read, desc, channels);
    QOI_FREE(data);
    return pixels;
}

#endif /* QOI_NO_STDIO */
#endif /* QOI_IMPLEMENTATION */

Read PNG (using stb_image.h) and convert the PNG file to QOI format (using vec.s).
main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/stat.h>

#define STB_IMAGE_IMPLEMENTATION
#define STBI_ONLY_PNG
#define STBI_NO_LINEAR
#include "stb_image.h"

#define QOI_IMPLEMENTATION
#include "qoi.h"

struct rgba_pixel {
    unsigned char r, g, b, a;
};

void encode_pixels_rvv(unsigned char *out, const struct rgba_pixel *pixels, int n);

static unsigned char* read_file(const char* filename, size_t* size_out) {
    FILE* f = fopen(filename, "rb");
    if (!f) {
        fprintf(stderr, "Failed to open %s: %s\n", filename, strerror(errno));
        return NULL;
    }

    struct stat st;
    if (fstat(fileno(f), &st) != 0) {
        fprintf(stderr, "Failed to stat %s: %s\n", filename, strerror(errno));
        fclose(f);
        return NULL;
    }

    unsigned char* buffer = malloc(st.st_size);
    if (!buffer) {
        fprintf(stderr, "Failed to allocate %ld bytes\n", (long)st.st_size);
        fclose(f);
        return NULL;
    }

    size_t bytes_read = fread(buffer, 1, st.st_size, f);
    if (bytes_read != (size_t)st.st_size) {
        fprintf(stderr, "Failed to read file: expected %ld bytes, got %ld\n",
                (long)st.st_size, (long)bytes_read);
        free(buffer);
        fclose(f);
        return NULL;
    }

    fclose(f);
    *size_out = st.st_size;
    return buffer;
}

static int write_file(const char* filename, const unsigned char* data, size_t size) {
    FILE* f = fopen(filename, "wb");
    if (!f) {
        fprintf(stderr, "Failed to create %s: %s\n", filename, strerror(errno));
        return 0;
    }

    size_t written = fwrite(data, 1, size, f);
    if (written != size) {
        fprintf(stderr, "Failed to write file: expected %ld bytes, wrote %ld\n",
                (long)size, (long)written);
        fclose(f);
        return 0;
    }

    fclose(f);
    return 1;
}

int main(int argc, char **argv) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s <input.png> <output.qoi>\n", argv[0]);
        return 1;
    }

    printf("Reading input file: %s\n", argv[1]);

    int width, height, channels;
    if (!stbi_info(argv[1], &width, &height, &channels)) {
        fprintf(stderr, "Failed to read PNG header: %s\n", stbi_failure_reason());
        return 1;
    }

    printf("Image: %dx%d, %d channels\n", width, height, channels);
    channels = 4; // Force RGBA

    unsigned char *png_data = stbi_load(argv[1], &width, &height, NULL, channels);
    if (!png_data) {
        fprintf(stderr, "Failed to load PNG: %s\n", stbi_failure_reason());
        return 1;
    }

    int pixel_count = width * height;
    struct rgba_pixel *pixels = malloc(pixel_count * sizeof(struct rgba_pixel));
    if (!pixels) {
        fprintf(stderr, "Failed to allocate pixel buffer\n");
        stbi_image_free(png_data);
        return 1;
    }

    // Convert to RGBA struct format
    for (int i = 0; i < pixel_count; i++) {
        pixels[i].r = png_data[i * 4 + 0];
        pixels[i].g = png_data[i * 4 + 1];
        pixels[i].b = png_data[i * 4 + 2];
        pixels[i].a = png_data[i * 4 + 3];
    }

    stbi_image_free(png_data);

    unsigned char *processed = malloc(pixel_count * sizeof(struct rgba_pixel));
    if (!processed) {
        fprintf(stderr, "Failed to allocate processing buffer\n");
        free(pixels);
        return 1;
    }

    printf("Processing pixels with RVV...\n");
    encode_pixels_rvv(processed, pixels, pixel_count);

    qoi_desc desc = {
        .width = width,
        .height = height,
        .channels = 4,
        .colorspace = QOI_SRGB
    };

    int qoi_size;
    void *qoi_data = qoi_encode(processed, &desc, &qoi_size);
    if (!qoi_data) {
        fprintf(stderr, "QOI encoding failed\n");
        free(pixels);
        free(processed);
        return 1;
    }

    printf("Writing output file: %s\n", argv[2]);
    if (!write_file(argv[2], qoi_data, qoi_size)) {
        free(pixels);
        free(processed);
        free(qoi_data);
        return 1;
    }

    free(pixels);
    free(processed);
    free(qoi_data);
    printf("Conversion successful\n");
    return 0;
}

RVV QOI encoder implementation
Vec.s

# vec.S - RISC-V Vector Extension implementation for QOI encoding

# Register Conventions:
#   a0 = output buffer pointer
#   a1 = input pixel array pointer
#   a2 = number of pixels to process
#   t0 = remaining pixels counter
#   t1 = vector length after vsetvli
#   v0-v3 = RGB(A) components
#   v4 = temporary calculations
#   v8-v11 = previous pixel values for difference calculation
#   v16 = pixel hash results
#   v24 = run length detection mask

    .text
    .balign 4
    .global encode_pixels_rvv

# void encode_pixels_rvv(unsigned char *out, const struct rgba_pixel *pixels, int n)
encode_pixels_rvv:
    # Preserve return address and callee-saved registers
    addi    sp, sp, -16
    sw      ra, 12(sp)
    sw      s0, 8(sp)
    sw      s1, 4(sp)

    # Initialize our working registers
    mv      t0, a2          # Copy pixel count to counter
    mv      s0, a0          # Save output buffer pointer
    mv      s1, a1          # Save input pixel pointer

process_loop:
    # Set vector length based on remaining pixels
    vsetvli t1, t0, e8, ta, ma   # 8-bit elements

    # Load RGBA components using strided load
    # Each component is 4 bytes apart in the struct
    vlse8.v v0, (s1), x4     # Load R components
    addi    t2, s1, 1
    vlse8.v v1, (t2), x4     # Load G components
    addi    t2, s1, 2
    vlse8.v v2, (t2), x4     # Load B components
    addi    t2, s1, 3
    vlse8.v v3, (t2), x4     # Load A components

    # Calculate QOI hash: (r*3 + g*5 + b*7 + a*11) % 64
    # First, multiply components by their coefficients
    vwmulu.vx v4, v0, x3     # v4 = r * 3  (16-bit result)
    vwmulu.vx v6, v1, x5     # v6 = g * 5
    vwmulu.vx v8, v2, x7     # v8 = b * 7
    vwmulu.vx v10, v3, x11   # v10 = a * 11

    # Add all components together
    vadd.vv v4, v4, v6      # Add g component
    vadd.vv v4, v4, v8      # Add b component
    vadd.vv v4, v4, v10     # Add a component

    # Perform modulo 64 (using AND since 64 is power of 2)
    vand.vi v16, v4, 63     # v16 contains final hash values

    # Detect runs of identical pixels
    # Compare each pixel with its predecessor
    vmseq.vv v24, v0, v8    # Compare R components
    vmand.vv v24, v24, v25  # AND with G comparison
    vmand.vv v24, v24, v26  # AND with B comparison
    vmand.vv v24, v24, v27  # AND with A comparison

    # Store results
    # We'll store the hash values and run detection mask for the C code to process
    vse8.v v16, (s0)        # Store hash values
    addi    t2, s0, t1
    vse8.v v24, (t2)        # Store run detection mask

    # Calculate the number of bytes processed
    slli    t2, t1, 2       # Multiply vector length by 4 (RGBA)
    add     s1, s1, t2      # Update input pointer
    add     s0, s0, t1      # Update output pointer for hash values
    add     s0, s0, t1      # Update output pointer for run mask

    # Update remaining pixel count
    sub     t0, t0, t1      # Decrease remaining elements

    # Continue if there are more pixels
    bnez    t0, process_loop

    # Restore registers and return
    lw      ra, 12(sp)
    lw      s0, 8(sp)
    lw      s1, 4(sp)
    addi    sp, sp, 16
    ret

    # Additional helper functions if needed
compute_differences:
    # Compute differences between consecutive pixels
    vsub.vv v4, v0, v8      # R differences
    vsub.vv v5, v1, v9      # G differences
    vsub.vv v6, v2, v10     # B differences
    vsub.vv v7, v3, v11     # A differences
    ret

detect_small_diffs:
    # Check if differences are within small range (-2 to 1)
    vmslt.vi v20, v4, 2     # Check upper bound for R
    vmsgt.vi v21, v4, -3    # Check lower bound for R
    vmand.vv v20, v20, v21  # Combine R bounds
    # Repeat for G and B...
    ret

QOI decoder - 洪至謙

//set array 0
#ifndef QOI_ZEROARR
	#define QOI_ZEROARR(a) memset((a),0,sizeof(a))
#endif
#define QOI_OP_DIFF   0x40 /* 01xxxxxx */
#define QOI_OP_LUMA   0x80 /* 10xxxxxx */
#define QOI_OP_INDEX  0x00 /* 00xxxxxx */
#define QOI_OP_RUN    0xc0 /* 11xxxxxx */
#define QOI_OP_RGB    0xfe /* 11111110 */
#define QOI_OP_RGBA   0xff /* 11111111 */
#define QOI_MASK_2    0xc0 /* 11000000 */
#define QOI_COLOR_HASH(C) (C.rgba.r*3 + C.rgba.g*5 + C.rgba.b*7 + C.rgba.a*11)

//malloc and free, but risc-v don't need to free
#ifndef QOI_MALLOC
	#define QOI_MALLOC(sz) malloc(sz)
	#define QOI_FREE(p)    free(p)
#endif
static const unsigned char qoi_padding[8] = {0,0,0,0,0,0,0,1};

//similar with struct but using same memory. Ex. onlt exist int or struct at the same time. 
typedef union {
	struct { unsigned char r, g, b, a; } rgba;
	unsigned int v;
} qoi_rgba_t;

// read image bytes and return abcd
static unsigned int qoi_read_32(const unsigned char *bytes, int *p) {
	unsigned int a = bytes[(*p)++];
	unsigned int b = bytes[(*p)++];
	unsigned int c = bytes[(*p)++];
	unsigned int d = bytes[(*p)++];
	return a << 24 | b << 16 | c << 8 | d;
}

typedef struct {
	unsigned int width;
	unsigned int height;
	unsigned char channels;
	unsigned char colorspace;
} qoi_desc;


void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels) {
	const unsigned char *bytes;
	unsigned int header_magic;
	unsigned char *pixels;
	qoi_rgba_t index[64];
	qoi_rgba_t px;
	int px_len, chunks_len, px_pos;
	int p = 0, run = 0;
    
    //if NULL return
	if (
		data == NULL || desc == NULL ||
		(channels != 0 && channels != 3 && channels != 4) ||
		size < QOI_HEADER_SIZE + (int)sizeof(qoi_padding)
	) {
		return NULL;
	}
    
    //get image
	bytes = (const unsigned char *)data;
    
    //png format
	header_magic = qoi_read_32(bytes, &p);
	desc->width = qoi_read_32(bytes, &p);
	desc->height = qoi_read_32(bytes, &p);
	desc->channels = bytes[p++];
	desc->colorspace = bytes[p++];

    //if formate wrong, NULL
	if (
		desc->width == 0 || desc->height == 0 ||
		desc->channels < 3 || desc->channels > 4 ||
		desc->colorspace > 1 ||
		header_magic != QOI_MAGIC ||
		desc->height >= QOI_PIXELS_MAX / desc->width
	) {
		return NULL;
	}

	if (channels == 0) {
		channels = desc->channels;
	}
    
    //count pixels
	px_len = desc->width * desc->height * channels;
	pixels = (unsigned char *) QOI_MALLOC(px_len);
	if (!pixels) {
		return NULL;
	}

	QOI_ZEROARR(index);
	px.rgba.r = 0;
	px.rgba.g = 0;
	px.rgba.b = 0;
	px.rgba.a = 255;

    //count chunk, get format 
	chunks_len = size - (int)sizeof(qoi_padding);
	for (px_pos = 0; px_pos < px_len; px_pos += channels) {
		if (run > 0) {
			run--;
		}
		else if (p < chunks_len) {
			int b1 = bytes[p++];

			if (b1 == QOI_OP_RGB) {
				px.rgba.r = bytes[p++];
				px.rgba.g = bytes[p++];
				px.rgba.b = bytes[p++];
			}
			else if (b1 == QOI_OP_RGBA) {
				px.rgba.r = bytes[p++];
				px.rgba.g = bytes[p++];
				px.rgba.b = bytes[p++];
				px.rgba.a = bytes[p++];
			}
			else if ((b1 & QOI_MASK_2) == QOI_OP_INDEX) {
				px = index[b1];
			}
			else if ((b1 & QOI_MASK_2) == QOI_OP_DIFF) {
				px.rgba.r += ((b1 >> 4) & 0x03) - 2;
				px.rgba.g += ((b1 >> 2) & 0x03) - 2;
				px.rgba.b += ( b1       & 0x03) - 2;
			}
			else if ((b1 & QOI_MASK_2) == QOI_OP_LUMA) {
				int b2 = bytes[p++];
				int vg = (b1 & 0x3f) - 32;
				px.rgba.r += vg - 8 + ((b2 >> 4) & 0x0f);
				px.rgba.g += vg;
				px.rgba.b += vg - 8 +  (b2       & 0x0f);
			}
			else if ((b1 & QOI_MASK_2) == QOI_OP_RUN) {
				run = (b1 & 0x3f);
			}

			index[QOI_COLOR_HASH(px) % 64] = px;
		}

		pixels[px_pos + 0] = px.rgba.r;
		pixels[px_pos + 1] = px.rgba.g;
		pixels[px_pos + 2] = px.rgba.b;
		
		if (channels == 4) {
			pixels[px_pos + 3] = px.rgba.a;
		}
	}

	return pixels;
}

RISC-V

# void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels);
# define data a0
# define size a1
# define desc a2
# define channels a3
# define pixels s4
.data
    QOI_OP_DIFF:   .word 0x40     # QOI_OP_DIFF   
    QOI_OP_LUMA:   .word 0x80     # QOI_OP_LUMA   
    QOI_OP_INDEX:  .word 0x00     # QOI_OP_INDEX  
    QOI_OP_RUN:    .word 0xc0     # QOI_OP_RUN    
    QOI_OP_RGB:    .word 0xfe     # QOI_OP_RGB    
    QOI_OP_RGBA:   .word 0xff     # QOI_OP_RGBA   
    QOI_MASK_2:    .word 0xc0     # QOI_MASK_2    
    .align 4                  
qoi_padding:                   
    .byte 0, 0, 0, 0, 0, 0, 0, 1 


.bss
    index:    .space 256           # index[64]

.text

    qoi_decode:
        #data in a0
        li s5 2147483647
        li a1 1234
        mv t1,a1 #size

         # header_magic = qoi_read_32(t2, &p);
        jal qoi_read_32         # oi_read_32 
        mv   t1, a0             # result a0 to t1

        mv t2,a2 #desc
        # desc->width = qoi_read_32(t2, &p);
        jal qoi_read_32         # oi_read_32
        sw   a0, 0(t2)          # store a0 to 0(t2)


        mv t2,a2 #desc
        # desc->height = qoi_read_32(t2, &p);
        jal qoi_read_32         # qoi_read_32
        sw   a0, 4(t2)          # store a0 to 4(t2)
        
        # desc->channels = t2[p++];
        lb   t0, 0(t5)          # load t2[p]
        addi t5, t5, 1          # p++
        sb   t0, 8(t2)          # store to desc->channels

        # desc->colorspace = t2[p++];
        lb   t0, 0(t5)          # load t2[p]
        addi t5, t5, 1          # p++
        sb   t0, 12(t2)          # store to desc->colorspace

        # c=if (channels == 0) { channels = desc->channels; }
        mv t3,a3 #channels
        addi sp, sp, -4
        auipc ra, 0
        sw ra, 0(sp)
        beqz t3, set_channels      # if t3 (channels) == 0, j set_channels
        lw ra, 0(sp)            
        addi sp, sp, 4 

        #px_len = desc->width * desc->height * channels;
        lw      t0, 0(t2)      # t0 = desc->width
        lw      t1, 4(t2)      # t1 = desc->height
        lw      t2, 8(t2)      # t2 = desc->channels
t2
        mul     t3, t0, t1     # t3 = width * height
        mul     t3, t3, t2     # t3 = (width * height) * channels

        sw      t3, 0(a0)      # px_len = t3
        mv a1,t3
        jal QOI_MALLOC
        mv s4, a0              #store pixels' address in s4

        #s0=px.rgba.r,s1=px.rgba.g,s2=px.rgba.b,s3=px.rgba.a
        li s0,0t2ize - sizeof(qoi_padding)

        li t1, 0                  # px_pos = 0

    loop_start:
        bge t1, t3, loop_end  # if px_pos >= px_len, end

        li a4, 0              # run = 0
        beqz a4, process_chunks   #  run == 0, process_chunks
        addi a4, a4, -1           # run--
        j loop_continue

    process_chunks:
        # b1 = byte[p++]
        lb t1, 0(s5)          
        addi p, p, 1              # p++
        li t2, QOI_OP_RGB
        beq t1, t2, handle_rgb    # if b1 == QOI_OP_RGB, handle_rgb

        li t2, QOI_OP_RGBA
        beq t1, t2, handle_rgba   # if b1 == QOI_OP_RGBA, handle_rgba

        li t2, QOI_MASK_2
        and t3, t1, t2
        li t4, QOI_OP_INDEX
        beq t3, t4, handle_index  # if (b1 & QOI_MASK_2) == QOI_OP_INDEX, handle_index

        li t4, QOI_OP_DIFF
        beq t3, t4, handle_diff   # if (b1 & QOI_MASK_2) == QOI_OP_DIFF, handle_diff

        li t4, QOI_OP_LUMA
        beq t3, t4, handle_luma   # if (b1 & QOI_MASK_2) == QOI_OP_LUMA, handle_luma

        li t4, QOI_OP_RUN
        beq t3, t4, handle_run    # if (b1 & QOI_MASK_2) == QOI_OP_RUN, handle_run

        j main_loop

    main_loop:
        jal QOI_COLOR_HASH #return t4
        li t3,64
        rem t2,t2,t3              #t2,hash_px%64
        
        slli t2,t2,2              #t2*=4
        add t3,t1,t2              #t3 = index + offest

        sw t0,0(index)
        j loop_continue

    loop_continue:
        sw s0, 0(pixels)          # px.rgba.r
        sw s1, 1(pixels)          # px.rgba.g
        sw s2, 2(pixels)          # px.rgba.b
        beqz a1, skip_alpha       # if channels == 3, skip alpha 
        sw s3, 3(pixels)          # px.rgba.a

    skip_alpha:
        addi t0, t0, a1           # px_pos += channels
        j loop_start              

    loop_end:
        ret


    set_channels:
        lw ra, 0(sp) 
        lw t3, 8(t2)               # desc->channels to t0 (channels)
        jr ra


    QOI_COLOR_HASH:
        li s0, 1
        li s1, 2
        li s2, 3
        li s3, 4 
        li t0, 3
        mul t1, s0, t0         # t0 = C.rgba.r * 3
        li t0, 5
        mul t2, s1, t0         # t1 = C.rgba.g * 5
        li t0, 7
        mul t3, s2, t0         # t2 = C.rgba.b * 7
        li t0, 11
        mul t4, s3, t0        # t3 = C.rgba.a * 11

        add t2, t2, t1        
        add t3, t3, t2        
        add t4, t4, t3        

        ret

    QOI_MALLOC:
        li a7, 214          # sbrk
        mv a0, a1           # a1=sz
        ecall   
        ret            

    # QOI_FREE:
    #     ret        

    qoi_read_32:
    # li a0 00000040
    addi sp, sp, -16           
    sw   ra, 12(sp)             # ra
    sw   t5, 8(sp)              # p
    mv   t5, s5               # p = a0

    lb   t0, 0(t5)              # t0 = data[*p]
    addi t5, t5, 1              # p++
    mv   t1, t0                 # q

    lb   t0, 0(t5)              # t0 = t2[*p]
    addi t5, t5, 1              # p++
    mv   t2, t0                 # b

    lb   t0, 0(t5)              # t0 = t2[*p]
    addi t5, t5, 1              # p++
    mv   t3, t0                 # c

    lb   t0, 0(t5)              # t0 = t2[*p]
    addi t5, t5, 1              # p++
    mv   t4, t0                 # d

    slli  t1, t1, 24             # a << 24
    slli  t2, t2, 16             # b << 16
    slli  t3, t3, 8              # c << 8
    or   t1, t1, t2             # a << 24 | b << 16
    or   t1, t1, t3             # a << 24 | b << 16 | c << 8
    or   t1, t1, t4             # restore p 
    addi sp, sp, 16             

    ret

    handle_rgb:
    lb s0, 0(s5)       # px.rgba.r = bytes[p++]
    addi a2, a2, 1
    lb s1, 0(s5)       # px.rgba.g = bytes[p++]
    addi a2, a2, 1
    lb s2, 0(s5)       # px.rgba.b = bytes[p++]
    addi a2, a2, 1
    j main_loop        

    handle_rgba:
    lb s0, 0(s5)       # px.rgba.r = bytes[p++]
    addi a2, a2, 1
    lb s1, 0(s5)       # px.rgba.g = bytes[p++]
    addi a2, a2, 1
    lb s2, 0(s5)       # px.rgba.b = bytes[p++]
    addi a2, a2, 1
    lb s3, 0(s5)       # px.rgba.a = bytes[p++]
    addi a2, a2, 1
    j main_loop        

    handle_index:
    slli t0, t3, 2     # t0 = b1 * 4 
    add t1, index, t0     # t1 = index 
    lw t0, 0(t1)       # index[b1]
    mv s0, t0          # px = index[b1]
    j main_loop        

    handle_diff:
    srli t0, t3, 4     # (b1 >> 4) & 0x03
    andi t0, t0, 0x03
    addi s0, s0, -2    # px.rgba.r += ((b1 >> 4) & 0x03) - 2

    srli t1, t3, 2     # (b1 >> 2) & 0x03
    andi t1, t1, 0x03
    addi s1, s1, -2    # px.rgba.g += ((b1 >> 2) & 0x03) - 2

    andi t2, t3, 0x03  # b1 & 0x03
    addi s2, s2, -2    # px.rgba.b += (b1 & 0x03) - 2

    j main_loop        

    handle_luma:
    lb t0, 0(s5)       # b2 = bytes[p++]
    addi a2, a2, 1

    andi t1, t3, 0x3f  # vg = (b1 & 0x3f) - 32
    addi t1, t1, -32

    srli t2, t0, 4     # (b2 >> 4) & 0x0f
    andi t2, t2, 0x0f
    sub t3, t1, t2
    addi s0, s0, -8    # px.rgba.r += vg - 8 + ((b2 >> 4) & 0x0f)

    add s1, s1, t1     # px.rgba.g += vg

    andi t2, t0, 0x0f  # b2 & 0x0f
    sub t3, t1, t2
    addi s2, s2, -8    # px.rgba.b += vg - 8 + (b2 & 0x0f)

    j main_loop        

    handle_run:
    andi t0, t3, 0x3f  # run = b1 & 0x3f
    mv a4, t0          # run to s4
    j main_loop

RISC-V with vector extension

# Vector registers used:
# v0-v3: RGBA components
# v4: temporary calculations
# v8-v11: for index operations

.data
    QOI_OP_DIFF:   .word 0x40     
    QOI_OP_LUMA:   .word 0x80     
    QOI_OP_INDEX:  .word 0x00     
    QOI_OP_RUN:    .word 0xc0     
    QOI_OP_RGB:    .word 0xfe     
    QOI_OP_RGBA:   .word 0xff     
    QOI_MASK_2:    .word 0xc0     
    .align 4                  
qoi_padding:                   
    .byte 0, 0, 0, 0, 0, 0, 0, 1 

.bss
    index:    .space 256          # index[64]

.text
    qoi_decode:
        # Configure vector unit
        li t0, 32                # Set vector length to 32 bytes
        vsetvli t1, t0, e8, m1   # 8-bit elements, single vector register

        # Save original arguments
        mv s5, a0                # Save data pointer
        mv s6, a1                # Save size
        mv s7, a2                # Save desc pointer
        mv s8, a3                # Save channels

        # Read header as before
        jal qoi_read_32
        mv t1, a0

        # Process pixels in vector mode
    process_pixels_vector:
        # Load multiple pixels into vector registers
        vle8.v v0, (s4)          # Load R components
        vle8.v v1, (s4)          # Load G components
        vle8.v v2, (s4)          # Load B components
        vle8.v v3, (s4)          # Load A components

    handle_diff_vector:
        # Vector version of diff handling
        vand.vi v4, v0, 0x3f     # Mask for diff
        vsub.vi v4, v4, 2        # Subtract 2
        vadd.vv v0, v0, v4       # Add to red channel

        vand.vi v4, v1, 0x3f
        vsub.vi v4, v4, 2
        vadd.vv v1, v1, v4       # Add to green channel

        vand.vi v4, v2, 0x3f
        vsub.vi v4, v4, 2
        vadd.vv v2, v2, v4       # Add to blue channel

    QOI_COLOR_HASH_vector:
        # Vector version of color hash
        vmul.vi v8, v0, 3        # r * 3
        vmul.vi v9, v1, 5        # g * 5
        vmul.vi v10, v2, 7       # b * 7
        vmul.vi v11, v3, 11      # a * 11
        
        vadd.vv v8, v8, v9       # Add components
        vadd.vv v8, v8, v10
        vadd.vv v8, v8, v11
        
        # Store results back
        vse8.v v0, (s4)          # Store R components
        vse8.v v1, (s4)          # Store G components
        vse8.v v2, (s4)          # Store B components
        vse8.v v3, (s4)          # Store A components

    process_chunks_vector:
        # Load chunk of bytes into vector register
        vsetvli t0, a1, e8, m1    # Set vector length for byte operations
        vle8.v v4, (s5)           # Load chunk of bytes
        
        # Check for different opcodes in parallel
        vandi.v v5, v4, 0xC0      # Apply QOI_MASK_2 to all elements
        
        # Create masks for different opcodes
        vmseq.vi v6, v4, 0xFE     # Mask for QOI_OP_RGB
        vmseq.vi v7, v4, 0xFF     # Mask for QOI_OP_RGBA
        vmseq.vi v8, v5, 0x00     # Mask for QOI_OP_INDEX
        vmseq.vi v9, v5, 0x40     # Mask for QOI_OP_DIFF
        vmseq.vi v10, v5, 0x80    # Mask for QOI_OP_LUMA
        vmseq.vi v11, v5, 0xC0    # Mask for QOI_OP_RUN

        # Handle RGB chunks
        vcompress.vm v12, v0, v6   # Gather RGB chunks
        vrgather.vv v0, v12, v6    # Load R components
        vrgather.vv v1, v12, v6    # Load G components
        vrgather.vv v2, v12, v6    # Load B components

        # Handle RGBA chunks
        vcompress.vm v12, v0, v7   # Gather RGBA chunks
        vrgather.vv v0, v12, v7    # Load R components
        vrgather.vv v1, v12, v7    # Load G components
        vrgather.vv v2, v12, v7    # Load B components
        vrgather.vv v3, v12, v7    # Load A components

        # Handle INDEX chunks
        vcompress.vm v12, v0, v8   # Gather INDEX chunks
        vsll.vi v13, v12, 2        # Multiply by 4 for index lookup
        vluxei8.v v14, (index), v13 # Load from index array

        # Handle DIFF chunks (similar to original but vectorized)
        vcompress.vm v12, v0, v9
        vsra.vi v15, v12, 4        # (b1 >> 4) & 0x03
        vand.vi v15, v15, 0x03
        vsub.vi v15, v15, 2        # -2
        vadd.vv v0, v0, v15        # Add to R

        vsra.vi v15, v12, 2        # (b1 >> 2) & 0x03
        vand.vi v15, v15, 0x03
        vsub.vi v15, v15, 2
        vadd.vv v1, v1, v15        # Add to G

        vand.vi v15, v12, 0x03     # b1 & 0x03
        vsub.vi v15, v15, 2
        vadd.vv v2, v2, v15        # Add to B

        # Continue to main_loop_vector
    
    main_loop_vector:
        # Vector version of main processing loop
        vsetvli t0, a1, e8, m1   # Set vector length based on remaining pixels
        vle8.v v0, (s5)          # Load chunk of pixels
        
        # Process op codes in vector mode
        vand.vi v4, v0, 0xc0     # Mask for op codes
        vmseq.vi v0, v4, 0x40    # Check for QOI_OP_DIFF
        vmseq.vi v1, v4, 0x80    # Check for QOI_OP_LUMA
        vmseq.vi v2, v4, 0x00    # Check for QOI_OP_INDEX
        
        # Parallel processing based on op codes
        vrgather.vi v8, v0, 0    # Gather DIFF operations
        vrgather.vi v9, v1, 0    # Gather LUMA operations
        vrgather.vi v10, v2, 0   # Gather INDEX operations

        # Continue with the rest of the decoder logic
        ret

    # Helper functions remain mostly unchanged
    qoi_read_32:
        addi sp, sp, -16           
        sw   ra, 12(sp)             # save ra
        sw   t5, 8(sp)              # save p
        mv   t5, s5                 # p = data pointer

        # Could potentially use vector load for 4 bytes at once
        # but keeping scalar for header reading since it's not performance critical
        lb   t0, 0(t5)              # t0 = data[*p]
        addi t5, t5, 1              # p++
        mv   t1, t0                 # q = first byte

        lb   t0, 0(t5)              # t0 = data[*p]
        addi t5, t5, 1              # p++
        mv   t2, t0                 # b = second byte

        lb   t0, 0(t5)              # t0 = data[*p]
        addi t5, t5, 1              # p++
        mv   t3, t0                 # c = third byte

        lb   t0, 0(t5)              # t0 = data[*p]
        addi t5, t5, 1              # p++
        mv   t4, t0                 # d = fourth byte

        # Combine bytes into 32-bit value
        slli  t1, t1, 24            # q << 24
        slli  t2, t2, 16            # b << 16
        slli  t3, t3, 8             # c << 8
        or    t1, t1, t2            # combine q and b
        or    t1, t1, t3            # combine with c
        or    t1, t1, t4            # combine with d to get final 32-bit value

        mv    a0, t1                # move result to return register
        
        # Restore stack and return
        lw    ra, 12(sp)          
        lw    t5, 8(sp)            
        addi  sp, sp, 16             
        ret

    QOI_MALLOC:
        li a7, 214                  # sbrk syscall number
        mv a0, a1                   # move size to a0
        ecall                       # make syscall
        ret                         # return allocated address in a0

RISC-V Vector Extension in 32-bit for the encoder and decoder in Quite OK Image Format

Enhance version

Environment Installation

Careful! This method is only applicable to version 22.04 or above. Otherwise, you might end up wasting several days installing it on version 20.04, losing a lot of time.(Just like me QQ by至謙)

If your environment is brand new, you must first update it to get the required prerequisites.
Update

sudo apt update && sudo apt upgrade -y

Install some prerequisites (Ubuntu)

sudo apt-get install autoconf automake autotools-dev curl python3 python3-pip python3-tomli libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev ninja-build git cmake libglib2.0-dev libslirp-dev

Clone the riscv-gnu-toolchain repo from the official GitHub repo.

mkdir tool && cd tool
git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git --recursive

Enter the folder riscv-gnu-toolchain, create and enter a folder called build and then configure what we need to compile in the makefile.

cd riscv-gnu-toolchain
mkdir build && cd build
../configure --prefix=$HOME/riscv-gnu-toolchain/build --with-arch=rv32gcv --with-abi=ilp32d --enable-multilib

Start to compile the 32-bit RISC-V GNU Toolchain

This step takes a while on my shabby ASUS Mini PN41. (Around 3 hours)

make

Setting up the path permanently

echo 'export PATH=$PATH:~/riscv-gnu-toolchain/build/bin' >> ~/.bashrc
source ~/.bashrc

Now, return to the folder riscv-gnu-toolchain and compile qemu which is compatible to the riscv-gnu-toolchain.

Please keep working in the directory where_you_git_clone_repo/riscv-gnu-toolchain/build, or you will suffer from failure.

make build-qemu

Here is the test file for the vector

vector-test.s

.section .text
.global _start

_start:
    # Initialize stack pointer
    lui sp, %hi(stack_top)
    addi sp, sp, %lo(stack_top)
    
    # Print start message
    lui a0, %hi(msg_start)
    addi a0, a0, %lo(msg_start)
    jal print_string
    
    # Initialize vector configuration with explicit configuration
    vsetvli t0, x0, e32, m1, ta, ma    # SEW=32, LMUL=1, tail agnostic, mask agnostic
    
    # Load vector register v0 with data
    la a0, vector_data
    vle32.v v0, (a0)    # Load 32-bit elements into v0
    
    # Add 1 to each element
    li t0, 1
    vadd.vx v0, v0, t0  # Add scalar t0 to each element
    
    # Store result back to memory
    la a0, vector_result
    vse32.v v0, (a0)    # Store 32-bit elements from v0
    
    # Print results
    la s0, vector_result     # Load result address
    li s1, 4                 # Counter for 4 numbers
    
print_loop:
    # Print "Result: "
    lui a0, %hi(msg_result)
    addi a0, a0, %lo(msg_result)
    jal print_string
    
    # Load and print original value
    lui a0, %hi(msg_orig)
    addi a0, a0, %lo(msg_orig)
    jal print_string
    la t0, vector_data
    la t2, vector_result       # Load address of vector_result
    sub t1, s0, t2            # Now subtract registers
    add t0, t0, t1
    lw a0, 0(t0)
    jal print_num
    
    # Print arrow
    lui a0, %hi(msg_arrow)
    addi a0, a0, %lo(msg_arrow)
    jal print_string
    
    # Load and print result value
    lw a0, 0(s0)
    jal print_num
    
    # Print newline
    lui a0, %hi(msg_newline)
    addi a0, a0, %lo(msg_newline)
    jal print_string
    
    # Move to next number
    addi s0, s0, 4
    addi s1, s1, -1
    bnez s1, print_loop
    
    # Print completion message
    lui a0, %hi(msg_done)
    addi a0, a0, %lo(msg_done)
    jal print_string
    
    # Exit success
    li a0, 0
    li a7, 93
    ecall

# Print string function - expects pointer in a0
print_string:
    addi sp, sp, -4
    sw ra, 0(sp)
    mv t0, a0
1:  lbu t1, 0(t0)
    beqz t1, 2f
    addi t0, t0, 1
    j 1b
2:  sub t0, t0, a0
    mv a2, t0
    mv a1, a0
    li a0, 1
    li a7, 64
    ecall
    lw ra, 0(sp)
    addi sp, sp, 4
    ret

# Print number function - expects number in a0
print_num:
    addi sp, sp, -20
    sw ra, 16(sp)
    sw s0, 12(sp)
    sw s1, 8(sp)
    sw s2, 4(sp)
    sw s3, 0(sp)
    
    mv s0, a0           # Save original number
    li s1, 10           # Divisor
    mv s2, sp           # Buffer pointer
    
    # Handle negative numbers
    bgez s0, positive
    neg s0, s0
    li t0, '-'
    li a0, 1
    mv a1, sp
    sb t0, 0(a1)
    li a2, 1
    li a7, 64
    ecall
    
positive:
    # Convert number to string (backwards)
    mv t0, s2           # Current buffer position
digit_loop:
    rem t1, s0, s1      # Get remainder (current digit)
    addi t1, t1, '0'    # Convert to ASCII
    sb t1, 0(t0)        # Store digit
    addi t0, t0, 1      # Move buffer pointer
    div s0, s0, s1      # Divide number by 10
    bnez s0, digit_loop
    
    # Print the number
    mv a1, s2           # Buffer start
    sub a2, t0, s2      # Calculate length
    li a0, 1
    li a7, 64
    ecall
    
    # Restore registers and return
    lw ra, 16(sp)
    lw s0, 12(sp)
    lw s1, 8(sp)
    lw s2, 4(sp)
    lw s3, 0(sp)
    addi sp, sp, 20
    ret

.section .rodata
msg_start:
    .string "Starting vector test...\n"
msg_done:
    .string "\nVector operations completed.\n"
msg_result:
    .string "Element "
msg_orig:
    .string "Original: "
msg_arrow:
    .string " -> Result: "
msg_newline:
    .string "\n"

.section .data
.align 4
vector_data:
    .word 1, 2, 3, 4    # Pre-initialized input data
.align 4
vector_result:
    .word 0, 0, 0, 0    # Space for results

.section .bss
.align 4
    .space 4096         # Stack
stack_top:

Compile it

riscv32-unknown-elf-as -march=rv32gcv_zba vector-test.s -o vector-test.o
riscv32-unknown-elf-ld -nostdlib vector-test.o -o vector-test

Now! It's time to witness the miracle.

qemu-riscv32 -cpu rv32,v=true,zba=true,vlen=128 ./vector-test

Result

Starting vector test...
Element Original: 1 -> Result: 2
Element Original: 2 -> Result: 3
Element Original: 3 -> Result: 4
Element Original: 4 -> Result: 5

Vector operations completed.

PNG to binary

What is PNG?

PNG (Portable Network Graphics) is a raster graphics file format that uses lossless compression, designed to replace the older GIF format.

PNG Format

1-bit: Black-and-white images.
2/4/8-bit: Palette-based images (up to 256 colors).
24-bit: True color images (16,777,216 colors per pixel).
32-bit: True color + alpha channel (supports transparency).

Structure of a PNG File

File Header (Signature) = 89 50 4E 47 0D 0A 1A 0A
PNG files are composed of multiple chunks, each including:
- Length (4 bytes): Length of the chunk's data.
- Type (4 bytes): Name of the chunk (e.g., IHDR).
- Data (variable length): Actual content of the chunk.
- CRC (4 bytes): Checksum for verifying data integrity.
Key Chunk Types:
- IHDR (Image Header):
  - Defines basic attributes like width, height, color depth, compression method, etc.
- PLTE (Palette):
  - Palette data (used for palette-based images only).
- IDAT (Image Data):
  - Contains compressed pixel data. Multiple IDAT chunks can be present.
- IEND (Image End):
  - Marks the end of the file, with no data.

Here is the python script to transform the PNG file to binary file.

import numpy as np
from PIL import Image
import struct
import argparse

def png_to_binary(input_png, output_binary):
    """
    Convert a PNG file to a binary format with the following structure:
    - First 4 bytes: width (int32)
    - Next 4 bytes: height (int32)
    - Next 1 byte: channels (uint8)
    - Remaining bytes: pixel data in row-major order
    """
    try:
        # Open and read the PNG file
        with Image.open(input_png) as img:
            # Convert to RGB or RGBA if not already
            if img.mode not in ['RGB', 'RGBA']:
                img = img.convert('RGB')
            
            # Get image dimensions and channel count
            width, height = img.size
            channels = len(img.getbands())  # 3 for RGB, 4 for RGBA
            
            # Convert image to numpy array
            img_array = np.array(img)
            
            # Write to binary file
            with open(output_binary, 'wb') as f:
                # Write header information
                f.write(struct.pack('>I', width))    # Big-endian uint32
                f.write(struct.pack('>I', height))   # Big-endian uint32
                f.write(struct.pack('B', channels))  # uint8
                
                # Write pixel data
                # Flatten array and ensure correct byte order
                img_array.astype(np.uint8).tobytes('C')
                f.write(img_array.tobytes())
                
        return True, f"Successfully converted {input_png} to {output_binary}"
    
    except FileNotFoundError:
        return False, f"Error: Input file {input_png} not found"
    except Exception as e:
        return False, f"Error during conversion: {str(e)}"

def binary_to_png(input_binary, output_png):
    """
    Convert our binary format back to PNG to verify the conversion worked correctly.
    """
    try:
        with open(input_binary, 'rb') as f:
            # Read header
            width = struct.unpack('>I', f.read(4))[0]
            height = struct.unpack('>I', f.read(4))[0]
            channels = struct.unpack('B', f.read(1))[0]
            
            # Read pixel data
            mode = 'RGBA' if channels == 4 else 'RGB'
            size = width * height * channels
            data = f.read(size)
            
            # Convert to numpy array and reshape
            img_array = np.frombuffer(data, dtype=np.uint8)
            img_array = img_array.reshape((height, width, channels))
            
            # Create and save image
            img = Image.fromarray(img_array, mode)
            img.save(output_png)
            
        return True, f"Successfully converted {input_binary} to {output_png}"
    
    except FileNotFoundError:
        return False, f"Error: Input file {input_binary} not found"
    except Exception as e:
        return False, f"Error during conversion: {str(e)}"

def main():
    parser = argparse.ArgumentParser(description='Convert between PNG and binary format')
    parser.add_argument('input_file', help='Input file path')
    parser.add_argument('output_file', help='Output file path')
    parser.add_argument('--to-png', action='store_true', 
                        help='Convert from binary to PNG (default is PNG to binary)')
    
    args = parser.parse_args()
    
    if args.to_png:
        success, message = binary_to_png(args.input_file, args.output_file)
    else:
        success, message = png_to_binary(args.input_file, args.output_file)
    
    print(message)
    return 0 if success else 1

if __name__ == "__main__":
    main()

Here is how you can use this python script.

python3 png2bin.py "input_file_name" "output_file_name"

Example

python3 png2bin.py A.png A.bin

QOI format Verify

Here is an online QOI viewer. You may drag and drop a QOI format image to test the result.

QOI Viewer - The Brain Dump - GitHub Pages

Original Page

Drag and Drop the image dice.qoi from floooh's GitHub repository.

You may download the QOI format image from floooh's GitHub using the following link.

qoiviewer