# RVV-accelerated Image Codec > 洪至謙, 曾遠哲 ## What is RVV-accelerated? The RISC-V Vector Extension is a key component of the RISC-V instruction set architecture, providing efficient vector computation capabilities. ### Scalable Vector Length - Supports various vector register lengths (VLEN), allowing flexibility across different hardware platforms. - Dynamically sets the vector length using the `vsetvli` instruction. - Enables support for vectorV32IM implementatio operations of varying lengths within the same code. ### Vector Operation Instructions - Supports basic arithmetic operations. - Bitwise operations. - Load/store instructions. - Vector reduction operations. - Vector masking operations. ## What is QOI? QOI (Quite OK Image Format) is a lossless image composition format designed with simplicity and speed. Speed: Offers significantly faster encoding and decoding compared to stb_image_write(20x-50x) and stb_image(3x-4x). Supports RGB and RGBA: Handles images with and without an alpha channel. ### QOI file structure 1. Header (14 bytes): 1. magic bytes ("qoif") The string "qoi " is used to identify that this is a valid QOI file. 2. image width 3. height 4. number of channels **3** indicates that the image uses the RGB color mode. **4** indicates that the image uses the RGBA color mode. 5. colorspace info **0** indicates sRGB. **1** indicates linear RGB. ```c qoi_header { char magic[4]; // magic bytes "qoif" uint32_t width; // image width in pixels (BE) uint32_t height; // image height in pixels (BE) uint8_t channels; // 3 = RGB, 4 = RGBA uint8_t colorspace; // 0 = sRGB with linear alpha // 1 = all channels linear }; ``` Images are encoded by 1. row by row 2. left to right 3. top to bottom Encoder/Decoder start with `{r:0, g:0, b:0, a:255}` as previous pixel value. When all pixels within the $\text{width} \times \text{height}$ have been filled, this image is complete. Pixels are encoded as 1. a run of the previous pixel (run length encoding) 2. an index into an array of previously seen pixels 3. a difference to the previous pixel value in `r,g,b` (difference must be very small, images with anti-aliasing) 4. full `r,g,b` or `r,g,b,a` values Note: the color channels are assumed not to be premultiplied with the alpha channel. $$ \text{index_position} = (r \times 3 + g \times 5 + b \times 7 + a \times 11) \% 64$$ :::info This is a simple Hash algorithm that minimizes the Hash collision. ::: Every chunk starts with a 2/8-bit tag, followed by some data bits. All chunks are byte aligned (The bit length of chunks is divisible by 8.) All data bits' MSB are on the left. The 8-bit tags have precedence over the 2-bit tags. (A decoder must check for the presence of an 8-bit tag first.) :::danger Reduce the indention. ::: :::info Finished, please check it out. Thank you. ::: 2. Data Chunks: - QOI_OP_RGBA/QOI_OP_RGB QOI_OP_RGB |Byte[0]|Byte[1]|Byte[2]|Byte[3]| |---|---|---|---| |7 6 5 4 3 2 1 0|$7 \dots 0$|$7 \dots 0$|$7 \dots 0$| |1 1 1 1 1 1 1 0|red|green|blue| QOI_OP_RGBA |Byte[0]|Byte[1]|Byte[2]|Byte[3]|Byte[4]| |---|---|---|---|---| |7 6 5 4 3 2 1 0|$7 \dots 0$|$7 \dots 0$|$7 \dots 0$|$7 \dots 0$| |1 1 1 1 1 1 1 1|red|green|blue|alpha| - QOI_OP_INDEX: |Byte[0]|-|-|-|-|-|-|-| |---|---|---|---|---|---|---|---| |7|6|5|4|3|2|1|0| |0|0|index|-|-|-|-|-| - QOI_OP_DIFF: |Byte[0]|-|-|-|-|-|-|-| |---|---|---|---|---|---|---|---| |7|6|5|4|3|2|1|0| |0|1|dr|-|dg|-|db|-| 2-bit tag b01 2-bit red channel difference from the previous pixel -2..1 2-bit green channel difference from the previous pixel -2..1 2-bit blue channel difference from the previous pixel -2..1 :::info The difference to the current channel values are using a wraparound operation. E.g.: 1 - 2 -> 255 255 + 1 -> 0 Values are stored as unsigned integers with a bias of 2. E.g.: -2 -> 0 (b00) 1 -> 3 (b11) ::: |Byte[0]|-|-|-|-|-|-|-| |---|---|---|---|---|---|---|---| |7|6|5|4|3|2|1|0| |1|0|diff green|-|-|-|-|-| |Byte[1]|-|-|-|-|-|-|-| |---|---|---|---|---|---|---|---| |7|6|5|4|3|2|1|0| |dr-dg|-|-|-|db-dg|-|-|-| 2-bit tag b10 6-bit green channel difference from the previous pixel -32..31 4-bit red channel difference minus green channel difference -8..7 4-bit blue channel difference minus green channel difference -8..7 :::info The `green` channel 1. indicate the general direction of change 2. encoded in 6 bits The `red` and `blue` channels (`dr` and `db`) base their differences on the green channel difference. I.e.: dr_dg = (cur_px.r - prev_px.r) - (cur_px.g - prev_px.g) db_dg = (cur_px.b - prev_px.b) - (cur_px.g - prev_px.g) The difference to the current channel values are using a wraparound operation. E.g.: 10 - 13 -> 253 250 + 7 -> 1 Values are stored as unsigned integers with a bias of 32 for the green channel and a bias of 8 for the red and blue channel. ::: - QOI_OP_RUN: |Byte[0]|-|-|-|-|-|-|-| |---|---|---|---|---|---|---|---| |7|6|5|4|3|2|1|0| |1|1|index|-|-|-|-|-| :::warning The run-length is stored with a bias of -1. Note that the runlengths 63 and 64 (b111110 and b111111) are illegal as they are occupied by the `QOI_OP_RGB` and `QOI_OP_RGBA` tags. ::: 3. End Marker (8 bytes): ## QOI encoder - 曾遠哲 :::danger The following codes is untested. The program cannot read the binary file from the QEMU emulator. Not so sure why but I am using the user mode of QEMU instead of system mode. That is, there shall not be a total isolated hardware to separate the environments. ::: ### Baseline implementation `qoi.h` ```c // qoi.h #ifndef QOI_H #define QOI_H #ifdef __cplusplus extern "C" { #endif #define QOI_SRGB 0 // Standard RGB colorspace with linear alpha #define QOI_LINEAR 1 // All channels are linear // Description of the image - width, height, channels, and colorspace typedef struct { unsigned int width; unsigned int height; unsigned char channels; // 3 = RGB, 4 = RGBA unsigned char colorspace; // 0 = sRGB, 1 = linear } qoi_desc; // Core encoding function: converts raw pixels to QOI format void *qoi_encode(const void *data, const qoi_desc *desc, int *out_len); // Core decoding function: converts QOI format back to raw pixels void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels); // File handling convenience functions int qoi_write(const char *filename, const void *data, const qoi_desc *desc); void *qoi_read(const char *filename, qoi_desc *desc, int channels); #ifdef __cplusplus } #endif #endif // QOI_H #ifdef QOI_IMPLEMENTATION // Include necessary headers #include <stdlib.h> #include <string.h> // If stdio functions are needed #ifndef QOI_NO_STDIO #include <stdio.h> #endif // Allow custom memory management #ifndef QOI_MALLOC #define QOI_MALLOC(sz) malloc(sz) #define QOI_FREE(p) free(p) #endif // Allow custom array zeroing #ifndef QOI_ZEROARR #define QOI_ZEROARR(a) memset((a),0,sizeof(a)) #endif // Chunk type tags #define QOI_OP_INDEX 0x00 // 00xxxxxx - 6-bit index into color array #define QOI_OP_DIFF 0x40 // 01xxxxxx - 2-bit RGB channel differences #define QOI_OP_LUMA 0x80 // 10xxxxxx - Larger RGB differences #define QOI_OP_RUN 0xc0 // 11xxxxxx - Run of pixels #define QOI_OP_RGB 0xfe // 11111110 - Full RGB values #define QOI_OP_RGBA 0xff // 11111111 - Full RGBA values #define QOI_MASK_2 0xc0 // Mask for 2-bit tag // Hash function for the color index array #define QOI_COLOR_HASH(C) (C.rgba.r*3 + C.rgba.g*5 + C.rgba.b*7 + C.rgba.a*11) // Magic bytes for file identification #define QOI_MAGIC \ (((unsigned int)'q') << 24 | ((unsigned int)'o') << 16 | \ ((unsigned int)'i') << 8 | ((unsigned int)'f')) #define QOI_HEADER_SIZE 14 // Maximum image size (400 million pixels) for safety #define QOI_PIXELS_MAX ((unsigned int)400000000) // Union for RGBA pixel manipulation typedef union { struct { unsigned char r, g, b, a; } rgba; unsigned int v; } qoi_rgba_t; // End-of-stream marker static const unsigned char qoi_padding[8] = {0,0,0,0,0,0,0,1}; // Helper functions for handling 32-bit values static void qoi_write_32(unsigned char *bytes, int *p, unsigned int v) { bytes[(*p)++] = (0xff000000 & v) >> 24; bytes[(*p)++] = (0x00ff0000 & v) >> 16; bytes[(*p)++] = (0x0000ff00 & v) >> 8; bytes[(*p)++] = (0x000000ff & v); } static unsigned int qoi_read_32(const unsigned char *bytes, int *p) { unsigned int a = bytes[(*p)++]; unsigned int b = bytes[(*p)++]; unsigned int c = bytes[(*p)++]; unsigned int d = bytes[(*p)++]; return a << 24 | b << 16 | c << 8 | d; } // The core encoding function void *qoi_encode(const void *data, const qoi_desc *desc, int *out_len) { int i, max_size, p, run; int px_len, px_end, px_pos, channels; unsigned char *bytes; const unsigned char *pixels; qoi_rgba_t index[64]; qoi_rgba_t px, px_prev; // Validate input parameters if (data == NULL || out_len == NULL || desc == NULL || desc->width == 0 || desc->height == 0 || desc->channels < 3 || desc->channels > 4 || desc->colorspace > 1 || desc->height >= QOI_PIXELS_MAX / desc->width) { return NULL; } // Calculate maximum possible size max_size = desc->width * desc->height * (desc->channels + 1) + QOI_HEADER_SIZE + sizeof(qoi_padding); // Allocate output buffer bytes = (unsigned char *) QOI_MALLOC(max_size); if (!bytes) { return NULL; } // Write file header p = 0; qoi_write_32(bytes, &p, QOI_MAGIC); qoi_write_32(bytes, &p, desc->width); qoi_write_32(bytes, &p, desc->height); bytes[p++] = desc->channels; bytes[p++] = desc->colorspace; // Initialize encoding state pixels = (const unsigned char *)data; QOI_ZEROARR(index); run = 0; px_prev.rgba.r = 0; px_prev.rgba.g = 0; px_prev.rgba.b = 0; px_prev.rgba.a = 255; px = px_prev; // Calculate pixel parameters px_len = desc->width * desc->height * desc->channels; px_end = px_len - desc->channels; channels = desc->channels; // Main encoding loop for (px_pos = 0; px_pos < px_len; px_pos += channels) { // Read pixel values px.rgba.r = pixels[px_pos + 0]; px.rgba.g = pixels[px_pos + 1]; px.rgba.b = pixels[px_pos + 2]; if (channels == 4) { px.rgba.a = pixels[px_pos + 3]; } // Check for run of identical pixels if (px.v == px_prev.v) { run++; if (run == 62 || px_pos == px_end) { bytes[p++] = QOI_OP_RUN | (run - 1); run = 0; } } else { // End any current run if (run > 0) { bytes[p++] = QOI_OP_RUN | (run - 1); run = 0; } // Check index for previously seen pixel int index_pos = QOI_COLOR_HASH(px) % 64; if (index[index_pos].v == px.v) { bytes[p++] = QOI_OP_INDEX | index_pos; } else { // Store pixel in index index[index_pos] = px; // Check if we can encode a small difference if (px.rgba.a == px_prev.rgba.a) { signed char vr = px.rgba.r - px_prev.rgba.r; signed char vg = px.rgba.g - px_prev.rgba.g; signed char vb = px.rgba.b - px_prev.rgba.b; signed char vg_r = vr - vg; signed char vg_b = vb - vg; if (vr > -3 && vr < 2 && vg > -3 && vg < 2 && vb > -3 && vb < 2) { // Small difference - use QOI_OP_DIFF bytes[p++] = QOI_OP_DIFF | ((vr + 2) << 4) | ((vg + 2) << 2) | (vb + 2); } else if (vg_r > -9 && vg_r < 8 && vg > -33 && vg < 32 && vg_b > -9 && vg_b < 8) { // Larger difference - use QOI_OP_LUMA bytes[p++] = QOI_OP_LUMA | (vg + 32); bytes[p++] = ((vg_r + 8) << 4) | (vg_b + 8); } else { // Full RGB values needed bytes[p++] = QOI_OP_RGB; bytes[p++] = px.rgba.r; bytes[p++] = px.rgba.g; bytes[p++] = px.rgba.b; } } else { // Alpha changed - need full RGBA bytes[p++] = QOI_OP_RGBA; bytes[p++] = px.rgba.r; bytes[p++] = px.rgba.g; bytes[p++] = px.rgba.b; bytes[p++] = px.rgba.a; } } } px_prev = px; } // Write end marker for (i = 0; i < (int)sizeof(qoi_padding); i++) { bytes[p++] = qoi_padding[i]; } *out_len = p; return bytes; } // Core decoding function implementing the inverse operations void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels) { const unsigned char *bytes; unsigned int header_magic; unsigned char *pixels; qoi_rgba_t index[64]; qoi_rgba_t px; int px_len, chunks_len, px_pos; int p = 0, run = 0; // Input validation if (data == NULL || desc == NULL || (channels != 0 && channels != 3 && channels != 4) || size < QOI_HEADER_SIZE + (int)sizeof(qoi_padding)) { return NULL; } // Parse header bytes = (const unsigned char *)data; header_magic = qoi_read_32(bytes, &p); desc->width = qoi_read_32(bytes, &p); desc->height = qoi_read_32(bytes, &p); desc->channels = bytes[p++]; desc->colorspace = bytes[p++]; // Validate header if (desc->width == 0 || desc->height == 0 || desc->channels < 3 || desc->channels > 4 || desc->colorspace > 1 || header_magic != QOI_MAGIC || desc->height >= QOI_PIXELS_MAX / desc->width) { return NULL; } // Set output channels if (channels == 0) { channels = desc->channels; } // Allocate pixel buffer px_len = desc->width * desc->height * channels; pixels = (unsigned char *) QOI_MALLOC(px_len); if (!pixels) { return NULL; } // Initialize decoder state QOI_ZEROARR(index); px.rgba.r = 0; px.rgba.g = 0; px.rgba.b = 0; px.rgba.a = 255; // Main decoding loop chunks_len = size - (int)sizeof(qoi_padding); for (px_pos = 0; px_pos < px_len; px_pos += channels) { if (run > 0) { run--; } else if (p < chunks_len) { int b1 = bytes[p++]; if (b1 == QOI_OP_RGB) { px.rgba.r = bytes[p++]; px.rgba.g = bytes[p++]; px.rgba.b = bytes[p++]; } else if (b1 == QOI_OP_RGBA) { px.rgba.r = bytes[p++]; px.rgba.g = bytes[p++]; px.rgba.b = bytes[p++]; px.rgba.a = bytes[p++]; } else if ((b1 & QOI_MASK_2) == QOI_OP_INDEX) { px = index[b1]; } else if ((b1 & QOI_MASK_2) == QOI_OP_DIFF) { px.rgba.r += ((b1 >> 4) & 0x03) - 2; px.rgba.g += ((b1 >> 2) & 0x03) - 2; px.rgba.b += ( b1 & 0x03) - 2; } else if ((b1 & QOI_MASK_2) == QOI_OP_LUMA) { int b2 = bytes[p++]; int vg = (b1 & 0x3f) - 32; px.rgba.r += vg - 8 + ((b2 >> 4) & 0x0f); px.rgba.g += vg; px.rgba.b += vg - 8 + (b2 & 0x0f); } else if ((b1 & QOI_MASK_2) == QOI_OP_RUN) { run = (b1 & 0x3f); } index[QOI_COLOR_HASH(px) % 64] = px; } // Write pixel values pixels[px_pos + 0] = px.rgba.r; pixels[px_pos + 1] = px.rgba.g; pixels[px_pos + 2] = px.rgba.b; if (channels == 4) { pixels[px_pos + 3] = px.rgba.a; } } return pixels; } // File handling functions if stdio is enabled #ifndef QOI_NO_STDIO // File I/O functions continued... int qoi_write(const char *filename, const void *data, const qoi_desc *desc) { FILE *f = fopen(filename, "wb"); int size, err; void *encoded; if (!f) { return 0; } // Encode the pixel data into QOI format encoded = qoi_encode(data, desc, &size); if (!encoded) { fclose(f); return 0; } // Write the encoded data to file fwrite(encoded, 1, size, f); fflush(f); err = ferror(f); fclose(f); QOI_FREE(encoded); return err ? 0 : size; } void *qoi_read(const char *filename, qoi_desc *desc, int channels) { FILE *f = fopen(filename, "rb"); int size, bytes_read; void *pixels, *data; if (!f) { return NULL; } // Get file size fseek(f, 0, SEEK_END); size = ftell(f); if (size <= 0 || fseek(f, 0, SEEK_SET) != 0) { fclose(f); return NULL; } // Read entire file into memory data = QOI_MALLOC(size); if (!data) { fclose(f); return NULL; } // Read file content and decode bytes_read = fread(data, 1, size, f); fclose(f); pixels = (bytes_read != size) ? NULL : qoi_decode(data, bytes_read, desc, channels); QOI_FREE(data); return pixels; } #endif /* QOI_NO_STDIO */ #endif /* QOI_IMPLEMENTATION */ ``` Read PNG (using `stb_image.h`) and convert the PNG file to QOI format (using `vec.s`). `main.c` ```c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <sys/stat.h> #define STB_IMAGE_IMPLEMENTATION #define STBI_ONLY_PNG #define STBI_NO_LINEAR #include "stb_image.h" #define QOI_IMPLEMENTATION #include "qoi.h" struct rgba_pixel { unsigned char r, g, b, a; }; void encode_pixels_rvv(unsigned char *out, const struct rgba_pixel *pixels, int n); static unsigned char* read_file(const char* filename, size_t* size_out) { FILE* f = fopen(filename, "rb"); if (!f) { fprintf(stderr, "Failed to open %s: %s\n", filename, strerror(errno)); return NULL; } struct stat st; if (fstat(fileno(f), &st) != 0) { fprintf(stderr, "Failed to stat %s: %s\n", filename, strerror(errno)); fclose(f); return NULL; } unsigned char* buffer = malloc(st.st_size); if (!buffer) { fprintf(stderr, "Failed to allocate %ld bytes\n", (long)st.st_size); fclose(f); return NULL; } size_t bytes_read = fread(buffer, 1, st.st_size, f); if (bytes_read != (size_t)st.st_size) { fprintf(stderr, "Failed to read file: expected %ld bytes, got %ld\n", (long)st.st_size, (long)bytes_read); free(buffer); fclose(f); return NULL; } fclose(f); *size_out = st.st_size; return buffer; } static int write_file(const char* filename, const unsigned char* data, size_t size) { FILE* f = fopen(filename, "wb"); if (!f) { fprintf(stderr, "Failed to create %s: %s\n", filename, strerror(errno)); return 0; } size_t written = fwrite(data, 1, size, f); if (written != size) { fprintf(stderr, "Failed to write file: expected %ld bytes, wrote %ld\n", (long)size, (long)written); fclose(f); return 0; } fclose(f); return 1; } int main(int argc, char **argv) { if (argc != 3) { fprintf(stderr, "Usage: %s <input.png> <output.qoi>\n", argv[0]); return 1; } printf("Reading input file: %s\n", argv[1]); int width, height, channels; if (!stbi_info(argv[1], &width, &height, &channels)) { fprintf(stderr, "Failed to read PNG header: %s\n", stbi_failure_reason()); return 1; } printf("Image: %dx%d, %d channels\n", width, height, channels); channels = 4; // Force RGBA unsigned char *png_data = stbi_load(argv[1], &width, &height, NULL, channels); if (!png_data) { fprintf(stderr, "Failed to load PNG: %s\n", stbi_failure_reason()); return 1; } int pixel_count = width * height; struct rgba_pixel *pixels = malloc(pixel_count * sizeof(struct rgba_pixel)); if (!pixels) { fprintf(stderr, "Failed to allocate pixel buffer\n"); stbi_image_free(png_data); return 1; } // Convert to RGBA struct format for (int i = 0; i < pixel_count; i++) { pixels[i].r = png_data[i * 4 + 0]; pixels[i].g = png_data[i * 4 + 1]; pixels[i].b = png_data[i * 4 + 2]; pixels[i].a = png_data[i * 4 + 3]; } stbi_image_free(png_data); unsigned char *processed = malloc(pixel_count * sizeof(struct rgba_pixel)); if (!processed) { fprintf(stderr, "Failed to allocate processing buffer\n"); free(pixels); return 1; } printf("Processing pixels with RVV...\n"); encode_pixels_rvv(processed, pixels, pixel_count); qoi_desc desc = { .width = width, .height = height, .channels = 4, .colorspace = QOI_SRGB }; int qoi_size; void *qoi_data = qoi_encode(processed, &desc, &qoi_size); if (!qoi_data) { fprintf(stderr, "QOI encoding failed\n"); free(pixels); free(processed); return 1; } printf("Writing output file: %s\n", argv[2]); if (!write_file(argv[2], qoi_data, qoi_size)) { free(pixels); free(processed); free(qoi_data); return 1; } free(pixels); free(processed); free(qoi_data); printf("Conversion successful\n"); return 0; } ``` RVV QOI encoder implementation `Vec.s` ```c # vec.S - RISC-V Vector Extension implementation for QOI encoding # Register Conventions: # a0 = output buffer pointer # a1 = input pixel array pointer # a2 = number of pixels to process # t0 = remaining pixels counter # t1 = vector length after vsetvli # v0-v3 = RGB(A) components # v4 = temporary calculations # v8-v11 = previous pixel values for difference calculation # v16 = pixel hash results # v24 = run length detection mask .text .balign 4 .global encode_pixels_rvv # void encode_pixels_rvv(unsigned char *out, const struct rgba_pixel *pixels, int n) encode_pixels_rvv: # Preserve return address and callee-saved registers addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) # Initialize our working registers mv t0, a2 # Copy pixel count to counter mv s0, a0 # Save output buffer pointer mv s1, a1 # Save input pixel pointer process_loop: # Set vector length based on remaining pixels vsetvli t1, t0, e8, ta, ma # 8-bit elements # Load RGBA components using strided load # Each component is 4 bytes apart in the struct vlse8.v v0, (s1), x4 # Load R components addi t2, s1, 1 vlse8.v v1, (t2), x4 # Load G components addi t2, s1, 2 vlse8.v v2, (t2), x4 # Load B components addi t2, s1, 3 vlse8.v v3, (t2), x4 # Load A components # Calculate QOI hash: (r*3 + g*5 + b*7 + a*11) % 64 # First, multiply components by their coefficients vwmulu.vx v4, v0, x3 # v4 = r * 3 (16-bit result) vwmulu.vx v6, v1, x5 # v6 = g * 5 vwmulu.vx v8, v2, x7 # v8 = b * 7 vwmulu.vx v10, v3, x11 # v10 = a * 11 # Add all components together vadd.vv v4, v4, v6 # Add g component vadd.vv v4, v4, v8 # Add b component vadd.vv v4, v4, v10 # Add a component # Perform modulo 64 (using AND since 64 is power of 2) vand.vi v16, v4, 63 # v16 contains final hash values # Detect runs of identical pixels # Compare each pixel with its predecessor vmseq.vv v24, v0, v8 # Compare R components vmand.vv v24, v24, v25 # AND with G comparison vmand.vv v24, v24, v26 # AND with B comparison vmand.vv v24, v24, v27 # AND with A comparison # Store results # We'll store the hash values and run detection mask for the C code to process vse8.v v16, (s0) # Store hash values addi t2, s0, t1 vse8.v v24, (t2) # Store run detection mask # Calculate the number of bytes processed slli t2, t1, 2 # Multiply vector length by 4 (RGBA) add s1, s1, t2 # Update input pointer add s0, s0, t1 # Update output pointer for hash values add s0, s0, t1 # Update output pointer for run mask # Update remaining pixel count sub t0, t0, t1 # Decrease remaining elements # Continue if there are more pixels bnez t0, process_loop # Restore registers and return lw ra, 12(sp) lw s0, 8(sp) lw s1, 4(sp) addi sp, sp, 16 ret # Additional helper functions if needed compute_differences: # Compute differences between consecutive pixels vsub.vv v4, v0, v8 # R differences vsub.vv v5, v1, v9 # G differences vsub.vv v6, v2, v10 # B differences vsub.vv v7, v3, v11 # A differences ret detect_small_diffs: # Check if differences are within small range (-2 to 1) vmslt.vi v20, v4, 2 # Check upper bound for R vmsgt.vi v21, v4, -3 # Check lower bound for R vmand.vv v20, v20, v21 # Combine R bounds # Repeat for G and B... ret ``` ## QOI decoder - 洪至謙 ```c //set array 0 #ifndef QOI_ZEROARR #define QOI_ZEROARR(a) memset((a),0,sizeof(a)) #endif #define QOI_OP_DIFF 0x40 /* 01xxxxxx */ #define QOI_OP_LUMA 0x80 /* 10xxxxxx */ #define QOI_OP_INDEX 0x00 /* 00xxxxxx */ #define QOI_OP_RUN 0xc0 /* 11xxxxxx */ #define QOI_OP_RGB 0xfe /* 11111110 */ #define QOI_OP_RGBA 0xff /* 11111111 */ #define QOI_MASK_2 0xc0 /* 11000000 */ #define QOI_COLOR_HASH(C) (C.rgba.r*3 + C.rgba.g*5 + C.rgba.b*7 + C.rgba.a*11) //malloc and free, but risc-v don't need to free #ifndef QOI_MALLOC #define QOI_MALLOC(sz) malloc(sz) #define QOI_FREE(p) free(p) #endif static const unsigned char qoi_padding[8] = {0,0,0,0,0,0,0,1}; //similar with struct but using same memory. Ex. onlt exist int or struct at the same time. typedef union { struct { unsigned char r, g, b, a; } rgba; unsigned int v; } qoi_rgba_t; // read image bytes and return abcd static unsigned int qoi_read_32(const unsigned char *bytes, int *p) { unsigned int a = bytes[(*p)++]; unsigned int b = bytes[(*p)++]; unsigned int c = bytes[(*p)++]; unsigned int d = bytes[(*p)++]; return a << 24 | b << 16 | c << 8 | d; } typedef struct { unsigned int width; unsigned int height; unsigned char channels; unsigned char colorspace; } qoi_desc; void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels) { const unsigned char *bytes; unsigned int header_magic; unsigned char *pixels; qoi_rgba_t index[64]; qoi_rgba_t px; int px_len, chunks_len, px_pos; int p = 0, run = 0; //if NULL return if ( data == NULL || desc == NULL || (channels != 0 && channels != 3 && channels != 4) || size < QOI_HEADER_SIZE + (int)sizeof(qoi_padding) ) { return NULL; } //get image bytes = (const unsigned char *)data; //png format header_magic = qoi_read_32(bytes, &p); desc->width = qoi_read_32(bytes, &p); desc->height = qoi_read_32(bytes, &p); desc->channels = bytes[p++]; desc->colorspace = bytes[p++]; //if formate wrong, NULL if ( desc->width == 0 || desc->height == 0 || desc->channels < 3 || desc->channels > 4 || desc->colorspace > 1 || header_magic != QOI_MAGIC || desc->height >= QOI_PIXELS_MAX / desc->width ) { return NULL; } if (channels == 0) { channels = desc->channels; } //count pixels px_len = desc->width * desc->height * channels; pixels = (unsigned char *) QOI_MALLOC(px_len); if (!pixels) { return NULL; } QOI_ZEROARR(index); px.rgba.r = 0; px.rgba.g = 0; px.rgba.b = 0; px.rgba.a = 255; //count chunk, get format chunks_len = size - (int)sizeof(qoi_padding); for (px_pos = 0; px_pos < px_len; px_pos += channels) { if (run > 0) { run--; } else if (p < chunks_len) { int b1 = bytes[p++]; if (b1 == QOI_OP_RGB) { px.rgba.r = bytes[p++]; px.rgba.g = bytes[p++]; px.rgba.b = bytes[p++]; } else if (b1 == QOI_OP_RGBA) { px.rgba.r = bytes[p++]; px.rgba.g = bytes[p++]; px.rgba.b = bytes[p++]; px.rgba.a = bytes[p++]; } else if ((b1 & QOI_MASK_2) == QOI_OP_INDEX) { px = index[b1]; } else if ((b1 & QOI_MASK_2) == QOI_OP_DIFF) { px.rgba.r += ((b1 >> 4) & 0x03) - 2; px.rgba.g += ((b1 >> 2) & 0x03) - 2; px.rgba.b += ( b1 & 0x03) - 2; } else if ((b1 & QOI_MASK_2) == QOI_OP_LUMA) { int b2 = bytes[p++]; int vg = (b1 & 0x3f) - 32; px.rgba.r += vg - 8 + ((b2 >> 4) & 0x0f); px.rgba.g += vg; px.rgba.b += vg - 8 + (b2 & 0x0f); } else if ((b1 & QOI_MASK_2) == QOI_OP_RUN) { run = (b1 & 0x3f); } index[QOI_COLOR_HASH(px) % 64] = px; } pixels[px_pos + 0] = px.rgba.r; pixels[px_pos + 1] = px.rgba.g; pixels[px_pos + 2] = px.rgba.b; if (channels == 4) { pixels[px_pos + 3] = px.rgba.a; } } return pixels; } ``` RISC-V ```ㄏ # void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels); # define data a0 # define size a1 # define desc a2 # define channels a3 # define pixels s4 .data QOI_OP_DIFF: .word 0x40 # QOI_OP_DIFF QOI_OP_LUMA: .word 0x80 # QOI_OP_LUMA QOI_OP_INDEX: .word 0x00 # QOI_OP_INDEX QOI_OP_RUN: .word 0xc0 # QOI_OP_RUN QOI_OP_RGB: .word 0xfe # QOI_OP_RGB QOI_OP_RGBA: .word 0xff # QOI_OP_RGBA QOI_MASK_2: .word 0xc0 # QOI_MASK_2 .align 4 qoi_padding: .byte 0, 0, 0, 0, 0, 0, 0, 1 .bss index: .space 256 # index[64] .text qoi_decode: #data in a0 li s5 2147483647 li a1 1234 mv t1,a1 #size # header_magic = qoi_read_32(t2, &p); jal qoi_read_32 # oi_read_32 mv t1, a0 # result a0 to t1 mv t2,a2 #desc # desc->width = qoi_read_32(t2, &p); jal qoi_read_32 # oi_read_32 sw a0, 0(t2) # store a0 to 0(t2) mv t2,a2 #desc # desc->height = qoi_read_32(t2, &p); jal qoi_read_32 # qoi_read_32 sw a0, 4(t2) # store a0 to 4(t2) # desc->channels = t2[p++]; lb t0, 0(t5) # load t2[p] addi t5, t5, 1 # p++ sb t0, 8(t2) # store to desc->channels # desc->colorspace = t2[p++]; lb t0, 0(t5) # load t2[p] addi t5, t5, 1 # p++ sb t0, 12(t2) # store to desc->colorspace # c=if (channels == 0) { channels = desc->channels; } mv t3,a3 #channels addi sp, sp, -4 auipc ra, 0 sw ra, 0(sp) beqz t3, set_channels # if t3 (channels) == 0, j set_channels lw ra, 0(sp) addi sp, sp, 4 #px_len = desc->width * desc->height * channels; lw t0, 0(t2) # t0 = desc->width lw t1, 4(t2) # t1 = desc->height lw t2, 8(t2) # t2 = desc->channels t2 mul t3, t0, t1 # t3 = width * height mul t3, t3, t2 # t3 = (width * height) * channels sw t3, 0(a0) # px_len = t3 mv a1,t3 jal QOI_MALLOC mv s4, a0 #store pixels' address in s4 #s0=px.rgba.r,s1=px.rgba.g,s2=px.rgba.b,s3=px.rgba.a li s0,0t2ize - sizeof(qoi_padding) li t1, 0 # px_pos = 0 loop_start: bge t1, t3, loop_end # if px_pos >= px_len, end li a4, 0 # run = 0 beqz a4, process_chunks # run == 0, process_chunks addi a4, a4, -1 # run-- j loop_continue process_chunks: # b1 = byte[p++] lb t1, 0(s5) addi p, p, 1 # p++ li t2, QOI_OP_RGB beq t1, t2, handle_rgb # if b1 == QOI_OP_RGB, handle_rgb li t2, QOI_OP_RGBA beq t1, t2, handle_rgba # if b1 == QOI_OP_RGBA, handle_rgba li t2, QOI_MASK_2 and t3, t1, t2 li t4, QOI_OP_INDEX beq t3, t4, handle_index # if (b1 & QOI_MASK_2) == QOI_OP_INDEX, handle_index li t4, QOI_OP_DIFF beq t3, t4, handle_diff # if (b1 & QOI_MASK_2) == QOI_OP_DIFF, handle_diff li t4, QOI_OP_LUMA beq t3, t4, handle_luma # if (b1 & QOI_MASK_2) == QOI_OP_LUMA, handle_luma li t4, QOI_OP_RUN beq t3, t4, handle_run # if (b1 & QOI_MASK_2) == QOI_OP_RUN, handle_run j main_loop main_loop: jal QOI_COLOR_HASH #return t4 li t3,64 rem t2,t2,t3 #t2,hash_px%64 slli t2,t2,2 #t2*=4 add t3,t1,t2 #t3 = index + offest sw t0,0(index) j loop_continue loop_continue: sw s0, 0(pixels) # px.rgba.r sw s1, 1(pixels) # px.rgba.g sw s2, 2(pixels) # px.rgba.b beqz a1, skip_alpha # if channels == 3, skip alpha sw s3, 3(pixels) # px.rgba.a skip_alpha: addi t0, t0, a1 # px_pos += channels j loop_start loop_end: ret set_channels: lw ra, 0(sp) lw t3, 8(t2) # desc->channels to t0 (channels) jr ra QOI_COLOR_HASH: li s0, 1 li s1, 2 li s2, 3 li s3, 4 li t0, 3 mul t1, s0, t0 # t0 = C.rgba.r * 3 li t0, 5 mul t2, s1, t0 # t1 = C.rgba.g * 5 li t0, 7 mul t3, s2, t0 # t2 = C.rgba.b * 7 li t0, 11 mul t4, s3, t0 # t3 = C.rgba.a * 11 add t2, t2, t1 add t3, t3, t2 add t4, t4, t3 ret QOI_MALLOC: li a7, 214 # sbrk mv a0, a1 # a1=sz ecall ret # QOI_FREE: # ret qoi_read_32: # li a0 00000040 addi sp, sp, -16 sw ra, 12(sp) # ra sw t5, 8(sp) # p mv t5, s5 # p = a0 lb t0, 0(t5) # t0 = data[*p] addi t5, t5, 1 # p++ mv t1, t0 # q lb t0, 0(t5) # t0 = t2[*p] addi t5, t5, 1 # p++ mv t2, t0 # b lb t0, 0(t5) # t0 = t2[*p] addi t5, t5, 1 # p++ mv t3, t0 # c lb t0, 0(t5) # t0 = t2[*p] addi t5, t5, 1 # p++ mv t4, t0 # d slli t1, t1, 24 # a << 24 slli t2, t2, 16 # b << 16 slli t3, t3, 8 # c << 8 or t1, t1, t2 # a << 24 | b << 16 or t1, t1, t3 # a << 24 | b << 16 | c << 8 or t1, t1, t4 # restore p addi sp, sp, 16 ret handle_rgb: lb s0, 0(s5) # px.rgba.r = bytes[p++] addi a2, a2, 1 lb s1, 0(s5) # px.rgba.g = bytes[p++] addi a2, a2, 1 lb s2, 0(s5) # px.rgba.b = bytes[p++] addi a2, a2, 1 j main_loop handle_rgba: lb s0, 0(s5) # px.rgba.r = bytes[p++] addi a2, a2, 1 lb s1, 0(s5) # px.rgba.g = bytes[p++] addi a2, a2, 1 lb s2, 0(s5) # px.rgba.b = bytes[p++] addi a2, a2, 1 lb s3, 0(s5) # px.rgba.a = bytes[p++] addi a2, a2, 1 j main_loop handle_index: slli t0, t3, 2 # t0 = b1 * 4 add t1, index, t0 # t1 = index lw t0, 0(t1) # index[b1] mv s0, t0 # px = index[b1] j main_loop handle_diff: srli t0, t3, 4 # (b1 >> 4) & 0x03 andi t0, t0, 0x03 addi s0, s0, -2 # px.rgba.r += ((b1 >> 4) & 0x03) - 2 srli t1, t3, 2 # (b1 >> 2) & 0x03 andi t1, t1, 0x03 addi s1, s1, -2 # px.rgba.g += ((b1 >> 2) & 0x03) - 2 andi t2, t3, 0x03 # b1 & 0x03 addi s2, s2, -2 # px.rgba.b += (b1 & 0x03) - 2 j main_loop handle_luma: lb t0, 0(s5) # b2 = bytes[p++] addi a2, a2, 1 andi t1, t3, 0x3f # vg = (b1 & 0x3f) - 32 addi t1, t1, -32 srli t2, t0, 4 # (b2 >> 4) & 0x0f andi t2, t2, 0x0f sub t3, t1, t2 addi s0, s0, -8 # px.rgba.r += vg - 8 + ((b2 >> 4) & 0x0f) add s1, s1, t1 # px.rgba.g += vg andi t2, t0, 0x0f # b2 & 0x0f sub t3, t1, t2 addi s2, s2, -8 # px.rgba.b += vg - 8 + (b2 & 0x0f) j main_loop handle_run: andi t0, t3, 0x3f # run = b1 & 0x3f mv a4, t0 # run to s4 j main_loop ``` RISC-V with vector extension ```ㄏ # Vector registers used: # v0-v3: RGBA components # v4: temporary calculations # v8-v11: for index operations .data QOI_OP_DIFF: .word 0x40 QOI_OP_LUMA: .word 0x80 QOI_OP_INDEX: .word 0x00 QOI_OP_RUN: .word 0xc0 QOI_OP_RGB: .word 0xfe QOI_OP_RGBA: .word 0xff QOI_MASK_2: .word 0xc0 .align 4 qoi_padding: .byte 0, 0, 0, 0, 0, 0, 0, 1 .bss index: .space 256 # index[64] .text qoi_decode: # Configure vector unit li t0, 32 # Set vector length to 32 bytes vsetvli t1, t0, e8, m1 # 8-bit elements, single vector register # Save original arguments mv s5, a0 # Save data pointer mv s6, a1 # Save size mv s7, a2 # Save desc pointer mv s8, a3 # Save channels # Read header as before jal qoi_read_32 mv t1, a0 # Process pixels in vector mode process_pixels_vector: # Load multiple pixels into vector registers vle8.v v0, (s4) # Load R components vle8.v v1, (s4) # Load G components vle8.v v2, (s4) # Load B components vle8.v v3, (s4) # Load A components handle_diff_vector: # Vector version of diff handling vand.vi v4, v0, 0x3f # Mask for diff vsub.vi v4, v4, 2 # Subtract 2 vadd.vv v0, v0, v4 # Add to red channel vand.vi v4, v1, 0x3f vsub.vi v4, v4, 2 vadd.vv v1, v1, v4 # Add to green channel vand.vi v4, v2, 0x3f vsub.vi v4, v4, 2 vadd.vv v2, v2, v4 # Add to blue channel QOI_COLOR_HASH_vector: # Vector version of color hash vmul.vi v8, v0, 3 # r * 3 vmul.vi v9, v1, 5 # g * 5 vmul.vi v10, v2, 7 # b * 7 vmul.vi v11, v3, 11 # a * 11 vadd.vv v8, v8, v9 # Add components vadd.vv v8, v8, v10 vadd.vv v8, v8, v11 # Store results back vse8.v v0, (s4) # Store R components vse8.v v1, (s4) # Store G components vse8.v v2, (s4) # Store B components vse8.v v3, (s4) # Store A components process_chunks_vector: # Load chunk of bytes into vector register vsetvli t0, a1, e8, m1 # Set vector length for byte operations vle8.v v4, (s5) # Load chunk of bytes # Check for different opcodes in parallel vandi.v v5, v4, 0xC0 # Apply QOI_MASK_2 to all elements # Create masks for different opcodes vmseq.vi v6, v4, 0xFE # Mask for QOI_OP_RGB vmseq.vi v7, v4, 0xFF # Mask for QOI_OP_RGBA vmseq.vi v8, v5, 0x00 # Mask for QOI_OP_INDEX vmseq.vi v9, v5, 0x40 # Mask for QOI_OP_DIFF vmseq.vi v10, v5, 0x80 # Mask for QOI_OP_LUMA vmseq.vi v11, v5, 0xC0 # Mask for QOI_OP_RUN # Handle RGB chunks vcompress.vm v12, v0, v6 # Gather RGB chunks vrgather.vv v0, v12, v6 # Load R components vrgather.vv v1, v12, v6 # Load G components vrgather.vv v2, v12, v6 # Load B components # Handle RGBA chunks vcompress.vm v12, v0, v7 # Gather RGBA chunks vrgather.vv v0, v12, v7 # Load R components vrgather.vv v1, v12, v7 # Load G components vrgather.vv v2, v12, v7 # Load B components vrgather.vv v3, v12, v7 # Load A components # Handle INDEX chunks vcompress.vm v12, v0, v8 # Gather INDEX chunks vsll.vi v13, v12, 2 # Multiply by 4 for index lookup vluxei8.v v14, (index), v13 # Load from index array # Handle DIFF chunks (similar to original but vectorized) vcompress.vm v12, v0, v9 vsra.vi v15, v12, 4 # (b1 >> 4) & 0x03 vand.vi v15, v15, 0x03 vsub.vi v15, v15, 2 # -2 vadd.vv v0, v0, v15 # Add to R vsra.vi v15, v12, 2 # (b1 >> 2) & 0x03 vand.vi v15, v15, 0x03 vsub.vi v15, v15, 2 vadd.vv v1, v1, v15 # Add to G vand.vi v15, v12, 0x03 # b1 & 0x03 vsub.vi v15, v15, 2 vadd.vv v2, v2, v15 # Add to B # Continue to main_loop_vector main_loop_vector: # Vector version of main processing loop vsetvli t0, a1, e8, m1 # Set vector length based on remaining pixels vle8.v v0, (s5) # Load chunk of pixels # Process op codes in vector mode vand.vi v4, v0, 0xc0 # Mask for op codes vmseq.vi v0, v4, 0x40 # Check for QOI_OP_DIFF vmseq.vi v1, v4, 0x80 # Check for QOI_OP_LUMA vmseq.vi v2, v4, 0x00 # Check for QOI_OP_INDEX # Parallel processing based on op codes vrgather.vi v8, v0, 0 # Gather DIFF operations vrgather.vi v9, v1, 0 # Gather LUMA operations vrgather.vi v10, v2, 0 # Gather INDEX operations # Continue with the rest of the decoder logic ret # Helper functions remain mostly unchanged qoi_read_32: addi sp, sp, -16 sw ra, 12(sp) # save ra sw t5, 8(sp) # save p mv t5, s5 # p = data pointer # Could potentially use vector load for 4 bytes at once # but keeping scalar for header reading since it's not performance critical lb t0, 0(t5) # t0 = data[*p] addi t5, t5, 1 # p++ mv t1, t0 # q = first byte lb t0, 0(t5) # t0 = data[*p] addi t5, t5, 1 # p++ mv t2, t0 # b = second byte lb t0, 0(t5) # t0 = data[*p] addi t5, t5, 1 # p++ mv t3, t0 # c = third byte lb t0, 0(t5) # t0 = data[*p] addi t5, t5, 1 # p++ mv t4, t0 # d = fourth byte # Combine bytes into 32-bit value slli t1, t1, 24 # q << 24 slli t2, t2, 16 # b << 16 slli t3, t3, 8 # c << 8 or t1, t1, t2 # combine q and b or t1, t1, t3 # combine with c or t1, t1, t4 # combine with d to get final 32-bit value mv a0, t1 # move result to return register # Restore stack and return lw ra, 12(sp) lw t5, 8(sp) addi sp, sp, 16 ret QOI_MALLOC: li a7, 214 # sbrk syscall number mv a0, a1 # move size to a0 ecall # make syscall ret # return allocated address in a0 ``` ## RISC-V Vector Extension in 32-bit for the encoder and decoder in Quite OK Image Format ### Enhance version ## Environment Installation #### **Careful! This method is only applicable to version 22.04 or above. Otherwise, you might end up wasting several days installing it on version 20.04, losing a lot of time.(Just like me QQ by至謙)** 0. If your environment is brand new, you must first update it to get the required prerequisites. Update ```bash sudo apt update && sudo apt upgrade -y ``` Install some prerequisites (Ubuntu) ```bash sudo apt-get install autoconf automake autotools-dev curl python3 python3-pip python3-tomli libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev ninja-build git cmake libglib2.0-dev libslirp-dev ``` 1. Clone the riscv-gnu-toolchain repo from the official GitHub repo. ```bash mkdir tool && cd tool git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git --recursive ``` 2. Enter the folder `riscv-gnu-toolchain`, create and enter a folder called `build` and then configure what we need to compile in the makefile. ```bash cd riscv-gnu-toolchain mkdir build && cd build ../configure --prefix=$HOME/riscv-gnu-toolchain/build --with-arch=rv32gcv --with-abi=ilp32d --enable-multilib ``` 3. Start to compile the 32-bit RISC-V GNU Toolchain :::warning This step takes a while on my shabby ASUS Mini PN41. (Around 3 hours) ::: ```bash make ``` 4. Setting up the path permanently ```bash echo 'export PATH=$PATH:~/riscv-gnu-toolchain/build/bin' >> ~/.bashrc source ~/.bashrc ``` 5. Now, return to the folder `riscv-gnu-toolchain` and compile qemu which is compatible to the riscv-gnu-toolchain. :::warning Please keep working in the directory `where_you_git_clone_repo/riscv-gnu-toolchain/build`, or you will suffer from failure. ::: ```bash make build-qemu ``` 6. Here is the test file for the vector vector-test.s ```c .section .text .global _start _start: # Initialize stack pointer lui sp, %hi(stack_top) addi sp, sp, %lo(stack_top) # Print start message lui a0, %hi(msg_start) addi a0, a0, %lo(msg_start) jal print_string # Initialize vector configuration with explicit configuration vsetvli t0, x0, e32, m1, ta, ma # SEW=32, LMUL=1, tail agnostic, mask agnostic # Load vector register v0 with data la a0, vector_data vle32.v v0, (a0) # Load 32-bit elements into v0 # Add 1 to each element li t0, 1 vadd.vx v0, v0, t0 # Add scalar t0 to each element # Store result back to memory la a0, vector_result vse32.v v0, (a0) # Store 32-bit elements from v0 # Print results la s0, vector_result # Load result address li s1, 4 # Counter for 4 numbers print_loop: # Print "Result: " lui a0, %hi(msg_result) addi a0, a0, %lo(msg_result) jal print_string # Load and print original value lui a0, %hi(msg_orig) addi a0, a0, %lo(msg_orig) jal print_string la t0, vector_data la t2, vector_result # Load address of vector_result sub t1, s0, t2 # Now subtract registers add t0, t0, t1 lw a0, 0(t0) jal print_num # Print arrow lui a0, %hi(msg_arrow) addi a0, a0, %lo(msg_arrow) jal print_string # Load and print result value lw a0, 0(s0) jal print_num # Print newline lui a0, %hi(msg_newline) addi a0, a0, %lo(msg_newline) jal print_string # Move to next number addi s0, s0, 4 addi s1, s1, -1 bnez s1, print_loop # Print completion message lui a0, %hi(msg_done) addi a0, a0, %lo(msg_done) jal print_string # Exit success li a0, 0 li a7, 93 ecall # Print string function - expects pointer in a0 print_string: addi sp, sp, -4 sw ra, 0(sp) mv t0, a0 1: lbu t1, 0(t0) beqz t1, 2f addi t0, t0, 1 j 1b 2: sub t0, t0, a0 mv a2, t0 mv a1, a0 li a0, 1 li a7, 64 ecall lw ra, 0(sp) addi sp, sp, 4 ret # Print number function - expects number in a0 print_num: addi sp, sp, -20 sw ra, 16(sp) sw s0, 12(sp) sw s1, 8(sp) sw s2, 4(sp) sw s3, 0(sp) mv s0, a0 # Save original number li s1, 10 # Divisor mv s2, sp # Buffer pointer # Handle negative numbers bgez s0, positive neg s0, s0 li t0, '-' li a0, 1 mv a1, sp sb t0, 0(a1) li a2, 1 li a7, 64 ecall positive: # Convert number to string (backwards) mv t0, s2 # Current buffer position digit_loop: rem t1, s0, s1 # Get remainder (current digit) addi t1, t1, '0' # Convert to ASCII sb t1, 0(t0) # Store digit addi t0, t0, 1 # Move buffer pointer div s0, s0, s1 # Divide number by 10 bnez s0, digit_loop # Print the number mv a1, s2 # Buffer start sub a2, t0, s2 # Calculate length li a0, 1 li a7, 64 ecall # Restore registers and return lw ra, 16(sp) lw s0, 12(sp) lw s1, 8(sp) lw s2, 4(sp) lw s3, 0(sp) addi sp, sp, 20 ret .section .rodata msg_start: .string "Starting vector test...\n" msg_done: .string "\nVector operations completed.\n" msg_result: .string "Element " msg_orig: .string "Original: " msg_arrow: .string " -> Result: " msg_newline: .string "\n" .section .data .align 4 vector_data: .word 1, 2, 3, 4 # Pre-initialized input data .align 4 vector_result: .word 0, 0, 0, 0 # Space for results .section .bss .align 4 .space 4096 # Stack stack_top: ``` 7. Compile it ```bash riscv32-unknown-elf-as -march=rv32gcv_zba vector-test.s -o vector-test.o riscv32-unknown-elf-ld -nostdlib vector-test.o -o vector-test ``` 8. Now! It's time to witness the miracle. ```bash qemu-riscv32 -cpu rv32,v=true,zba=true,vlen=128 ./vector-test ``` 9. Result ```bash Starting vector test... Element Original: 1 -> Result: 2 Element Original: 2 -> Result: 3 Element Original: 3 -> Result: 4 Element Original: 4 -> Result: 5 Vector operations completed. ``` ## PNG to binary ### What is PNG? PNG (Portable Network Graphics) is a raster graphics file format that uses lossless compression, designed to replace the older GIF format. #### PNG Format - 1-bit: Black-and-white images. - 2/4/8-bit: Palette-based images (up to 256 colors). - 24-bit: True color images (16,777,216 colors per pixel). - 32-bit: True color + alpha channel (supports transparency). #### Structure of a PNG File - File Header (Signature) = 89 50 4E 47 0D 0A 1A 0A - PNG files are composed of multiple chunks, each including: - Length (4 bytes): Length of the chunk's data. - Type (4 bytes): Name of the chunk (e.g., IHDR). - Data (variable length): Actual content of the chunk. - CRC (4 bytes): Checksum for verifying data integrity. - Key Chunk Types: - IHDR (Image Header): - Defines basic attributes like width, height, color depth, compression method, etc. - PLTE (Palette): - Palette data (used for palette-based images only). - IDAT (Image Data): - Contains compressed pixel data. Multiple IDAT chunks can be present. - IEND (Image End): - Marks the end of the file, with no data. Here is the python script to transform the PNG file to binary file. ```python import numpy as np from PIL import Image import struct import argparse def png_to_binary(input_png, output_binary): """ Convert a PNG file to a binary format with the following structure: - First 4 bytes: width (int32) - Next 4 bytes: height (int32) - Next 1 byte: channels (uint8) - Remaining bytes: pixel data in row-major order """ try: # Open and read the PNG file with Image.open(input_png) as img: # Convert to RGB or RGBA if not already if img.mode not in ['RGB', 'RGBA']: img = img.convert('RGB') # Get image dimensions and channel count width, height = img.size channels = len(img.getbands()) # 3 for RGB, 4 for RGBA # Convert image to numpy array img_array = np.array(img) # Write to binary file with open(output_binary, 'wb') as f: # Write header information f.write(struct.pack('>I', width)) # Big-endian uint32 f.write(struct.pack('>I', height)) # Big-endian uint32 f.write(struct.pack('B', channels)) # uint8 # Write pixel data # Flatten array and ensure correct byte order img_array.astype(np.uint8).tobytes('C') f.write(img_array.tobytes()) return True, f"Successfully converted {input_png} to {output_binary}" except FileNotFoundError: return False, f"Error: Input file {input_png} not found" except Exception as e: return False, f"Error during conversion: {str(e)}" def binary_to_png(input_binary, output_png): """ Convert our binary format back to PNG to verify the conversion worked correctly. """ try: with open(input_binary, 'rb') as f: # Read header width = struct.unpack('>I', f.read(4))[0] height = struct.unpack('>I', f.read(4))[0] channels = struct.unpack('B', f.read(1))[0] # Read pixel data mode = 'RGBA' if channels == 4 else 'RGB' size = width * height * channels data = f.read(size) # Convert to numpy array and reshape img_array = np.frombuffer(data, dtype=np.uint8) img_array = img_array.reshape((height, width, channels)) # Create and save image img = Image.fromarray(img_array, mode) img.save(output_png) return True, f"Successfully converted {input_binary} to {output_png}" except FileNotFoundError: return False, f"Error: Input file {input_binary} not found" except Exception as e: return False, f"Error during conversion: {str(e)}" def main(): parser = argparse.ArgumentParser(description='Convert between PNG and binary format') parser.add_argument('input_file', help='Input file path') parser.add_argument('output_file', help='Output file path') parser.add_argument('--to-png', action='store_true', help='Convert from binary to PNG (default is PNG to binary)') args = parser.parse_args() if args.to_png: success, message = binary_to_png(args.input_file, args.output_file) else: success, message = png_to_binary(args.input_file, args.output_file) print(message) return 0 if success else 1 if __name__ == "__main__": main() ``` Here is how you can use this python script. ```python python3 png2bin.py "input_file_name" "output_file_name" ``` Example ```python python3 png2bin.py A.png A.bin ``` ## QOI format Verify Here is an online QOI viewer. You may drag and drop a QOI format image to test the result. [QOI Viewer - The Brain Dump - GitHub Pages](https://floooh.github.io/qoiview/qoiview.html) Original Page ![image](https://hackmd.io/_uploads/H1zzkgEvkx.png) Drag and Drop the image `dice.qoi` from `floooh`'s GitHub repository. ![image](https://hackmd.io/_uploads/ry7wJxVDJe.png) You may download the QOI format image from `floooh`'s GitHub using the following link. [qoiviewer](https://github.com/floooh/qoiview/tree/main) ## Reference 1. [QOI official website](https://qoiformat.org/) 2. [QOI specification](https://qoiformat.org/qoi-specification.pdf) 3. [Building an ENCODER for the "Quite OK Image Format" (QOI) - from YT](https://www.youtube.com/watch?v=GgsRQuGSrc0) 4. [Building an DECODER for QOI Images (Quite OK Image Format)](https://www.youtube.com/watch?v=5bWopQj-oQs&list=PLP29wDx6QmW4bMSK8a7rZnhPo4pjqVdQT) 5. [RVV in QEMU setting tutorial](https://github.com/brucehoult/rvv_example) 6. [libpng_rvv: A RISC-V Vector Optimized libpng](https://github.com/mschlaegl/libpng_rvv-doc) 7. [Simple RISC-V Vector example in 64 bit](https://github.com/brucehoult/rvv_example) 8. [QOI Viewer - The Brain Dump - GitHub Pages](https://floooh.github.io/qoiview/qoiview.html) 9. [qoiview](https://github.com/floooh/qoiview/tree/main) 10. [riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain)