RVV-accelerated Audio Codec

李宗翰, 洪靖睿

Introduction

What is RVV?

RVV-accelerated refers to leveraging the RISC-V Vector Extension (RVV) to speed up computations, particularly for tasks that involve large-scale data processing. RVV introduces hardware support for vectorized operations, allowing multiple data elements to be processed simultaneously in a single instruction, greatly improving performance in parallel workloads.

Vectorized Data Processing:

RVV operates on vectors (arrays of numbers) instead of single data points, enabling Single Instruction Multiple Data (SIMD) execution.

Variable-Length Vectors:

Unlike fixed-width SIMD, RVV supports dynamically configurable vector lengths, making it adaptable to different hardware implementations.

Dynamic Scalability:

Developers can write "vector-length agnostic" programs that automatically adjust to the hardware's capabilities without needing code changes.

What is QOA?

QOA is the “Quite OK Audio Format” for fast, lossy audio compression

QOA is fast. It decodes audio 3x faster than Ogg-Vorbis, while offering better quality and compression than ADPCM (278 kbits/s for 44khz stereo).

QOA is simple. The reference en-/decoder fits in about 400 lines of C. The file format specification is a single page PDF.

QOA file

A QOA file consists of an 8 byte file header, followed by a number of frames.Each frame contains an 8 byte frame header, the current 16 byte en-/decoder　state per channel and 256 slices per channel. Each slice is 8 bytes wide and　encodes 20 samples of audio data.

All values, including the slices, are big endian. The file layout is as folloｗ

struct {
	struct {
		char     magic[4];         // magic bytes "qoaf"
		uint32_t samples;          // samples per channel in this file
	} file_header;

	struct {
		struct {
			uint8_t  num_channels; // no. of channels
			uint24_t samplerate;   // samplerate in hz
			uint16_t fsamples;     // samples per channel in this frame
			uint16_t fsize;        // frame size (includes this header)
		} frame_header;

		struct {
			int16_t history[4];    // most recent last
			int16_t weights[4];    // most recent last
		} lms_state[num_channels];

		qoa_slice_t slices[256][num_channels];

	} frames[ceil(samples / (256 * 20))];
} qoa_file_t;

Each qoa_slice_t contains a quantized scalefactor sf_quant and 20 quantized residuals qrNN:

.- QOA_SLICE -- 64 bits, 20 samples --------------------------/  /------------.
|        Byte[0]         |        Byte[1]         |  Byte[2]  \  \  Byte[7]   |
| 7  6  5  4  3  2  1  0 | 7  6  5  4  3  2  1  0 | 7  6  5   /  /    2  1  0 |
|------------+--------+--------+--------+---------+---------+-\  \--+---------|
|  sf_quant  |  qr00  |  qr01  |  qr02  |  qr03   |  qr04   | /  /  |  qr19   |
`-------------------------------------------------------------\  \------------`

Frames and Slices:

Frames: 256 slices per channel, except the last (1-256 slices).
Last slice: 8 bytes wide, unused samples zeroed.

Channel Interleaving:

Channels alternate per slice (e.g., stereo: L, R, L, R).

Valid QOA File:

At least 1 frame, 1 channel, 1 sample.
Samplerate: 1 to 16,777,215.
Static files: Consistent channels and samplerate.
Streaming: Samples field 0x00000000, variable channels/samplerate.

Decoder:

Supports up to 8 channels with predefined layouts.

Prediction:

Uses LMS filter to predict samples with residuals for output.

RISC-V Vector Extension in 64-bit Environment Installation

Setting some need package

sudo apt update
sudo apt install autoconf automake autotools-dev curl libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev

Clone the riscv-gnu-toolchain repo from the official GitHub repo and enter the folder riscv-gnu-toolchain

git clone https://github.com/riscv-collab/riscv-gnu-toolchain.git
cd riscv-gnu-toolchain

To configure the RISC-V toolchain (e.g., GCC compiler).

./configure --prefix=/opt/riscv --with-arch=rv64gcv --with-abi=lp64d

Running make linux in the context of building the RISC-V toolchain typically compiles and installs the Linux cross-compiler for RISC-V.(Around 2 hours)

make linux

Add the installed RISC-V toolchain (located in /opt/riscv/bin) to the system's PATH environment variable.

echo "export PATH=/opt/riscv/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc

Install QEMU

apt install qemu qemu-user qemu-system-misc
apt install gcc-riscv64-unknown-elf

Check QEMU and Riscv-Gnu-Toolchain

riscv64-unknown-linux-gnu-gcc --version
riscv64-unknown-elf-gcc --version
qemu-system-riscv64 --version

Test file for the vector
vec.S

    # void vec_len_rvv(float *r, struct pt *pts, int n)

    #define r a0
    #define pts a1
    #define n a2
    #define vl a3
    #define Xs v0
    #define Ys v1
    #define Zs v2
    #define lens v3
    
    .globl vec_len_rvv
vec_len_rvv:
    # 32 bit elements, don't care (Agnostic) how tail and mask are handled
    vsetvli vl, n, e32, ta,ma
    vlseg3e32.v Xs, (pts) # loads interleaved Xs, Ys, Zs into 3 registers
    vfmul.vv lens, Xs, Xs
    vfmacc.vv lens, Ys, Ys
    vfmacc.vv lens, Zs, Zs
    vfsqrt.v lens, lens
    vse32.v lens, (r)
    sub n, n, vl
    sh2add r, vl, r # bump r ptr 4 bytes per float
    sh1add vl, vl, vl # multiply vl by 3 floats per point
    sh2add pts, vl, pts # bump v ptr 4 bytes per float (12 per pt)
    bnez n, vec_len_rvv
    ret

main.c

#include <stdio.h>
#include <math.h>

struct pt {
  float x;
  float y;
  float z;
};

void vec_len_rvv(float *r, struct pt *v, int n);

void vec_len(float *r, struct pt *v, int n){
  for (int i=0; i<n; ++i){
    struct pt p = v[i];
    r[i] = sqrtf(p.x*p.x + p.y*p.y + p.z*p.z);
  }
}

#define N 6
struct pt v[N] = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11, 12}, {13, 14, 15}, {16, 17, 18}};

int main(){
  float lens[N], lens_rvv[N];
  
  vec_len(lens, v, N);
  vec_len_rvv(lens_rvv, v, N);

  for (int i=0; i<N; ++i){
    printf("%f %f\n", lens[i], lens_rvv[i]);
  }
  return 0;
}

Compile it

riscv64-unknown-elf-gcc -O main.c vec.S -o main -march=rv64gcv_zba -lm
qemu-riscv64 -cpu rv64,v=true,zba=true,vlen=128 ./main

output

3.741657 3.741657
8.774964 8.774964
13.928389 13.928389
19.104973 19.104973
24.289915 24.289915
29.478806 29.478806

Baseline implementation

qoa.h

#ifndef QOA_H
#define QOA_H

#ifdef __cplusplus
extern "C" {
#endif

#define QOA_MIN_FILESIZE 16
#define QOA_MAX_CHANNELS 8

#define QOA_SLICE_LEN 20
#define QOA_SLICES_PER_FRAME 256
#define QOA_FRAME_LEN (QOA_SLICES_PER_FRAME * QOA_SLICE_LEN)
#define QOA_LMS_LEN 4
#define QOA_MAGIC 0x716f6166 /* 'qoaf' */

#define QOA_FRAME_SIZE(channels, slices) \
	(8 + QOA_LMS_LEN * 4 * channels + 8 * slices * channels)

typedef struct {
	int history[QOA_LMS_LEN];
	int weights[QOA_LMS_LEN];
} qoa_lms_t;

typedef struct {
	unsigned int channels;
	unsigned int samplerate;
	unsigned int samples;
	qoa_lms_t lms[QOA_MAX_CHANNELS];
	#ifdef QOA_RECORD_TOTAL_ERROR
		double error;
	#endif
} qoa_desc;

unsigned int qoa_encode_header(qoa_desc *qoa, unsigned char *bytes);
unsigned int qoa_encode_frame(const short *sample_data, qoa_desc *qoa, unsigned int frame_len, unsigned char *bytes);
void *qoa_encode(const short *sample_data, qoa_desc *qoa, unsigned int *out_len);

unsigned int qoa_max_frame_size(qoa_desc *qoa);
unsigned int qoa_decode_header(const unsigned char *bytes, int size, qoa_desc *qoa);
unsigned int qoa_decode_frame(const unsigned char *bytes, unsigned int size, qoa_desc *qoa, short *sample_data, unsigned int *frame_len);
short *qoa_decode(const unsigned char *bytes, int size, qoa_desc *file);

#ifndef QOA_NO_STDIO

int qoa_write(const char *filename, const short *sample_data, qoa_desc *qoa);
void *qoa_read(const char *filename, qoa_desc *qoa);

#endif /* QOA_NO_STDIO */


#ifdef __cplusplus
}
#endif
#endif /* QOA_H */


/* -----------------------------------------------------------------------------
	Implementation */

#ifdef QOA_IMPLEMENTATION
#include <stdlib.h>

#ifndef QOA_MALLOC
	#define QOA_MALLOC(sz) malloc(sz)
	#define QOA_FREE(p) free(p)
#endif

typedef unsigned long long qoa_uint64_t;


/* The quant_tab provides an index into the dequant_tab for residuals in the
range of -8 .. 8. It maps this range to just 3bits and becomes less accurate at 
the higher end. Note that the residual zero is identical to the lowest positive 
value. This is mostly fine, since the qoa_div() function always rounds away 
from zero. */

static const int qoa_quant_tab[17] = {
	7, 7, 7, 5, 5, 3, 3, 1, /* -8..-1 */
	0,                      /*  0     */
	0, 2, 2, 4, 4, 6, 6, 6  /*  1.. 8 */
};


/* We have 16 different scalefactors. Like the quantized residuals these become
less accurate at the higher end. In theory, the highest scalefactor that we
would need to encode the highest 16bit residual is (2**16)/8 = 8192. However we
rely on the LMS filter to predict samples accurately enough that a maximum 
residual of one quarter of the 16 bit range is sufficient. I.e. with the 
scalefactor 2048 times the quant range of 8 we can encode residuals up to 2**14.

The scalefactor values are computed as:
scalefactor_tab[s] <- round(pow(s + 1, 2.75)) */

static const int qoa_scalefactor_tab[16] = {
	1, 7, 21, 45, 84, 138, 211, 304, 421, 562, 731, 928, 1157, 1419, 1715, 2048
};


/* The reciprocal_tab maps each of the 16 scalefactors to their rounded 
reciprocals 1/scalefactor. This allows us to calculate the scaled residuals in 
the encoder with just one multiplication instead of an expensive division. We 
do this in .16 fixed point with integers, instead of floats.

The reciprocal_tab is computed as:
reciprocal_tab[s] <- ((1<<16) + scalefactor_tab[s] - 1) / scalefactor_tab[s] */

static const int qoa_reciprocal_tab[16] = {
	65536, 9363, 3121, 1457, 781, 475, 311, 216, 156, 117, 90, 71, 57, 47, 39, 32
};


/* The dequant_tab maps each of the scalefactors and quantized residuals to 
their unscaled & dequantized version.

Since qoa_div rounds away from the zero, the smallest entries are mapped to 3/4
instead of 1. The dequant_tab assumes the following dequantized values for each 
of the quant_tab indices and is computed as:
float dqt[8] = {0.75, -0.75, 2.5, -2.5, 4.5, -4.5, 7, -7};
dequant_tab[s][q] <- round_ties_away_from_zero(scalefactor_tab[s] * dqt[q])

The rounding employed here is "to nearest, ties away from zero",  i.e. positive
and negative values are treated symmetrically.
*/

static const int qoa_dequant_tab[16][8] = {
	{   1,    -1,    3,    -3,    5,    -5,     7,     -7},
	{   5,    -5,   18,   -18,   32,   -32,    49,    -49},
	{  16,   -16,   53,   -53,   95,   -95,   147,   -147},
	{  34,   -34,  113,  -113,  203,  -203,   315,   -315},
	{  63,   -63,  210,  -210,  378,  -378,   588,   -588},
	{ 104,  -104,  345,  -345,  621,  -621,   966,   -966},
	{ 158,  -158,  528,  -528,  950,  -950,  1477,  -1477},
	{ 228,  -228,  760,  -760, 1368, -1368,  2128,  -2128},
	{ 316,  -316, 1053, -1053, 1895, -1895,  2947,  -2947},
	{ 422,  -422, 1405, -1405, 2529, -2529,  3934,  -3934},
	{ 548,  -548, 1828, -1828, 3290, -3290,  5117,  -5117},
	{ 696,  -696, 2320, -2320, 4176, -4176,  6496,  -6496},
	{ 868,  -868, 2893, -2893, 5207, -5207,  8099,  -8099},
	{1064, -1064, 3548, -3548, 6386, -6386,  9933,  -9933},
	{1286, -1286, 4288, -4288, 7718, -7718, 12005, -12005},
	{1536, -1536, 5120, -5120, 9216, -9216, 14336, -14336},
};


/* The Least Mean Squares Filter is the heart of QOA. It predicts the next
sample based on the previous 4 reconstructed samples. It does so by continuously
adjusting 4 weights based on the residual of the previous prediction.

The next sample is predicted as the sum of (weight[i] * history[i]).

The adjustment of the weights is done with a "Sign-Sign-LMS" that adds or
subtracts the residual to each weight, based on the corresponding sample from 
the history. This, surprisingly, is sufficient to get worthwhile predictions.

This is all done with fixed point integers. Hence the right-shifts when updating
the weights and calculating the prediction. */

static int qoa_lms_predict(qoa_lms_t *lms) {
	int prediction = 0;
	for (int i = 0; i < QOA_LMS_LEN; i++) {
		prediction += lms->weights[i] * lms->history[i];
	}
	return prediction >> 13;
}

static void qoa_lms_update(qoa_lms_t *lms, int sample, int residual) {
	int delta = residual >> 4;
	for (int i = 0; i < QOA_LMS_LEN; i++) {
		lms->weights[i] += lms->history[i] < 0 ? -delta : delta;
	}

	for (int i = 0; i < QOA_LMS_LEN-1; i++) {
		lms->history[i] = lms->history[i+1];
	}
	lms->history[QOA_LMS_LEN-1] = sample;
}


/* qoa_div() implements a rounding division, but avoids rounding to zero for 
small numbers. E.g. 0.1 will be rounded to 1. Note that 0 itself still 
returns as 0, which is handled in the qoa_quant_tab[].
qoa_div() takes an index into the .16 fixed point qoa_reciprocal_tab as an
argument, so it can do the division with a cheaper integer multiplication. */

static inline int qoa_div(int v, int scalefactor) {
	int reciprocal = qoa_reciprocal_tab[scalefactor];
	int n = (v * reciprocal + (1 << 15)) >> 16;
	n = n + ((v > 0) - (v < 0)) - ((n > 0) - (n < 0)); /* round away from 0 */
	return n;
}

static inline int qoa_clamp(int v, int min, int max) {
	if (v < min) { return min; }
	if (v > max) { return max; }
	return v;
}

/* This specialized clamp function for the signed 16 bit range improves decode
performance quite a bit. The extra if() statement works nicely with the CPUs
branch prediction as this branch is rarely taken. */

static inline int qoa_clamp_s16(int v) {
	if ((unsigned int)(v + 32768) > 65535) {
		if (v < -32768) { return -32768; }
		if (v >  32767) { return  32767; }
	}
	return v;
}

static inline qoa_uint64_t qoa_read_u64(const unsigned char *bytes, unsigned int *p) {
	bytes += *p;
	*p += 8;
	return 
		((qoa_uint64_t)(bytes[0]) << 56) | ((qoa_uint64_t)(bytes[1]) << 48) |
		((qoa_uint64_t)(bytes[2]) << 40) | ((qoa_uint64_t)(bytes[3]) << 32) |
		((qoa_uint64_t)(bytes[4]) << 24) | ((qoa_uint64_t)(bytes[5]) << 16) |
		((qoa_uint64_t)(bytes[6]) <<  8) | ((qoa_uint64_t)(bytes[7]) <<  0);
}

static inline void qoa_write_u64(qoa_uint64_t v, unsigned char *bytes, unsigned int *p) {
	bytes += *p;
	*p += 8;
	bytes[0] = (v >> 56) & 0xff;
	bytes[1] = (v >> 48) & 0xff;
	bytes[2] = (v >> 40) & 0xff;
	bytes[3] = (v >> 32) & 0xff;
	bytes[4] = (v >> 24) & 0xff;
	bytes[5] = (v >> 16) & 0xff;
	bytes[6] = (v >>  8) & 0xff;
	bytes[7] = (v >>  0) & 0xff;
}


/* -----------------------------------------------------------------------------
	Encoder */

unsigned int qoa_encode_header(qoa_desc *qoa, unsigned char *bytes) {
	unsigned int p = 0;
	qoa_write_u64(((qoa_uint64_t)QOA_MAGIC << 32) | qoa->samples, bytes, &p);
	return p;
}

unsigned int qoa_encode_frame(const short *sample_data, qoa_desc *qoa, unsigned int frame_len, unsigned char *bytes) {
	unsigned int channels = qoa->channels;

	unsigned int p = 0;
	unsigned int slices = (frame_len + QOA_SLICE_LEN - 1) / QOA_SLICE_LEN;
	unsigned int frame_size = QOA_FRAME_SIZE(channels, slices);
	int prev_scalefactor[QOA_MAX_CHANNELS] = {0};

	/* Write the frame header */
	qoa_write_u64((
		(qoa_uint64_t)qoa->channels   << 56 |
		(qoa_uint64_t)qoa->samplerate << 32 |
		(qoa_uint64_t)frame_len       << 16 |
		(qoa_uint64_t)frame_size
	), bytes, &p);

	
	for (unsigned int c = 0; c < channels; c++) {
		/* Write the current LMS state */
		qoa_uint64_t weights = 0;
		qoa_uint64_t history = 0;
		for (int i = 0; i < QOA_LMS_LEN; i++) {
			history = (history << 16) | (qoa->lms[c].history[i] & 0xffff);
			weights = (weights << 16) | (qoa->lms[c].weights[i] & 0xffff);
		}
		qoa_write_u64(history, bytes, &p);
		qoa_write_u64(weights, bytes, &p);
	}

	/* We encode all samples with the channels interleaved on a slice level.
	E.g. for stereo: (ch-0, slice 0), (ch 1, slice 0), (ch 0, slice 1), ...*/
	for (unsigned int sample_index = 0; sample_index < frame_len; sample_index += QOA_SLICE_LEN) {

		for (unsigned int c = 0; c < channels; c++) {
			int slice_len = qoa_clamp(QOA_SLICE_LEN, 0, frame_len - sample_index);
			int slice_start = sample_index * channels + c;
			int slice_end = (sample_index + slice_len) * channels + c;			

			/* Brute for search for the best scalefactor. Just go through all
			16 scalefactors, encode all samples for the current slice and 
			meassure the total squared error. */
			qoa_uint64_t best_rank = -1;
			#ifdef QOA_RECORD_TOTAL_ERROR
				qoa_uint64_t best_error = -1;
			#endif
			qoa_uint64_t best_slice = 0;
			qoa_lms_t best_lms;
			int best_scalefactor = 0;

			for (int sfi = 0; sfi < 16; sfi++) {
				/* There is a strong correlation between the scalefactors of
				neighboring slices. As an optimization, start testing
				the best scalefactor of the previous slice first. */
				int scalefactor = (sfi + prev_scalefactor[c]) % 16;

				/* We have to reset the LMS state to the last known good one
				before trying each scalefactor, as each pass updates the LMS
				state when encoding. */
				qoa_lms_t lms = qoa->lms[c];
				qoa_uint64_t slice = scalefactor;
				qoa_uint64_t current_rank = 0;
				#ifdef QOA_RECORD_TOTAL_ERROR
					qoa_uint64_t current_error = 0;
				#endif

				for (int si = slice_start; si < slice_end; si += channels) {
					int sample = sample_data[si];
					int predicted = qoa_lms_predict(&lms);

					int residual = sample - predicted;
					int scaled = qoa_div(residual, scalefactor);
					int clamped = qoa_clamp(scaled, -8, 8);
					int quantized = qoa_quant_tab[clamped + 8];
					int dequantized = qoa_dequant_tab[scalefactor][quantized];
					int reconstructed = qoa_clamp_s16(predicted + dequantized);


					/* If the weights have grown too large, we introduce a penalty
					here. This prevents pops/clicks in certain problem cases */
					int weights_penalty = ((
						lms.weights[0] * lms.weights[0] + 
						lms.weights[1] * lms.weights[1] + 
						lms.weights[2] * lms.weights[2] + 
						lms.weights[3] * lms.weights[3]
					) >> 18) - 0x8ff;
					if (weights_penalty < 0) {
						weights_penalty = 0;
					}

					long long error = (sample - reconstructed);
					qoa_uint64_t error_sq = error * error;

					current_rank += error_sq + weights_penalty * weights_penalty;
					#ifdef QOA_RECORD_TOTAL_ERROR
						current_error += error_sq;
					#endif
					if (current_rank > best_rank) {
						break;
					}

					qoa_lms_update(&lms, reconstructed, dequantized);
					slice = (slice << 3) | quantized;
				}

				if (current_rank < best_rank) {
					best_rank = current_rank;
					#ifdef QOA_RECORD_TOTAL_ERROR
						best_error = current_error;
					#endif
					best_slice = slice;
					best_lms = lms;
					best_scalefactor = scalefactor;
				}
			}

			prev_scalefactor[c] = best_scalefactor;

			qoa->lms[c] = best_lms;
			#ifdef QOA_RECORD_TOTAL_ERROR
				qoa->error += best_error;
			#endif

			/* If this slice was shorter than QOA_SLICE_LEN, we have to left-
			shift all encoded data, to ensure the rightmost bits are the empty
			ones. This should only happen in the last frame of a file as all
			slices are completely filled otherwise. */
			best_slice <<= (QOA_SLICE_LEN - slice_len) * 3;
			qoa_write_u64(best_slice, bytes, &p);
		}
	}
	
	return p;
}

void *qoa_encode(const short *sample_data, qoa_desc *qoa, unsigned int *out_len) {
	if (
		qoa->samples == 0 || 
		qoa->samplerate == 0 || qoa->samplerate > 0xffffff ||
		qoa->channels == 0 || qoa->channels > QOA_MAX_CHANNELS
	) {
		return NULL;
	}

	/* Calculate the encoded size and allocate */
	unsigned int num_frames = (qoa->samples + QOA_FRAME_LEN-1) / QOA_FRAME_LEN;
	unsigned int num_slices = (qoa->samples + QOA_SLICE_LEN-1) / QOA_SLICE_LEN;
	unsigned int encoded_size = 8 +                    /* 8 byte file header */
		num_frames * 8 +                               /* 8 byte frame headers */
		num_frames * QOA_LMS_LEN * 4 * qoa->channels + /* 4 * 4 bytes lms state per channel */
		num_slices * 8 * qoa->channels;                /* 8 byte slices */

	unsigned char *bytes = QOA_MALLOC(encoded_size);

	for (unsigned int c = 0; c < qoa->channels; c++) {
		/* Set the initial LMS weights to {0, 0, -1, 2}. This helps with the 
		prediction of the first few ms of a file. */
		qoa->lms[c].weights[0] = 0;
		qoa->lms[c].weights[1] = 0;
		qoa->lms[c].weights[2] = -(1<<13);
		qoa->lms[c].weights[3] =  (1<<14);

		/* Explicitly set the history samples to 0, as we might have some
		garbage in there. */
		for (int i = 0; i < QOA_LMS_LEN; i++) {
			qoa->lms[c].history[i] = 0;
		}
	}


	/* Encode the header and go through all frames */
	unsigned int p = qoa_encode_header(qoa, bytes);
	#ifdef QOA_RECORD_TOTAL_ERROR
		qoa->error = 0;
	#endif

	int frame_len = QOA_FRAME_LEN;
	for (unsigned int sample_index = 0; sample_index < qoa->samples; sample_index += frame_len) {
		frame_len = qoa_clamp(QOA_FRAME_LEN, 0, qoa->samples - sample_index);		
		const short *frame_samples = sample_data + sample_index * qoa->channels;
		unsigned int frame_size = qoa_encode_frame(frame_samples, qoa, frame_len, bytes + p);
		p += frame_size;
	}

	*out_len = p;
	return bytes;
}



/* -----------------------------------------------------------------------------
	Decoder */

unsigned int qoa_max_frame_size(qoa_desc *qoa) {
	return QOA_FRAME_SIZE(qoa->channels, QOA_SLICES_PER_FRAME);
}

unsigned int qoa_decode_header(const unsigned char *bytes, int size, qoa_desc *qoa) {
	unsigned int p = 0;
	if (size < QOA_MIN_FILESIZE) {
		return 0;
	}


	/* Read the file header, verify the magic number ('qoaf') and read the 
	total number of samples. */
	qoa_uint64_t file_header = qoa_read_u64(bytes, &p);

	if ((file_header >> 32) != QOA_MAGIC) {
		return 0;
	}

	qoa->samples = file_header & 0xffffffff;
	if (!qoa->samples) {
		return 0;
	}

	/* Peek into the first frame header to get the number of channels and
	the samplerate. */
	qoa_uint64_t frame_header = qoa_read_u64(bytes, &p);
	qoa->channels   = (frame_header >> 56) & 0x0000ff;
	qoa->samplerate = (frame_header >> 32) & 0xffffff;

	if (qoa->channels == 0 || qoa->samples == 0 || qoa->samplerate == 0) {
		return 0;
	}

	return 8;
}

unsigned int qoa_decode_frame(const unsigned char *bytes, unsigned int size, qoa_desc *qoa, short *sample_data, unsigned int *frame_len) {
	unsigned int p = 0;
	*frame_len = 0;

	if (size < 8 + QOA_LMS_LEN * 4 * qoa->channels) {
		return 0;
	}

	/* Read and verify the frame header */
	qoa_uint64_t frame_header = qoa_read_u64(bytes, &p);
	unsigned int channels   = (frame_header >> 56) & 0x0000ff;
	unsigned int samplerate = (frame_header >> 32) & 0xffffff;
	unsigned int samples    = (frame_header >> 16) & 0x00ffff;
	unsigned int frame_size = (frame_header      ) & 0x00ffff;

	unsigned int data_size = frame_size - 8 - QOA_LMS_LEN * 4 * channels;
	unsigned int num_slices = data_size / 8;
	unsigned int max_total_samples = num_slices * QOA_SLICE_LEN;

	if (
		channels != qoa->channels || 
		samplerate != qoa->samplerate ||
		frame_size > size ||
		samples * channels > max_total_samples
	) {
		return 0;
	}


	/* Read the LMS state: 4 x 2 bytes history, 4 x 2 bytes weights per channel */
	for (unsigned int c = 0; c < channels; c++) {
		qoa_uint64_t history = qoa_read_u64(bytes, &p);
		qoa_uint64_t weights = qoa_read_u64(bytes, &p);
		for (int i = 0; i < QOA_LMS_LEN; i++) {
			qoa->lms[c].history[i] = ((signed short)(history >> 48));
			history <<= 16;
			qoa->lms[c].weights[i] = ((signed short)(weights >> 48));
			weights <<= 16;
		}
	}


	/* Decode all slices for all channels in this frame */
	for (unsigned int sample_index = 0; sample_index < samples; sample_index += QOA_SLICE_LEN) {
		for (unsigned int c = 0; c < channels; c++) {
			qoa_uint64_t slice = qoa_read_u64(bytes, &p);

			int scalefactor = (slice >> 60) & 0xf;
			slice <<= 4;

			int slice_start = sample_index * channels + c;
			int slice_end = qoa_clamp(sample_index + QOA_SLICE_LEN, 0, samples) * channels + c;

			for (int si = slice_start; si < slice_end; si += channels) {
				int predicted = qoa_lms_predict(&qoa->lms[c]);
				int quantized = (slice >> 61) & 0x7;
				int dequantized = qoa_dequant_tab[scalefactor][quantized];
				int reconstructed = qoa_clamp_s16(predicted + dequantized);
				
				sample_data[si] = reconstructed;
				slice <<= 3;

				qoa_lms_update(&qoa->lms[c], reconstructed, dequantized);
			}
		}
	}

	*frame_len = samples;
	return p;
}

short *qoa_decode(const unsigned char *bytes, int size, qoa_desc *qoa) {
	unsigned int p = qoa_decode_header(bytes, size, qoa);
	if (!p) {
		return NULL;
	}

	/* Calculate the required size of the sample buffer and allocate */
	int total_samples = qoa->samples * qoa->channels;
	short *sample_data = QOA_MALLOC(total_samples * sizeof(short));

	unsigned int sample_index = 0;
	unsigned int frame_len;
	unsigned int frame_size;

	/* Decode all frames */
	do {
		short *sample_ptr = sample_data + sample_index * qoa->channels;
		frame_size = qoa_decode_frame(bytes + p, size - p, qoa, sample_ptr, &frame_len);

		p += frame_size;
		sample_index += frame_len;
	} while (frame_size && sample_index < qoa->samples);

	qoa->samples = sample_index;
	return sample_data;
}



/* -----------------------------------------------------------------------------
	File read/write convenience functions */

#ifndef QOA_NO_STDIO
#include <stdio.h>

int qoa_write(const char *filename, const short *sample_data, qoa_desc *qoa) {
	FILE *f = fopen(filename, "wb");
	unsigned int size;
	void *encoded;

	if (!f) {
		return 0;
	}

	encoded = qoa_encode(sample_data, qoa, &size);
	if (!encoded) {
		fclose(f);
		return 0;
	}

	fwrite(encoded, 1, size, f);
	fclose(f);

	QOA_FREE(encoded);
	return size;
}

void *qoa_read(const char *filename, qoa_desc *qoa) {
	FILE *f = fopen(filename, "rb");
	int size, bytes_read;
	void *data;
	short *sample_data;

	if (!f) {
		return NULL;
	}

	fseek(f, 0, SEEK_END);
	size = ftell(f);
	if (size <= 0) {
		fclose(f);
		return NULL;
	}
	fseek(f, 0, SEEK_SET);

	data = QOA_MALLOC(size);
	if (!data) {
		fclose(f);
		return NULL;
	}

	bytes_read = fread(data, 1, size, f);
	fclose(f);

	sample_data = qoa_decode(data, bytes_read, qoa);
	QOA_FREE(data);
	return sample_data;
}

#endif /* QOA_NO_STDIO */
#endif /* QOA_IMPLEMENTATION */

QOA decode

In the decode function, we utilize qoa_read_u64, qoa_clamp_predict, qoa_clamp_s16, and qoa_lms_update. Accordingly, we have optimized these functions to accelerate their performance and conducted an analysis comparing their efficiency with the original implementation.

Performance Optimization Analysis of `qoa_clamp_s16`

In this document, we compare the original implementation of the qoa_clamp_s16 function with its optimized version. Our goal is to demonstrate how the optimized method improves efficiency while maintaining functional correctness.

Original Method

static inline int qoa_clamp_s16(int v) {
    if ((unsigned int)(v + 32768) > 65535) {
        if (v < -32768) { return -32768; }
        if (v >  32767) { return  32767; }
    }
    return v;
}

In the original method, if the unsigned comparison evaluates as true, the function performs two additional comparisons to determine the clamped value. This introduces multiple branches in the code.

In the optimized method, the unsigned comparison alone directly determines whether the value is within the valid range or not. If it's out of range, the clamped value is calculated without additional comparisons. This reduces branching and improves efficiency.

Optimized method

static inline int qoa_clamp_s16_optimized(int v) {
	return (v < -32768) ? -32768 : (v > 32767 ? 32767 : v);
}

Compare Execution Time

RVV_Accelerated method

.globl qoa_clamp_s16_rvv

qoa_clamp_s16_rvv:
    li t0, 32767               # Upper limit 32767
    li t1, -32768              # Lower limit -32768

    # Set vector length (done once)
    vsetvli t2, a2, e32        # Set vector length for 32-bit elements
1:
    vle32.v v0, (a0)           # Load source array into vector register v0
    vmax.vx v1, v0, t1         # v1 = max(v0, -32768) (clamp to lower limit)
    vmin.vx v2, v1, t0         # v2 = min(v1, 32767) (clamp to upper limit)
    vse32.v v2, (a1)           # Store result from v2 to destination array

    slli t3, t2, 2             # t3 = t2 * 4 bytes (each element is 4 bytes)
    add a0, a0, t3             # Update source array pointer
    add a1, a1, t3             # Update destination array pointer
    sub a2, a2, t2             # Update remaining length
    bnez a2, 1b                # If remaining length is not zero, repeat loop

    ret                        # Return

Compare Execution Time with RVV

We can observe using RVV accelerated method is more faster than origin method.

Performance Optimization Analysis of `qoa_lms_predict`

In the original approach, the loop introduces additional runtime overhead due to repeated iterations and indexing operations. When the function is invoked thousands of times per second, these overheads can accumulate and negatively impact performance.

Since QOA_LMS_LEN is a fixed constant, the optimized approach takes advantage of this by explicitly writing out the calculation process. This allows the structure of the prediction calculation to be precomputed at compile time, eliminating unnecessary runtime computations. As a result, this approach improves efficiency and delivers faster and more predictable performance.

Original method

static int qoa_lms_predict(qoa_lms_t *lms) {
	int prediction = 0;
	for (int i = 0; i < QOA_LMS_LEN; i++) {
		prediction += lms->weights[i] * lms->history[i];
	}
	return prediction >> 13;
}

Optimized method

static int qoa_lms_predict(qoa_lms_t *lms) {
	 return (
        (lms->weights[0] * lms->history[0] +
         lms->weights[1] * lms->history[1] +
         lms->weights[2] * lms->history[2] +
         lms->weights[3] * lms->history[3]) >> 13
    );
}

Compare Execution Time with C

RVV_Accelerated method

.section .text
.globl qoa_lms_predict_rvv

qoa_lms_predict_rvv:
    li t0, 8                # Set QOA_LMS_LEN constant to 8
    li t1, 0                # Initialize prediction to 0

    # Load addresses of lms->weights and lms->history
    mv t2, a0               # The address of the LMS structure is stored in a0
    addi t3, t2, 0          # t3 = address of lms->weights (offset 0 from the start of the structure)
    addi t4, t2, 32         # t4 = address of lms->history (weights array occupies 32 bytes)

    # Configure vector registers
    vsetvli t5, t0, e32     # Set vector length with element size of 32 bits
    vle32.v v0, (t3)        # Load weights into vector register v0
    vle32.v v1, (t4)        # Load history into vector register v1

    # Perform dot product operation
    vmul.vv v2, v0, v1      # v2 = element-wise multiplication of v0 and v1
    vredsum.vs v3, v2, v0   # v3 = reduction sum of v2

    # Extract the result into t1
    vmv.x.s t1, v3          # Copy the scalar value from v3 to t1

    # Right-shift by 13 bits to scale the result
    srai t1, t1, 13

    # Return the result
    mv a0, t1
    ret

Compare Execution Time with RVV

Employ statistics model to illustrate the output.

Performance Optimization Analysis of `qoa_lms_update`

Original method

static void qoa_lms_update(qoa_lms_t *lms, int sample, int residual) {
	int delta = residual >> 4;
	for (int i = 0; i < QOA_LMS_LEN; i++) {
		lms->weights[i] += lms->history[i] < 0 ? -delta : delta;
	}

	for (int i = 0; i < QOA_LMS_LEN-1; i++) {
		lms->history[i] = lms->history[i+1];
	}
	lms->history[QOA_LMS_LEN-1] = sample;
}

Optimized method

static void qoa_lms_update(qoa_lms_t *lms, int sample, int residual) {
    int delta = residual >> 4;

    // Update weights (unrolled loop for fixed QOA_LMS_LEN = 4)
    lms->weights[0] += lms->history[0] < 0 ? -delta : delta;
    lms->weights[1] += lms->history[1] < 0 ? -delta : delta;
    lms->weights[2] += lms->history[2] < 0 ? -delta : delta;
    lms->weights[3] += lms->history[3] < 0 ? -delta : delta;

    // Update history (unrolled loop for fixed QOA_LMS_LEN = 4)
    lms->history[0] = lms->history[1];
    lms->history[1] = lms->history[2];
    lms->history[2] = lms->history[3];
    lms->history[3] = sample;
}

Compare Execution Time with C

Employ statistics model to illustrate the output.

We can observe using RVV accelerated method is more faster than origin method.

RVV_Accelerated method

.section .text
.globl qoa_lms_update_rvv
qoa_lms_update_rvv:
    # a0 = pointer to lms structure, a1 = sample, a2 = residual
    # Set vector length to QOA_LMS_LEN (assuming QOA_LMS_LEN = 4)
    li      t0, 4                   # t0 = QOA_LMS_LEN
    vsetvli t0, t0, e32             # Set vector length to 4, with element width 32 bits

    # delta = residual >> 4
    srai t1, a2, 4                  # t1 = residual >> 4

    # Load weights and history
    vle32.v v1, (a0)                # Load weights into vector register v1
    addi t3, a0, 16                 # t3 points to the history (assuming history follows weights in memory)
    vle32.v v2, (t3)                # Load history into vector register v2

    # Broadcast delta and calculate -delta
    vmv.v.x v3, t1                  # Broadcast delta to all elements in vector register v3
    vneg.v v4, v3                   # Calculate -delta and store in vector register v4

    # Compare history < 0, generate mask
    vmslt.vi v0, v2, 0              # Mask register v0 = (v2 < 0)

    # Select -delta or delta based on mask
    vmerge.vvm v5, v3, v4, v0       # v5 = (v2 < 0) ? -delta : delta

    # Update weights
    vadd.vv v1, v1, v5              # v1 = v1 + v5

    # Store updated weights
    vse32.v v1, (a0)                # Save v1 back to weights in memory

    # Shift history array (loop unrolling because QOA_LMS_LEN = 4 is small)
    lw t2, 20(a0)                   # t2 = lms->history[1]
    sw t2, 16(a0)                   # lms->history[0] = lms->history[1]

    lw t2, 24(a0)                   # t2 = lms->history[2]
    sw t2, 20(a0)                   # lms->history[1] = lms->history[2]

    lw t2, 28(a0)                   # t2 = lms->history[3]
    sw t2, 24(a0)                   # lms->history[2] = lms->history[3]

    # Set lms->history[QOA_LMS_LEN - 1] = sample
    sw a1, 28(a0)                   # lms->history[3] = sample

    # Return
    ret

Compare Execution Time with RVV

We can observe using RVV accelerated method is more faster than origin method.

Performance Optimization Analysis of `qoa_read_u64`

We try to acclerated the qoa_read_u64 function, but in our test result, we find our execution time is little longer than origin one.

Original method

static inline qoa_uint64_t qoa_read_u64(const unsigned char *bytes, unsigned int *p) {
	bytes += *p;
	*p += 8;
	return 
		((qoa_uint64_t)(bytes[0]) << 56) | ((qoa_uint64_t)(bytes[1]) << 48) |
		((qoa_uint64_t)(bytes[2]) << 40) | ((qoa_uint64_t)(bytes[3]) << 32) |
		((qoa_uint64_t)(bytes[4]) << 24) | ((qoa_uint64_t)(bytes[5]) << 16) |
		((qoa_uint64_t)(bytes[6]) <<  8) | ((qoa_uint64_t)(bytes[7]) <<  0);
}

RVV_Accelerated method

.section .text
.globl qoa_read_u64_rvv
qoa_read_u64_rvv:
    # Load the offset
    lw      t0, 0(a1)                 # Load the value pointed to by a1 into t0 (*p)
    add     a0, a0, t0                # Increment bytes by *p
    addi    t0, t0, 8                 # Increment *p by 8
    sw      t0, 0(a1)                 # Store the updated offset back to memory (*p)

    # Load 8 bytes using vector instructions
    vsetvli t1, zero, e8, m1          # Set vector length, element size is 8 bits
    vle8.v  v0, (a0)                  # Load 8 bytes into vector register v0

    # Initialize the shift vector directly
    li      t2, 56                    # Set the initial shift value to 56
    vmv.v.x v1, t2                    # Write the shift value into the shift vector
    addi    t2, t2, -8                # Update the shift value
    vmv.s.x v1, t2                    # Update an element of the shift vector
    addi    t2, t2, -8
    vmv.s.x v1, t2                    # Continue updating the shift vector elements
    # Repeat the above steps to fully initialize the shift vector (simplified for demonstration)

    # Extend and shift
    vsetvli t1, x0, e64, m1           # Set vector length, element size is 64 bits
    vzext.vf2 v2, v0                  # Zero-extend 8-bit data to 64 bits
    vsll.vv v2, v2, v1                # Perform left shift on each byte based on shift vector v1

    # Accumulate the shifted values
    vmv.v.i v3, 0                     # Initialize accumulation register v3 to 0
    vredsum.vs v3, v2, v3             # Reduce and sum the elements of v2 into v3

    # Return the result
    vmv.x.s a0, v3                    # Extract the accumulated result into a0
    ret                               # Return

Compare Execution Time with RVV

In the qoa_read_u64 function, we attempted to use RVV to accelerate its execution time. However, after multiple tests, the results showed that the accelerated version still performed slightly worse than the original method. Therefore, for this function, we did not achieve any acceleration.

RVV-accelerated Audio Codec

Introduction

What is RVV?

Vectorized Data Processing:

Variable-Length Vectors:

Dynamic Scalability:

What is QOA?

QOA file

Frames and Slices:

Channel Interleaving:

Valid QOA File:

Decoder:

Prediction:

RISC-V Vector Extension in 64-bit Environment Installation

Baseline implementation

QOA decode

Performance Optimization Analysis of qoa_clamp_s16

Original Method

Optimized method

Compare Execution Time

RVV_Accelerated method

Compare Execution Time with RVV

Performance Optimization Analysis of qoa_lms_predict

Original method

Optimized method

Compare Execution Time with C

RVV_Accelerated method

Compare Execution Time with RVV

Performance Optimization Analysis of qoa_lms_update

Original method

Optimized method

Compare Execution Time with C

RVV_Accelerated method

Compare Execution Time with RVV

Performance Optimization Analysis of qoa_read_u64

Original method

RVV_Accelerated method

Compare Execution Time with RVV

Reference

Performance Optimization Analysis of `qoa_clamp_s16`

Performance Optimization Analysis of `qoa_lms_predict`

Performance Optimization Analysis of `qoa_lms_update`

Performance Optimization Analysis of `qoa_read_u64`