RVV-accelerated Audio Codec
李宗翰, 洪靖睿
Introduction
qoa
What is RVV?
RVV-accelerated refers to leveraging the RISC-V Vector Extension (RVV) to speed up computations, particularly for tasks that involve large-scale data processing. RVV introduces hardware support for vectorized operations, allowing multiple data elements to be processed simultaneously in a single instruction, greatly improving performance in parallel workloads.
Vectorized Data Processing:
- RVV operates on vectors (arrays of numbers) instead of single data points, enabling Single Instruction Multiple Data (SIMD) execution.
Variable-Length Vectors:
- Unlike fixed-width SIMD, RVV supports dynamically configurable vector lengths, making it adaptable to different hardware implementations.
Dynamic Scalability:
- Developers can write "vector-length agnostic" programs that automatically adjust to the hardware's capabilities without needing code changes.
What is QOA?
QOA is the “Quite OK Audio Format” for fast, lossy audio compression
QOA is fast. It decodes audio 3x faster than Ogg-Vorbis, while offering better quality and compression than ADPCM (278 kbits/s for 44khz stereo).
QOA is simple. The reference en-/decoder fits in about 400 lines of C. The file format specification is a single page PDF.
QOA file
A QOA file consists of an 8 byte file header, followed by a number of frames.Each frame contains an 8 byte frame header, the current 16 byte en-/decoder state per channel and 256 slices per channel. Each slice is 8 bytes wide and encodes 20 samples of audio data.
All values, including the slices, are big endian. The file layout is as follow
Each qoa_slice_t
contains a quantized scalefactor sf_quant
and 20 quantized residuals qrNN
:
Frames and Slices:
- Frames: 256 slices per channel, except the last (1-256 slices).
- Last slice: 8 bytes wide, unused samples zeroed.
Channel Interleaving:
- Channels alternate per slice (e.g., stereo: L, R, L, R).
Valid QOA File:
- At least 1 frame, 1 channel, 1 sample.
- Samplerate: 1 to 16,777,215.
- Static files: Consistent channels and samplerate.
- Streaming: Samples field 0x00000000, variable channels/samplerate.
Decoder:
- Supports up to 8 channels with predefined layouts.
Prediction:
- Uses LMS filter to predict samples with residuals for output.
RISC-V Vector Extension in 64-bit Environment Installation
- Setting some need package
- Clone the riscv-gnu-toolchain repo from the official GitHub repo and enter the folder riscv-gnu-toolchain
- To configure the RISC-V toolchain (e.g., GCC compiler).
- Running
make linux
in the context of building the RISC-V toolchain typically compiles and installs the Linux cross-compiler for RISC-V.(Around 2 hours)
- Add the installed RISC-V toolchain (located in /opt/riscv/bin) to the system's PATH environment variable.
- Install QEMU
- Check QEMU and Riscv-Gnu-Toolchain
- Test file for the vector
vec.S
main.c
- Compile it
output
Baseline implementation
qoa.h
extern "C" {
(8 + QOA_LMS_LEN * 4 * channels + 8 * slices * channels)
typedef struct {
int history[QOA_LMS_LEN];
int weights[QOA_LMS_LEN];
} qoa_lms_t;
typedef struct {
unsigned int channels;
unsigned int samplerate;
unsigned int samples;
qoa_lms_t lms[QOA_MAX_CHANNELS];
double error;
} qoa_desc;
unsigned int qoa_encode_header(qoa_desc *qoa, unsigned char *bytes);
unsigned int qoa_encode_frame(const short *sample_data, qoa_desc *qoa, unsigned int frame_len, unsigned char *bytes);
void *qoa_encode(const short *sample_data, qoa_desc *qoa, unsigned int *out_len);
unsigned int qoa_max_frame_size(qoa_desc *qoa);
unsigned int qoa_decode_header(const unsigned char *bytes, int size, qoa_desc *qoa);
unsigned int qoa_decode_frame(const unsigned char *bytes, unsigned int size, qoa_desc *qoa, short *sample_data, unsigned int *frame_len);
short *qoa_decode(const unsigned char *bytes, int size, qoa_desc *file);
int qoa_write(const char *filename, const short *sample_data, qoa_desc *qoa);
void *qoa_read(const char *filename, qoa_desc *qoa);
}
typedef unsigned long long qoa_uint64_t;
static const int qoa_quant_tab[17] = {
7, 7, 7, 5, 5, 3, 3, 1,
0,
0, 2, 2, 4, 4, 6, 6, 6
};
static const int qoa_scalefactor_tab[16] = {
1, 7, 21, 45, 84, 138, 211, 304, 421, 562, 731, 928, 1157, 1419, 1715, 2048
};
static const int qoa_reciprocal_tab[16] = {
65536, 9363, 3121, 1457, 781, 475, 311, 216, 156, 117, 90, 71, 57, 47, 39, 32
};
static const int qoa_dequant_tab[16][8] = {
{ 1, -1, 3, -3, 5, -5, 7, -7},
{ 5, -5, 18, -18, 32, -32, 49, -49},
{ 16, -16, 53, -53, 95, -95, 147, -147},
{ 34, -34, 113, -113, 203, -203, 315, -315},
{ 63, -63, 210, -210, 378, -378, 588, -588},
{ 104, -104, 345, -345, 621, -621, 966, -966},
{ 158, -158, 528, -528, 950, -950, 1477, -1477},
{ 228, -228, 760, -760, 1368, -1368, 2128, -2128},
{ 316, -316, 1053, -1053, 1895, -1895, 2947, -2947},
{ 422, -422, 1405, -1405, 2529, -2529, 3934, -3934},
{ 548, -548, 1828, -1828, 3290, -3290, 5117, -5117},
{ 696, -696, 2320, -2320, 4176, -4176, 6496, -6496},
{ 868, -868, 2893, -2893, 5207, -5207, 8099, -8099},
{1064, -1064, 3548, -3548, 6386, -6386, 9933, -9933},
{1286, -1286, 4288, -4288, 7718, -7718, 12005, -12005},
{1536, -1536, 5120, -5120, 9216, -9216, 14336, -14336},
};
static int qoa_lms_predict(qoa_lms_t *lms) {
int prediction = 0;
for (int i = 0; i < QOA_LMS_LEN; i++) {
prediction += lms->weights[i] * lms->history[i];
}
return prediction >> 13;
}
static void qoa_lms_update(qoa_lms_t *lms, int sample, int residual) {
int delta = residual >> 4;
for (int i = 0; i < QOA_LMS_LEN; i++) {
lms->weights[i] += lms->history[i] < 0 ? -delta : delta;
}
for (int i = 0; i < QOA_LMS_LEN-1; i++) {
lms->history[i] = lms->history[i+1];
}
lms->history[QOA_LMS_LEN-1] = sample;
}
static inline int qoa_div(int v, int scalefactor) {
int reciprocal = qoa_reciprocal_tab[scalefactor];
int n = (v * reciprocal + (1 << 15)) >> 16;
n = n + ((v > 0) - (v < 0)) - ((n > 0) - (n < 0));
return n;
}
static inline int qoa_clamp(int v, int min, int max) {
if (v < min) { return min; }
if (v > max) { return max; }
return v;
}
static inline int qoa_clamp_s16(int v) {
if ((unsigned int)(v + 32768) > 65535) {
if (v < -32768) { return -32768; }
if (v > 32767) { return 32767; }
}
return v;
}
static inline qoa_uint64_t qoa_read_u64(const unsigned char *bytes, unsigned int *p) {
bytes += *p;
*p += 8;
return
((qoa_uint64_t)(bytes[0]) << 56) | ((qoa_uint64_t)(bytes[1]) << 48) |
((qoa_uint64_t)(bytes[2]) << 40) | ((qoa_uint64_t)(bytes[3]) << 32) |
((qoa_uint64_t)(bytes[4]) << 24) | ((qoa_uint64_t)(bytes[5]) << 16) |
((qoa_uint64_t)(bytes[6]) << 8) | ((qoa_uint64_t)(bytes[7]) << 0);
}
static inline void qoa_write_u64(qoa_uint64_t v, unsigned char *bytes, unsigned int *p) {
bytes += *p;
*p += 8;
bytes[0] = (v >> 56) & 0xff;
bytes[1] = (v >> 48) & 0xff;
bytes[2] = (v >> 40) & 0xff;
bytes[3] = (v >> 32) & 0xff;
bytes[4] = (v >> 24) & 0xff;
bytes[5] = (v >> 16) & 0xff;
bytes[6] = (v >> 8) & 0xff;
bytes[7] = (v >> 0) & 0xff;
}
unsigned int qoa_encode_header(qoa_desc *qoa, unsigned char *bytes) {
unsigned int p = 0;
qoa_write_u64(((qoa_uint64_t)QOA_MAGIC << 32) | qoa->samples, bytes, &p);
return p;
}
unsigned int qoa_encode_frame(const short *sample_data, qoa_desc *qoa, unsigned int frame_len, unsigned char *bytes) {
unsigned int channels = qoa->channels;
unsigned int p = 0;
unsigned int slices = (frame_len + QOA_SLICE_LEN - 1) / QOA_SLICE_LEN;
unsigned int frame_size = QOA_FRAME_SIZE(channels, slices);
int prev_scalefactor[QOA_MAX_CHANNELS] = {0};
qoa_write_u64((
(qoa_uint64_t)qoa->channels << 56 |
(qoa_uint64_t)qoa->samplerate << 32 |
(qoa_uint64_t)frame_len << 16 |
(qoa_uint64_t)frame_size
), bytes, &p);
for (unsigned int c = 0; c < channels; c++) {
qoa_uint64_t weights = 0;
qoa_uint64_t history = 0;
for (int i = 0; i < QOA_LMS_LEN; i++) {
history = (history << 16) | (qoa->lms[c].history[i] & 0xffff);
weights = (weights << 16) | (qoa->lms[c].weights[i] & 0xffff);
}
qoa_write_u64(history, bytes, &p);
qoa_write_u64(weights, bytes, &p);
}
for (unsigned int sample_index = 0; sample_index < frame_len; sample_index += QOA_SLICE_LEN) {
for (unsigned int c = 0; c < channels; c++) {
int slice_len = qoa_clamp(QOA_SLICE_LEN, 0, frame_len - sample_index);
int slice_start = sample_index * channels + c;
int slice_end = (sample_index + slice_len) * channels + c;
qoa_uint64_t best_rank = -1;
qoa_uint64_t best_error = -1;
qoa_uint64_t best_slice = 0;
qoa_lms_t best_lms;
int best_scalefactor = 0;
for (int sfi = 0; sfi < 16; sfi++) {
int scalefactor = (sfi + prev_scalefactor[c]) % 16;
qoa_lms_t lms = qoa->lms[c];
qoa_uint64_t slice = scalefactor;
qoa_uint64_t current_rank = 0;
qoa_uint64_t current_error = 0;
for (int si = slice_start; si < slice_end; si += channels) {
int sample = sample_data[si];
int predicted = qoa_lms_predict(&lms);
int residual = sample - predicted;
int scaled = qoa_div(residual, scalefactor);
int clamped = qoa_clamp(scaled, -8, 8);
int quantized = qoa_quant_tab[clamped + 8];
int dequantized = qoa_dequant_tab[scalefactor][quantized];
int reconstructed = qoa_clamp_s16(predicted + dequantized);
int weights_penalty = ((
lms.weights[0] * lms.weights[0] +
lms.weights[1] * lms.weights[1] +
lms.weights[2] * lms.weights[2] +
lms.weights[3] * lms.weights[3]
) >> 18) - 0x8ff;
if (weights_penalty < 0) {
weights_penalty = 0;
}
long long error = (sample - reconstructed);
qoa_uint64_t error_sq = error * error;
current_rank += error_sq + weights_penalty * weights_penalty;
current_error += error_sq;
if (current_rank > best_rank) {
break;
}
qoa_lms_update(&lms, reconstructed, dequantized);
slice = (slice << 3) | quantized;
}
if (current_rank < best_rank) {
best_rank = current_rank;
best_error = current_error;
best_slice = slice;
best_lms = lms;
best_scalefactor = scalefactor;
}
}
prev_scalefactor[c] = best_scalefactor;
qoa->lms[c] = best_lms;
qoa->error += best_error;
best_slice <<= (QOA_SLICE_LEN - slice_len) * 3;
qoa_write_u64(best_slice, bytes, &p);
}
}
return p;
}
void *qoa_encode(const short *sample_data, qoa_desc *qoa, unsigned int *out_len) {
if (
qoa->samples == 0 ||
qoa->samplerate == 0 || qoa->samplerate > 0xffffff ||
qoa->channels == 0 || qoa->channels > QOA_MAX_CHANNELS
) {
return NULL;
}
unsigned int num_frames = (qoa->samples + QOA_FRAME_LEN-1) / QOA_FRAME_LEN;
unsigned int num_slices = (qoa->samples + QOA_SLICE_LEN-1) / QOA_SLICE_LEN;
unsigned int encoded_size = 8 +
num_frames * 8 +
num_frames * QOA_LMS_LEN * 4 * qoa->channels +
num_slices * 8 * qoa->channels;
unsigned char *bytes = QOA_MALLOC(encoded_size);
for (unsigned int c = 0; c < qoa->channels; c++) {
qoa->lms[c].weights[0] = 0;
qoa->lms[c].weights[1] = 0;
qoa->lms[c].weights[2] = -(1<<13);
qoa->lms[c].weights[3] = (1<<14);
for (int i = 0; i < QOA_LMS_LEN; i++) {
qoa->lms[c].history[i] = 0;
}
}
unsigned int p = qoa_encode_header(qoa, bytes);
qoa->error = 0;
int frame_len = QOA_FRAME_LEN;
for (unsigned int sample_index = 0; sample_index < qoa->samples; sample_index += frame_len) {
frame_len = qoa_clamp(QOA_FRAME_LEN, 0, qoa->samples - sample_index);
const short *frame_samples = sample_data + sample_index * qoa->channels;
unsigned int frame_size = qoa_encode_frame(frame_samples, qoa, frame_len, bytes + p);
p += frame_size;
}
*out_len = p;
return bytes;
}
unsigned int qoa_max_frame_size(qoa_desc *qoa) {
return QOA_FRAME_SIZE(qoa->channels, QOA_SLICES_PER_FRAME);
}
unsigned int qoa_decode_header(const unsigned char *bytes, int size, qoa_desc *qoa) {
unsigned int p = 0;
if (size < QOA_MIN_FILESIZE) {
return 0;
}
qoa_uint64_t file_header = qoa_read_u64(bytes, &p);
if ((file_header >> 32) != QOA_MAGIC) {
return 0;
}
qoa->samples = file_header & 0xffffffff;
if (!qoa->samples) {
return 0;
}
qoa_uint64_t frame_header = qoa_read_u64(bytes, &p);
qoa->channels = (frame_header >> 56) & 0x0000ff;
qoa->samplerate = (frame_header >> 32) & 0xffffff;
if (qoa->channels == 0 || qoa->samples == 0 || qoa->samplerate == 0) {
return 0;
}
return 8;
}
unsigned int qoa_decode_frame(const unsigned char *bytes, unsigned int size, qoa_desc *qoa, short *sample_data, unsigned int *frame_len) {
unsigned int p = 0;
*frame_len = 0;
if (size < 8 + QOA_LMS_LEN * 4 * qoa->channels) {
return 0;
}
qoa_uint64_t frame_header = qoa_read_u64(bytes, &p);
unsigned int channels = (frame_header >> 56) & 0x0000ff;
unsigned int samplerate = (frame_header >> 32) & 0xffffff;
unsigned int samples = (frame_header >> 16) & 0x00ffff;
unsigned int frame_size = (frame_header ) & 0x00ffff;
unsigned int data_size = frame_size - 8 - QOA_LMS_LEN * 4 * channels;
unsigned int num_slices = data_size / 8;
unsigned int max_total_samples = num_slices * QOA_SLICE_LEN;
if (
channels != qoa->channels ||
samplerate != qoa->samplerate ||
frame_size > size ||
samples * channels > max_total_samples
) {
return 0;
}
for (unsigned int c = 0; c < channels; c++) {
qoa_uint64_t history = qoa_read_u64(bytes, &p);
qoa_uint64_t weights = qoa_read_u64(bytes, &p);
for (int i = 0; i < QOA_LMS_LEN; i++) {
qoa->lms[c].history[i] = ((signed short)(history >> 48));
history <<= 16;
qoa->lms[c].weights[i] = ((signed short)(weights >> 48));
weights <<= 16;
}
}
for (unsigned int sample_index = 0; sample_index < samples; sample_index += QOA_SLICE_LEN) {
for (unsigned int c = 0; c < channels; c++) {
qoa_uint64_t slice = qoa_read_u64(bytes, &p);
int scalefactor = (slice >> 60) & 0xf;
slice <<= 4;
int slice_start = sample_index * channels + c;
int slice_end = qoa_clamp(sample_index + QOA_SLICE_LEN, 0, samples) * channels + c;
for (int si = slice_start; si < slice_end; si += channels) {
int predicted = qoa_lms_predict(&qoa->lms[c]);
int quantized = (slice >> 61) & 0x7;
int dequantized = qoa_dequant_tab[scalefactor][quantized];
int reconstructed = qoa_clamp_s16(predicted + dequantized);
sample_data[si] = reconstructed;
slice <<= 3;
qoa_lms_update(&qoa->lms[c], reconstructed, dequantized);
}
}
}
*frame_len = samples;
return p;
}
short *qoa_decode(const unsigned char *bytes, int size, qoa_desc *qoa) {
unsigned int p = qoa_decode_header(bytes, size, qoa);
if (!p) {
return NULL;
}
int total_samples = qoa->samples * qoa->channels;
short *sample_data = QOA_MALLOC(total_samples * sizeof(short));
unsigned int sample_index = 0;
unsigned int frame_len;
unsigned int frame_size;
do {
short *sample_ptr = sample_data + sample_index * qoa->channels;
frame_size = qoa_decode_frame(bytes + p, size - p, qoa, sample_ptr, &frame_len);
p += frame_size;
sample_index += frame_len;
} while (frame_size && sample_index < qoa->samples);
qoa->samples = sample_index;
return sample_data;
}
int qoa_write(const char *filename, const short *sample_data, qoa_desc *qoa) {
FILE *f = fopen(filename, "wb");
unsigned int size;
void *encoded;
if (!f) {
return 0;
}
encoded = qoa_encode(sample_data, qoa, &size);
if (!encoded) {
fclose(f);
return 0;
}
fwrite(encoded, 1, size, f);
fclose(f);
QOA_FREE(encoded);
return size;
}
void *qoa_read(const char *filename, qoa_desc *qoa) {
FILE *f = fopen(filename, "rb");
int size, bytes_read;
void *data;
short *sample_data;
if (!f) {
return NULL;
}
fseek(f, 0, SEEK_END);
size = ftell(f);
if (size <= 0) {
fclose(f);
return NULL;
}
fseek(f, 0, SEEK_SET);
data = QOA_MALLOC(size);
if (!data) {
fclose(f);
return NULL;
}
bytes_read = fread(data, 1, size, f);
fclose(f);
sample_data = qoa_decode(data, bytes_read, qoa);
QOA_FREE(data);
return sample_data;
}
QOA decode
In the decode function, we utilize qoa_read_u64
, qoa_clamp_predict
, qoa_clamp_s16
, and qoa_lms_update
. Accordingly, we have optimized these functions to accelerate their performance and conducted an analysis comparing their efficiency with the original implementation.
In this document, we compare the original implementation of the qoa_clamp_s16
function with its optimized version. Our goal is to demonstrate how the optimized method improves efficiency while maintaining functional correctness.
Original Method
In the original method, if the unsigned comparison evaluates as true, the function performs two additional comparisons to determine the clamped value. This introduces multiple branches in the code.
In the optimized method, the unsigned comparison alone directly determines whether the value is within the valid range or not. If it's out of range, the clamped value is calculated without additional comparisons. This reduces branching and improves efficiency.
Optimized method
Compare Execution Time

RVV_Accelerated method
Compare Execution Time with RVV

We can observe using RVV accelerated method is more faster than origin method.
In the original approach, the loop introduces additional runtime overhead due to repeated iterations and indexing operations. When the function is invoked thousands of times per second, these overheads can accumulate and negatively impact performance.
Since QOA_LMS_LEN
is a fixed constant, the optimized approach takes advantage of this by explicitly writing out the calculation process. This allows the structure of the prediction calculation to be precomputed at compile time, eliminating unnecessary runtime computations. As a result, this approach improves efficiency and delivers faster and more predictable performance.
Original method
Optimized method
Compare Execution Time with C

RVV_Accelerated method
Compare Execution Time with RVV

Employ statistics model to illustrate the output.
Original method
Optimized method
Compare Execution Time with C

Employ statistics model to illustrate the output.
We can observe using RVV accelerated method is more faster than origin method.
RVV_Accelerated method
Compare Execution Time with RVV

We can observe using RVV accelerated method is more faster than origin method.
We try to acclerated the qoa_read_u64
function, but in our test result, we find our execution time is little longer than origin one.
Original method
RVV_Accelerated method
Compare Execution Time with RVV

In the qoa_read_u64
function, we attempted to use RVV to accelerate its execution time. However, after multiple tests, the results showed that the accelerated version still performed slightly worse than the original method. Therefore, for this function, we did not achieve any acceleration.
Reference