Use cases for flexible abstractions around target features

This document collects a few examples of situations in which achieving optimal performance for equivalent functionality on different SIMD platforms differs more than in choosing to use different instructions (either manually, by special-casing some platforms, or by letting the compiler take care of it).

The examples below come from real usecases in the context of compression libraries (specifically, image and text codecs).

Given that the "native" way to do abstractions in Rust is through types and generics, in the author's opinion the best way to express the different behaviours is through encapsulation in the type system; this in turn makes having an ergonomic way to somehow tie together types and target features an important characteristic of any approach to function multiversioning.

A simple case: ensuring usage of native vector width

In many cases, generating the optimal code boils down to ensuring the correct vector size is used. One such example is for 2d (Inverse) Discrete Cosine Transforms, quite common in image codecs. Modern codecs use DCTs of various sizes, up to 256x256 in some cases.

A typical 2d DCT implementation (see i.e. here) on a NxN block works in four stages:

1d DCT on each column in the block
NxN matrix transposition
1d DCT on each column of the transposed block (i.e. row-wise-DCT of the original block)
NxN matrix transposition

Both the 1d DCT on columns and the matrix transposition can be SIMDfied:

the 1d DCT can be executed on a set of consecutive columns at once
the matrix transposition can be done block-by-block

Both of those functions benefit greatly from knowing the native vector size:

1d DCTs use a large amount of registers; operating on, say, 2x long vectors would effectively double the registers required, unless LLVM's instruction reordering manages to de-interleave operations on the two halves; even then, it would still double code size, and make code re-use between different DCT sizes harder. Since the machine code for even a 16-DCT has a significant number of instructions (>200), these drawbacks are quite significant.
For matrix transposition, the natural block size to operate on when having K-long vectors is KxK. Doubling the vector size would imply needing to emulate vector-pair to vector-pair shuffles instead of doing 4 single-vector shuffles, which is more expensive and increases register usage by a factor of 4, likely leading to spills.

Moreover, 1d DCTs are often implemented recursively (again to improve code reuse, see i.e. here), which means that having a function that applies a 1d DCT to a list of vectors of native size is a natural API for such functions to have. Similarly, "transpose a KxK block where K is the native vector size" is also a natural API to implement arbitrary matrix transpositions.

More complex cases: feature-dependent algorithms that need different precomputed data

Consider the case of wanting to compute pow(x, a) for many values of x in a predefined range (say, [0.1, 1]) and a fixed, but not known at compile time, a. This comes up for example when implementing Color Management Systems, which need to apply user-provided transformations of the form x -> (ax+b)^e to transform pixel data between color profiles, in the middle of a somewhat flexible transformation pipeline. Typically, such transformations involve multiple millions of values for x and a single fixed a.

Two efficient approaches for this transformation are the following:

Pre-compute b = log2(a), then compute exp2(log2(x)*b). Note that computing exp2 and log2 usually involves evaluating rational polynomials (see here) and fiddling with floating point representations.
Pre-compute a lookup table with L entries, say each with the value of pow(2, a*i). Also pre-compute a rational polynomial approximation of pow(x, a) for x in i.e. [0.5, 2). Then, evaluation involves reducing the input values x to the form 2^i * x' with x' in the appropriate range (fiddling with floating point representations), evaluating a rational polynomial approximation for pow(x', a), doing a table lookup by i and multiplying the results.

The second approach can be faster if the number of evaluations of pow(x, a) is large enough and the relevant target supports fast table lookups in sufficiently-large tables; however the optimal size of the lookup table and the degree of the rational polynomials to be used depends on the target; moreover, computing rational polynomial approximations is quite time-consuming, so it is best done only once.

Thus, the data that should be pre-computed to make this evaluation efficient varies significantly with the SIMD instruction set that the code is meant to be used with. A natural API would then be to have a struct that stores this pre-computed data, and is passed around to all functions that need to compute pow(x, a). Such a struct is naturally heavily dependent on the target features of the function it is used in.

Similar situations arise also when implementing fast Huffman coding, which is relevant both for image (1, 2) and text compressors. In particular, if the SIMD hardware supports lookups in tables of size up to L efficiently, Huffman encoding of an alphabet of up to L symbols can be done extremely quickly; thus, a compressor might decide to constrain the Huffman alphabet such that only L different symbols are used, and pass down such a feature-dependent lookup table to parts of the code that do the actual Huffman coding through the image and text processing steps. Again a natural API uses feature-dependent types.