This document collects a few examples of situations in which achieving optimal performance for equivalent functionality on different SIMD platforms differs more than in choosing to use different instructions (either manually, by special-casing some platforms, or by letting the compiler take care of it).
The examples below come from real usecases in the context of compression libraries (specifically, image and text codecs).
Given that the "native" way to do abstractions in Rust is through types and generics, in the author's opinion the best way to express the different behaviours is through encapsulation in the type system; this in turn makes having an ergonomic way to somehow tie together types and target features an important characteristic of any approach to function multiversioning.
In many cases, generating the optimal code boils down to ensuring the correct vector size is used. One such example is for 2d (Inverse) Discrete Cosine Transforms, quite common in image codecs. Modern codecs use DCTs of various sizes, up to 256x256 in some cases.
A typical 2d DCT implementation (see i.e. here) on a NxN block works in four stages:
Both the 1d DCT on columns and the matrix transposition can be SIMDfied:
Both of those functions benefit greatly from knowing the native vector size:
Moreover, 1d DCTs are often implemented recursively (again to improve code reuse, see i.e. here), which means that having a function that applies a 1d DCT to a list of vectors of native size is a natural API for such functions to have. Similarly, "transpose a KxK block where K is the native vector size" is also a natural API to implement arbitrary matrix transpositions.
Consider the case of wanting to compute pow(x, a)
for many values of x
in a predefined range (say, [0.1, 1]
) and a fixed, but not known at compile time, a
. This comes up for example when implementing Color Management Systems, which need to apply user-provided transformations of the form x -> (ax+b)^e
to transform pixel data between color profiles, in the middle of a somewhat flexible transformation pipeline. Typically, such transformations involve multiple millions of values for x
and a single fixed a
.
Two efficient approaches for this transformation are the following:
b = log2(a)
, then compute exp2(log2(x)*b)
. Note that computing exp2
and log2
usually involves evaluating rational polynomials (see here) and fiddling with floating point representations.L
entries, say each with the value of pow(2, a*i)
. Also pre-compute a rational polynomial approximation of pow(x, a)
for x
in i.e. [0.5, 2)
. Then, evaluation involves reducing the input values x
to the form 2^i * x'
with x'
in the appropriate range (fiddling with floating point representations), evaluating a rational polynomial approximation for pow(x', a)
, doing a table lookup by i
and multiplying the results.The second approach can be faster if the number of evaluations of pow(x, a)
is large enough and the relevant target supports fast table lookups in sufficiently-large tables; however the optimal size of the lookup table and the degree of the rational polynomials to be used depends on the target; moreover, computing rational polynomial approximations is quite time-consuming, so it is best done only once.
Thus, the data that should be pre-computed to make this evaluation efficient varies significantly with the SIMD instruction set that the code is meant to be used with. A natural API would then be to have a struct that stores this pre-computed data, and is passed around to all functions that need to compute pow(x, a)
. Such a struct is naturally heavily dependent on the target features of the function it is used in.
Similar situations arise also when implementing fast Huffman coding, which is relevant both for image (1, 2) and text compressors. In particular, if the SIMD hardware supports lookups in tables of size up to L
efficiently, Huffman encoding of an alphabet of up to L
symbols can be done extremely quickly; thus, a compressor might decide to constrain the Huffman alphabet such that only L
different symbols are used, and pass down such a feature-dependent lookup table to parts of the code that do the actual Huffman coding through the image and text processing steps. Again a natural API uses feature-dependent types.