Design meeting: writing code once for multiple SIMD extension sets

Even within the same architecture, CPUs vary significantly in which SIMD ISA extensions they implement. For example, x86 CPUs currently have about 12 (!) different possible configurations with respect to AVX-512 support alone.

Most libraries that are shipped to users in binary form thus tend to generate code for different sets of features, and then use runtime dispatch to choose between the different implementations depending on the specific extensions available.

This approach is very common i.e. in modern image and video codec implementations.

Traditionally, this has been done by explicitly writing a version of the relevant code for every architecture and set of SIMD instructions, often even in hand-coded ASM, with obvious downsides for both developer convenience and safety properties of the resulting code.

A different approach is to write the code once and have the compiler generate multiple versions; the C++ library highway is a fairly used example of this pattern, used in multiple projects across different domains.

The topic of how to best support this pattern in Rust has been discussed extensively in this Zulip thread, without reaching a consensus on which of the three approaches proposed below to proceed with (nor on many of the other design questions discussed below). The main question for this design meeting would be to figure out the best approach(es), among the variants of the proposals below or even a completely different one.

Current state in Rust

As of writing this document, efficiently using SIMD instructions locally (i.e. in a single function, as opposed to program-wide) requires annotating the function with an attribute, and making the function unsafe, for example:

#[target_feature(enable="avx")]
unsafe fn with_avx() {
    // can use AVX intrinsics here.
}

This attribute influences how LLVM does code generation in multiple ways:

LLVM might apply automatic vectorization in a way that causes AVX to be used, even if the source code of the function does not explicitly indicate that.
Other AVX-using functions can be inlined into with_avx (in particular, functions from core::arch that explicitly map to AVX instructions).
The ABI of calls to extern "C" functions that take AVX registers as arguments changes; this was a potential source of surprising UB, and situations that could trigger this are being made into errors. See #116558 for details.

Since this attribute is required for efficient code to be generated, libraries that want to provide a way to generate multiple copies of a function with different SIMD instruction (such as pulp and multiversion) sets use one of the following methods:

They require intermediate functions to be declared #[inline(always)], causing all the relevant SIMD code to be inlined into a single function (which is then multiversioned); this has obvious downsides in terms of code size.
They require the user to perform dynamic dispatch again, which has a meaningful performance impact for medium-to-small sized functions.
They synthetize an intermediate function for callees that the user wishes not to be inlined, into which the user-provided callee is inlined and that enables the appropriate features. This requires significant boilerplate.

Requiring `unsafe`

Apart from the ABI issues mentioned above, unsafe is required (regardless of the source of the function) because CPUs that do not understand a certain SIMD ISA extension might interpret instructions as arbitrary other instructions.
target feature 1.1 removes the need for unsafe in free functions, making the functions safe to call in a context-dependent way. In particular, the call is safe iff the caller has a superset of the features of the callee.

Note that trait methods still require unsafe even with target feature 1.1.

SIMD abstraction choices

Assuming one does not want to rely on autovectorization (which is typically the case for the use cases in question), supporting this kind of function multiversioning inherently requires having some form of abstraction over SIMD ISAs.

Multiple such abstraction layers have been written over time, with somewhat different objectives:

Efforts such as std::simd provide vector types that offer identical behaviour across different platforms. When the behaviour is not identical across platforms, one standard behaviour is chosen and emulated on platforms for which it would not be the native behaviour.
pulp and highway aim to be as close to the hardware as possible. This implies that operations might have slightly different semantics across platforms; they also have a concept of "vector of native length", and may not implement some vector lengths for some targets.

The second approach typically results in higher performance, at the cost of somewhat more cumbersome code.

Moreover, sometimes the most efficient implementation of an algorithm for a specific SIMD architecture might be quite different from the one for a different architecture, for example requiring different types of pre-computation or storage, or also just adapting to the native vector size; thus, it is desirable for the multiversioning/abstraction mechanism to allow for some amount of platform-specific adaptation. See this document for some examples of how the SIMD ISA might affect the optimal implementation.

Proposal: `struct_target_features`

(this proposal is mostly based on RFC 3525)

One possible approach to ease function multiversioning is to allow attaching features to types. Such types would gain a validity invariant that requires the appropriate features to be present for an instance of the given type to exist.

A function taking such a type by argument would then be able to assume the feature to be present, and the compiler could thus safely instruct LLVM to generate code accordingly.

Note that this implies that a type explicitly marked with features should be unsafe to construct, unless the caller already has the required features. This is consistent with the behaviour proposed by target_feature_11.

This allows using the existing infrastructure for generics to achieve multiversioning, for example like so:

// See below for a discussion of other possible ways to define such structs
#[target_features(enable = "avx2")]
struct WithAvx2;
#[target_features(enable = "avx512f")]
type WithAvx512;


fn do_something<S: TargetSpecificOperation>(simd: S) {
    // code that can work under any set of target features here,
    // presumably using some form of SIMD abstraction.

    // Specializing the implementation of some operations depending
    // on the available instruction set can be done through traits.
    simd.target_specific_op();
}

trait TargetSpecificOperation {
    fn target_specific_op(self);
}

impl TargetSpecificOperation for WithAvx2 {
    // See below for considerations on how to inherit features
    // from arguments.
    fn target_specific_op(self) {
        // AVx2-specific code
    }
}

impl TargetSpecificOperation for WithAvx512 {
    fn target_specific_op(self) {
        // AVx512-specific code
    }
}

fn main() {
    if has_avx512() {
        do_something(unsafe { WithAvx512::new_unchecked() });
    } else if has_avx2() {
        do_something(unsafe { WithAvx2::new_unchecked() });
    }
    
    // *ALTERNATIVE*, with appropriate (user-defined) methods on
    // `WithAvx*` types that do feature detection:
    if let Some(avx512) = WithAvx512::new() {
        do_something(avx512);
    } else if let Some(avx2) = WithAvx2::new() {
        do_something(avx2);
    }
}

Beyond this basic idea, there's multiple specific design choice that could considerably change the shape of the resulting feature:

Should user-provided types be annotable this way, or only stdlib types?
Should non-ZST be annotable?
Should types inherit target features from nested types?
How should a function inheriting features from its arguments work?

What types can be annotated?

The main advantage of allowing annotating non-ZSTs is that it allows using features in the implementation of existing traits, i.e. a Avx2U8Vec type can implement std::ops::Add using avx2 intrinsics, as long as self is not passed by reference (see below).
Potentially, applying appropriate such attributes to #[repr(simd)] types could allow making them pass-by-value even in the Rust ABI, as any such case of pass-by-value would enable the required features implicitly and thus ABI incompatibilities should no longer be possible.
Allowing any struct to be annotated with #[target_features(enable = ...)] lets user code define subsets of features to do dispatching for with a familiar syntax.
Restricting the creation of such types to the stdlib (or core) allows for more freedom in the design of, for example, zero-sized witness types to be used by user code. For example, witness types could be designed as a lang-item type Witness<const FEATURES: TargetFeatures>, where TargetFeatures { avx: bool, avx2: bool, ...} has fields that depend on the current target; an impl of such a type could then be provided to offer functionality that would otherwise require new auto traits or other compiler support.

Inheriting features from nested types

Some reasonable options are:

Not inheriting features at all
Only inheriting features from fields if an attribute is present, i.e. #[target_features(inherit)]
Unconditionally inheriting features for i.e. tuples, but not for user-defined types
Unconditionally inheriting features

If features inheritance is desired, it should be compatible with validity invariants. As such, the following should probably be true, given a type T with features that is in some way contained in a type S:

S containing T by value should cause features to be inherited
S containing &T should probably not cause features to be inherited (see here for rationale on not recurring through references for validity invariants)
S containing [T; N] with N=0 should definitely not cause features to be inherited. With N>0, it probably should (unless doing so would be too confusing).
S containing *T should definitely not cause features to be inherited.
If S is a union or enum, it should not inherit features.

Note that inherited features do not require the struct to be unsafe to construct.

The rules that apply to inheriting features from struct fields should likely be the same as the rules for inheriting from function arguments.

Behaviour in function definitions

The main question is whether to require functions that inherit features to have a #[target_features(from_args)] attribute (or equivalent).

Such an attribute could also be made more specific, i.e. #[target_features(from_args(0, 1))] or #[target_features(from_args(arg.field))].

The main arguments for requiring such an attribute are twofold:

Not having such an attribute might be surprising for people that are used to having codegen controlled by attributes.
Currently, the MIR inliner refuses to inline functions that differ in features. Without such an attribute, features would effectively be unknowable until monomorphization, effectively disabling MIR inlining for generics; this is expected to have an unacceptable performance degradation. However, this check is arguably overly conservative: for example, LLVM happily performs inlining between functions that differ in features if it can determine that doing so is sound, and we'd expect that doing similar checks on MIR would effectively remove this drawback.

Note: not inheriting features from values passed by reference would make implementing trait methods that take &self and wish to use features slightly more cumbersome, as it would require either annotating the trait method with #[inline(always)] and having it call a function taking self by value, or something along the lines of #[unsafe(target_features(enable = ...)] which does not require the annotated function to be unsafe (by moving the burden of proof to a safety invariant of one of the function arguments).

Proposal: `#[target_features(caller)]`

(this proposal is mostly based on this draft RFC)

The main idea of this proposal is that annotating functions with #[target_features(caller)] would cause them to inherit the features of the caller. The proposal also involves adding macros is_{arch}_feature_enabled! to generate different code depending on which instruction set the current function is being compiled for.

Example:

#[target_features(caller)]
fn do_something() {
    // code that can work under any set of target features here,
    // presumably using some form of SIMD abstraction.
    
    // Specializing the implementation of some operations depending
    // on the available instruction set
    if is_x86_feature_enabled!("avx512f") {
        // target-specific code with AVX512
    } else {
        // target-specific code when AVX512 is not available
    }
}

#[target_features(enable = "avx512f")]
fn do_something_avx512() {
    do_something()
}

#[target_features(enable = "avx2")]
fn do_something_avx2() {
    do_something()
}

fn main() {
    if has_avx512() {
        unsafe {do_something_avx512();}
    } else if has_avx2() {
        unsafe {do_something_avx2();}
    }
}

This proposal comes with a few open questions, partially discussed below:

What should happen when the address of such a function is taken?
Should this attribute be allowed on trait method implementations?
At which point of the compilation process should functions with this annotation be multiversioned?
How should such a function interact with name mangling?
A caller might have features that a callee does not benefit from; how could features not be passed to callees to save binary size?
What should happen if the call graph contains cycles between #[target_features(caller)] functions?

Note: #[target_features(caller)] on non-trait methods could be implemented as sugar for struct_target_features in the presence of a features_of_current_function!() macro that constructs an appropriately-annotated ZST. Many of the discussion points below are also relevant to such an approach.

Taking addresses

Since there are now multiple copies of the function sharing the same name, converting the function to a function pointer (or Fn trait) needs special care.

There are two reasonable approaches:

Converting to a function pointer/Fn trait only uses the baseline features for the current architecture. This could cause some behaviours that are very surprising (i.e. calling iterator.map(fn_with_caller_features) would use baseline features).
Converting to a function pointer/Fn trait uses the features of the function that the call appears in. Note that this should ignore inlining.

If this attribute is allowed on trait method implementations of type T for trait Trait, similar considerations then apply to converting &T to &dyn Trait.

Time at which multiversioning happens

The main question is whether multiversioning should happen before MIR creation or before codegen.

Doing multiversioning before MIR creation might impact compilation times negatively, as potentially significantly more MIR would be generated.

Doing multiversioning at codegen time (effectively making #[target_features(caller)] functions equivalent to generics, from MIR's point of view) also comes with its own challenges. Most notably, even if inlining a #[target_features(caller)] function is by definition always fine, inlining of callers of #[target_features(caller)]-annotated functions during MIR optimizations would change program behaviour in an observable way if implemented naively. Thus, calls to caller functions need to be annotated to keep track of the original set of features.

Note that similar considerations apply to inlining function bodies that contain is_{arch}_feature_enabled! macro calls.

Proposal: `#[target_features(generic)]`

This proposal can be seen as a more explicit version of #[target_feature(caller)].

The main idea of this proposal is that annotating a const generic function with #[target_features(generic)] would cause the function to obtain the features specified in the first const generic argument of appropriate type.

Example:


trait Simd {
    fn do_something(&self);
}

struct S<const FEATURES: str>()

impl Simd for S<{*"avx2"}> {
    // Note: as written, this function would not have target features. See below
    // for a proposal related to the main discussion, but somewhat orthogonal,
    // that would fix this issue.
    fn do_something(&self) {
        // target-specific code with AVX2
    }
}

impl Simd for S<{*"avx512f"}> {
    fn do_something(&self) {
        // target-specific code with AVX512
    }
}

#[target_features(generic)]
fn do_something<const FEATURES: str>(s: S<FEATURES>) where S<FEATURES>: Simd {
    // code that can work under any set of target features here,
    // presumably using some form of SIMD abstraction.
    
    // Specializing the implementation of some operations depending
    // on the available instruction set
    s.do_something();
}

fn main() {
    if has_avx512() {
        unsafe {do_something::<{*"avx512f"}>();}
    } else if has_avx2() {
        unsafe {do_something::<{*"avx2"}>();}
    }
}

This proposal comes with a few open questions:

What type should the generic argument have? Some reasonable options are str (requires unsized_const_params, but can re-use existing syntax), or a new lang item (i.e. with boolean fields for each feature).
Is "first generic argument" the best semantics for the attribute? Should we allow specifying the index of the generic?

Comparison between the proposals

In the author's opinion, the additional control granted by #[target_features(generic)] and struct_target_features is highly desirable.
It however comes at the cost of a more verbose syntax (somewhat mitigated by generic argument deduction).

With #[target_feature(caller)], it is possible to achieve similar levels of flexibility by making feature-generic functions also generic over an appropriate witness type. However, this duplicates the feature information and introduces the possibility of the two disagreeing.

struct_target_features suffers significantly from not being able to rely on recursive reference validity, as having significantly different behaviour when passing by reference is a major footgun. However, after some discussion with t-opsem / t-lang it appears to be the case that it would be reasonable to rely on safety invariants (instead of validity invariants) for enabling features, bypassing this issue.

Neither #[target_feature(caller)] nor #[target_feature(generic)] offer a way to ergonomically create vector types that implement arbitrary traits; see the next section for a possible solution.

`#[unsafe(target_features(enable = ...))]`

Consider creating a Avx2u32x8 type, that contains 8 u32, has a safety invariant of "avx2 is available", and relies on avx2 to implement std::ops::Add.

Today, the implementation of Add would look like this:

use core::arch::x86_64::*;

struct Avx2u32x8(__m256i);

impl std::ops::Add<Avx2u32x8> for Avx2u32x8 {
    type Output = Avx2u32x8;
    #[inline(always)]
    fn add(self, other: Self) -> Self {
        unsafe {
            Self(_mm256_add_epi32(self.0, other.0))
        }
    }
}

This relies on the caller function to have the appropriate features (which is reasonable), and forces inlining (which might be less reasonable for more complex functions).

Even with target_features_11, the following is not possible, because target_feature is not allowed on safe trait methods:

impl std::ops::Add<Avx2u32x8> for Avx2u32x8 {
    type Output = Avx2u32x8;
    #[target_feature(enable = "avx2")]
    fn add(self, other: Self) -> Self {
        unsafe {
            Self(_mm256_add_epi32(self.0, other.0))
        }
    }
}

A workaround to still leave inlining decisions up to the compiler is the following:

Even with target_features_11, the following is not possible, because target_feature is not allowed on safe trait methods:

impl std::ops::Add<Avx2u32x8> for Avx2u32x8 {
    type Output = Avx2u32x8;
    #[inline(always)]
    fn add(self, other: Self) -> Self {
        #[target_feature(enable = "avx2")]
        unsafe fn inner(this: Self, other: Self) -> Self {
            Self(_mm256_add_epi32(this.0, other.0))
        }
        unsafe {
            inner(self, other)
        }
    }
}

However, this is somewhat cumbersome.

#[unsafe(target_feature(enable = ...))] has the same semantics as #[target_feature(enable = ...)], except that the annotated function (or trait method) is entirely safe: it is up to the person adding the attribute to ensure that the safety invariants of the function arguments ensure it can only be called if features are guaranteed to be present.

Example:

impl std::ops::Add<Avx2u32x8> for Avx2u32x8 {
    type Output = Avx2u32x8;
    #[unsafe(target_feature(enable = "avx2"))]
    fn add(self, other: Self) -> Self {
        // no unsafe with https://github.com/rust-lang/libs-team/issues/494
        unsafe {
            Self(_mm256_add_epi32(this.0, other.0))
        }
    }
}

Design meeting: writing code once for multiple SIMD extension sets

Current state in Rust

Requiring unsafe

SIMD abstraction choices

Proposal: struct_target_features

What types can be annotated?

Inheriting features from nested types

Behaviour in function definitions

Proposal: #[target_features(caller)]

Taking addresses

Time at which multiversioning happens

Proposal: #[target_features(generic)]

Comparison between the proposals

#[unsafe(target_features(enable = ...))]

Requiring `unsafe`

Proposal: `struct_target_features`

Proposal: `#[target_features(caller)]`

Proposal: `#[target_features(generic)]`

`#[unsafe(target_features(enable = ...))]`