Summary

Adding AVX10 target features to Rust

Motivation

(For reference, CAPS refer to a CPU ISA, and small refers to a Rust or LLVM target feature).

Intel has decided to streamline the mess of ISAs that AVX512 was, and started the AVX10 project.
It is a versioned system, with a maximum vector width - e.g, AVX10.N-256 means that this is the Nth iteration of the ISA, and this implementation supports only vectors (and operations) of up to 256 bits.
There will be 2 ISAs in each iteration - AVX10.N-256 and AVX10.N-512.
The first iteration hasn't added any new instructions, just augmented AVX512 - if a CPU supports AVX10.1-512, it supports all AVX512 instructions, and if the CPU just supports AVX10.1-256, it supports all AVX512 instructions of 128- and 256-bit vectors.

The problem stems there - Rust has the target features avx512f etc to indicate support for AVX512.
LLVM, on the other hand, uses a different mechanism.
The LLVM feature evex512 indicates the presence of 512-bit instructions, and the LLVM feature avx512f only means support of AVX512F instructions up to at least 256 bits - so LLVM uses avx512f+evex512 to actually indicate the AVX512F support.
For this reason, the Rust feature avx512f actually maps to the LLVM features avx512f+evex512 - so currently Rust code cannot use AVX10.N-256.

This proposal is to add AVX10.N-256 to Rust, and discuss the complications.

In this document, we propose 2 possible ways to add AVX10 to Rust.

evex512 Approach

Guide-level explanation

This proposal just designates the target feature evex512 for 512-bit vectors - if a function uses 512-bit vectors, it must have this target feature enabled.
For an example,

#[target_feature(enable = "avx512f")]
fn do_256_bit_things() {
    // Only 128- and 256-bit instructions/intrinsics from AVX512F (e.g. _mm256_reduce_add_pd) can be used here
}

#[target_feature(enable = "avx512f,evex512")]
fn do_512_bit_things() {
    // All instructions/intrinsics from AVX512F can be used here
}

This also changes the avx512* target features to just mean that at least 128- and 256-bit versions of that cpu feature are available.
So, to use 512-bit AVX512F intrinsics, a function must have both avx512f and evex512 (as shown in the example).

Reference-level explanation

This proposal changes the semantics of all avx512* target features and adds a new target-feature evex512 which indicates the presence of 512-bit instructions in the CPU.

This mechanism would mean that the target-feature avx10.1-256, when introduced, will imply avx512f, avx512vl etc, but not evex512.
And the target feature avx10.1-512 will imply avx10.1-256 and evex512.
This system essentially mirrors LLVM's solution to this issue.

A function with target feature avx512f will work with AVX512, AVX10.1-256 and AVX10.1-512 CPUs.
But a function with target features avx512f and evex512 will work only with AVX512 and AVX10.1-512 CPUs.

The is_x86_feature_detected macro will also have to change - even though this would mean changing a stable API (the AVX512 feature detection was stabilized in 1.27.0).
This would behave according to the target feature.

// On a AVX512 or AVX10.1-512 CPU
println!("{}", is_x86_feature_detected!("avx512f")); // true
println!("{}", is_x86_feature_detected!("evex512")); // true

// On a AVX10.1-256 CPU
println!("{}", is_x86_feature_detected!("avx512f")); // true
println!("{}", is_x86_feature_detected!("evex512")); // false

Only functions with target feature evex512 will be able to use 512-vectors (even when not using AVX512) due to ABI constraints.
This is similar to the restriction on 256-bit vectors with avx.

Rationale

This preserves parity with LLVM features, and introduces just 1 new target feature.
Using it should be pretty intuitive with little experience, as evex512 just means existence of 512-bit vectors.

Drawbacks

The main counterargument to this proposal seems to be the loss of association between CPU features and Rust target features.
This proposal would imply that a function with target feature avx512f will only be able to use a subset of AVX512F instructions.
This may be counterintuitive to programmers.

Also, introducing this might break a lot of code in the ecosystem, as many crates use AVX512 to accelerate computing.
This proposal would mean that many of their uses of AVX512 target features have to be augmented with evex512, otherwise a SIGILL may result.
Though might be acceptable because the AVX512 target features are still unstable.

[Ralf: I would also mention that using avx512f as name for something that indicates 256bit operations is terrible naming. If everybody else jumps of a cliff names something in a totally nonsensical way, do we follow or do we pick sensible names instead?]

avx256 Approach

Guide-level explanation

This proposal adds 14 new target features

  • avx256f
  • avx256vl
  • avx256bw
  • avx256dq
  • avx256ifma
  • avx256vnni
  • avx256vbmi
  • avx256vbmi2
  • avx256cd
  • avx256vpopcntdq
  • avx256vp2intersect
  • avx256bitalg
  • avx256bf16
  • avx256fp16

(There are some more possible naming conventions, e.g. avx512f-256).

These will function as the corresponding avx512* target feature, but only allowing operations on vectors of upto 256-bits.
For an example,

#[target_feature(enable = "avx256f")]
fn do_256_bit_things() {
    // Only 128- and 256-bit instructions/intrinsics from AVX512F (e.g. _mm256_reduce_add_pd) can be used here
}

#[target_feature(enable = "avx512f")]
fn do_512_bit_things() {
    // All instructions/intrinsics from AVX512F can be used here
}

Reference-level explanation

This doesn't change the semantics of any of the avx512* target features - this just adds new target features for each AVX512 ISA restricting it to only 256-bit vectors.

This mechanism would mean that the target-feature avx10.1-256, when introduced, will imply avx256f, avx256bw etc, but not avx512f.
And the target feature avx10.1-512 will imply avx10.1-256, avx512f, avx512bw etc.
This creates a parallel ISA system to the existing AVX512 system, but for 256-bit system.

A function with target feature avx256f will work with AVX512, AVX10.1-256 and AVX10.1-512 CPUs.
But a function with target feature avx512f will work only with AVX512 and AVX10.1-512 CPUs.

The is_x86_feature_detected macro will not change its current behaviour, just adding new behaviour for avx256* target features.

// On a AVX512 or AVX10.1-512 CPU
println!("{}", is_x86_feature_detected!("avx256f")); // true
println!("{}", is_x86_feature_detected!("avx512f")); // true

// On a AVX10.1-256 CPU
println!("{}", is_x86_feature_detected!("avx256f")); // true
println!("{}", is_x86_feature_detected!("avx512f")); // false

Only functions with target feature avx512* will be able to use 512-vectors (even when not using AVX512) due to ABI constraints.
This is similar to the restriction on 256-bit vectors with avx.

In effect, the avx512f target feature will map to LLVM feature avx512f+evex512, and the avx256f target feature will map to the LLVM feature avx512f.

As AVX512VL is somewhat different, having only 128- and 256-bit instructions, the only difference is that avx512vl implies avx512f, so it can use all AVX512F intrinsics,as well as AVX512VL intrinsics.
On the other hand, avx256vl only implies avx256f, so it can use only 128- and 256-bit AVX512F instructions along with all AVX512VL intrinsics.

Rationale

This is completely backward-compatible, and preserves the association of CPU features and Rust features - avx512f simply means the existence of AVX512F.

Drawbacks

The main criticism seems to be that the addition of that many target features can confuse users.
Also, this creates target features that do not directly map to a CPU feature, which some people might object to.

Prior art

Both GCC and CLang uses a evex512 style solution - but they didn't break any code when they introduced it because of negative target features.
Their target attribute avx512f by default enables evex512, and functions that don't need 512-bit vectors might use no-evex512 to disable evex512.
This can't be used in Rust because negative target features are capable of causing ABI mismatches and unsoundness (e.g. #116573).

So, we need to come up with our own solution to this.

Future possibilities

This would unblock the stabilization of AVX512 target features and intrinsics, helping programmers leverage AVX512 capabilities from Stable Rust.
This has the capability to improve performance throughout the ecosystem as AVX512 (or at least VEX variants of AVX512) supporting CPUs are becoming commonplace.
Also, in future this would allow smooth integration of other AVX10 iterations (in the time of writing this, AVX10.2 support has already landed in GCC).