avx10
Adding AVX10
target features to Rust
(For reference, CAPS refer to a CPU ISA, and small
refers to a Rust or LLVM target feature).
Intel has decided to streamline the mess of ISAs that AVX512 was, and started the AVX10 project.
It is a versioned system, with a maximum vector width - e.g, AVX10.N-256 means that this is the Nth iteration of the ISA, and this implementation supports only vectors (and operations) of up to 256 bits.
There will be 2 ISAs in each iteration - AVX10.N-256 and AVX10.N-512.
The first iteration hasn't added any new instructions, just augmented AVX512 - if a CPU supports AVX10.1-512, it supports all AVX512 instructions, and if the CPU just supports AVX10.1-256, it supports all AVX512 instructions of 128- and 256-bit vectors.
The problem stems there - Rust has the target features avx512f
etc to indicate support for AVX512.
LLVM, on the other hand, uses a different mechanism.
The LLVM feature evex512
indicates the presence of 512-bit instructions, and the LLVM feature avx512f
only means support of AVX512F instructions up to at least 256 bits - so LLVM uses avx512f+evex512
to actually indicate the AVX512F support.
For this reason, the Rust feature avx512f
actually maps to the LLVM features avx512f+evex512
- so currently Rust code cannot use AVX10.N-256.
This proposal is to add AVX10.N-256 to Rust, and discuss the complications.
In this document, we propose 2 possible ways to add AVX10 to Rust.
evex512
ApproachThis proposal just designates the target feature evex512
for 512-bit vectors - if a function uses 512-bit vectors, it must have this target feature enabled.
For an example,
#[target_feature(enable = "avx512f")]
fn do_256_bit_things() {
// Only 128- and 256-bit instructions/intrinsics from AVX512F (e.g. _mm256_reduce_add_pd) can be used here
}
#[target_feature(enable = "avx512f,evex512")]
fn do_512_bit_things() {
// All instructions/intrinsics from AVX512F can be used here
}
This also changes the avx512*
target features to just mean that at least 128- and 256-bit versions of that cpu feature are available.
So, to use 512-bit AVX512F intrinsics, a function must have both avx512f
and evex512
(as shown in the example).
This proposal changes the semantics of all avx512*
target features and adds a new target-feature evex512
which indicates the presence of 512-bit instructions in the CPU.
This mechanism would mean that the target-feature avx10.1-256
, when introduced, will imply avx512f
, avx512vl
etc, but not evex512
.
And the target feature avx10.1-512
will imply avx10.1-256
and evex512
.
This system essentially mirrors LLVM's solution to this issue.
A function with target feature avx512f
will work with AVX512, AVX10.1-256 and AVX10.1-512 CPUs.
But a function with target features avx512f
and evex512
will work only with AVX512 and AVX10.1-512 CPUs.
The is_x86_feature_detected
macro will also have to change - even though this would mean changing a stable API (the AVX512 feature detection was stabilized in 1.27.0
).
This would behave according to the target feature.
// On a AVX512 or AVX10.1-512 CPU
println!("{}", is_x86_feature_detected!("avx512f")); // true
println!("{}", is_x86_feature_detected!("evex512")); // true
// On a AVX10.1-256 CPU
println!("{}", is_x86_feature_detected!("avx512f")); // true
println!("{}", is_x86_feature_detected!("evex512")); // false
Only functions with target feature evex512
will be able to use 512-vectors (even when not using AVX512) due to ABI constraints.
This is similar to the restriction on 256-bit vectors with avx
.
This preserves parity with LLVM features, and introduces just 1 new target feature.
Using it should be pretty intuitive with little experience, as evex512
just means existence of 512-bit vectors.
The main counterargument to this proposal seems to be the loss of association between CPU features and Rust target features.
This proposal would imply that a function with target feature avx512f
will only be able to use a subset of AVX512F instructions.
This may be counterintuitive to programmers.
Also, introducing this might break a lot of code in the ecosystem, as many crates use AVX512 to accelerate computing.
This proposal would mean that many of their uses of AVX512 target features have to be augmented with evex512
, otherwise a SIGILL may result.
Though might be acceptable because the AVX512 target features are still unstable.
[Ralf: I would also mention that using avx512f
as name for something that indicates 256bit operations is terrible naming. If everybody else jumps of a cliff names something in a totally nonsensical way, do we follow or do we pick sensible names instead?]
avx256
ApproachThis proposal adds 14 new target features
avx256f
avx256vl
avx256bw
avx256dq
avx256ifma
avx256vnni
avx256vbmi
avx256vbmi2
avx256cd
avx256vpopcntdq
avx256vp2intersect
avx256bitalg
avx256bf16
avx256fp16
(There are some more possible naming conventions, e.g. avx512f-256
).
These will function as the corresponding avx512*
target feature, but only allowing operations on vectors of upto 256-bits.
For an example,
#[target_feature(enable = "avx256f")]
fn do_256_bit_things() {
// Only 128- and 256-bit instructions/intrinsics from AVX512F (e.g. _mm256_reduce_add_pd) can be used here
}
#[target_feature(enable = "avx512f")]
fn do_512_bit_things() {
// All instructions/intrinsics from AVX512F can be used here
}
This doesn't change the semantics of any of the avx512*
target features - this just adds new target features for each AVX512 ISA restricting it to only 256-bit vectors.
This mechanism would mean that the target-feature avx10.1-256
, when introduced, will imply avx256f
, avx256bw
etc, but not avx512f
.
And the target feature avx10.1-512
will imply avx10.1-256
, avx512f
, avx512bw
etc.
This creates a parallel ISA system to the existing AVX512 system, but for 256-bit system.
A function with target feature avx256f
will work with AVX512, AVX10.1-256 and AVX10.1-512 CPUs.
But a function with target feature avx512f
will work only with AVX512 and AVX10.1-512 CPUs.
The is_x86_feature_detected
macro will not change its current behaviour, just adding new behaviour for avx256*
target features.
// On a AVX512 or AVX10.1-512 CPU
println!("{}", is_x86_feature_detected!("avx256f")); // true
println!("{}", is_x86_feature_detected!("avx512f")); // true
// On a AVX10.1-256 CPU
println!("{}", is_x86_feature_detected!("avx256f")); // true
println!("{}", is_x86_feature_detected!("avx512f")); // false
Only functions with target feature avx512*
will be able to use 512-vectors (even when not using AVX512) due to ABI constraints.
This is similar to the restriction on 256-bit vectors with avx
.
In effect, the avx512f
target feature will map to LLVM feature avx512f+evex512
, and the avx256f
target feature will map to the LLVM feature avx512f
.
As AVX512VL is somewhat different, having only 128- and 256-bit instructions, the only difference is that avx512vl
implies avx512f
, so it can use all AVX512F intrinsics,as well as AVX512VL intrinsics.
On the other hand, avx256vl
only implies avx256f
, so it can use only 128- and 256-bit AVX512F instructions along with all AVX512VL intrinsics.
This is completely backward-compatible, and preserves the association of CPU features and Rust features - avx512f
simply means the existence of AVX512F.
The main criticism seems to be that the addition of that many target features can confuse users.
Also, this creates target features that do not directly map to a CPU feature, which some people might object to.
Both GCC and CLang uses a evex512
style solution - but they didn't break any code when they introduced it because of negative target features.
Their target attribute avx512f
by default enables evex512
, and functions that don't need 512-bit vectors might use no-evex512
to disable evex512
.
This can't be used in Rust because negative target features are capable of causing ABI mismatches and unsoundness (e.g. #116573).
So, we need to come up with our own solution to this.
This would unblock the stabilization of AVX512 target features and intrinsics, helping programmers leverage AVX512 capabilities from Stable Rust.
This has the capability to improve performance throughout the ecosystem as AVX512 (or at least VEX variants of AVX512) supporting CPUs are becoming commonplace.
Also, in future this would allow smooth integration of other AVX10 iterations (in the time of writing this, AVX10.2 support has already landed in GCC).