Feature Name: f16_and_f128
Start Date: 2023-07-02
RFC PR: rust-lang/rfcs#3453
Rust Issue: rust-lang/rfcs#2629

Summary

This RFC proposes adding new IEEE-compliant floating point types f16 and f128 into the core language and standard library. We will provide a soft float implementation for all targets, and use hardware support where possible.

Motivation

The IEEE 754 standard defines many binary floating point formats. The most common of these types are the binary32 and binary64 formats, available in Rust as f32 and f64. However, other formats are useful in various uncommon scenarios. The binary16 format is useful for situations where storage compactness is important and low precision is acceptable, such as HDR images, mesh quantization, and AI neural networks.^[1] The binary128 format is useful for situations where high precision is needed, such as scientific computing contexts.

The proposal is to add f16 and f128 types in Rust to represent IEEE 754 binary16 and binary128 respectively. Having f16 and f128 types in the Rust language would make Rust an optimal environment for more advanced use cases. Unlike third-party crates, this enables the compiler to perform optimizations for hardware with native support for these types, allows defining literals for these types, and would provide one single canonical data type for these floats, making it easier to exchange data between libraries.

This RFC does not have the goal of covering the entire IEEE 754 standard, since it does not include f256 and the decimal-float types. This RFC also does not have the goal of adding existing platform-specific float types such as x86's 80-bit double-extended-precision. This RFC does not make a judgement of whether those types should be added in the future, such discussion can be left to a future RFC, but it is not the goal of this RFC.

Guide-level explanation

f16 and f128 are primitive floating types, they can be used like f32 or f64. They always conform to binary16 and binary128 formats defined in the IEEE 754 standard, which means the size of f16 is always 16-bit, the size of f128 is always 128-bit, the amount of exponent and mantissa bits follows the standard, and all operations are IEEE 754-compliant.

let val1 = 1.0; // Default type is still f64
let val2: f128 = 1.0; // Explicit f128 type
let val3: f16 = 1.0; // Explicit f16 type
let val4 = 1.0f128; // Suffix of f128 literal
let val5 = 1.0f16; // Suffix of f16 literal

println!("Size of f128 in bytes: {}", std::mem::size_of_val(&val2)); // 16
println!("Size of f16 in bytes: {}", std::mem::size_of_val(&val3)); // 2

Every target should support f16 and f128, either in hardware or software. Most platforms do not have hardware support and therefore will need to use a software implementation.

All operators, constants, and math functions defined for f32 and f64 in core, must also be defined for f16 and f128 in core. Similarly, all functionality defined for f32 and f64 in std must also be defined for f16 and f128.

Reference-level explanation

`f16` type

f16 consists of 1 bit of sign, 5 bits of exponent, 10 bits of mantissa. It is exactly equivalent to the 16-bit IEEE 754 binary16 half-precision floating-point format.

The following traits will be implemented for conversion between f16 and other types:

impl From<f16> for f32 { /* ... */ }
impl From<f16> for f64 { /* ... */ }
impl From<bool> for f16 { /* ... */ }
impl From<u8> for f16 { /* ... */ }
impl From<i8> for f16 { /* ... */ }

Conversions to f16 will also be available with as casts, which allow for truncated conversions.

f16 will generate the half type in LLVM IR. It is also equivalent to C++ std::float16_t, C _Float16, and GCC __fp16. f16 is ABI-compatible with all of these. f16 values must be aligned in memory on a multiple of 16 bits, or 2 bytes.

On the hardware level, f16 can be accelerated on RISC-V via the Zfh or Zfhmin extensions, on x86 with AVX-512 via the FP16 instruction set, on some Arm platforms, and on PowerISA via VSX on PowerISA v3.1B and later. Most platforms do not have hardware support and therefore will need to use a software implementation.

`f128` type

f128 consists of 1 bit of sign, 15 bits of exponent, 112 bits of mantissa. It is exactly equivalent to the 128-bit IEEE 754 binary128 quadruple-precision floating-point format.

The following traits will be implemented for conversion between f128 and other types:

impl From<f16> for f128 { /* ... */ }
impl From<f32> for f128 { /* ... */ }
impl From<f64> for f128 { /* ... */ }
impl From<bool> for f128 { /* ... */ }
impl From<u8> for f128 { /* ... */ }
impl From<i8> for f128 { /* ... */ }
impl From<u16> for f128 { /* ... */ }
impl From<i16> for f128 { /* ... */ }
impl From<u32> for f128 { /* ... */ }
impl From<i32> for f128 { /* ... */ }
impl From<u64> for f128 { /* ... */ }
impl From<i64> for f128 { /* ... */ }

Conversions from i128/u128 to f128 will also be available with as casts, which allow for truncated conversions.

f128 will generate the fp128 type in LLVM IR. It is also equivalent to C++ std::float128_t, C _Float128, and GCC __float128. f128 is ABI-compatible with all of these. f128 values must be aligned in memory on a multiple of 128 bits, or 16 bytes.

On the hardware level, f128 can be accelerated on RISC-V via the Q extension, on IBM S/390x G5 and later, and on PowerISA via BFP128, an optional part of PowerISA v3.0C and later. Most platforms do not have hardware support and therefore will need to use a software implementation.

Drawbacks

While f32 and f64 have very broad support in most hardware, hardware support for f16 and f128 is more niche. On most systems software emulation will be required. Therefore, the main drawback is implementation difficulty.

Rationale and alternatives

There are some crates aiming for similar functionality:

f128 provides binding to the __float128 type in GCC.
half provides an implementation of binary16 and bfloat16 types.

However, besides the disadvantage of usage inconsistency between primitive types and types from a crate, there are still issues around those bindings.

The ability to accelerate additional float types heavily depends on CPU/OS/ABI/features of different targets heavily. Evolution of LLVM may unlock possibilities of accelerating the types on new targets. Implementing them in the compiler allows the compiler to perform optimizations for hardware with native support for these types.

Crates may define their type on top of a C binding, but extended float type definition in C is complex and confusing. The meaning of C types may vary by target and/or compiler options. Implementing f16 and f128 in the Rust compiler helps to maintain a stable codegen interface and ensures that all users have one single canonical definition of 16-bit and 128-bit float types, making it easier to exchange data between crates and languages.

Prior art

As noted above, there are crates that provide these types, one for f16 and one for f128. Another prior art to reference is RFC 1504 for int128.

Many other languages and compilers have support for these proposed float types. As mentioned above, C has _Float16 and _Float128, and C++ has std::float16_t and std::float128_t. Glibc supports 128-bit floats in software on many architectures.

This RFC is designed as a subset of RFC 3451, which proposes adding a variety of float types, including ones not in this RFC designed for interoperability with other languages.

Both this RFC and RFC 3451 are built upon the discussion in issue 2629.

The main consensus of the discussion thus far is that more float types would be useful, especially the IEEE 754 types proposed in this RFC as f16 and f128. Other types can be discussed in a future RFC.

Unresolved questions

The main unresolved parts of this RFC are the implementation details in the context of the Rust compiler and standard library. The behavior of f16 and f128 is well-defined by the IEEE 754 standard, and is not up for debate. Whether these types should be included in the language is the main question of this RFC, which will be resolved when this RFC is accepted.

Several future questions are intentionally left unresolved, and should be handled by another RFC. This RFC does not have the goal of covering the entire IEEE 754 standard, since it does not include f256 and the decimal-float types. This RFC also does not have the goal of adding existing platform-specific float types such as x86's 80-bit double-extended-precision.

Future possibilities

See RFC 3451 for discussion about adding more float types. RFC 3451 is mostly a superset of this RFC.

Design meeting minutes

Attendance: tmandry, TC, eholk, yosh, waffle, Trevor Gross

Question: How widespread is hardware support?

eholk: The RFC says we'd provide softfloat versions if the hardware doesn't support f16 and f128. Do x86_64 and arm64 support these? Do we want a way to require hardware support so we don't get accidental performance traps?

tmandry: The section for the individual types describes it. It seems pretty sparse.

TC: Slow is relative. LLVM routines can help make it as fast as it can be.

yosh: Clearly also, for platforms that do support it in hardware, it will be faster.

waffle: Part of the advantage for f16 in particular is its compactness rather than just its speed.

eholk: Would a crate's naive implementation be much worse than LLVM softfloat?

tmandry: If you are going to implement a crate to do this, you could add inline asm to the crate. I can still see LLVM being significantly faster, since it has to treat inline asm as opaque.

eholk: Maybe the people who have the use-case for for these types are more likely to be running on hardware that has this built in.

TC: +1

yosh: The ML ecosystem relies heavily on f16.

TC: The f16 type is probably more heavily used these days in image processing and other fields. ML has moved strongly toward bf16 which is included in the other related RFC. The balance between mantissa and exponent bits in bf16 works better for ML models, and there are hardware advantages as bf16 is simply a truncation of f32.

Comment: more prior art

yosh: decimal128 was needed to correctly implement MongoDB's BSON (Binary JSON) serialization format. (RustConf 2019 talk).

yosh: It's difficult to get this right in a crate. I know someone who tried. It's hard.

TC: Maybe we could say that it's hard, so we should probably just do it once, in the compiler, the right way, so that people don't have to repeat this work.

yosh: I do believe that.

late note from trevor: this talks about decimal128 but the focus is on binary128. The sentiment is still the same though.

Question: Does alignment vary by platform?

f16 values must be aligned in memory on a multiple of 16 bits, or 2 bytes.

f128 values must be aligned in memory on a multiple of 128 bits, or 16 bytes.

eholk: Is this universal for f16 and f128 on all platforms, or can this vary?

trevor: f128 can be platform dependent but I believe LLVM does the correct thing unlike i128. It has the alignment that i128 should.

Question: Do vector extensions even count for "normal" usage?

tmandry: The doc mentions that f16 is supported under AVX512, but these seem to all be vector instructions? Are those even useful at all for scalar values (unless a loop has been auto-vectorized)?

eholk: I think these days compilers use the vector instructions for floating point even when working on single values.

waffle: It would make sense for the compiler to use vector instructions even for single values in case it's faster than softfloats.

TC: Even when doing the vectorization by hand, you have to load things in and out of the vector registers. It's a pain not to have a primitive for a single value of that type when doing that. Without that, e.g., after doing all the things you need to be fast with intrinsics, you end up having to do a bunch of bit manipulation just to pull out the result and print it when you don't even need that part to be fast.

TC: Also, of course, you're hoping that even if you do write a simple loop with f16 that the compiler will kick in and autovectorize it using the available vectorized instructions on the platform. That's only possible if the primitive type exists.

waffle: The autovectorization probably wouldn't work as well with an external crate, right?

Question: Given sparse hardware acceleration, should these types be built in? What's the motivation?

tmandry: It seems like 95%+ of our users are not getting hardware acceleration. At that point the only motivation I can see is having built-in support for converting to the binary representation, which doesn't seem that compelling to me. Who's being blocked by not having these types?

eholk: The RFC does say LLVM can optimize better if we generate 16-bit and 128-bit types that it knows about, but I'm not sure how much better or how much it matters if you have to do a f128 multiply in software.

waffle: To me it seems like the main motivation is to have "standartized" types:

[…] ensures that all users have one single canonical definition of 16-bit and 128-bit float types, making it easier to exchange data between crates and languages.

trevor: On quite a few targets for C, long double is already f128. So we don't have a good way to interface for some things.

Yosh: f16/f128 are supported in C23. In order to carry these types over the FFI boundary, having a primitive for these would be helpful.

TC: There are philosophical questions here about what Rust is or should be. How big is our tent? How much paternalism do we want in language design? We're discussing these same questions now regarding the AFIT/RPITIT stabilization.

TC: As applied here, it asks, "do we want to let people use Rust to write code for e.g. GPUs?" Most users won't do that, but some will, and those people will get value from it. On paternalism, the question is, "how much should we worry that some people might misuse something that we stabilize (even if others will use it correctly and get value from it)?"

Yosh: I look at the upsides / downsides. The downsides are small. But for certain niches, it can be really important.

TC/waffle: +1

TC: The upsides are big for the people that need it and the downsides are small for the people who don't use it.

eholk: Performance predictability can be important. We could add a config flag.

TC: +1, no objection to a config flag.

waffle: f16 and f128 are niche enough that users are likely to be aware about the caveats on performance.

tmandry: Performance predictability is important.

yosh: some of these flags already exist in std. For example: is_x86_feature_detected!.

yosh: We're not lacking the tools to conditionally support numeric types.

TC: Part of the question here on how to move forward is whether we want to block this RFC on having the full story on those tools ready.

eholk: That's an important point. These are orthogonal. We could stabilize this, and if it becomes a problem, we can stabilize the other things.

yosh: It's also possible to be too careful. Sometimes we go looking for problems that may not really be a problem. Niko wrote a blog post about this recently.

trevor: We could document that these types might not be as fast as others.

tmandry: On balance, I think this feature makes sense to land. Not sure I agree that if we did block this feature that we would be moving too slowly. The big question is whether some set of users would get value out of this. The RFC could benefit from laying out the users who would get value out of this. We can address with documentation the problem of predictability.

yosh: Agreed

TC: One problem is that if a tool doesn't support a thing, people will use other tools, so it can be hard to find existing users of the tool who would benefit. I.e., in this case, people who need this, e.g. to write code for GPUs, would keep using C and C++, so it would be difficult to find Rust users.

tmandry: Agree with that.

Question: Are existing crates ABI-compatible with C/C++/LLVM?

waffle: Is it possible to make sure that the ABI is compatible for existing types in C/C++/LLVM given implementation in an external crate? (Asking this question because function ABI can depend on types; see Ralf's new findings with broken ABIs.)

tmandry: I think it has to be built in if it's going to affect what register it's passed in based on the type.

waffle: If we don't have a primitive, then there's nothing to wrap around.

trevor: u16 and f16 and u128 and f128 are supposed to have the same ABI on all targets.

yosh: Gankra's ABI Cafe found some fun incompatibilities in compilers for int128. This might be a good tool to validate compiler output if we're worried about that.

trevor: We should find a way to pull that into regular CI testing.

trevor: Because we punt to LLVM on this, we should be compatible with the C/C++ types.

Question: Should we make them available based on target / compiler flags, or lint in cases where they are unoptimized?

tmandry: We actually already discussed this above.

Checking boxes

TC: I haven't looked, tmandry, did you already check your box?

tmandry: Yes, my box is already checked. I still think the motivation could be written better, but I've checked off.

TC: Then it looks like we've done what we can in this meeting given the attendance to move this forward.

(The meeting ended here.)

Existing AI neural networks often use the 16-bit brain float format instead of 16-bit half precision, which is a truncated version of 32-bit single precision. This is done to allow performing operations with 32-bit floats and quickly convert to 16-bit for storage. ↩︎