[RFC] Intrinsic for non-trapping, partially out-of-bounds read

Motivation

It is sometimes useful to perform loads past the end of an object, where the out-of-bounds bytes end up not being used, and there are hardware-specific guarantees that such a load will not trap, e.g., because it does not cross a page boundary.

For example, to handle tail bytes in a memchr implementation, I might want to perform a wide load that covers some out-of-bounds bytes, which later get masked off.

However, I believe that currently, there is no well-defined way to actually achieve this on the LLVM IR level. Using plain loads for this is UB (even if it may usually work out in practice, and I'm sure plenty of C code just does that).

The closest we have are masked vector loads (llvm.masked.load), but these provide too strong guarantees: For them, our lowering has to ensure that the masked-off bytes will not produce a trapping access, rather than this being a promise by the programmer. As such, if there is no native masked load support, they need to scalarized, with conditional accesses.

Proposal

The proposal is to introduce an overloaded llvm.load.out.of.bounds (name subject to discussion) intrinsic, with the following signature and semantics:

declare i128 @llvm.load.out.of.bounds.i128.p0.i64(ptr nocapture noundef readonly %p, i64 immarg %defined_size) memory(argmem: read)

Generalizing i128 to an arbitrary type %T with store size T_size, the immarg %defined_size must be greater than zero and smaller than or equal to T_size, otherwise the IR is malformed.

A physical load of size T_size at pointer %p must not trap (based on some target-specific guarantee), otherwise the behavior is undefined.

From the perspective of the aliasing and memory model, the load will only access the first %defined_size bytes. The return value will be of type %T, with the first %defined_size bytes matching the result of an ordinary load of that size. The remaining T_size - %defined_size bytes will be undef.

The alignment of %p can be specified using the align call-site attribute, as usual. By default, an unaligned pointer (align 1) is assumed.

Lowering

On the machine code level, this intrinsic should get lowered to a plain non-masked load (or possibly multiple loads, depending on size/alignment restrictions). The %defined_size parameter can be ignored for lowering purposes.

I hope it would be possible to lower this to a plain load already on the SDAG level, but I'm not familiar with what kind of object size based assumptions we still make at that level. Feedback from someone more familiar with the backend would be appreciated.

Why the %defined_size parameter?

At a high level, the idea behind this intrinsic is that we want any loaded bytes past the end of "the memory we're allowed to access" to be undef. However, this notion is not really well-defined and isn't as simple as the bounds of the underlying allocated object once provenance considerations come into play.

Consider the following example and assume that p2 == p + 4:

define i64 @test(ptr noalias %p, ptr noalias %p2) {
  store i32 0, ptr %p2
  %v = call i64 @llvm.load.out.of.bounds.i64.p0.i64(ptr %p, i64 4)
  ret i64 %v
}

With the %defined_size parameter this function is clearly well defined, because only the first 4 bytes of the load interact with the memory model, and as such there is no noalias violation with the overlapping store to %p2.

Without the %defined_size parameter, it's unclear what the semantics of this code are supposed to be, especially once we consider that the write to %p2 might be from another thread, not part of the same function.