[RFC] A Provenance Model for LLVM

# [RFC] A Provenance Model for LLVM ## Introduction Pointer provenance refers to the notion that pointers are not merely an integral address: They also carry (virtual) information about which memory they are allowed to access. Provenance is the reason why performing an out-of-bounds access, or a use-after-free, is always undefined behavior -- even in cases where the physical memory access does not trap. LLVM has always had a concept of provenance (often referred to with terms such as "underlying object" or "based on"). However, any parts of the provenance model that relate to casts between pointers and integers, or type punning between pointers and integers, are either unspecified, or specified in a way that is known to be incorrect. This makes it hard/impossible to reason about the correctness of certain transforms, and results in miscompilations. The goal of this RFC is to specify a provenance model for LLVM. The proposal is split into three parts: First, a description of the provenance model. Second, rationale for certain choices, their implications, as well as possible alternatives. Third, which changes will be necessary to LLVM's current implementation to comply with the proposed model. This RFC makes reference to the [byte type][byte_type] as an important ingredient in the provenance model, but does not cover it in detail (esp. omitting operations on the byte type), as these will be covered by a separate RFC. ## Model > This is a concise description of the provenance model, not intended as normative wording. Pointers in LLVM consist of multiple components: * An integral address. * Optional non-address bits and optional external state. * A provenance, which describes which memory it has the permission to access. The provenance may be absent, also known as nullary provenance. The access permission granted by the pointer provenance has two primary components: * Spatial: The set of memory addresses that the pointer is allowed to access. * Temporal: The timespan during which the pointer is allowed to access those memory addresses. The spatial component of the provenance is not allowed to span across multiple allocated objects. Provenance is a virtual property, which is not retained when lowering to hardware. While some architectures like CHERI have a notion of "capabilities" that is similar to provenance, they are not equivalent to LLVM provenance. The basic rules for determining the provenance of a pointer value are as follows. **Base cases:** * The pointer returned by an allocation has provenance to accesses all addresses that are part of the allocated object for the lifetime of that object. This applies to all allocated objects, including allocas, global variables and the return value of allocation functions. * The null pointer has nullary provenance in the default address space, and in absence of the `null_pointer_is_defined` attribute. Otherwise, it acts like `inttoptr(0)`. * The undef/poison pointers always have nullary provenance. **Derived cases:** * The return value of `getelementptr` has the same provenance as its pointer argument. This holds regardless of whether the `getelementptr` is `inbounds` or not. * Instructions that merely copy values (`select`, `phi`, `insertvalue`, `extractvalue`, `insertelement`, `extractelement`, `shufflevector`, function return/argument) retain the provenance of the copied value. Provenance is tracked per-member and per-lane for structs/vectors. * `bitcast` and `addrspacecast` retain the provenance of their argument. * Semantics of `ptrtoint`, `inttoptr`, `load` and `store` are discussed below. **Provenance-checking operations:** * If memory is accessed using a pointer which does not have provenance to access the address range, the behavior is undefined. * `getelementptr inbounds` with a non-zero offsets results in poison if either the pointer has nullary provenance, or the address arithmetic falls outside the spatial provenance of the pointer. The temporal provenance is disregarded. ### Provenance exposure Integers do not have provenance. The `ptrtoint` operation *exposes* the provenance of the pointer argument. The `ptrtoaddr` and `icmp` operations do not. The `inttoptr` operation picks a provenance from either: * The set of previously exposed provenances. * Memory that is outside the control of the LLVM abstract machine, such as MMIO registers. This memory must be disjoint from all allocated objects known to LLVM. The provenance is picked using angelic non-determinism. That is, some provenance (from the above set) that makes the resulting program execution well-defined will be chosen. If no such provenance exists, the behavior is undefined. Exposure is not considered part of the visible behavior of the program. Some architectures like CHERI may not return pointers with nullary provenance from `inttoptr`. ### Type punning Each byte in memory consists of an integral value and provenance (which may be nullary). For the purpose of this document, we'll disregard special memory values like undef or poison as irrelevant to the provenance model. * Storing a pointer stores the provenance of the pointer (in all stored bytes). * Storing an integer (or other non-pointer value) stores nullary provenance. * When loading a pointer, if all bytes have the same provenance, return a pointer with that provenance. Otherwise, return a pointer with nullary provenance. * When loading an integer, ignore any provenance associated with the loaded bytes. * Loading or storing a [byte type][byte_type] will represent the contents of memory exactly, including byte-wise provenance, without any exposure effects. * `memcpy` performs the copy via the byte type, i.e. preserves provenance precisely. Notably, loading an integer from a location that has provenance does not expose that provenance. Loading a pointer from a location that has no provenance does not recover a previously exposed provenance. ## Rationale and Implications ### Integer provenance The pointer aliasing rules in LangRef currently make this statement: > A pointer value formed by an inttoptr is based on all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value. This implies that provenance is tracked through integers, with an unspecified "contribution" mechanism. This kind of provenance model is generally considered non-viable. The problem with making integers carry provenance is that many desirable transforms become illegal. For example, it may be invalid to optimize `x * 0` to `0`, because this would lose the contribution of `x` towards the provenance of the result. It also makes replacements of `x` with `y` under a dominating condition of `x == y` invalid, because the integers may have different provenance (and indeed, this replacement is invalid for pointers, unless certain conditions are met). The proposed provenance model does not give integers provenance, and is modeled around expose instead. This matches, at least at a high level, the provenance models adopted by Rust and C. This also means that there does not have to exist any kind of "dependence" between a `ptrtoint` exposing a provenance and a `inttoptr` recovering it. Consider the following simple example: ```llvm %i = ptrtoint ptr %p to i64 %a = ptrtoaddr ptr %p to i64 %p2 = inttoptr i64 %a to ptr ``` The pointer `%p2` here may have the provenance of `%p`, even though the `inttoptr` operation is performed on the result of `ptrtoaddr`, which explicitly does not expose provenance. However, the provenance of `%p` was already exposed by a preceding `ptrtoint` operation. The fact that the return value of that `ptrtoint` is not used is irrelevant. ### ptrtoint and inttoptr optimization implications Most of the optimization implications for `ptrtoint` and `inttoptr` can be derived by modeling `ptrtoint` as having a `memory(exposed: write)` effect and `inttoptr` as having a `memory(exposed: read)` effect, where `exposed` is a new inaccessible memory location. In particular: * `ptrtoint` with unused result cannot be trivially DCEd, because this would drop the exposure effect. (`inttoptr` can be trivially DCEd.) * `ptrtoint` and `inttoptr` cannot be freely moved. For example, we generally cannot move a `inttoptr` before a `ptrtoint` instruction, as this may reduce the set of exposed pointers at the time the `inttoptr` executes. However, some optimizations are still possible, beyond what may be implied by the basic effect description. A `ptrtoint` can generally be moved upwards, even across `inttoptr`. It is legal to expose a provenance earlier than in the original program. Similarly, an `inttoptr` can be moved downwards, even across `ptrtoint`. This increases the possible provenances it may recover. A `ptrtoint` with unused result can be changed to use a different pointer argument that has the same provenance. For example `ptrtoint(ptradd(p,i))` can be converted to `ptrtoint(p)`. Multiple `ptrtoint` operations with the same argument can be CSEd into one `ptrtoint` operation. (For CSE of `inttoptr` see the multi-object provenance section.) Round-trip casts of the form `inttoptr(ptrtoint(p))` generally cannot be folded to `p`, as the resulting pointer may have a different provenance. `ptrtoint(inttoptr(i))` round-trips can be folded to `i` and DCEd, as `ptrtoint` here could only expose an already-exposed provenance. ### Address-nondeterminism and ptrtoint exposure elision While unused `ptrtoint` can not be DCEd in general, provenance exposure of objects with a never observed, non-deterministic address can be elided. Consider the following example and for simplicity assume no other exposed pointers exist: ```llvm %a = alloca i8 ptrtoint ptr %a to i64 %p = inttoptr i64 u0xdeadbeef to ptr store i8 0, ptr %p ``` The program execution will be well-defined if and only if the `alloca` happens to be allocated at address `0xdeadbeef`. As the address of the `alloca` is not observed, we can make the assumption that the `alloca` will be allocated a different address, in which case the program is always UB, and the provenance exposure can be elided. It is worth noting that address non-determinism reasoning in general is [somewhat problematic][twinsem] in the presence of finite memory, but this is a general problem with all address non-determinism transforms (which LLVM does perform), and not specific to this case. ### Angelic non-determinism This proposal specifies the semantics of `inttoptr` with an angelically non-determistic choice from the set of exposed pointers. A way to look at this from the perspective of an IR interpreter is that the result of `inttoptr` is an abstract pointer associated with the set of all previously exposed provenances, and that set then collapses to a single provenance once it is used in an operation with a provenance restriction (such as a memory access). Once that has happened, it is not possible for a later operation to pick a different provenance from the original set. There are some alternatives to this choice. One is to limit `inttoptr` to chose from a smaller set of pointers. For example, one could limit `inttoptr` to only recover the provenance of an object that the address is part of. This would limit the number of possible provenances, but there would still generally be more than one. This is in part because the address at the end of an object may coincide with the address at the start of another object. And in part because there may be various notions of derived provenance (like subobject provenance, noalias provenance, etc.) As such, something like angelic choice would still be necessary to chose between the remaining options. And advantage of limiting the provenance set in this way is that `inttoptr(0)` would always return a nullary provenance, and as such be equivalent to `null`. LLVM currently treats these as equivalent, but under the proposed model they would not be, as `inttoptr(0)` may recover a different, non-nullary provenance. A downside of limiting the provenance set is that it is no longer legal to introduce a `inttoptr(ptrtoint(p))` round-trip cast for an arbitrary pointer. This would only be valid for pointers that are known to be in bounds of the allocated object they are based on. ### Multi-object provenance and inttoptr CSE Another alternative to angelic choice is to give the result of `inttoptr` the provenance to access *all* previously exposed objects. This would introduce two substantially different kinds of pointers. "Logical" pointers only have provenance to access a single object. "Physical" pointers may access many. To support the merging of multiple adjacent accesses in such a model, a single memory access would have to be able to access multiple allocated objects (as long as the pointer has provenance for all of them). Similarly, in order to support merging a sequence of multiple `getelementptr inbounds` in sequence, `inbounds` would have to allow crossing multiple allocated objects -- however, at that point it would no longer be capable of providing the no-wrap guarantees that are the primary purpose of the `inbounds` flag. Alternatively, if `inbounds` refers to a single object only, `inbounds` would have to always be dropped when merging `getelementptr`s. However, while a single-provenance model avoids these complications, it comes with the downside that two `inttoptr` instruction with the same argument cannot be freely CSEd. Consider this example: ```llvm %p1 = inttoptr i64 %i to ptr %p2 = inttoptr i64 %i to ptr store i8 0, ptr %p1 %gep = getelementptr i8, ptr %p2, i64 -1 store i8 0, ptr %gep ``` Assume that `%i` is the address at the end of one allocated object and the start of another, and that the provenance of both objects has been exposed. In that case, the program is well-defined, as `%p1` will have the provenance of one object and `%p2` the provenance of the other. However, if we CSE the instructions: ```llvm %p = inttoptr i64 %i to ptr store i8 0, ptr %p %gep = getelementptr i8, ptr %p, i64 -1 store i8 0, ptr %gep ``` Then the behavior will be undefined, as there is no possible (single-object) provenance for `%p` which would allow access to both allocated objects. ### Non-exposing type punning While `inttoptr` and `ptrtoint` are defined in terms of exposed provenance, in-memory type punning between integers and pointers is not. Storing a pointer and then loading it as an integer does not result in provenance exposure. The reason for this choice is that otherwise integer loads could no longer be trivially DCEd, without proving that the loaded bytes do not carry provenance. It is worth noting that in a model where integer loads may expose provenance, we cannot DCE integer loads even in the presence of integer TBAA: ```c uintptr_t x; void *p = /* ... */; memcpy(&x, &p, sizeof(uintptr_t)); *x; ``` Even though the load `*x` has integer TBAA, the memory it loads from has provenance, and as such the load needs to expose that provenance. Unfortunately, non-exposing type punning is not fully compatible with the current version of the [C provenance TS][c_provenance_ts] (at least without making all C loads go through the byte type). Ideally, the type punning semantics in the provenance TS would be adjusted to be non-exposing in a future revision. LLVM may choose to continue using ptrtoint/inttoptr instructions when performing store to load forwarding as a matter of QoI, as this makes the program more defined. ### Provenance monotonicity An important principle the provenance model is intended to follow is that of provenance monotonicity. This means that adding provenance to a pointer or byte in memory that previously had nullary provenance, should never make the program less well-defined. There should be no operations for which a pointer/byte with provenance will cause UB or return poison where a pointer/byte with nullary provenance would not. Apart from being a nice conceptual property, provenance monotonicity ensures that it's possible to eliminate integer load store pairs like the following: ```llvm store ptr %p1, ptr %p %v = load i64, ptr %p store i64 %v, ptr %p %p2 = load ptr, ptr %p ; transformed to store ptr %p1, ptr %p %p2 = load ptr, ptr %p ``` In the original program, `%p2` will have nullary provenance, because it is reading the result of an integer store. If the integer load/store pair is optimized away, the memory will instead have the provenance of `%p1`, which may be non-nullary. Provenance montonicity ensures that this can only make the program more well-defined. ### Derived provenance It is possible to derive a new provenance which has a subset of the permissions of the original provenance. This proposal does not attempt to specify these in any detail, but merely points out some cases that likely should be modeled using derived provenance: * Subobject provenance: A derived provenance that restricts access to a sub-object of a larger object. The `getelementptr inrange` modifier can be modeled in this fashion, as well as the proposed [memory region intrinsic][memory_region_intrinsic]. * `captures` provenance: A pointer with `captures(address)` can be modeled with a derived provenance which becomes invalid on return from the function. Similarly, `captures(address, read_provenance)` can be modeled as a derived provenance that only allows read-only accesses after return from the function. * `noalias`: This should be modeled as *some* kind of derived provenance -- how to do this is left as an exercise for the reader. The existence of derived provenance is important insofar as it implies that there can always be multiple provenances for a given address, even outside the object boundary edge case. It also means that on CHERI architectures, two pointers with equal "capabilities" do not necessarily have the same provenance as far as the LLVM abstract machine is concerned. ### Pointer comparisons Pointer `icmp` does not expose or inspect provenance. Both sides of the comparison may have different provenance and may point into different objects. (The semantics of `icmp` are not changed by this proposal and only discussed for completeness.) A consequence of this is that a dominating `p1 == p2` condition can generally not be used to replace pointer `p1` with `p2` in the dominated code. This requires that either `p1` and `p2` have the same provenance, or that the provenance does not matter at the point of replacement (e.g. in the operand of another `icmp`). It is not possible to define an `icmp` variant that also compares provenance, because provenance is virtual state that is only tracked at the abstract machine level, but not in hardware. It would be possible to define an `icmp` variant that returns poison (or a non-deterministic result) if both pointers do not point into the same object and/or don't have compatible provenance, though we don't currently have a use-case for this. ### Constant expressions The `ptrtoint` and `inttoptr` instructions can also occur as constant expression. This poses a challenge because: * Constant expressions do not have a specific point where they are evaluated, so it's unclear what the set of exposed addresses at the point of an `inttoptr` expression are. * Unused constant expressions are assumed to be freely removable, which means that unused `ptrtoint` expression may be eliminated, losing their exposure side effect. While many other constant expressions have been removed in recent years, the `ptrtoint`/`inttoptr` expressions do require constant expression support, because they may be part of relocatable expressions in global initializers. In particular, relative lookup tables are expressed as follows: ```llvm @a = ... @lookup = ptr [N x i32] [ i32 trunc (i64 sub (i64 ptrtoint (ptr @a to i64), i64 ptrtoint (ptr @lookup to i64)) to i32), ...] %ptr = call ptr @llvm.load.relative.i32(ptr @lookup, i32 %idx) ``` The handling of constant expressions is an open question. Possibly it would be acceptable to assume that all globals exposed, even without an explicit `ptrtoint`. (For DCE of internal globals, we would fall back on the address non-determinism justification introduced above.) ### Pointers with unstable representation TODO: This needs some analysis. ### Language compatibility #### Rust Strict Provenance The Rust [strict provenance][rust_strict_provenance] model does not rely on provenance being recovered by `inttoptr` operations, and thus elegantly side-steps all the hard parts of a provenance model. As such, it is compatible with the model described in this document, but does not depend on its details. The mapping between Rust strict provenance and LLVM is: * `p.addr()` maps to `ptrtoaddr(p)`. * `p.with_addr(addr)` maps to `ptradd(p, addr - ptrtoaddr(p))`. #### Rust Exposed Provenance While the use of strict provenance APIs is recommended and usually sufficient, interaction with legacy interfaces may require the use of the [exposed provenance][rust_exposed_provenance] APIs. Rust currently leaves the details of the exposed provenance model unspecified, because of uncertainties about LLVM's provenance semantics. The model specified in this document is intended to be compatible with Rust's requirements. The mapping between Rust exposed provenance and LLVM is: * `p.expose_provenance()` maps to `ptrtoint(p)`. * `ptr::with_exposed_provenance(addr)` maps to `inttoptr(addr)`. One aspect that is not directly part of the strict/exposed provenance models, but is rather a general language property, is that `MaybeUninit<T>` maps to the byte type of size `sizeof(T)`. That is, `MaybeUninit<T>` can faithfully preserve provenance. (Currently it maps to an integer type, and relies on LLVM being sufficiently conservative about handling those to not cause practical miscompilations.) #### C provenance TS While in the C standard itself pointer provenance is not explicitly acknowledged, there is a [technical specification][c_provenance_ts], which specifies a provenance model for C. This model is known as **pnvi-ae-udi** (provenance not via integers, address exposed, user-disambiguation). The model is broadly similar to the one proposed here. The key differences are: * The proposed model allows recovering any exposed provenance. pnvi-ae-udi only allows recovering provenance that corresponds to an object at the address, including the address at the end of an object. In this respect, the LLVM model is more general. * The proposed model does not expose or recover pointer provenance when performing type punning between integers and pointers. In this respect, the LLVM model is more restrictive. While the proposed model is not compatible with the letter pnvi-ae-udi, I believe it is compatible with the spirit, in the sense that most of the motivating examples for pnvi-ae-udi remain valid, if sometimes for other reasons. A possible mapping between C with pnvi-ae-udi and LLVM would be: * `(uintptr_t) p` maps to `ptrtoint(p)`. * `char` maps to the byte type. * Pointer-containing union accesses go through the `byte` type and `bytecast`. TODO: Finish writing this section. Most examples look fine, A.4.7 might be problematic. ## Implementation LLVM's current implementation essentially implements the convenient parts of an exposure-based provenance model (no integer provenance), and ignores the inconvenient parts (exposure side-effects). This results in incoherent behavior, which leads to miscompilations. Provenance related miscompilations are often considered "theoretical", in the sense that they can only be produced with artificially constructed code, rather than encountered in the wild. However, issues that were once considered theoretical tend to turn practical over time. Rust in particular has [observed][rust_i2p2i_miscompile] real-world miscompilations related to "theoretical" provenance issues. This section lists some of the larger items that will be needed to move the implementation into compliance with the proposed provenance model. ### ptrtoaddr The new `ptrtoaddr` instruction (which has already been introduced and for which optimization support is in progress) provides a `ptrtoint` alternative without provenance exposure. This does not address any correctness issues by themselves, but allows us to avoid provenance exposures for operations like pointer subtraction. Reducing the amount of exposures is important, as such exposures usually cannot be optimized away and require much more conservative treatment of the exposed object. ### Byte type The byte type serves multiple important functions: * It provides a way to faithfully represent memory in an SSA value. For example, this is important when optimizing `memcpy` to a fixed size load and store. Currently integers are used for this purpose, but this is not correct if integers do not carry provenance. * We currently get away with using integers for this purpose because we insert ptrtoint/inttoptr casts during store-load forwarding between integer and pointer accesses. However, this can introduce `inttoptr(ptrtoint(p))` round-trip casts which can then no longer be optimized away. The byte type avoids introduction of such casts. * Unrelated to the provenance model, the byte type is also necessary to make storage of poison in memory viable. The implementation for the byte type has recently been rebased and evaluated as part of [GSoC 2025][byte_type]. As such, what is needed now is a new RFC for it, and an upstreaming of the implementation. ### ptrtoint side effect Once we have reduced the amount of `inttoptr`/`ptrtoint` casts LLVM introduces as a result of optimization (as opposed to casts inserted by the frontend), we can take the previously described optimization restrictions on `ptrtoint` and `inttoptr` casts more seriously. The most important change will be to make `ptrtoint` a side-effecting instruction, no longer permitting trivial DCE for it. ### `inttoptr(ptrtoint(p))` round-trip Finally, we need to stop optimizing `inttoptr(ptrtoint(p))` to `p`. This is already available behind the `-disable-i2p-p2i-opt` option. This option currently still has non-trivial optimization impact, but the hope is that the byte type (which will avoid introduction of these roundtrip casts by the optimizer) will ameliorate the situation. [byte_type]: https://blog.llvm.org/posts/2025-08-29-gsoc-byte-type/ [twinsem]: https://research.ralfj.de/papers/2018-oopsla-twinsem.pdf [c_provenance_ts]: https://www.open-std.org/JTC1/SC22/WG14/www/docs/n3057.pdf [memory_region_intrinsic]: https://discourse.llvm.org/t/rfc-memory-region-declaration-intrinsic/59419 [rust_i2p2i_miscompile]: https://github.com/rust-lang/rust/issues/147265 [rust_strict_provenance]: https://doc.rust-lang.org/std/ptr/index.html#strict-provenance [rust_exposed_provenance]: https://doc.rust-lang.org/std/ptr/index.html#exposed-provenance