# Julep: Redesigning `Array` using a new lower-level container type ## History and motivation From its inception, Julia has had an `Array` type intended to be used any time you need "a bunch of things together in some order". This is one of the most important and most unusual types in the language. When `Array` was designed, it was believed (possibly by multiple people, but Jeff is willing to take the blame) that the easiest-to-use interface would be a single array type that does most of what users need. Many Julia programmers now feel that the language and community have outgrown this trade-off, and problems have emerged: * `Array` has many features, with an unusually large amount of functionality written in C. This makes the code harder to maintain and harder for the compiler to optimize. * Constructing an`Array` is slower than it should be, since it is implemented with branches in C instead of compiler specialization. * The `Array` struct layout is opaque, inhibiting optimizations and understanding. One notable example of this is `push!` which is expected to be fast almost all of the time, but has significant overhead when resizing the array is not needed. * The extra overhead of all the features of `Array`, even if tolerable in most cases, is a significant cost when arrays are used to implement other types like `Dict`. * Julia does not currently have an Array-like type that allows per-element atomic operations. `Array` has the following "fancy" features beyond the obvious function of storing a series of elements: * Multiple dimensions * 1-d arrays can be resized efficiently at either end * Multiple arrays can share the same underlying storage, mostly to support no-copy reshaping * Foreign pointers can be wrapped as `Array`s to share data without copying. Optionally, we will call `free` on the pointer when the `Array` is garbage collected. * An `Array` can use a `String` as storage to allow zero-copy string initialization. Here we propose a new lower-level storage type and to reimplement `Array` in Julia using it. ## New GenericMemory type _Also `Core.GenericPtr` and `Core.AddrSpace`_ We propose a new type with the following representation: ```julia primitive type AddrSpace{From::Module} 8 end const CPU = bitcast(AddrSpace{Core}, 0) primitive type GenericPtr{T, AS::AddrSpace} sizeof(Int) end const Ptr{T} = GenericPtr{T, CPU} mutable struct GenericMemory{kind::Symbol, T, AS::AddrSpace} <: AbstractVector{T} const length::Int const ptr::GenericPtr{Cvoid, AS::AddrSpace} # The following fields are hidden: # Either: elements::T... # `length` elements stored inline # Or: owner::Any # owning object of `ptr` end const Memory{T} = GenericMemory{:not_atomic, T, CPU} ``` The intention of this abstraction is to represent "a range of memory beginning at a fixed location". The memory space can either be directly after the metadata fields (in which case the `ptr` field is redundant) or refer to foreign memory via the `ptr` field. In both cases `ptr` points to the data. When the data is stored in-line, we predict no extra overhead from loading the `ptr` field since it shares a cache line with the data. `Memory` has constructors similar to those of the `Vector` type. `GenericMemory` generally does not have any constructors. ### The `kind` parameter The `kind` type parameter specifies the required access style of the data, currently intended to support either atomic (`:atomic`) or "plain" (`:not_atomic`) access. The details and implementation for `:atomic` have not yet been written. In brief though, this is expected to be similar to `@atomic` fields, with implied `:monotonic` order on `getindex` (unless specified), but explicit ordering required on `setindex!`. The alignment will be increased to support atomic access as required for the current hardware, and locks will be added (per element) if necessary. In the future, there are other possible uses for `kind` such as `:const`. ### The `addrspace` parameter The `addrspace` parameter is required to be `Core.CPU` for allocations in main memory. It is present to permit extension by other systems such as GPUs, which might define values such as: ``` module CUDA const Generic = bitcast(Core.AddrSpace{CUDA}, 0) const Global = bitcast(Core.AddrSpace{CUDA}, 1) end ``` The exact semantics of these other addrspaces is defined by the specific backend, but will error if the user is attempting to access these on the CPU. ### Type parameter order The order of the type parameters in `GenericMemory` might be surprising. We did it this way for two reasons: * If the order were reversed, it would be tempting to write `GenericMemory{T}`, but that would create an abstract type, which we want to discourage. The `Memory{T}` alias solves this. * Since atomic data allows only atomic store operations, a lot of code will not be generic with respect to atomic-ness by default. This parameter order reminds low-level users to specialize on the access type. ### Resizing To handle very large data, it is useful to expose the ability to expand a region of memory by extending the OS kernel's mapping. This is supported because one can construct a new `Memory` struct with a larger `length` which points to the same data as the original `Memory`. Growing blocks by moving, or any form of shrinking blocks, is left to higher-level abstractions to implement. Note this type does not and cannot implement `resize!` or related functions like `push!`, because it has different semantics. Several specific interface functions have been proposed here (names are provisional): * `expand(mem, n)`: Grow size to `n` elements if possible, otherwise do nothing. Returns a new `Memory` or `nothing`. * `expand_upto(mem, n)`: Grow size to *at most* `n` elements, possibly fewer, returning the new (or same) `Memory`. ## New GenericMemoryRef type ```julia struct GenericMemoryRef{kind::Symbol, T, AS::AddrSpace} <: Ref{T} # Note that since this is an immutable struct it will be stored inline in `Array`. mem::GenericMemory{kind, T, AS} # `Memory` holding data # Points to a location within `mem`; either an actual pointer # or an offset (if T is zero size or a Union). ptr_or_offset::GenericPtr{Cvoid, AS} end const MemoryRef{T} = GenericMemoryRef{:not_atomic, T, CPU} ``` The `GenericMemoryRef` type is similar to the existing `RefArray` type, in that it points to a location within another object. Unlike that type, it has special codegen support so that the element offset computation can be elided. Note that `GenericMemoryRef` here is always a valid index (unlike C/C++/LLVM GEP which lets it point one element past the end), except when the `GenericMemory` is empty, in which case accessing it can throw a BoundsError. ## New Builtin functions memoryref(mem::Memory)::MemoryRef Return a MemoryRef constructed from a Memory, pointing to the first element. memoryref(mem::Memory, idx::Int, [boundscheck::Bool=true])::MemoryRef memoryref(mem::MemoryRef, idx::Int, [boundscheck::Bool=true])::MemoryRef Return a MemoryRef constructed from an existing Memory or MemoryRef and an offset, with the boundscheck optionally disabled (unsafe). memoryrefget(mem::MemoryRef{kind, T}, order::Symbol, boundscheck::Bool)::T Load the element of a MemoryRef and return it, with the given memory order and with the isempty check optionally disabled (unsafe). The `order` must be compatible with the `kind` parameter. memoryrefset(mem::MemoryRef{kind, T}, x::T, order::Symbol, boundscheck::Bool) Store `x` to a MemoryRef with the given memory order and with the isempty check optionally disabled (unsafe). The `order` must be compatible with the `kind` parameter. memoryoffset(mem::MemoryRef) Get the integer index of a MemoryRef's location in its underlying Memory. The process of getting or setting elements is decomposed into two operations: constructing a `MemoryRef`, and then accessing it. The reason for this is that constructing a `MemoryRef` is consistent and effect-free, but loading and storing elements are not. In particular the bounds check is part of `MemoryRef` construction, and these properties make code transformations easier. ## Deprecated builtin functions `arraylen`, `arrayref`, `arrayset` ## New `Array` definition ```julia mutable struct Array{T, N} <: AbstractArray{T, N} ref::MemoryRef{T} axes::NTuple{N, Int} end ``` For `N==1`, the axes and `ref` are mutable and `axes[1]` is the length. For other `N`, the fields are `const`, and the length is `.ref.mem.length` or `prod(.axes)`. ### String-backed arrays `Base.StringVector` returns a `Vector` whose storage is a `String`, allowing conversion via the `String` constructor without copying. This is used to initialize (immutable) strings efficiently. Declaring the data field of `Array` as `Union{Memory, String}` is not a good idea. We handle this by making an `Array` whose memory is backed by a `Memory{UInt8}` whose storage is owned by a new `String`. > [!IMPORTANT] > When calling `String(::Vector)`, we set the Vector's length to 0 and replace its `Memory` with an empty one. The `String`-backed `Memory` will still exist, but we assume code will not reference it directly. ## Implementation considerations ### Dimension-dependent const-ness Compilers always want to know when a value is constant. `Array` has the complication that the size is constant only when N≠1. Our layout system cannot express this, so it needs to be hacked into the compiler and will continue to be. ### Batched allocation of `Array` and `Memory` This design requires allocating two objects (an Array and a Memory) for most arrays, giving it a slight disadvantage over the current implementation in some cases where the Array does not later change size. We think this could be improved by adding a feature to the GC to allocate two objects at once: (1) Space big enough to hold both objects is allocated, (2) A header bit in the first object indicates that it is followed by another, (3) The compiler recognizes this pattern: ```julia Array(length) = new(Core.Memory(length)) ``` and replaces it with a single allocation call that contains both objects. We do not currently believe this is necessary however. ### Zero-sized `Memory` Because all zero-sized memory has no actual mutable content, we can instead allocate only one of these for all consumers. This can be stored in the `.instance` field when first allocated and reused for all future allocations as an optimization (like the zero-size String optimization). Secondly, this is the one case where `MemoryRef` may point to invalid memory, so we have a boundscheck parameter on read/write access to verify this case is not present. ### Element layout access It is important (especially in the GC) to have fast access to the layout metadata for an array's element type. Thus, we set `Memory{T}.layout` to the `layout` of the element type but with the opaque layout flag set. `unwrap_unionall(GenericMemory).layout` holds the layout metadata for the `GenericMemory` type itself. Previously, `Array` did layout computations on array construction, which was one reason allocating arrays was surprisingly slow. `Union` types were especially a problem since unlike `DataType` they do not have a field to store a layout. ### Allocation optimizations We need to consider how this design will affect advanced array optimizations including SROA of arrays, allocation elision, and allocation hoisting. The two-layer object structure might make this more difficult. ### GenericMemoryRef representation For efficiency, the `GenericMemoryRef` data field is a `Ptr{Cvoid, AS}` to the referenced element for most arrays, but needs to be interpreted as an Int index for bitsunion and ghost value (elsize=0) elements. The `Base.memoryoffset`, `pointer`, and `Core.memoryref` functions can be used to convert between the two representations. ### New compiler features The addrspace Loaded will now have a function form, which combines a Tracked pointer and a raw pointer to return a Loaded pointer. ``` # memory(none) nosync nounwind speculatable willreturn norecurse declare nonnull noundef ptr(Loaded) @"julia.gc_loaded"( ptr(Tracked) nocapture nonnull noundef readnone, ptr nonnull noundef readnone) { top: %metadata GC base pointer for rooting is %0 ret (addrspacecast ptr %1 to ptr(Loaded)) } ``` The codegen for it will attempt to move this over some GEP call chains, and later the -llvm-julia-licm pass will do more of that as well. ## Initial results * `push!` is ~2.2x faster in the common case ``` function f() X = Int[1,2] sizehint!(X, 100) for n in 3:100 push!(X, X[n-1] + X[n-2]) end X[end] end ``` * Creating empty arrays is roughly 3x faster (e.g. `Int[]`) * `Dict`, `IDDict`, and `Base64`'s buffer type use `Memory` to save space (for `Dict` it saves 72 bytes per instance). * Faster initialization for types that switch to `Memory` (80% faster for empty `Dict{Int, Int}`) * Ability to construct empty arrays with capacity all at once (and much faster) than the previous way of making an array with a size and then `resize!`ing it. This makes `BitSet()` roughly 2x faster. * The sysimage is slightly larger. This happens because the serialization now assumes data may be stored in the overallocation of a `Vector` which was previously inaccessable memory to julia.