Julep: Redesigning `Array` using a new lower-level container type

History and motivation

From its inception, Julia has had an Array type intended to be used any time you need "a bunch of things together in some order". This is one of the most important and most unusual types in the language. When Array was designed, it was believed (possibly by multiple people, but Jeff is willing to take the blame) that the easiest-to-use interface would be a single array type that does most of what users need. Many Julia programmers now feel that the language and community have outgrown this trade-off, and problems have emerged:

Array has many features, with an unusually large amount of functionality written in C. This makes the code harder to maintain and harder for the compiler to optimize.
Constructing anArray is slower than it should be, since it is implemented with branches in C instead of compiler specialization.
The Array struct layout is opaque, inhibiting optimizations and understanding. One notable example of this is push! which is expected to be fast almost all of the time, but has significant overhead when resizing the array is not needed.
The extra overhead of all the features of Array, even if tolerable in most cases, is a significant cost when arrays are used to implement other types like Dict.
Julia does not currently have an Array-like type that allows per-element atomic operations.

Array has the following "fancy" features beyond the obvious function of storing a series of elements:

Multiple dimensions
1-d arrays can be resized efficiently at either end
Multiple arrays can share the same underlying storage, mostly to support no-copy reshaping
Foreign pointers can be wrapped as Arrays to share data without copying. Optionally, we will call free on the pointer when the Array is garbage collected.
An Array can use a String as storage to allow zero-copy string initialization.

Here we propose a new lower-level storage type and to reimplement Array in Julia using it.

New GenericMemory type

Also Core.GenericPtr and Core.AddrSpace

We propose a new type with the following representation:

primitive type AddrSpace{From::Module} 8 end

const CPU = bitcast(AddrSpace{Core}, 0)

primitive type GenericPtr{T, AS::AddrSpace} sizeof(Int) end

const Ptr{T} = GenericPtr{T, CPU}

mutable struct GenericMemory{kind::Symbol, T, AS::AddrSpace} <: AbstractVector{T}
   const length::Int
   const ptr::GenericPtr{Cvoid, AS::AddrSpace}
   # The following fields are hidden:
   # Either:
       elements::T...  # `length` elements stored inline
   # Or:
       owner::Any      # owning object of `ptr`
end

const Memory{T} = GenericMemory{:not_atomic, T, CPU}

The intention of this abstraction is to represent "a range of memory beginning at a fixed location". The memory space can either be directly after the metadata fields (in which case the ptr field is redundant) or refer to foreign memory via the ptr field. In both cases ptr points to the data. When the data is stored in-line, we predict no extra overhead from loading the ptr field since it shares a cache line with the data.

Memory has constructors similar to those of the Vector type. GenericMemory generally does not have any constructors.

The `kind` parameter

The kind type parameter specifies the required access style of the data, currently intended to support either atomic (:atomic) or "plain" (:not_atomic) access. The details and implementation for :atomic have not yet been written. In brief though, this is expected to be similar to @atomic fields, with implied :monotonic order on getindex (unless specified), but explicit ordering required on setindex!. The alignment will be increased to support atomic access as required for the current hardware, and locks will be added (per element) if necessary.

In the future, there are other possible uses for kind such as :const.

The `addrspace` parameter

The addrspace parameter is required to be Core.CPU for allocations in main memory. It is present to permit extension by other systems such as GPUs, which might define values such as:

module CUDA
const Generic = bitcast(Core.AddrSpace{CUDA}, 0)
const Global = bitcast(Core.AddrSpace{CUDA}, 1)
end

The exact semantics of these other addrspaces is defined by the specific backend, but will error if the user is attempting to access these on the CPU.

Type parameter order

The order of the type parameters in GenericMemory might be surprising. We did it this way for two reasons:

If the order were reversed, it would be tempting to write GenericMemory{T}, but that would create an abstract type, which we want to discourage. The Memory{T} alias solves this.
Since atomic data allows only atomic store operations, a lot of code will not be generic with respect to atomic-ness by default. This parameter order reminds low-level users to specialize on the access type.

Resizing

To handle very large data, it is useful to expose the ability to expand a region of memory by extending the OS kernel's mapping. This is supported because one can construct a new Memory struct with a larger length which points to the same data as the original Memory. Growing blocks by moving, or any form of shrinking blocks, is left to higher-level abstractions to implement.

Note this type does not and cannot implement resize! or related functions like push!, because it has different semantics.

Several specific interface functions have been proposed here (names are provisional):

expand(mem, n): Grow size to n elements if possible, otherwise do nothing. Returns a new Memory or nothing.
expand_upto(mem, n): Grow size to at most n elements, possibly fewer, returning the new (or same) Memory.

New GenericMemoryRef type

struct GenericMemoryRef{kind::Symbol, T, AS::AddrSpace} <: Ref{T}
    # Note that since this is an immutable struct it will be stored inline in `Array`.
    mem::GenericMemory{kind, T, AS}    # `Memory` holding data
    # Points to a location within `mem`; either an actual pointer
    # or an offset (if T is zero size or a Union).
    ptr_or_offset::GenericPtr{Cvoid, AS}
end
const MemoryRef{T} = GenericMemoryRef{:not_atomic, T, CPU}

The GenericMemoryRef type is similar to the existing RefArray type, in that it points to a location within another object. Unlike that type, it has special codegen support so that the element offset computation can be elided.

Note that GenericMemoryRef here is always a valid index (unlike C/C++/LLVM GEP which lets it point one element past the end), except when the GenericMemory is empty, in which case accessing it can throw a BoundsError.

New Builtin functions

    memoryref(mem::Memory)::MemoryRef

Return a MemoryRef constructed from a Memory, pointing to the first element.

    memoryref(mem::Memory, idx::Int, [boundscheck::Bool=true])::MemoryRef
    memoryref(mem::MemoryRef, idx::Int, [boundscheck::Bool=true])::MemoryRef

Return a MemoryRef constructed from an existing Memory or MemoryRef and an offset, with the
boundscheck optionally disabled (unsafe).

    memoryrefget(mem::MemoryRef{kind, T}, order::Symbol, boundscheck::Bool)::T

Load the element of a MemoryRef and return it, with the given memory order and
with the isempty check optionally disabled (unsafe). The `order` must be
compatible with the `kind` parameter.

    memoryrefset(mem::MemoryRef{kind, T}, x::T, order::Symbol, boundscheck::Bool)
    
Store `x` to a MemoryRef with the given memory order and with the isempty check
optionally disabled (unsafe). The `order` must be compatible with the `kind`
parameter.

    memoryoffset(mem::MemoryRef)

Get the integer index of a MemoryRef's location in its underlying Memory.

The process of getting or setting elements is decomposed into two operations: constructing a MemoryRef, and then accessing it. The reason for this is that constructing a MemoryRef is consistent and effect-free, but loading and storing elements are not. In particular the bounds check is part of MemoryRef construction, and these properties make code transformations easier.

Deprecated builtin functions

arraylen, arrayref, arrayset

New `Array` definition

mutable struct Array{T, N} <: AbstractArray{T, N}
    ref::MemoryRef{T}
    axes::NTuple{N, Int}
end

For N==1, the axes and ref are mutable and axes[1] is the length. For other N, the fields are const, and the length is .ref.mem.length or prod(.axes).

String-backed arrays

Base.StringVector returns a Vector whose storage is a String, allowing conversion via the String constructor without copying. This is used to initialize (immutable) strings efficiently. Declaring the data field of Array as Union{Memory, String} is not a good idea.

We handle this by making an Array whose memory is backed by a Memory{UInt8} whose storage is owned by a new String.

Important

When calling String(::Vector), we set the Vector's length to 0 and replace its Memory with an empty one. The String-backed Memory will still exist, but we assume code will not reference it directly.

Implementation considerations

Dimension-dependent const-ness

Compilers always want to know when a value is constant. Array has the complication that the size is constant only when N≠1. Our layout system cannot express this, so it needs to be hacked into the compiler and will continue to be.

Batched allocation of `Array` and `Memory`

This design requires allocating two objects (an Array and a Memory) for most arrays, giving it a slight disadvantage over the current implementation in some cases where the Array does not later change size. We think this could be improved by adding a feature to the GC to allocate two objects at once: (1) Space big enough to hold both objects is allocated, (2) A header bit in the first object indicates that it is followed by another, (3) The compiler recognizes this pattern:

Array(length) = new(Core.Memory(length))

and replaces it with a single allocation call that contains both objects. We do not currently believe this is necessary however.

Zero-sized `Memory`

Because all zero-sized memory has no actual mutable content, we can instead allocate only one of these for all consumers. This can be stored in the .instance field when first allocated and reused for all future allocations as an optimization (like the zero-size String optimization).

Secondly, this is the one case where MemoryRef may point to invalid memory, so we have a boundscheck parameter on read/write access to verify this case is not present.

Element layout access

It is important (especially in the GC) to have fast access to the layout metadata for an array's element type. Thus, we set Memory{T}.layout to the layout of the element type but with the opaque layout flag set. unwrap_unionall(GenericMemory).layout holds the layout metadata for the GenericMemory type itself.

Previously, Array did layout computations on array construction, which was one reason allocating arrays was surprisingly slow. Union types were especially a problem since unlike DataType they do not have a field to store a layout.

Allocation optimizations

We need to consider how this design will affect advanced array optimizations including SROA of arrays, allocation elision, and allocation hoisting. The two-layer object structure might make this more difficult.

GenericMemoryRef representation

For efficiency, the GenericMemoryRef data field is a Ptr{Cvoid, AS} to the referenced element for most arrays, but needs to be interpreted as an Int index for bitsunion and ghost value (elsize=0) elements. The Base.memoryoffset, pointer, and Core.memoryref functions can be used to convert between the two representations.

New compiler features

The addrspace Loaded will now have a function form, which combines a Tracked pointer and a raw pointer to return a Loaded pointer.

# memory(none) nosync nounwind speculatable willreturn norecurse
declare nonnull noundef ptr(Loaded) @"julia.gc_loaded"(
   ptr(Tracked) nocapture nonnull noundef readnone,
   ptr nonnull noundef readnone) {
 top:
   %metadata GC base pointer for rooting is %0
   ret (addrspacecast ptr %1 to ptr(Loaded))
}

The codegen for it will attempt to move this over some GEP call chains, and later the -llvm-julia-licm pass will do more of that as well.

Initial results

push! is ~2.2x faster in the common case

function f()
    X = Int[1,2]
    sizehint!(X, 100)
    for n in 3:100
        push!(X, X[n-1] + X[n-2])
    end
    X[end]
end

Creating empty arrays is roughly 3x faster (e.g. Int[])
Dict, IDDict, and Base64's buffer type use Memory to save space (for Dict it saves 72 bytes per instance).
Faster initialization for types that switch to Memory (80% faster for empty Dict{Int, Int})
Ability to construct empty arrays with capacity all at once (and much faster) than the previous way of making an array with a size and then resize!ing it. This makes BitSet() roughly 2x faster.
The sysimage is slightly larger. This happens because the serialization now assumes data may be stored in the overallocation of a Vector which was previously inaccessable memory to julia.

Julep: Redesigning Array using a new lower-level container type

History and motivation

New GenericMemory type

The kind parameter

The addrspace parameter