Julia Escape Analysis Project

Team: Shuhei Kadowaki, Jameson Nash
Meeting:
- zoom link: https://mit.zoom.us/j/93134458604
- schedule:
  - 2022: TBD
  - 2021: (every Wed.) 22:00-23:00 (JST), 9:00-10:00 (EDT)
Scope:
- implement a strong Julia-level escape analysis
- implement memory-optimizations based on the analysis
  - prototype them as a testbed of compiler-plugin
- … and more ?
  - feedback escape information to inference ?
  - summarize performance improvements, and write up a paper ?
Timeline: upstream to Julia base by the end of the year

Targets

As midterm objectives of this project, we will try to implement following optimizations:

Enhanced SROA & DCE
~~Heap-to-stack allocation conversion~~
(Allocation movement ?)
finalizer elision

Enhanced SROA & DCE

The current state

Julia's compiler does SROA (Scalar-Replacement-of-Aggregates) optimization on two levels:

on Julia-level IR: getfield_elim_pass!
on LLVM-IR: COMBAK (which one does LLVM SROA ?)

The purpose of this optimization is to enhance the Julia-level SROA with escape analysis.

The current SROA pass would be described as (per #42357):

getfield_elim_pass!(ir::IRCode) -> newir::IRCode
getfield elimination pass, a.k.a. Scalar Replacements of Aggregates optimization.

This pass is based on a local alias analysis that collects field information by def-use chain walking. It looks for struct allocation sites ("definitions"), and getfield calls as well as :foreigncalls that preserve the structs ("usages"). If "definitions" have enough information, then this pass will replace corresponding usages with lifted values. mutable structs require additional cares and need to be handled separately from immutables. For mutable structs, setfield! calls account for "definitions" also, and the pass should give up the lifting conservatively when there are any "intermediate usages" that may escape the mutable struct (e.g. non-inlined generic function call that takes the mutable struct as its argument).

In a case when all usages are fully eliminated, struct allocation may also be erased as a result of dead code elimination.

The important observations are:

successful SROA can sometimes even eliminate allocations at all as a result of dead code elimination
getfield_elim_pass! is limited in a sense that:
- it can't reason about anything inter-procedural
- it requires "safe" allocation sites to kick-off whole optimization

Let's see successful examples first, and then understand limitations.

Here, the allocation site r = Ref{Int}(42) is "safe" and has enough information, and r[] (inlined to getfield) will be eliminated, and even more, the allocation and setfield! are voided out too:

julia> code_typed() do
           r = Ref{Int}(0)
           r[] = 42
           b = sin(r[])
           return b
       end |> only
CodeInfo(
1 ─ %1 = Base.sitofp(Float64, 42)::Float64
│   %2 = invoke Base.Math.sin(%1::Float64)::Float64
└──      return %2
) => Float64

getfield_elim_pass! requires "safe" allocation. So if we initialize r = Ref{}(), then the whole optimization won't work:

julia> code_typed() do
           r = Ref{Int}()
           r[] = 42
           b = sin(r[])
           return b
       end |> only
CodeInfo(
1 ─ %1 = %new(Base.RefValue{Int64})::Base.RefValue{Int64}
│        Base.setfield!(%1, :x, 42)::Int64
│   %3 = Base.getfield(%1, :x)::Int64
│   %4 = Base.sitofp(Float64, %3)::Float64
│   %5 = invoke Base.Math.sin(%4::Float64)::Float64
└──      return %5
) => Float64

Still the later LLVM-level SROA may be able to optimize it further, and it seems to be more powerful than Julia-level one in general, e.g. it can "fully" optimize the above code (the argument of @j_sin_4761 is replaced with the scalar value double 4.20000e+01, and we don't have any allocation)

julia> code_llvm(()) do
           r = Ref{Int}()
           r[] = 42
           b = sin(r[])
           return b
       end
;  @ none:2 within `#49`
define double @"julia_#49_4759"() #0 {
top:
;  @ none:4 within `#49`
; ┌ @ math.jl:1228 within `sin`
   %0 = call double @j_sin_4761(double 4.200000e+01) #0
; └
;  @ none:5 within `#49`
  ret double %0
}

Now let's bother LLVM – we can just add non-inlined function call, then LLVM can't help giving up the optimization:

julia> @noinline setter!(r, x) = r[] = x
setter! (generic function with 1 method)

julia> code_llvm(()) do
           r = Ref{Int}()
           setter!(r, 42)
           b = sin(r[])
           return b
       end
;  @ none:2 within `#45`
define double @"julia_#45_4943"() #0 {
top:
  %gcframe7 = alloca [3 x {}*], align 16
  %gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
  %0 = bitcast [3 x {}*]* %gcframe7 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %0, i8 0, i32 24, i1 false)
  %1 = call {}*** inttoptr (i64 140733735208165 to {}*** (i64)*)(i64 261) #6
; ┌ @ refpointer.jl:135 within `Ref`
; │┌ @ refvalue.jl:7 within `RefValue`
    %2 = bitcast [3 x {}*]* %gcframe7 to i64*
    store i64 4, i64* %2, align 16
    %3 = load {}**, {}*** %1, align 8
    %4 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
    %5 = bitcast {}** %4 to {}***
    store {}** %3, {}*** %5, align 8
    %6 = bitcast {}*** %1 to {}***
    store {}** %gcframe7.sub, {}*** %6, align 8
    %ptls_field3 = getelementptr inbounds {}**, {}*** %1, i64 2305843009213693954
    %7 = bitcast {}*** %ptls_field3 to i8**
    %ptls_load45 = load i8*, i8** %7, align 8
    %8 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
    %9 = bitcast {}* %8 to i64*
    %10 = getelementptr inbounds i64, i64* %9, i64 -1
    store atomic i64 4617487696, i64* %10 unordered, align 8
    %11 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
    store {}* %8, {}** %11, align 16
; └└
;  @ none:3 within `#45`
  %12 = call i64 @"j_setter!_4945"({}* nonnull %8, i64 signext 42) #0
;  @ none:4 within `#45`
; ┌ @ refvalue.jl:56 within `getindex`
; │┌ @ Base.jl:38 within `getproperty`
    %13 = load i64, i64* %9, align 8
; └└
; ┌ @ math.jl:1232 within `sin`
; │┌ @ float.jl:269 within `float`
; ││┌ @ float.jl:243 within `AbstractFloat`
; │││┌ @ float.jl:146 within `Float64`
      %14 = sitofp i64 %13 to double
; │└└└
; │ @ math.jl:1234 within `sin`
   %15 = call double @j_sin_4946(double %14) #0
   %16 = load {}*, {}** %4, align 8
   %17 = bitcast {}*** %1 to {}**
   store {}* %16, {}** %17, align 8
; └
;  @ none:5 within `#45`
  ret double %15
}

LLVM can also be confused with some other Julia level semantics other than inter-procedural calls. For example, typeassert can pretty much null out LLVM SROA. In the following code snippet, neither of Julia-level nor LLVM-level SROA fails to eliminate closure constructions:
















































































julia> doit(f, args...) = f(args...)

julia> code_typed() do
           local arr::Vector{Any}
           doit() do
               arr = Any[]
           end
           arr
       end |> only
CodeInfo(
1 ─ %1  = Core.Box::Type{Core.Box}
│   %2  = %new(%1)::Core.Box
│   %3  = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Any}, svec(Any, Int64), 0, :(:ccall), Vector{Any}, 0, 0))::Vector{Any}
│         Core.setfield!(%2, :contents, %3)::Vector{Any}
│   %5  = Core.isdefined(%2, :contents)::Bool
└──       goto #3 if not %5
2 ─       goto #4
3 ─       $(Expr(:throw_undef_if_not, :arr, false))::Any
4 ┄ %9  = Core.getfield(%2, :contents)::Any
│         Core.typeassert(%9, Vector{Any})::Vector{Any}
│   %11 = π (%9, Vector{Any})
└──       return %11
) => Vector{Any}

julia> code_llvm(()) do
           local arr::Vector{Any}
           doit() do
               arr = Any[]
           end
           arr
       end
;  @ none:2 within `#55`
define nonnull {}* @"julia_#55_4772"() #0 {
pass:
;  @ none:3 within `#55`
; ┌ @ none:1 within `doit`
; │┌ @ none:4 within `#56`
; ││┌ @ array.jl:423 within `getindex`
; │││┌ @ boot.jl:471 within `Array` @ boot.jl:452
      %0 = call nonnull {}* inttoptr (i64 4483450916 to {}* ({}*, i64)*)({}* inttoptr (i64 4648194800 to {}*), i64 0)
; └└└└
;  @ none:6 within `#55`
  %1 = bitcast {}* %0 to i64*
  %2 = getelementptr inbounds i64, i64* %1, i64 -1
  %3 = load atomic i64, i64* %2 unordered, align 8
  %4 = and i64 %3, -16
  %5 = inttoptr i64 %4 to {}*
  %6 = icmp eq {}* %5, inttoptr (i64 4648194800 to {}*)
  br i1 %6, label %pass3, label %fail2

fail2:                                            ; preds = %pass
  call void @jl_type_error(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @_j_str1, i64 0, i64 0), {}* inttoptr (i64 4648194800 to {}*), {}* %0)
  unreachable

pass3:                                            ; preds = %pass
  ret {}* %0
}

# interesting fact:
# LLVM can optimize it away if we strip out the type declaration !
julia> code_llvm(()) do
           local arr
           doit() do
               arr = Any[]
           end
           arr
       end
;  @ none:2 within `#55`
define nonnull {}* @"julia_#55_4959"() #0 {
pass:
;  @ none:3 within `#55`
; ┌ @ none:1 within `doit`
; │┌ @ none:4 within `#56`
; ││┌ @ array.jl:423 within `getindex`
; │││┌ @ boot.jl:471 within `Array` @ boot.jl:452
      %0 = call nonnull {}* inttoptr (i64 4431128612 to {}* ({}*, i64)*)({}* inttoptr (i64 4614886128 to {}*), i64 0)
; └└└└
;  @ none:6 within `#55`
  ret {}* %0
}

Enhanced SROA & More aggressive DCE

We want "better" Julia-level SROA and DCE. Especially, it should be:

(inter-procedural) new SROA should analyze aliases and escapes better, and work better with non-inlined calls
(more aggressive) new DCE should be able to null out typeassert/isdefined/isa checks more aggressively so that we can eliminate more allocations^[1]

Target examples:

propagate inter-procedural memory-effect:

julia> @noinline setter!(r, x) = r[] = x
setter! (generic function with 1 method)

julia> code_typed() do
           r = Ref{Int}()
           setter!(r, 42)
           b = sin(r[])   # should be turned into `sin(42)`
           return b
       end |> only
CodeInfo(
1 ─ %1 = %new(Base.RefValue{Int64})::Base.RefValue{Int64}
│        invoke Main.setter!(%1::Base.RefValue{Int64}, 42::Int64)::Any
│   %3 = Base.getfield(%1, :x)::Int64
│   %4 = Base.sitofp(Float64, %3)::Float64
│   %5 = invoke Base.Math.sin(%4::Float64)::Float64
└──      return %5
) => Float64

(even local analysis improvement):

# `RefValue{String}` alloction won't be eliminated
(s::String) -> broadcast(identity, Ref(s))

# abstract-intepretation based field information propagation
# should allow us to optimize this IR
2 1 ─ %1 = %new(Base.RefValue{String}, _2)::Base.RefValue{String}                                                                                                                               │╻╷╷     Ref
│   %2 = Core.tuple(%1)::Tuple{Base.RefValue{String}}                                                                                                                                         │╻       broadcast
│   %3 = Core.getfield(%2, 1)::Base.RefValue{String}                                                                                                                                          ││
└──      goto #3 if not true                                                                                                                                                                  ││╻╷      materialize
2 ─      nothing::Nothing                                                                                                                                                                     │
3 ┄ %6 = Base.getfield(%3, :x)::String                                                                                                                                                        │││╻╷╷╷╷   copy
└──      goto #4                                                                                                                                                                              ││││┃       getindex
4 ─      goto #5                                                                                                                                                                              ││││
5 ─      goto #6                                                                                                                                                                              │││
6 ─      goto #7                                                                                                                                                                              ││
7 ─      return %6                                                                                                                                                                            │

nullify typeassert check:

julia> code_typed() do
           r = Ref{Any}()
           r[] = 42
           # this assertion is useful for inference
           # but "harmful" for LLVM
           return r[]::Int
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = %new(Base.RefValue{Any})::Base.RefValue{Any}
│        Base.setfield!(%1, :x, 42)::Int64
│   %3 = Base.getfield(%1, :x)::Any
│        Core.typeassert(%3, Main.Int)::Int64
│   %5 = π (%3, Int64)
└──      return %5
) => Int64

Heap to stack allocation conversion

The idea

Julia allocates every actually-allocated mutable object on heap memory, while in ideal world we want to have them on stack if their lifetime doesn't escape from the execution frame.

An obvious example would be like this – Julia will allocate MyRef("foo") even though it doesn't escape anywhere, because @noinlined getx call will prevent both Julia-level and LLVM-level SROA, and they can't eliminate the allocation:

julia> mutable struct MyRef{T}; x::T; end

julia> @noinline getx(x::MyRef) = x.x
getx (generic function with 1 method)

julia> code_typed(()) do
           r = MyRef("foo")
           x = getx(r)
           sizeof(x)
       end |> only
CodeInfo(
1 ─ %1 = %new(MyRef{String}, "foo")::MyRef{String}
│   %2 = invoke Main.getx(%1::MyRef{String})::String
│   %3 = Core.sizeof::typeof(Core.sizeof)
│   %4 = (%3)(%2)::Int64
└──      return %4
) => Int64

julia> code_llvm(()) do
           r = MyRef("foo")
           x = getx(r)
           sizeof(x)
       end
;  @ none:2 within `#75`
define i64 @"julia_#75_4792"() #0 {
top:
  %0 = alloca {}*, align 8
  %gcframe7 = alloca [3 x {}*], align 16
  %gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
  %1 = bitcast [3 x {}*]* %gcframe7 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
  %2 = call {}*** inttoptr (i64 140733733901541 to {}*** (i64)*)(i64 261) #7
; ┌ @ none:1 within `MyRef` @ none:1
   %3 = bitcast [3 x {}*]* %gcframe7 to i64*
   store i64 4, i64* %3, align 16
   %4 = load {}**, {}*** %2, align 8
   %5 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
   %6 = bitcast {}** %5 to {}***
   store {}** %4, {}*** %6, align 8
   %7 = bitcast {}*** %2 to {}***
   store {}** %gcframe7.sub, {}*** %7, align 8
   %ptls_field3 = getelementptr inbounds {}**, {}*** %2, i64 2305843009213693954
   %8 = bitcast {}*** %ptls_field3 to i8**
   %ptls_load45 = load i8*, i8** %8, align 8
   %9 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
   %10 = bitcast {}* %9 to i64*
   %11 = getelementptr inbounds i64, i64* %10, i64 -1
   store atomic i64 5398336928, i64* %11 unordered, align 8
   %12 = bitcast {}* %9 to {}**
   store {}* inttoptr (i64 5376165704 to {}*), {}** %12, align 8
   %13 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
   store {}* %9, {}** %13, align 16
; └
;  @ none:3 within `#75`
  store {}* %9, {}** %0, align 8
  %14 = call nonnull {}* @j1_getx_4794({}* inttoptr (i64 5377254840 to {}*), {}** nonnull %0, i32 1)
;  @ none:4 within `#75`
; ┌ @ Base.jl:175 within `sizeof`
   %15 = bitcast {}* %14 to i64*
   %16 = load i64, i64* %15, align 8
   %17 = load {}*, {}** %5, align 8
   %18 = bitcast {}*** %2 to {}**
   store {}* %17, {}** %18, align 8
; └
  ret i64 %16
}

The basic idea of this optimization is to modify LLVM IR so that we can allocate MyRef("foo") on stack rather than heap.

Do we really need this ?

We worked on this optimization at the first project iteration (GSoC term), but for now, I'd like to put lower priority on this optimization:

(SROA and DCE) as explained above, successful SROA can eliminate allocations at all: so enhanced SROA can subsume pretty much part of this optimization e.g. for this code snippet, we can just convert
```
let
    r = MyRef("foo")
    x = getx(r)
    sizeof(x)
end
```
to
```
let
    sizeof("foo")
end
```
at the enhanced SROA pass, then there is no need for this heap-to-stack conversion
(GC interaction) we need to make (complicated) design choices, mainly around interactions with GC (Jameson can elaborate this)
(throw exception) the current throwness check is not so accurate, and we often need to account for escapes through exceptions conservatively, which causes stack allocation harder^[2]

Allocation movement

TODO: elaborate this.

`finalize`r elision

Julia finalizer is managed by GC, and so they may hang around for a while as GC lets the mutable object to hang around. This means we may end up consuming much memory when finalizers are associated with e.g. resource managements, which is a common practice seen e.g. in GPU programming.

The idea of this optimization is to early-call a finalizer by inserting finalize(obj) call when we can prove the lifetime of obj finishes within a frame:

julia> mutable struct WithFinalizer
           v
           function WithFinalizer(v)
               x = new(v)
               f(t) = @async println("Finalizing $t.")
               return finalizer(f, x)
           end
       end

julia> make_obj(v = 10) = WithFinalizer(v)
make_obj (generic function with 1 methods)

julia> let
           o = make_obj()
           copy(o.v) # <= automatically insert `finalize(o)` here
       end
10

Considerations

can we just call finalize(obj) ?
- finalize(obj) is super costly !
- 1. we want to "enqueue" the finalizer (per-Valentin)
- 1. make it intrinsic and make it fast
how can we find object that has finalizer ?
do we need to account for possible exceptions within finalizer call itself … ?
do we want to have flow-sensitivity (i.e. insert finalize on return site) ?

Pipeline Design

type inference
optimization
1. preps
2. inlining
3. escape analysis
4. optimization passes
  1. enhanced SROA
  2. aggressive dead code elimination
  3. finalizer elision
caching

Even if we have an useful inter-procedural cache, escape analysis and SROA are better to run after inlining, just because inter-procedural handling is much easier we can just look for :invoke expressions that the inlinear has resolved.

Caching:

create new field of CodeInstance, and
pack (minimized) escape information into it, as we do for inference result.

Development Plan

Schedule

21/09 (kick-off): re-define targets, setup milestones
21/10 (stage 1): implement as "dummy" AbstractInterpreter
- optimizations
  - enhanced SORA
  - aggressive DCE
  - finalizer elision prototyping
- analysis
21/11: (stage 2) executable AbtractInterpreter
- keep working on stage 1 remainings
- scratch compiler-plugin infrastructure to actually execute that pipeline
21/12: (stage 3) make a PR on Julia Base, finish reviews

Discussions

NOTE: agenda for the recent mtg should come first than others

2021/12/17

For ImmutableArray
- mutating_arrayfreeze
- lifetime: whether or not mutable array is "mutated"
alias analysis?
- about accuracy
- may not need so much?
BoundsError
- if only escape is from BoundsError
- language semantics requires "non-semantic copy"
lifetime information?
- might not be too important?
- but it's very rare!
focus:
- phase 1: reviving ImmutableArray
- phase 2: LLVM escape
  - limitations: Union-type, inter-procedural
ImmutableArray PR
taint analysis

2021/11/24

Agenda

Implementation of field/alias analysis for enhanced SROA
"Non-semantic" copy for stack allocation

Implementation of field/alias analysis for enhanced SROA

Allocation-elimination targets

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

easy /

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

hard:

simplest SROA

code_typed((Any,)) do a
     r = Ref{Any}(a)
     r[] # => a
end

domination-aware SROA

code_typed((Any,Any)) do a, b
     r = Ref{Any}(a)
     r[] = b
     r[] # => b
end

# uninialized field
# https://github.com/JuliaLang/julia/pull/42834
code_typed((Any,)) do a
     r = Ref{Any}()
     r[] = a
     r[] # => a
end

code_typed((Bool,Any,Any)) do c, a, b
     r = Ref{Any}()
     if cond
         r[] = a
         r[] # => a
     else
         r[] = b
         r[] # => b
     end
end

φnode-fying SROA

code_typed((Bool,Any,Any)) do cond, a, b
     r = Ref{Any}(nothing)
     if cond
         r[] = a
     else
         r[] = b
     end
     r[] # => φ(a, b)
end

# uninialized field
# https://github.com/JuliaLang/julia/pull/43208
code_typed((Bool,Any,Any)) do cond, a, b
     r = Ref{Any}()
     if cond
         r[] = a
     else
         r[] = b
     end
     r[] # => φ(a, b)
end

alias-aware SROA (nested load elimination)

# `s -> broadcast(identity, Ref(s))`
code_typed((String,)) do s
    r = Ref(s)
    t = (r,)
    t[1][] # => s
end

code_typed() do
    local arr::Vector{Any}
    alloc! = () -> arr = Any[]
    alloc!()
    arr
end

code_typed((Bool,Any,Any)) do cond, a, b
    r1 = Ref{Any}(a)
    r2 = Ref{Any}(b)
    (cond ? r1 : r2)[] # φ(a, b)
    # or => (cond ? r1[] : r2[])
end

"partial" SROA

code_typed((Bool,Any,)) do rare_cond, a
    r = Ref{Any}(a)
    if rare_cond
        throw(r)   # move `r` allocation here
    else
        return r[] # => a
    end
end

stack allocation

@noinline safe_op(a) = ...
code_typed((Any,)) do a
    r = Ref{Any}(a) # can be allocated on stack
    safe_op(r)
end

Design of analysis/optimization

current Julia Base SROA

[SROA]
    [simple analysis]
    [domination analysis]
    [actual code transmation]

multiple SROA: benchmark

backward escape/alias/field analysis & load forwarding by domination analysis

https://github.com/aviatesk/EscapeAnalysis.jl/pull/54

overview

# analysis
while [converge]
    [backward escape/alias/field analysis]
end

# transformation
[domination analysis]
[SROA]
    [load forwarding]
    [DCE]

backward escape/alias analysis & forward field propagation

EscapeAnalysis.jl/avi/flowsensitive-aliasanalysis

overview:

# analysis
while [converge]
    [backward escape/alias analysis]
    while [converge]
        [forward field information propagation]
    end
end

# transformation
[SROA]
    [simple load replacement]
    [DCE]

forward escape/alias/field analysis

JVM compilers' approach

overview

# analysis
while [converge]
    [forward escape/alias analysis]
    [forward field information propagation]
end

# transformation
[SROA]
    [simple load replacement]
    [DCE]

Points
- complexity of domination/control-flow
  - "last" definition
  - field merge (φnode-fying)
  - uninitialized field
- conflicts between escape information and field information
  - escape information: flow from usage to definition
  - field information: flow from definition to usage
- versatility of EA
  - Valentin: separate SROA from EA

Discussion

Valentin: we want to have a "slot" in IRCode?
- represents a mutable memory
- might be useful for:
  - Tapir integration
  - stack allocation
- i.e. may be able to slot2ssa system?
Valentin: array?
- array optimization
- target?
  - ref/load is all local
  - constant index
- approaches?
  - Valentin: don't make array special!
  - LLVM can handle it nicely
  - Julia-level analysis: tell inter-procedural information!

"Non-semantic" copy for stack allocation

Problem:
- so many escape through exception
- in Julia an exception means "global escape"

Details:

conditions of "non-semantic" copy

no escape (including finalize call)
no lifetime overlap

local o
# finalize(f, o) # may escape in finalizer finalizer
try
    unsafe_op(o)
catch err
    # o.x = "bar" # lifetime overlap
end

but these two are NOT enough: other threads can access to err through the evil jl_current_exception

Ideas ?
- special errors ?
  - MethodError: static
  - BoundsError: runtime
- move throw ?
  - as a LICM: but need to preserve the order
  - unroll loop by 1, and use the obtained constants

2021/11/10

Agenda

When to run EA
~~[ ] discussion on "non-semantic" copy~~

When to run EA

tl;dr, do escape analysis at optimization
reasons:
- inference works on type domain vs. optimization works on SSA form
- current applications of escape analysis is mostly about optimizations:
  - SROA
  - ImmutableArray: copy elision
  - AD: sensitivity analysis
- as a first step, it would be easier and cleaner if they are just separated
more thoughts:
- inference already does some sort of alias analysis in forms of Const and PartialStruct
- but they don't encode anything about CFG
- encoding CFG information into type lattice would be far more complicated (like Conditional)
- also, if we reason about inter-procedural mutation analysis, we also need to think about cycle detection
more details:
- Partial Escape Analysis and Scalar Replacement for Java can be a good reference
- local analysis:
  - make EscapeAnalysis.jl control-flow sensitive
  - introduce alias (field) analysis (revive https://github.com/aviatesk/EscapeAnalysis.jl/pull/43)
- inter-procedural analysis:
  - easy ?
  - can be used for
    - stack allocation
    - ImmutableArray
- SROA is better to be separated from escape analysis
  - escape analysis: backward analysis
  - SROA: forward iteration of CFG transformation
  - other applications: some transformation
inter-procedural
- how to cache the inter-procedural information ?
  - new field of CodeInstance ?
- what kind of information
  - argument escapes ?
  - returned value escapes ?
  - side effect analysis ?
    - global side effect: read/write on global memory
      - e.g. does this loop write to global memory, then it can't be hoisted, if not, start thinking about local
      - obvious example: stdout
    - local side effect: mutation on field
      - LICM
      - better DCE by load forwarding ?
    - query:
      - input: MethodInstance, "x"
      - output: does this MethodInstance affect the state of "x"
    - side effect: a restriction on possible optimizations
    - synchronization optimization ?
  - what kind of optimization should side-effect target ?
  - reuse LLVM ?
    - LLVM-level purity vs. Julia-level purity
  - TODO: think more and more about applications of "side-effect" !
- "recursion" matters ?
  - "recursion" means the same MethodInstance
  - it won't be very ideal if recursion => escape

2021/10/13

Agenda

more optimization ideas
how to get Ian involved with this project ?
- we will collaborate on finishing ImmutableArray PR
aggressive DCE
- what needed ?
  - make clear the cost of updating domtree dynamically (benchmarking)
    - AbstractInterpreter can help benchmarking
  - if it turns out be costly, how we can mitigate ?
    - iterative processing to batch processing ?
    - just compute entire domtree in certain points ?
      - c.f. LLVM can reason about relations of various analyses to mitigate the cost of dynamic domtree update
  - correctness
(if time remains) Vararg handling is complicated, any hints ?

Optimization ideas

"allocation movement" (I'm not quite sure this naming is right)
- idea: move allocation to enable further optimization (elaborated below)
memory-reuse:
- idea: try to use the same memory that happen in a loop
  - we can optimize it when it doesn't escape the loop (use scope marker, like gcpreserve)
  - we want to make it control-flow sensitive
provide alias information to LLVM
- background: LLVM-level TBAA pass: it just checks types
- idea: enrich LLVM-level optimizations with Julia-level alias information
- do we want to tell lifetime information also ?
- how do we want to communicate to LLVM ?
  - attach meta data to LLVM AST
  - customizable: metadata may be consumed by our own optimization passes
ImmutableArray
- https://github.com/JuliaLang/julia/pull/42465
- we want to figure out what information will be needed
- provide nice API to query object lifetime ?

"allocation movement"

observations: we may not be able to prove effect-free-ness of something (e.g. typeassert)
- but we may sometimes know it doesn't escape
- and then we can try to "delay" allocation to where it can be thrown, and do optimization (maybe SROA ?) more aggressively
idea: "move allocation site to where it can be really needed"
- should enable further SROA
- we can have dynamic check to enable stack allocation ?
  - make "type-tag" keep information about allocation, and have a query to check if an object is on stack or heap
- can be extended to optimize Arrays ?

successful SROA

julia> code_typed((Int,)) do n
           s = 0
           for i in 1:n
               r = Ref(i)
               # r[] < 0 && throw(r) # very rare condition
               s += r[]
           end
           s
       end |> only
... # no allocation of `Base.RefValue{Int}`

unsuccessful SROA because of rare throw case (actuall it's impossible for this case)

julia> code_typed((Int,)) do n
           s = 0
           for i in 1:n
               r = Ref(i)
               r[] < 0 && throw(r) # very rare condition
               s += r[]
           end
           s
       end |> only
CodeInfo(
1 ── %1  = Base.sle_int(1, n)::Bool
│    %2  = Base.ifelse(%1, n, 0)::Int64
│    %3  = Base.slt_int(%2, 1)::Bool
└───       goto #3 if not %3
2 ──       Base.nothing::Nothing
└───       goto #4
3 ──       goto #4
4 ┄─ %8  = φ (#2 => true, #3 => false)::Bool
│    %9  = φ (#3 => 1)::Int64
│    %10 = φ (#3 => 1)::Int64
│    %11 = Base.not_int(%8)::Bool
└───       goto #12 if not %11
5 ┄─ %13 = φ (#4 => %9, #11 => %30)::Int64
│    %14 = φ (#4 => %10, #11 => %31)::Int64
│    %15 = φ (#4 => 0, #11 => %23)::Int64
│    %16 = %new(Base.RefValue{Int64}, %13)::Base.RefValue{Int64}
│    %17 = Base.getfield(%16, :x)::Int64
│    %18 = Base.slt_int(%17, 0)::Bool
└───       goto #7 if not %18
6 ──       Main.throw(%16)::Union{}
└───       unreachable
7 ── %22 = Base.getfield(%16, :x)::Int64
│    %23 = Base.add_int(%15, %22)::Int64
│    %24 = (%14 === %2)::Bool
└───       goto #9 if not %24
8 ──       Base.nothing::Nothing
└───       goto #10
9 ── %28 = Base.add_int(%14, 1)::Int64
└───       goto #10
10 ┄ %30 = φ (#9 => %28)::Int64
│    %31 = φ (#9 => %28)::Int64
│    %32 = φ (#8 => true, #9 => false)::Bool
│    %33 = Base.not_int(%32)::Bool
└───       goto #12 if not %33
11 ─       goto #5
12 ┄ %36 = φ (#10 => %23, #4 => 0)::Int64
└───       return %36
) => Int64

fan fact: unoptimize_throw_blocks makes SROA sad

julia> code_typed((Int,)) do n
           s = 0
           for i in 1:n
               r = Ref(i)
               r[] < 0 && throw(r[]) # very rare condition
               s += r[]
           end
           s
       end |> only
CodeInfo(
1 ── %1  = Base.sle_int(1, n)::Bool
│    %2  = Base.ifelse(%1, n, 0)::Int64
│    %3  = Base.slt_int(%2, 1)::Bool
└───       goto #3 if not %3
2 ──       Base.nothing::Nothing
└───       goto #4
3 ──       goto #4
4 ┄─ %8  = φ (#2 => true, #3 => false)::Bool
│    %9  = φ (#3 => 1)::Int64
│    %10 = φ (#3 => 1)::Int64
│    %11 = Base.not_int(%8)::Bool
└───       goto #12 if not %11
5 ┄─ %13 = φ (#4 => %9, #11 => %31)::Int64
│    %14 = φ (#4 => %10, #11 => %32)::Int64
│    %15 = φ (#4 => 0, #11 => %24)::Int64
│    %16 = %new(Base.RefValue{Int64}, %13)::Base.RefValue{Int64}
│    %17 = Base.getfield(%16, :x)::Int64
│    %18 = Base.slt_int(%17, 0)::Bool
└───       goto #7 if not %18
6 ── %20 = Base.getindex(%16)::Any
│          Main.throw(%20)::Union{}
└───       unreachable
7 ── %23 = Base.getfield(%16, :x)::Int64
│    %24 = Base.add_int(%15, %23)::Int64
│    %25 = (%14 === %2)::Bool
└───       goto #9 if not %25
8 ──       Base.nothing::Nothing
└───       goto #10
9 ── %29 = Base.add_int(%14, 1)::Int64
└───       goto #10
10 ┄ %31 = φ (#9 => %29)::Int64
│    %32 = φ (#9 => %29)::Int64
│    %33 = φ (#8 => true, #9 => false)::Bool
│    %34 = Base.not_int(%33)::Bool
└───       goto #12 if not %34
11 ─       goto #5
12 ┄ %37 = φ (#10 => %24, #4 => 0)::Int64
└───       return %37
) => Int64

so what is needed ?

they are really similar in a sense that what they need is "lifetime" information ?
is the current EscapeLattice enough, or does it need more props ?

`Vararg` handling …

julia> @noinline function foo(s...)
           string(s)
       end
foo (generic function with 1 method)

# when is `Vararg` introduced ?

julia> code_typed((String,)) do s
           foo(s)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String)::String
└──      return %1
) => String

julia> code_typed((String,)) do s
           foo(s, s)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String, _2::Vararg{String})::String
└──      return %1
) => String

julia> code_typed((String,)) do s
           foo(s, s, s)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String, _2::Vararg{String}, _2)::String
└──      return %1
) => String

# it seems like inference/optimizer has dealt with `foo(::String, ::String)` !?

julia> only(methods(foo)).specializations
svec(MethodInstance for foo(::String), MethodInstance for foo(::String, ::String), MethodInstance for foo(::String, ::String, ::String), nothing, nothing, nothing, nothing, MethodInstance for foo(::String, ::Vararg{String}))

2021/10/06

plans
updates:
- (mostly) debuginfo is available for EscapeAnalysis.jl

Next actions

Xuanda will take a look at more aggressive DCE
- first step: read through #27547 and #37882
Shuhei will learn much more about Julia runtime, and try to fully understand other optimization targets other than SROA and finalzier elision, and re-define project targets and update timeline if necessary
Shuhei will keep working on field analysis and enhanced SROA

DCE plan

step 1: typeassert can simply be eliminated
step 2: fix #27547 ( #37882 can be revived)
step 3: eliminate dead isdefined / isa / === branches (maybe <: too ?)





















julia> code_typed() do
           local arr::Vector{Any}
           doit() do
               arr = Any[]
           end
           arr
       end |> only
CodeInfo(
1 ─ %1  = Core.Box::Type{Core.Box}
│   %2  = %new(%1)::Core.Box
│   %3  = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Any}, svec(Any, Int64), 0, :(:ccall), Vector{Any}, 0, 0))::Vector{Any}
│         Core.setfield!(%2, :contents, %3)::Vector{Any}
│   %5  = Core.isdefined(%2, :contents)::Bool               # <= step 2
└──       goto #3 if not %5                                 # <= step 2
2 ─       goto #4
3 ─       $(Expr(:throw_undef_if_not, :arr, false))::Any    # <= step 2
4 ┄ %9  = Core.getfield(%2, :contents)::Any
│         Core.typeassert(%9, Vector{Any})::Vector{Any}     # <= step 1
│   %11 = π (%9, Vector{Any})
└──       return %11
) => Vector{Any}

More optimization ideas

move allocation (by copy) to where it can be thrown
- observations: we may not be able to prove effect-free-ness of something (e.g. typeassert)
  - but we may sometimes know it doesn't escape
  - and we can try to "delay" allocation to where it can be thrown, and do optimization (maybe SROA ?) more aggressively
- when it does escape in a limited scope, only allocate object when thrown
```
s = 0
for i in 1:n
    arr = [i]   # want to eliminate this allocation
    s += arr[1] # may throw,
end
s
```
```
let
    r = Ref(x)
    s = r::Base.RefValue{String}
    r[]
end
```
- conditions … ?
  - it shouldn't alias to anything
  - lifetime should end in optimization scope
- dynamic check to enable stack allocation ?
  - make "type-tag" keep information about allocation
  - "pool-allocation"
    - optimize if an object is in the pool
    - make query to check whether it's allocated in the pool
    - when GC allocates, it's gonna allocated in the pool
    - (as an implementation detail of heap-to-stack alloc opt)

immutables ~~don't need analysis …?~~

let
    # type-tag needs to be on stack (next to the object) so that GC can see it ..?
    # 1. immutable points to reference to heap object (`Any`)
    #    but the `Any` object might be able to be allocated on stack as well
    #    type-tag (size unknown ?)
    # 2. if immutable escapes, allocate it on the heap
    s = Some{Any}(10) # <= points to heap allocted (if escapes)
end

AD may also possibly benefit from mutation analysis
- Zygote: doesn't support
- Differactor: ???
- Enzyme: use alias/escape analysis to find objects whose mutation order should be recorded
Julia's memory management/optimization
- llvm-alloc-opt.cpp: SROA pass on heap memory
  - some LLVM has its own SROA pass on stack memory
- gc.c: memory management (allocation, finalization)
  - defines standard runtime semantics
  - keywords:
    - mark-and-sweep
    - non-moving
    - tri-color
    - precise: never changes
    - generation
    - stop-the-world: => concurrent garbage collection

2021/09/29

Agenda

agree on target optimizations (Targets)
plan collaboration (Development)
- meeting schedule
- discuss roles

more optimization ideas ?

alias analysis on Julia level
- we have LLVM-level TBAA pass: it just checks types
- we can tell more information to it
memory-reuse: try to use the same memory that happen in a loop
- we can optimize it when it doesn't escape the loop (use scope marker, like gcpreserve)
- we want to make it control-flow sensitive

This optimization can be developed standalone. For example, we can eliminate the else branch in the following code snippet w/o escape analysis if we have more aggressive DCE:

julia> code_typed(()) do
           r = Ref{Any}(42)
           a = r[]
           if isa(a, Int)
               return 0
           else
               return nothing
           end
       end
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = (42 isa Main.Int)::Bool
└──      goto #3 if not %1
2 ─      return 0
3 ─      return Main.nothing
) => Union{Nothing, Int64}

↩︎

There seems to be a way to circumvent this escape. Elaborated here. ↩︎

Valentin Churavy

2021/09/29 13:27:18

DCE is currently turned off. Might need to re-vitalize https://github.com/JuliaLang/julia/pull/37882/files to turn it on always. Stopping point was understanding the cost of the dynamic domtree updates (Edited)

Shuhei Kadowaki

2021/09/29 13:44:17

e.g: BoundsCheck Otherwise, we need to prove it is copiable properly.. --- How to eliminate BoundCheck ? => Loop analysis, how we can limit shape update Possible solutions: - API level: make not capture objects - compiler level: make immutable and copiable Jameson idea: - semantic level: e.g. some macros, some other semantics ...? (Edited)

2021/09/29 13:25:13

Valentin: LLLVM can only optimize stack allocations => our LLVM pass convert heap alloc to stack alloc, so that LLVM can optimize more allocations. (Edited)

2021/09/29 13:32:50

Note: After inlining constant folding doesn't work (Edited)

2021/09/29 13:42:04

https://github.com/JuliaCompilerPlugins/CompilerPluginTools.jl/blob/33bc3e4fbab994a4ae7a1d095cd1ca9a64f9c138/src/passes.jl#L132

2021/09/29 13:36:51

Q: Folded constants are "understandable" to LLVM ? (Edited)

2021/09/29 13:59:07

Could be involved ? - push finalizer to lock-free "queue" ? Don't escape finalizer: => can finalizer be inlined ? => no ? (Edited)

2021/09/29 14:24:39

Wed, 22:00-23:00 (JST) (Edited)

2021/10/06 13:25:31

will point to heap allocated object (Edited)

Guest Salazar2021/10/13 20:22:00

attach meta data to LLVM AST

https://llvm.org/docs/TransformMetadata.html (Edited)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Julia Escape Analysis Project

Targets

Enhanced SROA & DCE

The current state

Enhanced SROA & More aggressive DCE

Heap to stack allocation conversion

The idea

Do we really need this ?

Allocation movement

finalizer elision

Considerations

Pipeline Design

Development Plan

Schedule

Discussions

2021/12/17

2021/11/24

Agenda

Implementation of field/alias analysis for enhanced SROA

Allocation-elimination targets

Design of analysis/optimization

Discussion

"Non-semantic" copy for stack allocation

2021/11/10

Agenda

When to run EA

2021/10/13

Agenda

Optimization ideas

"allocation movement"

so what is needed ?

Vararg handling …

2021/10/06

Next actions

DCE plan

More optimization ideas

2021/09/29

Agenda

more optimization ideas ?

`finalize`r elision

`Vararg` handling …