changed 3 years ago
Published Linked with GitHub

Julia Escape Analysis Project

  • Team: Shuhei Kadowaki, Jameson Nash
  • Meeting:
  • Scope:
    • implement a strong Julia-level escape analysis
    • implement memory-optimizations based on the analysis
    • and more ?
      • feedback escape information to inference ?
      • summarize performance improvements, and write up a paper ?
  • Timeline: upstream to Julia base by the end of the year

Targets

As midterm objectives of this project, we will try to implement following optimizations:

  • Enhanced SROA & DCE
  • Heap-to-stack allocation conversion
  • (Allocation movement ?)
  • finalizer elision

Enhanced SROA & DCE

The current state

Julia's compiler does SROA (Scalar-Replacement-of-Aggregates) optimization on two levels:

  1. on Julia-level IR: getfield_elim_pass!
  2. on LLVM-IR: COMBAK (which one does LLVM SROA ?)

The purpose of this optimization is to enhance the Julia-level SROA with escape analysis.

The current SROA pass would be described as (per #42357):

​​​​getfield_elim_pass!(ir::IRCode) -> newir::IRCode

getfield elimination pass, a.k.a. Scalar Replacements of Aggregates optimization.

This pass is based on a local alias analysis that collects field information by def-use chain walking. It looks for struct allocation sites ("definitions"), and getfield calls as well as :foreigncalls that preserve the structs ("usages"). If "definitions" have enough information, then this pass will replace corresponding usages with lifted values. mutable structs require additional cares and need to be handled separately from immutables. For mutable structs, setfield! calls account for "definitions" also, and the pass should give up the lifting conservatively when there are any "intermediate usages" that may escape the mutable struct (e.g. non-inlined generic function call that takes the mutable struct as its argument).

In a case when all usages are fully eliminated, struct allocation may also be erased as a result of dead code elimination.

The important observations are:

  • successful SROA can sometimes even eliminate allocations at all as a result of dead code elimination
  • getfield_elim_pass! is limited in a sense that:
    • it can't reason about anything inter-procedural
    • it requires "safe" allocation sites to kick-off whole optimization

Let's see successful examples first, and then understand limitations.

Here, the allocation site r = Ref{Int}(42) is "safe" and has enough information, and r[] (inlined to getfield) will be eliminated, and even more, the allocation and setfield! are voided out too:

julia> code_typed() do
           r = Ref{Int}(0)
           r[] = 42
           b = sin(r[])
           return b
       end |> only
CodeInfo(
1 ─ %1 = Base.sitofp(Float64, 42)::Float64
│   %2 = invoke Base.Math.sin(%1::Float64)::Float64
└──      return %2
) => Float64

getfield_elim_pass! requires "safe" allocation. So if we initialize r = Ref{}(), then the whole optimization won't work:

julia> code_typed() do
           r = Ref{Int}()
           r[] = 42
           b = sin(r[])
           return b
       end |> only
CodeInfo(
1 ─ %1 = %new(Base.RefValue{Int64})::Base.RefValue{Int64}
│        Base.setfield!(%1, :x, 42)::Int64
│   %3 = Base.getfield(%1, :x)::Int64
│   %4 = Base.sitofp(Float64, %3)::Float64
│   %5 = invoke Base.Math.sin(%4::Float64)::Float64
└──      return %5
) => Float64

Still the later LLVM-level SROA may be able to optimize it further, and it seems to be more powerful than Julia-level one in general, e.g. it can "fully" optimize the above code (the argument of @j_sin_4761 is replaced with the scalar value double 4.20000e+01, and we don't have any allocation)

julia> code_llvm(()) do
           r = Ref{Int}()
           r[] = 42
           b = sin(r[])
           return b
       end
;  @ none:2 within `#49`
define double @"julia_#49_4759"() #0 {
top:
;  @ none:4 within `#49`
; ┌ @ math.jl:1228 within `sin`
   %0 = call double @j_sin_4761(double 4.200000e+01) #0
; └
;  @ none:5 within `#49`
  ret double %0
}

Now let's bother LLVM we can just add non-inlined function call, then LLVM can't help giving up the optimization:

julia> @noinline setter!(r, x) = r[] = x
setter! (generic function with 1 method)

julia> code_llvm(()) do
           r = Ref{Int}()
           setter!(r, 42)
           b = sin(r[])
           return b
       end
;  @ none:2 within `#45`
define double @"julia_#45_4943"() #0 {
top:
  %gcframe7 = alloca [3 x {}*], align 16
  %gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
  %0 = bitcast [3 x {}*]* %gcframe7 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %0, i8 0, i32 24, i1 false)
  %1 = call {}*** inttoptr (i64 140733735208165 to {}*** (i64)*)(i64 261) #6
; ┌ @ refpointer.jl:135 within `Ref`
; │┌ @ refvalue.jl:7 within `RefValue`
    %2 = bitcast [3 x {}*]* %gcframe7 to i64*
    store i64 4, i64* %2, align 16
    %3 = load {}**, {}*** %1, align 8
    %4 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
    %5 = bitcast {}** %4 to {}***
    store {}** %3, {}*** %5, align 8
    %6 = bitcast {}*** %1 to {}***
    store {}** %gcframe7.sub, {}*** %6, align 8
    %ptls_field3 = getelementptr inbounds {}**, {}*** %1, i64 2305843009213693954
    %7 = bitcast {}*** %ptls_field3 to i8**
    %ptls_load45 = load i8*, i8** %7, align 8
    %8 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
    %9 = bitcast {}* %8 to i64*
    %10 = getelementptr inbounds i64, i64* %9, i64 -1
    store atomic i64 4617487696, i64* %10 unordered, align 8
    %11 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
    store {}* %8, {}** %11, align 16
; └└
;  @ none:3 within `#45`
  %12 = call i64 @"j_setter!_4945"({}* nonnull %8, i64 signext 42) #0
;  @ none:4 within `#45`
; ┌ @ refvalue.jl:56 within `getindex`
; │┌ @ Base.jl:38 within `getproperty`
    %13 = load i64, i64* %9, align 8
; └└
; ┌ @ math.jl:1232 within `sin`
; │┌ @ float.jl:269 within `float`
; ││┌ @ float.jl:243 within `AbstractFloat`
; │││┌ @ float.jl:146 within `Float64`
      %14 = sitofp i64 %13 to double
; │└└└
; │ @ math.jl:1234 within `sin`
   %15 = call double @j_sin_4946(double %14) #0
   %16 = load {}*, {}** %4, align 8
   %17 = bitcast {}*** %1 to {}**
   store {}* %16, {}** %17, align 8
; └
;  @ none:5 within `#45`
  ret double %15
}

LLVM can also be confused with some other Julia level semantics other than inter-procedural calls. For example, typeassert can pretty much null out LLVM SROA. In the following code snippet, neither of Julia-level nor LLVM-level SROA fails to eliminate closure constructions:

julia> doit(f, args...) = f(args...) julia> code_typed() do local arr::Vector{Any} doit() do arr = Any[] end arr end |> only CodeInfo( 1 ─ %1 = Core.Box::Type{Core.Box} │ %2 = %new(%1)::Core.Box │ %3 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Any}, svec(Any, Int64), 0, :(:ccall), Vector{Any}, 0, 0))::Vector{Any} │ Core.setfield!(%2, :contents, %3)::Vector{Any} │ %5 = Core.isdefined(%2, :contents)::Bool └── goto #3 if not %5 2 ─ goto #4 3 ─ $(Expr(:throw_undef_if_not, :arr, false))::Any 4 ┄ %9 = Core.getfield(%2, :contents)::Any │ Core.typeassert(%9, Vector{Any})::Vector{Any} │ %11 = π (%9, Vector{Any}) └── return %11 ) => Vector{Any} julia> code_llvm(()) do local arr::Vector{Any} doit() do arr = Any[] end arr end ; @ none:2 within `#55` define nonnull {}* @"julia_#55_4772"() #0 { pass: ; @ none:3 within `#55` ; ┌ @ none:1 within `doit` ; │┌ @ none:4 within `#56` ; ││┌ @ array.jl:423 within `getindex` ; │││┌ @ boot.jl:471 within `Array` @ boot.jl:452 %0 = call nonnull {}* inttoptr (i64 4483450916 to {}* ({}*, i64)*)({}* inttoptr (i64 4648194800 to {}*), i64 0) ; └└└└ ; @ none:6 within `#55` %1 = bitcast {}* %0 to i64* %2 = getelementptr inbounds i64, i64* %1, i64 -1 %3 = load atomic i64, i64* %2 unordered, align 8 %4 = and i64 %3, -16 %5 = inttoptr i64 %4 to {}* %6 = icmp eq {}* %5, inttoptr (i64 4648194800 to {}*) br i1 %6, label %pass3, label %fail2 fail2: ; preds = %pass call void @jl_type_error(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @_j_str1, i64 0, i64 0), {}* inttoptr (i64 4648194800 to {}*), {}* %0) unreachable pass3: ; preds = %pass ret {}* %0 } # interesting fact: # LLVM can optimize it away if we strip out the type declaration ! julia> code_llvm(()) do local arr doit() do arr = Any[] end arr end ; @ none:2 within `#55` define nonnull {}* @"julia_#55_4959"() #0 { pass: ; @ none:3 within `#55` ; ┌ @ none:1 within `doit` ; │┌ @ none:4 within `#56` ; ││┌ @ array.jl:423 within `getindex` ; │││┌ @ boot.jl:471 within `Array` @ boot.jl:452 %0 = call nonnull {}* inttoptr (i64 4431128612 to {}* ({}*, i64)*)({}* inttoptr (i64 4614886128 to {}*), i64 0) ; └└└└ ; @ none:6 within `#55` ret {}* %0 }

Enhanced SROA & More aggressive DCE

We want "better" Julia-level SROA and DCE. Especially, it should be:

  • (inter-procedural) new SROA should analyze aliases and escapes better, and work better with non-inlined calls
  • (more aggressive) new DCE should be able to null out typeassert/isdefined/isa checks more aggressively so that we can eliminate more allocations[1]

Target examples:

  • propagate inter-procedural memory-effect:
    ​​julia> @noinline setter!(r, x) = r[] = x
    ​​setter! (generic function with 1 method)
    
    ​​julia> code_typed() do
    ​​           r = Ref{Int}()
    ​​           setter!(r, 42)
    ​​           b = sin(r[])   # should be turned into `sin(42)`
    ​​           return b
    ​​       end |> only
    ​​CodeInfo(
    ​​1 ─ %1 = %new(Base.RefValue{Int64})::Base.RefValue{Int64}
    ​​│        invoke Main.setter!(%1::Base.RefValue{Int64}, 42::Int64)::Any
    ​​│   %3 = Base.getfield(%1, :x)::Int64
    ​​│   %4 = Base.sitofp(Float64, %3)::Float64
    ​​│   %5 = invoke Base.Math.sin(%4::Float64)::Float64
    ​​└──      return %5
    ​​) => Float64
    
  • (even local analysis improvement):
    ​​# `RefValue{String}` alloction won't be eliminated
    ​​(s::String) -> broadcast(identity, Ref(s))
    
    ​​# abstract-intepretation based field information propagation
    ​​# should allow us to optimize this IR
    ​​2 1 ─ %1 = %new(Base.RefValue{String}, _2)::Base.RefValue{String}                                                                                                                               │╻╷╷     Ref
    ​​│   %2 = Core.tuple(%1)::Tuple{Base.RefValue{String}}                                                                                                                                         │╻       broadcast
    ​​│   %3 = Core.getfield(%2, 1)::Base.RefValue{String}                                                                                                                                          ││
    ​​└──      goto #3 if not true                                                                                                                                                                  ││╻╷      materialize
    ​​2 ─      nothing::Nothing                                                                                                                                                                     │
    ​​3 ┄ %6 = Base.getfield(%3, :x)::String                                                                                                                                                        │││╻╷╷╷╷   copy
    ​​└──      goto #4                                                                                                                                                                              ││││┃       getindex
    ​​4 ─      goto #5                                                                                                                                                                              ││││
    ​​5 ─      goto #6                                                                                                                                                                              │││
    ​​6 ─      goto #7                                                                                                                                                                              ││
    ​​7 ─      return %6                                                                                                                                                                            │
    
  • nullify typeassert check:
    ​​julia> code_typed() do
    ​​           r = Ref{Any}()
    ​​           r[] = 42
    ​​           # this assertion is useful for inference
    ​​           # but "harmful" for LLVM
    ​​           return r[]::Int
    ​​       end
    ​​1-element Vector{Any}:
    ​​ CodeInfo(
    ​​1 ─ %1 = %new(Base.RefValue{Any})::Base.RefValue{Any}
    ​​│        Base.setfield!(%1, :x, 42)::Int64
    ​​│   %3 = Base.getfield(%1, :x)::Any
    ​​│        Core.typeassert(%3, Main.Int)::Int64
    ​​│   %5 = π (%3, Int64)
    ​​└──      return %5
    ​​) => Int64
    

Heap to stack allocation conversion

The idea

Julia allocates every actually-allocated mutable object on heap memory, while in ideal world we want to have them on stack if their lifetime doesn't escape from the execution frame.

An obvious example would be like this Julia will allocate MyRef("foo") even though it doesn't escape anywhere, because @noinlined getx call will prevent both Julia-level and LLVM-level SROA, and they can't eliminate the allocation:

julia> mutable struct MyRef{T}; x::T; end

julia> @noinline getx(x::MyRef) = x.x
getx (generic function with 1 method)

julia> code_typed(()) do
           r = MyRef("foo")
           x = getx(r)
           sizeof(x)
       end |> only
CodeInfo(
1 ─ %1 = %new(MyRef{String}, "foo")::MyRef{String}
│   %2 = invoke Main.getx(%1::MyRef{String})::String
│   %3 = Core.sizeof::typeof(Core.sizeof)
│   %4 = (%3)(%2)::Int64
└──      return %4
) => Int64

julia> code_llvm(()) do
           r = MyRef("foo")
           x = getx(r)
           sizeof(x)
       end
;  @ none:2 within `#75`
define i64 @"julia_#75_4792"() #0 {
top:
  %0 = alloca {}*, align 8
  %gcframe7 = alloca [3 x {}*], align 16
  %gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
  %1 = bitcast [3 x {}*]* %gcframe7 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
  %2 = call {}*** inttoptr (i64 140733733901541 to {}*** (i64)*)(i64 261) #7
; ┌ @ none:1 within `MyRef` @ none:1
   %3 = bitcast [3 x {}*]* %gcframe7 to i64*
   store i64 4, i64* %3, align 16
   %4 = load {}**, {}*** %2, align 8
   %5 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
   %6 = bitcast {}** %5 to {}***
   store {}** %4, {}*** %6, align 8
   %7 = bitcast {}*** %2 to {}***
   store {}** %gcframe7.sub, {}*** %7, align 8
   %ptls_field3 = getelementptr inbounds {}**, {}*** %2, i64 2305843009213693954
   %8 = bitcast {}*** %ptls_field3 to i8**
   %ptls_load45 = load i8*, i8** %8, align 8
   %9 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
   %10 = bitcast {}* %9 to i64*
   %11 = getelementptr inbounds i64, i64* %10, i64 -1
   store atomic i64 5398336928, i64* %11 unordered, align 8
   %12 = bitcast {}* %9 to {}**
   store {}* inttoptr (i64 5376165704 to {}*), {}** %12, align 8
   %13 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
   store {}* %9, {}** %13, align 16
; └
;  @ none:3 within `#75`
  store {}* %9, {}** %0, align 8
  %14 = call nonnull {}* @j1_getx_4794({}* inttoptr (i64 5377254840 to {}*), {}** nonnull %0, i32 1)
;  @ none:4 within `#75`
; ┌ @ Base.jl:175 within `sizeof`
   %15 = bitcast {}* %14 to i64*
   %16 = load i64, i64* %15, align 8
   %17 = load {}*, {}** %5, align 8
   %18 = bitcast {}*** %2 to {}**
   store {}* %17, {}** %18, align 8
; └
  ret i64 %16
}

The basic idea of this optimization is to modify LLVM IR so that we can allocate MyRef("foo") on stack rather than heap.

Do we really need this ?

We worked on this optimization at the first project iteration (GSoC term), but for now, I'd like to put lower priority on this optimization:

  • (SROA and DCE) as explained above, successful SROA can eliminate allocations at all: so enhanced SROA can subsume pretty much part of this optimization e.g. for this code snippet, we can just convert
    ​​let
    ​​    r = MyRef("foo")
    ​​    x = getx(r)
    ​​    sizeof(x)
    ​​end
    
    to
    ​​let
    ​​    sizeof("foo")
    ​​end
    
    at the enhanced SROA pass, then there is no need for this heap-to-stack conversion
  • (GC interaction) we need to make (complicated) design choices, mainly around interactions with GC (Jameson can elaborate this)
  • (throw exception) the current throwness check is not so accurate, and we often need to account for escapes through exceptions conservatively, which causes stack allocation harder[2]

Allocation movement

TODO: elaborate this.

finalizer elision

Julia finalizer is managed by GC, and so they may hang around for a while as GC lets the mutable object to hang around. This means we may end up consuming much memory when finalizers are associated with e.g. resource managements, which is a common practice seen e.g. in GPU programming.

The idea of this optimization is to early-call a finalizer by inserting finalize(obj) call when we can prove the lifetime of obj finishes within a frame:

julia> mutable struct WithFinalizer
           v
           function WithFinalizer(v)
               x = new(v)
               f(t) = @async println("Finalizing $t.")
               return finalizer(f, x)
           end
       end

julia> make_obj(v = 10) = WithFinalizer(v)
make_obj (generic function with 1 methods)

julia> let
           o = make_obj()
           copy(o.v) # <= automatically insert `finalize(o)` here
       end
10

Considerations

  • can we just call finalize(obj) ?
    • finalize(obj) is super costly !
      1. we want to "enqueue" the finalizer (per-Valentin)
      1. make it intrinsic and make it fast
  • how can we find object that has finalizer ?
  • do we need to account for possible exceptions within finalizer call itself ?
  • do we want to have flow-sensitivity (i.e. insert finalize on return site) ?

Pipeline Design

  1. type inference
  2. optimization
    1. preps
    2. inlining
    3. escape analysis
    4. optimization passes
      1. enhanced SROA
      2. aggressive dead code elimination
      3. finalizer elision
  3. caching

Even if we have an useful inter-procedural cache, escape analysis and SROA are better to run after inlining, just because inter-procedural handling is much easier we can just look for :invoke expressions that the inlinear has resolved.

Caching:

  • create new field of CodeInstance, and
  • pack (minimized) escape information into it, as we do for inference result.

Development Plan

Schedule

  • 21/09 (kick-off): re-define targets, setup milestones
  • 21/10 (stage 1): implement as "dummy" AbstractInterpreter
    • optimizations
      • enhanced SORA
      • aggressive DCE
      • finalizer elision prototyping
    • analysis
  • 21/11: (stage 2) executable AbtractInterpreter
    • keep working on stage 1 remainings
    • scratch compiler-plugin infrastructure to actually execute that pipeline
  • 21/12: (stage 3) make a PR on Julia Base, finish reviews

Discussions

NOTE: agenda for the recent mtg should come first than others

2021/12/17

  • For ImmutableArray

    • mutating_arrayfreeze
    • lifetime: whether or not mutable array is "mutated"
  • alias analysis?

    • about accuracy
    • may not need so much?
  • BoundsError

    • if only escape is from BoundsError
    • language semantics requires "non-semantic copy"
  • lifetime information?

    • might not be too important?
    • but it's very rare!
  • focus:

    • phase 1: reviving ImmutableArray
    • phase 2: LLVM escape
      • limitations: Union-type, inter-procedural
  • ImmutableArray PR

  • taint analysis

2021/11/24

Agenda

  • Implementation of field/alias analysis for enhanced SROA
  • "Non-semantic" copy for stack allocation

Implementation of field/alias analysis for enhanced SROA

Allocation-elimination targets

:arrow_double_up: easy / :arrow_double_down: hard:

  • simplest SROA
    ​​code_typed((Any,)) do a
    ​​     r = Ref{Any}(a)
    ​​     r[] # => a
    ​​end
    
  • domination-aware SROA
    ​​code_typed((Any,Any)) do a, b
    ​​     r = Ref{Any}(a)
    ​​     r[] = b
    ​​     r[] # => b
    ​​end
    ​​
    ​​# uninialized field
    ​​# https://github.com/JuliaLang/julia/pull/42834
    ​​code_typed((Any,)) do a
    ​​     r = Ref{Any}()
    ​​     r[] = a
    ​​     r[] # => a
    ​​end
    ​​
    ​​code_typed((Bool,Any,Any)) do c, a, b
    ​​     r = Ref{Any}()
    ​​     if cond
    ​​         r[] = a
    ​​         r[] # => a
    ​​     else
    ​​         r[] = b
    ​​         r[] # => b
    ​​     end
    ​​end
    
  • φnode-fying SROA
    ​​code_typed((Bool,Any,Any)) do cond, a, b
    ​​     r = Ref{Any}(nothing)
    ​​     if cond
    ​​         r[] = a
    ​​     else
    ​​         r[] = b
    ​​     end
    ​​     r[] # => φ(a, b)
    ​​end
    ​​
    ​​# uninialized field
    ​​# https://github.com/JuliaLang/julia/pull/43208
    ​​code_typed((Bool,Any,Any)) do cond, a, b
    ​​     r = Ref{Any}()
    ​​     if cond
    ​​         r[] = a
    ​​     else
    ​​         r[] = b
    ​​     end
    ​​     r[] # => φ(a, b)
    ​​end
    
  • alias-aware SROA (nested load elimination)
    ​​# `s -> broadcast(identity, Ref(s))`
    ​​code_typed((String,)) do s
    ​​    r = Ref(s)
    ​​    t = (r,)
    ​​    t[1][] # => s
    ​​end
    ​​
    ​​code_typed() do
    ​​    local arr::Vector{Any}
    ​​    alloc! = () -> arr = Any[]
    ​​    alloc!()
    ​​    arr
    ​​end
    
    ​​code_typed((Bool,Any,Any)) do cond, a, b
    ​​    r1 = Ref{Any}(a)
    ​​    r2 = Ref{Any}(b)
    ​​    (cond ? r1 : r2)[] # φ(a, b)
    ​​    # or => (cond ? r1[] : r2[])
    ​​end
    
  • "partial" SROA
    ​​code_typed((Bool,Any,)) do rare_cond, a
    ​​    r = Ref{Any}(a)
    ​​    if rare_cond
    ​​        throw(r)   # move `r` allocation here
    ​​    else
    ​​        return r[] # => a
    ​​    end
    ​​end
    
  • stack allocation
    ​​@noinline safe_op(a) = ...
    ​​code_typed((Any,)) do a
    ​​    r = Ref{Any}(a) # can be allocated on stack
    ​​    safe_op(r)
    ​​end
    

Design of analysis/optimization

  1. current Julia Base SROA
​​[SROA]
​​    [simple analysis]
​​    [domination analysis]
​​    [actual code transmation]
  1. multiple SROA: benchmark

  2. backward escape/alias/field analysis & load forwarding by domination analysis

    • https://github.com/aviatesk/EscapeAnalysis.jl/pull/54
    • overview
      ​​​​​​# analysis
      ​​​​​​while [converge]
      ​​​​​​    [backward escape/alias/field analysis]
      ​​​​​​end
      ​​​​​​
      ​​​​​​# transformation
      ​​​​​​[domination analysis]
      ​​​​​​[SROA]
      ​​​​​​    [load forwarding]
      ​​​​​​    [DCE]
      
  3. backward escape/alias analysis & forward field propagation

    • EscapeAnalysis.jl/avi/flowsensitive-aliasanalysis
    • overview:
      ​​​​​​# analysis
      ​​​​​​while [converge]
      ​​​​​​    [backward escape/alias analysis]
      ​​​​​​    while [converge]
      ​​​​​​        [forward field information propagation]
      ​​​​​​    end
      ​​​​​​end
      ​​​​​​
      ​​​​​​# transformation
      ​​​​​​[SROA]
      ​​​​​​    [simple load replacement]
      ​​​​​​    [DCE]
      
  4. forward escape/alias/field analysis

    • JVM compilers' approach
    • overview
      ​​​​​​# analysis
      ​​​​​​while [converge]
      ​​​​​​    [forward escape/alias analysis]
      ​​​​​​    [forward field information propagation]
      ​​​​​​end
      ​​​​​​
      ​​​​​​# transformation
      ​​​​​​[SROA]
      ​​​​​​    [simple load replacement]
      ​​​​​​    [DCE]
      
  • Points
    • complexity of domination/control-flow
      • "last" definition
      • field merge (φnode-fying)
      • uninitialized field
    • conflicts between escape information and field information
      • escape information: flow from usage to definition
      • field information: flow from definition to usage
    • versatility of EA
      • Valentin: separate SROA from EA

Discussion

  • Valentin: we want to have a "slot" in IRCode?
    • represents a mutable memory
    • might be useful for:
      • Tapir integration
      • stack allocation
    • i.e. may be able to slot2ssa system?
  • Valentin: array?
    • array optimization
    • target?
      • ref/load is all local
      • constant index
    • approaches?
      • Valentin: don't make array special!
      • LLVM can handle it nicely
      • Julia-level analysis: tell inter-procedural information!

"Non-semantic" copy for stack allocation

  • Problem:

    • so many escape through exception
    • in Julia an exception means "global escape"
  • Details:

    • conditions of "non-semantic" copy
      • no escape (including finalize call)
      • no lifetime overlap
      ​​​​​​local o
      ​​​​​​# finalize(f, o) # may escape in finalizer finalizer
      ​​​​​​try
      ​​​​​​    unsafe_op(o)
      ​​​​​​catch err
      ​​​​​​    # o.x = "bar" # lifetime overlap
      ​​​​​​end
      
    • but these two are NOT enough: other threads can access to err through the evil jl_current_exception
  • Ideas ?

    • special errors ?
      • MethodError: static
      • BoundsError: runtime
    • move throw ?
      • as a LICM: but need to preserve the order
      • unroll loop by 1, and use the obtained constants

2021/11/10

Agenda

  • When to run EA
  • [ ] discussion on "non-semantic" copy

When to run EA

  • tl;dr, do escape analysis at optimization

  • reasons:

    • inference works on type domain vs. optimization works on SSA form
    • current applications of escape analysis is mostly about optimizations:
      • SROA
      • ImmutableArray: copy elision
      • AD: sensitivity analysis
    • as a first step, it would be easier and cleaner if they are just separated
  • more thoughts:

    • inference already does some sort of alias analysis in forms of Const and PartialStruct
    • but they don't encode anything about CFG
    • encoding CFG information into type lattice would be far more complicated (like Conditional)
    • also, if we reason about inter-procedural mutation analysis, we also need to think about cycle detection
  • more details:

  • inter-procedural

    • how to cache the inter-procedural information ?
      • new field of CodeInstance ?
    • what kind of information
      • argument escapes ?
      • returned value escapes ?
      • side effect analysis ?
        • global side effect: read/write on global memory
          • e.g. does this loop write to global memory, then it can't be hoisted, if not, start thinking about local
          • obvious example: stdout
        • local side effect: mutation on field
          • LICM
          • better DCE by load forwarding ?
        • query:
          • input: MethodInstance, "x"
          • output: does this MethodInstance affect the state of "x"
        • side effect: a restriction on possible optimizations
        • synchronization optimization ?
      • what kind of optimization should side-effect target ?
      • reuse LLVM ?
        • LLVM-level purity vs. Julia-level purity
      • TODO: think more and more about applications of "side-effect" !
    • "recursion" matters ?
      • "recursion" means the same MethodInstance
      • it won't be very ideal if recursion => escape

2021/10/13

Agenda

  • more optimization ideas
  • how to get Ian involved with this project ?
  • aggressive DCE
    • what needed ?
      • make clear the cost of updating domtree dynamically (benchmarking)
        • AbstractInterpreter can help benchmarking
      • if it turns out be costly, how we can mitigate ?
        • iterative processing to batch processing ?
        • just compute entire domtree in certain points ?
          • c.f. LLVM can reason about relations of various analyses to mitigate the cost of dynamic domtree update
      • correctness
  • (if time remains) Vararg handling is complicated, any hints ?

Optimization ideas

  • "allocation movement" (I'm not quite sure this naming is right)
    • idea: move allocation to enable further optimization (elaborated below)
  • memory-reuse:
    • idea: try to use the same memory that happen in a loop
      • we can optimize it when it doesn't escape the loop (use scope marker, like gcpreserve)
      • we want to make it control-flow sensitive
  • provide alias information to LLVM
    • background: LLVM-level TBAA pass: it just checks types
    • idea: enrich LLVM-level optimizations with Julia-level alias information
    • do we want to tell lifetime information also ?
    • how do we want to communicate to LLVM ?
      • attach meta data to LLVM AST
      • customizable: metadata may be consumed by our own optimization passes
  • ImmutableArray

"allocation movement"

  • observations: we may not be able to prove effect-free-ness of something (e.g. typeassert)
    • but we may sometimes know it doesn't escape
    • and then we can try to "delay" allocation to where it can be thrown, and do optimization (maybe SROA ?) more aggressively
  • idea: "move allocation site to where it can be really needed"
    • should enable further SROA
    • we can have dynamic check to enable stack allocation ?
      • make "type-tag" keep information about allocation, and have a query to check if an object is on stack or heap
    • can be extended to optimize Arrays ?

successful SROA

julia> code_typed((Int,)) do n
           s = 0
           for i in 1:n
               r = Ref(i)
               # r[] < 0 && throw(r) # very rare condition
               s += r[]
           end
           s
       end |> only
... # no allocation of `Base.RefValue{Int}`

unsuccessful SROA because of rare throw case (actuall it's impossible for this case)

julia> code_typed((Int,)) do n
           s = 0
           for i in 1:n
               r = Ref(i)
               r[] < 0 && throw(r) # very rare condition
               s += r[]
           end
           s
       end |> only
CodeInfo(
1 ── %1  = Base.sle_int(1, n)::Bool
│    %2  = Base.ifelse(%1, n, 0)::Int64
│    %3  = Base.slt_int(%2, 1)::Bool
└───       goto #3 if not %3
2 ──       Base.nothing::Nothing
└───       goto #4
3 ──       goto #4
4 ┄─ %8  = φ (#2 => true, #3 => false)::Bool
│    %9  = φ (#3 => 1)::Int64
│    %10 = φ (#3 => 1)::Int64
│    %11 = Base.not_int(%8)::Bool
└───       goto #12 if not %11
5 ┄─ %13 = φ (#4 => %9, #11 => %30)::Int64
│    %14 = φ (#4 => %10, #11 => %31)::Int64
│    %15 = φ (#4 => 0, #11 => %23)::Int64
│    %16 = %new(Base.RefValue{Int64}, %13)::Base.RefValue{Int64}
│    %17 = Base.getfield(%16, :x)::Int64
│    %18 = Base.slt_int(%17, 0)::Bool
└───       goto #7 if not %18
6 ──       Main.throw(%16)::Union{}
└───       unreachable
7 ── %22 = Base.getfield(%16, :x)::Int64
│    %23 = Base.add_int(%15, %22)::Int64
│    %24 = (%14 === %2)::Bool
└───       goto #9 if not %24
8 ──       Base.nothing::Nothing
└───       goto #10
9 ── %28 = Base.add_int(%14, 1)::Int64
└───       goto #10
10 ┄ %30 = φ (#9 => %28)::Int64
│    %31 = φ (#9 => %28)::Int64
│    %32 = φ (#8 => true, #9 => false)::Bool
│    %33 = Base.not_int(%32)::Bool
└───       goto #12 if not %33
11 ─       goto #5
12 ┄ %36 = φ (#10 => %23, #4 => 0)::Int64
└───       return %36
) => Int64

fan fact: unoptimize_throw_blocks makes SROA sad

julia> code_typed((Int,)) do n
           s = 0
           for i in 1:n
               r = Ref(i)
               r[] < 0 && throw(r[]) # very rare condition
               s += r[]
           end
           s
       end |> only
CodeInfo(
1 ── %1  = Base.sle_int(1, n)::Bool
│    %2  = Base.ifelse(%1, n, 0)::Int64
│    %3  = Base.slt_int(%2, 1)::Bool
└───       goto #3 if not %3
2 ──       Base.nothing::Nothing
└───       goto #4
3 ──       goto #4
4 ┄─ %8  = φ (#2 => true, #3 => false)::Bool
│    %9  = φ (#3 => 1)::Int64
│    %10 = φ (#3 => 1)::Int64
│    %11 = Base.not_int(%8)::Bool
└───       goto #12 if not %11
5 ┄─ %13 = φ (#4 => %9, #11 => %31)::Int64
│    %14 = φ (#4 => %10, #11 => %32)::Int64
│    %15 = φ (#4 => 0, #11 => %24)::Int64
│    %16 = %new(Base.RefValue{Int64}, %13)::Base.RefValue{Int64}
│    %17 = Base.getfield(%16, :x)::Int64
│    %18 = Base.slt_int(%17, 0)::Bool
└───       goto #7 if not %18
6 ── %20 = Base.getindex(%16)::Any
│          Main.throw(%20)::Union{}
└───       unreachable
7 ── %23 = Base.getfield(%16, :x)::Int64
│    %24 = Base.add_int(%15, %23)::Int64
│    %25 = (%14 === %2)::Bool
└───       goto #9 if not %25
8 ──       Base.nothing::Nothing
└───       goto #10
9 ── %29 = Base.add_int(%14, 1)::Int64
└───       goto #10
10 ┄ %31 = φ (#9 => %29)::Int64
│    %32 = φ (#9 => %29)::Int64
│    %33 = φ (#8 => true, #9 => false)::Bool
│    %34 = Base.not_int(%33)::Bool
└───       goto #12 if not %34
11 ─       goto #5
12 ┄ %37 = φ (#10 => %24, #4 => 0)::Int64
└───       return %37
) => Int64

so what is needed ?

  • they are really similar in a sense that what they need is "lifetime" information ?
  • is the current EscapeLattice enough, or does it need more props ?

Vararg handling

julia> @noinline function foo(s...)
           string(s)
       end
foo (generic function with 1 method)

# when is `Vararg` introduced ?

julia> code_typed((String,)) do s
           foo(s)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String)::String
└──      return %1
) => String

julia> code_typed((String,)) do s
           foo(s, s)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String, _2::Vararg{String})::String
└──      return %1
) => String

julia> code_typed((String,)) do s
           foo(s, s, s)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String, _2::Vararg{String}, _2)::String
└──      return %1
) => String

# it seems like inference/optimizer has dealt with `foo(::String, ::String)` !?

julia> only(methods(foo)).specializations
svec(MethodInstance for foo(::String), MethodInstance for foo(::String, ::String), MethodInstance for foo(::String, ::String, ::String), nothing, nothing, nothing, nothing, MethodInstance for foo(::String, ::Vararg{String}))

2021/10/06

  • plans
  • updates:
    • (mostly) debuginfo is available for EscapeAnalysis.jl

Next actions

  • Xuanda will take a look at more aggressive DCE
  • Shuhei will learn much more about Julia runtime, and try to fully understand other optimization targets other than SROA and finalzier elision, and re-define project targets and update timeline if necessary
  • Shuhei will keep working on field analysis and enhanced SROA

DCE plan

  • step 1: typeassert can simply be eliminated
  • step 2: fix #27547 ( #37882 can be revived)
  • step 3: eliminate dead isdefined / isa / === branches (maybe <: too ?)
julia> code_typed() do local arr::Vector{Any} doit() do arr = Any[] end arr end |> only CodeInfo( 1 ─ %1 = Core.Box::Type{Core.Box} │ %2 = %new(%1)::Core.Box │ %3 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Any}, svec(Any, Int64), 0, :(:ccall), Vector{Any}, 0, 0))::Vector{Any} │ Core.setfield!(%2, :contents, %3)::Vector{Any} │ %5 = Core.isdefined(%2, :contents)::Bool # <= step 2 └── goto #3 if not %5 # <= step 2 2 ─ goto #4 3 ─ $(Expr(:throw_undef_if_not, :arr, false))::Any # <= step 2 4 ┄ %9 = Core.getfield(%2, :contents)::Any │ Core.typeassert(%9, Vector{Any})::Vector{Any} # <= step 1 │ %11 = π (%9, Vector{Any}) └── return %11 ) => Vector{Any}

More optimization ideas

  • move allocation (by copy) to where it can be thrown
    • observations: we may not be able to prove effect-free-ness of something (e.g. typeassert)

      • but we may sometimes know it doesn't escape
      • and we can try to "delay" allocation to where it can be thrown, and do optimization (maybe SROA ?) more aggressively
    • when it does escape in a limited scope, only allocate object when thrown

      ​​​​​​s = 0
      ​​​​​​for i in 1:n
      ​​​​​​    arr = [i]   # want to eliminate this allocation
      ​​​​​​    s += arr[1] # may throw,
      ​​​​​​end
      ​​​​​​s
      
      ​​​​​​let
      ​​​​​​    r = Ref(x)
      ​​​​​​    s = r::Base.RefValue{String}
      ​​​​​​    r[]
      ​​​​​​end
      
    • conditions ?

      • it shouldn't alias to anything
      • lifetime should end in optimization scope
    • dynamic check to enable stack allocation ?

      • make "type-tag" keep information about allocation
      • "pool-allocation"
        • optimize if an object is in the pool
        • make query to check whether it's allocated in the pool
        • when GC allocates, it's gonna allocated in the pool
        • (as an implementation detail of heap-to-stack alloc opt)
  • immutables don't need analysis ?
    ​​let
    ​​    # type-tag needs to be on stack (next to the object) so that GC can see it ..?
    ​​    # 1. immutable points to reference to heap object (`Any`)
    ​​    #    but the `Any` object might be able to be allocated on stack as well
    ​​    #    type-tag (size unknown ?)
    ​​    # 2. if immutable escapes, allocate it on the heap
    ​​    s = Some{Any}(10) # <= points to heap allocted (if escapes)
    ​​end
    
  • AD may also possibly benefit from mutation analysis
    • Zygote: doesn't support
    • Differactor: ???
    • Enzyme: use alias/escape analysis to find objects whose mutation order should be recorded
  • Julia's memory management/optimization
    • llvm-alloc-opt.cpp: SROA pass on heap memory
      • some LLVM has its own SROA pass on stack memory
    • gc.c: memory management (allocation, finalization)
      • defines standard runtime semantics
      • keywords:
        • mark-and-sweep
        • non-moving
        • tri-color
        • precise: never changes
        • generation
        • stop-the-world: => concurrent garbage collection

2021/09/29

Agenda

  • agree on target optimizations (Targets)
  • plan collaboration (Development)
    • meeting schedule
    • discuss roles

more optimization ideas ?

  • alias analysis on Julia level
    • we have LLVM-level TBAA pass: it just checks types
    • we can tell more information to it
  • memory-reuse: try to use the same memory that happen in a loop
    • we can optimize it when it doesn't escape the loop (use scope marker, like gcpreserve)
    • we want to make it control-flow sensitive

  1. This optimization can be developed standalone. For example, we can eliminate the else branch in the following code snippet w/o escape analysis if we have more aggressive DCE:

    ​​​​​​julia> code_typed(()) do
    ​​​​​​           r = Ref{Any}(42)
    ​​​​​​           a = r[]
    ​​​​​​           if isa(a, Int)
    ​​​​​​               return 0
    ​​​​​​           else
    ​​​​​​               return nothing
    ​​​​​​           end
    ​​​​​​       end
    ​​​​​​1-element Vector{Any}:
    ​​​​​​CodeInfo(
    ​​​​​​1 ─ %1 = (42 isa Main.Int)::Bool
    ​​​​​​└──      goto #3 if not %1
    ​​​​​​2 ─      return 0
    ​​​​​​3 ─      return Main.nothing
    ​​​​​​) => Union{Nothing, Int64}
    
    ↩︎
  2. There seems to be a way to circumvent this escape. Elaborated here. ↩︎

Select a repo