owned this note
owned this note
Published
Linked with GitHub
---
breaks: false
---
<!--{%hackmd theme-dark %}-->
# Julia Escape Analysis Project
- Team: _Shuhei Kadowaki_, _Jameson Nash_
- Meeting:
- zoom link: <https://mit.zoom.us/j/93134458604>
- schedule:
- 2022: TBD
- 2021: (every Wed.) 22:00-23:00 (JST), 9:00-10:00 (EDT)
- Scope:
- implement a strong Julia-level escape analysis
- implement memory-optimizations based on the analysis
- prototype them as a testbed of [compiler-plugin](https://hackmd.io/bVhb97Q4QTWeBQw8Rq4IFw?both)
- ... and more ?
- feedback escape information to inference ?
- summarize performance improvements, and write up a paper ?
- [Timeline](#Development-Plan): upstream to Julia base by the end of the year
# Targets
As midterm objectives of this project, we will try to implement following optimizations:
- [ ] Enhanced SROA & DCE
- [ ] ~~Heap-to-stack allocation conversion~~
- [ ] (Allocation movement ?)
- [ ] `finalize`r elision
## Enhanced SROA & DCE
### The current state
Julia's compiler does SROA (Scalar-Replacement-of-Aggregates) optimization on two levels:
1. on Julia-level IR: [`getfield_elim_pass!`](https://github.com/JuliaLang/julia/blob/f8d3bd22f68f10bd2dc3f1be3b74bfa7cb125bc5/base/compiler/ssair/passes.jl#L527)
2. on LLVM-IR: COMBAK (which one does LLVM SROA ?)
The purpose of this optimization is to enhance the Julia-level SROA with escape analysis.
The current SROA pass would be described as (per [#42357](https://github.com/JuliaLang/julia/pull/42357)):
> getfield_elim_pass!(ir::IRCode) -> newir::IRCode
>
> `getfield` elimination pass, a.k.a. Scalar Replacements of Aggregates optimization.
>
> This pass is based on a local alias analysis that collects field information by def-use chain walking.
> It looks for struct allocation sites ("definitions"), and `getfield` calls as well as
> `:foreigncall`s that preserve the structs ("usages"). If "definitions" have enough information,
> then this pass will replace corresponding usages with lifted values.
> `mutable struct`s require additional cares and need to be handled separately from immutables.
> For `mutable struct`s, `setfield!` calls account for "definitions" also, and the pass should
> give up the lifting conservatively when there are any "intermediate usages" that may escape
> the mutable struct (e.g. non-inlined generic function call that takes the mutable struct as
> its argument).
>
> In a case when all usages are fully eliminated, `struct` allocation may also be erased as
> a result of dead code elimination.
The important observations are:
- successful SROA can sometimes even eliminate allocations at all as a result of dead code elimination
- `getfield_elim_pass!` is limited in a sense that:
- it can't reason about anything inter-procedural
- it requires "safe" allocation sites to kick-off whole optimization
Let's see successful examples first, and then understand limitations.
Here, the allocation site `r = Ref{Int}(42)` is "safe" and has enough information,
and `r[]` (inlined to `getfield`) will be eliminated, and even more,
the allocation and `setfield!` are voided out too:
```julia
julia> code_typed() do
r = Ref{Int}(0)
r[] = 42
b = sin(r[])
return b
end |> only
CodeInfo(
1 ─ %1 = Base.sitofp(Float64, 42)::Float64
│ %2 = invoke Base.Math.sin(%1::Float64)::Float64
└── return %2
) => Float64
```
`getfield_elim_pass!` requires "safe" allocation.
So if we initialize `r = Ref{}()`, then the whole optimization won't work:
```julia
julia> code_typed() do
r = Ref{Int}()
r[] = 42
b = sin(r[])
return b
end |> only
CodeInfo(
1 ─ %1 = %new(Base.RefValue{Int64})::Base.RefValue{Int64}
│ Base.setfield!(%1, :x, 42)::Int64
│ %3 = Base.getfield(%1, :x)::Int64
│ %4 = Base.sitofp(Float64, %3)::Float64
│ %5 = invoke Base.Math.sin(%4::Float64)::Float64
└── return %5
) => Float64
```
Still the later LLVM-level SROA may be able to optimize it further, and it seems to be more powerful than Julia-level one in general, e.g. it can "fully" optimize the above code (the argument of `@j_sin_4761` is replaced with the scalar value `double 4.20000e+01`,
and we don't have any allocation)
```julia
julia> code_llvm(()) do
r = Ref{Int}()
r[] = 42
b = sin(r[])
return b
end
; @ none:2 within `#49`
define double @"julia_#49_4759"() #0 {
top:
; @ none:4 within `#49`
; ┌ @ math.jl:1228 within `sin`
%0 = call double @j_sin_4761(double 4.200000e+01) #0
; └
; @ none:5 within `#49`
ret double %0
}
```
Now let's bother LLVM -- we can just add non-inlined function call, then LLVM can't help giving up the optimization:
```julia
julia> @noinline setter!(r, x) = r[] = x
setter! (generic function with 1 method)
julia> code_llvm(()) do
r = Ref{Int}()
setter!(r, 42)
b = sin(r[])
return b
end
; @ none:2 within `#45`
define double @"julia_#45_4943"() #0 {
top:
%gcframe7 = alloca [3 x {}*], align 16
%gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
%0 = bitcast [3 x {}*]* %gcframe7 to i8*
call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %0, i8 0, i32 24, i1 false)
%1 = call {}*** inttoptr (i64 140733735208165 to {}*** (i64)*)(i64 261) #6
; ┌ @ refpointer.jl:135 within `Ref`
; │┌ @ refvalue.jl:7 within `RefValue`
%2 = bitcast [3 x {}*]* %gcframe7 to i64*
store i64 4, i64* %2, align 16
%3 = load {}**, {}*** %1, align 8
%4 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
%5 = bitcast {}** %4 to {}***
store {}** %3, {}*** %5, align 8
%6 = bitcast {}*** %1 to {}***
store {}** %gcframe7.sub, {}*** %6, align 8
%ptls_field3 = getelementptr inbounds {}**, {}*** %1, i64 2305843009213693954
%7 = bitcast {}*** %ptls_field3 to i8**
%ptls_load45 = load i8*, i8** %7, align 8
%8 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
%9 = bitcast {}* %8 to i64*
%10 = getelementptr inbounds i64, i64* %9, i64 -1
store atomic i64 4617487696, i64* %10 unordered, align 8
%11 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
store {}* %8, {}** %11, align 16
; └└
; @ none:3 within `#45`
%12 = call i64 @"j_setter!_4945"({}* nonnull %8, i64 signext 42) #0
; @ none:4 within `#45`
; ┌ @ refvalue.jl:56 within `getindex`
; │┌ @ Base.jl:38 within `getproperty`
%13 = load i64, i64* %9, align 8
; └└
; ┌ @ math.jl:1232 within `sin`
; │┌ @ float.jl:269 within `float`
; ││┌ @ float.jl:243 within `AbstractFloat`
; │││┌ @ float.jl:146 within `Float64`
%14 = sitofp i64 %13 to double
; │└└└
; │ @ math.jl:1234 within `sin`
%15 = call double @j_sin_4946(double %14) #0
%16 = load {}*, {}** %4, align 8
%17 = bitcast {}*** %1 to {}**
store {}* %16, {}** %17, align 8
; └
; @ none:5 within `#45`
ret double %15
}
```
LLVM can also be confused with some other Julia level semantics other than inter-procedural calls.
For example, `typeassert` can pretty much null out LLVM SROA.
In the following code snippet, neither of Julia-level nor LLVM-level SROA fails to eliminate closure constructions:
```julia=
julia> doit(f, args...) = f(args...)
julia> code_typed() do
local arr::Vector{Any}
doit() do
arr = Any[]
end
arr
end |> only
CodeInfo(
1 ─ %1 = Core.Box::Type{Core.Box}
│ %2 = %new(%1)::Core.Box
│ %3 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Any}, svec(Any, Int64), 0, :(:ccall), Vector{Any}, 0, 0))::Vector{Any}
│ Core.setfield!(%2, :contents, %3)::Vector{Any}
│ %5 = Core.isdefined(%2, :contents)::Bool
└── goto #3 if not %5
2 ─ goto #4
3 ─ $(Expr(:throw_undef_if_not, :arr, false))::Any
4 ┄ %9 = Core.getfield(%2, :contents)::Any
│ Core.typeassert(%9, Vector{Any})::Vector{Any}
│ %11 = π (%9, Vector{Any})
└── return %11
) => Vector{Any}
julia> code_llvm(()) do
local arr::Vector{Any}
doit() do
arr = Any[]
end
arr
end
; @ none:2 within `#55`
define nonnull {}* @"julia_#55_4772"() #0 {
pass:
; @ none:3 within `#55`
; ┌ @ none:1 within `doit`
; │┌ @ none:4 within `#56`
; ││┌ @ array.jl:423 within `getindex`
; │││┌ @ boot.jl:471 within `Array` @ boot.jl:452
%0 = call nonnull {}* inttoptr (i64 4483450916 to {}* ({}*, i64)*)({}* inttoptr (i64 4648194800 to {}*), i64 0)
; └└└└
; @ none:6 within `#55`
%1 = bitcast {}* %0 to i64*
%2 = getelementptr inbounds i64, i64* %1, i64 -1
%3 = load atomic i64, i64* %2 unordered, align 8
%4 = and i64 %3, -16
%5 = inttoptr i64 %4 to {}*
%6 = icmp eq {}* %5, inttoptr (i64 4648194800 to {}*)
br i1 %6, label %pass3, label %fail2
fail2: ; preds = %pass
call void @jl_type_error(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @_j_str1, i64 0, i64 0), {}* inttoptr (i64 4648194800 to {}*), {}* %0)
unreachable
pass3: ; preds = %pass
ret {}* %0
}
# interesting fact:
# LLVM can optimize it away if we strip out the type declaration !
julia> code_llvm(()) do
local arr
doit() do
arr = Any[]
end
arr
end
; @ none:2 within `#55`
define nonnull {}* @"julia_#55_4959"() #0 {
pass:
; @ none:3 within `#55`
; ┌ @ none:1 within `doit`
; │┌ @ none:4 within `#56`
; ││┌ @ array.jl:423 within `getindex`
; │││┌ @ boot.jl:471 within `Array` @ boot.jl:452
%0 = call nonnull {}* inttoptr (i64 4431128612 to {}* ({}*, i64)*)({}* inttoptr (i64 4614886128 to {}*), i64 0)
; └└└└
; @ none:6 within `#55`
ret {}* %0
}
```
### Enhanced SROA & More aggressive DCE
We want "better" Julia-level SROA and DCE.
Especially, it should be:
- (_**inter-procedural**_) new SROA should analyze aliases and escapes better, and work better with non-inlined calls
- (_**more aggressive**_) new DCE should be able to null out `typeassert`/`isdefined`/`isa` checks more aggressively so that we can eliminate more allocations[^1]
[^1]: This optimization can be developed standalone.
For example, we can eliminate the `else` branch in the following code snippet w/o escape analysis
if we have more aggressive DCE:
```julia
julia> code_typed(()) do
r = Ref{Any}(42)
a = r[]
if isa(a, Int)
return 0
else
return nothing
end
end
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = (42 isa Main.Int)::Bool
└── goto #3 if not %1
2 ─ return 0
3 ─ return Main.nothing
) => Union{Nothing, Int64}
```
Target examples:
- propagate inter-procedural memory-effect:
```julia
julia> @noinline setter!(r, x) = r[] = x
setter! (generic function with 1 method)
julia> code_typed() do
r = Ref{Int}()
setter!(r, 42)
b = sin(r[]) # should be turned into `sin(42)`
return b
end |> only
CodeInfo(
1 ─ %1 = %new(Base.RefValue{Int64})::Base.RefValue{Int64}
│ invoke Main.setter!(%1::Base.RefValue{Int64}, 42::Int64)::Any
│ %3 = Base.getfield(%1, :x)::Int64
│ %4 = Base.sitofp(Float64, %3)::Float64
│ %5 = invoke Base.Math.sin(%4::Float64)::Float64
└── return %5
) => Float64
```
- (even local analysis improvement):
```julia
# `RefValue{String}` alloction won't be eliminated
(s::String) -> broadcast(identity, Ref(s))
# abstract-intepretation based field information propagation
# should allow us to optimize this IR
2 1 ─ %1 = %new(Base.RefValue{String}, _2)::Base.RefValue{String} │╻╷╷ Ref
│ %2 = Core.tuple(%1)::Tuple{Base.RefValue{String}} │╻ broadcast
│ %3 = Core.getfield(%2, 1)::Base.RefValue{String} ││
└── goto #3 if not true ││╻╷ materialize
2 ─ nothing::Nothing │
3 ┄ %6 = Base.getfield(%3, :x)::String │││╻╷╷╷╷ copy
└── goto #4 ││││┃ getindex
4 ─ goto #5 ││││
5 ─ goto #6 │││
6 ─ goto #7 ││
7 ─ return %6 │
```
- nullify `typeassert` check:
```julia
julia> code_typed() do
r = Ref{Any}()
r[] = 42
# this assertion is useful for inference
# but "harmful" for LLVM
return r[]::Int
end
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = %new(Base.RefValue{Any})::Base.RefValue{Any}
│ Base.setfield!(%1, :x, 42)::Int64
│ %3 = Base.getfield(%1, :x)::Any
│ Core.typeassert(%3, Main.Int)::Int64
│ %5 = π (%3, Int64)
└── return %5
) => Int64
```
## Heap to stack allocation conversion
### The idea
Julia allocates every actually-allocated mutable object on heap memory,
while in ideal world we want to have them on stack if their lifetime doesn't escape from the execution frame.
An obvious example would be like this -- Julia will allocate `MyRef("foo")` even though it doesn't escape anywhere,
because `@noinline`d `getx` call will prevent both Julia-level and LLVM-level SROA,
and they can't eliminate the allocation:
```julia
julia> mutable struct MyRef{T}; x::T; end
julia> @noinline getx(x::MyRef) = x.x
getx (generic function with 1 method)
julia> code_typed(()) do
r = MyRef("foo")
x = getx(r)
sizeof(x)
end |> only
CodeInfo(
1 ─ %1 = %new(MyRef{String}, "foo")::MyRef{String}
│ %2 = invoke Main.getx(%1::MyRef{String})::String
│ %3 = Core.sizeof::typeof(Core.sizeof)
│ %4 = (%3)(%2)::Int64
└── return %4
) => Int64
julia> code_llvm(()) do
r = MyRef("foo")
x = getx(r)
sizeof(x)
end
; @ none:2 within `#75`
define i64 @"julia_#75_4792"() #0 {
top:
%0 = alloca {}*, align 8
%gcframe7 = alloca [3 x {}*], align 16
%gcframe7.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 0
%1 = bitcast [3 x {}*]* %gcframe7 to i8*
call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
%2 = call {}*** inttoptr (i64 140733733901541 to {}*** (i64)*)(i64 261) #7
; ┌ @ none:1 within `MyRef` @ none:1
%3 = bitcast [3 x {}*]* %gcframe7 to i64*
store i64 4, i64* %3, align 16
%4 = load {}**, {}*** %2, align 8
%5 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 1
%6 = bitcast {}** %5 to {}***
store {}** %4, {}*** %6, align 8
%7 = bitcast {}*** %2 to {}***
store {}** %gcframe7.sub, {}*** %7, align 8
%ptls_field3 = getelementptr inbounds {}**, {}*** %2, i64 2305843009213693954
%8 = bitcast {}*** %ptls_field3 to i8**
%ptls_load45 = load i8*, i8** %8, align 8
%9 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_load45, i32 1392, i32 16) #2
%10 = bitcast {}* %9 to i64*
%11 = getelementptr inbounds i64, i64* %10, i64 -1
store atomic i64 5398336928, i64* %11 unordered, align 8
%12 = bitcast {}* %9 to {}**
store {}* inttoptr (i64 5376165704 to {}*), {}** %12, align 8
%13 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe7, i64 0, i64 2
store {}* %9, {}** %13, align 16
; └
; @ none:3 within `#75`
store {}* %9, {}** %0, align 8
%14 = call nonnull {}* @j1_getx_4794({}* inttoptr (i64 5377254840 to {}*), {}** nonnull %0, i32 1)
; @ none:4 within `#75`
; ┌ @ Base.jl:175 within `sizeof`
%15 = bitcast {}* %14 to i64*
%16 = load i64, i64* %15, align 8
%17 = load {}*, {}** %5, align 8
%18 = bitcast {}*** %2 to {}**
store {}* %17, {}** %18, align 8
; └
ret i64 %16
}
```
The basic idea of this optimization is to modify LLVM IR so that we can allocate `MyRef("foo")` on stack rather than heap.
### Do we really need this ?
We worked on this optimization at the first project iteration (GSoC term), but for now,
I'd like to put lower priority on this optimization:
- (_**SROA and DCE**_) as explained above, successful SROA can eliminate allocations at all: so enhanced SROA can subsume pretty much part of this optimization
e.g. for this code snippet, we can just convert
```julia
let
r = MyRef("foo")
x = getx(r)
sizeof(x)
end
```
to
```julia
let
sizeof("foo")
end
```
at the enhanced SROA pass, then there is no need for this heap-to-stack conversion
- (_**GC interaction**_) we need to make (complicated) design choices, mainly around interactions with GC (Jameson can elaborate this)
- (_**throw exception**_) the current throwness check is not so accurate, and we often need to account for escapes through exceptions conservatively,
which causes stack allocation harder[^2]
[^2]: There seems to be a way to circumvent this escape. Elaborated [here](https://github.com/aviatesk/EscapeAnalysis.jl/issues/15).
## Allocation movement
TODO: elaborate this.
## `finalize`r elision
Julia finalizer is managed by GC, and so they may hang around for a while as GC lets the mutable object to hang around.
This means we may end up consuming much memory when finalizers are associated with e.g. resource managements, which is a common practice seen e.g. in GPU programming.
The idea of this optimization is to early-call a finalizer by inserting `finalize(obj)` call when we can prove the lifetime of `obj` finishes within a frame:
```julia
julia> mutable struct WithFinalizer
v
function WithFinalizer(v)
x = new(v)
f(t) = @async println("Finalizing $t.")
return finalizer(f, x)
end
end
julia> make_obj(v = 10) = WithFinalizer(v)
make_obj (generic function with 1 methods)
julia> let
o = make_obj()
copy(o.v) # <= automatically insert `finalize(o)` here
end
10
```
### Considerations
- [ ] can we just call `finalize(obj)` ?
- `finalize(obj)` is super costly !
- 1. we want to "enqueue" the finalizer (per-Valentin)
- 2. make it intrinsic and make it fast
- [ ] how can we find object that has `finalizer` ?
- [ ] do we need to account for possible exceptions within finalizer call itself ... ?
- [ ] do we want to have flow-sensitivity (i.e. insert `finalize` on return site) ?
# Pipeline Design
1. type inference
2. optimization
1. preps
2. inlining
3. escape analysis
4. optimization passes
1. enhanced SROA
2. aggressive dead code elimination
3. finalizer elision
3. caching
Even if we have an useful inter-procedural cache,
escape analysis and SROA are better to run after inlining,
just because inter-procedural handling is much easier
we can just look for `:invoke` expressions that the inlinear has resolved.
Caching:
- create new field of `CodeInstance`, and
- pack (minimized) escape information into it, as we do for inference result.
# Development Plan
## Schedule
- [x] 21/09 (kick-off): re-define targets, setup milestones
- [ ] 21/10 (stage 1): implement as "dummy" `AbstractInterpreter`
- optimizations
- enhanced SORA
- aggressive DCE
- finalizer elision prototyping
- analysis
- [ ] 21/11: (stage 2) executable `AbtractInterpreter`
- keep working on stage 1 remainings
- scratch compiler-plugin infrastructure to actually execute that pipeline
- [ ] 21/12: (stage 3) make a PR on Julia Base, finish reviews
---
# Discussions
**NOTE: agenda for the recent mtg should come first than others**
## 2021/12/17
- For `ImmutableArray`
- `mutating_arrayfreeze`
- lifetime: whether or not mutable array is "mutated"
- alias analysis?
- about accuracy
- may not need so much?
- `BoundsError`
- if only escape is from `BoundsError`
- language semantics requires "non-semantic copy"
- lifetime information?
- might not be too important?
- but it's very rare!
- focus:
- phase 1: reviving `ImmutableArray`
- phase 2: LLVM escape
- limitations: `Union`-type, inter-procedural
- `ImmutableArray` PR
- taint analysis
## 2021/11/24
### Agenda
- [ ] Implementation of field/alias analysis for enhanced SROA
- [ ] "Non-semantic" copy for stack allocation
### Implementation of field/alias analysis for enhanced SROA
#### Allocation-elimination targets
:arrow_double_up: easy / :arrow_double_down: hard:
- [x] simplest SROA
```julia
code_typed((Any,)) do a
r = Ref{Any}(a)
r[] # => a
end
```
- [x] domination-aware SROA
```julia
code_typed((Any,Any)) do a, b
r = Ref{Any}(a)
r[] = b
r[] # => b
end
# uninialized field
# https://github.com/JuliaLang/julia/pull/42834
code_typed((Any,)) do a
r = Ref{Any}()
r[] = a
r[] # => a
end
code_typed((Bool,Any,Any)) do c, a, b
r = Ref{Any}()
if cond
r[] = a
r[] # => a
else
r[] = b
r[] # => b
end
end
```
- [x] φnode-fying SROA
```julia
code_typed((Bool,Any,Any)) do cond, a, b
r = Ref{Any}(nothing)
if cond
r[] = a
else
r[] = b
end
r[] # => φ(a, b)
end
# uninialized field
# https://github.com/JuliaLang/julia/pull/43208
code_typed((Bool,Any,Any)) do cond, a, b
r = Ref{Any}()
if cond
r[] = a
else
r[] = b
end
r[] # => φ(a, b)
end
```
- [ ] alias-aware SROA (nested load elimination)
```julia
# `s -> broadcast(identity, Ref(s))`
code_typed((String,)) do s
r = Ref(s)
t = (r,)
t[1][] # => s
end
code_typed() do
local arr::Vector{Any}
alloc! = () -> arr = Any[]
alloc!()
arr
end
code_typed((Bool,Any,Any)) do cond, a, b
r1 = Ref{Any}(a)
r2 = Ref{Any}(b)
(cond ? r1 : r2)[] # φ(a, b)
# or => (cond ? r1[] : r2[])
end
```
- [ ] "partial" SROA
```julia
code_typed((Bool,Any,)) do rare_cond, a
r = Ref{Any}(a)
if rare_cond
throw(r) # move `r` allocation here
else
return r[] # => a
end
end
```
- [ ] stack allocation
```julia
@noinline safe_op(a) = ...
code_typed((Any,)) do a
r = Ref{Any}(a) # can be allocated on stack
safe_op(r)
end
```
#### Design of analysis/optimization
0. current Julia Base SROA
```
[SROA]
[simple analysis]
[domination analysis]
[actual code transmation]
```
0. multiple SROA: **benchmark**
1. backward escape/alias/field analysis & load forwarding by domination analysis
- <https://github.com/aviatesk/EscapeAnalysis.jl/pull/54>
- overview
```
# analysis
while [converge]
[backward escape/alias/field analysis]
end
# transformation
[domination analysis]
[SROA]
[load forwarding]
[DCE]
```
2. backward escape/alias analysis & forward field propagation
- [`EscapeAnalysis.jl/avi/flowsensitive-aliasanalysis`](https://github.com/aviatesk/EscapeAnalysis.jl/tree/avi/flowsensitive-aliasanalysis)
- overview:
```
# analysis
while [converge]
[backward escape/alias analysis]
while [converge]
[forward field information propagation]
end
end
# transformation
[SROA]
[simple load replacement]
[DCE]
```
3. forward escape/alias/field analysis
- JVM compilers' approach
- overview
```
# analysis
while [converge]
[forward escape/alias analysis]
[forward field information propagation]
end
# transformation
[SROA]
[simple load replacement]
[DCE]
```
- Points
- complexity of domination/control-flow
- "last" definition
- field merge (φnode-fying)
- uninitialized field
- conflicts between escape information and field information
- escape information: flow from usage to definition
- field information: flow from definition to usage
- versatility of EA
- Valentin: separate SROA from EA
### Discussion
- Valentin: we want to have a "slot" in IRCode?
- represents a mutable memory
- might be useful for:
- Tapir integration
- stack allocation
- i.e. may be able to slot2ssa system?
- Valentin: array?
- array optimization
- target?
- ref/load is all local
- constant index
- approaches?
- Valentin: don't make array special!
- LLVM can handle it nicely
- Julia-level analysis: tell inter-procedural information!
### "Non-semantic" copy for stack allocation
- Problem:
- so many escape through exception
- in Julia an exception means "global escape"
- Details:
- conditions of "non-semantic" copy
- no escape (including `finalize` call)
- no lifetime overlap
```julia
local o
# finalize(f, o) # may escape in finalizer finalizer
try
unsafe_op(o)
catch err
# o.x = "bar" # lifetime overlap
end
```
- but these two are **NOT** enough: other threads can access to `err` through the evil `jl_current_exception`
- Ideas ?
- special errors ?
- `MethodError`: static
- `BoundsError`: runtime
- move `throw` ?
- as a LICM: but need to preserve the order
- unroll loop by 1, and use the obtained constants
## 2021/11/10
### Agenda
- [x] When to run EA
- ~~[ ] discussion on "non-semantic" copy~~
### When to run EA
- tl;dr, **do escape analysis at optimization**
- reasons:
- inference works on type domain vs. optimization works on SSA form
- current applications of escape analysis is mostly about optimizations:
- SROA
- `ImmutableArray`: copy elision
- AD: sensitivity analysis
- as a first step, it would be easier and cleaner if they are just separated
- more thoughts:
- inference already does some sort of alias analysis in forms of `Const` and `PartialStruct`
- but they don't encode anything about CFG
- encoding CFG information into type lattice would be far more complicated (like `Conditional`)
- also, if we reason about inter-procedural mutation analysis, we also need to think about cycle detection
- more details:
- [Partial Escape Analysis and Scalar Replacement for Java](https://ssw.jku.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf) can be a good reference
- local analysis:
- make EscapeAnalysis.jl control-flow sensitive
- introduce alias (field) analysis (revive <https://github.com/aviatesk/EscapeAnalysis.jl/pull/43>)
- inter-procedural analysis:
- easy ?
- can be used for
- stack allocation
- `ImmutableArray`
- SROA is better to be separated from escape analysis
- escape analysis: backward analysis
- SROA: forward iteration of CFG transformation
- other applications: some transformation
- inter-procedural
- how to cache the inter-procedural information ?
- new field of `CodeInstance` ?
- what kind of information
- argument escapes ?
- returned value escapes ?
- side effect analysis ?
- _global_ side effect: read/write on global memory
- e.g. does this loop write to global memory, then it can't be hoisted, if not, start thinking about local
- obvious example: `stdout`
- _local_ side effect: mutation on field
- LICM
- better DCE by load forwarding ?
- query:
- input: `MethodInstance`, "x"
- output: does this `MethodInstance` affect the state of "x"
- side effect: a restriction on possible optimizations
- synchronization optimization ?
- what kind of optimization should side-effect target ?
- reuse LLVM ?
- LLVM-level purity vs. Julia-level purity
- TODO: think more and more about applications of "side-effect" !
- "recursion" matters ?
- "recursion" means the same `MethodInstance`
- it won't be very ideal if recursion => escape
## 2021/10/13
### Agenda
- [x] more optimization ideas
- [x] how to get Ian involved with this project ?
- we will collaborate on finishing [`ImmutableArray` PR](https://github.com/JuliaLang/julia/pull/42465)
- aggressive DCE
- what needed ?
- make clear the cost of updating domtree dynamically (benchmarking)
- `AbstractInterpreter` can help benchmarking
- if it turns out be costly, how we can mitigate ?
- iterative processing to batch processing ?
- just compute entire domtree in certain points ?
- c.f. LLVM can reason about relations of various analyses to mitigate the cost of dynamic domtree update
- correctness
- [ ] (if time remains) `Vararg` handling is complicated, any hints ?
### Optimization ideas
- **"allocation movement" (I'm not quite sure this naming is right)**
* idea: move allocation to enable further optimization (elaborated below)
- **memory-reuse**:
- idea: try to use the same memory that happen in a loop
* we can optimize it when it doesn't escape the loop (use scope marker, like gcpreserve)
* we want to make it control-flow sensitive
- **provide alias information to LLVM**
- background: LLVM-level TBAA pass: it just checks types
- idea: enrich LLVM-level optimizations with Julia-level alias information
- do we want to tell lifetime information also ?
- how do we want to communicate to LLVM ?
- attach meta data to LLVM AST
- customizable: metadata may be consumed by our own optimization passes
- **`ImmutableArray`**
- <https://github.com/JuliaLang/julia/pull/42465>
- we want to figure out what information will be needed
- provide nice API to query object lifetime ?
#### "allocation movement"
- observations: we may not be able to prove effect-free-ness of something (e.g. `typeassert`)
- but we may sometimes know it doesn't escape
- and then we can try to "delay" allocation to where it can be thrown, and do optimization (maybe SROA ?) more aggressively
- idea: "move allocation site to where it can be really needed"
* should enable further SROA
* we can have dynamic check to enable stack allocation ?
- make "type-tag" keep information about allocation, and have a query to check if an object is on stack or heap
* can be extended to optimize `Array`s ?
> successful SROA
```julia
julia> code_typed((Int,)) do n
s = 0
for i in 1:n
r = Ref(i)
# r[] < 0 && throw(r) # very rare condition
s += r[]
end
s
end |> only
... # no allocation of `Base.RefValue{Int}`
```
> unsuccessful SROA because of rare `throw` case (actuall it's impossible for this case)
```julia
julia> code_typed((Int,)) do n
s = 0
for i in 1:n
r = Ref(i)
r[] < 0 && throw(r) # very rare condition
s += r[]
end
s
end |> only
CodeInfo(
1 ── %1 = Base.sle_int(1, n)::Bool
│ %2 = Base.ifelse(%1, n, 0)::Int64
│ %3 = Base.slt_int(%2, 1)::Bool
└─── goto #3 if not %3
2 ── Base.nothing::Nothing
└─── goto #4
3 ── goto #4
4 ┄─ %8 = φ (#2 => true, #3 => false)::Bool
│ %9 = φ (#3 => 1)::Int64
│ %10 = φ (#3 => 1)::Int64
│ %11 = Base.not_int(%8)::Bool
└─── goto #12 if not %11
5 ┄─ %13 = φ (#4 => %9, #11 => %30)::Int64
│ %14 = φ (#4 => %10, #11 => %31)::Int64
│ %15 = φ (#4 => 0, #11 => %23)::Int64
│ %16 = %new(Base.RefValue{Int64}, %13)::Base.RefValue{Int64}
│ %17 = Base.getfield(%16, :x)::Int64
│ %18 = Base.slt_int(%17, 0)::Bool
└─── goto #7 if not %18
6 ── Main.throw(%16)::Union{}
└─── unreachable
7 ── %22 = Base.getfield(%16, :x)::Int64
│ %23 = Base.add_int(%15, %22)::Int64
│ %24 = (%14 === %2)::Bool
└─── goto #9 if not %24
8 ── Base.nothing::Nothing
└─── goto #10
9 ── %28 = Base.add_int(%14, 1)::Int64
└─── goto #10
10 ┄ %30 = φ (#9 => %28)::Int64
│ %31 = φ (#9 => %28)::Int64
│ %32 = φ (#8 => true, #9 => false)::Bool
│ %33 = Base.not_int(%32)::Bool
└─── goto #12 if not %33
11 ─ goto #5
12 ┄ %36 = φ (#10 => %23, #4 => 0)::Int64
└─── return %36
) => Int64
```
> fan fact: `unoptimize_throw_blocks` makes SROA sad
```julia
julia> code_typed((Int,)) do n
s = 0
for i in 1:n
r = Ref(i)
r[] < 0 && throw(r[]) # very rare condition
s += r[]
end
s
end |> only
CodeInfo(
1 ── %1 = Base.sle_int(1, n)::Bool
│ %2 = Base.ifelse(%1, n, 0)::Int64
│ %3 = Base.slt_int(%2, 1)::Bool
└─── goto #3 if not %3
2 ── Base.nothing::Nothing
└─── goto #4
3 ── goto #4
4 ┄─ %8 = φ (#2 => true, #3 => false)::Bool
│ %9 = φ (#3 => 1)::Int64
│ %10 = φ (#3 => 1)::Int64
│ %11 = Base.not_int(%8)::Bool
└─── goto #12 if not %11
5 ┄─ %13 = φ (#4 => %9, #11 => %31)::Int64
│ %14 = φ (#4 => %10, #11 => %32)::Int64
│ %15 = φ (#4 => 0, #11 => %24)::Int64
│ %16 = %new(Base.RefValue{Int64}, %13)::Base.RefValue{Int64}
│ %17 = Base.getfield(%16, :x)::Int64
│ %18 = Base.slt_int(%17, 0)::Bool
└─── goto #7 if not %18
6 ── %20 = Base.getindex(%16)::Any
│ Main.throw(%20)::Union{}
└─── unreachable
7 ── %23 = Base.getfield(%16, :x)::Int64
│ %24 = Base.add_int(%15, %23)::Int64
│ %25 = (%14 === %2)::Bool
└─── goto #9 if not %25
8 ── Base.nothing::Nothing
└─── goto #10
9 ── %29 = Base.add_int(%14, 1)::Int64
└─── goto #10
10 ┄ %31 = φ (#9 => %29)::Int64
│ %32 = φ (#9 => %29)::Int64
│ %33 = φ (#8 => true, #9 => false)::Bool
│ %34 = Base.not_int(%33)::Bool
└─── goto #12 if not %34
11 ─ goto #5
12 ┄ %37 = φ (#10 => %24, #4 => 0)::Int64
└─── return %37
) => Int64
```
#### so what is needed ?
- they are really similar in a sense that what they need is "lifetime" information ?
- is the current `EscapeLattice` enough, or does it need more props ?
### `Vararg` handling ...
```julia
julia> @noinline function foo(s...)
string(s)
end
foo (generic function with 1 method)
# when is `Vararg` introduced ?
julia> code_typed((String,)) do s
foo(s)
end
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String)::String
└── return %1
) => String
julia> code_typed((String,)) do s
foo(s, s)
end
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String, _2::Vararg{String})::String
└── return %1
) => String
julia> code_typed((String,)) do s
foo(s, s, s)
end
1-element Vector{Any}:
CodeInfo(
1 ─ %1 = invoke Main.foo(_2::String, _2::Vararg{String}, _2)::String
└── return %1
) => String
# it seems like inference/optimizer has dealt with `foo(::String, ::String)` !?
julia> only(methods(foo)).specializations
svec(MethodInstance for foo(::String), MethodInstance for foo(::String, ::String), MethodInstance for foo(::String, ::String, ::String), nothing, nothing, nothing, nothing, MethodInstance for foo(::String, ::Vararg{String}))
```
## 2021/10/06
- [x] plans
- [x] updates:
* (mostly) debuginfo is available for EscapeAnalysis.jl
### Next actions
- Xuanda will take a look at more aggressive DCE
- first step: read through [#27547](https://github.com/JuliaLang/julia/issues/27547) and [#37882](https://github.com/JuliaLang/julia/pull/37882)
- Shuhei will learn much more about Julia runtime, and try to fully understand other optimization targets other than SROA and finalzier elision, and re-define project targets and update timeline if necessary
- Shuhei will keep working on field analysis and enhanced SROA
### DCE plan
- step 1: `typeassert` can simply be eliminated
- step 2: fix [#27547](https://github.com/JuliaLang/julia/issues/27547) ( [#37882](https://github.com/JuliaLang/julia/pull/37882) can be revived)
- step 3: eliminate dead `isdefined` / `isa` / `===` branches (maybe `<:` too ?)
```julia=
julia> code_typed() do
local arr::Vector{Any}
doit() do
arr = Any[]
end
arr
end |> only
CodeInfo(
1 ─ %1 = Core.Box::Type{Core.Box}
│ %2 = %new(%1)::Core.Box
│ %3 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Any}, svec(Any, Int64), 0, :(:ccall), Vector{Any}, 0, 0))::Vector{Any}
│ Core.setfield!(%2, :contents, %3)::Vector{Any}
│ %5 = Core.isdefined(%2, :contents)::Bool # <= step 2
└── goto #3 if not %5 # <= step 2
2 ─ goto #4
3 ─ $(Expr(:throw_undef_if_not, :arr, false))::Any # <= step 2
4 ┄ %9 = Core.getfield(%2, :contents)::Any
│ Core.typeassert(%9, Vector{Any})::Vector{Any} # <= step 1
│ %11 = π (%9, Vector{Any})
└── return %11
) => Vector{Any}
```
### More optimization ideas
- move allocation (by copy) to where it can be thrown
- observations: we may not be able to prove effect-free-ness of something (e.g. `typeassert`)
- but we may sometimes know it doesn't escape
- and we can try to "delay" allocation to where it can be thrown, and do optimization (maybe SROA ?) more aggressively
- when it does escape in a limited scope, only allocate object when thrown
```julia
s = 0
for i in 1:n
arr = [i] # want to eliminate this allocation
s += arr[1] # may throw,
end
s
```
```julia
let
r = Ref(x)
s = r::Base.RefValue{String}
r[]
end
```
- conditions ... ?
- it shouldn't alias to anything
- lifetime should end in optimization scope
- dynamic check to enable stack allocation ?
- make "type-tag" keep information about allocation
- "pool-allocation"
- optimize if an object is in the pool
- make query to check whether it's allocated in the pool
- when GC allocates, it's gonna allocated in the pool
- (as an implementation detail of heap-to-stack alloc opt)
- immutables ~~don't need analysis ...?~~
```julia
let
# type-tag needs to be on stack (next to the object) so that GC can see it ..?
# 1. immutable points to reference to heap object (`Any`)
# but the `Any` object might be able to be allocated on stack as well
# type-tag (size unknown ?)
# 2. if immutable escapes, allocate it on the heap
s = Some{Any}(10) # <= points to heap allocted (if escapes)
end
```
- AD may also possibly benefit from mutation analysis
- Zygote: doesn't support
- Differactor: ???
- Enzyme: use alias/escape analysis to find objects whose mutation order should be recorded
- Julia's memory management/optimization
- `llvm-alloc-opt.cpp`: SROA pass on heap memory
- some LLVM has its own SROA pass on stack memory
- `gc.c`: memory management (allocation, finalization)
- defines standard runtime semantics
- keywords:
- mark-and-sweep
- non-moving
- tri-color
- precise: never changes
- generation
- stop-the-world: => concurrent garbage collection
## 2021/09/29
### Agenda
- [x] agree on target optimizations ([Targets](#Targets))
- [x] plan collaboration ([Development](#Development-Plan))
* [x] meeting schedule
* [x] discuss roles
### more optimization ideas ?
- alias analysis on Julia level
- we have LLVM-level TBAA pass: it just checks types
- we can tell more information to it
- memory-reuse: try to use the same memory that happen in a loop
- we can optimize it when it doesn't escape the loop (use scope marker, like gcpreserve)
- we want to make it control-flow sensitive