owned this note
owned this note
Published
Linked with GitHub
# Intensive Prior Research on Escape Analysis
<!--{%hackmd theme-dark %}-->
[TOC]
## Prior Researches
**Format**:
1. what ?
1. what is the point of the idea ?
1. how is it interesting compared to prior works ?
1. how is it validated ?
1. is there any remaining discussion ?
1. what's next to read ?
### (JVM) Escape Analysis in the Context of Dynamic Compilation and Deoptimization
- info: _paper_, _2005_
- <https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf>
- Thomas Kotzmann
Institute for System Software
Johannes Kepler University Linz
Linz, Austria
- Hanspeter Mössenböck
Institute for System Software
Johannes Kepler University Linz
Linz, Austria
1. what ?
- intra-procedural and inter-procedural algorithm of escape analysis
- with the support for dynamic class loading and "deoptimization"[^1]
- target optimization: SROA / synchronization removal[^3] / stack allocation
- by-product: a better inlining heuristic based on escape information
[^1]: This seems to be such a concept that is very specific to tracing-JIT compilers,
where VM execution sometimes transfers the control to the interpreter
in a case any speculative assumption for optimization gets invalidated.
[^3]: "synchronization removal" here seems to simply means "lock elision"
1. what is the point of the idea ?
- _intra-procedural_ analysis: _forward_ abstract interpretation on SSA from
- escape state:
- local objects (SSA allocation): fixed-length array
- fields: growing-length array
- NOTE: propagate locals escapes to fields
- control flow handling (i.e. phi-node handling)
- if any operand of phi-function escapes, propagate the escapes to any other (since then "global SROA doesn't make any sense"[^4])
- use _equi-escape set (EES)_ to propagate escape information among the related phi-operands
- EES is implemented as an union-find algorithm with path compression
- _inter-procedural_ analysis: propagate the ("formal"[^2]) arguments escape information as a method descriptor
- purpose: stack allocation (i.e. identifies objects that can be allocated on the caller stack)
- benefits of stack allocation:
- cheap allocation (reuse the same stack slot for allocation in a loop)
- efficient field load (access can be relative to frame's base pointer)
- (efficient field store: possible but note that actual argument might be heap-allocated)
- by-product: inter-procedural escape information turned out to be a good inlining heuristics
- > inline methods up to twice as large as the normal threshold if there is at least one parameter that does neither escape the caller nor the callee
- runtime support
- support for deoptimization:
- debugging information keeps a record of how SROA and synchronization removal was performed by the compiler
- GC modification for stack allocation:
- > A stack object must not be moved in memory, but the heap objects referenced by its fields need to be visited and kept alive. Therefore, we modified the garbage collector so that it does not visit stack objects but treats pointer fields within stack objects as root pointers
[^4]: This statement seems to reflect their purpose of SROA: SROA is done for allocation elimination and an object allocation can't be eliminated if there is any site it escapes.
[^2]: They seem to use "formal" when referring to declared method arguments, and "actual" to those actually passed at callsite.
1. how is it interesting compared to prior works ?
- As as good summary of a basic escape analysis design, and basic target optimization design
- Their idea to use _equi-escape set_ to handle control flow effects
1. how is it validated ?
- SROA validation: insert runtime assertion (run without SROA but with assertions on otherwise-replaced values)
- performance: benchmarks
1. is there any remaining discussion ?
- Is "graph" (EES) is really needed to handle control flow ? Couldn't backward analysis make it more simpler ?
- ~~How they limit the lattice of field escapes (since its size isn't limited) ?~~ # of SSA statements is finite, and # of allocations is finite, thus field states is also finite
1. what's next to read ?
- F. Vivien and M. Rinard. Incrementalized pointer and
escape analysis. In Proceedings of the ACM SIGPLAN
Conference on Programming Language Design and
Implementation, pages 35–46, Snowbird, June 2001.
- general observation on allocations, proposes incremental analysis
### (JVM) Partial Escape Analysis and Scalar Replacement for Java
- info: _paper_, _2014_
- https://ssw.jku.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf
- Lukas Stadler
Johannes Kepler University
Linz, Austria
- Thomas Würthinger
Oracle Labs
- Hanspeter Mössenböck
Johannes Kepler University
Linz, Austria
1. what ?
- flow-sensitive escape analysis and scalar replacements
- (with a support for deoptimization from speculative execution)
- target optimizations: SROA, lock elision (_on individual branches_)
1. what is the point of the idea ?
- > In many cases, however, an object escapes just in a single unlikely branch.
```java
Object getValue(int idx, Object ref) {
Key key = new Key(idx, ref);
if (key.equals(cacheKey)) {
return cacheValue; // not escape
} else {
cacheKey = key; // escapes to the static field
cacheValue = createValue(...);
return cacheValue;
}
}
```
- the basic idea is to transform above snippet into something like:
```java
Object getValue(int idx, Object ref) {
if (/*scalar-replaced operations*/ ...) {
return cacheValue;
} else {
Key key = new(idx, ref); // "materiazation"
cacheKey = key;
cacheValue = createValue(...);
return cacheValue;
}
}
```
- concepts
- _"partial escape analysis"_: flow-sensitive escape analysis
- _"materialization"_: materialize allocation to where it escapes
- analysis: _forward_, _flow-sensitive_, _intra-procedural_
- allocation state
```java
class Id extends Node {...} // allocation identity
class ObjectState {}
class VirtualState extends ObjectState {
int lockCount;
Node[] fields;
}
class EscapedState extends ObjectState {
Node materializedValue;
}
class State { // each Node has this state
Map<Id, ObjectState> states;
/**
* Loading an object with `VirtualState` from a field of another
* `VirtualState` object inserts a new entry into this map, so
* that the `Load node can be recognized as referring to the
* `VirtualState` object during further processing.
*
* If any of the operands of a node is key in this `aliases` map,
* then the node needs to be examined also.
*/
Map<Node, Id> aliases;
}
```
- state propagation
- each "node" has the state (=> flow-sensitivity)
- each node updates its state using the states from previous nodes
- `Merge` node: merge control-flows
- loop entry node: merge loop backedges (could be more than two ?)
- others: just propagate
1. how is it interesting compared to prior works ?
- flow-sensitive SROA (and "materialization")
1. how is it validated ?
- performance benchmarks: reduced allocated memory by up to 58.5%, improves performance by up to 33%
1. is there any remaining discussion ?
- in worst case scenario, it may lead to bigger code size if "materialization" happens at both branches ?
- how can it be extended to eliminate array allocations ?
3. what's next to read ?
- A. Shankar, M. Arnold, and R. Bodik. Jolt:
lightweight dynamic analysis and removal of object
churn. In Proceedings of the ACM SIGPLAN
Conference on Object-Oriented Programming Systems,
Languages, and Applications, pages 127–142. ACM
Press, 2008.
- a lightweight dynamic analysis to reduce excessive creation of short-lived objects, by guiding inlining so that escape analysis is more effective
- P. Molnar, A. Krall, and F. Brandner. Stack allocation
of objects in the CACAO virtual machine. In
Proceedings of the International Conference on the
Principles and Practice of Programming in Java,
pages 153–161. ACM Press, 2009
- similar approach as Kotzmann and Mössenböck to perform stack allocation on Cacao VM
- V8: very local analysis (looking only at the usages of the allocation, _independently_)
Notes:
- "Zone": heap areas with a known, limited lifetime. The whole area can be freed in bulk when a certain scope is left.
### (JVM) Current state of EA and its uses in the JVM
- info: _talk_, _2020_
- <https://www.youtube.com/watch?v=p1MhRBYS-k0>
- Charlie Gracie
This talk summarizes the status of escape analysis and related optimizations in Java community:
- 3 main optimizations:
1. monitor elision (lock elision)
2. stack allocation
3. scalar replacement
- C2 vs. Graal vs. OpenJ9
- C2: flow-insensitive escape analysis, global SROA
- Graal: flow-sensitive escape analysis, partial SROA
- OpenJ9: flow-sensitive escape analysis, partial SROA & stack allocation !
- SROA vs. stack allocation: SA can be applied if the control flow is complicated:
e.g. motivating example:
```java
public static int simple_math_myobject_valueof(int val1, int val2) {
// these allocations can't be eliminated because they are represented as
// phi nodes of (global object, scalar-replaceable object)
MyObject int1 = MyObject.valueOf(val1);
MyObject int2 = MyObject.valueOf(val2);
return int1.add(int2);
}
public MyOjbect valueOf(int val) {
if (0 <= val && val <= 2) {
return _cache[val];
} else {
return new MyObject(val);
}
}
```
- challenges with stack allocation
- GC extension
- lifetime overwrap detection
- "heapification" at deoptimization
- prototyped code can be found at: <https://github.com/eclipse-openj9/openj9/blob/master/runtime/compiler/optimizer/EscapeAnalysis.cpp>
- complicated, `9432` lines of C++ code
### (LuaJIT) Allocation Sinking Optimization
- info: _documentation_, _2012_
- <http://wiki.luajit.org/Allocation-Sinking-Optimization#implementation%5B>
- Mike Pall
1. what ?
- a variant of code motion, which moves an allocation to where it's really needed (e.g. in a fallback path), built on top of flow-sensitive SROA enhanced by LuaJIT's "advanced alias analysis"
1. what is the point of the idea ?
- observations:
- global escape analysis is pointless especially in dynamic languages, where their code typically has many fallback paths that often escape things
- escapes often happen in uncommon paths, and allocations are avoidable in fast path
- implementation: "mark & sweep"
- (mark) marks all allocations that cannot be sunk
- this phase is essentially one instance of escape analysis
- _intra-procedural_, _backward_, _flow-insensitive_[^5]
- (sweep) tags the unmarked allocations and the related stores as sunk
1. how is it interesting compared to prior works ?
- one of the most earlier approaches of flow-sensitive SROA ?
1. how is it validated ?
1. is there any remaining discussion ?
1. what's next to read ?
[^5]: "How LuaJIT encodes flow-sensitivity ?"
LuaJIT's tracing-JIT nature separates an unlikely branch into a separate snapshot.
It allows escapes that happens in the rare snapshots not to escape the objects in the main trace.
In other word, necessary flow-sensitivity is encoded within the nature of tracing JIT.
### (JavaScript) Escape Analysis in V8
- info: _talk_, _2018_
- <https://www.youtube.com/watch?v=KiWEWLwQ3oI>
- Tobias Tebbi
1. what ?
- speculative escape analysis and allocation elimination with load replacement (SROA)
- _forward_, _flow-sensitive_, allocation analysis
- each escape site will refine analysis state
- state update can be costly since each "node" has state and # of nodes is large
- => "functional map data structure" solved it
1. how is it interesting compared to prior works ?
- implement escape analysis on "unscheduled" SSA form
- "unscheduled"
- scheduled graph: one state per basic block
- unscheduled: one state per "effectful node"
- "functional map data structure" turns out to be successful
- handle V8 deoptimization
1. how is it validated ?
1. is there any remaining discussion ?
1. what's next to read ?
### (LLVM) Alias Analysis
- [LLVM Alias Analysis Infrastructure](https://llvm.org/docs/AliasAnalysis.html)
- `alias` method
- `getModRefInfo(CS1, CS2)`: memory effect test between function calls
- helper methods: `doesNotAccessMemory` / `onlyReadsMemory`
- interesting: interface methods to keep the analysis state up-to-date even after other consumer transformations
- analysis passes:
* `basic-aa`: walks through domination
* TBAA: walk through type domain
### Other references
- [JOLT: Lightweight Dynamic Analysis and Removal of Object Churn](https://www.boopidy.com/aj/cs/jolt.pdf)
- [Escape Analysis for Java](https://www.researchgate.net/publication/2804836_Escape_Analysis_for_Java)
- [Escape Analysis in Java: Theory and Practice](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.463.1254&rep=rep1&type=pdf)
- [Incrementalized Pointer and Escape Analysis](https://people.csail.mit.edu/rinard/paper/pldi01.pdf)
## Observations
### General transition of memory optimization:
1. SROA: _eliminates_ allocations
1. partial SROA (allocation sinking): _eliminates_ allocations in the existence of simpler branches
1. stack allocation: _eases_ allocation cost in the existence of complicated branches
### Analysis design
**General design choices**:
- _flow-insensitive_ vs. _flow-sensitive_: simply, analysis complexity vs.analysis accuracy
- _intra-procedural_ vs. _inter-procedural_[^6]:
* intra-procedural for SROA
* inter-procedural for stack allocation
- _forward_ vs. _backward_:
* forward: easier to follow control-flow, suited for SROA
* forward analysis will invalidate previous state whenever it sees new escape information
(which may lead to more iterations to reach a fixpoint)
* backward: easier to propagate escape information, suited for general escape analysis
**Escape analyses in the wild**:
- (JVM C2) SROA: _flow-insensitive_, _intra-procedural_, _forward_
- (JVM Graal) partial SROA: _flow-sensitive_, _intra-procedural_, _forward_,
- (JVM OpenJ9) partial SROA & stack allocation: _flow-sensitive_, _inter-procedural_, ???
- (JavaScript V8) partial SROA: _flow-sensitive_, _intra-procedural_, _forward_,
- (LuaJIT) allocation sinking (partial SROA): _flow-insensitive_[^5], _intra-procedural_, two-phases analysis
1. _backward_ analysis for escape analysis (collect escape information about allocation)
2. _forward_ analysis for partial SROA (to tag sunk allocations)
[^6]: Well, usually, the "inter-procedural" part isn't so challenging: basically, inter-procedural
information is encoded within callee method descriptor, and the main lifting is to translate
the information in caller context.
## What's for Julia
### status quo, and future
- SROA and stack allocation
- flow-insensitive SROA :+1: (:warning:[^7])
```julia
struct Point
x::Float64
y::Float64
end
add(a::Point, b::Point) = Point(a.x + b.x, a.y + b.y)
function compute()
a = Point(1.5, 2.5)
b = Point(2.25, 4.75)
for i in 0:(100000000-1)
a = add(add(a, b), b)
end
a.x, a.y
end
```
- flow-sensitive SROA :disappointed:
```julia
struct Key
idx::Int
ref::Symbol
end
const CACHED_KEY = Ref(Key(0, :orig))
@inline eqkey(k1, k2) = k1.idx == k2.idx && k1.ref === k2.ref
function getref(idx, ref)
key = Key(idx, ref) # move the allocation into the `if` clause
if !eqkey(key, CACHED_KEY[])
CACHED_KEY[] = key
end
return key.ref
end
```
- stack allocation :drooling_face:
```julia
let
if cond
key = CACHED_KEY[]
else
key = Key(idx, ref) # can we avoid this allocation ?
CACHED_KEY[] = key
end
return key.ref # key::ϕ(global object, allocation)
end
```
- DCE, more constant-folding/inference: SROA can derive new type information
* Java: every field is statically-typed (mostly ?)
* Julia: untyped, `Union`-typed fields
- third consumers of escape information
* `ImmutableArray`
* some taint analysis for efficient contribution prop' for Diffractor
[^7]: Even flow-insensitive SROA sometimes fails on Julia-level.
For example, if `Point` is declared as `mutable struct`, Julia-level SROA fails on the above example.
But for such cases, succeeding LLVM passes will optimize them away.
- some implementation ideas
1. separate "state" into two parts
- _backward_ escape analysis: general usages, allocation movement (for partial SROA)
- with escape state _per basic block_ (rather than _per SSA statement_)
- _forward_ allocation analysis: for SROA
1. SROA at inference time:
- insight: we already have a decent forward analysis
- inference time
- (effect analysis) inter-procedural memory-effect propagation
- form `PartialStruct` for mutables
- (SROA, tagging): constant-folding or as encoded lattice property _at inference_
- make `PartialStruct.fields` contain `Replaceable(::SlotNumber)`
- (DCE, constant-folding): inference is already flow-sensitive
- optimization time
- (SROA, replacement): look for `getfield(...)::Union{Const,Replaceable}`
- (SROA) allocation sinking: new code motion pass
- escape analysis: just summarizes the result of inference for succeeding optimizations
- what's needed ?
* lifetime query
---
- Q. "SROA at inference"?
- Valentin: _what kind of information do you want to propagate around ?_
- Shuhei (revisit)
- memory effect analysis should be presented to inference
- SROA tagging during inference:
- may complicates thing:
- brings a part of optimization into inference
- the domain of inference lattice gets complicated: it now encodes CFG information
- we need to keep tags up-to-date throughout many optimizations, including slot2ssa, inlining
- idea: reuse existing flow-sensitive, forward, analysis to implement more "decent" SROA
- stack allocation:
- non-escape argument: ok
- return value: (change calling convention)
- calling convention:
- LLVM level: struct pointer (`*`)
- mutability is still very hard to handle
- how information propagates ?
- LLVM: rerun analysis
- on high-level Julia IR: how to handle inter-procedural information
- why on LLVM ?
- existed for long
- opportunities for more optimizations using LLVM infrastructure
- problems:
- intra-procedural
- stack allocation
- more researches
- publish review paper: before something novel
- focus on compactness ?
- other compilers (tracing JIT) vs. Julia (ahead of runtime)
* predictability
* implementation complexities of tracing-JIT:
- complexity vs. optimization opportunities