Intensive Prior Research on Escape Analysis

Intensive Prior Research on Escape Analysis

Prior Researches

Format:

what ?
what is the point of the idea ?
how is it interesting compared to prior works ?
how is it validated ?
is there any remaining discussion ?
what's next to read ?

(JVM) Escape Analysis in the Context of Dynamic Compilation and Deoptimization

info: paper, 2005
- https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf
- Thomas Kotzmann
  Institute for System Software
  Johannes Kepler University Linz
  Linz, Austria
- Hanspeter Mössenböck
  Institute for System Software
  Johannes Kepler University Linz
  Linz, Austria

what ?
- intra-procedural and inter-procedural algorithm of escape analysis
  - with the support for dynamic class loading and "deoptimization"^[1]
- target optimization: SROA / synchronization removal^[2] / stack allocation
- by-product: a better inlining heuristic based on escape information
what is the point of the idea ?
- intra-procedural analysis: forward abstract interpretation on SSA from
  - escape state:
    - local objects (SSA allocation): fixed-length array
    - fields: growing-length array
    - NOTE: propagate locals escapes to fields
  - control flow handling (i.e. phi-node handling)
    - if any operand of phi-function escapes, propagate the escapes to any other (since then "global SROA doesn't make any sense"^[3])
    - use equi-escape set (EES) to propagate escape information among the related phi-operands
      - EES is implemented as an union-find algorithm with path compression
- inter-procedural analysis: propagate the ("formal"^[4]) arguments escape information as a method descriptor
  - purpose: stack allocation (i.e. identifies objects that can be allocated on the caller stack)
    - benefits of stack allocation:
    - cheap allocation (reuse the same stack slot for allocation in a loop)
    - efficient field load (access can be relative to frame's base pointer)
    - (efficient field store: possible but note that actual argument might be heap-allocated)
  - by-product: inter-procedural escape information turned out to be a good inlining heuristics
    - inline methods up to twice as large as the normal threshold if there is at least one parameter that does neither escape the caller nor the callee
- runtime support
  - support for deoptimization:
    - debugging information keeps a record of how SROA and synchronization removal was performed by the compiler
  - GC modification for stack allocation:
    - A stack object must not be moved in memory, but the heap objects referenced by its fields need to be visited and kept alive. Therefore, we modified the garbage collector so that it does not visit stack objects but treats pointer fields within stack objects as root pointers
how is it interesting compared to prior works ?
- As as good summary of a basic escape analysis design, and basic target optimization design
- Their idea to use equi-escape set to handle control flow effects
how is it validated ?
- SROA validation: insert runtime assertion (run without SROA but with assertions on otherwise-replaced values)
- performance: benchmarks
is there any remaining discussion ?
- Is "graph" (EES) is really needed to handle control flow ? Couldn't backward analysis make it more simpler ?
- ~~How they limit the lattice of field escapes (since its size isn't limited) ?~~ # of SSA statements is finite, and # of allocations is finite, thus field states is also finite
what's next to read ?
- F. Vivien and M. Rinard. Incrementalized pointer and
  escape analysis. In Proceedings of the ACM SIGPLAN
  Conference on Programming Language Design and
  Implementation, pages 35–46, Snowbird, June 2001.
  - general observation on allocations, proposes incremental analysis

(JVM) Partial Escape Analysis and Scalar Replacement for Java

info: paper, 2014
- https://ssw.jku.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf
- Lukas Stadler
  Johannes Kepler University
  Linz, Austria
- Thomas Würthinger
  Oracle Labs
- Hanspeter Mössenböck
  Johannes Kepler University
  Linz, Austria

what ?
- flow-sensitive escape analysis and scalar replacements
  - (with a support for deoptimization from speculative execution)
- target optimizations: SROA, lock elision (on individual branches)

what is the point of the idea ?

In many cases, however, an object escapes just in a single unlikely branch.

Object getValue(int idx, Object ref) {
    Key key = new Key(idx, ref);
    if (key.equals(cacheKey)) {
        return cacheValue;             // not escape
    } else {
        cacheKey = key;                // escapes to the static field
        cacheValue = createValue(...);
        return cacheValue;
    }
}

the basic idea is to transform above snippet into something like:

Object getValue(int idx, Object ref) {
    if (/*scalar-replaced operations*/ ...) {
        return cacheValue;
    } else {
        Key key = new(idx, ref);       // "materiazation"
        cacheKey = key;
        cacheValue = createValue(...);
        return cacheValue;
    }
}

concepts
- "partial escape analysis": flow-sensitive escape analysis
- "materialization": materialize allocation to where it escapes

analysis: forward, flow-sensitive, intra-procedural

allocation state

class Id extends Node {...} // allocation identity
class ObjectState {}
class VirtualState extends ObjectState {
    int lockCount;
    Node[] fields;
}
class EscapedState extends ObjectState {
    Node materializedValue;
}
class State { // each Node has this state

    Map<Id, ObjectState> states;

    /**
     * Loading an object with `VirtualState` from a field of another
     * `VirtualState` object inserts a new entry into this map, so
     * that the `Load node can be recognized as referring to the
     * `VirtualState` object during further processing.
     *
     * If any of the operands of a node is key in this `aliases` map,
     * then the node needs to be examined also.
     */
    Map<Node, Id> aliases;
}

state propagation
- each "node" has the state (=> flow-sensitivity)
- each node updates its state using the states from previous nodes
  - Merge node: merge control-flows
  - loop entry node: merge loop backedges (could be more than two ?)
  - others: just propagate

how is it interesting compared to prior works ?
- flow-sensitive SROA (and "materialization")
how is it validated ?
- performance benchmarks: reduced allocated memory by up to 58.5%, improves performance by up to 33%
is there any remaining discussion ?
- in worst case scenario, it may lead to bigger code size if "materialization" happens at both branches ?
- how can it be extended to eliminate array allocations ?
what's next to read ?
- A. Shankar, M. Arnold, and R. Bodik. Jolt:
  lightweight dynamic analysis and removal of object
  churn. In Proceedings of the ACM SIGPLAN
  Conference on Object-Oriented Programming Systems,
  Languages, and Applications, pages 127–142. ACM
  Press, 2008.
  - a lightweight dynamic analysis to reduce excessive creation of short-lived objects, by guiding inlining so that escape analysis is more effective
- P. Molnar, A. Krall, and F. Brandner. Stack allocation
  of objects in the CACAO virtual machine. In
  Proceedings of the International Conference on the
  Principles and Practice of Programming in Java,
  pages 153–161. ACM Press, 2009
  - similar approach as Kotzmann and Mössenböck to perform stack allocation on Cacao VM
- V8: very local analysis (looking only at the usages of the allocation, independently)

Notes:

"Zone": heap areas with a known, limited lifetime. The whole area can be freed in bulk when a certain scope is left.

(JVM) Current state of EA and its uses in the JVM

info: talk, 2020
- https://www.youtube.com/watch?v=p1MhRBYS-k0
- Charlie Gracie

This talk summarizes the status of escape analysis and related optimizations in Java community:

3 main optimizations:
1. monitor elision (lock elision)
2. stack allocation
3. scalar replacement
C2 vs. Graal vs. OpenJ9
- C2: flow-insensitive escape analysis, global SROA
- Graal: flow-sensitive escape analysis, partial SROA
- OpenJ9: flow-sensitive escape analysis, partial SROA & stack allocation !

SROA vs. stack allocation: SA can be applied if the control flow is complicated:
e.g. motivating example:

public static int simple_math_myobject_valueof(int val1, int val2) {
    // these allocations can't be eliminated because they are represented as
    // phi nodes of (global object, scalar-replaceable object)
    MyObject int1 = MyObject.valueOf(val1);
    MyObject int2 = MyObject.valueOf(val2);
    return int1.add(int2);
}

public MyOjbect valueOf(int val) {
    if (0 <= val && val <= 2) {
        return _cache[val];
    } else {
        return new MyObject(val);
    }
}

challenges with stack allocation
- GC extension
- lifetime overwrap detection
- "heapification" at deoptimization
prototyped code can be found at: https://github.com/eclipse-openj9/openj9/blob/master/runtime/compiler/optimizer/EscapeAnalysis.cpp
- complicated, 9432 lines of C++ code

(LuaJIT) Allocation Sinking Optimization

info: documentation, 2012
- http://wiki.luajit.org/Allocation-Sinking-Optimization#implementation[
- Mike Pall

what ?
- a variant of code motion, which moves an allocation to where it's really needed (e.g. in a fallback path), built on top of flow-sensitive SROA enhanced by LuaJIT's "advanced alias analysis"
what is the point of the idea ?
- observations:
  - global escape analysis is pointless especially in dynamic languages, where their code typically has many fallback paths that often escape things
  - escapes often happen in uncommon paths, and allocations are avoidable in fast path
- implementation: "mark & sweep"
  - (mark) marks all allocations that cannot be sunk
    - this phase is essentially one instance of escape analysis
    - intra-procedural, backward, flow-insensitive^[5]
  - (sweep) tags the unmarked allocations and the related stores as sunk
how is it interesting compared to prior works ?
- one of the most earlier approaches of flow-sensitive SROA ?
how is it validated ?
is there any remaining discussion ?
what's next to read ?

(JavaScript) Escape Analysis in V8

info: talk, 2018
- https://www.youtube.com/watch?v=KiWEWLwQ3oI
- Tobias Tebbi

what ?
- speculative escape analysis and allocation elimination with load replacement (SROA)
- forward, flow-sensitive, allocation analysis
  - each escape site will refine analysis state
  - state update can be costly since each "node" has state and # of nodes is large
  - => "functional map data structure" solved it
how is it interesting compared to prior works ?
- implement escape analysis on "unscheduled" SSA form
  - "unscheduled"
    - scheduled graph: one state per basic block
    - unscheduled: one state per "effectful node"
  - "functional map data structure" turns out to be successful
- handle V8 deoptimization
how is it validated ?
is there any remaining discussion ?
what's next to read ?

(LLVM) Alias Analysis

LLVM Alias Analysis Infrastructure
- alias method
- getModRefInfo(CS1, CS2): memory effect test between function calls
  - helper methods: doesNotAccessMemory / onlyReadsMemory
- interesting: interface methods to keep the analysis state up-to-date even after other consumer transformations
analysis passes:
- basic-aa: walks through domination
- TBAA: walk through type domain

Other references

Observations

General transition of memory optimization:

SROA: eliminates allocations
partial SROA (allocation sinking): eliminates allocations in the existence of simpler branches
stack allocation: eases allocation cost in the existence of complicated branches

Analysis design

General design choices:

flow-insensitive vs. flow-sensitive: simply, analysis complexity vs.analysis accuracy
intra-procedural vs. inter-procedural^[6]:
- intra-procedural for SROA
- inter-procedural for stack allocation
forward vs. backward:
- forward: easier to follow control-flow, suited for SROA
- forward analysis will invalidate previous state whenever it sees new escape information
  (which may lead to more iterations to reach a fixpoint)
- backward: easier to propagate escape information, suited for general escape analysis

Escape analyses in the wild:

(JVM C2) SROA: flow-insensitive, intra-procedural, forward
(JVM Graal) partial SROA: flow-sensitive, intra-procedural, forward,
(JVM OpenJ9) partial SROA & stack allocation: flow-sensitive, inter-procedural, ???
(JavaScript V8) partial SROA: flow-sensitive, intra-procedural, forward,
(LuaJIT) allocation sinking (partial SROA): flow-insensitive^[5:1], intra-procedural, two-phases analysis
1. backward analysis for escape analysis (collect escape information about allocation)
2. forward analysis for partial SROA (to tag sunk allocations)

What's for Julia

status quo, and future

SROA and stack allocation

flow-insensitive SROA

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

(

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

^[7])

struct Point
    x::Float64
    y::Float64
end
add(a::Point, b::Point) = Point(a.x + b.x, a.y + b.y)
function compute()
    a = Point(1.5, 2.5)
    b = Point(2.25, 4.75)
    for i in 0:(100000000-1)
        a = add(add(a, b), b)
    end
    a.x, a.y
end

flow-sensitive SROA

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

struct Key
    idx::Int
    ref::Symbol
end
const CACHED_KEY = Ref(Key(0, :orig))
@inline eqkey(k1, k2) = k1.idx == k2.idx && k1.ref === k2.ref

function getref(idx, ref)
    key = Key(idx, ref) # move the allocation into the `if` clause
    if !eqkey(key, CACHED_KEY[])
        CACHED_KEY[] = key
    end
    return key.ref
end

stack allocation

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

let
    if cond
        key = CACHED_KEY[]
    else
        key = Key(idx, ref) # can we avoid this allocation ?
        CACHED_KEY[] = key
    end
    return key.ref # key::ϕ(global object, allocation)
end

DCE, more constant-folding/inference: SROA can derive new type information
- Java: every field is statically-typed (mostly ?)
- Julia: untyped, Union-typed fields
third consumers of escape information
- ImmutableArray
- some taint analysis for efficient contribution prop' for Diffractor

some implementation ideas
1. separate "state" into two parts
  - backward escape analysis: general usages, allocation movement (for partial SROA)
    - with escape state per basic block (rather than per SSA statement)
  - forward allocation analysis: for SROA
2. SROA at inference time:
  - insight: we already have a decent forward analysis
  - inference time
    - (effect analysis) inter-procedural memory-effect propagation
      - form PartialStruct for mutables
    - (SROA, tagging): constant-folding or as encoded lattice property at inference
      - make PartialStruct.fields contain Replaceable(::SlotNumber)
    - (DCE, constant-folding): inference is already flow-sensitive
  - optimization time
    - (SROA, replacement): look for getfield(...)::Union{Const,Replaceable}
    - (SROA) allocation sinking: new code motion pass
    - escape analysis: just summarizes the result of inference for succeeding optimizations
      - what's needed ?
        
        lifetime query

Q. "SROA at inference"?
- Valentin: what kind of information do you want to propagate around ?
- Shuhei (revisit)
  - memory effect analysis should be presented to inference
  - SROA tagging during inference:
    - may complicates thing:
      - brings a part of optimization into inference
      - the domain of inference lattice gets complicated: it now encodes CFG information
      - we need to keep tags up-to-date throughout many optimizations, including slot2ssa, inlining
- idea: reuse existing flow-sensitive, forward, analysis to implement more "decent" SROA
- stack allocation:
  - non-escape argument: ok
  - return value: (change calling convention)
    - calling convention:
      - LLVM level: struct pointer (*)
    - mutability is still very hard to handle
- how information propagates ?
  - LLVM: rerun analysis
  - on high-level Julia IR: how to handle inter-procedural information
why on LLVM ?
- existed for long
- opportunities for more optimizations using LLVM infrastructure
problems:
- intra-procedural
- stack allocation
more researches
- publish review paper: before something novel
focus on compactness ?
other compilers (tracing JIT) vs. Julia (ahead of runtime)
- predictability
- implementation complexities of tracing-JIT:
  - complexity vs. optimization opportunities

This seems to be such a concept that is very specific to tracing-JIT compilers,
where VM execution sometimes transfers the control to the interpreter
in a case any speculative assumption for optimization gets invalidated. ↩︎
"synchronization removal" here seems to simply means "lock elision" ↩︎
This statement seems to reflect their purpose of SROA: SROA is done for allocation elimination and an object allocation can't be eliminated if there is any site it escapes. ↩︎
They seem to use "formal" when referring to declared method arguments, and "actual" to those actually passed at callsite. ↩︎
"How LuaJIT encodes flow-sensitivity ?"
LuaJIT's tracing-JIT nature separates an unlikely branch into a separate snapshot.
It allows escapes that happens in the rare snapshots not to escape the objects in the main trace.
In other word, necessary flow-sensitivity is encoded within the nature of tracing JIT. ↩︎ ↩︎
Well, usually, the "inter-procedural" part isn't so challenging: basically, inter-procedural
information is encoded within callee method descriptor, and the main lifting is to translate
the information in caller context. ↩︎
Even flow-insensitive SROA sometimes fails on Julia-level.
For example, if Point is declared as mutable struct, Julia-level SROA fails on the above example.
But for such cases, succeeding LLVM passes will optimize them away. ↩︎

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.