NOTE: some comments may look weird because HackMD seems unable to sync comments together with text movements …
Julia's staged programming is quite powerful; it allows arbitrary transformations applied to a whole call graph and they're still composable as like functions are.
On top of @generated
function, the main entry point of Julia's staged programming, Cassette.jl and IRTools.jl implement more high-level interfaces for code transformations at staging time, and they're actually used widely in the community for advanced computations like automatic differentiation.
Still there are downsides of the current staged programming infrastructure based on the @generated
function.
We will discuss them later anyway, but to put briefly, with @generated
function we end up facing 1.) limited stages, 2.) looser inference, 3.) high compilation cost.
The purpose of this project is to provide an new, unlimited and more extensive staged programming interface for Julia.
With such an interface, we will be able to hook into more stages of Julia's compilation pipeline other than the pre-inference stage, freely use type information derived from inference and intervene to the optimization phase.
Even more, combined with the AbstractInterpreter
interface, we will be allowed to customize any part of the compilation pipeline.
Along the way, we will also find common utilities to work on Core.Compiler
, and will want to pack them into a library. We may also want to refactor some of the Core.Compiler
data structures so that we can manipulate them more easily. We may end up with wanting some "composability" of different AbstractInterpreter
s, which will enable yet more complex but much fancier program analysis and code generation.
So this project will consist of two main sub-projects (N.B. the names are just tentative – we will find better names as the project goes on and things get sorted out):
Core.Compiler
The final goal of this project is to make them "first-class" libraries, and eventually publish them as standard ones.
It would be helpful to understand the state of the art of customized code execution techniques, and recognize the problems we would like to solve with our compiler-plugin project.
The existing approaches would fall into the following two categories:
@generated
function based approach: e.g. Cassette.jlAbstractInterpreter
× LLVM ORC API: e.g. GPUCompiler.jl@generated
Function Based ApproachJulia offers a first-class support for staged-programmings via the @generated
function interface, and we can exploit it to enable a customized compilation pass.
There are several different package implementations following this approach, e.g. Cassette.jl and IRTools.jl, and they share the same key idea: use @generated
function to transform generic function calls involved within the entire call graph into "overdubbing" versions, whose IRs are manipulated by an external transformation pass.
Since @generated
function is fully integrated with Julia's dispatch system, dynamic dispatch is inherently supported. "Overdubbing" is essentially an invocation of a special and dedicated method, which is completely different from original "overdubbed" method, and so the customized code execution is just separated from execution in other contexts.
Having said that, this approach has the following "serious" issues:
@generated
is a well-maintained but still limited interface for staged programming – it can only manipulate pre-inference IRs at its best, and it can't customize behaviors of inference and/or optimization, nor tune codegen options. So for example, IR manipulation can't make full use of inter-procedural information that may be necessary for certain code generation or optimization.@generated
functions can be inferred only when their argument types are known to be fully concrete (more specifically, used arguments are fully concrete) – this is actually very strict restriction and real-world complex programs often can't satisfy this criteria – and obviously this will lead to unideal runtime performanceAbstractInterpreter
× LLVM ORC APIThis approach is the only existing (at least published) approach to execute code generated from an external AbstractInterpreter
.
The notable packages are:
AbstractInterpreter
, supporting both GPU- and CPU-executionWithin this approach, GPUCompiler.jl's AbstractInterpreter
manages its own, Julia-level and per-function code cache for inference and optimization (i.e. CodeInstance
s associated with MethodInstance
), and then GPUCompiler.jl transforms the caches into a LLVM function and finally use LLVM's ORC (On-Request-Compilation) API to execute it.
When transformed to a LLVM function, all code reachable from a given entry point will be bundled into a single execution image[1].
This approach solves many difficulties already, i.e. GPUCompiler.jl's code cache is totally separated from native code cache (since it doesn't rely on Julia's native code execution at all), and the AbstractInterpreter
interface allows flexible tuning of inference and optimization behaviors suited for GPU computation[2].
Still there are some aspects that may not be desired for our future compiler-plugin:
code_typed
since there is no useful Julia-level information appearing in the final target[3], we may need additional efforts to collect stacktraces accurately at runtime, etc.Here are two different approaches for the compiler plugin infrastructure:
AbstractInterpreter
× LLVM ORC API ApproachAbstractInterpreter
delegationI'd like to prefer the second approach as the first choice, but still welcome discussions.
AbstractInterpreter
× LLVM ORC API Approach ?So the LLVM-based approach has the following issues:
As for the dynamic dispatch handling, there are actually several ways to solve it, e.g. we can employ Cassette.jl-like context separation, or we can use implicit codegen "context" hook. This problem will be revisited in the following section anyway, and so leave this one for now.
The main concern is the first one; especially being not able to easily reuse Julia's native code execution infrastructure is unfortunate – GPU computation kernel can't run on Julia's native code execution engine from the beginning, but there is no reason for CPU computation to follow the restriction.
As for the limited code caching problem, we can naively solve it by generating and caching LLMV functions per CodeInstance
– but it's effectively what the native codegen already does.
TL;DR; The idea of transforming Julia-level code caches to a LLVM function is oriented for GPU computation (especially CUDA), and we want to find an "easier" way to support customized code execution.
AbstractInterpreter
DelegationThe idea consists of two parts:
typeinf
to external AbstractInterpreter
The part 1. enables built-in supports for Cassette.jl-like customized compilation (i.e. customization via IR manipulation), and such plugins don't need to interact with the AbstractInterpreter
interface at all.
When a plugin needs more advanced customization, e.g. tuning of inference/optimization/codegen parameters, its own optimization passes, etc., then the AbstractInterpreter
interface comes in.
We will implement some hooks to delegate inference to an external AbstractInterpreter
, which should also sync with the part 1. to enable code cache separation and dynamic dispatch.
The idea is to have a special built-in abstract type, which enables multi-staged programmings through pre-defined hooks that can be invoked at each "interesting" point of the native compilation pipeline, when the first argument type of inferred method is subtype of it.
Let's call the special abstract type as Core.AbstractCompilerPlugin
(since it's going to be defined in Core
since it needs to be built-in) in the following parts of this document.
There will be two different kinds of hook:
AbstractCompilerPlugin
contextretrieve_code_info
finish(::InferenceState, ::AbstractInterpreter)
optimize
finish(::AbstractInterpreter, ::OptimizationState, ::OptimizationParams, ::IRCode, ::Any)
Here is an overview of how a compiler plugin mechanism will work:
AbstractComiplerPlugin
, before getting MethodInstance
, compose a plugin context from the inherited context and the context specified by the first argument, and then form a MethodInstance
whose first argument is the composed plugin contextretreive_code_info
will try to retrieve the source for the original methodNote that this way enables us to:
Code might make it clearer. Here is a sketch of the prototype:
# NOTE assume all AbstractCompilerPlugin is a singleton
struct PluginContext{Ctxs<:Tuple{Vararg{AbstractCompilerPlugin}}}
ctxs::Ctxs
end
# the initial context
const DEFAULT_CONTEXT = PluginContext(())
# step 1. form a plugin context for the incoming inference, and get a `MethodInstance` with it
compose_context(::PluginContext{Tuple{}}, ctx::AbstractCompilerPlugin) = PluginContext((ctx,))
compose_context(inherit::PluginContext, ctx::AbstractCompilerPlugin) = PluginContext((inherit.ctxs..., ctx))
function typeinf_edge(interp::AbstractInterpreter, method::Method, @nospecialize(atypes), sparams::SimpleVector, caller::InferenceState)
maybe_plugin_ctx = atypes.parameters[1]
if maybe_plugin_ctx <: AbstractCompilerPlugin
plugin_ctx = compose_context(plugin_ctx, maybe_plugin_ctx)
atypes = Tuple{typeof(plugin_ctx), atypes.parameters[2:end]...}
end
mi = specialize_method(method, atypes, sparams)::MethodInstance
...
end
# step 2. `retrieve_code_info` will just try to retrieve the source for the original method
# step 3. apply transformations
# e.g. apply pre optimization hooks within `optimize`
function apply_preopt_hook(ctx::PluginContext, unopted::IRCode)
foldl(ctx.ctxs; init=unopted) do unopted, ctx
return preopt_hook(ctx, unopted)
end
end
# step 4. form the plugin context for the edge frame
function apply_context_hook(ctx::PluginContext, sigt)
# ctx.ctxs
foldl(ctx.ctxs; init=()) do ctxs, ctx
newctx = context_hook(ctx, sigt)
if isnothing(newctx)
return ctxs
end
return (ctxs..., newctx)
end |> PluginContext
end
A sample implementation of a sin
to fast_sin
rewriter plugin:
struct FastSinPlugin <: Core.AbstractCompilerPlugin end
context_hook(::FastSinPlugin, ::Type{<:Tuple{Vararg{Any}}) =
return FastSinPlugin() # always recurse except,
context_hook(::FastSinPlugin, ::Type{Tuple{typeof(fast_sin),Float64}}) =
return nothing # when inferring `fast_sin`
function preopt_hook(::FastSinPlugin, inferred::CodeInfo)
... # rewrite the call of `sin(::Float64)` with the call of `fast_sin(::Float64)`
end
# entry
entry(f args...) = FastSinPlugin()(f, args...)
entry(10) do x
sin(x)
get_sin()(x)
Base.inferencebarrier(sin)(x)
...
end
@generated
?
- limited customization: we could say that
@generated
is a well-maintained but still limited interface for staged programming – it can only manipulate pre-inference IRs at its best, and it can't customize behaviors of inference and/or optimization, nor tune codegen options. So for example, IR manipulation can't make full use of inter-procedural information that may be necessary for certain code generation or optimization.
We will be able to hook into more stages than with @generated
(where we can only hook into pre-inference stage). So for example we can manipulate IRs using type information derived by inference by hooking into the optimization stage.
The part 2. (i.e. AbstractInterpreter
integration) combined with this design will enable further customization, including customized inference and codegen options, etc.
- limited inference: in order to preserve the soundness of inference,
@generated
functions can be inferred only when their argument types are known to be fully concrete (more specifically, used arguments are fully concrete) – this is actually very strict restriction and real-world complex programs often can't satisfy this criteria – and obviously this will lead to unideal runtime performance
The core idea of this part is that we can circumvent the problem of @generated
by decoupling staging interfaces from dispatch. Rather, code transformation will be applied whenever given a plugin context.
The key observation here is that the inference limit of @generated
stems from the fact that it's too much integrated with the dispatch system where a generator itself needs to be dispatched.
With the proposed compiler plugin design, IR manipulations can happen with regardless of if an inferred frame signature is leaf-type or not, and thus we can expect the customized compilation will infer types as well as with the native inference. In other word, a compiler plugin will be able to manipulate IRs in the same way as Julia's native compiler can do.
While Julia has to invoke @generated
with leaf types since a generator itself is just a function and it's Julia's semantics to always call a function with leaf types, with this compiler plugin, Julia compiler is only responsible for invoking the hooks with a plugin context no matter whether the frame signature is leaf type or not, and it's plugin's responsibility to preserve its own semantics with the existence of uninferred code.
- compile-time latency: a custom compilation pass rewrites every generic function call in order to create a separate execution context, and so it has to compile everything from ground zero, and thus can't avoid certain compile-time cost
Similarly to above, the "context hook" will be invoked even within uninferred frame, and so a plugin can bail out from its own plugin context more freely that with @generated
. Again, it's a plugin's responsibility to reduce compile-time cost while preserving its semantics by "safely" bailing out from its context.
FIXME: After discussing this part with @vtjnash , I concluded the best approach for this issue would be to allow plugins to directly customize inference heuristics, rather than offering this limited interface to cut off the latency by bail out. In this view, this part will be essentially postponed to the part 2,. i.e. integration with AbstractInterpreter
, where we can directly intervene to inference.
With this compiler plugin interface, we can fully implement Keno's example target plugin: sin
to fast_sin
rewriter.
We can also rewrite Cassette.jl's println
slicer as a compiler plugin.
AbstractInterpreter
DelegationIf a plugin wants to tweak inference/optimization options, implement its own abstract interpretation routine and/or optimization pass, it needs to use the AbstractInterpreter
interface.
Switching to an external AbstractInterpreter
can also be done by making another hook on AbstractCompilerPlugin
.
We will set the hooks in appropriate inference entries and the hook will decide what AbstractInterpreter
to use.
# by default, not switch interpreter
get_interpreter(@nospecialize(::Any), interp::AbstractInterpreter) =
return interp
# a plugin can opt-in to provide a custom interpreter
struct XInterpreterPlugin <: AbstractCompilerPlugin end
struct XInterpreter <: AbstractInterpreter end
# ... customizations of inference and/or optimization
get_interpreter(::Type{XInterpreterPlugin}, interp::AbstractInterpreter) =
return XInterpreter(interp, ...)
get_interpreter(::Type{XInterpreterPlugin}, interp::XInterpreter) =
return interp
Probably we want to have an API to allow global delegation.
# insert implicit first argument `plugin` for every top-level calls in succeeding program execution
set_plugin!(plugin::AbstractCompilerPlugin)
Some AbstractInterpreter
s may want cache separation and others may not:
MyNativeInteprereter
wants to enable aggressive_constant_prop
globally, it's arguably okay to just share the cache with NativeInterpreter
AutoDiffInterpreter
should definitely separate its cache from the other AbstractInterpreter
'sIt's up to each AbstractInterpreter
to separate their cache from the native code cache or not, but if so, it will use the part 1 approach.
This is because under the current Julia's native code execution infrastructure, code can be inserted into the native cache at many different points of the codegen pipeline, and it's actually not parameterized by AbstractInterpreter
– so only managing code_cache(::AbstractIntepreter)
is really not enough to enable cache separation.
So it should pass its own context throughout the entire call graph and avoid its code cache (in other word, its customized program semantics) to leak into the native one.
AbstractInterpreter
sWe need to think of what kind of "contracts" are enough to express and restrict AbstractInterpreter
s composition.
Core
– then we may should call it Core.Compiler.Plugin
module rather than CompilerPlugin.jlCore.Compiler
and Base
Core.Compiler
data structuresIt's so often that a package implements their own bridges between Base
and Core.Compiler
, and it can produce (pre)compilation problems when multiple such packages are used in a same project (since they may try to overload the same method).
So essentially this part will just do something very similar to what this PR tried to do:
Base.iterate(x::Core.Compiler.BitSet) = Core.Compiler.iterate(x)
Base.length(matches::Core.Compiler.MethodLookupResult) = Core.Compiler.length(matches)
...
And probably we also want to define missing
within Core
, so that we don't need to distinguish Core.Compiler.missing
and Base.missing
.
Maybe there is no "urgent" need for this, but this should still be valuable.
This part may include some refactoring of Core.Compiler
itself.
Here are some (random) ideas of library functionalities:
CodeInfo
and IRCode
?argextype
work for typed lowered code as well ?Probably we want to setup an repository named CompilerTools.jl under the JuliaLang organization.
Since there is sort of urgent need for the part 1., we want to implement it soonish and actually make use of it in the existing packages like JET.jl and Mixtape.jl. As for the part 2., a slower and more gradual approach would be appropriate – maybe just collect ideas like above on GitHub issues at the initial stage, and then triage them and implement one by one ?
Valentin said the background for this approach is that the programming model of CUDA – the main user of GPUCompiler.jl – expects a single images containing all functions. ↩︎
Probably the most obvious example of this part is this one – GPUCompiler.GPUInterpreter <: AbstractInterpreter
intentionally disables the heuristic to ease compile-time latency for Julia's native code execution; GPU computation requires everything to be resolved at compile-time no matter whether the code block is dedicated for error handling. ↩︎
We can technically reconstruct Julia-level semantics from the codee cache, but it gonna be some effort. ↩︎
This means CPU execution based on GPUCompiler.jl can break its own semantics as is. ↩︎