changed 4 years ago
Published Linked with GitHub

NOTE: some comments may look weird because HackMD seems unable to sync comments together with text movements

Julia Compiler-Plugin Project

Julia's staged programming is quite powerful; it allows arbitrary transformations applied to a whole call graph and they're still composable as like functions are. On top of @generated function, the main entry point of Julia's staged programming, Cassette.jl and IRTools.jl implement more high-level interfaces for code transformations at staging time, and they're actually used widely in the community for advanced computations like automatic differentiation.

Still there are downsides of the current staged programming infrastructure based on the @generated function. We will discuss them later anyway, but to put briefly, with @generated function we end up facing 1.) limited stages, 2.) looser inference, 3.) high compilation cost.

The purpose of this project is to provide an new, unlimited and more extensive staged programming interface for Julia.
With such an interface, we will be able to hook into more stages of Julia's compilation pipeline other than the pre-inference stage, freely use type information derived from inference and intervene to the optimization phase. Even more, combined with the AbstractInterpreter interface, we will be allowed to customize any part of the compilation pipeline.

Along the way, we will also find common utilities to work on Core.Compiler, and will want to pack them into a library. We may also want to refactor some of the Core.Compiler data structures so that we can manipulate them more easily. We may end up with wanting some "composability" of different AbstractInterpreters, which will enable yet more complex but much fancier program analysis and code generation.

So this project will consist of two main sub-projects (N.B. the names are just tentative we will find better names as the project goes on and things get sorted out):

  • CompilerPlugin.jl: provide high-level interface to create a "compiler plugin"
  • CompilerTools.jl: provide common utilities for working on Core.Compiler

The final goal of this project is to make them "first-class" libraries, and eventually publish them as standard ones.

Index

CompilerPlugin.jl

Scope

  • provide high-level interfaces to generate and execute code with customized compilation behavior
  • provide some interfaces to compose different "compiler-plugin"s

State of The Art

It would be helpful to understand the state of the art of customized code execution techniques, and recognize the problems we would like to solve with our compiler-plugin project.

The existing approaches would fall into the following two categories:

@generated Function Based Approach

Julia offers a first-class support for staged-programmings via the @generated function interface, and we can exploit it to enable a customized compilation pass. There are several different package implementations following this approach, e.g. Cassette.jl and IRTools.jl, and they share the same key idea: use @generated function to transform generic function calls involved within the entire call graph into "overdubbing" versions, whose IRs are manipulated by an external transformation pass.

Since @generated function is fully integrated with Julia's dispatch system, dynamic dispatch is inherently supported. "Overdubbing" is essentially an invocation of a special and dedicated method, which is completely different from original "overdubbed" method, and so the customized code execution is just separated from execution in other contexts.

Having said that, this approach has the following "serious" issues:

  1. limited customization: we could say that @generated is a well-maintained but still limited interface for staged programming it can only manipulate pre-inference IRs at its best, and it can't customize behaviors of inference and/or optimization, nor tune codegen options. So for example, IR manipulation can't make full use of inter-procedural information that may be necessary for certain code generation or optimization.
  2. limited inference: in order to preserve the soundness of inference, @generated functions can be inferred only when their argument types are known to be fully concrete (more specifically, used arguments are fully concrete) this is actually very strict restriction and real-world complex programs often can't satisfy this criteria and obviously this will lead to unideal runtime performance
  3. compile-time latency: a custom compilation pass rewrites every generic function call in order to create a separate execution context, and so it has to compile everything from ground zero, and thus can't avoid certain compile-time cost

AbstractInterpreter × LLVM ORC API

This approach is the only existing (at least published) approach to execute code generated from an external AbstractInterpreter. The notable packages are:

  • GPUCompiler.jl implements the infrastructure to execute code generated by AbstractInterpreter, supporting both GPU- and CPU-execution
  • Mixtape.jl: provides a Cassette.jl-like high-level interface to write a customized compilation pass on top of the GPUCompiler.jl infrastructure

Within this approach, GPUCompiler.jl's AbstractInterpreter manages its own, Julia-level and per-function code cache for inference and optimization (i.e. CodeInstances associated with MethodInstance), and then GPUCompiler.jl transforms the caches into a LLVM function and finally use LLVM's ORC (On-Request-Compilation) API to execute it. When transformed to a LLVM function, all code reachable from a given entry point will be bundled into a single execution image[1].

This approach solves many difficulties already, i.e. GPUCompiler.jl's code cache is totally separated from native code cache (since it doesn't rely on Julia's native code execution at all), and the AbstractInterpreter interface allows flexible tuning of inference and optimization behaviors suited for GPU computation[2].

Still there are some aspects that may not be desired for our future compiler-plugin:

  • direct execution via LLVM: this approach targets LLVM function, which will be called externally, and it involves the following problems:
    • no reuse of Julia's code execution infrastructure: this means we can't easily debug inference issues with code_typed since there is no useful Julia-level information appearing in the final target[3], we may need additional efforts to collect stacktraces accurately at runtime, etc.
    • limited code cache: since there is only a single executable code per entry point, we have to re-compile an executable image for every different entry point and on any of method invalidation which might not be desirable since latency is an important aspect when working in a dynamic computation environment like REPL.
  • support for dynamic dispatch: there seems to be very low priority on dynamic dispatch handing in the context of GPU computation, and accordingly GPUCompiler.jl doesn't provide a complete support of it at the moment of writing; it will just bail out the execution context when encountering dynamic dispatch.[4]

Prototype Design

Here are two different approaches for the compiler plugin infrastructure:

  • Enhanced AbstractInterpreter × LLVM ORC API Approach
  • Built-in IR-rewrite hooks × AbstractInterpreter delegation

I'd like to prefer the second approach as the first choice, but still welcome discussions.

Enhanced AbstractInterpreter × LLVM ORC API Approach ?

So the LLVM-based approach has the following issues:

  • direct execution via LLVM
  • support for dynamic dispatch

As for the dynamic dispatch handling, there are actually several ways to solve it, e.g. we can employ Cassette.jl-like context separation, or we can use implicit codegen "context" hook. This problem will be revisited in the following section anyway, and so leave this one for now.

The main concern is the first one; especially being not able to easily reuse Julia's native code execution infrastructure is unfortunate GPU computation kernel can't run on Julia's native code execution engine from the beginning, but there is no reason for CPU computation to follow the restriction. As for the limited code caching problem, we can naively solve it by generating and caching LLMV functions per CodeInstance but it's effectively what the native codegen already does.

TL;DR; The idea of transforming Julia-level code caches to a LLVM function is oriented for GPU computation (especially CUDA), and we want to find an "easier" way to support customized code execution.

Built-in IR-Rewrite Hooks × AbstractInterpreter Delegation

The idea consists of two parts:

  1. provide hooks to implement Cassette.jl-like code rewrite
  2. implements a hook to delegate typeinf to external AbstractInterpreter

The part 1. enables built-in supports for Cassette.jl-like customized compilation (i.e. customization via IR manipulation), and such plugins don't need to interact with the AbstractInterpreter interface at all.

When a plugin needs more advanced customization, e.g. tuning of inference/optimization/codegen parameters, its own optimization passes, etc., then the AbstractInterpreter interface comes in. We will implement some hooks to delegate inference to an external AbstractInterpreter, which should also sync with the part 1. to enable code cache separation and dynamic dispatch.

Part 1: Cassette.jl-like Code Rewrite

The idea is to have a special built-in abstract type, which enables multi-staged programmings through pre-defined hooks that can be invoked at each "interesting" point of the native compilation pipeline, when the first argument type of inferred method is subtype of it.

Let's call the special abstract type as Core.AbstractCompilerPlugin (since it's going to be defined in Core since it needs to be built-in) in the following parts of this document.

There will be two different kinds of hook:

  • context hook: decides inference should recurse into edge inference with the same AbstractCompilerPlugin context
  • rewrite hooks: takes IR at each "interesting" point of compilation and returns modified IRs, e.g.
    • pre-inference: within retrieve_code_info
    • post-inference: within finish(::InferenceState, ::AbstractInterpreter)
    • pre-optimization: within optimize
    • post-optimization: within finish(::AbstractInterpreter, ::OptimizationState, ::OptimizationParams, ::IRCode, ::Any)

Here is an overview of how a compiler plugin mechanism will work:

  1. when the first argument type is subtype of AbstractComiplerPlugin, before getting MethodInstance, compose a plugin context from the inherited context and the context specified by the first argument, and then form a MethodInstance whose first argument is the composed plugin context
  2. retreive_code_info will try to retrieve the source for the original method
  3. at each stage of the compilation of the method instance, apply "rewrite hook"
  4. when inference recurses into an call edge, "context hook" specifies a context that the edge inference will inherit, and go back to the step 1.

Note that this way enables us to:

  • naturally have a composability of different plugins
  • code cache is just separated from that of the native execution

Code might make it clearer. Here is a sketch of the prototype:

# NOTE assume all AbstractCompilerPlugin is a singleton struct PluginContext{Ctxs<:Tuple{Vararg{AbstractCompilerPlugin}}} ctxs::Ctxs end # the initial context const DEFAULT_CONTEXT = PluginContext(()) # step 1. form a plugin context for the incoming inference, and get a `MethodInstance` with it compose_context(::PluginContext{Tuple{}}, ctx::AbstractCompilerPlugin) = PluginContext((ctx,)) compose_context(inherit::PluginContext, ctx::AbstractCompilerPlugin) = PluginContext((inherit.ctxs..., ctx)) function typeinf_edge(interp::AbstractInterpreter, method::Method, @nospecialize(atypes), sparams::SimpleVector, caller::InferenceState) maybe_plugin_ctx = atypes.parameters[1] if maybe_plugin_ctx <: AbstractCompilerPlugin plugin_ctx = compose_context(plugin_ctx, maybe_plugin_ctx) atypes = Tuple{typeof(plugin_ctx), atypes.parameters[2:end]...} end mi = specialize_method(method, atypes, sparams)::MethodInstance ... end # step 2. `retrieve_code_info` will just try to retrieve the source for the original method # step 3. apply transformations # e.g. apply pre optimization hooks within `optimize` function apply_preopt_hook(ctx::PluginContext, unopted::IRCode) foldl(ctx.ctxs; init=unopted) do unopted, ctx return preopt_hook(ctx, unopted) end end # step 4. form the plugin context for the edge frame function apply_context_hook(ctx::PluginContext, sigt) # ctx.ctxs foldl(ctx.ctxs; init=()) do ctxs, ctx newctx = context_hook(ctx, sigt) if isnothing(newctx) return ctxs end return (ctxs..., newctx) end |> PluginContext end
API Look

A sample implementation of a sin to fast_sin rewriter plugin:

struct FastSinPlugin <: Core.AbstractCompilerPlugin end context_hook(::FastSinPlugin, ::Type{<:Tuple{Vararg{Any}}) = return FastSinPlugin() # always recurse except, context_hook(::FastSinPlugin, ::Type{Tuple{typeof(fast_sin),Float64}}) = return nothing # when inferring `fast_sin` function preopt_hook(::FastSinPlugin, inferred::CodeInfo) ... # rewrite the call of `sin(::Float64)` with the call of `fast_sin(::Float64)` end # entry entry(f args...) = FastSinPlugin()(f, args...) entry(10) do x sin(x) get_sin()(x) Base.inferencebarrier(sin)(x) ... end
How This Design Solves the Issues of @generated ?
  1. limited customization: we could say that @generated is a well-maintained but still limited interface for staged programming it can only manipulate pre-inference IRs at its best, and it can't customize behaviors of inference and/or optimization, nor tune codegen options. So for example, IR manipulation can't make full use of inter-procedural information that may be necessary for certain code generation or optimization.

We will be able to hook into more stages than with @generated (where we can only hook into pre-inference stage). So for example we can manipulate IRs using type information derived by inference by hooking into the optimization stage.
The part 2. (i.e. AbstractInterpreter integration) combined with this design will enable further customization, including customized inference and codegen options, etc.

  1. limited inference: in order to preserve the soundness of inference, @generated functions can be inferred only when their argument types are known to be fully concrete (more specifically, used arguments are fully concrete) this is actually very strict restriction and real-world complex programs often can't satisfy this criteria and obviously this will lead to unideal runtime performance

The core idea of this part is that we can circumvent the problem of @generated by decoupling staging interfaces from dispatch. Rather, code transformation will be applied whenever given a plugin context. The key observation here is that the inference limit of @generated stems from the fact that it's too much integrated with the dispatch system where a generator itself needs to be dispatched.

With the proposed compiler plugin design, IR manipulations can happen with regardless of if an inferred frame signature is leaf-type or not, and thus we can expect the customized compilation will infer types as well as with the native inference. In other word, a compiler plugin will be able to manipulate IRs in the same way as Julia's native compiler can do.

While Julia has to invoke @generated with leaf types since a generator itself is just a function and it's Julia's semantics to always call a function with leaf types, with this compiler plugin, Julia compiler is only responsible for invoking the hooks with a plugin context no matter whether the frame signature is leaf type or not, and it's plugin's responsibility to preserve its own semantics with the existence of uninferred code.

  1. compile-time latency: a custom compilation pass rewrites every generic function call in order to create a separate execution context, and so it has to compile everything from ground zero, and thus can't avoid certain compile-time cost

Similarly to above, the "context hook" will be invoked even within uninferred frame, and so a plugin can bail out from its own plugin context more freely that with @generated. Again, it's a plugin's responsibility to reduce compile-time cost while preserving its semantics by "safely" bailing out from its context.

FIXME: After discussing this part with @vtjnash , I concluded the best approach for this issue would be to allow plugins to directly customize inference heuristics, rather than offering this limited interface to cut off the latency by bail out. In this view, this part will be essentially postponed to the part 2,. i.e. integration with AbstractInterpreter, where we can directly intervene to inference.

Plugin Examples

With this compiler plugin interface, we can fully implement Keno's example target plugin: sin to fast_sin rewriter.

We can also rewrite Cassette.jl's println slicer as a compiler plugin.

Part 2: AbstractInterpreter Delegation

If a plugin wants to tweak inference/optimization options, implement its own abstract interpretation routine and/or optimization pass, it needs to use the AbstractInterpreter interface.

Switching to an external AbstractInterpreter can also be done by making another hook on AbstractCompilerPlugin. We will set the hooks in appropriate inference entries and the hook will decide what AbstractInterpreter to use.

API Look
# by default, not switch interpreter get_interpreter(@nospecialize(::Any), interp::AbstractInterpreter) = return interp # a plugin can opt-in to provide a custom interpreter struct XInterpreterPlugin <: AbstractCompilerPlugin end struct XInterpreter <: AbstractInterpreter end # ... customizations of inference and/or optimization get_interpreter(::Type{XInterpreterPlugin}, interp::AbstractInterpreter) = return XInterpreter(interp, ...) get_interpreter(::Type{XInterpreterPlugin}, interp::XInterpreter) = return interp

Probably we want to have an API to allow global delegation.

# insert implicit first argument `plugin` for every top-level calls in succeeding program execution set_plugin!(plugin::AbstractCompilerPlugin)
Cache Separation and Dynamic Dispatch

Some AbstractInterpreters may want cache separation and others may not:

  • if MyNativeInteprereter wants to enable aggressive_constant_prop globally, it's arguably okay to just share the cache with NativeInterpreter
  • AutoDiffInterpreter should definitely separate its cache from the other AbstractInterpreter's

It's up to each AbstractInterpreter to separate their cache from the native code cache or not, but if so, it will use the part 1 approach. This is because under the current Julia's native code execution infrastructure, code can be inserted into the native cache at many different points of the codegen pipeline, and it's actually not parameterized by AbstractInterpreter so only managing code_cache(::AbstractIntepreter) is really not enough to enable cache separation.

So it should pass its own context throughout the entire call graph and avoid its code cache (in other word, its customized program semantics) to leak into the native one.

TODO: Compose AbstractInterpreters

We need to think of what kind of "contracts" are enough to express and restrict AbstractInterpreters composition.

Next Actions

  • we need to decide what approach we'd like to take for CompilerPlugin.jl
  • if we're going to support built-in Cassette.jl-like IR transformation, this module will be likely live inside of Core then we may should call it Core.Compiler.Plugin module rather than CompilerPlugin.jl
  • and the actual implementation will be likely to happen at JuliaLang/julia

CompilerTools.jl

Scope

  1. define and export "standardized" bridges between Core.Compiler and Base
  2. provide common utilities to inspect/manipulate Core.Compiler data structures

1. "Standardized" Bridges

It's so often that a package implements their own bridges between Base and Core.Compiler, and it can produce (pre)compilation problems when multiple such packages are used in a same project (since they may try to overload the same method).

So essentially this part will just do something very similar to what this PR tried to do:

Base.iterate(x::Core.Compiler.BitSet) = Core.Compiler.iterate(x) Base.length(matches::Core.Compiler.MethodLookupResult) = Core.Compiler.length(matches) ...

And probably we also want to define missing within Core, so that we don't need to distinguish Core.Compiler.missing and Base.missing.

2. Utilities for IR Manipulation

Maybe there is no "urgent" need for this, but this should still be valuable. This part may include some refactoring of Core.Compiler itself.

Here are some (random) ideas of library functionalities:

  • lowered code: some of what implemented within MacroTools.jl and/or LoweredCodeUtils.jl might be better to be maintained as stdlib(s)
  • optimized IRs:
    • automatic numbering of SSA statement and basic block label, easier basic block replacement, etc. are desirable
    • IRTools.jl is fancy, but not compatible with Julia's base IR
  • unified interface for lowered/optimized IRs ?
    • implement an unified interface for CodeInfo and IRCode ?
    • or even "re-merge" them ?
    • make argextype work for typed lowered code as well ?

Next Actions

Probably we want to setup an repository named CompilerTools.jl under the JuliaLang organization.

Since there is sort of urgent need for the part 1., we want to implement it soonish and actually make use of it in the existing packages like JET.jl and Mixtape.jl. As for the part 2., a slower and more gradual approach would be appropriate maybe just collect ideas like above on GitHub issues at the initial stage, and then triage them and implement one by one ?


  1. Valentin said the background for this approach is that the programming model of CUDA the main user of GPUCompiler.jl expects a single images containing all functions. ↩︎

  2. Probably the most obvious example of this part is this one GPUCompiler.GPUInterpreter <: AbstractInterpreter intentionally disables the heuristic to ease compile-time latency for Julia's native code execution; GPU computation requires everything to be resolved at compile-time no matter whether the code block is dedicated for error handling. ↩︎

  3. We can technically reconstruct Julia-level semantics from the codee cache, but it gonna be some effort. ↩︎

  4. This means CPU execution based on GPUCompiler.jl can break its own semantics as is. ↩︎

Select a repo