Julia Compiler-Plugin Project

--- breaks: false --- **NOTE**: some comments may look weird because HackMD seems unable to sync comments together with text movements ... # Julia Compiler-Plugin Project  Julia's staged programming is quite powerful; it allows arbitrary transformations applied to a whole call graph and they're still composable as like functions are. On top of `@generated` function, the main entry point of Julia's staged programming, [Cassette.jl](https://github.com/JuliaLabs/Cassette.jl) and [IRTools.jl](https://github.com/FluxML/IRTools.jl) implement more high-level interfaces for code transformations at staging time, and they're actually used widely in the community for advanced computations like automatic differentiation. Still there are downsides of the current staged programming infrastructure based on the `@generated` function. We will discuss them later anyway, but to put briefly, with `@generated` function we end up facing 1.) _limited stages_, 2.) _looser inference_, 3.) _high compilation cost_. The purpose of this project is to provide an new, unlimited and more extensive staged programming interface for Julia. \ With such an interface, we will be able to hook into more stages of Julia's compilation pipeline other than the pre-inference stage, freely use type information derived from inference and intervene to the optimization phase. Even more, combined with the `AbstractInterpreter` interface, we will be allowed to customize any part of the compilation pipeline.  Along the way, we will also find common utilities to work on `Core.Compiler`, and will want to pack them into a library. We may also want to refactor some of the `Core.Compiler` data structures so that we can manipulate them more easily. We may end up with wanting some "composability" of different `AbstractInterpreter`s, which will enable yet more complex but much fancier program analysis and code generation. So this project will consist of two main sub-projects (N.B. the names are just tentative -- we will find better names as the project goes on and things get sorted out): - **CompilerPlugin.jl**: provide high-level interface to create a "compiler plugin" - **CompilerTools.jl**: provide common utilities for working on `Core.Compiler` The final goal of this project is to make them "first-class" libraries, and eventually publish them as standard ones. # Index [toc] # CompilerPlugin.jl ## Scope - provide high-level interfaces to generate and execute code with customized compilation behavior - provide some interfaces to compose different "compiler-plugin"s ## State of The Art It would be helpful to understand the state of the art of customized code execution techniques, and recognize the problems we would like to solve with our compiler-plugin project. The existing approaches would fall into the following two categories: - `@generated` function based approach: e.g. [Cassette.jl](https://github.com/JuliaLabs/Cassette.jl) - `AbstractInterpreter` × LLVM ORC API: e.g. [GPUCompiler.jl](https://github.com/JuliaGPU/GPUCompiler.jl) ### `@generated` Function Based Approach Julia offers a first-class support for staged-programmings via the `@generated` function interface, and we can exploit it to enable a customized compilation pass. There are several different package implementations following this approach, e.g. [Cassette.jl](https://github.com/JuliaLabs/Cassette.jl) and [IRTools.jl](https://github.com/FluxML/IRTools.jl), and they share the same key idea: use `@generated` function to transform generic function calls involved within the entire call graph into "overdubbing" versions, whose IRs are manipulated by an external transformation pass. Since `@generated` function is fully integrated with Julia's dispatch system, dynamic dispatch is inherently supported. "Overdubbing" is essentially an invocation of a special and dedicated method, which is completely different from original "overdubbed" method, and so the customized code execution is just separated from execution in other contexts. Having said that, this approach has the following "serious" issues: 1. _**limited customization**_: we could say that `@generated` is a well-maintained but still limited interface for staged programming -- it can only manipulate pre-inference IRs at its best, and it can't customize behaviors of inference and/or optimization, nor tune codegen options. So for example, IR manipulation can't make full use of inter-procedural information that may be necessary for certain code generation or optimization. 2. _**limited inference**_: in order to preserve the soundness of inference, `@generated` functions can be inferred only when their argument types are known to be fully concrete (more specifically, _used_ arguments are fully concrete) -- this is actually very strict restriction and real-world complex programs often can't satisfy this criteria -- and obviously this will lead to unideal runtime performance 3. _**compile-time latency**_: a custom compilation pass rewrites every generic function call in order to create a separate execution context, and so it has to compile everything from ground zero, and thus can't avoid certain compile-time cost ### `AbstractInterpreter` × LLVM ORC API This approach is the only existing (at least published) approach to execute code generated from an external `AbstractInterpreter`. The notable packages are: - [GPUCompiler.jl](https://github.com/JuliaGPU/GPUCompiler.jl) implements the infrastructure to execute code generated by `AbstractInterpreter`, supporting both GPU- and CPU-execution - [Mixtape.jl](https://github.com/JuliaCompilerPlugins/Mixtape.jl): provides a Cassette.jl-like high-level interface to write a customized compilation pass on top of the GPUCompiler.jl infrastructure Within this approach, GPUCompiler.jl's `AbstractInterpreter` manages its own, Julia-level and per-function code cache for inference and optimization (i.e. `CodeInstance`s associated with `MethodInstance`), and then GPUCompiler.jl transforms the caches into a LLVM function and finally use LLVM's ORC (On-Request-Compilation) API to execute it. When transformed to a LLVM function, all code reachable from a given entry point will be bundled into a single execution image[^1]. [^1]: Valentin said the background for this approach is that the programming model of CUDA -- the main user of GPUCompiler.jl -- expects a single images containing all functions. This approach solves many difficulties already, i.e. GPUCompiler.jl's code cache is totally separated from native code cache (since it doesn't rely on Julia's native code execution at all), and the `AbstractInterpreter` interface allows flexible tuning of inference and optimization behaviors suited for GPU computation[^2]. [^2]: Probably the most obvious example of this part is [this one](https://github.com/JuliaGPU/GPUCompiler.jl/blob/50725e007dc64ddeb1855d3d9194f85ca66858e8/src/jlgen.jl#L198-L199) -- `GPUCompiler.GPUInterpreter <: AbstractInterpreter` intentionally disables [the heuristic](https://github.com/JuliaLang/julia/pull/35982) to ease compile-time latency for Julia's native code execution; GPU computation requires everything to be resolved at compile-time no matter whether the code block is dedicated for error handling. Still there are some aspects that may not be desired for our future compiler-plugin: - _**direct execution via LLVM**_: this approach targets LLVM function, which will be called externally, and it involves the following problems: - _no reuse of Julia's code execution infrastructure_: this means we can't easily debug inference issues with `code_typed` since there is no useful Julia-level information appearing in the final target[^2.5], we may need additional efforts to collect stacktraces accurately at runtime, etc. - _limited code cache_: since there is only a single executable code per entry point, we have to re-compile an executable image for every different entry point and on any of method invalidation -- which might not be desirable since latency is an important aspect when working in a dynamic computation environment like REPL. - _**support for dynamic dispatch**_: there seems to be very low priority on dynamic dispatch handing in the context of GPU computation, and accordingly GPUCompiler.jl doesn't provide a complete support of it at the moment of writing; it will just bail out the execution context when encountering dynamic dispatch.[^3] [^2.5]: We can technically reconstruct Julia-level semantics from the codee cache, but it gonna be some effort. [^3]: This means CPU execution based on GPUCompiler.jl can break its own semantics as is. ## Prototype Design Here are two different approaches for the compiler plugin infrastructure: - Enhanced `AbstractInterpreter` × LLVM ORC API Approach - Built-in IR-rewrite hooks × `AbstractInterpreter` delegation I'd like to prefer the second approach as the first choice, but still welcome discussions. ### Enhanced `AbstractInterpreter` × LLVM ORC API Approach ? So the LLVM-based approach has the following issues: - _direct execution via LLVM_ - _support for dynamic dispatch_ As for the dynamic dispatch handling, there are actually several ways to solve it, e.g. we can employ Cassette.jl-like context separation, or we can use [implicit codegen "context" hook](https://github.com/JuliaLang/julia/pull/36398). This problem will be revisited in the following section anyway, and so leave this one for now. The main concern is the first one; especially being not able to easily reuse Julia's native code execution infrastructure is unfortunate -- GPU computation kernel can't run on Julia's native code execution engine from the beginning, but there is no reason for CPU computation to follow the restriction. As for the limited code caching problem, we can naively solve it by generating and caching LLMV functions per `CodeInstance` -- but it's effectively what the native codegen already does. TL;DR; The idea of transforming Julia-level code caches to a LLVM function is oriented for GPU computation (especially CUDA), and we want to find an "easier" way to support customized code execution. ### Built-in IR-Rewrite Hooks × `AbstractInterpreter` Delegation The idea consists of two parts: 1. provide hooks to implement Cassette.jl-like code rewrite 2. implements a hook to delegate `typeinf` to external `AbstractInterpreter` The part 1. enables built-in supports for Cassette.jl-like customized compilation (i.e. customization via IR manipulation), and such plugins don't need to interact with the `AbstractInterpreter` interface at all. When a plugin needs more advanced customization, e.g. tuning of inference/optimization/codegen parameters, its own optimization passes, etc., then the `AbstractInterpreter` interface comes in. We will implement some hooks to delegate inference to an external `AbstractInterpreter`, which should also sync with the part 1. to enable code cache separation and dynamic dispatch. #### Part 1: Cassette.jl-like Code Rewrite The idea is to have a special built-in abstract type, which enables multi-staged programmings through pre-defined hooks that can be invoked at each "interesting" point of the native compilation pipeline, when the first argument type of inferred method is subtype of it. Let's call the special abstract type as `Core.AbstractCompilerPlugin` (since it's going to be defined in `Core` since it needs to be built-in) in the following parts of this document. There will be two different kinds of hook: - _**context hook**_: decides inference should recurse into edge inference with the same `AbstractCompilerPlugin` context - _**rewrite hooks**_: takes IR at each "interesting" point of compilation and returns modified IRs, e.g. - pre-inference: within `retrieve_code_info` - post-inference: within `finish(::InferenceState, ::AbstractInterpreter)` - pre-optimization: within `optimize` - post-optimization: within `finish(::AbstractInterpreter, ::OptimizationState, ::OptimizationParams, ::IRCode, ::Any)` Here is an overview of how a compiler plugin mechanism will work: 1. when the first argument type is subtype of `AbstractComiplerPlugin`, before getting `MethodInstance`, compose a plugin context from the inherited context and the context specified by the first argument, and then form a `MethodInstance` whose first argument is the composed plugin context 2. `retreive_code_info` will try to retrieve the source for the original method 3. at each stage of the compilation of the method instance, apply "rewrite hook" 4. when inference recurses into an call edge, "context hook" specifies a context that the edge inference will inherit, and go back to the step 1. Note that this way enables us to: - naturally have a composability of different plugins - code cache is just separated from that of the native execution Code might make it clearer. Here is a sketch of the prototype: ```julia= # NOTE assume all AbstractCompilerPlugin is a singleton struct PluginContext{Ctxs<:Tuple{Vararg{AbstractCompilerPlugin}}} ctxs::Ctxs end # the initial context const DEFAULT_CONTEXT = PluginContext(()) # step 1. form a plugin context for the incoming inference, and get a `MethodInstance` with it compose_context(::PluginContext{Tuple{}}, ctx::AbstractCompilerPlugin) = PluginContext((ctx,)) compose_context(inherit::PluginContext, ctx::AbstractCompilerPlugin) = PluginContext((inherit.ctxs..., ctx)) function typeinf_edge(interp::AbstractInterpreter, method::Method, @nospecialize(atypes), sparams::SimpleVector, caller::InferenceState) maybe_plugin_ctx = atypes.parameters[1] if maybe_plugin_ctx <: AbstractCompilerPlugin plugin_ctx = compose_context(plugin_ctx, maybe_plugin_ctx) atypes = Tuple{typeof(plugin_ctx), atypes.parameters[2:end]...} end mi = specialize_method(method, atypes, sparams)::MethodInstance ... end # step 2. `retrieve_code_info` will just try to retrieve the source for the original method # step 3. apply transformations # e.g. apply pre optimization hooks within `optimize` function apply_preopt_hook(ctx::PluginContext, unopted::IRCode) foldl(ctx.ctxs; init=unopted) do unopted, ctx return preopt_hook(ctx, unopted) end end # step 4. form the plugin context for the edge frame function apply_context_hook(ctx::PluginContext, sigt) # ctx.ctxs foldl(ctx.ctxs; init=()) do ctxs, ctx newctx = context_hook(ctx, sigt) if isnothing(newctx) return ctxs end return (ctxs..., newctx) end |> PluginContext end ``` ##### API Look A sample implementation of a `sin` to `fast_sin` rewriter plugin: ```julia= struct FastSinPlugin <: Core.AbstractCompilerPlugin end context_hook(::FastSinPlugin, ::Type{<:Tuple{Vararg{Any}}) = return FastSinPlugin() # always recurse except, context_hook(::FastSinPlugin, ::Type{Tuple{typeof(fast_sin),Float64}}) = return nothing # when inferring `fast_sin` function preopt_hook(::FastSinPlugin, inferred::CodeInfo) ... # rewrite the call of `sin(::Float64)` with the call of `fast_sin(::Float64)` end # entry entry(f args...) = FastSinPlugin()(f, args...) entry(10) do x sin(x) get_sin()(x) Base.inferencebarrier(sin)(x) ... end ``` ##### How This Design Solves the Issues of `@generated` ? > 1. _**limited customization**_: we could say that `@generated` is a well-maintained but still limited interface for staged programming -- it can only manipulate pre-inference IRs at its best, and it can't customize behaviors of inference and/or optimization, nor tune codegen options. So for example, IR manipulation can't make full use of inter-procedural information that may be necessary for certain code generation or optimization. We will be able to hook into more stages than with `@generated` (where we can only hook into pre-inference stage). So for example we can manipulate IRs using type information derived by inference by hooking into the optimization stage. \ The part 2. (i.e. `AbstractInterpreter` integration) combined with this design will enable further customization, including customized inference and codegen options, etc. > 2. _**limited inference**_: in order to preserve the soundness of inference, `@generated` functions can be inferred only when their argument types are known to be fully concrete (more specifically, _used_ arguments are fully concrete) -- this is actually very strict restriction and real-world complex programs often can't satisfy this criteria -- and obviously this will lead to unideal runtime performance The core idea of this part is that we can circumvent the problem of `@generated` by decoupling staging interfaces from dispatch. Rather, code transformation will be applied whenever given a plugin context. The key observation here is that the inference limit of `@generated` stems from the fact that it's _too much_ integrated with the dispatch system where a generator itself needs to be dispatched. With the proposed compiler plugin design, IR manipulations can happen with regardless of if an inferred frame signature is leaf-type or not, and thus we can expect the customized compilation will infer types as well as with the native inference. In other word, a compiler plugin will be able to manipulate IRs in the same way as Julia's native compiler can do. While Julia has to invoke `@generated` with leaf types since a generator itself is just a function and it's Julia's semantics to always call a function with leaf types, with this compiler plugin, Julia compiler is only responsible for invoking the hooks with a plugin context no matter whether the frame signature is leaf type or not, and it's plugin's responsibility to preserve its own semantics with the existence of uninferred code. > 3. _**compile-time latency**_: a custom compilation pass rewrites every generic function call in order to create a separate execution context, and so it has to compile everything from ground zero, and thus can't avoid certain compile-time cost Similarly to above, the "context hook" will be invoked even within uninferred frame, and so a plugin can bail out from its own plugin context more freely that with `@generated`. Again, it's a plugin's responsibility to reduce compile-time cost while preserving its semantics by "safely" bailing out from its context. **FIXME**: After discussing this part with @vtjnash , I concluded the best approach for this issue would be to allow plugins to directly customize inference heuristics, rather than offering this limited interface to cut off the latency by bail out. In this view, this part will be essentially postponed to the part 2,. i.e. integration with `AbstractInterpreter`, where we can directly intervene to inference.  ##### Plugin Examples With this compiler plugin interface, we can fully implement Keno's example target plugin: [`sin` to `fast_sin` rewriter](https://gist.github.com/Keno/d0c2df947f67be543036238a0caeb1c6). We can also rewrite [Cassette.jl's `println` slicer](https://julia.mit.edu/Cassette.jl/latest/contextualpass.html) as a compiler plugin. #### Part 2: `AbstractInterpreter` Delegation If a plugin wants to tweak inference/optimization options, implement its own abstract interpretation routine and/or optimization pass, it needs to use the `AbstractInterpreter` interface. Switching to an external `AbstractInterpreter` can also be done by making another hook on `AbstractCompilerPlugin`. We will set the hooks in appropriate inference entries and the hook will decide what `AbstractInterpreter` to use.  ##### API Look ```julia= # by default, not switch interpreter get_interpreter(@nospecialize(::Any), interp::AbstractInterpreter) = return interp # a plugin can opt-in to provide a custom interpreter struct XInterpreterPlugin <: AbstractCompilerPlugin end struct XInterpreter <: AbstractInterpreter end # ... customizations of inference and/or optimization get_interpreter(::Type{XInterpreterPlugin}, interp::AbstractInterpreter) = return XInterpreter(interp, ...) get_interpreter(::Type{XInterpreterPlugin}, interp::XInterpreter) = return interp ``` Probably we want to have an API to allow global delegation. ```julia= # insert implicit first argument `plugin` for every top-level calls in succeeding program execution set_plugin!(plugin::AbstractCompilerPlugin) ``` ##### Cache Separation and Dynamic Dispatch Some `AbstractInterpreter`s may want cache separation and others may not: - if `MyNativeInteprereter` wants to enable `aggressive_constant_prop` globally, it's arguably okay to just share the cache with `NativeInterpreter` - `AutoDiffInterpreter` should definitely separate its cache from the other `AbstractInterpreter`'s It's up to each `AbstractInterpreter` to separate their cache from the native code cache or not, but if so, it will use the part 1 approach. This is because under the current Julia's native code execution infrastructure, code can be inserted into the native cache at many different points of the codegen pipeline, and it's actually not parameterized by `AbstractInterpreter` -- so only managing `code_cache(::AbstractIntepreter)` is really not enough to enable cache separation. So it should pass its own context throughout the entire call graph and avoid its code cache (in other word, its customized program semantics) to leak into the native one. ##### TODO: Compose `AbstractInterpreter`s We need to think of what kind of "contracts" are enough to express and restrict `AbstractInterpreter`s composition. ## Next Actions - we need to decide what approach we'd like to take for CompilerPlugin.jl - if we're going to support built-in Cassette.jl-like IR transformation, this module will be likely live inside of `Core` -- then we may should call it `Core.Compiler.Plugin` module rather than CompilerPlugin.jl - and the actual implementation will be likely to happen at JuliaLang/julia # CompilerTools.jl ## Scope 1. define and export "standardized" bridges between `Core.Compiler` and `Base` 2. provide common utilities to inspect/manipulate `Core.Compiler` data structures ## 1. "Standardized" Bridges It's so often that a package implements their own bridges between `Base` and `Core.Compiler`, and it can produce (pre)compilation problems when multiple such packages are used in a same project (since they may try to overload the same method). So essentially this part will just do something very similar to what [this PR](https://github.com/JuliaLang/julia/pull/40571) tried to do: ```julia= Base.iterate(x::Core.Compiler.BitSet) = Core.Compiler.iterate(x) Base.length(matches::Core.Compiler.MethodLookupResult) = Core.Compiler.length(matches) ... ``` And probably we also want to define `missing` within `Core`, so that we don't need to distinguish `Core.Compiler.missing` and `Base.missing`. ## 2. Utilities for IR Manipulation Maybe there is no "urgent" need for this, but this should still be valuable. This part may include some refactoring of `Core.Compiler` itself. Here are some (random) ideas of library functionalities: - lowered code: some of what implemented within [MacroTools.jl](https://github.com/FluxML/MacroTools.jl) and/or [LoweredCodeUtils.jl](https://github.com/JuliaDebug/LoweredCodeUtils.jl) might be better to be maintained as stdlib(s) - optimized IRs: - automatic numbering of SSA statement and basic block label, easier basic block replacement, etc. are desirable - [IRTools.jl](https://github.com/FluxML/IRTools.jl) is fancy, but not compatible with Julia's base IR - unified interface for lowered/optimized IRs ? - implement an unified interface for `CodeInfo` and `IRCode` ? - or even "re-merge" them ? - make `argextype` work for typed lowered code as well ? ## Next Actions Probably we want to setup an repository named CompilerTools.jl under [the JuliaLang organization](https://github.com/JuliaLang). Since there is sort of urgent need for the part 1., we want to implement it soonish and actually make use of it in the existing packages like [JET.jl](https://github.com/aviatesk/JET.jl) and [Mixtape.jl](https://github.com/JuliaCompilerPlugins/Mixtape.jl). As for the part 2., a slower and more gradual approach would be appropriate -- maybe just collect ideas like above on GitHub issues at the initial stage, and then triage them and implement one by one ?

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.