How
pgx
creates PostgreSQL extension SQL definitions that bind to Rust.
PostgreSQL offers an extension interface, and it's my belief that Rust is a fantastic language to write extensions for it. Eric Ridge thought so too, and started pgx
awhile back. I've been working with him to improve the toolkit, and wanted to share about one of our latest hacks: improving the generation of extension SQL code to interface with Rust.
This post is more on the advanced side, as it assumes knowledge of both Rust and PostgreSQL. We'll approach topics like foreign functions, dynamic linking, procedural macros, and linkers.
<!— more —>
pgx
based PostgreSQL extensions ship as the following:
These extension objects include a *.control
file which the user defines, a *.so
cdylib that contains the compiled code, and *.sql
which PostgreSQL loads the extension SQL entities from. We'll talk about generating this SQL today!
These sql
files must be generated from data within the Rust code. The SQL generator pgx
provides needs to:
#[derive(PostgresEnum)]
marked enum.
#[pg_extern]
marked function.
#[pg_operator]
marked function.
#[derive(PostgresType)]
marked type.
#[derive(PostgresHash)]
:
#[derive(PostgresOrd)]
:
An earlier version of cargo-pgx
had cargo-pgx pgx schema
command that would read your Rust files and generate SQL corresponding to them.
This worked okay! But gosh, it's not fun to do that, and there are a lot of complications such as trying to resolve types!
So, what's more fun than parsing Rust source code and generating SQL? Parsing Rust code in procedural macros to inject metadata foreign functions, then later creating a binary which re-exports those functions via linker tricks, dynamically loads itself, and calls them all to collect metadata, then builds a depdency graph of them to drive the output!
Wait… What, that was not your answer? Oh no… Well, bear with me because that's what we're doing.
So, why else should we do this other than fun?
i32
to integer
, but it starts to break down when you're mapping things like Array<Floofer>
to Floofer[]
. Array
is from pgx::datum::Array
, and we'd need to start reading into use
statements if we were parsing code… and also what about macros (which may create #[pg_extern]
s)? …Oops! We're a compiler!core::any::TypeId
, or building up accurate Rust to SQL type maps.#[pg_extern]
(or similar) definitions.Okay, did I convince you? … No? Dangit. Oh… well, Let's explore anyways!
Here's some fun ideas we pondered (and, in some cases, tried):
build.rs
: We'd be right back to code parsing like before! The build.rs
of a crate is invoked before macro expansion or any type resolution, so same problems, too!$OUTDIR
: We could output metadata, such a JSON files to some $OUT_DIR
instead, and have cargo pgx schema
read them, but that doesn't give us the last pass where we can call core::any::TypeId
, etc.rust-analyzer
to inspect: This would work fine, but we couldn't depend on it directly since it's not on crates.io. We'd need to use the command line interface, and the way we thought of seemed reasonable without depending on more external tools.inventory
we could sprinkle inventory::submit! { T::new(/* ... */) }
calls around our codebase, and then at runtime call a inventory::iter::<T>
.
codegen-units = 1
but it was not satisfying or ideal..so
without also having the postgres headers around, and that's … Oof! We don't really want to make cargo-pgx
depend on specific PostgreSQL headers.But wait! It turns out, that can work! We can have the binary re-export the functions and be very careful with what we use!
We're going to produce a binary during the build. We'll have macros output some foreign functions, then the binary will re-export and call them to build up a structure representing our extension. cargo-pgx pgx schema
's job will be to orcestrate that.
Roughly, we're slipping into the build process this way:
During the proc macro expansion process, we can append the proc_macro2::TokenStream
with some metadata functions. For example:
How about functions? Same idea:
Then, in our sql-generator
binary, we need to re-export them! We can do this by setting linker
in .cargo/config
to a custom script which includes dynamic-list
:
Note: We can't use
rustflags
here since it can't handle environment variables or relative paths.
In .cargo/pgx-linker-script.sh
:
Since these functions all have a particular naming scheme, we can scan for them in cargo pgx schema
like so:
This list gets passed into the binary, which dynamically loads the code using something like:
Then the outputs of that get passed to our PgxSql
structure, something like:
Since SQL is very order dependent, and Rust is largely not, our SQL generator must build up a dependency graph, we use petgraph
for this. Once a graph is built, we can topological sort from the root (the control file) and get out an ordered set of SQL entities.
Then, all we have to do is turn them into SQL and write them out!
Let's talk more about some of the moving parts. There are a few interacting concepts, but everything starts in a few proc macros.
A procedural macro in Rust can slurp up a TokenStream
and output another one. My favorite way to parse a token stream is with syn
, to output? Well that's quote
!
syn::Parse
and quote::ToTokens
do most of the work here.
syn
contains a large number of predefined structures, such as syn::DeriveInput
which can be used too. Often, your structures will be a combination of several of those predefined structures.
You can call parse
, parse_str
, or parse_quote
to create these types:
parse_quote
works along with quote::ToTokens
. Its use is similar to that of how the quote
macro works!
They get called by proc macro declarations:
In the case of custom derives, often something like this works:
TypeId
s & Type NamesRust's standard library is full of treasures, including core::any::TypeId
and core::any::type_name
.
Created via TypeId::of<T>()
, TypeId
s are unique TypeId
for any given T
.
We can use this to determine two types are indeed the same, even if we don't have an instance of the type itself.
During the macro expansion phase, we can write out TypeId::of<#ty>()
for each used type pgx
interacts with (including 'known' types and user types.)
Later in the build phase these calls exist as TypeId::of<MyType>()
, then during the binary phase, these TypeId
s get evaluated and registered into a mapping, so they can be queried.
core::any::type_name<T>()
is a diagnostic function available in core
that makes a 'best-effort' attempt to describe the type.
Unlike TypeId
s, which result in the same ID at in any part of the code, type_name
cannot promise this, from the docs:
The returned string must not be considered to be a unique identifier of a type as multiple types may map to the same type name. Similarly, there is no guarantee that all parts of a type will appear in the returned string: for example, lifetime specifiers are currently not included. In addition, the output may change between versions of the compiler.
So, type_name
is only somewhat useful, but it's our best tool for inspecting the names of the types we're working with. We can't depend on it, but we can use it to infer things, leave human-friendly documentation, or provide our own diagnostics.
The above gets parsed and new tokens are quasi-quoted atop this template like this:
The proc macro passes will make it into:
Next, let's talk about how the TypeId
mapping is constructed!
We can build a TypeId
mapping of every type pgx
itself has builtin support. For example, we could do:
This works fine, until we get to extension defined types. They're a bit different!
Since #[pg_extern]
decorated functions can use not only some custom type T
, but also other types like PgBox<T>
, Option<T>
, Vec<T>
, or pgx::datum::Array<T>
we want to also create mappings for those. So we also need TypeId
s for those… if they exist.
Types such as Vec<T>
require a type to be Sized
, pgx::datum::Array<T>
requires a type to implement IntoDatum
. These are complications, since we can't always do something like that unless MyType
implements those. Unfortunately, Rust doesn't really give us the power in macros to do something like what the impls
crate does in macros, so we can't do something like:
Thankfully we can use the same strategy as impls
:
Inherent implementations are a higher priority than trait implementations.
First, we'll create a trait, and define a blanket implementation:
This lets us do <T as WithTypeIds>::ITEM_ID
for any T
, but the VEC_ID
won't ever be populated. Next, we'll create a 'wrapper' holding only a core::marker::PhantomData<T>
Now we can do WithSizedTypeIds::<T>::VEC_ID
for any T
to get the TypeId
for Vec<T>
, and only get Some(item)
if that type is indeed sized.
Using this strategy, we can have our __pgx_internals
functions build up a mapping of TypeId
s and what SQL they map to.
Once we have a set of SQL entities and a mapping of how different Rust types can be represented in SQL we need to figure out how to order it all.
While this is perfectly fine in Rust:
The same in SQL is not valid.
We use a petgraph::stable_graph::StableGraph
, inserting all of the SQL entities, then looping back and connecting them all together.
If an extension has something like this:
We need to go and build an edge reflecting that the function known_animals
requires the type Animals
to exist. It also needs edges reflecting that these entities depend on the extension root.
Building up the graph is a two step process, first we populate it with all the SqlGraphEntity
we found.
This process involves adding the entity, as well doing things like ensuring the return value has a node in the graph even if it's not defined by the user. Something like &str
, a builtin value pgx
knows how to make into SQL and back.
Once the graph is fully populated we can circle back and connect rest of the things together! This process includes doing things like connecting the arguments and returns of our #[pg_extern]
marked functions.
With a graph built, we can topologically sort the graph and transform them to SQL representations:
In order to generate SQL for an entity, define a ToSql
trait which our SqlGraphEntity
enum implements like so:
Then the same trait goes on the types which SqlGraphEntity
wraps, here's what the function looks like for enums:
Other implementations, such as on functions, are a bit more complicated.
With the toolkit pgx
provides, users of Rust are able to develop PostgreSQL extensions using familiar tooling and workflows. There's still a lot of things we'd love to refine and add to the toolkit, but we think you can start using it, like we do in production at TCDI.
Thanks to @dtolnay for making many of the crates discussed here, as well as being such a kind and wise person to exist in the orbit of. Also to Eric Ridge for all the PostgreSQL knowledge, and TCDI for employing me to work on this exceptionally fun stuff!