Try   HackMD
tags: Juvix juvix-project

Juvix and Serialization

Context

Juvix as a project has decided upon the Berlin Pipeline as a means of unifying the codebase (see the writeup for motivation). The Berlin also enforces a strict structure on what is a pass. Namely, the data that gets sent between compiler passes must have the following properties:

  1. Be one consistent type that can express any form (statement, expression, declaration, etc.) that the compiler may contain.
  2. Can be easily manipulated, as to be able to write tooling ontop and operate generically over it.
  3. Be able to reason over and operate over subsets of the given form that we wish to operate upon.
  4. Be easily serializable and de-serializable to whatever structure is most convenient to work over.

Only a few representations fit the bill for all four of these properties, and out of those we have chosen S-expressions as the best candidate for this. This document will outline how S-expressions interact with 4., covering prior work and what we wish to do with it.

Previous Work

S-expression have a long storied history as programming language, a configuration language, and as a generic serializable format. Thus, what we present here is a tiny subset of total previous work, focusing on the use of s-expressions as a serialization format in OCaml.

The most user prominant use of S-expressions in OCaml, can be found in Jane Street's work. Here is a few places where they use it to serialize data.

  1. Erorr Handling

    • In error construction they use it as a generic error type, allowing extra freedom of combining error types between different parts of the codebase which may wish to use custom error types.
    • It also allows the programmer to not worry about converting their error message into a string on the spot, as the structure of the Error ADT is saved in a generic structure.
    • Lastly, it allows easy tagging of error messages to better express where the errors come from (I.E. did it come from the desugaring part of the compiler or the type checker or maybe an invalid form).
  2. IPC

    • Here S-expressions serve as a cross language communication device.
    • We can write ADT's in OCaml, and serialize them to S-expressions passing them off to Ruby or whatever language has implemented an S-expression parser. Do some operations over it and serialize it back to OCaml for further processing.
  3. Meta and Generic Programming

    • Here are some Quotes from the Jane Street talk about how s-expressions have helped the generic process of OCaml's AST and allow more powerful tooling to be made.

      So again, this was, I think people elsewhere in the OCaml community realized that this was important, and that it was specifically important to rationalize the tooling story and to get rid of a lot of the technical depth that had built up on top of Camlp4. And I think one of the reasons that we were nowhere near realizing this was important is our own engagement with the tooling at that point was still fairly limited, right? We were still much more in the user stance.

      That was an incredibly powerful tool for us, because it meant all sorts of cases where we would otherwise need to do a bunch of boring and error-prone hand-coding of these serializers and de-serializers, we could replace with a system where we just once carefully wrote out a thing that automatically generates those for us. So it’s great, what is there to complain about? We have a metaprogramming facility, we can extend the language in whatever way we want.

  4. automatic serialization

    • Further since the structure is so regular, and ADT's have a defined shape, there exists automated ways of taking an ADT and deriving the serialization of it for free.
    • The link also gives more details on deriving error messages and preserving invariants upon this transformation as well.

Even firms outside of jane street find use of this serialization format

  1. Cross Language Communication
    • Here is another exmaple of using OCaml's S-expressions as a means to serialize data to and from OCaml.

Serializing Data Structures

With the previous work demonstrated in OCaml, we can formulate our own serialization and de-serialization of datastructures with the following signature.

class Serializeable a where toSexp :: a -> Sexp.T fromSexp :: Sexp.T -> Maybe a

Thus if we had a minimal ADT like:

data Expression = Int Integer | Add Expression Expression | Multiply Expression Expression

Then we can write the serialization as follows:

instance Serializable Expression where toSexp (Int integer) = Sexp.Num integer toSexp (Add e1 e2) = Sexp.list [Sexp.symbol "Add", toSexp e1, toSexp e2] toSexp (Multiply e1 e2) = Sexp.list [Sexp.symbol "Multiply", toSexp e1, toSexp e2] fromSexp (Sexp.Num i) = Int i fromSexp (Sexp.List [Sexp.Symbol "Add", e1, e2]) = Add <$> fromSexp e1 <*> fromSexp e2 fromSexp (Sexp.List [Sexp.Symbol "Multiply", e1, e2]) = Multiply <$> fromSexp e1 <*> fromSexp e2 fromSexp _ = Nothing

This is by doing the structure by hand, however for a function like fromSexp what we may wish to use instead is our Sexp library which can generate toSexp and fromSexp for forms like

module Structure where data Add = Add Sexp.T Sexp.T data Multiply = Multiply Sexp.T Sexp.T newtype Int = Int Integer -- Now we can write instance Serializable Expression where fromSexp (Structure.Add e1 e2 <- Sturcture.toAdd) = Add <$> fromSexp e1 <*> fromSexp e2 fromSexp (Structure.Multiply e1 e2 <- Structure.toMultiply) Multiply <$> fromSexp e1 <*> fromSexp e2 fromSexp (Structure.Int i <- Structure.toInt) = Int i fromSexp _ = Nothing

In the future, we could write a library like Jane streets to automate this process entirely, thus making this completely automated.

This is, however, not the end of the work that we can do. The given Structure example, showcases other abilities this format can bring to working on serialized data. Namely the Structure example shows transforming a shallow layer to and from s-expressions. However what if we could write something like this

data ExpressionSubset = Add (Sexp.T ExpressionSubset) (Sexp.T ExpressionSubset) | Multiply (Sexp.T ExpressionSubset) (Sexp.T ExpressionSubset) deriving Sexp

What this means is that if Sexp.T can be paramartirized over some primitive, then we can have functions which operate over subsets of the full expression. Like

-- recurse allows the function to work through the arguments -- of Add and Multiply to have this automatically run on those linearize :: ExpressionSubset -> (Sexp.T ExpressionSubset -> Sexp.T ExpressionSubset) -> ExpressionSusbset linearize (Add (Sexp.Param (Add a b)) y) recurse = lineraize (Add a (Sexp.Param (Add b y))) recurse linearize (Add a b) recurse = Add (recurse a) (recurse b) lineraize (Multiply ...) recurse = ... flattenMath :: Sexp.T ExpressionSubset -> Sexp.T ExpressionSubset flattenMath = Sexp.onParamarizations linearize toExpressionSubset :: Sexp.T () -> Sexp.T ExpressionSubset toExpressionSubset = Sexp.convertForms toExpressionSubset linerizeTransform :: Sexp.T () -> Sexp.T () linearizeTransform sexp = toExpressionSubset sexp |> flattenMath |> toUnitSexpression lineraizeTransform <$> Sexp.parse "(Add (Add 4 (Sub 5 2)) 7)" --> Just "(Add 4 (Add (Sub 5 2) 7))" lineraizeTransform <$> Sexp.parse "(Add (Add 4 (Sub (Add (Add 5 8) 6) 2)) 7)" --> Just "(Add 4 (Add (Sub (Add 5 (Add 8 6)) 2) 7))"

With the following Sexpression library additions

module Sexpression where -- | @toUnitSexpression@, serializes out the parameterization of the S-expression itself -- Giving back an unparameterized form which is marked by the `Unit` type. toUnitSexpression :: Serializeable a => Sexp.T a -> Sexp.T () -- | @convertForms@ takes an un-parameterized s-expression and tries to run the given -- transformation. If the function returns a `Just` then the Sexp is converted to -- Sexp.Param a, otherwise it stays as the origional s-expressions and tries recursively. convertForms :: (Sexp.T () -> Maybe a) -> Sexp.T () -> Sexp.T a -- | @onParamrizations@ takes a transformation on the parameterized s-expression -- and runs it every instance of every paramaretized atom in the given S-expression. -- This function can not be ran automatically inside the paramaretized atom, and thu -- a recursive function to continue this process is handed to it. -- In the end we get back out a new paramaretized S-expression being changed by the -- transofmration that was passed. onParamarizations :: (a -> (Sexp.T a -> Sexp.T b) -> b) -- ^ The parameterized transformation function. -- it takes the original parameterization `a` -- along with a function which recurses on the `Sexp.T a` -- that it may carry inside the `a` form. to allow nested changes. -- Finally it produces a new parameterization with the desired -- modifications preformed on the parameterization. -> Sexp.T a -> Sexp.T b

However this kind of transformation needs a Sexp.T which can take a paramatirzation to work effectively.

Thus what we have seen in this section is three fold:

  1. Full serialization and de-serialization of a Haskell ADT
  2. Shallow serialization and de-serialization through a single layer ADT
  3. A partial deep serialization and de-serialization through a sturcture containing S-expression
      1. looks very promising for most of the passes we have.

Use in Foreign Data-Stores

As the Prior Work has hinted at, Serialization is not just useful for unifying various intereconnected parts of code, it also acts as a great communication mechanism between languages and across the wired. This is less important right this second, however in the future the Context data structure that is threaded through the Berlin Pipeline will be an in memory LLVM data structure. For when this time comes, S-expressions serve as an easy way to communicate between the code Juvix (The programming language) can store and the code that Haskell will operate and send data to.