Persistent Libraries

# Persistent Libraries This document explains a design for persisting *Libraries* and the *Module*s that comprise them to that we can stage them once, store them in Templandia, and reload them when a recompile needs them again. As the Temper library ecosystem expands, most libraries will reside be downloaded on demand based on *import* instead of being under the current work root. Not re-compiling the universe every time `temper build` is run is a nice performance boost. This document is divided into a few pieces: - How to find a persisted module given a *LibraryName*. - Freshness metadata: how to determine quickly, from a persisted library, whether it is up-to-date with the current build environment: - the version of the compiler that persisted it, - a hash of its source file path and content, - hashes of persisted libraries it depends upon. - Persisted format overview; what are the main pieces of data in a persisted library file, how are they generated when persisting and how are they consumed when un-persisting. - The flow of persisting and unpersisting libraries. - Code-generation - How we annotate Kotlin classes used by the compiler and generate code from them that handles persisting and un-persisting them. - Provenance - Which library does a thing belong to? ## Goals Allow for quick recompiles by saving work done by earlier invocations of `temper build` and reusing it in subsequent builds. For example, we should be able to compile *std* once, save it to local disk, and load it back into memory faster than it takes to pass its modules through the staging process. ## Non-goals It is not a goal of this design to allow for re-using persisted modules created by one version of the Temper toolchain by a later version of the Temper toolchain. Specifically, this does not establish a binary distribution format for Temper libraries, just a local cache of work done. ## Templandia file layout: How to find a persisted library given a *LibraryName* For each Temper library, we might have several versions in play. - If the library is in a work-root on the local machine, we can use the canonical work-root path as an identifier. - If the library was downloaded into Templandia, we can use the semver identifier. ```kt sealed class LibraryNameAndVersion { abstract val libraryName: DashedIdentifier } data class RemoteLibrayNameAndVersion( override val libraryName: DashedIdentifier, override val version: SemVer, ) : LibraryNameAndVersion() data class LocalLibrayNameAndVersion( override val libraryName: DashedIdentifier, override val workRoot: SystemPath, ) : LibraryNameAndVersion() ``` In addition to the library name and version, it might be good to recognize which version of the Temper toolchain is using them because the file format might differ by different versions of the toolchain. We might, in the future, generate code differently for different backends. If we don't need to, we can use a generic backend-id like `-any`, but if so, designing for a place for the backend id in the persist/unpersist flow will avoid headaches down the road. A very tentative file layout would be: .templandia/.built/<toolchain-version-tag>/<backend-id>/<library-name>.temper-prebaked.json The rest of this document assumes the file format is JSON based but a goal of the use of code generation below is to allow experimentation with and benchmarking of different approaches to persisting. ## Version tagging the toolchain As a Temper team contributor is working on a new version of the compiler, they shouldn't be bothered by frequent complaints that their changes to the Kotlin source files that make up the compiler cause attempts to load *std* to fail with an exception because it can no longer be unpacked. But the *\<toolchain-version-tag\>* used by stable distributions of the compiler should map to something recognizable to users so that they can file bugs. A stable version has a resource file loadable from a well-known location that specifies the semver. `gradle cli:deploy` creates a temporary version tag based on a hash of the toolchain's files, and bundles it with the deployed gradle application. Running tests locally comes up with a similar temporary version tag, but does not persist it based on `git rev-parse <LOCAL-REPO_ROOT> HEAD` but including untracked-but-not-gitignored files. In short, there is a Kotlin class, *ToolchainVersion*, that exposes a string which is one of: - `stable-<semver>` when the stable version semver tag resource file exists - `deployed-<date>-<hash>` when the `gradle cli:deploy` was run and bundled a resource file with the hash - `development-<date>-<hash>` or falls back to invoking `git` to compute a fast hash of content for the current JVM run. ## Persisted file format overview Assuming a JSON like format for explanatory purposes, the outer layer looks like the below: ```json { "persisted-by": "<stable-1.0.0>", // toolchain-version-tag "library-name": "<my-library>", // redundant self identifier in case people // upload a file in a bug report without its path "source-hash": "<HASH>", // SHA hash of relative file paths and file content // of files under this library's root. // Files are sorted lexicographically by OS-independent // file path. "backend-id": "<backend-id>", "depends-on": [ // key names have the same meaning as above, but for other libraries { "library-name": "<other-library>", "source-hash": "<HASH>" }, ... ], "ref-table": { // ref-tables explained below "<reference-key>": {<reference-value>}, ... } } ``` ## Freshness: Can a persisted module be reused? The `depends-on` key above allows us to check whether a group of persisted files are internally coherent, whether source-hashes match with those in the `depends-on` list. Inconsistencies might happen if we have libraries with dependencies like the below: depends-on-lots --depends-on--> depends-on-some --depends-on--> depends-on-none Consider the following sequence of events - `temper build` builds `depends-on-lots` and its two transitive dependencies. - Source files for `depends-on-none` change. - `temper build` builds `depends-on-some` which generates two persisted library files but does not update `depends-on-lots`'s persisted library. - Some temper toolchain command tries to load `depends-on-lots` from the persisted file, but it's out-of-sync with its dependencies' persisted library files, so we rebuild it (using the persisted library files for its dependencies) and repersist it. So it's fine if hashes don't match, and we can solve the problem by aborting unpersisting, restaging its modules, and then persist a newly-consistent library file. ## Ref-table architecture We have Kotlin classes like the below that we will probably have to persist. ```kotlin data class Value<T : Any>( stateVector: T, typeTag: TypeTag<T>, ) : Result { ... } ``` And persisting that might require persisting `content`, for example, if the value has a *UserFunctionValue*, or if the *typeTag* is a *TClass* for a user-defined type. Maybe one library constructs a *TClass* instance using a *TypeShape* defined in another library. When persisting we have a persisting context: ```kotlin class PersistingContext { val refTables: MutableMap<LibraryName, RefTable> } ``` That lets us look up a reference table. If we know, for each thing we persist, which library it comes from, we can look up the ref table, allocate a *reference key* if necessary, and store a *pre-persisted form* in the table. A pre-persisted form is just a list of key/value pairs, where a value is a ref-table entry or an *unowned value* (see below). A ref-table entry is identified by: - a library name (or in the JSON form, a small integer index into the dependencies list) - a key into that library's ref-table (or in the JSON form, an int index into the ref-table list) It doesn't make sense to store some values in a particular library's ref-table: values like `null`, `false`, `true`. If a ref-table entry value is not a JSON object, then it's an *unowned* value, and we assume the un-persister knows how to deal with it. ```kotlin typealias RefTableKey = Int class RefTable { val valueToRefTableKey<Persistable, RefTableKey> val refs: MutableMap<RefTableKey, ComplexPersistedResult> } ``` As noted, a *Persisted* is either: - a simple, unowned value, represented in JSON as a non-JSON object value, - or it's complex so it needs a ref-table key and is representible as a series of string-key/value pairs. (More on how those are derived and converted back into values later) And the values in a series of string-key value pairs are themselves either simply persisted, or are references to a row in a ref-table (*Persisted.ByReference*). ```kotlin sealed PersistResult /** A form that can be easily serialized to an entry in a persisted library file */ sealed interface Persisted { sealed interface SimplePersisted : Persisted, PersistResult object NullValue : SimplePersisted data class BooleanValue(val b: Boolean) : SimplePersisted data class BytesValue(val x: WrappedByteArray) : SimplePersisted data class DoubleValue(val x: Double) : SimplePersisted data class FloatValue(val x: Float) : SimplePersisted data class IntValue(val x: Int) : SimplePersisted data class LongValue(val x: Long) : SimplePersisted data class StringValue(val x: String) : SimplePersisted data class ByReference( val libraryName: DashedIdentifier, val refTableKey: RefTableKey, ) : Persisted } data class ComplexPersistResult( /** * Optionally allows pairing an arbitrary value with a class that knows * how to un-persist it. */ val typeTag: KClass<Unpersister>?, val keysAndValues: List<Pair<String, Persisted>>, ) : PersistResult ``` As can be seen in *RefTable* above, we also keep a cache from *Persistable* to an assigned integer key. A *Persistable* is something that can be hashed in a ref table for equivalence, so we can avoid generating duplicate entries. Farther below we talk about how we generate *Persistable* implementations so these need not be hand maintained, and so that they unpersist correctly. ```kotlin /** * A persister converts a value that needs to be persisted to * a [PersistResult], and also allows checking whether */ interface Persister<T> { fun persist(pc: PersistingContext): PersistResult /** Maybe box a */ fun keyFor(x: T): Persistable } /** * Keys into a reference key table. */ interface Persistable { /** * Which library *owns* this for the purpose * of Temper library persistence */ val persistProvenance: DashedIdentifier, } ``` ## Unpersisting and relationship to persisting flow An un-persister knows how to reverse persistence by a *Persister*. ```kotlin interface Unpersister<T> { fun unpersist(pc: UnpersistingContext): RResult<T, MalformedPersistFileException> } ``` For a library, we are starting with some library metadata and a list of staged modules. For a set of co-compiled libraries, we do the following: 1. create a blank *PersistingContext*, 2. get `Persisters.getPersister<LibraryModuleAndMetadataPersister>()`, an inline method that accesses a generated class, 3. reserve key 0 for each library's persistable *LibraryModuleAndMetadata* instance, 4. pass each *LibraryModulesAndMetadata* instance to the persister from 2 to fill in the ref-tables, 5. look at the reference values to figure out the *dependencies* section of each ref-table file, 6. convert each ref-table along with its library metadata and dependencies, to a file, and 7. write those files to disk using the file path convention above. Unpersisting involves a similar process: 1. read in a persisted file, 2. read in more persisted files by looking at the *dependencies* list, 3. store a failure result for any library with a dependency hash mismatch, or missing dependency file, 4. let *toUnpersist* be the set of files with coherent (*transitive*) hashes, 5. create a blank *UnpersistingContext*, 6. for each entry in each *toUnpersist* file's ref-table, create a *RefRable* entry pointing to a *PendingUnpersist\<\*\>* wrapping, 7. for each *RefTable*, access the *PendingUnpersist* at row 0 and pass it to the *Unpersister* for *LibraryModulesAndMetadata* accessed similarly to (2) above via `Unpersisters.getUnpersister<...>()`. 8. If any application from (7) resulted in something other than *RSuccess* report errors and exit with overall failure. 9. Fold the library metadata from the persisted files into *LibraryModuleAndMetadata* and return an indicator of broken fiels from (3) and the successfully unpersisted libraries from (8). ## Code-generation: Making it easy to persist and un-persist Kotlin class instances There are many Kotlin classes that need to be persisted and un-persisted and we need the flexibility to evolve those classes without writing and re-writing persisting and un-persisting code. We also need to be sure, at compile time, that libraries can be persisted; that there's not some rarely used class that the system doesn't know how, via reflection, to persist or un-persist. We use annotations and code-generation to produce *Persister*s and *Unpersister*s for code types. A gradle task, `gradle kcodegen:u`, updates Kotlin files that define maps used by `Persisters.getPersister<T>()` and `Unpersisters.getUnpersister<T>()` which relate [*KType*s](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.reflect/-k-type/) to implementations. Unaided persisters require no additional annotations: - If a KType with @GeneratePersister is sealed, then we generate a persister by looking at each sub-type and apply un-aided persisters as below, or for concrete types, if the key name sets are distinct we generate a persister by keeping a map from sets of key strings to the unpersisters for those types. - for `object` (singleton) sub-types, we unpersist a string like `object:lang.temper.KotlinObjectClassName` to the object value. - for `enum` sub-types, we relate a string like `enum:<EnumMemberName>` to each enum member Aided persisters are required for complex classes with fields. - `val` fields in the constructor and type body and zero argument mehods may be annotated with `@PersistedField` optionally with a field name. - If the field name differs from the parameter name in the constructor (or factory, see below), `@UnpersistParamName("name")` may be used. - If the field needs to be set after construction, `@UnpersistLate` may be used to exclude it from the constructor/factory parameter list, and to have the unpersister generate a field assignment. - To disambiguate the type from others that might have the same key set, `@PersistNeedsTypeTag` allows adding a field with a type tag as in *ComplexPersistResult.typeTag*. - To specify a factory function for use by the persister, specify `@UnpersistFactory`. With no argument, it applies to a static method of the unpersisted type. - `@UnpersistFactory` can also be passed an `object` with an `operator invoke` method that can be referenced by class name in generated code and used to construct a value on unpersisting as in `@UnpersistFactory(object : UnpersistFactory<T> { operator fun invoke(args): T { ... } })`. Going back to the *class Value* example from above, there are some complications because we'd like to persist *Value*'s of type *TString* using unowned strings; the type tag knows how to persist/unpersist the state vector. ``` // This is sealed, so we need a Persister and an Unpersister implementation // based on the sealed type rules above. sealed class PartialResult : Persistable {...} sealed class Result : PartialResult() {...} // Falls into the singleton branch of the unaided. object NotYet : PartialResult() {...} // For this type, we need to specify that both constructor fields are persisted. // That way the generated persister produces something like // { "stateVector": ..., "typeTag": ... } // and the generated unpersister uses a call like // Value(typeTag = ..., stateVector = ...) // stateVector. data class Value<T : Any>( @PersistedField @UnpersistAfter("typeTag") // We need to unpersist the typeTag first // so that the typeTag can specify the persister/unpersister for @PersistUsing(object : Persister<T> by typeTag.persistorForValue) @UnpersistUsing(/* invokable object that takes type tag and fetches its value unpersister */, // Which parameters to pass to the unpersister getter "typeTag") stateVector: T, @PersistedField typeTag: TypeTag<T>, ) : Result() { ... } ``` For generic Kotlin types like *Map* and *List* we need to generate persisters that wrap the element type persister on demand. ## Provenance rules `@PersistProvenancer(object ...)` also allows us to figure out which library a persistable is part of, its *provenance*. In many cases, the answer is simply the larger structure we're persisting, so we need multiple views of *PersistContext* from the point of each *current* library. For some *Value*s the answer is more nuanced. We need to rewrite *Interpreter* to store information with *Value*s so that we can avoid simplifying to a constant across library boundaries. But we have some good rules of thumb: - For *TType* values, the library of the declaring module owns it. - For *TClass* values, the library of the module that constructed the value owns it. - For *UserFunction* values, the library of the declaring module owns it. - For *BuiltinFun* values, the provenance can be the current library; most will be unaided `object`s anyway.