Codebase v2 format v1

# Codebase v2 format v1 ## Notes 1. Abilities and effects are only distinguished in the term functor, not the type blob. 2. I think we're not quite ready for "patches as unison terms", because I think updating a patch should work differently from updating a term; e.g., * updating a patch shouldn't necessarily update another patch — or should it? * merging two branches containing a patch should merge the patches, not create a name conflict — or should it? 3. Codebase objects have an object_type, sometimes version, and format. The difference is a bit subjective, but `object_type` is meant to be the a broad category, like _term_, _type_, _namespace_, _patch_, etc. Maybe _causal_ too. `version` is like the semantic version; e.g. if an old term's meaning has no representation in the latest version, then it could be stored using an older version. `format` distinguishes different representations of the same semantics. i.e. an object can be rewritten from one format to another without affecting its hash; the hash is based on a canonical version of the object, but what's stored may be different; specifically: what's stored may be smaller, like a diff. 4. It looks like sqlite will [optimize prefix searches into range searches](https://www.sqlite.org/optoverview.html#:~:text=The%20string%20literal%20must%20not%20begin%20with%20a%20wildcard;%20if%20the%20right-hand%20side%20begins%20with%20a%20wildcard%20character%20then%20this%20optimization%20is%20not%20attempted.). ## Todo - [ ] Tentatively, we'll retain an index from hashes to unison repo URLs to download on demand if not available locally. - [ ] Where's the index for the packed causal, or else are we not packing causals? - [ ] Idea: Flip the packed Causal around, so that you can append to it? Unfortunately you would have to read it backwards to go back in time though. - [ ] Do a performance test on the sqlite loose causal representation. How long does it take to do 26k row lookups? 100k? 1m? Is there any sqlite way to speed it up? Maybe it's fine, per [this](http://sqlite.1065341.n5.nabble.com/Is-it-possible-to-create-the-Stored-Procedure-SP-in-Sqlite-tp95348p95349.html): > With SQLite, each statement is just a procedure call. There is no network traffic, not IPC, and hence very little latency. Applications that use SQLite can be very "chatty" with the database and that is not a problem. For example, the SQLite website is backed by SQLite (duh!) and a typical page request involves 200 to 300 separate queries. That would be a performance killer with a client/server database, but with SQLite it is not a problem and the pages render in about 5 milliseconds. ## Implementation plan 1. Implement the sqlite schemas for - [ ] hash - [ ] object - [ ] object_type - [ ] old_hash - [ ] reference - [ ] causal (after determining which schema will give us faster LCAs) - [ ] namespace_root, maybe; taking the place of `paths/_head` and haskell de/serialization for - [ ] HashIndex - [ ] Term/TypeComponent; they can probably share some code, otherwise separate - [ ] Patch - [ ] NSDiffSlice - [ ] Namespace 2. Implement the v1 to v2 conversion and and see how `oldbase` and `base` look space-wise. 3. Implement the v2 sync algorithm and see how fast or slow it is. * Variations * [ ] `HashIndex.format` uses `embedded` vs `external` * [ ] implement `Namespace.format = diff` and use it to test different history compression schemes. * [ ] shallow sync: just fetch the latest namespace and its current dependencies * [ ] Flesh out and implement an index from object hashes to internet mirrors, and work out how it would be do to sparse syncs that only include the causal spine, and only sync namespaces on demand 4. ... more stuff goes here ... 5. Implement codebase v2 interface 6. Reimplement a bunch of commands, simple versions at first. 7. Maybe Later: Implement packed files and their respective indices ## Assumptions ## Ideas If we end up making the codebase interface into a daemon, it would be straightforward to imagine a remote server computing LCA for you, when you don't have (or otherwise need) a full copy of a namespace's history locally. Or maybe, again, the `Causal`s won't be too big. (5MB of Causal for base before the reset.) ## Desiderata and summary of solution: - "Old name" lookups should be fast, Like O(1) * We will eliminate old name lookups in favor of working to limit currently-unnamed entities - Dependency lookups (namespaces, definitions) should be O(_dependencies(query)_) * _i.e._, index them separately, compactly, so probably O(log(_n_) + _dependencies(query)_) - Merges should be fast (essentially: identifying and materializing the LCA) - We will store the causal spine separately from the data; LCA lookup will still be O(_history_), but not O(_history_ \* _namespace\_size_) - Q: Heuristics to minimize the typical case of materializing the LCA, in light of sometimes-diff-representation? (discussed in next section) A: Maybe not an issue because LCAs which exist may not be too far away, as opposed to LCAs which do not exist. - Codebases should be small on disk (goal: comparable to text-based git repo; will need to measure, or at least back of napkin calculation?) * We can make significant use of sharing / deltas, specifically for older versions of things. * We'll have an option to store a namespace as a delta relative to another namespace; we may or may not implement this immediately, as the two representations should be interchangeable. e.g. a namespace could be switched from a delta to a full representation or vice versa depending on access patterns (weak analogy to mpeg I-frames vs P-frames) * Future: We don't have a diff algorithm for ABTs yet, but we could add one later in a future-proof way. * Future: Some shared representation for similar sets of attached metadata * Causal spines are amenable to a packed, semi-fixed-width format to support using just a small number of random-access `read` calls, with `Cons` chains laid out together. * We can do compute LCA directly in this format without loading individual nodes onto the heap: Load each cons string into memory, and put them onto the BFS queue. Whenever a merge node or end of cons string is encountered, its its parents' cons strings belong on the queue at this level.) * Make it not-fatal for old namespaces & definitions to be missing / discarded, although they could be pinned to the codebase if needed; TBD * Maybe user-tagged versions + versions that UCM remembers pushing; relevant for 3-way merge calculations. * Maybe later some exponentially drop-off schedule for retaining recent changes. * How much local history does IntelliJ keep? * Tentatively, we'll retain an index from hashes to unison repo URLs to download on demand if not available locally. - Codebases should be small in memory - We can avoid ever loading most namespace files into memory. - Because we compute LCA using the causal spine only (and perhaps directly from disk), the full history might be able to fit fully into memory. - Full history spine of bloated old `base` was 265 elements long; approximately 26k namespaces total — approximately 265kB and 26MB respectively, as compared to the 1.4GB size of the V1 causal files. - If reconstructing namespace values from diffs, a strict fold can avoid ever loading more than a full namespace's worth or two into memory. - All the types and terms of old `base` together make up 13MB on disk; probably not much needs to change here, but we can compact them further using a shared dependency index. (See [HashIndex](#hashindex) under "On-disk representations".) - It would be nice if we could inherit the old reflog, refer to old namespace hashes in URLs? * Option 1: create an indexed mapping from old hashes to new hashes during codebase upgrade? * Option 2: maintain the mapping only during upgrade, and rewrite the reflog :-1: * Option 3: delete the old reflog, break old namespace links. ## Final checks: ## Paul & Arya Questions: - What's the path to smart syncing? Off the cuff ideas: - Some random access web api? - We think we can just use git like rsync, and with optimized codebase representation, hopefully it's small enough for a long time. - Estimate codebase size? ## Questions: - [x] **Q: Should we try to convert old codebases on sync, or try to read them directly?** A: Let's read them directly. - [ ] **Q: How should we represent the root branch? it could just be a directory/files like now.** A: - [x] **Q: Can we get away with having no in-memory `Causal` at all? Don't we just use it for LCA?** A: No, we currently do need it just for `syncFromDirectory`/`syncToDirectory` as well, which currently take branches containing in-memory elements that may not have been written to disk yet. I think we could find ourselves wanting to find an LCA for a branch that only exists in memory too. Maybe all that will change in the future, but hasn't yet. - [ ] **Q: Will sqlite be fast enough? At least for implementing certain indexes?** We will hide it behind a Haskell interface and try. - [ ] **Q: Continue with "off-by-one" or no?** On the whole I think switching "off-by-one" to "on-by-one" makes sense: Simpler: * A definition and its supporting definitions are all part of the same namespace entity, and thus can be moved or duplicated together as a single unit. _e.g._, a type declaration, its constructors, and its related functions; a term and helper functions namespaced below it * A namespace definition and namespace diff definition become simpler; they no longer needs `NameSegment` maps, except to represent `children`. Yucky: * A bunch of branch helpers need to be rewritten (unclear whether they will simpler on the whole or more complicated). * Certain operations may no longer make sense, like `reset-root` or `push` or `fork` on a namespace that is a `Term` or a `Type` as well. * Forking a definition is also a little odd? * Maybe it makes sense if definitions are `unique`? It would copy a definition but with a new salt. * Side note: we could store `unique` definitions with an extra level of indirection to enable structural sharing again while still maintaining distinct identities. i.e. a `unique` definition would be presented by a pair of GUID and hash of the structural definition. * where does the patch go, and what scope does it represent? * Can you just `edit` the current `Type`/`Term`? * Or maybe it just works out that users naturally don't try to do weird stuff? :joy: ## Intro The big changes in v2 are: * The values stored in a Causal are stored separately from the causal spine, and aren't loaded into memory automatically. `Causal` contains a `valueHash` that represents its value, and a `causalHash` that represents the value + parents. (We'll still fully load the head `Branch0`.) * We can now disinguish whether a `Causal`'s parents already exist in memory (pure vs not loaded), which I think had been important for some reason that I can't recall. * `Branch0.terms`/`types` doesn't store `Metadata.Type`; we can still have some `deepSomething` of the root that does though. * `TermEdits` can be `Referent`s * A serialized `Branch.Raw` may sometimes be represented as a diff relative to another referenced branch; * Indexes are not stored in marker files; they will tentatively be stored in sqlite. ## In-memory representations `Causal` contains a `valueHash` that represents its value, and a `causalHash` that represents the values + parents. It has a `value` field which may be either a pure value (constructed in-memory) or a value loader. The `parents` are the "causal hash" of each parent, sometimes assicated with a pure `Causal`. We generally don't need either of these loaded. The `parents` are only needed for computing LCA, something that the codebase interface ca perform more efficiently than we can out here; and any old `value` is only needed for doing a 3-way merge. When the pure value or pure parent are present, it means they likely haven't necessarily been serialized at all yet. ```haskell newtype CausalHash h = CausalHash Hash newtype ValueHash h = ValueHash Hash -- Causal doesn't necessarily pre-load anything other than some hashes. data Causal m h e = Causal { causalHash :: CausalHash h, valueHash :: ValueHash h, value :: m e, parents :: Map (RawHash (Causal m h e)) (m (Causal m h e)) } ``` Nothing too interesting here. `MdValues` is structured a little differently; `Metadata.Type`s probably belong somewhere else, like the loop state, I'm guessing. ```haskell -- Branch has to be Causal to cons onto it in memory. data Branch m = Branch { history :: Causal m Branch.Raw (Branch0 m) } data MdValues = MdValues (Set References) data Branch0 m = Branch0 { terms :: Map Referent MdValues, types :: Map Reference MdValues, termEdits :: Map Referent TermEdit, typeEdits :: Map Reference TypeEdit, -- wait, do I need to load children? -- will I want everything loaded for the root? -- will I want everything loaded for the current branch? children :: Map NameSegment Branch.CausalHash } -- or, continuing with "off-by-one" data Branch0 m = Branch0 { terms :: Map NameSegment (Map Referent MdValues), types :: Map NameSegment (Map Reference MdValues), patches :: Map NameSegment (EditHash, m Patch), -- same questions here children :: Map NameSegment Branch.CausalHash } -- DeepTypes / DeepTerms etc. formed separately, -- possibly taking archived namespaces into account type Branch.CausalHash = Causal.RawHash (Causal n Branch.Raw ) type Branch.Hash = Causal.RawHash Branch.Raw data Codebase m v a = Codebase { getTerm :: Reference.Id -> m (Maybe (Term v a)), getTypeOfTermImpl :: Reference.Id -> m (Maybe (Type v a)), getTypeDeclaration :: Reference.Id -> m (Maybe (Decl v a)), putTerm :: Reference.Id -> Term v a -> Type v a -> m (), putTypeDeclaration :: Reference.Id -> Decl v a -> m (), getBranch :: Branch.Hash -> m (Maybe (Branch m)), getRootBranch :: m (Either GetRootBranchError (Branch m)), putRootBranch :: Branch m -> m (), rootBranchUpdates :: m (m (), m (Set Branch.Hash)), getBranchForCausal :: Branch.CausalHash -> m (Maybe (Branch m)), -- |Supports syncing from a current or older codebase format syncFromDirectory :: CodebasePath -> SyncMode -> Branch m -> m (), -- |Only writes the latest codebase format syncToDirectory :: CodebasePath -> SyncMode -> Branch m -> m (), -- ^ maybe return type needs to reflect failure if remote codebase has an old version -- |Watch expressions are part of the codebase, the `Reference.Id` is -- the hash of the source of the watch expression, and the `Term v a` -- is the evaluated result of the expression, decompiled to a term. watches :: UF.WatchKind -> m [Reference.Id], getWatch :: UF.WatchKind -> Reference.Id -> m (Maybe (Term v a)), putWatch :: UF.WatchKind -> Reference.Id -> Term v a -> m (), getReflog :: m [Reflog.Entry], appendReflog :: Text -> Branch m -> Branch m -> m (), -- |the nicely-named versions will utilize these, and add builtins to the result set termsHavingType_impl :: Reference -> m (Set Referent.Id), termsMentioningType_impl :: Reference -> m (Set Referent.Id), -- |number of base58 characters needed to distinguish any two hashes in the codebase; -- we don't have to compute it separately for different namespaces hashLength :: m Int, termReferencesByPrefix :: ShortHash -> m (Set Reference.Id), typeReferencesByPrefix :: ShortHash -> m (Set Reference.Id), termReferentsByPrefix :: ShortHash -> m (Set Referent.Id), branchHashesByPrefix :: ShortBranchHash -> m (Set Branch.Hash), -- lca :: [Causal m Branch.Raw e] -> m (Maybe Branch.Hash) dependentsImpl :: Reference -> m (Maybe (Set Reference.Id)) termDependencies :: Reference.Id -> m (Maybe (Set Reference.Id)) declDependencies :: Reference.Id -> m (Maybe (Set Reference.Id)) -- |terms, types, patches, and branches branchDependencies :: Branch.Hash -> m (Maybe (Branch.CausalHash, BD.Dependencies)) -- |the "new" terms and types mentioned in a patch patchDependencies :: EditHash -> m (Set Reference, Set Reference) } data Causal.Raw e = Causal.Raw { valueHash :: Hash e , parents :: [Hash (Causal.Raw e)] } data Branch.Raw = Full [Hash Term] [Hash Type] ... | Delta { reference :: Hash Branch.Raw, ... } sources :: Relation Hash RemoteRepo ``` ## On-disk representations I like the `git` model of having a representation for "loose" and "packed" objects. Packed objects are faster to query, more compact, but slower to write. Loose objects are the opposite :) In the v1 codebase format, we use loose objects exclusively. > Unfortunately, `git pull` prefers loose user objects (that's why we chose them to begin with), even though git generally uses packed objects under the hood. (It doesn't version them, like we will be.) If we start sharing large packed objects with git, it won't be long before we need our own protocol—analogous to git's—to exchange the underlying objects efficiently, without the extra layer of versioning. ### sqlite I'm thinking sqlite would store a hash index, codebase object blobs, as well as the dependents index, type index, and type mentions index. ```sqlite -- actually stores the 512-byte hashes create table hash ( id integer primary key, -- used to look up by prefix: text base32 unique not null ); create index hash_base32 on hash(base32); create table object ( id integer primary key, hash_id integer references hash(id), type integer not null references object_type(id), bytes blob not null ); -- we could also just hard-code this in haskell, -- but it helps self-document the database. create table object_type ( id integer primary key, name text unique not null -- term, decl, patch, namespace ); create index object_type_name on object_type(name); ``` - [ ] Q: What about the old hash lookup for namespaces (reflog, urls, etc)? Separate table? Or integrated somehow? - ```sql create table oldhash ( id integer primary key, -- omitted bytes blob; would we ever need it? text base32 unique not null, hash_id integer references hash(id) ) ``` I'm introducing this table to support the indexes (immediate dependents, term/constructor types and type mentions) ```sqlite -- store references/referents create table reference ( id integer primary key, builtin text, -- a builtin name hash_id integer, -- or a derived id component_idx integer, -- with its position in its component constructor_id integer, -- and optional constructor id constructor_type integer, -- let's say 0 for data, 1 for effect foreign key (hash_id) references hash(id) ); create index reference_hash on reference (hash); ``` ### object types | `object_type` | description | | ------------- | ----------- | | 0 | term | | 1 | decl | | 2 | patch | | 3 | namespace | #### HashIndex The `HashIndex` is a type we'll use all over. It's an array or int-keyed dictionary (or reference to an int-keyed dictionary) of hashes. Note that the `empty`, `embedded`, and possibly the future `blob` formats are amenable to modular transfer between codebases, e.g. during syncing. The `db` format will require an extra step of ```c union HashIndex { struct empty { format : varint = 0; } struct embedded { format : varint = 1; values : [hash512]; } // e.g. the sqlite `hash` table struct external { format : varint = 2; // the ids embedded in the referring object are indices into this // list of database ids; so two dereferences will be needed to // arrive at the 512-bit hash. first look up by index in this array, // and then in the external hash_ids : [hash.id] } // another format idea for the future: a reference to some other blob of // hashes, which is itself a codebase object. useful if many objects refer // to the same set of hashes, e.g. a large library where every definition // has the same metadata; authors, copyright holders, license, etc. } ``` `term : * -> *` (and the like) takes a type arg that represents the `hash512`, generally `hash512` or `varint`. `term` etc. doesn't have to be a reified data type, we can just perform the mapping from index to `Hash` at load time with a `Map Int Hash` where the values go through a `PinBoard`. `decl_component` is analogous, but with `object_type = 1` #### Term/Type blobs A hash should represent one object (e.g. in the case of `ABT`s, a whole component) | term `format` | description | | ------------- | --------------- | | 0 | structural | | 1 | unique/embedded | | 2 | unique/database | ```c struct term_component { object_type : varint = 0; // term // `version` is like which semantic version this is; which // typechecker admits it? Once we have more than one, we would // split it into a union. You would expect most terms to be of // the latest version, and possibly a few more that could no // longer be represented in the new format. term_version : varint = 1; union { struct structural { format : varint = 0; hash_index : struct HashIndex; body : [term varint]; // elements form the component }; struct unique_embedded { format : varint = 1; guid : text; // hash of the structural term we are newtyping structure : hash512; }; struct unique_database { format : varint = 2; guid : text; // db index of the hash of the component we are newtyping structure : varint; }; }; }; ``` #### Patch blobs ```c++ struct patch { object_type : varint = 2; format : varint = 1; // we'll certainly have more term_edits : Map Referent TermEdit; type_edits : Map Reference TypeEdit; } ``` #### Namespace blobs A namespace can be represented by a `Namespace` or an `NSDiff`. I believe `git` uses full objects for the current generation, and packs older generations into diffs. > I could imagine doing something like using full objects whenever they'd be not much bigger than their supporting chain of diffs. I don't know what the right heuristic should be, but we should be able to switch between representations at runtime with minimal user impact. ```c struct namespace { object_type : varint = 3; namespace_version : varint = 1; union { struct full { format : byte = 0; hash_index : HashIndex; // varints index into hash_index term : Map (Referent varint) HashIndex; type : Map (Reference varint) MdValues; patch : Maybe hash512; children : Map NameSegment hash512(causal namespace); } struct diff { format : byte = 1 // which namespace this is a diff of: reference : hash512(namespace) add : NSDiffSlice remove : NSDiffSlice } } }; struct NSDiffSlice { terms : Map Referent MdValues types : Map Reference MdValues termEdits : Map Reference TermEdit typeEdits : Map Reference TypeEdit children : Map NameSegment hash512(causal branch) } ``` * A diff can be reversed by simply swapping add/remove fields! I think? * Two diffs can be appended associatively. * [ ] Todo: Although append isn't commutative in general, prove or disprove that diffs generated in an n-way merge can be applied in any order with the same result. ## Packed Causal Here's a proposed layout for a big file containing all the causal spine nodes, optimized for lookups. ### Packed Causal data ```c++ union causal { struct one { type : byte = 0 value_hash : word512 } struct cons { type : byte = 1 // how many Cons immediately follow this one; // they'll have fixed length (131 bytes as of now) // they'll have descending values for this field, to zero. chain_length : word16 value_hash : word512 parent_hash : word512 } struct merge { type : byte = 'm' value_hash : word512 parent_hashes : [word512] // 2 or more } } ``` ### Packed Causal index We need some sort of index to help us find records within the packed file. It could be a trie or hash table on disk, or maybe in sorted fixed-width records, which you could do binary search on, or in sqlite: ```sqlite create table causal_file ( id integer primary key, -- related to the filename filename text unique not null ); create index causal_file_filename on causal_file(filename); create table causal_index ( id integer primary key, causal_file_id integer references causal_file(id), causal_hash_id integer references hash(id), file_offset integer not null, -- any causal should only appear once in a file unique (causal_file_id, causal_hash_id, file_offset) ); ``` ### Loose causal ```c struct Causal { self_hash : word512 // or associated somehow value_hash : word512 parent_hashes : [word512] } ``` As these will be quite small again, they could still be packed into files, appended to existing ones. It could be: all hashes that start with the same prefix (prefix length and encoding tbd) go into one file; it could be split as needed at a certain record count threshold. > Note: At guesstimated 1.2 parents per `Causal`, this would be about 205 bytes per entry, about 1/20th of the typical 4k minimum file size. How many entries could you have before the file was so big that linear search through it is too slow? > > Before the reset, base had ~26k causal+namespace entries, which would be ~5MB instead of 1.4G. If they're distributed uniformly, ... something something Or these could go into sqlite until they are packed. Undecided on whether the hashes should be entered into the `hash` table, although I don't see why not at the moment. ```sqlite create table causal ( self_hash blob not null, value_hash blob not null, parent_hashes blob not null ); create unique index causal_self_hash_index on causal (self_hash); ``` Oh the reason might be because `parent_hashes` is like a hash array blob which is weird; or would could repeat the relation and hope for the best. And most nodes won't have multiple parents, so it's really nbd: ```sqlite create table causal ( id integer primary key, causal_hash integer references hash(id), value_hash integer references hash(id), parent_hash integer references hash(id) ); create unique index causal_hash_index on causal(causal_hash); ``` I'm assuming that having to do iterative queries will be unnecessarily slow, and that's why we need the packed format, but maybe it isn't too slow! https://www.sqlite.org/c3ref/blob_reopen.html - relevant for reading blobby parent hashes ## Performing a repo format upgrade at runtime: 1. On startup, look in .unison directory for `v<N>` directories. 2. If none exists, or more than one exists, or one exists > current version, explain and quit. 3. If `v<N>` = current version, then done. 4. Else (`v<N>` is older version), call `v<N>To<Current>` upgrade procedure, copying `v<N>` elements to `v<Current>`. 5. Delete `v<N>`. * After a failed lookup of a hash of unknown version, check old>new hash index for a match and retry.