XID: Totally not CIDv2

# XID: Totally not CIDv2 *pronounced "ekks-eye-dee"* A format for self-describing content identifiers. #### Bit of history of this format: Inspiration of this format comes from conversations around CIDv2 and how to save bytes when attaching metadata to CIDs in general (mostly trying to avoid just gluing 2 CIDv1's together). The short version of XID does not aim to solve the goals of CIDv2 and instead is intended to live as an alternative to CIDv1. However the long version is the logical extension of the short XID that puts it into the same category as CIDv2. Note: This format is a *tad bit* opinionated. ## Goals The goals of XID follow similar goals of [CIDs](https://github.com/multiformats/cid#design-considerations) with a few exceptions. ### Compactness While the short XID wastes 4 more bytes than CIDv1 in the worst case it can save up to 11 bytes in the best case (if varint length of codec and multihash code are 8 bytes and multihash length is 2 bytes). This is still attempting to be a reasonable size for its purposes as XIDs are still meant to be part of longer path identifiers or URIs. Short XID in peticular should always fit on the stack. ### Fixed sized prefixes! If i have a whole bunch of XIDs I dont want to parse every single one in order to discard the prefix! This means no more varints, and therefore no plaintext multicodec codes. ### Less Managed Tables Speaking of the multicodec table, It sucks to maintain! Better to just not :^) Instead of maintaining a canonical table of all the protocols and what they mean, fighting for the "special" the single byte codes- generate the table by hashing! It's a bit more wastefull but every protocol uses the same amount of bytes. "It's better to standardize methodology than magic numbers" This doesnt fully erase the need for table governance or trusted sources, as there are bound to be collisions and a method of resolving table conflicts should still be around. # Format ## Context Hashes In current CIDv1 there are two main value fields: Codec and Multihash. - Codec: how to interpret the content (data) the CID points to. (e.g. DAG-CBOR, Unix-FS, etc.) - Multihash: how to verify the data in the CID itself. (e.g. SHA-256:32, Blake2b-224:20, etc.) XID keeps that design by defining two fields of hashed metadata, the **data context** and the **id context** (collective and individually *"context"*). Though instead of allocating magic number codes defined in a publicly managed table, contexts are generated by hashing the relavant information for a given context. This idea of hashing the meta info for generating context codes is the core part of XIDs. ### Hash generation *I selected the hash function and preimage encoding format with moderately weak motiviation. objections are welcome*. Context information is encoded as DAG-CBOR and then hashed with blake3. The first 4 bytes of the hash is used as a unique identifier for said context. For the existing multicodec table, there should be a defined method of encoding - but it can be any DAG-CBOR as far as this spec is concerned. Field names and values are to be left as conventions, this spec only serves to define the method of construction. **HOWEVER**, while arbitrary metadata may be created for either contexts, it is very much encouraged *against* adding special metadata for every application. 4 bytes of the hash is **hardly** enough data to verify the uniqueness of YourSpecialContext(tm) in the grand scheme of everything. id context Example: ``` 0x55e568ec = { name: "sha256", length: 32, } ``` ``` 0x9121cc1f = { name: "blake3", length: 64, } ``` data context Examples: ``` 0xfa3b70c4 = { codec: "git-raw", } ``` ``` 0x5ce052d0 = { codec: "dag-json", } ``` ### Reasonings #### Why Blake3? Blake3 has the same output bytes regardless of the output size, meaning there should be no confusion about blake2b-256 truncated to 4 bytes or just blake2b-32. An argument could be made about using SHA instead, but if long data context id(TODO) are used, then its not secure to fetch from a network. #### Implementation Optimization I don't expect most implementations to care about the context pre-image at all during the runtime, instead pre-calculating a table of all the context hashes that it understands. It is often the case where an implementation only supports a few hashing algorithims (and codecs) and will simply fail if it doesn't recognize it anyway. #### How future proof is this? For as Long as DAG-CBOR is considered an ok choice and blake3 isnt broken! New multicodec codes can (and should!) be allocated for new versions when the time comes. XIDs are prefixed with a single byte code allocated in the Multicodec table (TODO). If either blake3 or DAG-CBOR fail to provide as a reasonable choice for generating context hashes, a new version (XIDv2) should be made to update context hashing methodology (or whatever else needs to be changed). ## XID data and id context are both the first 4 bytes of the blake3 hash. ``` order: [<multicodec code>, <data context>, <id context>, <id bytes>] buffers: [u8, [u8; 4], [u8; 4], [u8; <up to 64>]] ``` ID byte lengh should not exceed more than 64 bytes. This nudges users away from trying to inline data into `id bytes`. This also allows for implementations to _always_ put the XID on the stack. Justification being that most digests are no greater than 512 bits. ### ID context The ID context *may* specify the length of `id bytes` less than 64 bytes in a special field `length`. If not, implementations must allocate all 64 bytes. ``` { length: 12 } ``` `length` values larger than 64 must give an error since it might be an encoding error when working along long XIDs. ## (Maybe) Long XID Long XID follows the same encoding, hashing, and field order with the the shorter version, with a few modifications. #### Longer data context hash Data context uses 32 bytes instead of the normal 4 bytes. This is done in order to provide a method of verification when creating completly arbitrary and dynamic contexts. This allows for context of the data to be safely fetched from a network instead of embedded inside the program. While this provides much greater flexibility for attaching metadata to content, . #### Larger max id length Note: Max length must be less than 2048 bytes (in order to allow for "fun" hacking by encapsulating this in a CIDv1) ``` order: [<multicodec code>, <block context>, <id context>, <id bytes>] buffers: [u8, [u8; 32], [u8; 4], [u8; <up to 1024>]] ``` # Wait the isn't the long version just CIDv2? Shhhh... Well- Yes, though not as flexible and is 2 bytes shorter. The core thinking of this format comes from conversations about CIDv2 after all. 32 bytes lets block contexts be fetched as-needed and verified. Many of the common uses for CIDv2 and proposals like [IPIP-305](https://github.com/multiformats/cid/pull/49) and [IPIP-49](https://github.com/multiformats/cid/pull/49) are focused around attaching fetchable metadata CID next to the content CID (really just two CIDv1's glued together) and the long version allows for same thing but excludes the codec and multihash prefix since that is already hard-coded. ### Why the longer block context hash (aka should the long version even be a thing)? TODO fun long discusson about why this is isomorphic to a structural way of attaching metadata. ##### Pros - Dont need to reach into IPLD in order to modify the metadata, its right there in the XID - Code path seperation ##### Cons - you cant sign metadata (brain dump: you COULD put the signature of the block in the metadata, then inline the signature of the metadata inside of the XID, possible with short XID but limits the hash and signature length to 32 bytes) #### Uses - Auto-codec! Simply put the CID(or Xid) of the wasm you want to use to interpit the block in the block context, then fetch it at runtime (tho this is a round trip to the network before you can even tell if its an auto-codec or not) - Tree structure encoding; sort of like a graph-ql "this is the data structure you should expect" but for tree structures - Most of the other use case people wanted CIDv2 for ## Encoding Textual representations of XID (like CIDs) must be multibase prefixed and encoded since this is a binary format. Some examples (short XID with 32 bytes): hex ``` f0a7bedbfaf30e15c54814d9f43146f0d3fad56825ff12cf2813d0e5156c7e7441997364437830ec760 ``` base32lower ``` k78f6g7j5za36adlbb9al412jbskw4a4du8x5v9xwtarcel28rvktknn3lkl802o ``` base58flickr ``` Z3imKaYVrVdXXvUw3RH2sVKJpn397tfr4RxVZcRRg2jxtMj7cRiZzaoyQ ``` While this spec doesn't force a given encoding, it does reccomend using base32lower. Hex is often too long and all base64 variants have delimters that prevent selecting the entirety with a double click. Base58flickr is space efficient and avoids the delimeters, but is much more expensive to decode than power of 2 encodings. ## Block inlining Note: only really applicable to the long XID's larger buffer Simply set id context hash to 0's. IMO this might better be left out of the spec, since if you _really_ want to inline a block, using a CIDv1 might be better- but if we ever live in a world with only XIDs then this could be useful. Note: This also open the ~~gates of hell~~ door for doing something insanely silly by inlining a block in an XID which is also inlined into a CIDv1. # Aside - Although a bit silly, XIDs can be encoded directly into a CIDv1 with the identity "hash" - I plan on using (at least the short XIDs) for things that dont involve IPLD at all, just blocks of DAG-CBOR