owned this note
owned this note
Published
Linked with GitHub
---
tags: IPLD, Multiformats
---
CIDv2
=====
The idea is to provide context to some data.
Proposals
---------
### IPIP proposal
One proposal was started as IPIP [TODO vmx 2022-10-31: link]. This was spawn by needs from [Lurk], but also other people asked for having more context [TODO vmx 2022-10-31: look through multicodec and find those].
Use cases:
- Lurk: the exact same data might be traversed in different ways, depending on the context
- Software heritage: [NOTE vmx 2022-10-31: this might be the routing information case, but I cannot recall]
- Routing information:
### Hashing properties proposal
#### Multihash
Related to the context problem, are hash functions that hash differently depending on certain parameters. Examples are [Blake2], [Skein] or [Poseidon] [TODO vmx 2022-10-31: expand with examples].
As there can be a very large number of possible parameters, it's unfeasable to put them all into the multicodec table. A possible solution is to define those parameters as structured data and hash them. That hash can then be used as identifier to those parameters.
Example:
[TODO vmx 2022-10-31: write example]
#### Relation to CID
The same idea from the multihashes can be expanded to the codec of the CID as well.
Example:
[TODO vmx 2022-10-31: write example]
With this idea a CID would look like:
<cid-version><codec-properties-hash><multihash-properties-hash><content-hash>
##### Implicit CIDs
This proposal would lead to larger CIDs that we currently have. One possible optimization is to introduce the concept of implicit CIDs.
DAGs mostly consists of blocks being encoded with the same codec. There are of course cases, like UnixFS where DAG-PB blocks point to raw blocks. But still all internal nodes have the same codec. This opens up possible optimizations that reduce repetition.
If a block is encoded, it knows its own codec it's and also knows about the codecs it links to. This means that an implicit CID could be used to state that a CID contains the same codec the current node has. This would especially help for hashing properties proposal.
### New identifier proposal
So far the outlined proposals would lead to larger CIDs. This proposal takes the ideas from the [hashing properties proposal], but combines those into a single structure. That structure describes the codec as well as the hash and results in a single identifier.
Example:
[TODO vmx 2022-10-31: write example]
To further reduce the size, a non-cryptographic hash function can be used at a smaller bit-size, e.g. 4 bytes. So the total size of the identifier would be around 5 bytes plus the content hash itself.
Desired properties
------------------
**The hash of a block can be verified**
- IPIP proposal: As today
- Hashing properties proposal: The hash of the properties of the multihash is used as an identifier. You'll use that identifier to know how to hash the block.
- New identifier propsal: The hash of the properties is used as an identifier. This identifier isn't unique to the hash function, as the properties also depend on other, non related properties. Hence there can be many identifiers that relate to the same hash function. The properties would need to be read in and the hash information would be extracted. With heaving caching this shouldn't be a problem.
**Links can be extracted form a block**
- IPIP proposal: As today
- Hashing properties proposal: The has'uh of the properties of the multihash is used as 'an identifier. You'll use that identifier to know how to hash the block.
- New identifier propsal: The hash of the properties is used as an identifier. This identifier isn't unique to the hash function, as the properties also depend on other, non related properties. Hence there can be many identifiers that relate to the same codec. The properties would need to be read in and the codec information would be extracted. With heaving caching this shouldn't be a problem.
### Application context proposal
We've identified 3 layers.
- Block verification: On the lowest layer are the multihashes. With those you can verify that some bytes actually match a certain hash.
- Link extraction: The codec of the CID makes extraction of links possible, hence traversal of a DAG.
- Computation: That layer currently only exists implicitly, e.g. in UnixFS. You request a CID together with some path. The path is the context.
This proposal tries to generalize this concept of computation.
#### Concrete propsal
On the conceptual level, the application context proposal is similar to the IPIP propsal. We have two parts, the CID as well as some context. The context points to some structured data. The difference is, that the result is not a new version of CID.
The application context has the form of:
<application-context-code><hash><cid>
- `application-context-code` is a multicodec code.
- `hash` is hash of the structured data that is encoded as DAG-CBOR and hashed with SHA2-256.
- `cid` is a CIDv1 that points to the content
This solves two different problems we've identified. The dynamic and the static invocation. The dynamic invocation is used e.g. by Lurk, where you want to be able to interpret the same data in different ways, based on the context at run-time.
The static invocation happens, when the information is stored within an IPLD DAG.
TODO vmx 2022-11-01: Ask [@expede] to fill in the details about the static invocation.
:wave: [@expede] :writing_hand::point_down:
### Static Invocation
It is often useful to represent contextual information about a CID directly in a DAG (e.g. decryption keys, Wasm to run as a codec over data, etc). There is nothing special about this layout: it's all graph data! Also that old chestnut: code is data!
In the following diagram, we represent the same information as in the fat pointer as a regular graph. This is lifted from early deisgns for both IPVM and Autocodec, as well as "decryption pointers" in WNFS. An application can take this and run the Wasm over the data to convert it to bytes:
![](https://ipfs.runfission.com/ipns/expede-strange-loop.files.fission.name/p/tmp.png)
You can think about this like an AST: there's a function and its argument, waiting to be applied at calltime. The top CID ("`Invocation`" above) is the CID of the same info as the fat pointer.
Using the same layout, you can provide decryption keys in a cryptree:
![](https://ipfs.runfission.com/ipns/expede-strange-loop.files.fission.name/p/Screenshot%202022-11-07%20at%2023.11.51.png)
In my (@expede's) conversation with @vmx, it sounded like a fat pointer shouldn't be kept in links in a DAG. I think that this is a really clear deliniation between the two (compatible) approaches! A fat pointer is not a CID (because you can't/shouldn't put it in a DAG), but _can_ be passed around easily e.g. in URIs, and is "dynamic" in the sense that it's lightweight & one-off. The "static" DAG version is great for passing around this informtaion in a repeatable way, including with nested application, continuations, static traversal, distributed memoization by invocation CID, and so on.
![](https://ipfs.runfission.com/ipns/expede-strange-loop.files.fission.name/p/Screenshot%202022-11-07%20at%2022.48.14.png)
Static invocation also gives you the ability to recontextualize data, but transforming this AST. There's two simple examples: one is that the higher invocation will generally have control, and will decide to yield to the inner invocation -- but doesn't have to! In fact it can rewrite the DAG (like a macro) before processing the subgraph. The other example is recontextualizing simply by replacing the top invocation.
![](https://ipfs.runfission.com/ipns/expede-strange-loop.files.fission.name/p/Screenshot%202022-11-07%20at%2022.55.44.png)
Thank you
---------
Most of those ideas were discussed at the LabWeek event in Lisbon 2022. I'd like to thank everyone who has contributed to all this, especially:
- [@johnchandlerburnham] for the fruitful discussions on LurkDay, flashing out the Multihash ideas
- [@porcuquine] for spending hours and hours explaning to me how Lurk works and what the needs are
- [@mriise] for discussing with me about this for hours, bringing in a IPVM perspective and the implicit CIDs idea
- [@stebalien] for his thoughts on chainable multicodec combinations
- [@dignifiedquire] for coming up with the identifier proposal
- [@rvagg] for talking through this and finding overlap with his ideas
- [@ribasushi] for listening and playing devil's advocate
- [@expede] for getting the IPVM perspective
[Lurk]:
[Blake]:
[Skein]:
[Poseidon]:
[@johnchandlerburnham]: https://github.com/johnchandlerburnham
[@porcuquine]: https://github.com/porcuquine
[@mriise]: https://github.com/mriise
[@stebalien]: https://github.com/stebalien
[@dignifiedquire]: https://github.com/dignifiedquire
[@rvagg]: https://github.com/rvagg
[@ribasushi]: https://github.com/ribasushi
[@expede]: https://github.com/expede
[hashing properties proposal]