Try โ€‚โ€‰HackMD

Note @vmx: Thinking about it again, the whole proposal isn't really different from just using two CIDs, one for the context, one for the content. There was another idea floating around about generalizing the idea of multicodec code + bytes, I'll try to find some time to write that down as well.

IPIP 0000: Application Context

Summary

This proposal adds another layer on top of CIDs to describe application specific context. This can range from semantic information about the data, or auxiliary data that is needed for traversal. It can be considered a fat-pointer.

Motivation

The Application Context tries to solve several problems.

  1. Lurk: This is about additional information in order to make things traversable. You can have some content-addressed data that could be interpreted in different ways. The way how to interpret it, must be encoded somewhere. One option would be to put it into the CID directly somehow, but that's beyond the scope of CIDs. This use case spawned the CIDv2 - Tagged Pointers IPIP.
  2. Software Heritage: This is about semantics of the data. The data is a Git object, but the desire is to attach the semantics that it originates from a specific software system. Again putting it somehow into a CID would be a workaround, but again it's out of scope for CIDs.
  3. UnixFS pathing: This is about traversing a DAG in a special way. UnixFS pathing is built into IPFS. You give the gateway a CID and a path and it will return you a file. That path can be seen as application specific (in this case IPFS) context.
  4. IPLDVM: This is about attaching executable code, which will be applied to the given data.
  5. IPLD Schemas: TODO vmx 2023-01-16 write about, with Filecoin as example

Detailed design

The Application Context is arbitrary structured data that describes some application specific information. It could range from a simple identifier, that attaches some semantic meaning, to a full WASM module including its interface definitions. It is expected that common schemas for those contexts will arise.

The Application Context is then encoded as:

โ€‹โ€‹โ€‹โ€‹<application-context-code><hash><cid>
  • application-context-code is a Multicodec code.
  • hash is hash of the structured data that is encoded as DAG-CBOR and hashed with SHA2-256.
  • cid is a CIDv1 that points to the content.

There are two slightly different cases termed dynamic and static invocation.

Dynamic invocation

The dynamic invocation is when the data is interpreted in different ways at run time. The UnixFS pathing use cases from above is a good example. You have fixed data stored, but you want to be able to traverse it in different ways. The static use case is e.g. used in the IPLDVM use case from above, where you want to store the executable code directly with the data.

A gateway implementation of the dynamic use case would introduce a new endpoint called context. That makes it already clear that we are requesting an application context, hence we don't need to supply the Application Context code. It takes two parameters that are represented like a path. The first one is a Multibase encoded version of the SHA-256 hash of the context, the second one is a Multibase encoded CIDv1. This means the endpoint be /context/{multibase-encoded-sha-256-hash}/{multibase-encooded-cidv1}.

The reason to keep the context and the CID separate, as opposed to Multibase-encoding the together, is to make caching easier. If the application context is a commonly used one, then the gateway can easily cache the context and use the Multibase encoded hash as an identifier. There could then be several requests with different data (CIDs).

It is represented as a path as the context has a single argument, which is the CID.

Static invocation

The static invocation is for storing the context directly in the DAG. Again, it should not be part of a link itself, but rather provide additional information. This could be things like decryption keys. The application can decide on how to store it within the IPLD data model. Likely it would be a two element tuple, where the first element is the hash of the structure data describing the application context. The second element is then the CID.

Design rationale

In the Multiformats/IPLD world, there can be three layers identified, that serve different purposes:

  • Block verification: On the lowest layer are the Multihashes. With those you can verify that some bytes actually match a certain hash.
  • Structured data/link extraction: The codec of the CID makes extraction of links possible, hence enables traversal of a DAG. In some cases it also enables processing of the structured data (IPLD) it contains.
  • Computation/semantics: That layer currently only exists implicitly, e.g. in UnixFS. You request a CID together with some path. The path is the context. This layer can also be used to supply semantics to the data.

The missing piece is the computation/semantics layer, which this proposal tries to fill.

Introducing a new concept can be costly. The decision to not extend current CIDs (creating a CIDv2) was made due to several reasons:

  • CIDs are already established in the InterPlanetary ecosystem, introducing a new breaking version would be as costly as introducing a new concept.
  • CIDs are already good at what they do, they provide a mechanism to get links out of bytes. Overloading this concept would make the stack even harder to learn/use.
  • CIDs are core part of the IPLD Data Model, changes to what CIDs are/how they work, would directly impact that data model. It would even bleed into the IPLD Codecs. Existing implementations would need to support it, although it's totally viable to build system with the current version of CID.
  • Building a new concept on top of CIDs makes it optional. Your use case might not even need all that. You might be happy with today's CIDs/IPLD.
  • It makes the concept of CIDs easier/more concrete as existing functionality like UnixFS pathing could be interpreted as Application Context. Requesting data via a CID would always mean that a single block is returned. If application would want to return more than one block at a time, you'd need an Application Context for that.

User benefit

It will enable new use cases for the IPLD world. On the Multicodec repository, there are several issues where people wanted to bend CIDs for their needs. This proposal makes many of those case possible.

Compatibility

It's an optional new feature that gateways are free to implement. In doesn't have any compatibility issues.

Security

The application context itself is just content addressed data, the security implications depend on the individual context. Not all gateways will implement all contexts.

Alternatives

CIDv2

The CIDv2 - Tagged Pointers IPIP is similar, but extending CID itself.

Fitting it into CIDv1

One could do tricks with inline CIDs and put some data there. Inline CIDs are highly problematic as they lead to many special cases. Ideally they are not used at all, hence using it for new cases is not a good idea.

Copyright and related rights waived via CC0.

Thank you

Most of those ideas were discussed at the LabWeek event in Lisbon 2022. I'd like to thank everyone who has contributed to all this, especially:

  • @johnchandlerburnham for the fruitful discussions on LurkDay, flashing out the Multihash ideas
  • @porcuquine for spending hours and hours explaning to me how Lurk works and what the needs are
  • @mriise for discussing with me about this for hours, bringing in a IPVM perspective and the implicit CIDs idea
  • @stebalien for his thoughts on chainable multicodec combinations
  • @dignifiedquire for coming up with the identifier proposal
  • @rvagg for talking through this and finding overlap with his ideas
  • @ribasushi for listening and playing devil's advocate
  • @expede for getting the IPVM perspective
  • @gozala for thinking about how gateways would work
  • @RangerMauve for another IPLD perspective