# Content Claims Protocol
UCAN based protocol allowing actors to share information about specific content (identified by CID).
> We base the protocol on top of [UCAN invocation specification 0.2](https://github.com/ucan-wg/invocation/blob/v0.2/README.md#23-ipld-schema)
## NextGen IPFS Content Discovery
Content discovery in IPFS today is “peer based,” meaning that an IPFS client first performs a “content discovery” step and then a “peer discovery” step to read data in the following way:
1. Client finds peerids that have announced they have a CID
2. Client finds transport protocols for the given peerid that it can use to read IPFS data from that peer.
3. Client retreives data from that peer.
![](https://hackmd.io/_uploads/r1pmgLuVh.png)
File reads in this system requires that two entities, the client and the
peer serving the data, both support an IPFS aware transport protocol.
The peer serving a given CID is required to serve data at a specific
endpoint and keep that endpoint operational for data to be available.
When a publisher wants to hire a remote storage device this protocol design presents a challenge since these remote storage devices don’t speak native IPFS protocols, therefor IPFS clients can’t read content from them directly. We end up having to place a node between these remote storage devices and the client, proxying the data and incurring unnecessary egress.
![](https://i.imgur.com/Jan0Opg.png)
Content Claims change this.
Publishers post **verifiable claims** about IPFS data they publish in remote storage devices into Content Discovery networks and services.
From these claims, IPFS clients can request data directly from remote storage devices over standard transports like HTTP.
![](https://i.imgur.com/4iutSwA.png)
With content claims, any actor can provide data into the network without making the data available over an IPFS aware transport protocol. Rather than "serving" the data, content providers post verifiable claims about the data and its location.
Clients can use these claims to read directly, over any existing transport (mostly HTTP), and in the act of reading the data will verify the related claims.
This removes a **substantial** source of cost from providing content into the network. Content can be published into the network "at rest" on any permanent or temporary storage device that supports reading over HTTP.
## Claim Types
The requirements for content claims break down into a few isolated components, each representing a specific claim. These claims are assembled together to represent the proof information necessary for retrieving content.
All claim types map a **single** CID to "claim information."
While you can derive block indexes from these claims (see: Inclusion Claims), each individual claim is indexed by a single **significant** CID (file root, dir root, etc) referrenced by `content` field.
These claims include examples of a unixfs file encoded into CAR files, but the protocol itself makes heavy use of CIDs in order to support a variety of future protocols and other use cases.
Since this protocol builds upon the UCAN Invocation Specification, all of these claims are contained within a message from an *issuer* to a *destination*. This means it can be used to send specific actors unique route information apart from public content discovery networks, and can also be used to send messages to public content discovery networks by simply addressing them to a DID representing the discovery network.
### Location Claims
* Claims that a CID is available at a URL.
* Block Level Interface,
* GET Body MUST match multihash digest in CID.
* Using CAR CID, or any future BlockSet() identifier, allows this to be a multi-block interface.
```javascript
{
"op": "assert/location",
"rsc": "https://web3.storage",
"input": {
"content" : CID /* CAR CID */,
"location": "https://r2.cf/bag...car",
"range" : [ start, end ] /* Optional: Byte Range in URL
}
}
```
From this, we can derive a verifiable Block interface over HTTP or any other URL based address. This could even be used with `BitSwap` using `bitswap://`.
Filecoin Storage Providers (running boost) would be addressable via the [lowest cost read method available (HTTP GET Range in Piece)](https://boost.filecoin.io/http-retrieval#retrieving-a-full-piece).
And you can provide multiple locations for the same content using a list.
```javascript
{
"op": "assert/location",
"rsc": "https://web3.storage",
"input": {
"content": CID /* CAR CID */,
"location": [ "https://r2.cf/bag...car", "s3://bucket/bag...car" ],
"range": [ start, end ] /* Optional: Byte Range in URL
}
}
```
### Equivalency Claims
We also have cases in which the same data is referred to by another CID and/or multihash. Equivalency claims represent this association as a verifiable claim.
```javascript
{
"op": "assert/equals",
"rsc": "https://web3.storage",
"input": {
"content": CID /* CAR CID */,
"equals": CID /* Commp CID */
}
}
```
We should expect content discovery services to index these claims by `"content"` CID, since that is standard across all claims, but since this is a cryptographic equivalency, equivalency claim aware systems are encouraged to index both.
### Inclusion Claims
* Claims that a CID includes the contents claimed in another CID.
* Multi-Block Level Interface,
* One CID is included in another CID.
* When that's a CARv2 CID, the CID also provides a block level index of the referenced CAR CIDs contents.
* Using CAR CIDs and CARv2 Indexes, get a multi-block interface.
* Combined with HTTP location information for the CAR CID, this means we can read individual block sections using HTTP Ranges.
```javascript
{
"op": "assert/inclusion",
"rsc": "https://web3.storage",
"input": {
"content": CID /* CAR CID */,
"includes": CID /* CARv2 Index CID */,
"proof": CID /* Optional: zero-knowledge proof */
}
}
```
This can also be used to provide verifable claims for sub-deal inclusions in Filecoin.
```javascript
{
"op": "assert/inclusion",
"rsc": "https://web3.storage",
"input": {
"content": CID /* PieceCID (CommP) */,
"includes": CID /* Sub-Deal CID (CommP) */,
"proof": CID /* Sub-Deal Inclusion Proof */
}
}
```
### Partition Claims
* Claims that a CID’s graph can be read from the blocks found in parts,
* `content` (Root CID)
* `blocks` (List of ordered CIDs)
* `parts` (List of archives [CAR CIDs] containing the blocks)
```javascript
{
"op": "assert/partition",
"rsc": "https://web3.storage",
"input": {
"content": CID /* Content Root CID */,
"blocks": CID, /* CIDs CID */
"parts": [
CID /* CAR CID */,
CID /* CAR CID */,
...
]
}
}
```
# Consuming Claims
An IPFS client wishing to perform a verifiable `read()` of IPFS data can construct one from verifiable claims. Given the amount of cryptography and protocol expertise necessary to perform these operations a few examples are detailed below.
## IPFS File Publishing
An IPFS File is a contiuous set of bytes (source_file) that has been encoded into a merkle-dag. The resulting tree is referenced by CID using the `dag-pb` codec.
When files are encoded and published into the network, they are often packed into CAR files. One CAR file might contain **many** IPFS Files, and one large file could be encoded into **many** CAR Files. CAR files can be referenced by CID (digest of the CAR) and are typically exchanged transactionally (block level interface). This can get confusing, as transaction (block level) interfaces now ***contain*** multi-block interfaces.
### Large File
Large file uploads can be problematic when performed as a single pass/fail transaction, so it has become routine to break large upload transactions into smaller chunks. The default encoder for web3.storage (w3up) is configurable but defaults to ~100MB, so as the file is encoded into the unixfs block structure it is streamed to CAR encodings of the configured size. Each of these is uploaded transactionally, and if the operation were aborted and started again it would effectively resume as long as the file and settings hadn't changed.
When the encode is complete the root CID of the unixfs merkle tree will be available as a `dag-pb` CID. This means we need a claim structure that can cryptographically route from this `dag-pb` CID to the data ***inside*** the resulting CAR files.
Content Claims allow us to expose cryptography in the IPFS encoding to client such that clients can read **directly from the encoded CAR**, rather than requiring an intermediary to *assemble* the unixfs merkle tree for them.
So, after we stream encode a bunch of CARs and upload them, we encode and publish:
* A **Partition Claim** for the ***Content Root CID***, which includes:
* An ordered list of every CID we encoded.
* A list of CAR CIDs where those CIDs can be found.
* **Inclusion Claims** for every ***CAR CID***, which includes:
* A CARv2 index of the CAR file.
Now, if we publish these CAR files to Filecoin, we're going to want to capture another address (CommP) for the CAR file. We then include an
* **Equivalency Claim** for every ***CAR CID***, which claims this CAR CID is equivalent to CommP for that CAR file.
Once that CAR data is aggregated into Filecoin deals, the list of CommP addresses included in each deal can be used to compute a sub-inclusion proof. At this time you may also publish:
* **Inclusion Claims** for every ***Sub-Deal CID*** (CommP) which claims each Sub-Deal CID is "included" in each Piece CID (also CommP), which includes:
* The referenced sub-deal inclusion proof.
From this point forward, you can lookup Filecoin Storage Providers onchain 😁
The preferred (lowest cost) method of reading data from Storage Providers is through an HTTP interface that requests data by Piece CID. You could describe these locations, perhaps playing the role of chain oracle, as:
* **Location Claims** for every ***CAR CID*** in every Storage Provider, which includes:
* The offsets in the Piece addressable by HTTP Range Header.
And of course, you can also use **Location Claims** for any other HTTP accessible storage system you ever decide to put a CAR into.
### *`read(DagPBCID, offset, length)`*
Now that we have a better idea of what sorts of claims are being published, let's construct a read operation from the claims.
It's out of scope for this specification to define the exact means by which you **discover** claims, but content discovery systems are expected to index these claims by `content` CID (in theory, you could index every block in the CARv2 indexes, but we should assume many actors don't want to pay for that, so everything in this specification works with only requires indexing the `content` CID in each claim). However, once a publisher sends these claims in public networks, they should presume all exposed addresses in the claims are "discoverable" from a security and privacy perspective.
Once you have the claims for a given `dag-pb` CID, like the claim examples above, you can:
* Build a Set() of CAR CIDs claimed to be holding relevant blocks and
* Build a Map() of inclusions indexed by CAR CID and
* depending on your CARv2 library you may want to parse the CARv2 indexes into something you can read quickly from, and
* Build a Map() of locations indexed by CAR CID,
* and you may even be able to build those concurrently.
Take the list of blocks from the partition claim and find the smallest number of CAR CIDs containing a complete Set.
Using the CARv2 indexes, you can now determine the byte offsets within every CAR for every block. You should also take the time to improve the read performance by closing the distance between many of those offsets so that you have fewer individual reads for contiuous sections of data, that will also reduce the burden your client will put on data providers since the same data will be fetched in fewer requests.
You're now free to request this data from all CARs concurrently in a single round-trip, wherever they are.
#### From Remote HTTP Endpoint
You can use HTTP Range Headers to request the required block sections over HTTP. Keep in mind that CAR CID claims have support for offsets that would need to be included in the offset calculation you'd then need to perform *within* the CAR.
#### From Filecoin Sub-Inclusion Proofs
Pretty simple really:
* The **Equivalency Claims** tells you the sub-deal CID.
* The **Inclusion Claims** for those sub-deal CIDs give you the Piece CIDs that Storage Providers commit to on chain.
#### From BitSwap 🤯
One amazing thing about these protocols is that by simply using a BitSwap URL, we have BitSwap peers as "locations" for CAR CIDs in the same **Location Claim** protocol.
One could build a system/protocol for "loading" CAR files into BitSwap peers and then publish Location Claims.
Even over BitSwap, this would represent a performance improvement over the current client protocols as you avoid round-trips traversing the graph.
## IPLD Schema
```ipldsch
# For UCAN IPLD Schema see
# https://github.com/ucan-wg/ucan-ipld
type URL string
type URLs [URL]
type CIDs [&Any]
type Assertion union {
| AssertLocation "assert/location"
| AssertInclusion "assert/inclusion"
| AssertPartition "assert/partition"
} representation inline {
discriminantKey "op"
}
type AssertLocation {
on URL
input ContentLocation
}
type AssertLocation {
on URL
input ContentInclusion
}
type AssertLocation {
on URL
input ContentPartition
}
type Range {
start Int
end Int
} representation tuple
type Locations union {
| URL string
| URLs list
} representation kinded
type ContentLocation {
content &Any
location Locations
range optional Range
}
type ContentInclusion {
content &Any
includes &Any
proof optional &Any
}
type ContentPartition {
# Content that is partitioned
content &Any
# Links to the CARs in which content is contained
parts [&ContentArchive]
# Block addresses in read order
blocks &CIDs
}
```
# Encodingless IPFS
Here we will describe some general cryptography that arrives at an "Encodingless IPFS" as well as a specific use of such cryptography in the form of UCAN based content claims.
By "encodingless" we mean that IPFS file trees are never actually encoded and written to disc or network. This removal of the file tree as a primary point of representation allows us to treat all virtual memory space, and space within that space, as essentially open ended and appearing in the form of numerous encodings and transports including all existing encodings and transports.
Perhaps the most obvious feature of this cryptography is that by not being described as a new encoding it works with all existing data on the internet in whatever system that data is already in as it already appears. This conquers many "second system" challenges inherent in IPFS today.
## Generic Inclusions
To the extent that any new memory is written into by way of cryptographic process, here it will merely be the packing of hashes of the same algorithm and length together. We don't assume or require persistence of the packed entity, as it is often only necessary to build in order to arrive at a local proof.
When referencing these proofs, the address is assumed to be using the same hash algorithm and length as the hashes appearing in it, and is referred to by RAW CID when encoding as a CID.
The reason we stick with RAW is because the `multihash` already contains the ONLY variable we have consumed in this representation. Having only one possible representation (per hash function) means we don't take on the indeterminism of the CID's codec field unnecessary (as we have no useful codec information to signal).
## Multihash Inclusion Claims
First, we need a simple inclusion claim.
```javascript
{
"op": "assert/multihash-inclusions",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"includes": RawCID /* Hash of Proof */
}
}
```
This claims the included proof occurs *within* the content.
In the case of a raw file, the "content" hash would be the regular hash of that file, and "includes" would be the hash of a proof that is a bunch of hashes that are claimed to *be included* in the content.
If consumers of the claim already have information they can use to arrive at the these inclusions, this is all you need. But, most consumers will benefit greatly from such information being encoded into the claim so there are the following optional parameters.
### Byte Length and Range
The following options define information that allows clients to easily find inclusions in byte representations.
```javascript
{
"op": "assert/multihash-inclusions",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"includes": RawCID /* Hash of Proof */,
"lengths": [ 1024, 1024, 5 ]
}
}
```
```javascript
{
"op": "assert/multihash-inclusions",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"includes": RawCID /* Hash of Proof */,
"ranges": [ [0, 3], [17, 99], [4, 7] ]
}
}
```
Note that, because the proofs and information arriving at those proofs are separate, the utility of both the inclusion proofs and the flexibility to encode and transmit proof assembly information are expanded and do not result in any loss in determinism because the encoding of numbers is left to other encoding layers that are presumed to already have a consensus regarding number encodings.
## Diffing and Updating Files
Consider a use case in which a game has two downloadable assets: Asset-A and Asset-B.
Asset-B is a variance of Asset-A.
Asset-A and Asset-B may be updated in the future and these updates may or may not coincide.
Here we provide the means for the theoretical smallest possible data throughput necessary to distribute these assets, assuming the publisher, or *any consumer of the data in a position to write their own client*, is willing to publish a claim using whatever their preferred [content defined chunking algorithm](https://joshleeb.com/posts/content-defined-chunking.html). Clients are presumed to hold a consensus about chunking outside this specification but claim authors are encouraged to use the open address space in UCAN claims to describe these chunkers as best they can in-protocol.
We will also presume that files are being distributed in their original form. This may mean being distributed along with the original game, or independently from an HTTP server. If the asset location appears in a claim then it would look like other location claims.
```javascript
{
"op": "assert/location",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"location": 'http://assert.server/Asset-A.fbx'
}
}
```
### Diffing
In the case that a user has already downloaded Asset-A, they can apply the chunker locally and arrive a local proof.
If they aquire the proof for Asset-B and a location claim for Asset-B then they should be able to read only the sectors of the file that differ from Asset-A and Asset-B.
The computation of the diff never has to be done ahead of time so long as there is consensus about the chunker settings.
### Updating
Since updates to local files are computed the same way for updates and diffs there is nothing else to do.
In fact, given the likelihood that many assets share sectors it's best to gain as much knowledge through local proofs as possible whenever updating or looking to aquire new files.
All of this can be done without any new network or consensus protocol, with data wherever it already exists in whatever protocol of whatever generation of the internet it happens to be in (FTP is back! #probablynot).
## Equivalencies
The above examples have been cases in which the local client still arrives at a complete state, which puts them in a position to verify the entire claim.
A hazard exists in content addressed networks if claims are trusted in scenarios in which they are never completely verified.
Rather than a game asset, let's use the example of a video.
```javascript
{
"op": "assert/multihash-inclusions",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"includes": RawCID /* Hash of Proof */,
"lengths": [ 38, 1024, 1024, ... ]
}
}
{
"op": "assert/location",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"location": 'http://assert.server/video.mkv'
}
}
```
Now, in a content addressed network, if I were to play this video I'd want to ask for the first couple blocks worth of data in the inclusion proof first. Similar to when I read data in the diffing and updating use cases, I can read the early ranges of data and verify each block against the proof.
However, if I start playing this video to a user, I will have compromised the security the entire system of cryptography everyone built to make this safe 😅
Until we have the entire video, we don't know for certain that any of these blocks are actually included in the "content." It's still a "claim" and while it is verifiable it has no means of **partial** or **incremental** verification because the only thing that connects the content hash and the proof information is an as-yet-to-be-verified claim.
We have, at this point, the cryptography to create hash addressed virtual memory ("content") and can even go as far as to understand that address as a larger address space with verifiable parts as long as we verify the entirety of the inclusion before we consider it authenticated.
For a video, the distance between the sub-inclusion authentication (part) and the parent (file) are not something a user can wait on before trust is established.
Numerous systems exist to solve this with a merkle tree encoding, IPFS being an obvious example. Since the root hash is derived from leaves readers have means of *incremental verification* that begins with at the first block.
We will solve those same problems here **not** by defining a *new* fully deterministic tree encoding but by simply posting verifiable claims to deterministally generatable encodings. Any new encoding will, by definition, be indeterministic since its own choice of encoding is a branch of determinism from the single conceptual root of the un-encoded state.
Posting a cryptographic equivalency between the hash of a file and an identifier for an encoded tree representation of that file have the result of offering **all** the benefit of that tree *without creating a persistent secondary encoding of the original data*. Even better, since such trees are derived from an entirely deterministic representation of the proof they do not have the same indeterminism problems IPFS-alone has. In fact, this should give content authors the ability to post equivalencies between variances in such representations.
And this isn't limited to IPFS. Torrent files can also be generated in much the same way, as can any future merkle encoding of files.
Folks have noticed that blake3 hashes have some interesting characteristics that can substitute for many of the afforementioned IPFS tree characteristics. Such characteristics will appear here next to **any** content hashed under **any** algorithm the moment an equivalency claim is posted.
```javascript
{
"op": "assert/equals",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"equals": Blake3CID
}
}
```
### Deterministic UnixFS File
Since there are so many potential tree encodings we will concern ourselves with only one in this specification, and that is a simple, flat, deterministic unixs-fs encoding.
Rather than consider this an exercise in either unixfs tree shaping or protobuf encoding lets just arrange bytes around their equivalent proof information to arrive at the smallest byte representation available in valid unixfs as a single block. Since the tree is not expected to be persisted or transported, there's no basis from which to form a block size or appropriate tree shape so "smallest bytes" seems the only relevant factor to consider.
```javascript
{
"op": "assert/equals",
"rsc": "https://web3.storage",
"input": {
"content": RawCID /* Hash of Content */,
"equals": DagPbCID /* IPFS File CID */,
"deterministic": true
}
}
```
TODO: maybe gozala can author this encoding :)