# rdf inclusion proofs
'ok but rdf is esoteric': yeah it is. 1) some might already have data in that format and eager to try it out (e.g. https://openlibrary.org) and more importantly 2) you can [represent any bytes in rdf](https://www.w3.org/TR/Content-in-RDF10/#bytesProperty) or [any http in rdf](https://www.w3.org/TR/HTTP-in-RDF10/) meaning any http-based protocol which is a lot including much of web archives and including any [libp2p-http](https://github.com/libp2p/go-libp2p-http) or maybe even [coap][] protocol traffic. you can also describe some of [ethereum in rdf](https://ethon.consensys.io/entities-tree-classes.html) and [ERC-4824: Common Interfaces for DAOs](https://eips.ethereum.org/EIPS/eip-4824#why-json-ld).
[coap]: https://en.wikipedia.org/wiki/Constrained_Application_Protocol
## overview
let D be a [dataset][]
[dataset]: https://www.w3.org/TR/rdf11-concepts/#dfn-rdf-dataset
let C(D) be [rdfcanon1][](D), i.e. the canonicalization of the dataset. the output of C(D) is n-quads which is newline-delimited sequence of 4-tuples.
let Q(C(D)) be the sequece of quads in the n-quads representation of C(D)-then-slit-on-newlines. the elements of Q(C(D)) are 4-tuples of cryptographic values that correspond to the edges in a graph like [source, link, target, context].
note each 4-tuple is a sequence so you can get a unique comm(D) using something like [@mikeal/pkv][]. another comm(D) method is described below that only reuses rdfcanon1.
### example id(D)
there are many ways of 'summarizing' D into an identifier that is like a cryptographic commitment to D.
## map(pkv, Q(C(D)))
and the 4-tuples are themselves elements within a sequence, so you can pkv(map(pkv, Q(C(D)))) to have a verifiable identifier of D
## H(C(map(C, Q(C(D))))
for each quad/4-tuple in the canonicalized dataset, represent it as an rdf `@list` and then canonize it. Then you have a sequence of canonized quads. Represent *that* as an rdf `@list` and then canonize it.
Then if you hash it, (`H(C(map(C, Q(C(D))))`), you have an `id(D)` that can be used in an inclusion proof.
## proveInclusion(id(D), E, P)
D: dataset
E: element
P: Proof
[rdfcanon1][] basically says how to sha256 the n-quads text and use that as the id of D, but also leaves open other methods.
It could be useful to have an id of D that also commits to selective disclosure of elements of D that verify against that id, i.e. 'inclusion proofs' of D where the proveably included elements of D are quads i.e. 4-tuples i.e. sequences of numbers
ok so if Q(C(D)) is of type `Seq<[,,,]>` i.e. `Seq<Tuple<4>>`
this can also be seen as a dag rooted in the sequence of 4-tuples, whose children are each of the 4-tuples, and each of those nodes as 4 ordered items which are the tuples values (which are binary but may themselves be hashes of other values e.g. if the quad is linking to an `ipfs:` or `dweb:` or [`ni:`](https://www.rfc-editor.org/rfc/rfc6920) or [`hl:`](https://datatracker.ietf.org/doc/html/draft-sporny-hashlink) URI). But you can also use `data:` URIs, soft links to non-hash-links or RDF literals.
You can merklize this dag. and then do inclusion proofs of any quad or any element of any quad (leaf in dag).
When the leaves are URIs that are URIs that link to merkle roots (e.g. ipfs uris), this can be virtualized into a new node that allows for deeper traversal or inclusion proofing that is dependent on the tree structure and hashing scheme of the hashlink referent.
All of this is pretty abstract to:
* whether/which hash functions
* how you make addresses for the sequences (i.e. [@mikeal/pkv][], dcbor, or [rdfcanon1][] on an rdf list)
but what it maybe allows you to do is selective disclosure of any part of any edge (i.e. source, type, destination, context) in a graph dataset, of which a dag is one kind, and a merkle dag is one kind where source and destination are always verifiable identifiers (e.g. CIDs, PKVs, etc etc).
So I could e.g. record the http traffic of me interacting with twitter.com to scroll through all my tweets. That becomes D. Do the above, and then you 1) get an id(D) and 2 can do proveInclusion(id(D), E, P).
you could also export all of openlibrary.org's existing RDF API and then use the dataset as D. Changes to the dataset can also be expressed as just new merkle roots largely reusing leaves.
## G() => D
Previously we assumed D was an RDF dataset. You may not want to ever store that at rest. Maybe instead you just have a sqlite table, or a big CSV, or a bunch of raw HTTP traffix or something. You can just encode a description of a generator that goes from that source to an rdf dataset as G() => D. The the rest of the above can kick in to add comm(G()), proof of inclusion, etc.
e.g. a G could use [RML][] datasets to convert from an excel table to a D dataset.
e.g. G could be [more complex dataset generation](https://wikidataworkshop.github.io/2023/papers/4__novel_keysearchwiki_an_automa%5B1%5D.pdf)
[rdfcanon1]: https://www.w3.org/TR/rdf-canon/
[@mikeal/pkv]: https://github.com/mikeal/pkv
[RML]: https://rml.io/features/