Exploration: Incremental verification of content-addressable streams

Exploration: Incremental verification of content-addressable streams ==================================================================== Content-addressed data has several interesting properties. This exploration report concentrates on incremental verifiability of streams. Such a system would not need to buffer the whole data in memory, before the verification will be able to start. It's extendable to all kinds of content-addressing and not limited to a specific implementation/protocol. Background ---------- A common form of verifiable data is checksums provided next to the data. An example are [Debian ISOs] where you have a file called [SHA256SUMS] that provides the checksums of the ISOs in that directory. You can then verify locally that the download was correct and that it was the expected file. The problem is that you need to download the full file, several GBs, of data first. With incremental verification, you can start verifying while the download is in progress and in case there's wrong/bad data you can stop early. Sometimes you don't event want to download the file at all, but just process it on-the-fly, like listening to a podcast or watching a video. Or event skip some data and start in the middle somewhere. There are already systems out there that do such verification, such as [Helia's verified fetch], [Amazon S3 object integrity checks], [BitTorrent] or [iroh's BAO based verification]. The goal is to explore a system that could be used as a library, that supports all those cases and also future ones. Scope ----- - System/Protocol agnostic: Don't be bound to a specific system, protocol or ecosystem. As long as the data is content-addressed, incrementally verifiable it should be possible to make it part of this system. - Platform agnostic: As long as the input is bytes, the system can operate. It's independent of the platform, it can be a local computer, in the browser or the cloud. - Only store verified bytes: When processing an incoming stream, only verified data is written to disk. - Low memory footprint: Different protocols will require different amounts of auxiliary data to be kept in memory in order to perform the verification. Those requirements should be in the order of magnitude of MBs. - Extendable: The core should get the abstractions write and make it easy to extend to other (future) incrementally verifiable streams with ideally little effort. - Return early: Returning early also means error early. If bytes cannot be verified, error. Clearly signal what the problem was, e.g. whether it was a bad verification, missing data that could be re-requested or anything else. - Range requests: If the underlying protocol supports it, allow for seeking and range requests. Out of scope ------------- - Transport: The design outlined here is only concerned about the data, i.e. bytes and not where they originate from. Other tooling is expected to be used to create a stream of bytes as input. - Best effort: Don't try to be smart about specific error case. Do not things like retrying or reprocessing in case of an error. See the in scope "return early" for more information. - Transparent traversal across protocols: Multiple protocols are supported, but every instantiation is for one specific protocol. If one links to another (e.g. an IPLD DAG links to a S3 file), it would return information about the S3 link, which can then be used for another instantiation. [Debian ISOs]: https://cdimage.debian.org/debian-cd/current/amd64/iso-bd/ [SHA256SUMS]: https://cdimage.debian.org/debian-cd/current/amd64/iso-bd/SHA256SUMS [Helia's verified fetch]: https://blog.ipfs.tech/verified-fetch/ [Amazon S3 object integrity checks]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html [BitTorrent]: https://en.wikipedia.org/wiki/BitTorrent#Creating_and_publishing [iroh's BAO based verification]: https://www.iroh.computer/blog/blob-store-design-challenges