Thesis - HackMD

# Thesis The "thin waist" of the "merkle forest" needs a stable, ubuquitous and nearly-universally agreed upon stream chunking algorithm that is sufficiently good to not require a switchover for several decades. The term "good" here covers performance/resource usage in terms of - computation phase space/time ( RAM/CPU ) - requirements at rest (deduplication potential) - requirements at transport (overhead for subdag identification/communication) **Provided** [such a beast exists](https://xkcd.com/2268/) (and I firmly believe it does, similar to Zooko's ["local maximum for hash functions" conjecture](https://twitter.com/zooko/status/835294257888002050)) it would apply simultaneously at the following spots **without** needing to work within the IPFS ecosystem at all, but rather parallel to it: - A boon for FLOSS mirrors: duplication in packaging archives is rampant - True "holy-grail" ETAG in the HTTP world - A block-deduplication-atom for ZFS/BTRFS/whoever-will-listen ## Part 1: how we got here ### 2016 Obective(s): - Ability to serve both - https://kernel.org/pub/linux/kernel/v5.x/ and - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ utilizing the same backing leaf nodes for everything - Derive a stable stream hashing function that converges with the above but at the same time is suitable to be included in something like [`rhash`](https://www.mankier.com/1/rhash#Hash_Sums_Options) - {{ Insert 30 second demo of the toolkit from that era }} ### Problems identified while doing above TLDR: If one wants to interoperate with current and *future* web gateways, one is essentially at the mercy of PL. Moreover it is **really** difficult to advocate for any spec work ahead of time, as "the code is not written yet". Which basically means that you either: - dump stuff into the DHT and hope you got it right / that PL will bend to your worldview - wait ( what I've been doing, because the above alternative is a dick move ) Specific issues: [I wrote this list back in December, some stuff may have changed since](https://hackmd.io/EX5uh93eRfuivmfNbZYgsA) ## Part 2: go-ipfs-time In late January real work on https://github.com/ipfs/specs/issues/227 started (followed by rapidly diverging expectations of what the end result should be) - mini-thesis: - The time for "experimental interfaces" has passed: islands of stability are **absolutely necessary** at the tooling boundaries to ooze confident "yes, you can use this for your data" on every turn - "systemd is an architectural mistake": building more and more intricate stateful multi-gigabyte-eating daemons does not help adoption. In other words: this *should not* be writen: https://github.com/ipfs/notes/issues/434 - Lean on simple "unix-style" streaming interfaces as much as possible, while allowing extensibility - "Do one thing and do it well" components - Keep an eye on runtime (RAM/CPU) performance envelopes - {{ 5 min demo of what :dagger_knife: does as it is right now }} ## Part 3: where do we go from here As it stands :dagger_knife: is the perfect tool to bridge the: - "we want it fast but don't know much about the scientific method" crowd with the - "we know information theory inside out, but can not write real code to save our lives" crowd, with the - "I just want to mirrot my ubuntu archive quickly" crowd. Sticking points: - If this were at my pre-PL time, I would know exactly how to market/run/extend this project - After joining PL everything became **really** "complicated" - {{ the remaining 45 mins go towards figuring what do we do }}