Equivalent DAG Representations

# Equivalent DAG Representations ## Background We are designing IPLD to be a general framework for working with content-addressed and content-linked data. While we have designed some of our own formats and preferences (i.e. DagPB, DagCBOR, etc.) we also endeavor to be compatible with existing formats as long as they use content based linking (i.e. Bitcoin, Ethereum, Git, etc.) ## Problem In trying to interoperate with existing systems sometimes we come across content addressed data which while IPLD can technically handle IPFS cannot. The biggest offender in this regard is block size. While IPFS has a maximum block size of 1MiB (it's actually more like 2, but we recommend people go with 1 to be safe), some systems like Git have blocks that are much larger (apparently GitHub allows up to 100 MiB). This means that while all content addressed data can be added to a local IPFS node it cannot actually be communicated over the network. ### Why do we have a (1MiB) maximum block size? Without a maximum block size there's nothing stopping an attacker from sending us lots of bogus data. With a maximum block size we are ( theoretically since AFAIK we don't actually implement this in go-ipfs) able to just block peers that send us blocks that we don't want with a fixed cost. Without a maximum block size we could ask for CID bafyabc and be sent a never ending stream of data. While this attack costs the attacker in upload bandwidth, it's still something we'd like to prevent. Additionally, having a many blocks for a single logical object gives us other nice benefits like being able to split up queries between multiple peers and potentially benefitting from deduplication of the data within the object. ### Flat data Aside from the case of existing content-linked systems there is also a huge amount of data out there that is content-linked flat data. For example, many software distributors, antivirus vendors, torrents, etc. refer to a file by the SHA256 of the entire file. These files can be much larger than the maximum block size. ## General Proposal Given the existence of some legacy content addressed data DagLegacy that we (the protocol designers) wish were represented in some other way DagPreferred let us give the user some information that would convince them that while they asked for DagLegacy that DagPreferred would suffice. Restrictions: 1) The information (or proof) that we give the user must cost them resources (bandwidth, CPU, etc.) that is at most O(FileSize) 2) The user should never have to expend more than O(1) (or perhaps O(log(FileSize))) resources during the proof verification without evidence they are making progress - Sorry the definition of evidence is flaky here, what I'm trying to get at is that the attacker should have the same flexibility as how with a Merkle Tree someone can give you the whole tree except for the leaves, which sucks but is valid. Nice to Haves: 1) Is conveyable/non-interactive 2) The proof is the same/comparable no matter who computed it It would be nice if parties could coalesce around storing the same DAG instead of everyone storing their own. It seems likely that this would only happen if people could pass around the proofs and compare them. ### Strawman Proposal Overview: Create a new MerkleDAG where every link is represented by a combination of a CID of the graph and the SHA256 intermediate hash of the leaf nodes underneath it. Structure the DAG as a line so we can slowly unroll the SHA256 because it's a streaming hash. Example: We have an 4MiB file `F` with `H_F=SHA256(F)` so we divide the file up into 4 chunks `F0...F3`. We then put them into a tree with non-leaf nodes `I0...I2` as below and a root node `Root`. ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Red,fontname=Courier,shape=box] //All nodes will this shape and colour edge [color=Blue, style=dashed] //All the lines look like this i0 [label = "I0 \n CID(F0), SHA256-P(nil)"]; i1 [label = "I1 \n CID(I0) \n CID(F1) SHA256-P(F0)"]; i2 [label = "I2 \n CID(I1) \n CID(F2) SHA256-P(F0||F1)"]; i3 [label = "Root \n CID(I2) \n CID(F3) SHA256-P(F0||F1||F2)"]; f0 [label = "F0"]; f1 [label = "F1"]; f2 [label = "F2"]; f3 [label = "F3"]; i3 -> {i2, f3} i2 -> {i1, f2} i1 -> {i0, f1} i0 -> {f0} //{rank=same;ITManager Teacher1 Teacher2} // Put them on the same level } ``` Where SHA256-P(X) (p for partial) is defined as the partial SHA256 of X that does not include the checksum (i.e. is the state before adding the extra 1 bit + padding). We also define SHA256-P(X, Y) as SHA256-P(X || Y). Notes: 1) Because SHA256 is a streaming function we can compute SHA256-P(X, Y) knowing only SHA256-P(X) and Y. 2) SHA256(X) can be computed knowing only SHA256-P(X) To download the graph we: 1) Get the Root block knowing that H=SHA256(FO||F1||F2||F3) should be the hash of the concatenated leaf blocks. The Root block points us at both I2 and F3. 2) Fetch F3 and check that SHA256-P(F0 || F1 || F2, F3) can give us H. 3) Fetch I2 and then F2 and check that SHA256-P(F0 || F1, F2) gives the SHA256-P(F0 || F1 || F2) that we found in Root. 4) Fetch I1 and then F1 and check that SHA256-P(F0, F1) gives the SHA256-P(F0 || F1) that we found in I2 5) Fetch I0 and then F0 and check that SHA256-P(nil, F1) gives the SHA256-P(nil || F0) that we found in I1 - I0 can obviously be optimized away Some strawman code for the Gophers: https://play.golang.org/p/TJvnI4gc2Dk #### Strawman Analysis At each step in the process we're downloading no more than 2 blocks (e.g. Root + F3) without knowing that the blocks we received can be used to assemble the file we're looking for. This is based on the assumption that finding an Y' != Y such that SHA256-P(X || Y) == SHA256-P(X || Y') which seems about as strong as SHA256 although would like some confirmation. While this works it unfortunately requires a bit of a security/performance tradeoff if using a request-response protocol like Bitswap instead of a streaming protocol like GraphSync. This is because in order to verify that F~i~ is a block we want we need to have downloaded F~i+1~...F~n~ and the request-response protocol is latency bound with each request (i.e. ask for F~i~ then get it and process F~i+1~). However, we can choose to relax our security constraints a bit and allow downloading several blocks ahead. ## Questions 1) Is the strawman proposal sufficiently secure? 2) Is there a proposal that works better with Bitswap? (e.g. where we can prove a block F~i~ is good while knowing only O(log(N)) data instead of O(N) data as we do with merkle proofs) 3) How would we talk about this object (and its components) in such a way that the data verification mechanism is self descriptive? Some illustrative (definitely not perfect or complete) examples: - Use the multihash: We could create a new multihash type for the non-leaf nodes that is: \{(multi)hash of block data || SHA256-P of F~0~...F~n-1~ || SHA256-P of F~0~...F~n~\} - While this is hash-like functionality that confirms the content-addressing property it feels pretty gross to make a multihash that's really 3 hashes. Also bloats potential CID size. - Use the codec: Make a new codec type that handles verification of the data - Requires that the Bitswap layer understand, or receive feedback from, the IPLD/codec layer in order to know if received blocks are actually valid. Perhaps there is flexibility here but it might be a bit tricky - Use an ADL: Use an existing codec but layer on some externally defined code to indicate how to validate - Similar downsides to the codec approach - Not sure if ADLs support this behavior - Nicer than codec in that it is format agnostic - Seems to run into problems around how the downloading node knows what ADL to use for processing ## Acknowledgements This proposal is almost identical to the one @Stebalien proposed here https://discuss.ipfs.io/t/git-on-ipfs-links-and-references/730/6