Questions about multicodecs, UnixFS, CBOR, MerkleDAG...

# Questions about multicodecs, UnixFS, CBOR, MerkleDAG... 1. `dag-pb` is protobuf, which decodes to `merkledag.ProtoNode`. It usually (always?) contains UnixFSv1 protobuf, right? -> No, but today it mostly is. And if it's not, it's ignored. 2. `dag-cbor` is CBOR, which decodes to `cbornode.Node`. That also has a payload, which is `interface{}`, but the implementation assumes it to be one of - `map[string]interface{}` - `map[interface{}]interface{}` - `[]interface{}` - `cid.Cid` - IPLD DataModel specification: https://github.com/ipld/specs/blob/master/data-model-layer/data-model.md - Specification of a data stream within a dag-cbor/DM node-tree https://github.com/ipld/specs/blob/master/data-structures/data.md 2.1 When building a `Node`, however, it could wrap anything that can be CBOR-encoded, so what is actually going on here? -> It can actually wrap anything that can be represented in above data model 2.2 I don't think it contains UnixFSv1, ever, is that correct? (Also UnixFSv1 refuses to decode itself from anything but `dag-pb`?) -> Not the protobuf, but rebuilding the graph would be possible 2.3 So it seems to me like it functions mostly as nodes for the IPLD graph, is that correct? -> Yup 3. `raw` is just raw bytes, without encoding. For IPFS this is usually file contents, right? -> https://github.com/ipld/specs/pull/273/files When moving to UnixFSv2, all of the leaves will be encoded as raw leaves. (Of course the default can change before UnixFSv2) 4. How does `raw` relate to `identity`? -> `raw`: The stuff in the block is just bytes `identity`: the data is in the CID 5. How do `dag-pb` and `dag-cbor` relate to `protobuf` and `cbor`? -> `dag-cbor` is any CBOR that is valid IPLD DataModel, 6. I was under the impression that large files are sharded into blocks (`raw` or `dag-pb`-wrapped UnixFS Files or Raws) of 256 KiB. I've queried block sizes we've seen for `raw` blocks and it turns out that, yes, most of them are 256 KiB, but the second most common size is 1 MiB: ``` block_size | cnt ------------+--------- 262144 | 3584900 1048576 | 521763 85 | 27206 84 | 26528 393216 | 8619 8192 | 7032 4096 | 5662 83 | 5096 ... ``` So now I'm wondering what is going on here. -> Default is 256, but there are different ways of chunking 6.1 Is there an upper limit for block size? -> Soft limit is 1MiB. Hard limit is 2MiB-1B (enforced by libp2p). 6.2 Were there experiments with other default "sharding block" sizes, or where are the 1MiB and 384 KiB blocks coming from? -> Yes, there is also ongoing research into this :) (6.3 Interestingly, almost all of the 85, 84, 83, 82, and 81 byte blocks are tiny HTML documents that look something like this: `<html><head><meta http-equiv="Refresh" content="0; url=../35343.html"></head></html>` ) Also interestingly, this is the data for `dag-pb`, where the block size includes the MerkleDAG wrapper protobuf: ``` block_size | cnt ------------+---------- 262158 | 44591517 112 | 348465 104 | 324419 8362 | 267480 131086 | 182752 164 | 155494 152 | 145850 108 | 130065 102 | 129437 ``` Two clear spikes around 256 KiB and 128 KiB (plus `dag-pb` wrapper size). 6.4 Were there experiments with other default sharding block sizes for UnixFS without raw leaves, or where do they come from?? 7. What is the situation with large files w.r.t. how they are stored? I can think of three ways: 1. A `dag-pb` parent which is a UnixFS `File`, with children being UnixFS `File`s themselves. 2. A `dag-pb` parent which is a UnixFS `File`, with children being UnixFS `Raw` blocks. (This is old) 3. A `dag-pb` parent which is a UnixFS `File`, with children being `raw` blocks. (From the documentation of the balanced importer for UnixFS, I understand that this can have depth>1, in which case the intermediary nodes are `dag-pb` wrapped UnxiFS `File`s.)) Which one(s) of these are used, and is it always necessary to have the UnixFS parent block? ((The idea behind my question is something like: Why not construct an empty `dag-pb` or `dag-cbor` or any other MerkleDAG node that links to the parts of the file?)) 8. The multicodecs table thingy lists a bunch of different MerkleDAG-related codecs: - `dag-pb`, as discussed above I think this only ever holds UnixFS data? -> See above - `dag-cbor` and `dag-json`, which are implemented, but I'm not sure what they're used for. -> See above - `dag-jose` and `dag-cose`, which I believe are not implemented, is that correct? -> Someone might be working on an implementation -> Could ask in the PR if someone is working on it. (https://github.com/ipld/specs/pull/269) 9. Whats the status with UnixFSv1 Symlinks? Metadata? -> Symlinks work (not via gateway) metadata: UnixFSv1.5: js-ipfs supports this, go-ipfs does not yet https://github.com/ipfs/specs/blob/master/UNIXFS.md#metadata (There may be variations to this in the future) 10. Are Bitswap Sessions being used? -> This might not be about the INV messaging, but rather a batch of work that needs to be done Because of selectors (in the future), the depth of a tree might not matter too much, because we can fetch multiple levels together. (This leads to interesting issues about verification) Chunking is one decision, building a tree from that is another (arranging), and what kind of nodes to build for that is yet Work on UnixFSv2 is somewhat slow because changing this means we get different hashes. ``` go run ./cmd/stream-dagger/ \ --chunkers="pad-finder_max-pad-run=65535_pad-static-hex=00_static-pad-min-repeats=42_static-pad-literal-max=1285__pigz_min-size=288_max-size=65535_state-mask-bits=10_state-target=0" \ --collectors=shrubber_max-payload=65535_static-pad-repeater-nodes=3_cid-subgroup-target=0_cid-subgroup-min-nodes=3_cid-subgroup-mask-bits=5__fixed-cid-refs-size_max-cid-refs-size=$(( 16 * 1024)) \ --node-encoder=unixfsv1_non-standard-lean-links \ --hash=sha3-512 \ --inline-max-size=0 --multipart ``` ``` Ran on 4-core/8-thread Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz Processing took 5.94 seconds using 2.22 vCPU and 55.21 MiB peak memory Performing 22,922 system reads using 0.18 vCPU at about 172.02 MiB/s Ingesting payload of: 1,071,741,825 bytes from 1 substreams Forming DAG covering: 1,072,770,214 bytes of 25,411 logical nodes Dataset deduped into: 16,203,172 bytes over 407 unique leaf nodes Linked as streams by: 722,130 bytes over 302 unique DAG-PB nodes Taking a grand-total: 16,925,302 bytes, 1.58% of original, 63.3x smaller Roots\Counts\Sizes: 3% 10% 25% 50% 95% | Avg {1} 1 L3: 1,490 | 1,490 37 L2: 18,209 18,209 18,209 18,209 18,209 | 18,055 260 L1: 168 168 168 208 208 | 199 4 PS: 127 207 207 247 | 197 4 PB: 516 613 1,188 1,285 | 900 403 DB: 13 58 16,437 49,165 65,535 | 40,197 ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.