# Questions about multicodecs, UnixFS, CBOR, MerkleDAG...
1. `dag-pb` is protobuf, which decodes to `merkledag.ProtoNode`. It usually (always?) contains UnixFSv1 protobuf, right?
-> No, but today it mostly is. And if it's not, it's ignored.
2. `dag-cbor` is CBOR, which decodes to `cbornode.Node`.
That also has a payload, which is `interface{}`, but the implementation assumes it to be one of
- `map[string]interface{}`
- `map[interface{}]interface{}`
- `[]interface{}`
- `cid.Cid`
- IPLD DataModel specification: https://github.com/ipld/specs/blob/master/data-model-layer/data-model.md
- Specification of a data stream within a dag-cbor/DM node-tree https://github.com/ipld/specs/blob/master/data-structures/data.md
2.1 When building a `Node`, however, it could wrap anything that can be CBOR-encoded, so what is actually going on here?
-> It can actually wrap anything that can be represented in above data model
2.2 I don't think it contains UnixFSv1, ever, is that correct? (Also UnixFSv1 refuses to decode itself from anything but `dag-pb`?)
-> Not the protobuf, but rebuilding the graph would be possible
2.3 So it seems to me like it functions mostly as nodes for the IPLD graph, is that correct?
-> Yup
3. `raw` is just raw bytes, without encoding. For IPFS this is usually file contents, right?
->
https://github.com/ipld/specs/pull/273/files
When moving to UnixFSv2, all of the leaves will be encoded as raw leaves. (Of course the default can change before UnixFSv2)
4. How does `raw` relate to `identity`?
-> `raw`: The stuff in the block is just bytes
`identity`: the data is in the CID
5. How do `dag-pb` and `dag-cbor` relate to `protobuf` and `cbor`?
-> `dag-cbor` is any CBOR that is valid IPLD DataModel,
6. I was under the impression that large files are sharded into blocks (`raw` or `dag-pb`-wrapped UnixFS Files or Raws) of 256 KiB.
I've queried block sizes we've seen for `raw` blocks and it turns out that, yes, most of them are 256 KiB, but the second most common size is 1 MiB:
```
block_size | cnt
------------+---------
262144 | 3584900
1048576 | 521763
85 | 27206
84 | 26528
393216 | 8619
8192 | 7032
4096 | 5662
83 | 5096
...
```
So now I'm wondering what is going on here.
-> Default is 256, but there are different ways of chunking
6.1 Is there an upper limit for block size?
-> Soft limit is 1MiB. Hard limit is 2MiB-1B (enforced by libp2p).
6.2 Were there experiments with other default "sharding block" sizes, or where are the 1MiB and 384 KiB blocks coming from?
-> Yes, there is also ongoing research into this :)
(6.3 Interestingly, almost all of the 85, 84, 83, 82, and 81 byte blocks are tiny HTML documents that look something like this:
`<html><head><meta http-equiv="Refresh" content="0; url=../35343.html"></head></html>`
)
Also interestingly, this is the data for `dag-pb`, where the block size includes the MerkleDAG wrapper protobuf:
```
block_size | cnt
------------+----------
262158 | 44591517
112 | 348465
104 | 324419
8362 | 267480
131086 | 182752
164 | 155494
152 | 145850
108 | 130065
102 | 129437
```
Two clear spikes around 256 KiB and 128 KiB (plus `dag-pb` wrapper size).
6.4 Were there experiments with other default sharding block sizes for UnixFS without raw leaves, or where do they come from??
7. What is the situation with large files w.r.t. how they are stored?
I can think of three ways:
1. A `dag-pb` parent which is a UnixFS `File`, with children being UnixFS `File`s themselves.
2. A `dag-pb` parent which is a UnixFS `File`, with children being UnixFS `Raw` blocks. (This is old)
3. A `dag-pb` parent which is a UnixFS `File`, with children being `raw` blocks.
(From the documentation of the balanced importer for UnixFS, I understand that this can have depth>1, in which case the intermediary nodes are `dag-pb` wrapped UnxiFS `File`s.))
Which one(s) of these are used, and is it always necessary to have the UnixFS parent block?
((The idea behind my question is something like: Why not construct an empty `dag-pb` or `dag-cbor` or any other MerkleDAG node that links to the parts of the file?))
8. The multicodecs table thingy lists a bunch of different MerkleDAG-related codecs:
- `dag-pb`, as discussed above I think this only ever holds UnixFS data? -> See above
- `dag-cbor` and `dag-json`, which are implemented, but I'm not sure what they're used for. -> See above
- `dag-jose` and `dag-cose`, which I believe are not implemented, is that correct? -> Someone might be working on an implementation
-> Could ask in the PR if someone is working on it. (https://github.com/ipld/specs/pull/269)
9. Whats the status with UnixFSv1 Symlinks? Metadata?
-> Symlinks work (not via gateway)
metadata: UnixFSv1.5: js-ipfs supports this, go-ipfs does not yet
https://github.com/ipfs/specs/blob/master/UNIXFS.md#metadata (There may be variations to this in the future)
10. Are Bitswap Sessions being used?
-> This might not be about the INV messaging, but rather a batch of work that needs to be done
Because of selectors (in the future), the depth of a tree might not matter too much, because we can fetch multiple levels together. (This leads to interesting issues about verification)
Chunking is one decision, building a tree from that is another (arranging), and what kind of nodes to build for that is yet
Work on UnixFSv2 is somewhat slow because changing this means we get different hashes.
```
go run ./cmd/stream-dagger/ \
--chunkers="pad-finder_max-pad-run=65535_pad-static-hex=00_static-pad-min-repeats=42_static-pad-literal-max=1285__pigz_min-size=288_max-size=65535_state-mask-bits=10_state-target=0" \
--collectors=shrubber_max-payload=65535_static-pad-repeater-nodes=3_cid-subgroup-target=0_cid-subgroup-min-nodes=3_cid-subgroup-mask-bits=5__fixed-cid-refs-size_max-cid-refs-size=$(( 16 * 1024)) \
--node-encoder=unixfsv1_non-standard-lean-links \
--hash=sha3-512 \
--inline-max-size=0 --multipart
```
```
Ran on 4-core/8-thread Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
Processing took 5.94 seconds using 2.22 vCPU and 55.21 MiB peak memory
Performing 22,922 system reads using 0.18 vCPU at about 172.02 MiB/s
Ingesting payload of: 1,071,741,825 bytes from 1 substreams
Forming DAG covering: 1,072,770,214 bytes of 25,411 logical nodes
Dataset deduped into: 16,203,172 bytes over 407 unique leaf nodes
Linked as streams by: 722,130 bytes over 302 unique DAG-PB nodes
Taking a grand-total: 16,925,302 bytes, 1.58% of original, 63.3x smaller
Roots\Counts\Sizes: 3% 10% 25% 50% 95% | Avg
{1} 1 L3: 1,490 | 1,490
37 L2: 18,209 18,209 18,209 18,209 18,209 | 18,055
260 L1: 168 168 168 208 208 | 199
4 PS: 127 207 207 247 | 197
4 PB: 516 613 1,188 1,285 | 900
403 DB: 13 58 16,437 49,165 65,535 | 40,197
```