Deterministic CID grouping

# Deterministic CID grouping ## Motivation Users who want to store relatively small[1] DAGs on Filecoin often find it difficult to have their deal accepted, even in the presence of Fil+. One solution to this is grouping multiple root CIDs into a "batching structure" which is then used to make a deal with a specific miner. The downside is that the batching structure will naturally have a new root CID, which complicates both discovery and retrieval. Moreover there are limits to how large such a structure can be before becoming unwieldy in other IPLD context ( e.g. IPFS ) ## Scope The design of the "batching structure" needs to at address the following: - Must have - Structure is something existing tooling can understand. This effectively limits it to UnixFSv1. - Structure must cater to least-common-denominator among available transports. This means individual block limits within 1MiB. - Structure must present reasonably browsable names for the individual customer-provided dag roots within it - Structure is simple to produce and navigate with minimal or no tooling. This effectively means no IPLD-level sharding - For the clearly necessary "logical sharding" the choice of indirection should be such that a user knowing just their own CID and the resulting final bundle-root CID[2] can craft a selector for partial retrieval of only their content. - Nice to have - Structure has a placeholder or mark of some sort, hinting miners at what is and isn't reasonably indexable within it. - Limits in consideration - A car file accepted by the Filecoin network currently can have a maximum payload size of about 65,024MiB ( 64<<30 / 128 * 127 ). For the sake of argument let's assume a future with 128GiB sectors, which gives an exact maximum payload size of 136,365,211,648 bytes. Assuming ~60 bytes for a car header, **shortest safe CID** representation of a cidv0 sha2-256 at 34 bytes, and payload of 1024 bytes per block ( ridiculously small NFTs ), gives us an upper bound of (136365211648 - 60) / ( 2 + 34 + 1024) ~~ 128,646,426 ~~ an upper bound of 2^27 individual CIDs that could be in a deal and all be individually addressable. - The **longest textual representation** of a "tenuously common" CID would be a `b`-multibase base32 representation of a [blake2b-512](https://cid.ipfs.io/#bafymbzacid777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777), which clocks at 113 bytes. This in turn means that a typical UnixFS directory can contain: 1048576 = ( 4b dir hdr ) + N * ( 2b pbframe hdr + 2b CID prefix + 70b CID + 2b name prefix + 113b text-CID + ~5 bytes for prefixed size ) ~~ 5405 such names without sharding, before going over the 1MiB libp2p limit. In order to be super-conservative assume a target of 2^12 entries per "batch shard". If we go with the more reasonable 256-bit hashes, we arrive at ~9363 names which translates to 2^13 entries per batch shard. - The common textual representation of CIDs is base32, each character of which represents exactly 5 bits. This means that sharding on 2 base32 characters gives a rough distribution of 2^10 per shard, fitting comfortably within the above considerations. ## Rough spec The above limits, combined with the perfect distribution of hashes in CIDs, means that one can safely store a "super-sector" full of addressable CIDs by having 2 layers of directories, the first layer by 2 base32 characters, the second layer by 4 base32 characters, and the final container having the full CIDs pointing to the content ( 2^(10+10+10) > 2^(27). This does not even take into account the vast overestimation of sector and CID sizes. Therefore a "common batching directory" has the following rough spec: - Every CID is represented as base32 cidv1 ( upgrading Qm's... this shouldn't be a problem ) - For browseability/recognizeability purposes the **leading** 3 characters of a cidv1 are combined with the 2 and 4 sharding characters for each directory sublevel This means a directory roughly like: ``` - bafyBundleRootCid - baf...aa - baf...aaaa - baf...abaa ... - baf...77aa - baf...ab - baf...aaab ... - baf...77ab ... - baf...77 - baf...7777 - bafymbzacid777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777777 ``` When a user want to contruct a retrieval selector, knowing the DealID and their "starting CID" they reverse the process to determine the 2 individual directory sublevels. Additionally this does not preclude a bundler service from adding a top-level METADATA.json or somesuch with various miner instructions (e.g. what to put in the deal `Label` field) [1] ~4GiB or less after deduplication [2] In practice all they would need is the assigned-on-chainpublish sequential numeric dealID, which in turn contains the final payload CID

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.