Wasm split

Since this is largely about size efficiency we will only define this splitting for binary forms (not WAT).

Design goals:

The hash of a "top level split Wasm" should validate the contents of all (recursively) split sections
Deterministic split algo to allow signature/hash validation of both original and split content
Can be spliced without knowing canonical split algo
Can determine original size from the split container (allow pre-allocating space while splicing)
Split "fragments" contain just the "main content" of their original section/segment

Candidate splits

Components
- Components (recursive)
- Custom sections
- Core modules
  - Data segments
  - Custom sections
- Core data segments (if added)

Proposal

TBD: preamble layer "split" bit

Split algorithm

Taking binary Wasm as input:

Set the preamble layer "split" bit and emit the updated preamble
For each section:
- If the section is not to be split:
  - Emit the section, unchanged
- If the section is to be split:
  - Hash the section's "content" (see "Split sections") and write the content to a store, keyed by the digest
  - Emit the corresponding "split section"

Digest calculation algorithm

Digest calculation is a variation of the split algorithm which requires all possible (recursive) section splits to be performed. Split contents do not need to be stored, just included in recursive digest calculations.

Splice algorithm

Taking binary Wasm as input:

Check the preamble layer "split" bit
- If unset, emit the entire binary and quit
- If set, unset it and emit the updated preamble
For each section:
- If the section is not a "split section":
  - Emit the section unchanged and continue with the next section
- If the section is a "split section":
  - Reconstruct and emit the corresponding section header
  - Look up the section contents by digest and emit

Typed hash digest

Content is hashed using a supported hash algorithm to produce a content digest. This is encoded along with an identifier of the hash algorithm:

typeddigest  ::= 0x00 sha256digest:byte[32]

For split sections with optional splitting of inner sections/segments, the typeddigest must be computed "as if" all allowed splitting has been done.

Split section

Split section ID (X) TBD; should be reserved in component and core specs

splitsection_N(S) ::= section_X(
                        sectionid:N
                        sectionsize:u32
                        S
                      )

sectionid and sectionsize record the original section's id and size

Component

componentsplit  ::= splitsection_4(typeddigest)

Core module

coremodulesplit ::= splitsection_1(typeddigest)

Custom

customsplit     ::= splitsection_0(customsplitdata)
customsplitdata ::= customname:name typeddigest

customname is copied from the original custom section's name

Data

datasplit ::= splitsection_11(segments:datasegmentsplitopt*)

datasegmentsplitopt ::= 0:u8 segmentdata:vec<byte>
                      | 1:u8 segmentheader:vec<byte> 
                             segmentdatasize:u32
                             segmentdatadigest:typeddigest

The first variant of datasegmentsplitopt allows a splitting implementation to leave a data segment inline (segmentdata) but may not be used when calculating digests for the parent core module.
segmentheader records the original data segment's "header": the segment type and any other fields that come before the actual segment's data. This allows the splice algorithm to mechanically reconstruct segments without "knowing" about segment types
segmentdatasize and segmentdatadigest record the original segment data's size and digest

Other thoughts

MIME Types

You could think of "split" Wasm as just another kind of binary Wasm with particular features, in which case it could be covered by the existing application/wasm type. However, it may be useful to allow consumers to differentiate between "split" and "unsplit" forms. We could specify a separate type (e.g. application/split+wasm) or a parameter (e.g. application/wasm; split=1) to distinguish between them.