Hyperlight Sandbox Images — Proposed Design

# Hyperlight Sandbox Images — Proposed Design ## Overview ### Why Sandbox Images? Hyperlight sandboxes are fast to create, but runtime initialisation — booting a language runtime (QuickJS, Wasmtime), loading modules, parsing code — can take tens of milliseconds. For workloads that create thousands of sandboxes or need sub-millisecond cold start, this initialisation cost dominates. Hyperlight's [forthcoming](https://github.com/hyperlight-dev/hyperlight/pull/1373) snapshot files (`.hls`) solve part of this: you can snapshot an initialised sandbox and restore it later, skipping the boot+init phase entirely. But these snapshots are monolithic standalone files — they capture everything into a single blob with no structure. This creates several problems: - **No composability** — each configuration (different JS handlers, different WASM modules) needs its own full snapshot. There's no way to say "start from this base and apply these changes." - **No sharing** — 3 module configurations means 3 copies of the base runtime on disk and in memory. The OS can't deduplicate across different files. - **No distribution story** — you have opaque binary files with no standard way to push, pull, version, or share them across machines. - **No security ecosystem** — no signing, no provenance, no access control. Just files on disk. **Hyperlight Sandbox Images** solve all of these by packaging sandbox state as [OCI Images](https://github.com/opencontainers/image-spec) — the same standard used by container images. Each memory region (snapshot, scratch/diff, mapped modules) becomes a separate OCI layer, stored as a raw mmap-able blob in a content-addressed blob store. Metadata (sregs, layout, file mappings) lives in a JSON config blob. One format, used everywhere — local save, file transfer, registry push/pull. This design delivers: - **Composable snapshots (diffs)** — save a base runtime once, then capture small diffs for each specialisation. Diffs are separate layers that reference the shared base by content digest. For example, 100 hyperlight-wasm sandboxes with 3 different WASM modules/components loaded all share one base Wasmtime runtime blob in the OS page cache. - **Revert to saved state** — after executing untrusted code, revert to a non-zero pre-call state instead of just zeroing scratch. Currently, KVM uses `MADV_DONTNEED` (reverts to zeroes) and MSHV/WHP uses `fill(0)` memset (O(scratch_size)). File-backed mmap changes what revert *restores to*: `MADV_DONTNEED` on a file-backed mapping faults back from file content, not zeroes. The main performance win is on MSHV/WHP, where re-mmap replaces O(scratch_size) memset with O(1) remap. - **Efficient sharing** — OCI's content-addressed blob store means identical content is stored once. The base snapshot, each module, and each diff are separate blobs. Same digest = same file = shared page cache. No custom cache, no reflink/copy logic. - **Distribution for free** — any OCI-compliant tool ([ORAS](https://oras.land/), `crane`, `skopeo`, etc.) or the `oci-distribution` crate can push/pull sandbox images. Pulling a new specialisation when the base is already cached = download only the small diff layer. - **Security ecosystem** — image signing (cosign/notation), provenance attestation (SLSA), vulnerability scanning, registry auth (OAuth2/RBAC), and admission control (Gatekeeper/Kyverno) all work out of the box because the format is OCI. These are the host's responsibility to integrate; we provide the format that makes it possible. - **Tooling compatibility** — `oras discover`, `crane manifest`, or any OCI-aware tool can examine sandbox images. No custom CLI, no proprietary format. > **Relationship to PR #1373**: This proposal builds directly on the snapshot foundations laid in PR #1373. The core snapshot mechanics — `Snapshot::new()` walking page tables, collecting CoW'd pages, rebuilding PTEs, capturing sregs — are all preserved and reused. Nothing is thrown away. What changes is the **packaging**: instead of writing the memory blob into a standalone `.hls` file with a custom binary header, we write it as an OCI layer blob and describe it with a JSON config. The snapshot engine stays the same; it just gets a better container. ### What Hyperlight separates immutable state (snapshot region, low GPAs) from mutable state (scratch region, high GPAs). All guest writes go through copy-on-write (CoW) into scratch — the snapshot blob is never modified. All sandbox artifacts are packaged as [OCI Image Layout](https://github.com/opencontainers/image-spec/blob/main/image-layout.md) directories. Each memory region is a separate OCI layer — a raw, uncompressed, directly mmap-able blob. Metadata lives in the OCI config JSON. This means every artifact is natively distributable via any OCI-compliant registry with zero format conversion. Diff snapshots capture the scratch region state relative to a base snapshot. Loading is done by mmap'ing the diff blob directly as the scratch memory slot. Reverting to pre-call state is achieved via `MADV_DONTNEED` (KVM) or re-mmap (MSHV/WHP). No deltas, no stacking, no merge logic. Every diff is a complete scratch dump, directly mmap-able. For terminology used in this document (GPA, GVA, CoW, PEB, EPT/NPT, scratch region), see the [glossary](glossary.md) and [hyperlight execution details](hyperlight-execution-details.md). ### Why OCI? Using OCI as the native artifact format — not just a distribution mechanism — was a deliberate choice. The alternative would be to design custom binary file formats for each artifact type and build a bespoke content-addressed cache, then bolt on OCI distribution later. That approach creates two problems: you build and maintain custom infrastructure that OCI already provides, and you inevitably face a format migration when distribution becomes a requirement. | Concern | Custom format approach | OCI-native approach | |---|---|---| | Local storage | Proprietary binary files + custom cache directory | OCI layout directory with `blobs/sha256/` | | Layer sharing | Custom content-addressed cache with reflink/copy | Content-addressed by construction — same digest = same file | | Distribution | Build your own, or bolt on OCI later | Already IS OCI — any OCI client works day one | | Signing | Separate effort, deferred | cosign/notation work now (host integrates) | | Provenance | Not possible with proprietary format | SLSA attestations attach as OCI referrers | | Tooling | Custom CLI for everything | ORAS, crane, or any OCI client works out of the box | | Format migration | Inevitable when distribution arrives | Never — one format from day one to production | **OCI Image Layout unlocks capabilities we need today — content-addressed sharing for density, composable layers for fast specialisation, and a security/distribution ecosystem for production readiness**. These aren't future benefits we're pre-investing in; they're requirements for meeting performance, efficiency, and scale goals now. The fact that we also unlock registry integration for free is not the sole motivation. Registry integration (auth, signing, provenance) is the **host's responsibility**. Hyperlight produces and consumes OCI Image Layouts. The host application decides if it wants to use a registry, how to authenticate, whether to verify signatures, and which provenance attestations to require. The fact that the format is natively OCI means all of this is possible — no custom protocol, no proprietary format standing in the way. ### Relationship to the Snapshot-First Lifecycle Requirements This design doc implements the artifact format, persistence, and revert mechanics described in the [Snapshot-First Sandbox Lifecycle Requirements Document](https://hackmd.io/@8sYwbcKoR36MNDyyFvtecQ/HJnxgWyqWe). It does NOT cover everything in that requirements doc — the following are out of scope here but still required: - **Unified `Sandbox` type and builder API** (R10) — replacing `UninitializedSandbox` / `MultiUseSandbox` / `evolve()` with a single `Sandbox` type constructed via `SandboxBuilder`. This is a sequencing problem: API examples in this doc use current type names (`MultiUseSandbox::from_oci()`, `MultiUseSandbox::save_diff()`). When the unified type lands, these become methods on `Sandbox` / `SandboxBuilder`. The format and mechanics are independent of which type exposes them. - **CLI tooling** (R2, R7.3) — `hl snap bake`, `hl snap push/pull/inspect`. Not covered here. The OCI format makes these straightforward to implement as a separate crate or binary. - **`map_external_file` API rename** (R11.3) — the requirements doc unifies `map_file_cow()` / `map_region()` into `map_external_file(path, gpa, expected_hash)`. This doc uses the current `map_file_cow()` name. The rename is an API surface change, not a format change. - **Heap grow-only append** (R6) — extending heap beyond snapshot size at load time. Orthogonal to the image format. - **Metrics and tracing** (R8) — snapshot load/save/revert timings. Will be added during implementation. **Intentional divergences** from the requirements doc: | Topic | Requirements Doc | This Design | Rationale | |---|---|---|---| | **Artifact format** | Single-file custom binary with magic bytes + blake3 hash (R1.1-R1.3) | OCI Image Layout with sha256 digests | OCI-native eliminates the need for a custom format, custom cache, and a format migration when distribution becomes a requirement. See [Why OCI?](#why-oci) | | **Sregs** | Not persisted — reconstructed via `standard_64bit_defaults()` (R1.4) | Persisted in config JSON per hypervisor | PR #1373 showed that sregs must be persisted: guests can modify control register bits, and different hypervisors have different sreg layouts. Reconstruction is only safe for a strict subset of configurations | | **Layer stacking** | Full overlay stack `[base, overlay_1, overlay_2, ...]` (R5.1) | One base + at most one diff | Keeps load and revert O(1). Stacking is deferred — see [Excluded](#excluded-future-work) for conditions under which it could be revisited | | **Layer composition** | `memcpy` dirty pages from overlays on load (R5.3) | Direct mmap of scratch blob — no copy | File-backed mmap avoids the composition step entirely. The diff IS the scratch region, directly mmap-able | ## Design Principles 1. **OCI Image Layout as native format** — all artifacts (base, diff, cow mapped files) are OCI images with raw blob layers. One format for local use and registry distribution. No custom file formats, no custom cache. 2. **Directly mmap-able layers** — OCI layers are uncompressed raw blobs on disk, page-aligned and ready for file-backed CoW mapping (`mmap MAP_PRIVATE` on Linux, `MapViewOfFile(FILE_MAP_COPY)` on Windows). Compression is only used for registry transfer. 3. **Diff layers are captured directly from the scratch region** — the scratch region already contains all mutable guest state (CoW'd pages, page tables, allocator state). Saving a diff is just dumping scratch to a blob — no diffing algorithm, no delta computation. Each diff is a **complete, standalone scratch dump**, not a delta against a previous diff. Diffs are never stacked or merged: loading always takes one base + at most one diff. The diff layer is optional — a base-only or base + modules image works without one. To capture a new state, save a new diff — it replaces the previous one, not layers on top of it. To build on an intermediate state (e.g., base runtime → framework → handler), flatten the intermediate into a new base image, then diff from there. See [Excluded: Stacked Diffs](#excluded-future-work) for the rationale and conditions under which this could be revisited. 4. **Diff layers are sparse on disk** — a diff layer's logical size spans the full scratch region, but only pages below the allocator watermark (`scratch_used`) consume disk blocks; unused space is a sparse hole. 5. **Layers can be flattened** — a multi-layer image (base + diff + mapped files) can be collapsed into a single-layer image containing all state. This produces a self-contained, portable image at the cost of losing layer sharing. Useful for simple deployment or when the host doesn't need to manage shared blob stores. 6. **Simple save/load API** — saving and loading is as simple as providing a path. Hyperlight supports two standard OCI forms: - **OCI layout directory** (primary) — e.g., `./my-sandbox/`, addressed by OCI tools as `oci:./my-sandbox/`. This is the runtime format: blobs are individual files, directly mmap-able, supports layer sharing. - **OCI archive** (convenience) — e.g., `./my-sandbox.tar`, addressed by OCI tools as `oci-archive:./my-sandbox.tar`. Standard tar of the layout directory, no compression. For easy file transfer without a registry. On load, extracted to a temporary directory first. No custom file extension — standard OCI archive format that ORAS, `skopeo`, and `crane` already understand. 7. **In-memory snapshots remain available** — the existing `snapshot()` / `restore()` mechanism still works for transient state capture without touching disk. `save_diff()` is for when you want to persist state as an OCI image for sharing, distribution, or cross-process reuse. The two mechanisms are complementary: use in-memory snapshots for fast runtime checkpoints within a single process, OCI images for durable, shareable state. ## Trade-offs and Drawbacks Adopting OCI Image Layout as the native format is not without costs. These are the known trade-offs we're accepting: | Drawback | Impact | Mitigation | |---|---|---| | **Directory, not a single file** | An OCI layout is a directory tree, not a single file like `.hls`. Less convenient for simple file transfer. Requires the host to have writable local storage for the layout. | OCI archive (`.tar`) provides a single-file transfer format. The host extracts to a directory before loading. Standard OCI workflow. | | **sha256 hashing cost on save** | Every blob must be sha256'd to compute its digest. sha256 is ~3-5x slower than blake3 for large blobs. For a 256 MB snapshot blob, this could add 50-100ms to save time. | Save is an offline operation — not on the hot path. If sha256 becomes a measured bottleneck, we could explore caching digests or parallel hashing. | | **More files per artifact** | A base-only image has 5 files (oci-layout, index.json, manifest blob, config blob, snapshot blob) vs 1 file for `.hls`. | The extra files are small JSON metadata (< 1 KB each). The blob is the same size either way. inode overhead is negligible. | | **No mmap direct from image** | Cannot mmap directly from an OCI archive (tar). Must extract first. This requires write access to local storage and adds first-load latency. | The host has full flexibility: pre-extract the base image once at deploy time (amortised to zero), then load only new diff/module layers dynamically (small, fast extractions since the base is already cached). Or extract everything upfront. Or accept the extraction cost at runtime for simplicity. The host decides based on their latency vs. convenience trade-off. | | **OCI spec complexity** | The OCI Image spec has concepts (manifests, descriptors, media types, annotations) that are more complex than a flat binary header. | We use a minimal subset of the spec. The implementation is a single module (`oci_layout.rs`) that reads/writes the subset we need. No full OCI runtime dependency. | ## Architecture ### What's in a Sandbox Image? A Hyperlight Sandbox Image is an [OCI Image Layout](https://github.com/opencontainers/image-spec/blob/main/image-layout.md) directory containing up to three types of layer, plus a config blob: ``` my-sandbox/ oci-layout ← {"imageLayoutVersion": "1.0.0"} index.json ← points to manifest(s) blobs/ sha256/ <manifest-digest> ← OCI manifest JSON <config-digest> ← Hyperlight config JSON (sregs, layout, file mappings) <snapshot-blob-digest> ← raw snapshot memory blob (Layer 0, mmap-able) <module-blob-digest> ← raw .aot module (or other CoW mapped file) (Layer 1, optional, mmap-able) <scratch-blob-digest> ← raw scratch/diff memory blob (Layer N, optional, mmap-able) ``` **Layer types**: | Layer Type | Media Type | Purpose | When present | |---|---|---|---| | Snapshot | `application/vnd.hyperlight.snapshot.v1` | Initialised VM memory — code, heap, globals, PTEs | Always | | Mapped file | `application/vnd.hyperlight.mapped-file.v1` | External file (`.aot` module, component) mapped at a specific GPA | When sandbox has `map_file_cow` regions | | Scratch (diff) | `application/vnd.hyperlight.scratch.v1` | Mutable scratch region state — CoW'd pages, page tables, allocator | When saving a diff on top of a base | All layers are **uncompressed raw blobs** — directly mmap-able from disk. Compression is only used for registry transfer, not local storage. **Image products** — all the same OCI format, different layer combinations: | Product | Layers | Use case | |---|---|---| | **Base snapshot** | Snapshot only | Initialised runtime, ready to specialise | | **Base + modules** | Snapshot + mapped file(s) | Runtime with modules mapped, no scratch state captured yet | | **Base + diff** | Snapshot + scratch | Specialised sandbox (e.g., handlers loaded), base shared across configs | | **Base + modules + diff** | Snapshot + mapped file(s) + scratch | Fully specialised sandbox with module/file sharing | | **Flattened** | Single snapshot (all state baked in) | Portable, self-contained, no layer sharing | **Config blob** (JSON) describes how to reconstitute the sandbox from its layers: ```json { "hyperlight_version": "0.5.0", // Hyperlight version that produced this image "arch": "x86_64", // Target architecture (x86_64, aarch64) "abi_version": 1, // Guest ABI version — must match between base and diff "hypervisor": "kvm", // Hypervisor that captured sregs (kvm, mshv, whp) "scratch_size": 268435456, // Total scratch region size in bytes (must match base layout) "scratch_used": 2097152, // Bytes of scratch actually used (derived from allocator_gpa at save time) "pt_size": 16384, // Page table size within scratch "allocator_gpa": "0xFFFFFFFFFFE02000", // Bump allocator position (read from scratch bookkeeping at save time) "snapshot_generation": 1, // Counter incremented on each save — used to detect stale scratch bookkeeping on restore "entrypoint": "RunFromSnapshot", // Next action on restore (RunFromSnapshot or CallGuestFunction) "stack_top_gva": "0xDFFFEFFF", // Guest virtual address of stack top "sregs": { ... }, // Hypervisor-specific special registers (format matches PR #1373) "host_functions": [ ... ], // Names and signatures of registered host functions (validated on load) "file_mappings": [ // map_file_cow regions — one entry per mapped file layer { "layer_index": 1, // Index into the OCI manifest's layer array "guest_base": "0x100000000", // GPA where this file is mapped in the guest "size": 4194304 // Expected file size (sanity check on load) } ] } ``` All metadata lives in this JSON — no binary headers, no magic bytes. The layer blobs are pure data. ### How Layers Map to VM Memory Each layer in the image maps to a hypervisor memory slot. Layers are loaded by mmap'ing the blob file directly — file-backed, lazy-faulted, shared via page cache: | Layer | Mapping | Slot | GPA Range | Access | |---|---|---|---|---| | Snapshot blob | File-backed CoW¹ | Snapshot slot | `0x1000` → snapshot end | Immutable (writes redirect to scratch via guest CoW) | | Mapped file(s) | `map_file_cow()` | One slot per file | Per config | Read + Execute, immutable | | Scratch blob | File-backed CoW¹ | Scratch slot | `scratch_base_gpa` → `MAX_GPA` | Mutable (guest writes here) | ¹ Linux: `mmap MAP_PRIVATE`. Windows: `CreateFileMappingA(PAGE_WRITECOPY)` + `MapViewOfFile(FILE_MAP_COPY)`. Both provide copy-on-write semantics — writes go to private pages, the underlying blob file is never modified. If no scratch layer is present (base-only image), the scratch region starts as anonymous zero-filled memory. ### Mapped Files (`map_file_cow`) `map_file_cow()` maps external files (e.g., WASM `.aot` modules) at separate GPAs as read-only + executable memory, each in its own hypervisor memory slot. Mapped files are independent of scratch/diff layers — a sandbox image can have mapped files with or without a scratch layer. Stored as separate OCI layers, addressed by sha256 digest. On load, resolved from blob store and re-mapped via `map_file_cow()`. On revert, immutable and untouched. **Flatten (alternative)**: Collapses all layers into a single blob — portable but loses sharing. Use when portability matters more than density. ### Layer Sharing Since layers are OCI blobs addressed by sha256 digest, sharing is automatic: - **Same base across images** — multiple sandbox images referencing the same snapshot blob share one file on disk and one set of page cache pages - **Same module across images** — 100 sandboxes loading the same `.aot` module blob share one file in the page cache - **Same diff reused** — loading the same scratch blob across sandboxes shares the file backing (each gets isolated writes via CoW) This is the core density advantage: instead of N monolithic files duplicating shared content, N images reference shared blobs. The OS page cache handles the rest. ### Guest Memory Layout (x86-64) > **Note**: x86-64 only initially. aarch64 support is deferred — `scratch_base_gpa()` is currently `unimplemented!()` on that architecture. The config JSON includes an `arch` field for forward compatibility. The 64 GiB guest physical address space (`MAX_GPA = 0xF_FFFF_FFFF`) is split into two non-overlapping regions. The snapshot region starts at the bottom (`0x1000`) and the scratch region is placed at the top. For example, a 256 MB scratch region starts at GPA `0xF_F000_0000` (i.e., `MAX_GPA - scratch_size + 1`). **Snapshot slot** (immutable, file-backed from snapshot blob): | Section | Access | Notes | |---|---|---| | Code | R/X | Guest executable code | | PEB | CoW | Process Environment Block — writes redirect to scratch | | Heap | CoW | Guest heap — writes redirect to scratch | | Init data | R/O | Initialisation data | **Scratch slot** (mutable, file-backed from scratch blob or zero-filled): | Section | Grows | Notes | |---|---|---| | I/O buffers | — | Transient host↔guest communication (zeroed on save) | | Page tables | — | Guest page table entries | | CoW pages | ↓ (bump allocator) | Pages allocated when guest writes to CoW-mapped snapshot pages | | Free zeros | — | Unallocated space — sparse hole on disk, faults in as zeroes | | Bookkeeping | — | Allocator state at top of scratch (includes `allocator_gpa`, read at save time to derive `scratch_used`) | The **allocator watermark** (`scratch_used`) sits between CoW pages and free zeros. Everything below it gets written to disk; everything above is a sparse hole. ### Operations | Operation | What happens | |---|---| | **Load** | Parse OCI manifest → mmap snapshot blob → mmap scratch blob (if present) → `map_file_cow()` each mapped file layer → register all slots with hypervisor | | **Execute** | Guest runs. Writes to snapshot pages trigger guest CoW → pages allocated in scratch. Mapped file regions are R/X only. | | **Revert** | Discard dirty scratch pages, restore from file backing. KVM: `MADV_DONTNEED`. MSHV: re-mmap (pins prevent `MADV_DONTNEED`). WHP: re-`MapViewOfFile`. Mapped files untouched. | | **Save diff** | Dump scratch as new blob → new OCI image referencing same base + mapped file blobs + new scratch blob | | **Flatten** | Walk page tables, collect all pages into single blob → new single-layer OCI image | ### Multi-Hypervisor Images Memory layers (snapshot, scratch, mapped files) are identical across KVM/MSHV/WHP — same bytes, same sha256 digest. Only the config blob differs (sregs are hypervisor-specific). **Locally**, `save_diff()` always produces a single-platform image for the current hypervisor. There's no multi-hypervisor concern at save or load time — you save on KVM, you load on KVM. **In a registry**, multiple single-platform images can be combined under one tag using an [OCI Image Index](https://github.com/opencontainers/image-spec/blob/main/image-index.md) (fat manifest). This is assembled at push time by the host's CI/CD or tooling — not by Hyperlight: ``` # Build time: save on each hypervisor (or cross-generate configs) save_diff("./my-sandbox-kvm/") ← on a KVM machine save_diff("./my-sandbox-mshv/") ← on an MSHV machine # Push time: host assembles index and pushes (e.g., using ORAS or crane) # registry.io/myapp:v1 → OCI Image Index # ├── linux/kvm → manifest with kvm config (same layer digests) # ├── linux/mshv → manifest with mshv config (same layer digests) # └── windows/whp → manifest with whp config (same layer digests) ``` On pull, OCI clients resolve the correct platform variant automatically — the same way container runtimes resolve `linux/amd64` vs `linux/arm64`. Since the memory layers share the same digests, the registry stores them once regardless of how many platform variants exist. ### Distribution Because the native format IS OCI, distribution is just transport: | Method | How | |---|---| | **Local files** | Copy the OCI layout directory, or pack as `.tar` | | **Registry push** | `oras copy ./my-sandbox --to registry.io/app:v1` (or via `oci-distribution` crate, `crane`, `skopeo`, etc.) | | **Registry pull** | `oras copy registry.io/app:v1 --to ./my-sandbox` → blobs land in local layout, ready for mmap | | **File transfer** | `tar cf my-sandbox.oci.tar my-sandbox/` → single file, standard OCI archive | Layer sharing happens automatically — the registry deduplicates layers by content digest. Two images referencing the same base share the layer. Standard OCI behaviour. Registry integration details are covered in [Registry Distribution](#registry-distribution-host-responsibility). ### Security Benefits of OCI-Native Format Adopting OCI as the native format unlocks the container security ecosystem for sandbox images: | Capability | Tooling | Status | |---|---|---| | **Image signing** | [cosign](https://github.com/sigstore/cosign), [notation](https://github.com/notaryproject/notation) | Available now — host integrates at pull time | | **Provenance attestation** | [SLSA](https://slsa.dev/), in-toto | Attach build provenance as OCI referrers | | **Registry auth** | OAuth2, token-based, RBAC | Standard registry access control | | **Admission control** | Gatekeeper, Kyverno | Policy-based approval before loading | | **Supply chain integrity** | `sha256` digests, manifest signatures | Tamper detection from build to deploy | None of these require Hyperlight to implement anything — they work because the format is OCI. A host that deploys sandbox images to production gets the same supply chain security guarantees as container images. The path from "local dev" to "signed, attested, policy-checked production deployment" is a configuration change, not a format migration. ### What Doesn't Work (By Design) 1. **Cannot swap base on a running sandbox** — must create a new sandbox. This prevents accidental state carryover between incompatible bases. 2. **Cannot revert to a DIFFERENT diff** — revert always returns to the scratch blob that was loaded. To switch diffs, create a new sandbox from a different image. 3. **Mapped file layers must be present in the OCI blob store at load time** — the config references layers by sha256 digest. If a blob is missing, load fails with a clear error. For self-contained distribution without layer sharing, use flatten to produce a single-layer image. ## Excluded (Future Work) - **Stacked / composable diffs** — currently, loading always takes one base + at most one diff. There is no runtime merging of multiple diff layers. The workaround is to flatten an intermediate state into a new base image and diff from there. This costs a flatten step but keeps load and revert O(1) in complexity. Stacking could enable sharing intermediate states (e.g., base runtime → shared framework layer → per-handler diffs) without flattening, which would improve sharing density for deep specialisation trees. However, it introduces significant complexity: page table merge semantics (which PTE wins when two diffs touch the same page?), allocator state reconciliation across layers, multi-layer revert semantics (revert to which layer?), and verification cost scaling with layer count. **Revisit if**: flatten cost becomes a bottleneck in a real workload, or if sharing density for 3+ level specialisation hierarchies becomes a measured requirement. - Offline flatten without VM (defer until demand) - aarch64 support (after x86-64 stabilizes) --- ## Implementation ### Dependencies OCI manifest/descriptor handling uses the [`oci-spec`](https://crates.io/crates/oci-spec) crate (Apache-2.0) — typed Rust structs for OCI Image spec objects (manifests, descriptors, media types, image index). Digest computation uses the [`sha2`](https://crates.io/crates/sha2) crate. Both are lightweight and have no async/network dependencies. Hyperlight already has `serde` and `serde_json` — no new serialisation deps needed. The Hyperlight-specific config JSON (sregs, scratch layout, file mappings) is our own struct, not part of `oci-spec`. ### Artifact Format (OCI Image Layout) All snapshot artifacts use the OCI Image Layout specification. Memory blobs are stored as uncompressed layers in `blobs/sha256/`, identified by content digest. Metadata lives in the config JSON blob. **Layer blobs** are raw memory content — no headers, no magic bytes. Each blob is directly mmap-able: - **Snapshot blob**: byte-for-byte snapshot region content. Page-aligned. - **Scratch blob**: byte-for-byte scratch region content. Logical size = `scratch_size` + trailing guard page (4096 bytes). Physical disk: only `scratch_used` bytes consume blocks (sparse file). The on-disk size of a scratch blob is proportional to `scratch_used` (the allocator watermark), NOT to `scratch_size`. A sandbox with 256 MB scratch that has only allocated 2 MB of CoW pages produces a ~2 MB blob on disk. The remaining 254 MB exists only as a sparse hole — it is never written, occupies no disk blocks, and faults in as zeroes on access. - **Mapped file blob**: byte-for-byte file content (e.g., `.aot` module). Raw, uncompressed. **Config blob** (JSON) contains all metadata — see [What's in a Sandbox Image?](#whats-in-a-sandbox-image) for the full schema. Key fields: `scratch_size`, `scratch_used`, `allocator_gpa`, `pt_size`, `sregs`, `hypervisor`, `arch`, `abi_version`, `file_mappings[]`. **Integrity**: OCI provides content-addressed integrity by construction — each blob is named by its sha256 digest. If a blob exists at a given digest path, its content matches. The OCI manifest lists all layer digests and the config digest. Tampering with any blob invalidates the manifest. The manifest itself can be signed via cosign/notation (host responsibility). ### Save Base — `snapshot().to_oci()` This is the foundation that `save_diff()` builds on. The existing `Snapshot::new()` from PR #1373 is **unchanged** — it walks page tables, collects all CoW'd pages, rebuilds PTEs, and captures sregs. What changes is only the serialisation step: `to_file()` (custom binary) is replaced by `to_oci()` (OCI layout directory). `Snapshot::to_oci(&self, path: impl AsRef<Path>) -> Result<()>` | Step | PR #1373 `to_file()` | OCI `to_oci()` | |---|---|---| | Write memory | Single blob with binary header (magic, version, arch, hashes) | Raw blob to `blobs/sha256/<digest>` — no header, no magic bytes | | Hash | blake3 embedded in binary header | sha256 = blob filename (content-addressed by construction) | | Metadata | Binary structs (`SnapshotHeaderV1`, `RawSregs`, `RawHashes`) | JSON config blob (sregs, arch, layout, version, host_funcs) | | Host functions | Serialised into binary prefix, covered by header hash | Serialised into config JSON blob | | Format marker | Magic bytes `0x484C53` at offset 0 | `oci-layout` file + `index.json` | | Manifest | None — single file | OCI manifest JSON referencing config + snapshot layer by digest (`oci-spec` crate) | | Atomicity | Write-then-rename single file | Assemble temp dir, rename directory | | Guard page padding | Trailing `PAGE_SIZE` zeros appended for Windows mmap | Same — blob includes trailing guard page bytes | Steps: 1. Write snapshot memory blob to temporary file (page-aligned, with trailing guard page) 2. Compute sha256 digest of blob (`sha2` crate) → blob filename 3. Move blob to `blobs/sha256/<digest>` 4. Generate config JSON from `Snapshot` fields: sregs, layout, arch, hypervisor, entrypoint, stack_top_gva, host function metadata, snapshot_generation 5. Compute sha256 of config JSON → config blob filename 6. Generate OCI manifest JSON referencing config + snapshot layer (`oci-spec` crate — `ImageManifestBuilder`, `DescriptorBuilder`) 7. Write `index.json` and `oci-layout` 8. Atomic directory rename to final path **Mapped files are absorbed (flattened)**: If the sandbox has `map_file_cow` regions when `snapshot()` is called, `Snapshot::new()` walks their page tables and copies the mapped file content into the snapshot blob. The file references are lost — the resulting OCI image is a single flattened snapshot layer with no separate mapped file layers. This is the current PR #1373 behaviour and is intentional for `snapshot().to_oci()` (it's how flatten works). To preserve mapped files as separate shareable layers, use `save_diff()` instead, which records file mappings in the config JSON and stores each file as its own OCI layer. > **Future work**: `Snapshot::new()` should be modified to skip `map_file_cow` regions during page table walking and instead record them as separate layers. This would enable `snapshot().to_oci()` to produce a "Base + modules" image product (snapshot + mapped file layers, no scratch). Until then, the only way to get separate mapped file layers is via `save_diff()`. **What's preserved from PR #1373**: `Snapshot::new()`, `Snapshot::from_file()` (reading old `.hls` files), in-memory `snapshot()`/`restore()`. The `to_file()` should be removed. ### Save Diff `MultiUseSandbox::save_diff(&mut self, path: impl AsRef<Path>) -> Result<()>` **Prerequisite**: The sandbox MUST have been loaded from an existing OCI image (via `from_oci()`). `save_diff()` on a sandbox created from a guest binary (via `evolve()`) is an error — there is no base snapshot to reference. The caller must first save a base image via `snapshot().to_oci()`, then load from that image, before `save_diff()` is available. Requires `&mut self` — ensures exclusive access at compile time. Must not be called while guest is running. Gets all internal state from `self`: - Scratch memory from `self.mem_mgr.scratch_mem` - Sregs from `self.vm.get_snapshot_sregs()` - **Base snapshot layer digest** — stored on the sandbox at `from_oci()` load time. This is the sha256 digest of the snapshot blob that was mmap'd as the snapshot slot. The diff image references this same digest in its OCI manifest, so the base blob is shared, not duplicated. - Layout, scratch_size from `self.mem_mgr.layout` Steps: 1. Read allocator position from scratch bookkeeping at offset `scratch_size - SCRATCH_TOP_ALLOCATOR_OFFSET` (constant `0x10`, defined in `hyperlight_common::layout`) 2. Compute `scratch_used = allocator_gpa - scratch_base_gpa`; validate ≤ scratch_size 3. Zero I/O buffer region in output via selective `pwrite` (skip I/O offsets, avoiding data leakage without copying) 4. Write scratch blob to a **temporary file**, `ftruncate` to full logical size (sparse zeros above watermark) 5. Compute sha256 digest of scratch blob (`sha2` crate) → this is the OCI blob name 6. For each active `map_file_cow` region: compute sha256 of the file, write as a blob if not already present 7. **Include the base snapshot blob** in the output layout's `blobs/sha256/` by hardlinking from the source OCI layout using the stored digest. The source blob must exist — we are mmap'd from it. If the source file is missing (deleted/moved after load), this is a terminal error. 8. Generate config JSON (sregs, layout metadata, file mapping entries) 9. Generate OCI manifest JSON referencing all layer and config digests (`oci-spec` crate — `ImageManifestBuilder`, `DescriptorBuilder`). The snapshot layer descriptor uses the stored base digest; the scratch layer descriptor uses the newly computed digest. 10. Write `index.json` and `oci-layout` — **atomic via directory rename on POSIX** **Atomicity**: The OCI layout is assembled in a temporary directory, then `rename()` to the final path. Guarantees no partial layouts on crash. On Windows, `MoveFileEx(MOVEFILE_REPLACE_EXISTING)`. **Saving from a diff-loaded sandbox**: Produces a new OCI layout with a complete scratch blob of the current state. The base snapshot layer is referenced by the same digest — not duplicated. The new diff replaces the old one, it doesn't stack. Sparse file support: - Linux: holes are implicit after `ftruncate` - Windows: `DeviceIoControl(FSCTL_SET_SPARSE)` before writing, then `SetEndOfFile` ### Load `MultiUseSandbox::from_oci(path: impl AsRef<Path>, host_funcs, config) -> Result<Self>` Steps: 1. Read `oci-layout` and `index.json` from the OCI layout directory 2. Parse manifest JSON (`oci-spec` crate — `ImageManifest::from_reader()`) → list of layer digests + config digest 3. Read and parse config JSON from `blobs/sha256/<config-digest>` 4. Validate config fields (bounds, relationships) 5. Resolve snapshot blob: `blobs/sha256/<snapshot-digest>` → mmap `MAP_PRIVATE` as snapshot slot 6. If scratch blob present: `blobs/sha256/<scratch-digest>` → mmap `MAP_PRIVATE` as scratch slot 7. **Restore file mappings**: For each `file_mappings[]` entry in config: - Resolve blob: `blobs/sha256/<layer-digest>` - Verify blob size matches config entry - `map_file_cow(fd, guest_base)` → additional hypervisor slot (file-backed, READ|EXEC, lazy) 8. Wire mmap'd regions into `SandboxMemoryManager`, write scratch bookkeeping, and register all slots with the hypervisor **Validation** — the following are verified during load: - OCI layout structure: `oci-layout` file present, `index.json` valid - Manifest: all referenced blobs exist in `blobs/sha256/` - Config: all integer fields bounds-checked (`scratch_size` > 0 and ≤ `MAX_MEMORY_SIZE`, `scratch_used` ≤ `scratch_size`, etc.) - Base compatibility: if diff present, `scratch_size` must match base layout - Blob sizes: snapshot blob size matches expected layout size, scratch blob size ≥ `scratch_used` All blobs verified by sha256 digest (OCI guarantees content matches filename). No separate hash computation needed — content-addressed by construction. **Linux mmap**: `mmap(MAP_PRIVATE | PROT_READ | PROT_WRITE | MAP_NORESERVE)` with guard pages via anonymous underlay. **Windows mmap**: `CreateFileMappingA(PAGE_WRITECOPY)` + `MapViewOfFile(FILE_MAP_COPY)` — CoW semantics, file is not modified. Guard pages via `VirtualProtect(PAGE_NOACCESS)`. ### Revert After a guest call, reset scratch to the pre-call state loaded from the scratch blob. File mappings (`map_file_cow` regions) are read-only and immutable — they are left untouched during revert. > **How revert works today (without OCI images)**: Scratch is anonymous memory (`MAP_ANONYMOUS`). On KVM, `MADV_DONTNEED` discards pages and they fault back as **zeroes** — this only works because the "clean" state IS all zeroes. On MSHV and Windows, `fill(0)` memsets the region. Both approaches are O(scratch_size) and can only revert to a zeroed state. > > **How revert works with OCI images**: Scratch is file-backed (`MAP_PRIVATE` from the scratch blob). On KVM, `MADV_DONTNEED` discards dirty pages and they fault back from the **file content** — restoring to the saved diff state, not zeroes. This is a fundamentally different mechanism from the current anonymous-memory path, even though it uses the same syscall. On MSHV/WHP, re-mmap or re-`MapViewOfFile` achieves the same result. **KVM (Linux)**: ```rust madvise(scratch_ptr, scratch_size, MADV_DONTNEED); // Kernel discards private (dirty) pages — O(resident_pages) // Next access faults back from file backing (MAP_PRIVATE semantics) // Same host VA range → no hypervisor re-registration → no EPT flush update_scratch_bookkeeping()?; ``` **MSHV (Linux)**: Cannot use `MADV_DONTNEED` — MSHV pins pages with `pin_user_pages(FOLL_PIN|FOLL_WRITE)`, making `MADV_DONTNEED` a no-op on those pages. ```rust munmap(scratch_ptr, scratch_total_size); // releases pins let new_mapping = ExclusiveSharedMemory::from_file_cow(&file, 0, total)?; let (host, guest) = new_mapping.build(); self.scratch_mem = host; self.vm.update_scratch_mapping(guest)?; // re-register → re-pin update_scratch_bookkeeping()?; ``` **WHP (Windows)**: No `MADV_DONTNEED` equivalent. `DiscardVirtualMemory()` does not restore file backing. ```rust UnmapViewOfFile(scratch_ptr); // Re-map from the stored file mapping handle (kept in SandboxMemoryManager // for the lifetime of the sandbox — survives across revert cycles) let addr = MapViewOfFile(self.scratch_file_handle, FILE_MAP_COPY, ...); VirtualProtect(...); // guards update_scratch_mapping(new_guest_shared_mem)?; update_scratch_bookkeeping()?; ``` **Handle lifetime**: The `CreateFileMappingA` handle for the scratch blob is stored in `SandboxMemoryManager` at load time and kept alive for the sandbox's lifetime. It is used for each re-map during revert. Dropped when the sandbox is dropped. **Performance comparison** (to be benchmarked — no existing benchmarks cover the file-backed path): | | Current (anonymous scratch) | Proposed (file-backed scratch) | |---|---|---| | **KVM** | `MADV_DONTNEED` — O(dirty_pages), pages fault back as **zeroes** | `MADV_DONTNEED` — O(dirty_pages), pages fault back from **file** | | **MSHV** | `fill(0)` memset — O(scratch_size) | re-mmap from file — O(1) mmap + hypervisor re-registration | | **WHP** | `fill(0)` memset — O(scratch_size) | re-`MapViewOfFile` — O(1) map + hypervisor re-registration | For KVM, the syscall cost is the same — `MADV_DONTNEED` walks the page table and discards dirty pages regardless of whether the mapping is anonymous or file-backed. The difference is **capability, not speed**: anonymous can only revert to zeroes, file-backed restores to saved diff state. The performance win on KVM is eliminating the need to memcpy saved state back after zeroing. For MSHV and WHP, the improvement is both capability and performance: current `fill(0)` touches every byte in the scratch region (O(scratch_size)), while re-mmap is O(1) plus a fixed hypervisor re-registration cost. This is where the biggest gain is expected — but is unvalidated. Existing benchmarks only test scratch sizes up to 1 MB (default 288 KB). ### Revert Benchmarks Benchmark current and file-backed revert at 1 MB, 8 MB, 64 MB, 256 MB scratch sizes. Run per hypervisor (KVM, MSHV, WHP). ### Flatten (base + diff → new base) Compose a base snapshot + diff into a single new base snapshot OCI image. **Option A — From running sandbox**: ```rust let sb = MultiUseSandbox::from_oci("./my-sandbox/", funcs, cfg)?; sb.snapshot()?.to_oci("./flattened/")?; ``` `Snapshot::new()` walks both regions, collects all pages, produces a dense blob as a single-layer OCI image. Requires booting a VM but reuses all existing code. **Option B — From OCI layout, no VM**: ```rust Snapshot::flatten("./my-sandbox/")?.to_oci("./flattened/")?; ``` Would need an offline page table walker. The information is in the scratch blob. Deferred until demand exists. The flattened image can be used as a new base for further diff capture. --- ## Threat Model OCI layouts contain executable guest state (page tables + CoW'd memory). They should be treated with the same trust level as container images. - **Crafted PTEs**: Can only reference GPAs within allocated hypervisor memory slots. EPT/NPT is the security boundary, not guest PTEs. No VM escape possible via crafted page tables. - **Data integrity**: OCI content-addressed digests (sha256) catch corruption and tampering. Each blob is named by its hash — modification invalidates the digest. The OCI manifest lists all digests. - **Blob substitution**: The OCI manifest binds all layers together by digest. Swapping one valid blob for another changes the manifest digest. Manifest itself can be signed (cosign/notation) for tamper-proof provenance. - **No path traversal possible**: Blobs are addressed by sha256 digest, not file paths. No paths to traverse, no symlinks to follow. - **No mapped file TOCTOU**: Blobs in the OCI store are immutable by construction (content-addressed). No verification-to-mmap race window. - **Atomic writes**: `save_diff()` assembles the OCI layout in a temporary directory, then `rename()`. No partial layouts on crash. --- ## Registry Distribution (Host Responsibility) Hyperlight does NOT implement registry push/pull. The OCI Image Layout is already the native format — any OCI-compliant tool can distribute sandbox images without Hyperlight involvement. The host application is responsible for: - **Pushing images to registries** — using [ORAS](https://oras.land/), `crane`, `skopeo`, the `oci-distribution` crate, or whatever OCI-compliant tool fits their infrastructure - **Pulling images from registries** — decompressing layers into a local OCI layout directory, then calling `from_oci()` - **Authentication** — OAuth2, token-based, workload identity — whatever the registry requires - **Signing and verification** — cosign, notation, or other tools at push/pull time - **Provenance** — attaching SLSA attestations as OCI referrers - **Admission control** — Gatekeeper, Kyverno, or custom policy checks before loading This is the same model as container runtimes: the runtime reads OCI layouts, the orchestrator handles distribution. Hyperlight is the runtime. ``` Host workflow: oras copy registry.io/app:v1 --to ./my-sandbox ← host pulls let sb = MultiUseSandbox::from_oci("./my-sandbox/", ...)?; ← Hyperlight loads sb.save_diff("./updated/")?; ← Hyperlight saves oras copy ./updated --to registry.io/app:v2 ← host pushes ``` If demand arises for a convenience wrapper (e.g., `from_registry()` that pulls + loads in one call), it can be added as a **separate crate** or behind a feature gate — not in core Hyperlight.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.