WASI Virt - HackMD

Initially we will just support two basic virtualizations: 1. Environment variables 2. Filesystem The assumption of WASI Virt is that it is a single Wasm component that exports the WASI interfaces that can be composed with a WASI importer. The virtualization code will be static, but the way it interfaces with memory will allow active and passive memory injections to be able to provide the virtualization data that is used at runtime. # API The main API would take virtualization options and return the final virtualization module with the virtualization data embedded, something like: ```rs= /// Each WASI subsystem can be individually virtualized /// virtualizations are all optional but compose together. struct VirtOpts { env: Option<VirtEnv>, fs: Option<VirtFs>, } /// Create the component virtualizer for the given /// virtualization options. /// This component exports WASI interfaces as selected, /// while having pass-through imports for non-virtualized /// WASI interfaces, to be used in composition. pub fn create_virt (opts: VirtOpts) -> Vec<u8> { // ... return component; } ``` # Environment Variables `wasi:cli-base/environment` implementation. `VirtEnv` would be defined: ```rs= enum HostEnv { All, Allow(Vec<String>), Deny(Vec<String>), None, } struct VirtEnv { /// list of environment variables which should /// come from the host. If None, then there is /// no host fallthrough. host_env_vars: HostEnv, /// env vars. Can use $NAME substitutions /// to depend on host vars. env_vars: BTreeMap<String, String> } ``` The `getEnvironment` function returns a list of pairs of strings. The data layout in memory for this function return can be a simple binary embedding of length followed by each pair, with each string starting with its length. We can use LEBs or even just plain u32s. We can either preallocate a section of linear memory for the environment variables, up to a maximum length. Or we can support a custom pointer for this data. With the pointer option, we could have another function export specific to the virtualizer - `getEnvironmentPtr`, which would provide the address of the pointer to the environment data. An allocation can then be statically provided to allocate environment variables without a limit, and the memory value at this pointer address set to this new memory location. The virtualizer then simply has to mutate the global stack pointer and add the active memory segements for the environment variables. # Filesystem `wasi:filesystem/filesystem`, `wasi:cli-base/preopens`, `wasi:io/streams` implementations. `VirtFs` might be defined: ```rs= struct VirtFs { /// Filesystem state to virtualize preopen_dirs: BTreeMap<String, FsRoot>, /// A cutoff size in bytes, above which /// files will be treated as passive segments. /// Per-file control may also be provided. passive_cutoff: u32, } // where FsEntry in terms defines all data // in the file system available statically struct FsRoot { files: BTreeMap<String, FsEntry> } enum FsEntry { File(FsFile), Dir(FsDir), HostFile(FsHostFile), HostDir(FsHostDir), } ``` The above API is entirely about user convenience and isn't an internal or runtime data structure at all. Validation and processing would be handled at virtualization time. The data would then be statically arranged into the memory convention, against which the interfaces would be implemented. The use of host file and host directory entries in configuration allows constructing fall-through paths to the final host-native filesystem. If none are provided, then full encapsulation is used without any host filesystem imports. If any files defer to the host, then a host path substitutes into the provided position. ## Memory Convention While environment was a simple case, for FS, there are two separate primary memory structures - the filesystem index, and the filesystem data. The data (above a certain size threshold) would be in passive memory, while the index would remain in active memory. ### File Data All file data would be separate from the index in injected memory. For files under a certain threshold, they might be included as active segments. Otherwise they would be passive segments. FS operations would determine this information by checking the index, which would provide the correct reference information for the given file. In the passive case, a `read_passive_segment` core function would be inlined that can be called from Rust to handle the segment read, taking three `u32` arguments. Allocations by these functions would be "component-model-owned" by the callee and freed by the component model ABI calls themselves. In the active case, the pointer would be read directly. Evaluating the best strategy for batching of reads would be a larger effort, but initially the simplest implementation should achieve the required goals. The layout of this index data structure would need to support: * Storing all file and directory structure, including permissions * Runtime mutations for writes and deletions ### Filesytem Index One suggested approach for implementation would be to treat the index as consisting of two components - the initial static index, and then the dynamic mutation index, which can be a dynamic runtime structure which includes garbage collection utilizing the Rust memory management directly. The static index can be created using the same global stack pointer bumping technique used by the environment variables. Most likely also exporting a `getFsIndexPtr` function at virtualization time for population. Then the dynamic index would be an entirely runtime-specific implementation. This dual static / dynamic index approach allows us to use Rust data structures for the dynamic filesystem, while keeping to simple static layouts for the static file system. #### Static Index The static index should at least be able to provide optimized lookups, so it should consist of entries of fixed length. Top-level directory entries should be listed first, in alphabetical order, followed by applying the same listing to each child. This way path lookups remain optimized binary searches over fixed lists. For example for the directory structure: ``` - dirA - fileA - dirB - sub - file - zDir - file ``` The linearized index would be `[dirA, file, zDir, fileA, dirB, sub, file]`. In addition the top-level count should always be known. The fixed-size entry data structure can then be something along the lines of: ``` StaticEntry { name: *str, type: 'active-file' | 'passive-file' | 'dir', perms: u16, segment_data: ActiveFileEntry | PassiveFileEntry | DirEntry } DirEntry { // flat index at which these directory entries start index: u32, // number of direct directory entries dir_entry_cnt: u32, } ActiveFileEntry { ptr: u32, len: u32, } PassiveFileEntry { memory_segment_idx: u32, } ``` #### Dynamic Index The optional dynamic index could be treated as an overlay over the static index, in the case of mutable filesystem support, where the dynamic index is first consulted before falling back to the static index. Entries in the dynamic index would be Rust data structures (`BTreeMap` etc), which then need to include additional overlay state. For example: * Deletion entries * Overwrite entries * Move entries If done properly, the dynamic index should be able to consult the static index as needed providing a layering without duplication, while supporting garbage collection of dynamically created files.