RepoStore Quotas - Part 2

# RepoStore Quotas - Part 2 This document summarizes our current understanding of how to address the current issues with quotas in the RepoStore and Marketplace. The latest discussion led us to consider: 1. simplifying the Marketplace availability structure (Sec 1); 2. adding support for a two-partition RepoStore with block tags to support rolling back partially filled slots (Sec 2). ### 1. Dropping support for multiple availabilities. Multiple availabilities add complexity at several levels: 1. they require quotas for each availability to be tracked separately; 2. they place a larger burden on the user by adding an extra concept and requiring them to size and create availabilities before they can start selling storage. We argue this complexity is unwarranted at the present moment. The main reason we envisioned multiple availabilities was to allow one to sell different pieces of their quota at different prices, but Codex currently does not even support multiple volumes at present; e.g., for SSD and magnetic storage to be sold at different prices, making this unnecessary complexity. Having a single availability would make things simpler. This in principle could let us get away with implementing a two-partition RepoStore in which one partition is used for caching and the other for slot storage. It would allow us to drop all code that is needed to manage multiple availabilities (add/delete/list)^[I think we will still need a way to allow a provider to update the price at which it is willing to sell _space that is currently free_ so they can follow market prices, but that might be a lot simpler than tracking availabilities.]. ### 2. Active and passive management of slot blocks. Slot blocks begin entering the blockstore as soon as the download process for a slot starts. The moment they exit the blockstore, however, is a lot more blurry. One would be willing to assume, in the happy path, that: 1. a slot will download to completion; 2. the storage provider will submit proofs until the storage contract finishes; 3. once the contract finishes, the blocks will be deleted and space will be reclaimed. Currently, however: 1. there is no active removal of slot blocks. When a slot finishes downloading, the blocks are left to be reclaimed by the block maintenance process; 2. there is no active handling for slots that fail. This means those blocks should also, eventually, be garbage-collected by the maintenance process. Let us discuss the issues with these mechanisms, as they are currently implemented, next. **Completed Slots.** The [BlockMaintainer](https://github.com/codex-storage/nim-codex/blob/9d7b521519329766cee675ece4013614adc7c6ad/codex/stores/maintenance.nim) currently runs at every $\delta$ instants, with $\delta = 10$ minutes being the default. Moreover, the BlockMaintainer will delete up to $d$ blocks per cycle, and the current default is $100$. This means a $1\text{GB}$ file will currently take $163$ cycles to be deleted, or about $1$ day. Deleting $1$ terabyte of data will take $3.2$ years. This is clearly not what we want. **Failed Slots.** Blocks belonging to failed slots will currently linger in storage until their original expiry date. If $l_{\max}$ is the maximum duration for a storage contract, then blocks for a failed slot which, say, have not even downloaded to completion, can linger for up to $l_{\max}$ instants in storage. After that, they should be reclaimed by block maintenance at the rate described before ($3.2$ years for a terabyte). I argue we would have a much simpler/more predictable system if: 1. slot blocks were actively deleted when a slot expires or fails, by default; 2. we relied on maintenance or _active orphan pruning_ as a fallback for crashes. In this context, _active orphan pruning_ refers to a primitive that: 1. identifies blocks not belonging to currently active slots (failed slots, aborted slots, finished slots); 2. deletes them immediately. Such primitive could be ran by default on startup. Since a node needs to be restarted after a crash, we are guaranteed to never miss blocks orphaned by crashes. Recovery from errors not resulting from crashes, on the other hand, would fall on the Marketplace, but this is also simple: a "finally" state that runs regardless of the outcome of a slot and runs this exact same cleanup primitive could suffice. In pseudocode: ```python= try: await download_slot() await submit_initial_proof() while not contract_expired(): await next_proof_interval() await submit_proof() finally: await prune_orphan_blocks() ``` Making such a primitive work would require us to _tag_ blocks so we can tell which blocks belong to which slots. It would likely also require us to efficiently retrieve all tags, so we can tell which ones correspond to slots we are tracking and which do not. **Alternative: fix block maintenance.** The alternative to orphan pruning is for us to fix block maintenance so that it actually does what is expected. This to me means reducing _significantly_ the deletion latency for blocks by removing the parameter $d$ and making the process fast enough that it can run on a small $\delta$. This will _still_ require active deletion of blocks for failed slots in non-crash scenarios, however, as otherwise blocks for failed slots will potentially linger for large amounts of time before being reclaimed. ### 3. Primitives With that in mind, we need to change the RepoStore such that: **It supports two partitions.** The first partition is used for caching, the second is is used for slots. From the point of view of the RepoStore, those are simply two partitions for which it tracks quotas separately. This would require providing a `putBlock` primitive as follows: ```nim= method putBlock*( self: BlockStore, blk: Block, ttl = Duration.none, partition: Partition = Zero ): Future[?!void] {.base, gcsafe.} ``` where partition `Zero` is the one used for caching, and partition `One` is used for the single availability^[We could also use a range type defined on $[0, 1]$ but range types are buggy.]. **It allows blocks to be tagged.** The format for tags needs to be investigated. In its most general form, tags could be arrays of strings, or a single string. If needed for efficiency, we could do with an integer, at the cost of adding complexity to marketplace which will need to map `(dataset root, slot index)` tuples onto integers. Block tags are passed when blocks are stored. Our final primitive becomes: ```nim= method putBlock*( self: BlockStore, blk: Block, ttl = Duration.none, partition: Partition = Zero, tags: BlockTags = None, ): Future[?!void] {.base, gcsafe.} ``` by default, the RepoStore stores untagged blocks in partition zero. We then need to add support for blocks with a given tag to be deleted for the slot cleanups. The precise format for this primitive depends on what is more efficient. At the simplest end of the spectrum, we could have something like: ```nim= method delBlocks*(tags: BlockTags): Future[?!void] {.base, gcsafe.} ``` which simply deletes all blocks with a given set of tags. Note that I am omiting the partition as blocks are not identified by partition - partitions are just for tracking quotas. ### 4. Recovery Cleanup of dead blocks can happen either: 1. **At startup.** Assuming we keep enough information around to figure out that during our last run we were in the process of filling a slot, but that process never completed, we could simply clean the blocks for every partially filled slot as part of the initialization of the sales module^[This could be too aggressive if we expect that nodes perhaps can resume downloading a slot, but it is simple enough and works, so that is what I would start with.]. 1. **Immediately upon realization that a slot cannot be filled.** For instance, if the node realizes halfway through filling the slot that it can no longer download blocks for it, or if the sales fails for any other reason, then the node simply purges all blocks belonging to that slot. Note that the node can still crash while doing this, so we need to be careful on how we implement it. Most likely, we will need _both_ mechanisms: nodes can crash, and crashes will defeat any attempt at resolving the issue purely by relying on error signaling like (2). [^1]: [^2]: [^3]: