Storing and Cleaning Up Slots

# Storing and Cleaning Up Slots $$ \newcommand{state}[1]{\text{#1}} $$ Failure to clean up blocks for slots that have reached a termination state results in wasted space and loss of storage provider efficiency. This document attempts to define a more precise strategy for such cleanup. **When is cleanup needed?** If we look at the state machine for a slot at a storage provider, depicted in Fig. 1, any slot that has ever reached a state past $\state{downloading}$ (inclusive), should be assumed to require cleanup. ![Diagramma vuoto](https://hackmd.io/_uploads/SkcBqWD21e.png) **Figure 1.** Sales state machine. **When should cleanup happen?** Cleanup should happen when: 1. a slot reaches a termination state; i.e., one of the coloured states in Fig. 1; 2. the state of the slot can no longer be determined. Cleanup must also happen regardless of node crashes or unexpected exceptions: as soon as the node is able to restore normal operation, it _must_: 1. restore its knowledge about slot state; 2. determine which blocks need to be cleaned up, if any. This means we need to perform _active_ cleanup of failed slots, and _corrective_ cleanup after node failure states. Any time the regular operation of the node is interrupted, we must assume there is cleanup that might need to be done. ## Achieving Proper Cleanup To achieve cleanup, we must: 1. make sure that every termination state for a slot ends up in cleanup (_active_ cleanup); 2. make sure that every time a node restores proper operation, it cleans up any terminated slots that could not be cleaned up because the node crashed or was shut down (_corrective_ cleanup). I will not deal with lack of cleanup due to bugs. If a bug results in blocks not being cleaned up, then that bug needs to be addressed. ### Active Cleanup We have several options for implementing active cleanup, currently. **Implement a `delSlotBlocks` primitive.** Looking at [how slots are built](https://github.com/codex-storage/nim-codex/blob/9d7b521519329766cee675ece4013614adc7c6ad/codex/node.nim#L595), one could envision a primitive like: ```nim= proc delSlotBlocks(manifestCid: Cid, slotIndex: uint64): Future[?!void] = for index in slotIndices(manifestCid, slotIndex): await delBlock(manifestCid.treeCid, index) ``` **Listing 1.** Pseudocode for a primitive for deleting slot blocks. which relies on the dataset manifest to recover the block indices for a slot, and simply deletes them. This has the advantage of not requiring any changes to the RepoStore (no need to tag blocks). The main disadvantage is that it requires the dataset manifest. In fact, if the manifest were to disappear from the network and our local store for some reason, we would no longer be able to identify the slot or delete its blocks. There is also no "proper" place where to implement such a primitive: it does not belong in the RepoStore, but implementing it in CodexNode would also look strange. **Implement block tagging and tagged deletes.** We can do away with the requirement of having a manifest if we can explicitly tag blocks. As discussed, we can use a simple $32$-byte identifier which can then be derived by hashing from the `(treeCid, slotIndex)` tuple, thus uniquiely identifying a slot. This requires allowing blocks to be tagged on store: ```nim= type BlockTag = distinct array[32, byte] # or something like that method putBlock*( self: BlockStore, blk: Block, ttl = Duration.none, tag: ?BlockTag = None, ): Future[?!void] ``` **Listing 2.** Signature for the `putBlock` primitive with block tags. and a delete primitive: ```nim= method delBlocks(self: BlockStore, tag: BlockTag): Future[?!void] ``` **Listing 3.** Signature for a primitive for deleting block sets by tag. which wipes all blocks with a given tag. ### Corrective Cleanup Corrective cleanup requires extra information. Since the delete primitives require us to identify a slot either by a `(manifestCid, slotIndex)` tuple, in the case of `delSlotBlocks`, or by a hash of the `(treeCid, slotIndex)` tuple, in the case of `delBlocks`, we need to know the list of such values that refer to slots that might require cleaning up to begin with. We have two options. **Persist information on slots we process.** This requires keeping extra state, but is relatively easy to manage safely. We can write an entry to a local, persistent `pendingSlots` table which identifies every slot we process _before we download its first block_; i.e., at the beginning of the $\state{downloading}$ state. We then assume, conservatively, that as long as identifier remains in that table the slot might need cleanup. We can then implement a pair of `delSlot`/`cleanupSlots` primitive as follows: ```nim= proc cleanupSlots() {.async.}= for slotMeta in pendingSlots: try: await delSlot(slotMeta) except CatchableError as err: error "Failed to clean up blocks for slot", slot = slotMeta ``` **Listing 4.** Pseudonimcode for a primitive that cleans up orphan blocks. ```nim= proc delSlot(meta: SlotMeta) {.async.} = when defined(delBlocks): await delBlocks(slotMeta.blockTag) else: await delSlotBlocks(slotMeta.manifestCid, slotMeta.slotIndex) # Assuming exceptions as errors, we'll only get here if nothing # failed. await pendingSlots.delete(slotMeta) ``` **Listing 5.** Pseudonimcode for a primitive that deletes all blocks for a slot. We can then have `delSlot` called by the active cleanup sequence in the Marketplace state machine as well. The main donwside of this approach is having to keep an extra table, namely `pendingSlots`. The upside is that we can implement an efficient cleanup, and that managing `pendingSlots` safely, even with faults and crashes, is easy. The only point of care is if we by chance start downloading a slot that we have as a `pendingSlot` _while_ we are running a cleanup operation. If this is a possiblity, we might need to add a bit more machinery to make sure race conditions are dealt with properly, but this is a situation which we probably want to avoid anyway: cleaning up while downloading will wreak havoc on the whole process, cause slots to be unfilled and, in the worst case, it could even lead to node slashing. **Determine orphan blocks from current state.** A Codex node can query the list of its currently _active_ slots by [inspecting on-chain state](https://github.com/codex-storage/nim-codex/blob/9d7b521519329766cee675ece4013614adc7c6ad/codex/contracts/market.nim#L155). We could then construct a `getStoredTags` primitive which computes the set of all block tags currently in the RepoStore with an aggregation query. This would allow us to implement `cleanupSlots` as follows: ```nim= proc cleanupSlots {.async.} = storedTags = await getStoredTags() mySlots = await market.contract.mySlots() for slotId in setMinus(storedTags, mySlots): await delSlot(await slotMeta(slotId)) ``` **Listing 6.** where `delSlot` is the same primitive defined in Listing 5, except that it does not need to update the `pendingSlots` table as it does not exist here. This will wipe the contents for all slots that are not recorded on chain as being ours; i.e., anything that went past $\state{downloading}$ but did not reach $\state{filled}$. If we opt for implementing `delSlotBlocks` as the slot cleanup primitive, this will require an extra mapping step to convert the `slotId` returned by the contract onto the corresponding `(manifestCid, slotIndex)`, which I am not actually sure is currently possible as the contract stores the tree CID and not the manifest CID. If we opt for `delBlocks`, we can use the slot ID as the block tag, and then call `delBlocks` directly with it without any mapping. The pros of this approach is that it does not require an extra table. The cons are: 1. it requires an aggregation query, which may not be efficiently supported by all backends, and which is currently not supported at the `nim-datastore` level. Naively implementing this by traversing all blocks and collecting the tags will likely be very slow; 2. if we want to go with the `delSlotBlocks` approach, it requires a way for the node to map a slot ID, as returned by the smart contract, onto a manifest CID, which I am not sure is possible. It therefore might constrain us to having to implement block tags. The improvements described here would be implemented _in addition_ to the quota management improvements we have already discussed. ## Dealing with Renewals^[After a chat with Dmitriy, we converged on having dataset expiries to deal with this instead, where "dataset" is a treeCid that would work both for slots and entire datasets. Instead of tracking the number of active state machines, we would have the marketplace renew the expiries for the slot treeCid, as it does now, except those would not need to be set for every single block. Maintenance would then happen at the dataset-level and not at the block level anymore. We would need locking to properly order dataset expiry extensions with concurrent deletes; i.e., a delete and a request for extending the expiry of a treeCid are always ordered and the extension of the expiry either succeeds, in which case the delete fails, or the delete succeeds, in which case the extension of the expiry fails. The mechanisms above describing how active cleanup deals with failed/aborted slots still apply.] Renewals are regular storage requests issued against slots that are currently being handled by another storage request. They concern us when the same node that is currently servicing the request starts handling the renewal, as that means we can have more than one state machine instance running against the same slot. We start by reasoning against the simplest case of a renewal, in which we have _two_ state machines, $s_o$ and $s_r$, which are handling the original and the renewal requests, respectively. If $s_o$ is allowed to delete the data for the slot _while_ $s_r$ is running, this can break $s_r$'s invariants. Indeed, the state machine assumes that the data will not disappear from the local disk past $\state{download}$. If such invariant, which we refer to as the _download invariant_, gets violated, this will cause the node to be slashed and $s_r$ to transition into a failure state. We can preserve this invariant by taking a few precautions. We assume a renewal needs to be issued _before_ the current request expires; i.e., for a renewal to succeed, we require $s_o$ to _not_ have reached a termination state by the time $s_r$ reaches $\text{prepared}$. In other words, if $s_o$ has reached a termination state by the time we get a renewal request, then the renewal should fail, and we should expect that the data will need to be downloaded again. In practical terms, we must distinguish between two situations: 1. $s_o$ is active and running a termination state; 2. $s_o$ has reached a termination state and is no longer running. In both cases, $s_r$ should be handled as a new request and not a renewal, but at the implementation level we need to ensure that $s_r$ is not allowed to run _until_ $s_o$ terminates so we can preserve $s_r$'s download invariant. With that intuition in mind, we tackle the general case. **General case.** In the general case, let $S = \left\{s_1, \cdots, s_n\right\}$ be a set of state machine instances running for a given slot, such that: * a machine $s$ enters $S$ when $s_i$ reaches $\text{preparing}$, and; * a machine $s \in S$ leaves $S$ when it terminates; i.e., when it is done running the actions for its termination state. To prevent the download invariant from being broken at any given machine while achieving proper cleanup, we must ensure that: * a machine $s$ cannot join $S$ if any of the $s_i \in S$ is currently running a cleanup operation; * cleanup is always performed by the last machine to leave $S$. Note that this means that whenever $|S| = 0$, we should have no data on disk for that slot. It might happen that we have to re-download all the data again as a new machine joins $S$, but invariants will not be broken, and this situation should not be common. In fact, in the common case we will have $|S| \leq 2$; i.e., we will have only $s_o$ and $s_r$, and we will likely never even block $s_r$. We can implement this behavior by keeping some extra information on $|S|$, and a lock: ```nim= let activeFSMs: Table[SlotId, tuple[lock: AsyncLock, n: uint] ``` We can then add the following to the cleanup state for a state machine: ```nim= proc cleanup(self) = let (lock, n) = activeFSMs[self.slotId] with (await lock.acquire()): n.dec() if n == 0: await delSlot(self.slotId) ``` at the same time, we need to prevent new machines from starting while there is a cleanup running: ```nim= proc downloading(self): let (lock, n) = activeFSMs[self.slotId] with (await lock.acquire()): n.inc() ``` Note that none of this needs to be persisted. If the node crashes halfway through a cleanup, the remainder of the slot will be evicted on the next startup and the node will pick up any eventual renewal requests as new requests, from where they were. [^1]: