Quota reservations at the RepoStore

# Quota reservations at the RepoStore **What's Broken and How To Fix It** ## Problem Slot quota reservations may fail to be honored. In a nutshell, this happens because the current implementation is open to race conditions, as well as vulnerable to inconsistencies in case of faults/crashes. To understand how that happens, we start by looking at the code that actually creates the repostore reservations: ```nim= proc createAvailability*( self: Reservations, size: uint64, duration: uint64, minPricePerBytePerSecond: UInt256, totalCollateral: UInt256, ): Future[?!Availability] {.async.} = trace "creating availability", size, duration, minPricePerBytePerSecond, totalCollateral let availability = Availability.init(size, size, duration, minPricePerBytePerSecond, totalCollateral) let bytes = availability.freeSize if reserveErr =? (await self.repo.reserve(bytes.NBytes)).errorOption: return failure(reserveErr.toErr(ReserveFailedError)) if updateErr =? (await self.update(availability)).errorOption: # rollback the reserve trace "rolling back reserve" if rollbackErr =? (await self.repo.release(bytes.NBytes)).errorOption: rollbackErr.parent = updateErr return failure(rollbackErr) return failure(updateErr) return success(availability) ``` reservations are created in line $15$. The first problem happens already in line $18$ - if the call to `update(availability)` fails, we attempt a rollback. But if the node crashes, is terminated, or an exception is thrown, we will simply end up with an orphaned reservation. If we then look at the code that does the actual reservation: ```nim= proc reserve*(self: RepoStore, bytes: NBytes): Future[?!void] {.async.} = ## Reserve bytes ## trace "Reserving bytes", bytes await self.updateQuotaUsage(plusReserved = bytes) ``` we realize that all it does is updating a counter, as can be seen from `updateQuotaUsage` below: ```nim= proc updateQuotaUsage*( self: RepoStore, plusUsed: NBytes = 0.NBytes, minusUsed: NBytes = 0.NBytes, plusReserved: NBytes = 0.NBytes, minusReserved: NBytes = 0.NBytes, ): Future[?!void] {.async.} = await self.metaDs.modify( QuotaUsedKey, proc(maybeCurrUsage: ?QuotaUsage): Future[?QuotaUsage] {.async.} = var usage: QuotaUsage if currUsage =? maybeCurrUsage: usage = QuotaUsage( used: currUsage.used + plusUsed - minusUsed, reserved: currUsage.reserved + plusReserved - minusReserved, ) else: usage = QuotaUsage(used: plusUsed - minusUsed, reserved: plusReserved - minusReserved) if usage.used + usage.reserved > self.quotaMaxBytes: raise newException( QuotaNotEnoughError, "Quota usage would exceed the limit. Used: " & $usage.used & ", reserved: " & $usage.reserved & ", limit: " & $self.quotaMaxBytes, ) else: self.quotaUsage = usage codex_repostore_bytes_used.set(usage.used.int64) codex_repostore_bytes_reserved.set(usage.reserved.int64) return usage.some, ) ``` in fact, whatever is reserved is actually semantically subtracted from the available quota. But one cannot, by design, use the reserved quota for writing. In fact, if we look at how `putBlock` is implemented: ```nim= method putBlock*( self: RepoStore, blk: Block, ttl = Duration.none ): Future[?!void] {.async.} = ## Put a block to the blockstore ## # ... if res.kind == Stored: trace "Block Stored" if err =? (await self.updateQuotaUsage(plusUsed = res.used)).errorOption: # rollback changes without delRes =? await self.tryDeleteBlock(blk.cid), err: return failure(err) return failure(err) if err =? (await self.updateTotalBlocksCount(plusCount = 1)).errorOption: return failure(err) if onBlock =? self.onBlockStored: await onBlock(blk.cid) else: trace "Block already exists" return success() ``` we see that line $11$ checks the block size against the available quota and that is it. If one reserves the entire blockstore and tries to write a block, that write will fail because there is no way of saying "use the bytes from what I have reserved to write this block". Indeed, using a reserved quota relies on a _release-then-use_ pattern, which we can see in the Marketplace code that fills a slot (`downloading.nim`): ```nim= proc onBlocks(blocks: seq[bt.Block]): Future[?!void] {.async.} = # release batches of blocks as they are written to disk and # update availability size var bytes: uint = 0 for blk in blocks: if not blk.cid.isEmpty: bytes += blk.data.len.uint trace "Releasing batch of bytes written to disk", bytes return await reservations.release(reservation.id, reservation.availabilityId, bytes) ``` this has a number of issues: 1. there's no locking between the release and the actual storage of the blocks, but there are calls to await. This means you could be releasing some quota and, before you can use it, some other process (e.g. an ongoing upload, or a new availability request) might race you to it, resulting in [failures to store a slot due to lack of quota](https://github.com/codex-storage/nim-codex/issues/1142); 2. the node can crash after you release the reservation but before you store anything. You'll then be left with an availability not backed by a quota reservation, and will end up in the same situation as before. ## How To Improve Things Improvements largely boil down to what we want. There are several options. ### 1. **No reservations at all.** In this case, we explicitly stop tracking any sort of reservation at the RepoStore level. The main pro of this approach is that it the simplest. This main cons are: 1. **Confusing semantics.** An operator can create availabilties that are carefully sized to its quota but end up overcommitting because the effective quota is actually smaller. Add that to the fact that slots will be competing for whatever quota is left, and you might run into head-scratching situations like the node trying to fill two, or even three slots at once, and not being able to fill any due to lack of space. 2. **It mixes caching and storage quotas, compouding confusion.** A slot might fail to fill because of seemingly unrelated operations, like the operator having used the node to download a file from the network. If the user does not have a very clear idea of the Codex storage model, this is unexpected. While we could add active garbage collection to purge expired blocks when storage pressure mounts, this adds the problem of defining an active garbage collector trigger, dealing with issues like thrashing, and possibly doing extra bookkeeping to keep performance of GC cycles acceptable. ### 2. **Separate reserved quotas from caching.** In this case, we would have two pools (or partitions) -- reserved and non-reserved -- and we do not allow those to mix. This means adding a primitive that allows selecting to which pool we want to write, like: ```nim= method putBlock*( self: CacheStore, blk: Block, ttl = Duration.none, reserved: bool = false ): Future[?!void] {.async.} = ``` This allows us to fix the "mixing of caching and storage quotas" problem described earlier, so that reservations are never used up by caching. It also fixes accidental overcommitting to some extent, because caching will not eat into carefully sized availabilities. It still allows room for issues, however: 1. **Marketplace has to track its use of the reserved quotas very carefully.** In practice, it needs to correctly implement partitioning of the reserved space from outside of the RepoStore. This includes keeping track of how many bytes are used per partition in the face of failures and unexpected conditions, and keeping that in sync with the blocks actually in the RepoStore. Since you cannot tag blocks as part of a partition from the outside, there is ample room for corner cases leading to orphaned/leaked blocks that linger in the repostore forever. ### 3. **Implement a partitioned RepoStore.** As the name implies, this would allow RepoStore quotas to be dynamically partitioned into disjoint spaces, and the RepoStore would be responsible for keeping track of per-partition quotas. The primitive from earlier would become something like: ```nim= method putBlock*( self: CacheStore, blk: Block, ttl = Duration.none, partition: uint32 = 0 ): Future[?!void] {.async.} = ``` This has several advantages: * we could tag blocks with their partition IDs in their metadata, which would mean eventual orphaned blocks would live as long as their availabilties; * I don't think this would make the RepoStore any slower, since we need to do quota checks at every store operation anyway; * it would ensure, by construction, that the quota tallies and what is actually in the RepoStore are always in sync; * it would free Marketplace from implementing this logic, and focus instead on making sure that the contents within each parition is managed properly. Blocks might still leak if, say, a slot fails to fill but Marketplace does not clean the blocks up from the RepoStore reservation (or does not clean the reservation itself) but those are easier to detect and recover from; e.g., a sweep at startup that looks for orphaned BlockStore reservations could be enough to make this work OK-ish.