Codex RepoStore - Pending Issues and Future Directions

$$ \newcommand{\repostore}{\text{RepoStore}} \newcommand{\networkstore}{\text{NetworkStore}} $$ # Codex RepoStore - Pending Issues and Future Directions Codex manages data as fixed-sized chunks called _blocks_. Block management is handled by a series of _block stores_, which form a layered architecture (Figure 1). At the top -- and exposed to the remainder of Codex -- we have the $\networkstore$, which leverages a local $\repostore$[^1] for actual block storage and retrieval. <center> <img src="https://hackmd.io/_uploads/rkzwEUtYkg.png" width="70%" style="padding: 10px 0px 25px 0px"/> </center> **Figure 1.** Architecture of the Codex block store. $\networkstore$ delegates most operations to the underlying $\repostore$, with the exception of block lookup: if a client issues a `getBlock` for a block that is not known locally, $\networkstore$ will attempt to fetch it over the network by activating the _block exchange engine_. This process might involve a DHT lookup, as well as joining a swarm and connecting to other peers to obtain the block. Once the block exchange engine returns the block, it gets placed onto the local $\repostore$, from where it can then be retrieved as many times as desired. Although the whole structure is deserving of attention, the focus of this document will be on the local $\repostore$. **$\textbf{RepoStore}$ Responsibilities.** In addition to storing blocks, the $\repostore$ keeps track of: 1. $\repostore$-wide metadata such as total blocks stored and total bytes stored; 1. a _storage quota_, which represents the maximum number of bytes that the $\repostore$ is willing to store; 1. per-block block expiry times; 1. whether a block $b$ is an _orphan block_, or a part of a larger dataset; 1. Merkle inclusion proofs that attest the participation of non-orphan blocks into datasets. Orphan blocks will have no Merkle inclusion proofs associated to them; 1. a per-block counter which counts how many datasets a block $b$ participates in; i.e., how many Markle inclusion proofs we know for $b$. Orphan blocks will have this set to zero. **Design goals.** The $\repostore$ needs to be, in order of importance: 1. **Correct.** Node crashes, power outages, and heavy workloads should not put the $\repostore$ into unrecoverable states, and should not corrupt the underlying data or metadata; 2. **Scalable.** The `RepoStore` must be able to handle large amounts of data, in the order of terabyte files. 3. **Fast.** Block retrieval should be fast and, ideally, retrieving a file from the local blockstore should be almost as fast as reading the equivalent file from a regular filesystem. ## Current Issues ### Non-transactional Updates As Figure 1 shows, the $\repostore$ relies on a pair of key-value stores (kv-stores) (based on [nim-datastore](https://github.com/codex-storage/nim-datastore)) to actually store data and metadata. Block data is stored in a filesystem-backed kv-store where each block corresponds to a file. Metadata, instead, is stored into a kv-store backed by [LevelDB](https://github.com/google/leveldb). A critical issue with the current $\repostore$ implementation is the widespread use of naive, non-transactional updates. The following code for `delBlock` illustrates the problem well: ```nim= without res =? await self.tryDeleteBlock(cid, self.clock.now()), err: return failure(err) if res.kind == Deleted: trace "Block deleted" if err =? (await self.updateTotalBlocksCount(minusCount = 1)).errorOption: return failure(err) if err =? (await self.updateQuotaUsage(minusUsed = res.released)).errorOption: return failure(err) ``` There are three separate kv-store updates going on here: the deletion of the block's content (line 1), the updating of the total block count (line 6), and the updating of the quota (line 9). A failure in between these updates will leave the block store in an inconsistent state, and this will never be recovered from. There are two issues at play here: * **Problem 1. Lack of single-store transactional updates.** Athough LevelDB supports atomic updates by means of [batched updates](https://github.com/google/leveldb/blob/main/doc/index.md#atomic-updates), those are not enough. This is because we are _incrementing_ and _decrementing_ counters. Increment and decrement are read-modify-write operations which require either transactional support or the use of safe primitives (e.g. atomic counters) at the kv-store level. Neither of those are generally available; * **Problem 2. Lack of cross-store transactional updates.** Applying updates across kv-store implementations requires another type of transactional logic which is currently not implemented. In the next subsections, we will discuss some options for addressing both problems. #### Possible Solutions to Problem 1: Lack of Single-Store Transactional Updates The current implementation for the kv-stores has protection against racing updates to the same key by means of specialized read-modify-write primitives which internally rely on locking or transactions, but there is no support for multi-row operations. Further, the read-modify-write primitives have been criticized for introducing complexity while not solving all of the issues. One could envision a simpler way of dealing with the problem by noticing that the $\repostore$ is actually quite simple, and that multi-row updates are typically counter updates and quota checks. _Record-level locking_ might be enough to prevent concurrent updates, but we still require rollback logic for when failures occur. Using transactions, particularly with SQL-lite, could provide an affordable and simple alternative. The drawback is transaction overhead, which might inadvertently introduce performance issues in the uncontended path, especially when dealing with doing high-throughput writes; i.e., trying to write many blocks in a row. Since we expect actual contention to be rare, however, introducing a batched `putBlock` operation - in which metadata remains locked for the duration of a batched insert and block writes go as fast as they can - could amortize transaction costs to the point of making them negligible. #### Possible Solutions to Problem 2: Lack of Multi-Store Transactional Updates Multi-store transactional updates are trickier because of the different nature of the loads on the stores. One could argue that using a single transactional store could fix the problem, but transaction overhead will likely create big problems, particularly for high-throughput writes. What I envision as more palatble approach, therefore, is a mixture of real transactions for metadata (with the cost amortizations strategies described earlier), and [compensating transactions](https://en.wikipedia.org/wiki/Compensating_transaction) for blocks. By keeping enough information around to understand whether something is not consistent, we might able to implement lightweight recovery strategies that work even when the node crashes and block content no longer matches metadata content. Actually solving the issues will required developing these solutions further, however, and a careful investigation of their tradeoffs. #### Issue Summary Although the $\repostore$ is fairly small, the use of unsafe data access patterns is widespread. We need safe, atomic updates for _all_ block and metadata operations. Those need to be implemented in a way that does not incur unacceptable overhead, especially in the uncontended paths. ### Performance The current storage layer does not perfom well. Deleting a $2\text{GB}$ file, for instance, takes $4$ minutes on my laptop's SSD. Again, these are likely a mixture of: 1. small blocks; 2. fine-grained metadata updates; 3. blocking I/O running on a single Chronos thread. The impact of those need to be _measured_. Optimizing them will then require a mixture of techniques. Block sizes can be increased for larger files. Batching operations like my batched `putBlock` and batched deletes could be used to amortize metadata update cost. Blocking I/O could be deal with by having threads at kv-storage layer, or by doing much larger writes. The extent that those optimizations will be required will ultimately depend on the performance we expect to have, and how far we are willing to go to have it. #### Issue Summary The current $\repostore$ does not perfom well. We need to define a performance target, measure the impact of the various sources of inefficiency, and implement the required optimizations to address them as required. [^1]: Which might have an additional `CacheStore` layered on top of it but, for simplicity, we will assume it does not.