owned this note
owned this note
Published
Linked with GitHub
# Codex EC and placement requirements
## Abstract
There are several concurrent requirements that our erasure coding and data placement needs to satisfy, thus we need to first understand what this requirement are before engaging in the minutia of the specific layouts. I'll try to outline the high level of this requirements and provide a few directions which I believe might be interesting to explore in this context.
Our requirements are concurrent and in some cases conflicting, thus some level of compromise is expected, however, we should be careful to make the **right kind** of compromises. I've been using the following framework to aid in deciding where the compromises are most acceptable (in order of importance):
1. Integrity - the EC structure and block placement SHOULD not compromise the integrity of the dataset and by extension durability
2. Performance - the EC structure and block placement SHOULD avoid creating bottlenecks during retrieval
1. Retrieval can also be significantly accelerated by the EC code, thus the structure of the code SHOULD be maximally exploited
3. Fairness - the EC structure and block placement SHOULD strive for maximum fairness given the economic constraints of the system
1. Since retrieval is incentivized, unfair placement would lead to marketplace imbalances and create privileged and disadvantaged participants
4. Complexity - increasing complexity is justifiable only in the context of the other factors. In other words, complexity can be increased only if the alternative would be compromising on integrity, performance or fairness
## Requirements
We have several high level requirements which can be classified as **immediate** and **future**.
Our immediate requirements are:
1. The EC structure SHOULD maximally guarantee the integrity of the dataset
2. The block placement SHOULD avoid correlated failures
3. The block placement SHOULD avoid creating retrieval bottlenecks
4. The block placement SHOULD maximize download fairness due to the retrieval economics
Our future requirements are:
NOTE: We should future prove our solution with this requirements in mind.
1. The EC code and placement SHOULD support dynamic datasets.
1. The simplest and most effective strategy for dynamic datasets is append; it supports most (or all) types of modifications to the source data, thus, we should at a minimum, allow appending data (blocks) to an existing dataset
2. It SHOULD be possible to adjust the EC parameters of an already deployed dataset. For example, it SHOULD be possible to adjust the parity blocks parameters of the dataset (M)
3. In the context of the above two requirements, it follows that we SHOULD allow increasing the set of nodes storing a dataset
1. Removing nodes from the storage set is not a contemplated at this time, as it seems unfeasible
4. Lastly, appending data, changing the EC (M) parameters and/or adding new nodes, SHOULD not lead to a reordering of the blocks in the storage set. In other words, new nodes or a changed M, should not lead to a reshuffle of the blocks across all the storage nodes.
## Dataset Metadata
Another factor to keep in mind is metadata. Metadata has significant storage overhead, thus, it has the potential of becoming a complicating factor if not managed carefully.
In our current system, all the metadata is kept in a manifest file which is stored and distributed as a regular block in the Codex network. The manifest contains a list of hashes which comprise the dataset. A manifest can be "protected", which means that it contains additional erasure coding parameters and the parity blocks in addition to the original blocks.
Note, the manifest doesn't contain the actual blocks, only the hashes of the blocks, blocks are stored independently from the manifest and several manifests could point to the same subset of blocks, which could be stored uniquely or duplicated across storage sets or other unrelated nodes.
### Future proving Manifests
So far, we've been able to keep the complexity of the metadata to a minimum. Concretely block hashes are recorded sequentially and their position in the blocks list/array determines their position in the dataset. This approach has kept the complexity of retrieval to a minimum. For example, downloading a dataset simply means reading all the hashes in the `blocks` array and requesting them sequentially from the network or the local store. This might not be the case anymore once we factor in all the requirements. For example, the actual layout of the data might change based on the specific EC layout we choose, i.e. blocks might not anymore be laid out sequentially, which would require some level of correction at the access level, this seems like a reasonable overhead/complexity given all of the above requirements.
It seems reasonable to think, given all the current and future requirements that we need some level of future proving our manifests. In this context, `Cid` already provide some level of control, but versioning might be a better way of signaling features?
## Proposed Roadmap
Given the complexity of the requirements, I propose we address them iteratively. The current approach document in [#85](https://github.com/status-im/codex-research/pull/85) has some obvious flaws that conflict with the requirements as they are laid out above. Some proposed ideas in [#119](https://github.com/status-im/codex-research/discussions/119).
## Current Resource and Discussions
- [#85](https://github.com/status-im/codex-research/pull/85) - description of the currently implemented model and some initial discussion
- [#119](https://github.com/status-im/codex-research/discussions/119) - Continued layout discussions