owned this note
owned this note
Published
Linked with GitHub
$$
\newcommand{\id}[1]{\text{id}\left(#1\right)}
\newcommand{\pdur}{p_{\text{dur}}}
\newcommand{\ploss}{p_{\text{loss}}}
$$
# An Altruistic Mode for Codex
## Terminology
**Data.** We assume our storage system to store _files_. In our context, a file $F = \{b_1, \cdots, b_n\}$ can be represented as an ordered sequence of fixed-length blocks. This is without loss of generality as a block can be represented as a single-block file $F_b = \{b\}$. We can assign a unique identifier to a file $\id{F}$ by, say, taking the Merkle root of the tree built over its blocks. A block can then be identified by a tuple $\left(\id{F}, i\right)$; $1 \leq i \leq n$.
**Node roles.** Nodes in our network can take two roles:
* **Storage Providers (SPs).** Provide storage that can be used by other nodes in the network to store their files.
* **Storage Clients (SC).** Publish data on the network by uploading files to SPs; e.g., an invidual wishing to back up their photos onto the network would take the role of a Storage Client when doing so.
## Durability in Altruism
One of the key properties for a storage system is durability. Durability relates to how likely a storage system is to lose data, and can typically be quantified as a probability that some $\pdur$ that some piece of data $F$ survives over a period of time.
Ensuring that data does not get lost means countering the factors that incur loss. In the case of decentralized storage, those can be roughly categorized as:
1. **natural faults;** e.g., disk and hardware failures due to wear, defects, and external factors like natural disasters;
2. **byzantine behavior;** e.g., participants that commit to holding $D$ but then do not. This might be malicious, like participants trying to forge storage proofs for data they do not hold and make a profit or actively destroy data, but might be something much more mundane, like participants dropping data because they are no longer interested in serving it, or because it became economically uninteresting to do so.
While (1) is an oft-cited source for data loss, we argue it is not the _primary_ source in decentralized storage. Rather, incentivizing providers to not drop data is likely the biggest hurdle to achieving workable decentralized storage at large.
**Cryptoeconomic security and unstable honesty.** Codex has so far hinged on the assumption of cryptoeconomic security as its pillar for durability. The idea is simple: every piece of data $D$ that a Storage Provider (SP) commits to store is covered by a _storage contract_ which locks some amount of _economically valuable collateral_ provided by the SP upfront when it accepts the contract (staking).
The SP then needs to provide proofs, at random intervals, that it is indeed holding the data it has committed to hold (remote auditing). If an SP fails to provide such proof in a timely fashion, it will be slashed (slashing); i.e., it will lose part of its collateral. If enough such proofs are lost, then the data will itself be declared as lost, and another SP will be allowed to claim it. Since data has been lost, this might involve reconstructing that data from erasure-coded pieces still present in the other SPs (repair).
The cost of slashing acts as a strong incentive to maintain correct behavior. This comes, however, with two critical, related, assumptions:
1. that the collateral remains economically valuable over time;
2. that the cost of operating a node is always smaller than the cost of slashing.
Collateral is typically backed by a token, as are storage payments. If that token loses value, neither conditions (1) or (2) may continue to hold, which means providers can, and most likely _will_, drop data.
Since otherwise honest nodes might turn adversarial at any time due to market forces, we say this induces _unstable honesty_.
The contingency of $\pdur$ on market forces which are _not_ captured in the currentreliablity models we have outlined for Codex, mean that the actual durability guarantees for Codex much less clear than what one would might be led to believe.
**Intrinsic data value.** In _intrinsic data value_, one postulates that honest SPs will volunteer to store data because they value such data; i.e., because they do not want to see such data lost. If said interest is _stable_, this provides a much stronger incentive to SPs than cryptoeconomic security, as it would induce _stable honesty_ instead.
In principle, this would appear to be an easier problem to solve as it would exclude market forces from the durability equation. The downside is that it that altruistic nodes will likely be much fewer in number and, depending on the underlying mechanics, selective to what type of dataset they want to store, which will immediately exclude use cases like personal backups or other types of commercial usage.
## Key Elements
Disclaimer: I'm trying to reason this as Jacek suggested. Success is not guaranteed! :-)
There are several ways in which one could conceive an altruistic mode for Codex. One of the ways of framing it is by picturing Codex as a modular structure within a general framework.
**Dissemination.** In broad terms, SCs introduce new files onto the network, whereas SPs decide whether or not to replicate it. Assuming wlog that each file has a CID attached to it, the first question becomes how do SPs learn about datasets.
1. **Out-of-band.** SPs learn them from the publisher directly, or through some third-party directory (e.g. PirateBay);
2. **Automatically.** SPs learn them from a broadcast channel (Fig. 1). This could be a global pubsub channel, or group-based.
(1) looks like Bittorrent. (2) looks like Marketplace, but we could equally envision a P2P pubsub channel in which such CIDs are gossiped. Learning about _existing_ CIDs at the time one joins the network can be slightly more challenging - in Marketplace Codex, those are stored on-chain. One could equally find reasonable that nodes in the network store all CIDs, or some bounded subset of them, and that joining nodes can learn those from their neighbors.
<center>
<image src="https://hackmd.io/_uploads/BJLKS1vIll.png" width="65%"/>
</center>
**Figure 1.** Learning CIDs from a broadcast channel.
Once an SP learns about a CID, it must decide whether or not it wants to store the dataset corresponding to that CID as well. The SP SHOULD download the dataset metadata from the network, and apply a decision function to evaluate whether or not it is considered valuable enough for storage.
The SP MAY evict an existing dataset to make room for the new one, should the _eviction penalty_ be low enough that the value gained from storing the new dataset offsets it. If dataset value is completely decoupled from dataset size, this might be equivalent to a knaspack problem in which the provider runs a greedy approximation.
```python
def on_cid(self: SP, cid: CID) -> None:
meta = fetch_meta(cid)
if not self.has_free_space(meta) and\
not self.can_make_room(meta):
return
if value == 0:
return
self.store(cid)
def can_make_room(self: SP, meta: Metadata) -> bool:
required_space = meta.dataset_size - self.free_space
datasets = sorted(self.datasets, key = lambda x: -self.dataset_value(x))
to_evict = []
penalty = 0
while required_space > 0:
candidate = datasets.pop()
required_space -= candidate.dataset_size
penalty += self.eviction_penalty(candidate)
if penalty < self.dataset_value(meta):
self.evict_all(to_evict)
return True
return False
```
<<To be continued>>