owned this note
owned this note
Published
Linked with GitHub
# Why "Low Durability" is a Bad Name
### (or why we must know what we aim for)
Durability is a probability. In its simplest form[^2], it can be written down as an indexed random variable $X_t$, where $t$ is a real number representing some length of time. We can then look at some random instant in time[^1], and at any piece of data $C$ that currently exists in the system, and define:
$$
\begin{equation}
X_t =
\begin{cases}
1, \text{if}\ C\ \text{is lost within the next}\ t\ \text{time instants} \\
0, \text{otherwise}
\end{cases}
\end{equation}
$$
We are, then, interested in $\mathbb{P}(X_t = 0)$[^3]. This is our durability. We could estimate this empirically if we could get the data, or we could build a best-guess model based on existing models and data. We can then pick a threshold $\delta$ and declare that:
* if $\mathbb{P}(X_t = 0) \leq \delta$, then durability is _low_;
* if $\mathbb{P}(X_t = 0) > \delta$, then durability is _high_.
"Low durability", then, refers to an explicit assumption that $\mathbb{P}(X_t = 0)$ is below $\delta$. What we have been referring to as "low durability" in our discussions, however, is actually "unknown durability". I would rather call this "best-effort" durability, as in, we are trying to store it but have no idea what the guarantees are. It could be low, it could be high, we just do not know.
And this also puts the actual problem of operating this way in the spotlight: without more work on establishing what properties we want, any system can be said to guarantee best-effort durability. This means that either:
1. we must refine the durability properties we are trying to provide;
2. we must focus on properties other than durability to distinguish ourselves from existing best-effort durability systems (e.g. Bittorrent, Freenet, IPFS, Hypercore, etc). This could be e.g. storage efficiency, bandwidth efficiency, privacy, or a more qualitative property like usability. But we need to know what we are aiming for.
Otherwise, I argue, we just do not know what we are building and why, and that is not a good way to operate.
[^1]: Assuming memorylessness, for simplicity.
[^2]: Simplest that I can think of right now. :-)
[^3]: Note that we make very few assumptions on the nature of $X_t$. It could mean that all replicas in a replicated or erasure-coded system have failed while repair has not managed to keep up. It could mean someone attacked the system and destroyed $C$. Regardless of how the system works and how loss occurs, we can just assign a probability that it happens within $t$ time instants of any random time instant within the system's existence.