# *N* copies needed for given SLA
## Part of research note on perpetual storage
---
**Context:** discussing pertpetual storage with jnthnvctr and Axel. Meeting [notes](https://docs.google.com/document/d/1aWYwpN9lvgEDHtfnDmMWRtFlOTb6iRtf7FjVBJeYRvM/edit)
**Question:** How many copies have to be stored to be sure that a minimum number of copies need to be persisted?
**Approach:** target some SLA
### Introduction
Every storage system (e.g. [AWS](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/example-implementations-for-availability-goals.html)) has a set of guarantees regarding data resiliency. Where some clients may prefer to pay for higher redundancy factors to increase the 9s of resiliency for their data, others may prefer to save on cost and operate with weaker guarantees. Unique to the design of Filecoin is the flexibility for users and protocol designers to specify replication factors - allowing builders to tune risk and resiliency appropriately for their use case.
In this analysis, we aim to provide a framework for selecting replication factors on Filecoin and empower clients to reason about the trade offs between cost and resiliency.
**Definitions and Assumptions:**
* Definitions:
* ***Permanent loss***: We define permanent loss to be the scenario where all copies of data are dropped off the network within a given timeframe.
* ***Temporary loss***: We define temporary loss to be the scenario where a single copy of data is lost within a given timeframe, but upon detection can be restored from one of the remaining copies.
* ***Faults***: A fault occurs when a storage provider fails to provide a proof for their sector in time. A fault does not necessarily mean data is lost (repeated faults would be a better indication of this) - but for our analysis, we take the most conservative case of assuming a fault = data is lost.
* Assumptions:
* Failure is binomially distributed – i.e. selection with replacement. Given we're able to restore datasets, this seems reasonable.
* Failures are independent. While there is some evidence sector failure events occur in clumps, we assume a strategy in which clients intentionally store their data across distinct storage provider actors and across distinct geographies to mitigate this.
* This can be manually tuned today using reputation systems and the [FVM](https://www.fvm.filecoin.io) can enable more on-chain information to enable automated processing.
* While not explicit, data recovery requires storing additional copies of data on the network. Today, this involves manual intervention, but the [FVM](https://www.fvm.filecoin.io) can enable simple bounty systems for maintaining minimum levels of redundancy. We assume this function exists either via manual replication or bounties.
_Additional Notes_
* Filecoin is unique in that:
* (a.) Storage providers have collateral at stake with keeping data online.
* (b.) Zk-proofs are run over all sectors every 24 hours.
* (c.) Proof of replication makes it possible to verifiably store multiple distinct copies of data (making it irational for an attacker to pretend to store multiple copies)
* Given (a.) we can restrict our analysis to "inadvertant" data loss, given storage providers have "skin in the game" simply dropping off the network comes at a heavy cost - making it irrational to do so even in periods of token volatility.
* Given (b.) the Filecoin network discovery of data loss happens at a granular level, allowing for recovery actions to take place.
### How many copies of data need to be stored to achieve a target SLA?
If there are $n$ copies, the probability of losing $k$ copies on a given day is
\begin{align*}
p\left(k\,\text{fail}\right)=\left(\begin{array}{c}
n\\
k
\end{array}\right)p_{1}^{k}\left(1-p_{1}\right)^{n-k}
\end{align*}
where $p_{1}$ is the probability one fails on a given day.
The probability all $n$ fail on a given day (catastrophic failure) is
\begin{align*}
p\left(n\,\text{fail}\right)=p_{1}^{n}\,.
\end{align*}
To achieve five 9s SLA we therefore require
\begin{align*}
n=\left\lceil \frac{\text{log}\left(1-0.99999\right)}{\text{log}\,p_{1}}\right\rceil \,
\end{align*}
Simply put, we can calculate the required copies of data to store, $n$, based on our target resiliency and the probability of losing a single copy, $p_1$.
### Strategies for selecting storage providers and finding a value for $p_1$
Given the above, we must now pick a value for $p_1$.
_Note - we can conservatively approach this calculation by assuming a fault (a miner failing to submit a proof on time) is equivalent to data loss. In practice, faults can occur for a number of reasons - e.g. during an upgrade._
One method for calculating $p_1$ might be to look at the network in aggregate - assuming the average case for faults can be applied equally to all miners. To do this calculation, we can take the ratio of the *current faults outstanding* (around 61 PiB) to *network capacity* (around 15 EiB) as our $p_1$ - this comes out at 0.000246 (about 99.97% 'uptime'). Therefore the number of copies needed (using our formula above) for five 9s SLA is $n=2$. Low number!
As a base framework this seems like a reasonable approach, and likely conservative - assuming the client applies some strategy in choosing the SPs with whom they store their data. Based on network stats, we can actually better select our risk:

Since the vast majority of miners have negligible faults (note the log scale for counts) and there are a few outliers with higher fault rates drive the average faults up, users can easily adopt a strategy based on the verifiable history of miner performance. For example, the top quintile of miners by fewest number of faults per storage power, have 0 faults in the month of January. As the network matures, the reputational history grows giving clients higher signal in the operational excellence of the SPs with whom they store data.
<!-- ### What if lost copies aren't replaced?
**My bias is to maybe kill this section. I'm not sure if its convincing**
One assumption in the above analysis was that lost copies are replaced the next day. However if lost data is not replaced the next day, time is expected to be a much more important factor.
Consider the ***hypothetical*** scenario of long-term storage without replacement of lost data. Let's store for 100 years, and let $n_\text{o}$ be the number of opportunites to lose data. With Filecoin's daily fault detection, this is $365\cdot100$ more chances than before. In the limit of $n_\text{o}$ being very large, which it is, binomial is a good approximation for selection without replacement problem for counting failures.
<!-- , so $n_\text{o} = 365100\cdot k$. In the large $n_o$ limit, -->
<!-- \begin{align*}
\left(\begin{array}{c}
n_\text{o}\\
k
\end{array}\right)p_{1}^{k}\left(1-p_{1}\right)^{n_\text{o}-k}
\end{align*}
with $p_1$ being daily loss probability for a single copy. -->
<!-- Now to guarantee five 9s SLA for 100 years --- 99.999% assurance data is retained to 2122 --- we need to keep 111 copies of the data. Assuming the previous daily loss probability of $p_1=0.000246$, the service level vs number of copies is:

If the daily loss probability for a single copy can be improved, for example by 10x to $p_{1}=0.0000246$, then only 21 copies of the data are required. Of course, this whole scenario represents the unlikely service agreement of lost copies not being replaced, so is really a worst case scenario.
<!-- If they're never replaced, we have $365\cdot100$ more daily chances to lose, then
\begin{align*}
p\left(n\,\text{fail}\right)=36500\,p_{1}^{n}\,.
\end{align*}
and if again we assume $p_1=0.000246$, then
\begin{align*}
n =\left\lceil \frac{\text{log}\left(1-0.99999\right)-\text{log}(36500)}{\text{log}\,p_{1}}\right\rceil \,
=3
\end{align*}
ie. we need only three copies! This seems remarkably few. Of course probability of losing a single copy on a given day might day vary widely. Let's basic sensitivity test this, check the number of copies vs number of 9s for different probability of data loss on a given day:
<!-- [](https://i.imgur.com/O0x6aEU.png) -->