# Optimal disk replacement policy under crypto economic failure penalisation constraints
## Problem statement
Being a storage provider relies on having operational hard disks. Sometimes disks fail. The probability of failure typically follows what's known in reliability engineering as a *bathtub curve*:

When a disk fails the storage provider pays for its replacement, but since the failure is unpredictable, it carries risk of missing the winning PoST giving a block-reward opportunity cost and missing window PoST with faulted sectors resulting in collateral slashing. These costs can be mitigated by planning a sufficiently early replacement date, but of course it's too early, that's also suboptimal. Furthermore disks typically come with a 1, 2 or 3 year warranty. Then the question then arises --- given these factors, and distributional assumption about disk lifetime --- what's the optimal policy to replace the disk?
## Approach
Optimal policy means the time $\tau$ when it's best replace the disk. To define what best means, we can note that the problem is a *renewal-reward process*.
Whereas a *renewal process* is a generalization of Poisson arrival process to arbitrary non-negative arrival times, a *renewal-reward process* introduces additional structure, placing rewards (or costs) on top of arrivals. This is precisely what is happening with disk replacement --- disks fail at random times, with time-dependent costs, then disks are renewed.
To proceed we'll use the renewal-reward theorem. This relates the asymptotic limit of the process to the ratio of expected rewards and disk lifetimes. The expected reward per unit time will gives us an expression to optimize to find the most cost-effective replacement time.
## Solution
Let $R(t)$ be a renewal-reward process for disk replacement and $t$ be chronological time. The expected renewal-reward theorem states
\begin{align*}
\mathbb{E}\left[R(t)\right]=\underset{t\to\infty}{\text{lim}}\frac{t\,\mathbb{E}\left[C(t)\right]}{\mathbb{E}\left[\Lambda(t)\right]}
\end{align*}
where $C$ is a stochastic process for cost of replacement and $\Lambda$ is a stochastic process for the lifetime of the disk.
To find optimal time $\tau$ we optimize for the expected reward per unit time:
\begin{align*}
\tau=\underset{t}{\text{argmin}}\frac{\mathbb{E}\left[C(t)\right]}{\mathbb{E}\left[\Lambda(t)\right]}\,.
\end{align*}
The expectations must be defined over appropriate distributions. For the probability of disk failure, like the bathtub curve above, we can model this as a generalized Beta distribution. Let the probability of failure at time $t$ be Beta distributed but on an arbitrary finite interval $[0,T]$, where $T$ is maximum retirement time. For this distribution (instead of the usual one on $[0,1]$), the pdf is given by
\begin{align*}
p\left(\text{fail}(t)\right)=T^{1-\alpha-\beta}\left(T-t\right)^{\beta-1}\left(t\right)^{\alpha-1}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\,.
\end{align*}
By the tower property of total expectation we can expand the expected disk lifetime over different modes ($m_{\Lambda}$) of failure:
\begin{align*}
\mathbb{E}\left[\Lambda\right] & =\mathbb{E}\left[\mathbb{E}\left[\Lambda|\text{fail}_{i}\right]\right]\\
& =\sum_{i\in m_{\Lambda}}\mathbb{E}\left[\Lambda|\text{fail}_{i}\right]p\left(\text{fail}_{i}\right)
\end{align*}
For the disk lifetime, we consider two failure modes. The disk fails at time $t$ before we replace it, or we replace it preemptively:
\begin{align*}
m_{\Lambda}=\begin{cases}
m_{\Lambda}(t<\tau) & \text{unplanned failure}\\
m_{\Lambda}(t\ge\tau) & \text{replaced before failure}
\end{cases}\,.
\end{align*}
The probability of each scenario, the $p\left(\text{fail}_{i}\right)$ in each case, is given by
\begin{align*}
p\left(\text{fail}_{t<\tau}\right) & =\int_{0}^{\tau}p\left(\text{fail}(t)\right)dt\\
& =\text{B}(\frac{\tau}{T},\alpha,\beta)\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}
\end{align*}
where $\text{B}(\frac{\tau}{5},\alpha,\beta)$ is the incomplete beta function, and
\begin{align*}
p\left(\text{fail}_{\tau>t}\right) & =\int_{\tau}^{T}p\left(\text{fail}(t)\right)dt\\
& =\frac{\Gamma(\alpha)\Gamma(\beta)-\text{B}(\frac{\tau}{T},\alpha,\beta)\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\,.
\end{align*}
The expectation of disk lifetime given failure before retirement is
\begin{align*}
\mathbb{E}\left[\Lambda|\text{fail}_{t<\tau}\right] & =\int_{0}^{\tau}t\,p\left(\text{fail}(t)\right)dt\\
& =T\text{B}(\frac{\tau}{T},1+\alpha,\beta)\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\,\,.
\end{align*}
If failure happens after the disk is retired, the expected lifetime is simply $\tau$.
For costs, we can similarly expand over different modes $m_{C}$ of cost realization:
\begin{align*}
\mathbb{E}\left[C\right] & =\mathbb{E}\left[\mathbb{E}\left[C|\text{fail}_{i}\right]\right]\\
& =\sum_{i\in m_{C}}\mathbb{E}\left[C|\text{fail}_{i}\right]p\left(\text{fail}_{i}\right)
\end{align*}
The ways cost can vary with failure time depend on the warranty time period $W$ and the replacement policy time $\tau$, and the potential cost of disk replacement $D$, slashing incurred $S$, and block reward opportunity cost $B$. In this model, three scenarios are considered:
* Disk failure during warranty, at time $t$ where $t<W<\tau$, in which case the $\mathbb{E}\left[C|\text{fail}_{i}\right]$ cost is given by slashing $S$ $+$ block-reward $B$.
* Failure after warranty, at time $t$ where $W<t<\tau$, giving an $\mathbb{E}\left[C|\text{fail}_{i}\right]$ cost of disk replacement $D$ $+$ slashing $S$ $+$ block-reward $B$.
* Disk is retired, i.e. $t$ where $W<\tau<t$, for which the cost is just disk replacement $D$ since unpredictable slashing etc isn't a factor.
For each of the $\mathbb{E}\left[C|\text{fail}_{i}\right]p\left(\text{fail}_{i}\right)$ terms in $\mathbb{E}\left[C\right]$, analytic expressions are available but omitted here for brevity.
## Computation
If we assume the lifetime distribution is generalized Beta with maximum retirement time of $T=5$ years, and shape parameters $\alpha=0.99$ and $\beta=0.98$, i.e. $\text{gBeta}(0.99,0.98,5)$, then our bathtub curve looks like

If:
* the warranty is $W=3$ years
* disk replacement cost is $D=$\$500
* slashing cost from unplanned failure is $S=$\$200
* block reward opportunity cost is $B=$\$100
then our optimisation problem is to find the minimum of $\frac{\mathbb{E}\left[C(t)\right]}{\mathbb{E}\left[\Lambda(t)\right]}$ setting these specific parameters. This cost function looks like:

The optimal cost policy is to replace the disk at time $\tau=3.3$ years --- that is, slightly after the warranty $3$ year warranty.
## Next
Improvements
* using a lifetime distribution parameters $\alpha$ and $\beta$ actually based on data for the types of disks SPs are using
* use the actual cost of disk replacement instead of placeholder \$$500$
* work out actual numbers for the cost asscoated with slashing and missing blockrewards from having disk failure
Furthermore this type of optimal policy model can potentially be fed into other research regarding long-term storage e.g. pricing perpetual storage.