Faults analysis

# Faults analysis *Work in progress* ## Short summary While faults are going up on average, the same can't be said for fault volatility. Faults occurance has near perfect adherence to the 48hr fault fee rule. **Suggestion**: half the fault fee, and change the no fee time from 48hrs to 24hrs. Since it's saturated, there's scope to tighten the time and loosen the fee. ## Key points * Average faults have been creeping up, whereas the magnitude of fault fluctuations has been slightly decreasing. * Filecoin fault events highly complex and non-stationary dynamics. For example, with exploratory modeling we see some evidence of jump diffusion and cluster-volatility. * Despite the nuance needed for a principled representation of the data, a simple stochastic volatility model accounts for much of the observed variation. This model may serve as a basis for risk-neutral valuation in order to price faults. * Sector failure events have fat tails. This is partially attributable to miner-level failures, based on analysis faults distributed across miner_ids. * Consecutive daily sector faults, grouping by miner_id + sector_id, are incredibly rare. This suggests: i) faults tend to get fixed subsequently to avoid fees (and by extension the fault fee is too high, if consecutive faults effectively never occur) ii) terminations are voluntary (plausible, as lower fee). iii) alternatively there's a bug in the sentinel data. * On the question of fault fees, non-linear fees have attractive features, albeit at the cost of more complexity. * Miner-level simulation indicates a FIL-on-FIL breakeven uptime of approximately 65%. ## Context, risk and psychology Filecoin relies on protocol parameters to regulate individual SP actions and promote growth of the network ecosystem. Several mechanisms are key in this: * long-term stability is encouraged through scheduled vesting, * [EIP-1559](https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1559.md) style [gas fees](https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0013.md) close specific vectors of attack and reduce issuance circulating, * and fault fees encourage consistent storage through rewards and penalties levied according to [fee parameters](https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0002.md). The initial tokenomics spec was created through extensive simulation prior to launch. Mainnet has now been [live since 15-Oct-2020](https://filecoin.io/blog/posts/filecoin-mainnet-is-live/) with much empirical data collected, so it's good time reassess all round. In this document we examine fault fees in particular, but before dropping into the details, it's worthwhile to pause for context. Filecoin mainnet supports thousands of SPs, and FIL trades around 500M USD in volume per day --- suggesting and moreover actually changing protocol parameters carries some degree of risk. True, but successful examples of pulling levers to improve an economy have precedence. In traditional finance the bag is mixed — the [FED](https://www.federalreserve.gov/newsevents/speech/bowman20220221a.htm) and [BOE](https://www.bankofengland.co.uk/-/media/boe/files/monetary-policy-report/2021/august/monetary-policy-report-august-2021.pdf) continue to push up interest rates to quieten widespread inflation. But elsewhere, web3, defi, and L1's have had more success with recent major tokenomics updates. Perhaps the most wellknown example is Eth's introduction of gas fees in the [London fork](https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1559.md). Subsequent [theoretical](https://www.youtube.com/watch?v=ndNyx-Oj9Wk) and [empirical](https://arxiv.org/pdf/2201.05574.pdf) analyses give strong evidence of lower fee volatility with minimal downside. Meanwhile in gaming coins, Axie's tokenomic spec for the SLP token has been re-tuned to change the minting rate which was [previously seen](https://www.axieworld.com/en/economics/charts?chart=slpIssuance) to be vastly outstripping burning. Long-term effects remain to be measured and understood --- worth checking-in on this in a few months --- but at time of writing market sentiment is positive. In part, signs of active management is often positive signaling. More directly, circulating supply dynamics encouraging a deflationary economony is an attractive prospect for many, which introduces the final comment. The purpose of this document is to begin a re-examination of faults and fault fee parameters --- in most part, a mathematical and data-driven scientific exercise. But context is everything, and with fees part of this framing is perception, specifically *prospects* of future revenue. Actors in the network are not ideals, they're not free from irrationality, random externalities, or the misperception of risk. Recall the cognitive fallacy of systematically miscallibrating the relative size of future gains compared to losses --- the distinction between utility theory and the prospect theory of [Kahneman and Taversky](http://www.dklevine.com/archive/refs47656.pdf). Prospect theory specifically updating the concept of utility to distinguish losses and gains, recognizing our tendency to place more value on loss avoidance, for example from fault fees, than achieving rational gains from rewards. Of course risk aversion and perceived value asymmetry as per prospect theory applies to individuals, and SPs as businesses *are* different. But it is a plausible assumption that many aspects of individual behavior flow up to [enterprise](https://hbr.org/2012/02/why-companies-are-betting-agai) level too. While behavioral economics is *not* the focus here, the difference between rational utility and prospect of revenue serves to color the limitations of any and all analyses based on rewards and fault fees. Precautions over! ## Fault trends analysis ### The basic empirical view Analysis considers 130,741,899 unique sector fault events, between June 2021 and Jan 2022, were pulled from SECTOR_FAULTED in miner_sector_events from Sentinel [Lily](https://lilium.sh/lily/) visor. Faults are increasing. Increaseing in terms of *total faults per month*, which from June 2021 to Jan 2022, are up 50% (except for dips in August and September). And increasing in terms of *median faults per month*, a metric that's robust to outliers, which increased from 10K/hour in June 2021 to 20K/hour Jan 2022. ![](https://i.imgur.com/NPOSzg4.png) ![](https://i.imgur.com/t23BiGH.png) ![](https://i.imgur.com/A9B57wI.png) The number of hourly fault events fluctuates a lot. To separate drift in location from variation in scale, we fit a stochastic volatility model (specification in supplementary). The model shows substantial variation from hour to hour and day to day: ![](https://i.imgur.com/bNlTXtb.png) Summarising by month, volatility is typically decreasing slightly month on month: ![](https://i.imgur.com/jSeseIF.png) In terms of the distribution of aggreated fault event counts, aggregated by hour we observe a monotonically decreasing distribution of exponential with power law tail. If the hourly faults were simply exponentially distributed, we'd expect daily faults to be Erlang. Visually, Erlang/Gamma/Lognormal are all plausible descriptions for daily faults, though in reality the tail is fatter. Which begs the question, what's causing the powerlaw tail? ### Faults concentrate by miner One answer to this can be found by appealing to the information entropy and how faulted sectors are distributed across `miner_id`. Take January for example. Here we find the days with the largest number of daily faults are also the days with the lowest entropy, when faults are most concentrated: ![](https://i.imgur.com/sqNgy2K.png) This means on days with many faults, the faults are more concentrated than usual --- indicating miner-level failures. But is this effect significant? Entropy as metric of spread-out-ness of faults across miners has an effect size of -0.4 CI 90% [-0.20,-0.60] when predicting total number of faults. Which means this *is* significant evidence for miner-level failure as a factor in fault spikes. ### Fault distributions While we've established some evidence of faults concentrating by `miner_id`, a curious behavior is observed in the distribution of faults by `sector_id`. Look at the distribution for miner_sectors that record multitple fault events in January: ![](https://i.imgur.com/uEUOfSL.png) The first thing that's remarkable is the largest number of faults reported by any miner_sector is 11. This means terminations generated through consecutive faults are **far** from happening (42 needed). And January is not an exception, similar observations are made in other months. This implies the termination by consecutive fault punishment is completely effective. Complete effectiveness is not a good thing. It suggests inefficiency through excessive penalty, and that the fault fee could be lowered and still be effective. The second remarkable point concerns consecutive faults. Define consecutive faults as miner_sectors recording a second fault within 48hrs of the first. This **almost never** happens. For example, out the miner_sectors that had most faults (the ones with 11 in January), the longest chain of consecutive faults ever observed is 2. That seems somewhat low but is it actually? If we compare it to 11 randomly distributed faults across a 31 day month, the expected length of consecutive sequence is 3.6. So 2 it's definitely less than expected randomly. Furthermore, if we look at the thousands of miner_sectors with e.g. 5 total faults in the month, *none* are consecutive. Consecutive fault days are strongly repulsive. The distribution of faults implies three things: * the 42 day rule for terminations could be shorter and still effective * terminations are voluntary (to get lower fees) * miners carefully manage faulted sectors to avoid fees. ### More models For more depth on the faults picture, a series of models of progressively increasing complexity were developed. The aim was to build understanding through inference by seeing what fits, updating assumptions, and iterating to a new model. In short we find models that include jumps fit best the non-stationary fault time-series, and that aspects of history-dependent self-excitation in the faults also improves the fits. Example shown here for the first few hundred hours in January: ![](https://i.imgur.com/kowMdrY.png) All models fitted are specified in the Supplementary Information. ### Faults and terminations What about the connection between faults and terminations --- is there one? In descriptive terms, terminations are bimodal, distinct from faults, as shown by the empirical joint distribution here: ![](https://i.imgur.com/HS3JhhY.png) Looking at the hourly terminations time-series, again the behavior is quite different: ![](https://i.imgur.com/uZlmHRN.png) At protocol level, 42 days of faults trigger termination. We've already established this isn't happening frequently, but is there any statistical evidence of a causal link between fault levels and future terminations none-the-less? Looking at the time-lagged cross-correlation between faults and terminations does suggest a weak leader-follower relationship: ![](https://i.imgur.com/ALc64VT.png) We can take a closer look using time-localized cross-correlation. Let's assume the resolution of time chunks is 42days. Now the leader-follower relationship all but disappears: ![](https://i.imgur.com/n2oTmXD.png) But we do find a clear shift in correlation in Sept-Oct. Since it's observed across all lags it indicates the leader-follower relationship was an artifact, and that there's no evidence at this level analysis for faults now causing terminations in the future. ## A phenomenological model of breakeven The following section examines the simplest possible model for breakeven. The model is analytically tractable and allows us to build intuition then explore some non-linear variations. *Symbols* * $T$ is the term, the total period of time of consideration * $R_{T}$ is cumulative reward, generated from fees * $E_{T}$ is cumulative expenses, generated from fees * $r_{t}$ is daily reward * $e_{t}$ is daily expenses * $r$ is term averaged daily reward, $R_{T}/T$ * $e$ is term averaged daily expenses, $E_{T}/T$ * $f$ is a constant fault fee factor: $e_{t}=fr_{t}$ * $f_{d}$ is variable fault fee factor. Depends on downtime. Let breakeven for term $T$ be defined $R_{T}=E_{T}$\,,where $R_{T}$ is cumulative reward, $R_{T}=\sum_{t\leq T}r_{t}$, and $E_{T}$ is cumulative expenses $E_{T}=\sum_{t\leq T}e_{t}$. For average daily reward $r=R_{T}/T$, and average daily expense $e=E_{T}/T$, uptime $u$ as fraction of total uptime is $u=U/T$, and downtime is $d=D/T$. The point of breakeven must satisfy \begin{align*} ru & =ed\\ & =e(1-u)\,^{\dagger}\\ & =fr(1-u)\,^{\dagger\dagger} \end{align*} $\dagger=\text{assume only two possible states: }d+u=1$. $\dagger\dagger=\text{assume expenses rate proprtional to rewards rate }\, e_{t}=fr_{t}$. It follows that the level of uptime $u$ implied by a given level of fault fee factor $f$ is \begin{align} u=\frac{f}{1+f}\,. \end{align} Conversely the fault fee factor implied by an observed level of uptime is \begin{align} f=\frac{u}{1-u}\,. \end{align} What does this imply? If the uptime is two-thirds ($u=\frac{2}{3}$), the implied breakeven fault fee factor is $f=2$. If 50% sector uptime is satisfactory, a fault fee factor of 1 is sufficient. If we're zealous about disincentivizing sector faults and desire an uptime of 90%, then a fee factor of $9\times$ is required. The idealization presented is stripped back but captures the fundamental balance at the heart of the problem. We can develop this further by relaxing the assumption $e=fr$, that expenses are a fixed factor of rewards, and having non-linear fault fees.   ![](https://i.imgur.com/HXReTBe.png) Expense-downtime relation with fault fee factor $f=1$, for current linear, and proposed quadratic and geometric scalings. ![](https://i.imgur.com/3mZ5GJo.png) Expense-uptime relation, with $f=\{2.14,\,4/3,\,6\}$ in each instance. ![](https://i.imgur.com/oJWSJ0b.png) Fault fee scaling vs uptime implied by each mechanism. If the fault fee factor depends on downtime, the following breakeven equilibrium exists: \begin{align} u=f_{d}\,\left(1-u\right)\,. \end{align} We concentrate on specific forms of $f_{d}$ that may give desirable behavior and simultaneously permit closed-form solutions for the breakeven boundary. Example: to overweight the penalty as downtime increases we can use the following form \begin{align} f_{d}^{\text{geometric}}=\frac{f}{1-d} \end{align} which scales geometrically as $\frac{1}{1-d}=\sum_{i=0}^{\infty}d^{i}$ for $\left|d\right|<1$. If downtime d is small, behavior is almost the same as current linear implementation, but as d\to1 geometric scaling of the fault fee dominates. This specific form yields breakeven boundary \begin{align*} u & =f\,\frac{1-u}{1-d}\\ & =f\,\frac{1-u}{u} \end{align*} which has closed form solution for uptime: \begin{align} u=\frac{1}{2}\left(-f+f^{\frac{1}{2}}\left(4+f\right)^{\frac{1}{2}}\right)\,. \end{align} Now, an uptime of two-thirds ($u=\frac{2}{3}$) can be achieved with $f=1.3$, compared to $f=2$ before. For an individual miner this means smaller fees to begin with, but as downtime accumulates the fees increase. As long as downtime is less than 35%, miners incur smaller daily fees greater than with a fixed fee factor of 2, but above this and they're higher. A behavioral economic interpretation is the fee scaling mechanism acts to incentivize 'good' miners: those with occasional faults get substantially lower fees, while SPs get nonlinear penalties, with more downtime causing progressively higher fees. Alternatively, one might incentivize a different form of behavior. If our rationale is to [support miners](https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0026.md) who are having medium term reliability issues, we can choose to introduce quadratic downtime scaling, as $f_{d}^{\text{quadratic}}=fd$ which yields the breakeven boundary: \begin{align*} u & =f_{d}^{\text{quadratic}}\,d\\ & =f\,d^{2}\,\\ & =f\left(1-u\right)^{2}\\ \end{align*} This has the closed-form solution \begin{align} u=\frac{(1+2f+\left(1+4f\right)^{\frac{1}{2}})}{2f}. \end{align} The structure achieves the same asymptotes as the other models (0% breakeven uptime as $f\to0$ and 100% uptime as $f\to\infty$), but this scaling now softens the blow more as downtime accumulates. This mechanism could be valuable to SPs suffering long-term outrages, but the cost of a single down day early on is much higher as the latter discounts must be paid for by earlier ones. ## A simulation of breakeven Miner level simulation of FIL-on-FIL revenue. [TODO: Discussion to be extended] ![](https://i.imgur.com/UfVFkvV.png) A single simulation of miner revenue across a year. ![](https://i.imgur.com/1B5y5mz.png) Breakeven occurs around 65% with current protocol parameters and chain state. ## Pricing faults One way to price fault fees is risk neutral replication. Volatility can be fitted with a model and used to price a geometric Brownian motion in revenue via the risk free rate of return. For example, assume revenue $\mathcal{R}_{t}$ is a geometric brownian motion following stochastic differential equation \begin{align} d\mathcal{R}_{t}=\mu\mathcal{R}_{t}dt+\sigma\mathcal{R}_{t}dW_{t} \end{align} with drift $\mu$, volatility $\sigma$ and Wiener process $W_{t}$. Under change to risk neutral measure, where $r$ is the risk-free rate of return, $d\tilde{W}_{t}=d\tilde{W}_{t}+\frac{\mu-r}{\sigma}dt$, and therefore \begin{align} d\mathcal{R}_{t}=r\mathcal{R}_{t}dt+\sigma\mathcal{R}_{t}d\tilde{W}_{t}\,. \end{align} The gain (or loss) in revenue $g_{t}=\frac{\mathcal{R}_{t}}{\mathcal{R}_{t_0}}$ up to time $t$ is \begin{align} g_{t}(r,\sigma)=\text{exp}\left(r-\frac{\sigma^{2}}{2}\right)t+\sigma\tilde{W}_{t}\,. \end{align} Since revenue is rewards less expenses , $\mathcal{R}_{t}=R_{t}-E_{t}$, we have \begin{align} \frac{R_{t}-E_{t}}{R_{t_0}-E_{t_0}}=g_{t}(r,\sigma)\,. \end{align} If a fair fault fee factor is given by ratio of expected future rewards and expenses and fraction of uptime $u$ to downtime $1-u$, and using $R_{t}=E_{t}+({R_{t_0}-E_{t_0}})g_{t}(r,\sigma)$ from above, then \begin{align*} \text{faultfeefactor}_{t} & =\frac{\mathbb{E}\left[R_{t}\right]u}{\mathbb{E}\left[E_{t}\right](1-u)}\\ & =\frac{\mathbb{E}\left[E_{t}+\left(R_{t_0}-E_{t_0}\right)g_{t}\right]u}{\mathbb{E}\left[E_{t}\right](1-u)}\\ & =1+\frac{\left(R_{t_0}-E_{t_0}\right)\mathbb{E}\left[g_{t}\right]u}{\mathbb{E}\left[E_{t}\right](1-u)}\,. \end{align*} Here current rewards and expenses $R_{t_0},\,E_{t_0}$ are both assumed to be known. The expected gain in revenue $\mathbb{E}\left[g_{t}\right]$ is modelled in terms of the risk free rate of return with coefficients for volatility fitted to data. The risk-free return $r$ must be equated with a plausible source, although there is substantial question of what merits a plausible value of $r$. The expected expenses $\mathbb{E}\left[E_{t}\right]$ must be modelled, which may be possible based on the historical trend. A question remains as to the best model for $\mathbb{E}\left[E_{t}\right]$, but certainly one option is the stochastic volatility model with hourly innovations to the variance and daily updates to the mean. ## Notebooks [TO DO add] * Consecutive faults * Fault probabilistic models * Fault flucations * Terminations and faults * Breakeven simulation * Nonlinear fees ## Open questions and future directions * Development and discussion of plausible pricing models * rust-cad-cad level simulations * Explore whether faults affect $\alpha-\beta$ filter state variables * For miner level simulation, priors are sampled independently. For MCMC simulation to robustly explore the joint posterior geometry is non-trivial to implement, and in my estimation requires weeks of work * A queue theoretical analysis of population of faulted sectors ## Supplementary sections * A mental model is mapped out, from the perspective of SPs engaging in cheating vs truthful behavior, in terms of a causal decision tree, and minimax optimization. * Specification of statistical and probabilistic models used to explore fault time series. ### Pathologies of providers This section attempts to think about when to cheat, and what this means for network growth and stability. SP behavior exists on a continuous scale but as an idealization consider two types: * The aligned provider. Minimizes faults to maximizes rewards and optimize returns, actions as envisaged by protocol designers. Self-interest and network goals aligned. Social norm respecter— won't actively try to cheat. If sectors become faulted, faults are reported. * The selfish provider. Rent-seeking, extracts maximal value through myopic optimization of individual revenue. May engage in behavior that damages network for personal utility, and strategize to evade detection, e.g. choose to maximize undetected down time. Together the types give an SP-centric angle to ask what a successful protocol looks like? One view is that it minimizes the distance between the types, for example in expected revenue, or KL divergence over revenue distributions. In other words a successful protocol is one in which the selfish provider by some metric has no economic advantage over the aligned provider. From another perspective the goal is to promote stability and growth of useful storage, which may not be the same thing. For example in human biology, adaptive sociopathy is a stable equilibrium. So one question we may ask is to the possibility and value of an idealized protocol, one with no cheating perhaps at the cost of burdensome and complex rules, or is one that's a stable and growing stable ecosystem sufficient and primary. To formalize this consider the expected return \mathcal{R} for an aligned provider. This depends on the probability of receiving a reward or fee, which are distributed according to the protocol and the SP actions. At high level, the probability of a negative outcome (receiving fault fees), can be mitigated through economic effort. For example, expenditure on quality hardware or contingency planning. Denote this cost c, and consider the distributions of rewards and fees as parameterized in c. Then the expected return for an aligned provider is \begin{align} \mathcal{R}_{\text{aligned}}(c)=rp_{\text{reward}}^{\text{aligned}}(c)-fp_{\text{fee}}^{\text{aligned}}(c)-c\,. \end{align} For the selfish provider, the expected revenue is more complex. It depends on the strategy with cost s the SP uses to evade detection by the network n, as: \begin{align} \mathcal{R}_{\text{selfish}}(c,s,n)=rp_{\text{reward}}^{\text{selfish}}(c,s)-fp_{\text{fee}}^{\text{selfish}}(c,n)-c-s \end{align} It's worth looking at the reward and fee probabilities closer. The probability of getting a reward, $p_{\text{reward}}^{\text{selfish}}(c,s)$, depends on successfully evading detection given a fault, or simply receiving a reward given no fault \begin{align} p_{\text{reward}}^{\text{selfish}}(c,s)=p_{\text{evade}|\text{fault}}^{\text{selfish}}(c,s)p_{\text{fault}}^{\text{selfish}}(c,s)+p_{\text{nofault}}^{\text{selfish}}(c,s) \end{align} The probability of receiving a fee, depends on the probability of detection given a fault, or self-reporting given a fault, \begin{align} p_{\text{fee}}^{\text{selfish}}(c,s,n)=p_{\text{detection}|\text{fault}}^{\text{selfish}}(s,n)p_{\text{fault}}^{\text{selfish}}(c,s,n)+p_{\text{selfreport|fault}}^{\text{selfish}}(c,s). \end{align} It's plausible that these distributions, as per decision tree, can be estimated from data. ![](https://i.imgur.com/sIG1fUF.png) Reward vs fee decision free. Rewards are generated through actually having no faults, or having faults but evading detection. Feea are generated through self-reporting or network detection. To maximize the revenue of the selfish SP subject to constraints g, we can formulate the Lagrange optimization \begin{align} \mathbb{\mathcal{L}}_{\text{selfish}}(c,n,s,\lambda)=\mathcal{R}_{\text{selfish}}(c,s,n)+\lambda g_{\text{selfish}} \end{align} \begin{align} \nabla_{c,n,s,\lambda}\mathbb{\mathcal{L}}_{\text{selfish}}(c,n,s,\lambda)=0 \end{align} Here g represent arbitrary constraints, e.g, that probabilities of different outcomes sum to one, or conditions on CAPEX distribution. For the aligned SP we can similarly write \begin{align} \mathbb{\mathcal{L}}_{\text{aligned}}(c,\lambda)=\mathcal{R}_{\text{aligned}}(c)+\lambda g_{\text{aligned}}\,. \end{align} One cryptoeconomic goal, is to formalize what is is optimal in relation to cheating at SP-level. In terms of expected revenue the optimization problems has the minimax form \begin{align} \text{min}\left(\text{max}\,\mathbb{\mathcal{L}}_{\text{aligned}}(c,\lambda)-\text{max}\mathbb{\mathcal{L}}_{\text{selfish}}(c,n,s,\lambda)\right)\,. \end{align} This functional appears to be bounded from above at 0. I don't think it's possible for the aligned provider to be better than the selfish one. For example the aligned SP is strictly a subset of the selfish SP who can always chose a strategy that's not to cheat. Further directions 1. Develop minimax interpretation. 2. Map optimization geometry, using parametric assumptions to build further intuition. What fraction SP funds distributed between mining and cheating strategies is optimal. 3. Can we be more concrete. What does $p_{\text{detection}|\text{fault}}(c,n)$, the probability of detection given a fault parameterized in terms of miner economic costs $c$ and network state $n$ look like. 4. Can we actually implement cheating SPs in a simulation. 5. Can generalize to distribution-level (e.g. KL) rather than point estimate (differences in expectations). ### Probabilistic and statistical model specifications To reason about fluctuations in faults we propose different probabilistic models and see how they actually fit the data to test our intuition. This lets us iteratively build up a picture on the structure of fluctuations in the faults data in a systematic way. For aggregated fault time-series I consider several variants of models fitted to counts. They story we find is that faults follow a multi-type process with non-stationary fluctuations. Fault counts non-iid with respect to Poisson or Negative Binomial distributions, and show evidence of jump diffusion and stochastic volatility features as well as temporal clustering. A generic practical issue is that as models become more complex they become more time-consuming and brittle to fit. Though simple a lognormal stochastic volatility model fit satisfactorily, allowing instantaneous estimates of volatility, decoupled from drift, and that may be useful for future pricing models. **Model 1:** Cox process model with Gaussian process intensity The first model is a Cox process, which generalizes Poisson distributed faults by replacing the rate with a random time-varying measure. In practice we use a non-parametric intensity with a Gaussian process (GP). \begin{align*} k(t,t') & =\text{SquareExponential}(t,t')\\ \text{log}\left(\lambda_{t}\right) & \sim\mathcal{GP}(\cdot,k)\\ \text{faults}_{t} & \sim\text{Poisson}(\lambda_{t}) \end{align*} The universal function approximation provides a rationale for using the GP Gaussian process as a reasonable basis for time-varying autocorrelation data. The model is fitted using Hamiltonian Monte Carlo, and while the model is satisfactory for short times, it struggles to be able to fit more than few days of hourly aggregated data. Reasons are twofold: 1) is $\mathcal{O}(n^{3})$ scaling in data points for standard implementations. 2) a more fundamental issue is non-stationarity of the fluctuations. Using a negative binomial likelihood to account for more flexibility in faults variance doesn't cure the issue. One approach could be to generalize the intensity measure with a deep GP, which'll bring [much more flexibility](http://inverseprobability.com/talks/notes/deep-gps.html), but is opening a can of worms in terms of scaling, complexity and interpretability. So we try a simple approach next. **Model 2:** Negative binomial stochastic volatility This model generalizes a Poisson process for counts to negative binomial to provide more flexibility in the fluctuations. The fluctuations in variance are made stochastic by parameterizing negative binomial over-dispersion with a first-order autoregression process with daily innovations. The mean of the negative binomial is also allowed to vary, but on a long timescale to avoid overparameterization \begin{align*} \text{log}\left(\mu_{t}\right) & \sim\text{AR}(1)\\ \phi & \sim\text{LogNormal}(0,1)\\ \sigma_{t}^{2} & \sim\mu_{t}+\frac{\mu_{t}^{2}}{\phi}\\ \text{faults}_{t} & \sim\text{NegativeBinomial}(\mu_{t},\sigma_{t}) \end{align*} This model works for short periods of up to a few hundred hours. But still breaks down where there a substantial jump in the mean the of the process. This suggests there's some underlying structure we're failing to understand. **Model 3:** Negative binomial Hawkes process We can generalize to faults clustering. The underlying premise is clusters arise because of underlying power-law-type events, which can be modeled as a latent self-excitation – faults right now increase the probability of faults in a minute, but the intensity decays with time. Due to the clustering the model is slightly more involved \begin{align*} \delta & \sim\text{HalfNormal}(0,1)\\ \alpha_{i} & \sim\text{HS}(\sigma_{i})\\ \text{log}\left(\mu_{t}^{\text{baseline}}\right) & \sim\text{AR}(1)\\ \mu_{t} & \sim\mu_{t}^{\text{baseline}}+\sum_{0\le\tau_{i}<t}\alpha_{i}^{2}e^{-\delta(t-\tau_{i})}\\ \phi & \sim\text{LogNormal}(0,1)\\ \sigma_{t}^{2} & \sim\mu_{t}+\frac{\mu_{t}^{2}}{\phi}\\ \text{faults}_{t} & \sim\text{NegativeBinomial}(\mu_{t},\sigma_{t}) \end{align*} To avoid letting the model run away in terms of overparameterization, we implement regularization with a horseshoe prior on the Hawkes intensities $\alpha_{i}^{2}$, i.e. $\tau\sim\text{HalfCauchy}(1);\,\lambda_{i}\sim\text{HalfCauchy}(1);\,\sigma_{i}\sim\tau^{2}\lambda_{i}^{2}$. For efficiency, the time convolution with an exponentially decaying kernel is implemented as a dot product over pre-computed stencil. In terms of fit, this model does fit better. This suggests there is some clustering structure in faults. But still the model completely fails to fit to the jump-style dynamics observed at medium timescales. **Model 4:** Negative binomial Hawkes process with baseline jumps This model generalizes to clustering of faults and includes jumps in the baseline intensity. \begin{align*} j_{t} & \sim\text{HorseShoe}(\sigma_{t}^{j})\\ \alpha_{i} & \sim\text{HorseShoe}(\sigma_{i}^{\alpha})\\ \text{log}\left(\mu_{t}^{\text{baseline}}\right) & \sim\text{AR}(1)\\ \mu_{t} & \sim j_{t}\mu_{t}^{\text{baseline}}+\sum_{0\le\tau_{i}<t}\alpha_{i}e^{-\delta(t-\tau_{i})}\\ \phi & \sim\text{LogNormal}(0,1)\\ j_{t} & \sim AR(1)\\ \sigma_{t}^{2} & \sim\mu_{t}+\frac{\mu_{t}^{2}}{\phi j_{t}^{2}}\\ \text{faults}_{t} & \sim\text{NegativeBinomial}(\mu_{t},\sigma_{t}) \end{align*} This is the only model so far that can actually produce a visually acceptable fit the data. This indicates the nature of fluctuations in faults is arising complex multi-type process. However this model is somewhat cumbersome, and from a statistical perspective the fits are not satisfactory, with chains mixing poorly. **Model 5:** Lognormal stochastic volatility To be clear it's not advisable in general to use a continuous likelihood for counts data. With this in mind but also understanding the nature of the data distribution for faults in this particular instance, we proceed to fit a lognormal stochastic volatility model. This has the advantage of simplicity, and interpretation for potential derivatives pricing. The model is \begin{align*} \text{log}(\sigma_{t}) & \sim\text{AR}(1,hourly)\\ \text{log}(\mu_{t}) & \sim\text{AR}(1,weeky)\\ \text{faults}_{t} & \sim\text{LogNormal}(\mu_{t},\sigma_{t}) \end{align*} Volatility is allowed to change on an hourly basis through an AR1 prior on log standard deviation, and mean is allowed to change with weekly innovations.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.