# **BASED**: Trustless and Efficient Gas Price Oracle
## Base Fees
### Background
The well-known [EIP-1559](https://eips.ethereum.org/EIPS/eip-1559) dramatically changes the transaction fee mechanism for Ethereum. In short, transaction fees are now split into two parts:
1. Base fee: a base gas price for the entire block. In order for a block to be valid, every transaction in the block *must* pay at least the base fee for its gas, and the entirety of base fees in the block are burned.
2. Priority fee: a "tip" on top of the base fee which is sent to the block proposer as incentive for transaction inclusion. Unlike the base fee, the priority fee varies between transaction, but is almost always substantially smaller than the base fee.
In order to enforce the new rules for block validity, the Ethereum protocol natively tracks the base fee and has deterministic rules for updating it based on network activity. The base fee for each block is included in its header, and thus all historical base fees can be quickly accessed by any Ethereum node.
### On-Chain Access
Unfortunately, smart contracts can only directly query the _current_ base fee. Previous base fees can still be extracted from historical block headers and validated against an on-chain cache of verified historical block hashes (for instance, using the ZK-SNARK powered [BlockHistory](https://etherscan.io/address/0xa00f3c13d60ad6a1a6e6d8ebb4529408d7e87343)) contract from Relic Protocol).
However, accessing individual base fees is insufficient for use cases which want a higher level view of how gas prices vary over time. For instance, an Ethereum gas derivative may want to use a time-weighted average price (TWAP) to avoid concerns around short-term gas price manipulation or to more accurately represent a fair price for a given time range. Unfortunately, computing a TWAP for a nontrivial block range on-chain by verifying each block header is computationally infeasible.
## The Solution: **BASED**
### Overview
**BASED** (**B**ase-fee **A**ggregated **S**tatistics **E**fficiently **D**iscoverable) is a ZK-SNARK powered oracle for Ethereum base fees created by Relic Protocol, enabling smart contracts to trustlessly and cheaply compute various base fee statistics. For any range of blocks, **BASED** enables computation of the first 4 statistical moments (sometimes called the *mean*, *variance*, *skewness*, and *kurtosis*) of the distribution of base fees. Price oracles typically report some sort of *mean* (usually a TWAP or VWAP), but lack these higher-order statistics which reveal the volatility and skewness of the price feed.
Combined, these statistics give a decent approximation of the distribution of base fees and enable computation of approximate upper and lower bounds on the base fees in any time window. To see how these bounds are computed, check out the [bound analysis](#Appendix-1-Bound-Analysis) below.
### Implementation Details
#### Relic Protocol Circuits
**BASED** is implemented by augmenting Relic Protocol's original ZK-circuits with additional aggregation logic. In short, Relic's original ZK-circuits take in a range of consecutive blocks, validate their connectedness, and "output" three values:
* The parent block hash of the first block in the range
* The block hash of the last block in the range
* The merkle root of all block hashes in the range
These three values allow Relic to aggregate proofs of adjacent block chunks into a larger proof of the combined block range.
#### Base Fee Statistics
For some block $B$, and positive integer $i$, let $\Sigma^i(B)$ to be the sum of the $i^{\text{th}}$ powers all base fees in the blocks up to and including $B$. The raw data provided by **BASED** consists of $\Sigma^1(B), \Sigma^2(B), \Sigma^3(B),$ and $\Sigma^4(B)$ for all blocks $B$.
Let $B_k$ denote the $k^\text{th}$ block, and let $X$ be a random variable sampled from the block fees in the block range $B_k, ..., B_{k+n-1}$. Then
$$
E[X] = \frac{1}{n} \left( \Sigma^1(B_{k+n-1}) - \Sigma^1(B_k) \right)
$$
$$
E[X^2] = \frac{1}{n} \left( \Sigma^2(B_{k+n-1}) - \Sigma^2(B_k) \right)
$$
$$
E[X^3] = \frac{1}{n} \left( \Sigma^3(B_{k+n-1}) - \Sigma^3(B_k) \right)
$$
$$
E[X^4] = \frac{1}{n} \left( \Sigma^4(B_{k+n-1}) - \Sigma^4(B_k) \right)
$$
And thus these $\Sigma^i(B)$ values are sufficient to compute the first 4 moments for any range of base fees.
#### New Circuits
In order to aggregate the base fee statistics, we add the following outputs to the Relic circuits:
* The base fee statistics for the parent block
* The base fee statistics for the last block
* The merkle root of all base fee statistics in the block range
Similar to the original circuits, these values allow for aggregation of proofs of adjacent chunks into a new proof for the combined range.
In order to extract the base fees and timestamps from the block headers, our circuits must implement RLP parsing, which is somewhat difficult to do efficiently in a circuit. For details of how we tackled this, see the appendix on [RLP parsing](#Appendix-2-RLP-Parsing).
#### Smart Contracts
Once all the historical blocks are aggregated, the proofs are submitted to an on-chain verifier, which then stores the base fee statistics merkle roots on-chain. As a result, the base fee stats for any block can now be proven using a short merkle proof.
Then, querying the aggregated statistics for any block range requires only two merkle proofs (for the start and end blocks) and some quick arithmetic to compute the TWAP and variance.
## Use Cases and Future Work
The most obvious use case for a base fee TWAP oracle is settling Ethereum gas price perpetuals. That said, we believe the rich data set provided by **BASED** enables additional use cases not previously considered. In particular, the use of time-weighted variance as a measure of volatility is an under-explored primitive, which may have applications to novel on-chain derivatives and beyond.
Notably, the approach used by **BASED** can be adapted to build similar oracles for other price feeds, enabling equally powerful trustless oracles for any on-chain pricing data.
Additionally, we would like to explore how valuable it would be to add higher moments to the aggregated statistics. For instance, adding the third central moment would offer tighter upper and lower bounds on the base fees as well as information about the skewness of the base fee distribution.
We encourage anybody interested in building on **BASED** (or exploring variants of this design) to reach out to us!
## Appendix 1: Bound Analysis
#### Trivial bound
To give some intuition how bounds can be derived from only the mean and variance, we'll first explain the trivial bound implied by Chebyshev's inequality. Let $X$ be a random variable drawn uniformly from the base fees of some $n$ consecutive blocks. Then if $\mu = \text{E}[X]$ and $\sigma = \sqrt{\text{Var}[X]}$, Chebyshev's inequality states that for any $k$,
$$
\text{Pr}\left(| X - \mu | \ge k\right) \le \frac{\sigma^2}{k^2}
$$
But since $X$ is drawn uniformly from $n$ discrete values, if the probability that $X$ takes on some values is smaller than $\frac{1}{n}$, then $X$ must never take on those values. Thus, picking any $k > \sqrt{n \cdot \sigma^2}$ and applying Chebyshev's inequality gives $\text{Pr}\left(| X - \mu | \ge k\right) < \frac{1}{n}$, implying upper and lower bounds on $X$.
#### Improving the trivial bound
The intuitive reason our bounds can be tighter than those implied by Chebyshev's inequality is simple: base fees change at a maximum rate $r$ (in Ethereum, base fees change by at most 12.5%, so the maximum ratio between two adjacent base fees is $r \approx 1.143$). Thus, in order for the base fee to significantly deviate from the mean, it must deviate significantly for "many" blocks, which must contribute significantly to the higher-order moments.
Without loss of generality, we can analyze the "worst case", i.e. the case where the base fees deviate maximally from the mean while having minimal impact on the higher moments. Let $p$ be the peak base fee in the given $n$ blocks. Then the worst case consists of base fees equal to some constant value $\beta$ for all blocks, except for a brief sequence where it steadily climbs from $\beta \rightarrow p$ at rate $r$ and then steadily falls $p \rightarrow \beta$ at rate $r$. Then $m = \left\lceil \log_r \frac{p}{\beta} \right \rceil$ is the number of steps it takes to reach the peak. We hence the only values of $X$ which deviate from $\beta$ are
$$
r \beta,\, r^2 \beta,\, \ldots, r^{m-1} \beta,\, r^m \beta,\, r^{m-1} \beta,\, \ldots,\, r \beta
$$
Clearly $\mu \ge \beta$ (since all values are at least $\beta$), so we can lower bound $r^i \beta \ge r^i \mu$ for all $i$. Thus $\mu_4$ can be lower bounded as:
$$
\mu_4 = \text{E}\left[(X - \mu)^4\right] \ge
\frac{\mu^4}{n} \left((r^m - 1)^4 + 2 \sum_{i=1}^{m-1} (r^i - 1)^4 \right)
$$
This equation can be used to derive closed form upper bounds on $p = r^m \mu$ in terms of $\mu_4$, though the expressions become quite messy. However, since **BASED** computes these bounds in EVM bytecode, we can instead iteratively compute the right side of the above equation until it exceeds the left side, and then simply compute the associated value of $p$.
The same argument applies symmetrically for low base fees, giving an equivalent lower bound.
TOOD: note that if it's necessary for strong bounds, you can split large block ranges into smaller chunks and have tighter bounds for each, though this comes at the cost of extra gas.
## Appendix 2: RLP Parsing
[RLP](https://ethereum.org/en/developers/docs/data-structures-and-encoding/rlp/) is a very simple encoding scheme used throughout Ethereum's execution layer. Although RLP decoding is typically simple to implement, the fact that it uses variable-length encodings adds significant overhead to a naive RLP parser circuit. Specifically, the simple approach to parsing RLP in a circuit requires explicit handling for *every* length for *every* variable length field, leading to quadratic blowup in the number of gates required.
Thankfully, there's a common ZK-circuit trick which helps dramatically reduce the required gates, namely _random linear combinations_. The trick is to realize that parsing a piece of RLP data is essentially splitting the input data into three components:
1. The _prefix_, consisting of the RLP prefix byte (if it exists) and the encoded length (if it exists)
2. The _data_, containing the actual data contained in the RLP field
3. The _remainder_, containing all remaining bytes after the data
The lengths of the prefix, data, and remainder can all be easily constrained by looking at a small constant number of bytes at the beginning. However, constraining that our three components are actually the correct splicing of the input is the challenge which typically leads to quadratic gate blowup. To avoid this, we will use a *probabilistic* constraint instead. We first take the unconstrained values in the slices and generate a cryptographic commitment to them (for instance, using an arithmetization-friendly hash function). This generates a "random" (in the [Fiat-Shamir](https://en.wikipedia.org/wiki/Fiat%E2%80%93Shamir_heuristic) sense) challenge value $\gamma$, which we can use the constrain the values via a random linear combination. Now for some vector of field elements $\mathtt{vec}$, we define
$$
\mathtt{RLC(vec)} = \sum_{i = 0}^{\mathtt{|vec| - 1}} \mathtt{vec[i]} \cdot \gamma^i
$$
Now we constrain our parsed slices by checking that:
$$
\mathtt{RLC}(\mathtt{prefix}) +
\mathtt{RLC}(\mathtt{data}) \cdot \gamma^{\mathtt{|prefix|}} +
\mathtt{RLC}(\mathtt{data}) \cdot \gamma^{\mathtt{|prefix| + |data|}}
= \mathtt{RLC(input)}
$$
It should be clear that if we correctly partitioned the input string, the above equality will hold. Conversely, if the equality holds, then notice that $\gamma$ is a root of the polynomial
$$
p(x) = \sum_{i=0}^{\mathtt{|input|} - 1} (\mathtt{input[i] - parsed[i]}) \cdot x^i
$$
But since $\gamma$ was a "random" value, the [Schwartz-Zippel Lemma](https://en.wikipedia.org/wiki/Schwartz%E2%80%93Zippel_lemma) implies that with overwhelming probability, $p(x)$ is identically zero, and hence our parsed slices are exactly equal to the input.
While these RLC constraints are still expensive, they change the number of required gates from $O(n^2)$ to $O(n)$ where $n$ is the number of bytes in the input. For everything but very small inputs, this works out to a significant decrease in the gate requirements.