# Blob sharing protocol _Thanks @eduadiez, @aliasjesus, @kevaundray, @dankrad, @protolambda @n1shantd, @rauljordaneth, @dmarzz for helpful discussions_ ## Motivation Blobs have a more rigid pricing model than transaction calldata: you can only buy data in chunks of 131072 bytes. To be economically efficient, rollups today can: - Wait time $t$ to buffer 131072 bytes to fill an entire blob - Post more frequently at intervals $t / k$, sharing the blob with other participants Also, to perform atomic operations ZK rollups need to publish data together with a validity proof to achieve instant finality. If UX requirements demand frequent submissions rollups may want to perform a state update before being able to fill an entire blob. <img src="https://hackmd.io/_uploads/r1CEeWpE6.png" alt="drawing" width="450"/> _usage pattern example_ ## Overview In its most basic form, the usage of the blob multiplexer service looks like the following sequence. ```sequence Publishers->Multiplx: I am willing to pay X_1\nto include data_1 on chain Publishers->Multiplx: I am willing to pay X_2\nto include data_2 on chain Multiplx->Chain: Concat data_1 + data_2\n send as blob tx Consumers->Chain: Filter and read\nPublisher's data ``` Let's introduce some requirements first. We'll discuss in detail latter why they are necessary: 1. Minimum (ideally zero) overhead incurred by user compared to self-publishing full blobs 2. Data submissions and their location should be authenticated by the user 3. Users should be able to filter and derive their published data from L1 cheaply This protocol is _initially_ focused on sequencers as users. Today, all major rollups live require **permissioned sequencers**, i.e. only an authorized party is able to publish data. Thus, when relaying data publishing to a 3rd party there must exist some link but to the authorized sequencer entity. Where possible a trustless protocol is prefered over a trusted solution. However the requirements above make it quite hard to offer an economically attractive trustless protocol. | | Trusted multiplexer | Trustless multiplexer | | - | - | - | | Payment settlement | off-chain, after the fact | on-chain, atomic | Cost per byte overhead | zero, or negligible | significant | | Censorship resistance | Somewhat, possible with inclusion proofs | Somewhat, possible with inclusion proofs | Data origin authentication | Somewhat, with data signatures | Yes, required The design of a trustless protocol requires a key primitive: can you succintly proof that $data_i$ belongs to the version hash of a commitment of concatenated data? Sections below describe attempts at solving this issue. <img src="https://hackmd.io/_uploads/BkYGlba4p.png" alt="drawing" width="450"/> _._ Without this primitive it's very difficult to build a trustless multiplexer layer, since you can not build trustless payments. # Trusted multiplexer ## Payment settlement Trivial to offer multiple settlement options since the multiplexer is a trusted party. This could include: - Pre-paid credits via on-chain or off-chain transactions - Payment channels, updated after each successful inclusion - FIAT subscription model, i.e. Infura or Pinata ## Data origin authentication Rollups live today require permissioned sequencing. If data is published by a 3rd party (blob sharing protocol), there must exist a way to authenticate the data origin. Adding more complex authentication protocols increases the cost per byte. ### 2nd transaction to authenticate pointers Multiplexer just posts data, and expects consuming users to perform a second action to include an authenticated pointer to the previously posted data: ```sequence Publisher->Multiplx: Post data X, Y Note right of Multiplx: Buffer data Multiplx->Chain: Send blob tx Chain->Publisher: Verify data in blob Publisher->Chain: Authenticate data + location Chain->Consumer: Iterate publisher\npointers ``` To derive the L2 chain, simply iterate the authenticated pointers and resolve their data. ### Single tx with signature over data No need for 2nd transaction or additional communication roundtrip. However and observer can't be convienced of blob authenticity without downloading the full blob. ```sequence Publisher->Multiplx: Post data X + sig Note right of Multiplx: Buffer data Multiplx->Chain: Send blob tx Chain->Consumer: Iterate Multiplx\ntxs with Publisher\nsignatures ``` To derive the L2 chain, iterate Multiplx txs and filter by relevant blob transactions with valid signature from the user (rollup) address. The multiplexer is not a trusted party, and can publish invalid data. In section *TODO* we explore how to authenticate the data, but it is currently required to download the full blob to conclusively assert its correctness. The only link to data is the blob versioned hash. ## Censorship resistance In both a trusted and trustless multiplexer, the publisher can't explicitly force the multiplexer into sending a blob transaction. Therefore the multiplexer can censor a publisher by ignoring it's data publishing requests. A multiplexer is positively incentivized to include data publishing requests to collect fees. But this is not be sufficient to prevent profitable censorship attacks. In the trusted model, the social reputation of the company running the multiplexer adds an extra layer of protection. A multiplexer can also be incentivized via penalties to not exclude data publishing requests. Consider the following protocol: - Multiplexer stakes some capital - Publisher makes a data publishing request available (i.e. include it in a blockchain) - After some timeout, publisher triggers an optimistic claim of censorship for the request made available above - Multiplexer has some sufficiently long interval to proof onchain that the data of the publishing request is part of a versioned hash of a past canonical block - If the multiplexer fails to dispute the censorship claim, its stake is slashed The multiplexer is now disincentivized (up to its stake value) to not censor publishers. However, publishers should build their protocol such that both a multiplexer and themselves can publish data. # Trustless multiplexer _WIP TODO_ Requires some mechanism to execute user payments conditional on blob inclusion. **Trusted execution on Intel SGX** EthGlobal istanbul '23 project is attempting it with Flashbots SUAVE architecture: https://ethglobal.com/showcase/blob-merger-k7m1f **On-chain proofs** Requires some not yet figure out primitives, and will likely be too expensive to do on L1. - TODO: Figure out a protocol to proof a versioned hash includes the data of a signed commitment by the user, before knowledge of the full blob # Usage examples ## ZK rollup (Polygon ZK-evm) ZK rollups have some extra requirements to optimize prover costs: - Sequencer must be able to form a hash chain at the SC level = aggregator must not be able to submit the batches out of order, ideally checked trustlessly - Sequencer on-chain smart contract must be able to test data integrity Refer to [**How does Polygon ZK-evm publish / consume data**](#How-does-Polygon-ZK-evm-publish--consume-data) for details on their current architecture. To optimize costs, on chain ZK provers minimize the public inputs typically using very few (or a single) commitment to all data needed for the computation. Part of the verifier circuit, includes checking that the commitment is correct. With EIP-4844, this single commitment must include the versioned hash as sole link to transaction data. **How to verify data subset on aggregated blob** Cheapest strategy is to compute the KZG commitment for the full data in their native field, and then do an equivalence proof (see [@dankrad/kzg_commitments_in_proofs]( https://notes.ethereum.org/@dankrad/kzg_commitments_in_proofs)). Then extract your subset data of interest for the rest of the circuit execution. While it requires to ingest some $k$ factor more data, according to @eduadiez it's not significant. Thus, the logic to handle DA on full blobs or partial blobs is the same. Implement partial blob reading, and set offset to 0 data to all for the full blob case. _TODO rest of integration @aliasjesus, @eduadiez_ ## Optimistic rollup (OP stack) Currently OP stack uses a very simple strategy to post data, send a blob transaction to some predetermined address. It authenticates data origin with the transaction's signature. This architecture must change slightly to accomodate data published from a 3rd party (untrusted) account. _TODO integration, @protolambda_ # Appendix: blob retrieval ### Versioned hash from EVM https://eips.ethereum.org/EIPS/eip-4844#opcode-to-get-versioned-hashes Versioned hash is available inside the EVM exclusively during its transaction execution. Instruction `BLOBHASH` reads `index` argument and returns `tx.blob_versioned_hashes[index]` ### Blob data Beacon API route [`getBlobSidecars`](https://github.com/ethereum/beacon-APIs/blob/ad873a0f8d2c2a7c587dd667274199b2eb00f3ba/apis/beacon/blob_sidecars/blob_sidecars.yaml) allows to retrieve `BlobSidecars` for `block_id` and a set of blob `indices` ```python class BlobSidecar(Container): index: BlobIndex # Index of blob in block blob: Blob kzg_commitment: KZGCommitment kzg_proof: KZGProof # Allows for quick verification of kzg_commitment ... ``` # Appendix: data authentication protocols append a suffix of authentication to get rid of the send consume tx. To minimize intrinsic costs. Data readers need to capacity to discard invalid data. Append in send blob tx calldata, a signature of the user to authenticate the data being submitted. proof to the chain that blob includes that data. data[128:256] belongs to address Y, **Rationale** - **Invalid offset problem**: Since the sequencer initiates a latter transaction after the multiplexer posting the data, it can just verify the integrity of the data offchain, and publish the correct offset and data length **Questions**: - Could the sequencer contract just reference the original transaction with a proof to the historical block? More expensive, but would bypass the payment mechanism. - Is the consume transaction really necessary? - Proto is thinking strongly on also full nodes following the chain. So how does your derive L2 from L1 function look like and if it's efficient in terms of relevant data to you vs data downloaded ### _construction 1_ Read process: - Read send_blob_tx calldata - Verify that range proof matches KZG commitment of blob, without loading full blob - Required to avoid replay of header without including the data - Verify that sequencer signature is correct - Required to not read junk data from others - Read blob, only extract the data that was proven in the range-proof - `call_data = [ Signed KZG range-proof over chunk ] = sign_by_rollup_sequencer(range_proof(range info, KZG proof data))` - `blob = [ chunk 0 ][ chunk 1 ] ...` - `chunk = [ data ]` ### _construction 2_ Read process: - Read send_blob_tx calldata - Verify that range proof matches KZG commitment of blob, without loading full blob - Required to avoid replay of header without including the data - Verify that sequencer signature is correct - Required to not read junk data from others - Read blob, only extract the data that was proven in the range-proof - `call_data = [ Signed KZG range-proof over chunk ] = sign_by_rollup_sequencer(range_proof(range info, KZG proof data))` - `blob = [ chunk 0 ][ chunk 1 ] ...` - `chunk = [ data ]` # Appendix: how rollups publish / consume data Overview of how some major rollups live today publish and consume data pre EIP-4844 and post EIP-4844. Relevant to the topic, Dankrad notes on ideas to integrate EIP-4844 into rollups: https://notes.ethereum.org/@dankrad/kzg_commitments_in_proofs ## How does Polygon ZK-evm publish / consume data - Sequencer creates batches - At time $t_0$ sequencer groups a list of batches and publishes it to the sequencer smart contract (calls [PolygonZkEVM.sequenceBatches()](https://github.com/0xPolygonHermez/zkevm-contracts/blob/aa4608049f65ffb3b9ebc3672b52a5445ea00bde/contracts/PolygonZkEVM.sol#L484)) - The prover watches the on-chain tx and starts producing proofs for those batches in parallel - At time $t_1$ (~ 30 min after $t_0$) the prover publishes the final proof to the verifier smart contract Current architecture can't handle invalid sequencer submissions gracefully. Thus, sequencer role is permissioned. The smart contract guarantees that the data hash is correct, and that all data is eventually process by the prover with a hash chain. In [PolygonZkEVM.sequenceBatches](https://github.com/0xPolygonHermez/zkevm-contracts/blob/aa4608049f65ffb3b9ebc3672b52a5445ea00bde/contracts/PolygonZkEVM.sol#L484C14-L484C29) the hash chain is computed with an accumulator consisting of the previous accumulator hash, and the current transactions (_ref [PolygonZkEVM.sol#L572-L581](https://github.com/0xPolygonHermez/zkevm-contracts/blob/aa4608049f65ffb3b9ebc3672b52a5445ea00bde/contracts/PolygonZkEVM.sol#L572-L581)_). ```solidity // Calculate next accumulated input hash currentAccInputHash = keccak256( abi.encodePacked( currentAccInputHash, currentTransactionsHash, currentBatch.globalExitRoot, currentBatch.timestamp, l2Coinbase ) ); ``` Only persisted data to link with the future verifier submission in the accumulator hash of this batch (_ref (PolygonZkEVM.sol#L598-L602)[https://github.com/0xPolygonHermez/zkevm-contracts/blob/aa4608049f65ffb3b9ebc3672b52a5445ea00bde/contracts/PolygonZkEVM.sol#L598-L602]_) ```solidity sequencedBatches[currentBatchSequenced] = SequencedBatchData({ accInputHash: currentAccInputHash, sequencedTimestamp: uint64(block.timestamp), previousLastBatchSequenced: lastBatchSequenced }); ``` In [PolygonZkEVM.verifyBatchesTrustedAggregator](https://github.com/0xPolygonHermez/zkevm-contracts/blob/aa4608049f65ffb3b9ebc3672b52a5445ea00bde/contracts/PolygonZkEVM.sol#L709C14-L709C44) a verifier posts a new root with its proof. The actual call to verify the proof is `rollupVerifier.verifyProof(proof, [inputSnark])` ([ref](https://github.com/0xPolygonHermez/zkevm-contracts/blob/aa4608049f65ffb3b9ebc3672b52a5445ea00bde/contracts/PolygonZkEVM.sol#L817)) where `inputSnark` is computed with data submitted by the sequencer as (_ref (PolygonZkEVM.sol#L1646-L1675)[https://github.com/0xPolygonHermez/zkevm-contracts/blob/aa4608049f65ffb3b9ebc3672b52a5445ea00bde/contracts/PolygonZkEVM.sol#L1646-L1675]_) ```solidity bytes32 oldAccInputHash = sequencedBatches[initNumBatch].accInputHash; bytes32 newAccInputHash = sequencedBatches[finalNewBatch].accInputHash; bytes memory inputSnark = abi.encodePacked( msg.sender, // to ensure who is the coinbase oldStateRoot, // retrieved from smart contract storage oldAccInputHash, // retrieved from smart contract storage initNumBatch, // submitted by the prover newStateRoot, // submitted by the prover newAccInputHash, // retrieved from smart contract storage newLocalExitRoot, // submitted by the prover finalNewBatch // submitted by the prover ); ``` ## How does Arbitrum publish / consume data SequencerInbox.addSequencerL2BatchFromOrigin https://github.com/OffchainLabs/nitro-contracts/blob/695750067b2b7658556bdf61ec8cf16132d83dd0/src/bridge/SequencerInbox.sol#L195 **Post EIP-4844** WIP diff https://github.com/OffchainLabs/nitro-contracts/compare/main...4844-blobasefee Version hashes are read from an abstract interface, not yet defined. Ref [src/bridge/SequencerInbox.sol#L316](https://github.com/OffchainLabs/nitro-contracts/blob/21757edbd4b6937b9b7ac223a532cf5cf7d268a7/src/bridge/SequencerInbox.sol#L316) - TODO: Why is `data` used here? It's not the RLP serialized transaction but the contract call data ```solidity bytes32[] memory dataHashes = dataHashReader.getDataHashes(); return ( keccak256(bytes.concat(header, data, abi.encodePacked(dataHashes))), timeBounds ); ``` ## How does OP stack publish / consume data PR for EIP-4844 support (no smart contract changes) https://github.com/ethereum-optimism/optimism/pull/7349 _Optimism construction pre-eip4844 - inbox address, EOA - batcher submits tx to inbox address - verifier traverses tx list for dest: inbox adress - checks that signature is from batchers # Appendix: EIP-4844 Economics and Rollup Strategies Paper by Davide Crapis, Edward W. Felten, and Akaki Mamageishvili https://arxiv.org/pdf/2310.01155.pdf _TODO_, by TLDR; (1) When should a rollup use the data market versus the main market for sending data to L1? (2) Is there a substantial efficiency gain in aggregating data from multiple rollups and what happens to the data market fees? (3) When would rollups decide to aggregate and what is the optimal cost-sharing scheme? # Appendix: KZG commitment proofs for data subset Data publisher has a data chunk `data_i` encoded as $(w^i, y_i)$ where $i ∈ 0,...,k-1$. It computes a commitment $C_{i}$ which then signs. Multiple publishers send `data_i` and $C_{i}$ signed to the aggregator. Each chunk may have a different $k$. The aggregator concats data chunks of different sizes and computes an overall commitment $C$ which will be posted on chain as part of the blob transaction. ``` [ data_0 ][ data_1 ][ data_2 ] [ data ] ``` The aggregator must allow a verifier to convience itself that the signed $C_{i}$ commitment belongs to $C$. Let's define $C_{io}$ as the commitment to a data chunk offset by some $t$. The proof has two steps: - Proof that $C_{i}$ equals $C_{io}$ multiplicatively shifted by $w^t$ - Proof that $C_{io}$ belongs to $C$ _Terminology_ - $f(x)$ is the interpolated polynomial over the concatenated data. - $f_i(x)$ is such that $f(w^j) = y_j$ where $j ∈ 0,...,k-1$ - $f_{io}(x)$ is such that $f(w^{t+j}) = y_j$ where $j ∈ 0,...,k-1$ - $C = [f(τ)]$, and $C_i = [f_i(τ)]$, and $C_{io} = [f_{io}(τ)]$ ### Proof that $C_{io}$ equals $C_i$ multiplicatively shifted by $w^t$ The interpolation polynomials are evaluated at the roots of unity, in our example $w^j$ where $j ∈ 0,...,k-1$. A root of unity can be shifted $t$ times by multiplying it by $w^t$. $C_{io}$ and $C_i$ are commitments to polynomials of the same set of points, only multiplicatively shifted by $w^t$. We need to proof that $f_{io}(x) = f_{i}(w^{t}x)$. With the Schwartz–Zippel lemma we can just proof that identity for a deterministic random point $r$, $f_{io}(r) = f_{i}(w^tr)$. $r$ is computed from the commitments $C_{io}$ and $C_i$. The verifier is given $t$, $C_{io}$, $C_i$, and evaluation proofs $f_{io}(r)$ and $f_{i}(w^tr)$. Verification routine: 1. Compute $r$ from $C_{io}$ and $C_i$ 2. Verify evaluation proofs $f_{io}(r)$ and $f_{i}(w^tr)$ 3. Check $f_{io}(r) == f_{i}(w^tr)$ ### Proof that $C_{io}$ belongs to $C$ Given a subset of points $x_j, y_j$ where $j ∈ 0,...,k-1$ proof that $f(x_j) = y_j$. $f_{io}(x)$ is the interpolation polynomial over the point subset such that $f_{io}(x_j) = y_j$ for all $j$. $z(x)$ is the zero polynomial $z(x_j) = 0$ for all $j$. $τ$ is from the trusted setup. We construct a quotient polynomial $q(x)$: $$ q(x) = {{ f(x) - f_{io}(x) } \over { z(x) }} $$ For this polynomial to exist (can't divide by zero), $f(x) - f_{io}(x) = 0$ for all $j$. The proof is $$ π = [q(τ)] $$ The verifier is given $π$, $C$, $C_{io}$. Verification routine: 1. Compute zero polynomial $Z(x)$ and compute $[Z(τ)]$ 2. Do pairing check $e(π, [Z(τ)]) == e([f(τ)] - [I_{io}(τ)], H)$ _Refer to [Dankrad's notes](https://dankradfeist.de/ethereum/2020/06/16/kate-polynomial-commitments.html) "Multiproofs" sections or [arnaucube's batch proof notes](https://arnaucube.com/blog/kzg-batch-proof.html#batch-proof) for more details._