Codex dynamic data design space

--- breaks: false --- Codex dynamic data design space ------------------------------- This documents explores potential approaches to handle small/dynamic data in Codex (spoiler: there aren't many). ### Small data vs. dynamic data To be economically viable, we need quite big slots. This is because the cost of creating a storage proof is mostly independent from the size of the slots. If the slots are too small, that means there are too many of them, requiring too many proofs to create (and check!). A viable range for slot sizes seems to be 100GB--1TB. Hence, it seems to be a natural approach to have (large) fixed size slots, which we gradually fill up with data. The proofs are much easier if they have a fixed structure, meaning the slot size/structure is not changing; even when it's almost totally empty, we prove the whole slot (in which case the unused part is filled with some dummy data, for example zero bytes). Hence, even if do append-only, we actually modify data from the dummy data to the actual data. With this approach, mutable data and small data _are the same thing_, so I usually refer to this as "small/mutable data" (or dynamic data). ### Requirements There are a few requirements which seem to be non-negotiable in practice. These in turn have consequences, seemingly constraining the design space to be quite small. - We need large slots (100GB--1TB, see above). - As a direct consequence of this, we really don't want to read the whole slot when creating proofs, because that's just too slow: A quick google [\[1\]](https://www.techradar.com/pro/fastest-hard-drives-of-year) says that the fastest hard drive in 2024 has a reading speed below 300MB/sec, so reading 100GB would take at least 5 minutes, which is just not acceptable - we want a proof time below 10 seconds with the above slot sizes. - The same probably also applies to updates (adding/modifying data), we don't want to read the whole thing. - The first consequence rules out 2D erasure coding (because the bottom-right corner depends on every bit of the dataset). - But we still need slot-level erasure coding to have useful detection probabilities, meaning 1D erasure code. - The second consequence (not reading the whole dataset when updating) means that the 1D erasure code must be "parallel" to the updated pieces. If we imagine the data as a matrix, and erasure coding being "horizontal" (that is, we encode rows independently), this means that we also update rows. Remark: the same also applies to the network-level coding: "rows" must be not mixed together. - This in turn means that we have to prove some pieces of _each row_ (or at least a significant percentage of the rows) in every proof. In practice, we want to prove _whole columns_. - If we use Leopard for erasure coding (which makes sense as it is very fast), we have a limitation of at most `2^16 = 65536` columns, including parity columns. This in turn for a 128GB slot means at least 2MB data per column. Note: there are even faster, but non-optimal codes, especially if we don't care about reconstruction. This could be an interesting direction, though neither coding nor the $2^{16}$ limitation seems to be a bottleneck. - Since the columns are quite big, we cannot use the same idea as in the current "big static data" solution, namely, directly proving hashes of columns, as that would be too expensive. - Thus we need a faster way to prove whole columns. We can use a _polynomial commitment scheme_ for that: Use the data as coefficients of a polynomial, and provide an evaluation proof at a randomly selected point. This can be very efficient, and ideally we can batch several columns together (we will probably need to prove around 20 columns per slot, see below). - However, to be able to update the column commitments (which we prove against) without reading the whole dataset, we also need a _homomorphic commitment_. - These two together means that at the moment, the only serious candidate I'm aware of is KZG for the columns (for example to update FRI columns we would need to recalculate the columnwise RS encoding, which in turns mean reading the whole (or at least a significant percentage of the) dataset, something we really want to avoid). - Proving columns and erasure coding rows also means that we have to actually store the parity columns: To recalculate a parity column we would need to read the whole dataset. With `rate=1/2` this would mean a storage overhead of 2x. But I think this is totally fine: It makes sense that "small data" is more expensive per byte than "bulk data"; in fact double price still sounds very cheap. On the other hand, this also gives extra redundancy to the SP if the data is organized carefully (for example, parity columns on a different disk). ### Summary of constraints In summary, barring radically new ideas, we seem to be constrained to the following design: - organize the slot data in a matrix - erasure code and update individual rows - KZG commit to the individual columns - prove whole columns with (batched) KZG evaluation proofs - prove the KZG commitments of the (randomly) selected columns against a Merkle root of the column commitments Note: instead of RS coding the rows, we could in theory use any kind of code (we don't even have to be able to reconstruct, though the latter gives extra redundancy). ### Possible alternatives Hash based: - have a very large number of columns (say 8 million), presumably with some non-RS code - so that each column is so small we can prove using a hash instead - problem: if columns are "at the bottom" of the Merkle tree, updating requires 8 million Merkle updates... - the numbers don't look very good Non-homomorphic: - just commit to each "cell" of the matrix individually with any commitment, whose proofs are batchable - batch the resulting $20\times \mathsf{nrows}$ "mini-proofs" - but this looks like it will result in a rather large proof size ### Parameters and tradeoffs Remark: With an EC rate of $\rho=1/2$ (that is, 2x storage overhead), for a detection probability of 6 nines, we need to prove 20 columns: $$ \left(\frac{1}{2}\right)^{20} \approx 10^{-6} $$ Some possible parameter settings with above constraints: | data size | slot size | overhead | #cols | col. size | 𝔽 el. / col. | | ---------:| ---------:|:--------:| --------:|:-----------:| -----------:| | 64 GB | 128 GB | 2x | 65536 | 2 MB | 68k | | 128 GB | 256 GB | 2x | 65536 | 4 MB | 135k | | 256 GB | 512 GB | 2x | 65536 | 8 MB | 270k | | 64 GB | 128 GB | 2x | 16384 | 8 MB | 270k | | 128 GB | 256 GB | 2x | 16384 | 16 MB | 541k | | 256 GB | 512 GB | 2x | 16384 | 32 MB | 1082k | | 64 GB | 128 GB | 2x | 4096 | 32 MB | 1082k | | 128 GB | 256 GB | 2x | 4096 | 64 MB | 2165k | | 256 GB | 512 GB | 2x | 4096 | 128 MB | 4330k | ### Tradeoffs The primary parameter seems to be the number of columns. Smaller number of columns means less commitments to update (but similar cost of update, because the data to commit is the data we change. However smaller number of columns is a _tiny bit_ faster because MSM is sublinear, and we do small MSMs in each column when homomorphically updating the KZG commitments). However, smaller number of columns means larger column sizes, which in turn means larger proof costs (we prove whole columns). So it seems we should set the number of columns as small as possible while still compatible with our proof cost requirements. For reference, Constantine's single-threaded BN254 MSM speeds (KZG commitment/proof) on my M2 macbook pro are approx: - size `2^16`: 0.48 sec (4.2 MB/sec) - size `2^18`: 1.6 sec (5.0 MB/sec) - size `2^20`: 5.7 sec (5.7 MB/sec) So if we want around 1 sec proof time on modern multi-core hardware, we can maybe get away with 32 MB column size at most. ### Fat rows In theory, a single row can be as "thin" as a single field element. With say 8192 columns that would result in \~250kb row sizes. However, it may make sense to have a minimum size to be allocated to a user. But such "fat rows" (where each cell contains multiple field elements) can be still updated in smaller segments. A tradeoff here is how much data needs to be re-encoded using the slot-level erasure code. ### Outsourcing KZG KZG commitment is slow, we would prefer to outsource it from the client to the presumably more powerful SP nodes (also, when outsourcing, the workload gets distributed among N storage nodes). However, we don't want to just blindly trust the SP. Outsourcing a KZG commitment is relatively easy: The untrusted server just needs to provide an evaluation proof at a random point (which can be even selected deterministically, for example the SHA256 hash of the data, maybe randomized with the request). The client can (pre-)calculate the result quite easily, because evaluating the polynomial only requires field operations (additions and multiplications), which is fast (looks like in the order of magnitude of GB/sec). Producing a KZG evaluation proof has approximately the same cost as the commitment itself, so this "only" doubles the SPs initial workload. However, in our case we have to outsource _partial commitments_, because we want to homomorphically update existing commitments. Consider the edge case of updating a single coefficient: $$ \begin{align*} f'(x) &= f'(x) + (a_k' - a_k)x^k \\ \mathsf{com}(f') &= \mathsf{com}(f) + (a_k' - a_k)*[\tau^k] \end{align*} $$ Producing an evaluation proof $f(\zeta) = y$ requires computing and committing to the quotient polynomial: $$ Q(x) := \frac{f(x) - y}{x-\zeta} $$ or at least the difference: $$ Q'(x) - Q(x) = \frac{(a_k'-a_k)(x^k-\zeta^k)}{x-\zeta} = (a'_k-a_k)\sum_{i=0}^{k-1} \zeta^{k-1-i}x^i $$ which has degree $k-1$, and thus corresponding cost. So proving each column update separately is prohibitively expensive. Fortunately, we can still batch these proofs into a single one, amortizing the cost. The verifier (in this case the client) then requires to do a small MSM, as big as the number of updates (columns), but for say 8192 columns that's still takes only a miniscule time. Unfortunately the same is not true for an onchain verifier. So what we will presumably need is the client signing off the updates of all the SP nodes involved, and the on-chain mechanism checking this signature, because the update has to happen atomically (we have to update the on-chain commitments). ### Incentives Updates are somewhat expensive for a storage provider, and they can only work if all other players are cooperating. So a question is how to ensure that providers actually execute the updates? ### Workflow What happens when the user wants to upload / modify a piece of data? The data is allocated to one or several rows of the slot matrix. Since updating several rows is the same process replicated, we describe updating a single row (though updating several rows at the same time could be more efficient). 0. If the user is updating only a fragment of "a full row", then they have to download the rest of that row first, because they need to re-erasure code it 1. The user applies network level K-to-N erasure coding, and splits to result into N pieces, to be stored on N storage providers 2. Each piece is erasure coded with the slot-level erasure coding parameters, resulting in the N rows to be stored on the N storage nodes. Note: this requires that network level EC is "per row", that is, different users' data is not mixed together. 3. These are uploaded to the nodes 4. An SP splits the row into "cells". If for example we have 8192 columns, then we get 8192 field elements. Note: Usually we will update several such "thin rows" at the same time 5. the SP calculates the column commitment updates: for row $i$ and column $j$ this will be $x_{i,j}*\mathsf{com}(\mathcal{L}_i)$ where $x_{i,j}\in\mathbb{F}$ is the field element encoding the 31 bytes of data at the cell. When we have multiple rows, this becomes a tiny MSM. 6. the SP evaluates the column polynomials belonging