owned this note
owned this note
Published
Linked with GitHub
# On Erasure Coding for Codex
## The idealised model
Let’s start from an idealised world. Assume we have an ideal code that can encode K symbols to N symbols such that any K of N can restore the original K, for arbitrary large K and N. We call R = K/N the code rate.
We encode our data (K symbols) to N, and we sample Q to get a False Positive (at most K pieces there, at least N-K+1 pieces missing, but all samples are there) probability P_FP <= (K/N)^Q = R^Q
Let’s assume symbols are bits. That sounds cool, with r=1/2 and a target P_FP < 1e-9, we just need 30 bits.
In both cases, if the test passes, we know with high probability that the data is there. If instead the test fails, the only thing we know is that we can’t be sure whether the data is there. However, there are large regions of losses, where the data is there, but the test fails almost certainly.
## Centralised vs. distributed
Even in this ideal case, we should speak of differences between centralised and distributed:
- On one extreme, all N symbols are stored on the same server. We need to check Q bits, and send it to someone. This ensures that the server can restore the data. But it can still withhold.
- Note that we could even send a more compact proof (using PoR or SNARKs). Note that in this case we are not transmitting info about which symbol failed.
- On the other extreme, each bit is stored on a different server. In this case we check 30 bits, each on a different place, and we know with high probability that the data is there. There is more communication burden. However
- We know which symbol failed
- With some “honest” assumptions we can be sure data is not withhold. TBD quantify: If D is the ratio of dishonest nodes, P_FP <= (R-H)^Q ???
In-between the two, there is the case where groups of symbols are stored on a node, but there are also multiple nodes. How to pick the Q in this case?
We should start thinking about
- Failure model: are symbols on the same server fail together? If so, no reason to check more than one from the same node
- Communication cost: If we can aggregate on-node, we don’t need Q comms, but less.
## Moving towards real code constructs
Now, let’s move away from some of the idealistic assumptions:
1. Unfortunately K and N are limited in case of MDS codes. Options are:
- A) Bump K and N: we can do this with MDS codes, but it slows down encode/decode, increases memory requirements, and makes repairing single errors more expensive (linear with K)
- B) Move away from MDS code (e.g. 2D RS code, encoding from K^2 to N^2). Now error patterns start to matter:
- If we are lucky, we can restore from R = K^2 / N^2 of data
- But if we are unlucky, we can’t restore even if having approx (N-K)^2 / N^2 = 1-R portion of the data.
- With typical numbers N=2K, this means we need between 25% and 75% of the data, depending on error pattern.
- WHAT IF we keep R>=1/2 ??? Isn’t that keeping error patterns at bay?
- In this case it starts to matter how we map coded symbols to nodes.
- C) Increase symbol size
- D) introduce symbol “vector”
- C and D are almost the same, just different ways of looking at the problem. They both introduce an “unprotected” dimension.
- How the symbols from the original data are mapped into the vectors is also called interleaving.
- When proving a symbol, in general, we have to send more data (send the whole symbol), or we have to process more data (if we use PoR or SNARK). Otherwise, we have to accept to go probabilistic at this level as well.
Now, for this last one (C, D) the “unit” has many names: vector, block, cell, segment …. but overall it is the “unit of erasure” from the erasure code point of view. This means that the guarantees of the erasure code holds only if I treat the whole unit as follows: if any bit from the vector/block/cell/segment is changed, I consider the whole vector/block/cell/segment erased.
This is, however, just the guarantee coming from the erasure code. Things can still be repairable depending on what was actually lost from the underlying vectors, and with assumptions around random erasures, we can still quantify what happens.
Importantly, it does NOT mean that I have to reconstruct the whole vector if there is some loss in it. If I can somehow identify which part of the vector is lost, I can restore that individually. Even if the whole vector is lost, I do not need to restore it as a whole. I can do it symbol by symbol, going through the vector linearly.
As mentioned before, we can also **go probabilistic** in this dimension, this is what I call **sub-sampling**. And if we pre-commit to an erasure coded version of this, we can even do the sub-sampling reinforced. This is what I called **sub-sampling with erasure code**. The important part is to commit to the erasure coded version. It does not matter whether the parity symbols are actually stored or not.