owned this note
owned this note
Published
Linked with GitHub
[More detailed] musings about "altruistic Codex"
------------------------------------------------
(copied here from my workspace - this is different from, and more detailed, than the other "Musings ..." document here)
I try to collect some ideas about how to achieve the current new priority goal of Codex being essentially an altruistic file-sharing network.
### Starting assumptions
I have some starting assumptions, which will influence potential designs:
- we need to handle a large number of relatively small files
- we need to provide at least some kind of _weak durability_ properties, if we want the system to make any sense
- for that, we need some kind of incentivisation, even if not monetary
- we will probably still need some form of redundancy, data dispersal, lightweight remote verification, and repair to achieve the above
- we need some kind of lightweight consensus mechanism (as the traffic is way to big for any proper blockchain, even L2-s)
In more details:
I think we should plan for at least 10 million - 1 billion files present in the network. We can probably expect a power-law distribution of file sizes, so I expect that small files will dominate the numbers. 1 petabyte worth of 1 megbyte sized files mean 1 billion ($10^9$) files. Note that this is still _tiny_ compared to say Dropbox (which is exabyte+ sized). However already this "small" number may cause problems in the DHT, which presumably needs a metadata entry for each individual file.
I strongly believe that if we provide no guarantees at all, then the whole exercise is completely pointless. Also there are already many such systems, like for example IPFS, bittorrent, `/dev/null`, etc. I mean bittorrent seems to have better emergent durability properties at least for popular data, than a "no-guarantees Codex" would have...
Of course, we cannot provide the same level of guarantees as a paid storage system, because, _well obviously_, nobody is paying for it... However, we still need some kind of incentives. It can be reputation, non-monetary tokens, usable maybe to get better quality or larger storage, etc; or some combination of those. Though to be honest I don't see much _technical_ difference between a monetary and a non-monetary token or reputation point.
I don't have any better ideas to achieve some form of durability than we had before: Some kind of redundant data dispersal (it can be replication or erasure coding, or both, or one or the other based on the file size, it doesn't matter that much); periodic checks for the continued existence of data; and replacing failed or misbehaving nodes with new ones.
To do that we will need some consensus mechanism(s), as this is supposed to be a _decentralized_ system. It seems to me that we need consensus for at least two (and a half) different purposes:
- to detect missing data
- to punish / reward participants based on their behaviour (I expect this to be the same underlying system as the first one)
- and to store reputation / tokens / whatever
- (optional: to generate unbiased random entropy)
The first two can be a more lightweight mechanism than the third one, as 1) we don't need to store past state (it's ephemeral); and 2) temporary failure of is not a catastrophic event.
I imagine that some kind of aggregated and/or treshold signatures will be a building block of whatever lightweight consensus we choose, but it seems to me that that alone is not enough. Unfortunately I don't know enough about consensus mechanisms.
The reputation / tokens can be probably stored even on an L2, if we can somehow collect and batch "points" offchain (for example just collect signatures from validators, and go onchain more infrequently with a batched signature).
### Immediate use cases
- Waku message archives
- Waku large message attachments
- Status communities metadata
- File sharing
Some input about expected sizes from Waku (from discord):
> for Status we expect at least 1GB in total of messages per day for average use case. I imagine if Waku is successful some apps may generate 10GB per day.
and
> using Status as reference we should expect archives of 10GB to 100GB (larger obviously possible too, but see this as our opening gambit)
and
> more realistically we aim for a network with a usage/throughput of around 80Mbps in total within a year. Perhaps double that in two years. If 1/10th of all this gets archived with Codex (only some apps will), we expect monthly archives of 500GB or larger.
So let's assume there is a daily message archive somewhere between 1G-10GB, as a single file (?). Or maybe several of those for different apps.
Furthermore there are a lots of message attachments like photos around 1MB or so. And some file sharing in the 1MB-100MB range maybe.
Now a 10GB file is closer to the low-end of the old codex; while 1MB files are obviously totally different. We may actually need different protocols for these cases (with maybe some kind of "continuity": small files slowly migrating to large files with eventual bundling; see below for details)
### Replication vs. Erasure coding
I think we should support both. For small files which need low latency (sending pictures in realtime chat), maybe replication makes more sense.
For larger files whose availibility is more important, erasure coding probably makes more sense. (In theory you could even combine/mix the two, though I'm not sure how much sense that makes...)
In both cases however there is a "central" CID for the file, and the corresponding entry contains links to the pieces, so it's quite similar, shouldn't be hard to support both.
Brainfart sidenote: With replication, there could be a minor issue that the storage proofs (see below) for the different copies could be in theory possibly the same, which is a problem with Sybil attacks. This looks easy to counter: Just include the provider's public ID in the randomness input of the proofs. However, with a Sybil attack you still don't need to store the different copies, so you lose redundancy. To counter this problem, I guess we could slightly alter the copies so that their commitments (to prove against) become different. Unfortunately, if the alteration is easy to compute, this is still not that helpful... A last resort could be organizing the "tokenomics" in a way that just cheating replication itself with a Sybil attack isn't really worth the effort.
### Lightweight remote verification
I think we can reuse the main idea from "old Codex", but simplify it to the extreme, namely:
- commit to (the shards of) a file with a Merkle tree;
- "slot-level" (maybe should be called "shard-level") erasure coding is optional (see below for more details);
- a storage proof would consist a single Merkle path (the index selected randomly) - so it's like "1 samples" instead of "100 samples", using the old lingo.
This is very cheap and can detect a file missing. It cannot reliably detect file integrity errors (but of course that will be detected when somebody actually downloads the file, as they can compare against the hash root), or malicious actors.
Such a proof can be between 600-2000 bytes (depending on file size and various design choices), and is very cheap to compute and verify. But we need a consensus of "validator nodes" verifying these proofs because anybody checking them can lie for any reasons.
Also, there are a large amount of these proofs! Given a billion files and one check per day on average, we have say 100 validator nodes each verifying 1 billion proofs; that's a huge amount of traffic and not a negligible amount of computation either.
### Dealing with large amounts of small files
I have the following layered proposal:
- opportunistically merge small files (and their Merkle trees) until it reaches a critical size (say around 1GB)
- at that point, apply "slot-level" erasure coding in the background, so that the same kind of proofs become more powerful
- aggregate the individual Merkle proofs with the standard recursive proof technique
Note: Compared to the "old Codex" design, this has two levels of bundling: The first level bundles _files_, and the second level bundles _proofs_ (of the file bundles). In the old Codex design, we only had the second level proposed (PoC implementation exists but was not yet integrated).
#### Merging files
For simplicity, I assume that each file has power-of-two size, and that we have some minimum and maximum file size (say between 256kB and 10GB).
It is extremely easy to merge two files of the same size: Just concatenate and do a single hash of the two Merkle roots to compute the concatenated Merkle root.
(Unfortunately then each new Merkle path proof will skip one of the two files completely, but this is one of the reasons why I suggest also doing the slot-level erasure coding at some point).
Now I assume that each file has a DHT entry, which lists its shards and where they can be found; these entries need to updated at each such merge. Such update requests should come with proofs that they are done correctly - fortunately, this is extremely easy to check (a single hash in the simplest situation). Nodes participating in the DHT can simply ignore update requests where the proof is incorrect.
I also expect this merging not to be done at every single new file (shard) arriving, but in larger batches - this both reduces the traffic, and gives more opportunity to organize files of different size into a big Merkle tree (without having too much dummy padding between them).
Remark: Normally each file has a different "expiration date". When merging files, one should take that into account. However, unlike "old Codex", we are trying to provide soft guarantees, not hard ones; also keeping some 1MB files for a few extra weeks doesn't sound that problematic. In practice I expect that a typical small file's required lifetime is at least 1 month and not more than a few months, so hopefully this won't be such a big issue.
#### Erasure coding merged files
This is again quite similar to the "old Codex" design, except that instead of 2D erasure coding I propose to do 1D, and instead of using Leopard and the $\mathsf{GF}(2^{16})$ field, I propose to use a "small" prime field, like for example the Goldilocks field ($p=2^{64}-2^{32}+1$). Note that while this is slower than Leopard's RS encoding, the files (even combined) are smaller, and the reason for this choice will be clear below.
The idea here is that the 2:1 erasure coding over these nice prime fields can be done by the storage provider, and it can prove, relatively efficiently, that it was done correctly. So unlike in "old Codex", it's not the uploader who does the slot-level erasure coding (they still do the network level one though, when applicable); and in fact it doesn't happen at the time of uploading, but only after a critical size of merged shards are collected by the provider.
Namely, the so-called FRI protocol (and it's variations) does exactly what we need: It proves that an encoded piece of data is (close to) a codeword. Furthermore, it's based on Merkle commitments, and we can organize the data such that the original data is a subtree of the Merkle tree, and the parity data is another subtree; so we can even check that the codeword is an encoding _of the original data_ (and not just some random data), extremely efficiently (basically a single hash; of course this is on the top of the FRI proving/verification cost, which is way more expensive).
I imagine that this is done only once, kind of like "sealing" in Filecoin (though way cheaper - I estimate this process to be under 10 seconds for 1GB of data, and it can be done in the background, there is no deadline). But in theory it could be even repeated as one merges more and more data.
Now if we have a 2:1 erasure coding, then each Merkle path proof has an 1/2 chance to detect missing / corrupted data; so even if we only check a single path (or "sample" in the old lingo) each day, after 30 days you essentially have "9 nines". So this is really similar to the old Codex idea.
Remark: We couldn't do exactly the same in the old Codex, for a few reasons:
- there the slots needed to bigger, because both the on-chain verification and proof generation is more expensive; and this encoding is quite a bit slower than Leopard (though maybe only by 1 order of magnitude)
- the 2x local overhead of storage is probably too big if you want to compete with the pricing of existing centralized static storage providers, but looks fine to me in this new situtation
- before I was planning for an exabyte sized storage network, not a petabyte sized one
Note that the second point was basically solved by 2D erasure coding (we can have as low as 10% overhead), but that makes the proofs even more expensive, as you need much more samples for the same failure detection probability (say 160 instead of 20).
#### Proof aggregation
Fortunately, we can also use the existing proof aggregation strategy to compress these proofs (assuming a ZK-friendly hash function is used for the Merkle trees, like Monolith). However, be aware that a single such "compressed" proof is around 200 kilobytes, so compressing just 100-200 proofs doesn't really help - you need to do one more layer of recursive compression, in total something like 1,000 - 10,000 files (each with a distinct lightweight proof) for this to make sense (and that obviously has a quite bit of cost for the provider).
Such a proof compression is also way more expensive than the individual proofs (maybe around 1 sec of CPU time and consuming 1GB of memory for the initial bundle of 100-200 Merkle proofs, and then the same again for each 2:1 proof compression). Verifying a compressed proof is also more expensive than verifying just a Merkle proof (in the order of 100ms with an FRI-based proof system like Plonky2; could be in theory improved significantly with WHIR).
While in theory we can do a "final compression" step which further reduces the proof size to a few hundred bytes, that's even more expensive and technically even more complicated (we can do it in principle but it comes with a lot of extra complexity and hardware requirements).
#### Merkle tree organization
This is a more technical detail.
If for example our field size is 64 bits, that is 8 bytes (for simplicity let's ignore here that normally it's just a bit smaller, so you cannot actually encode 8 bytes into a single field element, only a bit less...) then 1GB of data is $2^{27}$ field elements. That's quite a lot, as FFT scales with $O(N\log(N))$ and requires a lot of memory too.
However, we can take an inspiration from Plonky2: Organize the data not as a vector but as matrix, for example in this case a $2^{20}\times 128$ matrix. Apply erasure coding to each column independently. Hash each row separately, and build a Merkle tree on the top of those hashes.
This gets us the following advantages:
- the Merkle tree is much smaller (you presumably want to store it); furthermore, linear hashes can be usually computed a bit more efficiently than Merkle trees
- instead of one FFT of size $2^{27}$, we have to do 128 FFT-s of size $2^{20}$, which is:
- _always_ faster
- done serially, requires way less memory
- it's trivial to parallelize, giving a large performance boost (assuming we have enough memory and CPU cores)
- instead of executing the FRI protocol for a size $2^{27}$ vector, we can take a random linear combination of the columns (size $2^{20}$), and execute FRI on that. This gives at least a 100x speedup (prolly more) for that part.
Recall that the FRI protocol is required so that the storage provider can prove that they did the erasure coding correctly.
### Validator network
I imagine that there is a set of "validator nodes", with probably somewhat better hardware, permanently being online, somehow staked, but also getting more rewards for their service.
The role of these nodes would be primarily:
- reach consensus about which storage proofs failed
- reach consensus about the resulting rewards / punishments (for both the storage providers and themselves)
- (optional: produce unbiasable entropy)
It's my impression that it's a good idea to do "epochs", and in each epoch select a smaller subset of validators (say 100) from a bigger validator pool (I mean, assuming we have enough validator nodes lol).
This seems to help with two things in particular:
- attacks trying to overtake the validator network (??)
- communication bandwidth requirements of the consensus
#### "Tokenomics"
Again, I'm very much a non-expert on consensus protocols, so the followings are just some braindump:
- I can imagine a "oracle" type strategy for reaching consensus about the validity of proofs: if more than 66% votes for one direction, that's considered correct, the majority is rewarded and the minority is punished (and probably some action is taken if the vote went for data missing)
- if there is no 2/3 majority, everybody gets slightly punished (so that next time they actually try to reach consensus :), and we simply continue as if the proofs were fine. Fortunately, here this is not a catastrophic failure, unlike say in an L1.
So to me it seems that there are two kind of rewards / reputation or whatever to be collected:
- one for the validators
- and one for the storage providers
We need stronger incentivization for the validators, as they maintain the consensus (even if it's a weak kind of consensus).
For storage providers, I imagine that the rewards are based on:
- file size
- whether it's erasure coded ("sealed") - significantly more reward for erasure coded merged file bundles, as they provide way more guarantees, and the merging helps the network a lot by decreasing traffic and validator computation
- number of proofs provided
I imagine some kind of system where they can collect some signatures or other offline "authenticated tokens" from the the validators, and after enough such is collected, they can exchange that to reputation or whatever.
#### Freeloaders vs. providers
An explicit requirement from IFT is that this new Codex system supports mobile phones and other lightweight devices (maybe something browser-based for example).
Obviously, such devices cannot provide an actual service to the network:
- they are ephemeral by definition
- web browser doesn't have persistent storage
- mobile devices can have limited and very expensive bandwidth
- mobile devices have limited storage too
- mobile devices have limited CPU and electricity (battery) too
So these users are basically freeloaders: They use the system but doesn't provide anything in exchange.
Thus I think we need something like RLN (rate limiting nullifiers, a technology also used by Waku) to prevent DDOS attacks, by limiting the rate these users can have requests from the network.
On the other hand, nodes providing actual services could probably use their reputations / tokens / whatever to get a higher level service (more storage, higher durability, etc); or possibly a promise of some future monetary tokens (though I don't think the latter is a very nice solution, and I guess legal would also have some words...).
### Data dispersal
Jacek proposed that dispersal is not really different from repair. While this is a kind of nice theoretical viewpoint, I don't think it's true in practice.
Let's assume I have my file (maybe already erasure coded into shards, or just several "copies" of it). I want these pieces to be dispersed over some (different) randomly selected nodes in the network.
First of all, the network needs to know about this request - so first we would have to broadcast to the whole network that we have these pieces (could be quite a bit of bandwidth there). Then two things can happen:
- either the nodes are not incentivised to take it, and nothing will happen
- or they are incentivised, and will race to be the one who gets it
Neither seems to be a good outcome (remark: the second we planned to solve with the "expanding window" idea in the old, paid-for, Codex design).
So instead I imagine the file uploader to randomly select some candidate nodes (this can be even deterministic, based on the file hash and some external entropy, which we need anyway; or even a local RNG), and try to send them the pieces, continuing the process until it all succeeds (all pieces are taken up by somebody).
Some rather tricky coordination is required here: who / how / when can request the new information to be included in the DHT? What kind of "proof" do we need for this to propagate?
It seems that some new ideas are required here.
#### Repair
Repair is kind of a similar problem after all: We need some mechanism to convince new nodes to take over failed or misbehaving nodes (after the validator consensus detected something going wrong).
For self-initiated repair (instead of consensus-driven, because consensus is _hard_), to avoid racing, we can try and reuse the "expanding window" idea, but it's self-enforced by the (honest) providers?
### Data availability vs. data retrievability
Storage proofs, whether lightweight or strong, can only promise that the data _is there_, but obviously cannot guarantee that the data is _retrievable_ - as any misbehaving provider can keep the data, produce proofs, but refuse to actually serve the data.
Essentially:
- data availability is a _storage problem_
- data retrievabity is a _CDN problem_
Let's supposed we want to download a piece of data, but the provider refuses to send it. What can we do?
Not much, unfortunately. We could go to "court" (which would be the validator network) and claim that we didn't get the file, but it looks impossible to prove that.
We can maybe ask a validator node to try to fetch it for us - if they can fetch it, then we get the data; if they cannot, then they can maybe initiate some stricter protocol.
However this can be abused easily:
- just ask some validator all the time, effectively DDOS-ing them
- they cannot check if I'm DDOS-ing or I am really ignored by the provider(s)
- the provider can still always ignore everybody except the validators
- in which case even a rate limiting of such requests to the validators won't really help
So more ideas are needed here.
Maybe we should add a caching layer, but how to incentivize that, and also that's again quite a bit of extra complexity.
### DHT scaling and security
It seems to me that there are two big problems with our usage of DHT: One being scaling (can it handle 1 billion, or even more entries?) and the second being security (anybody can try and poison the metadata, which would completely ruin the system).
Again, I don't know enough about DHT-s, but here are some random thoughts:
- can we use a trie (= prefix tree) structure, so instead of say 1 billion DHT entries, you have let's say 10 million "bundled entries", and each of those entries contain on average 100 "actual entries"? So this is kind of a trade-off between number of DHT entries and the cost of queries and updates. It seems that [I'm not the first one](https://en.wikipedia.org/wiki/Prefix_hash_tree) to have this idea, maybe worth researching it?
- for at least certain type of updates, we could require a proof bundled with the update request - well-behaving DHT nodes can reject the request if the proof is invalid
- we should look at existing research about DHT security like S/Kademlia
### A possible incremental rollout plan
Obviously, we need to rollout this stuff incrementally. Here is a possible plan:
1. replicate files to say 10-20 nodes selected randomly. This requires storage provider registration but we need that anyway because of mobile and other light clients are different. Store the info where the pieces are in the DHT (in the manifest)
2. add erasure coding instead of (in addition to) replication (should be simple)
3. encrypt data before uploading (can be optional, often the application layer already does it)
4. produce very simple Merkle path proofs for the files; gossip them (simple)
5. add opportunistic file bundling so we have less proof traffic (not too complicated)
6. add voluntary repair with an expanding window criteria to self-select nodes doing the repair (easy)
7. add local EC for better guarantees (i think i know how to do it)
8. validator network and lightweight consensus (hard)
9. RLN for freeloading users
10. DHT scaling and security (presumably not easy...)
11. proof aggregation for better scaling (very involved, but most of the research is already done; have PoC implementation)
12. privacy considerations? (apart from simple auto-encryption)
13. real incentivization, smart contract registration
14. strong(er) durability
15. etc
Of course the ordering above is somewhat flexible.
### Unpredictable entropy
The same way as with "old Codex", we need some source of entropy (random numbers) which is unpredictable and more-or-less unbiasable.
So far we used randomness to:
- trigger storage proof requests
- select the random samples the providers need to prove
- possibly to select the "seeds" of the "expanding windows"
- (maybe something I forgot?)
Some further application could be:
- select the validator subset for the current epoch
- to generate distributed private keys in threshold cryptosystems (eg. for threshold signatures created by the validators)
- etc
In old Codex we planned to use a blockchain block hash sequence for this. Obviously this should still work in theory (though in practice the different cadence of different target blockchains could cause issues).
There also other third-party random beacon services available, for example:
- https://www.cloudflare.com/leagueofentropy/
- https://csrc.nist.gov/projects/interoperable-randomness-beacons
- etc
However, as this "new Codex" is much less coupled to a blockchain (because blockchains, even L2-s, simply doesn't scale to this amount of traffic) -- in fact, the only reason I can see to rely on a blockchain at all is to store reputation points / reward tokens --, it may make sense to decouple the entropy generation too.
Fortunately, creating a random beacon is maybe _not that hard_, especially as we doesn't seem to have as strong requirements as some other applications.
A possible basic recipe for a distributed random beacon is the following.
In each epoch:
- participants (presumably the validator nodes) all generate some local randomness, and commit to them (normally this is just a hash like SHA256)
- they broadcast the commitments in a given time window
- after the time window expired, in another window they reveal the original (uncommitted) random values
- they calculate the XOR of these values (filtering out those which don't match the previous commitment)
- somehow reach consensus about the set of well-behaving participants, and the final combined (XOR-ed) value
- **there is however a problem here:** The last person to reveal can decide whether to reveal or not; they can simply compute both results and select the favourable one
- one a solution for this is that the currently elected leader calculates a VDF (Verifiable Delay Function) of the resulting combined value
- a VDF is something like a hash, which takes a long time compute (so cannot be predicted), but can be checked very efficiently by third parties
- the parameters of the VDF are set such that in the time window to reveal, it's not feasible to compute it; so whatever the last person decides, it cannot meaningfully bias the result
- participants check the computed VDF value, and reach consensus about it
The most complicated component of this approach is the VDF itself, but there are existing published instances.