owned this note
owned this note
Published
Linked with GitHub
# Pruned Sets
*Contributors: Varun Srinivasan, Paul Fletcher-Hill, Sagar Dhawan, Sanjay Prabhu, Shane da Silva, Dan Romero*
## 1. Problem
A problem with the current Hub design is that there is no limit on how much data a user can add to the network. Unbounded storage will introduce two new problems that apply a centralizing force on the network.
First, most developers will not be able to afford to keep a copy of the network's data if it keeps growing indefinitely. A few "large hub operators" will emerge as intermediaries between users and developers, offering to maintain a full copy of all data in exchange for some fees.
Second, users might accidentally or maliciously add billions of messages to the network. Hub operators would solve this by developing tools to ignore specific users and content. Some operators might wield such tools more aggressively than others and have different opinions on what constitutes spam on the network.
The cumulative effect will be to push the network towards centralization. A few hub operators become the intermediaries between developers, users and their audiences. They can collude to block users and developers (or be forced to do so).
We believe Farcaster can become [sufficiently decentralized](https://www.varunsrinivasan.com/2022/01/11/sufficient-decentralization-for-social-networks) if **all user data is replicated across 100+ hubs** owned by different entities. We also believe that developers building companies on Farcaster have a strong incentive to run hubs and keep the network decentralized. However, they are only likely to run hubs if they are:
1. **Full copies of the network** - a Hub must have all data needed for a useful client.
1. **Easy to set up** - a Hub can be run locally or in the cloud on a single machine.
2. **Cheap to run** - a Hub can store a single user's data for < $0.01 / year.
## 2. Solution
Hubs enforce a limit on the total number of messages within each Set, dropping the oldest known message if the limit is exceeded. They also enforce an age limit, dropping messages older than a rolling window for each Set.
The size and age limit maximize the number of relevant messages stored on a network. A size limit alone would result in old data being kept around forever even if it is never viewed. Reallocating this space to data from a more active, recent user is more desirable for the network.
A future update will explore further changes to allow users to preserve specific messages or pay for increased storage. These are out of scope for the upcoming v2 release since they increase complexity and most users will not be affected by limits for a while.
### 2.1 Set Limits
Limits are intentionally **very conservative** and we expect them to increase in the near future as we develop certainty in our models.
| Collection | Max Size | Max Age |
| -------- | -------- | -------- |
| Casts | 10,000 | 1 year |
| Reactions | 5,000 | 3 months |
| Follows | 5,000 | None |
| Signers | 100 | None |
| Verifications | 50 | None |
| Meta | 100 | None |
*See [Appendix A](#Appendix-A:-Estimated-Message-Volumes) and [Appendix B](#Appendix-B:-Estimated-Message-Sizes) for a detailed estimate on volume and sizing of messages*
Message types have different limits based on their frequency, periodicity, value to the author and value to the network:
- **Casts** - moderate volume and medium-term useful, so have high size and age limits
- **Reactions** - frequent and medium-term useful, given medium size limits and low age limits
- **Follows** - low volume and important, given medium size limits with no age limits
- **Signers, Verifications, Meta** - rare but important, given low size limits and no age limits
Based on estimates, only a tiny percentage of users (<1%) will run into size limits. At 10M users, the projected size of the network data is ~ 10TB which costs approximately $11k / year on major cloud providers. This comes out to a cost of $ 0.001 / user, 10x below our target, giving us space to increase limits and add new message types.
### 2.2 Known Issues
1. **A malicious actor can bloat the network since fids are cheap**
Fids have a one-time registration cost but guarantee storage of data on the network. A malicious actor can register thousands of fids and bloat the network with dummy data. This will have to be specifically addressed before mainnet launch, but it is not urgent during testnet when registrations are restricted.
2. **A compromised signer can wipe a user's history without a trace**
Previously, a compromised signer could delete earlier messages but it would always leave a tombstone. Now, a signer can issue a large volume of dummy messages to wipe older messages out without a trace. Both attacks can be reversed by revoking the signer and re-issuing the valid messages from a backup.
3. **Hubs can be subject to a Network DDOS by sending millions of valid messages**
A malicious user can spam the network, and intelligent rate limits and a peer reputation system will be needed to prevent this. In the short term, this can be mitigated by only peering with known good actors until a more robust solution is developed.
4. **Users can accidentally wipe their own history**
Similar to (2), a user can accidentally generate many messages that wipe their own history. Such a problem would only affect advanced users and can be handled with checks at the RPC client or server.
5. **A malicious actor can erase another user's messages by issuing signers**
Signers can be submitted from invalid custody addresses, an optimization introduced to smooth custody address migrations. With a limited signer set, this can be exploited by a malicious user to flood the set with addresses causing the older ones to be pruned. A fix will need to be made to modify this behavior.
### 2.3 FAQ
**Can we incentivize more hubs by giving them a share of the network fees?**
While many mechanisms can prove that a Hub is hosting data, there is no practical way to prove that it is serving the data to callers with reasonable latency and a low error rate.
**Can we let users choose how many messages to store?**
Unbounded data growth leads to two problems. First, the network size will grow large enough that the average developer cannot afford to run a hub that maintains a full copy of the network. Second, some users will publish spam requiring hubs to have the ability to block specific users. The network will end up being controlled by a small oligopoly of hub operators have the power to remove people from the network.
**Can we have super hubs that store more data?**
See above
**Can we use sharding to scale the network instead of a limit on data?**
See above
**Can we let users pick specific messages to keep forever?**
We plan to explore this in a future upgrade to the protocol. While technically feasible today this adds significant complexity to the sync model and would significantly delay the launch of v2.
**Can network-rate limits be used instead of a storage limit?**
Any form of network-based rate limit to enforce a storage limit is bound to fail since it can be sidestepped in several ways in an untrusted, eventually consistent network.
**Can we implement limits on actual bytes instead of the number of messages?**
Counting bytes might lead to a more predictable outcome for "bytes-on-disk", but it is much harder for people to reason about and more work for Hubs to keep track of.
**Can we evict reactions when a parent message is pruned?**
To do this, we must be able to guarantee that all data will always live on a single logical Hub. While we have a clear path to this until 10M users, we do not yet have a path to our target of 1B and wish to hold off on this optimization for now.
## 3. Consensus Changes
All Sets now have a "prune clause" that triggers when the a message when the size or age limits are exceeded. The message with the lowest farcaster identifier is removed first until the age and size clause are satisfied. This changes an important property of our sets — a member can now revert to a null state from a non-null state.
The prune rules can be defined more formally on a set S which contains some messages, the lowest ordered of which is n. Now assume a new message m is added:
- If the `set_size` < `max_set_size` , do nothing
- If the `set_size` = `max_set_size`, and
- `order(m)` =< `order(n)`, reject m
- `order(m)` > `order(n)`, prune n and add m
The prune rule must also trigger every second on every message n in the set S:
- If `time.now()` - `timestamp(n)` > `age limit`, prune n
- Else, do nothing
### 3.1 Last Write Wins Sets
Reactions, Follows, Verifications, Signers and Meta use LWW sets with an add-set and a rem-set to track state. Adding the prune rules into the LWW rules adds a few new state transitions that were not possible before:
1. rem → null when a later message is added or the clock ticks forward
2. add → null when a later message is added or the clock ticks forward
3. messages may be rejected when order(message) < order of lowest element in set S
![](https://i.imgur.com/fEhDGTi.png)
Users and applications perceive null and rem states similarly, so the only semantic change is that adds sometimes do not win even if they are the last write. An add must be both the last write for a value and within the validity range for the set's messages to win. We call this a *LastValidWriteWins* or LVWW set which has slightly different guarantees to the user.
**Reactions**: LVWW sets works well with reactions where the value of a reaction decays with age and purging older ones is desirable.
**Follows**: LVWW sets are less ideal for follows since the value of a follow does not necessarily change with age. Applications must take care to track follow set state and warn users when a new follow might expire an older one.
**Signers**: LVWW sets with Signers suffer the same problem as follows since older signers are not necessarily less valuable. However, signer actions are rare and most users are not expected to cross the limits. The purging of a signer does not violate the consistency model since SignerRemoves have the same effect on other sets.
### 3.2 Remove Wins Sets
Casts use RW sets with an add-set and a rem-set to track state. Adding the prune rules into the RW rules adds a few new state transitions that were not possible before:
1. rem → null when a later message is added or the clock ticks forward
2. add → null when a later message is added or the clock ticks forward
3. messages may be rejected when order(message) < order of lowest element in set S
![](https://i.imgur.com/3kQW3QN.png)
If the remove message has an earlier order than the add message through user error, then a prune may cause an add to win after remove has won. Similar to LVWW sets, adds sometimes do not win if they are outside the validity range. We define this as *CausalRemovesWinIfValid* or CRWV sets.
**Casts**: CRWV sets work perfectly fine except in the case where the timestamps are set incorrectly. The user can remedy this by waiting for the add message to expire or publishing a remove message with a later timestamp.
## 4. Sync Changes
Hubs maintain a merkle trie with the id of every known message to compare state with each other. Previously, new messages were most likely to insert messages in the right hand side of the tree. The prefix-exclusion sync algorithm was chosen to optimize for this case.
The introduction of pruned sets makes left-hand side changes very likely, and mid-tree changes somewhat likely. This reduces the efficiency of our sync algorithm though it is almost always better than a naive traversal of the tree. No changes are needed for now, but this will need to be upgraded sooner rather than later.
## 5. Appendices
### Appendix A: Estimated Per-User Message Volumes
Shows the average, median and maximum number of messages generated by users of Farcaster over a year. Estimates were obtained by sampling the 30-day rates and extrapolating them to a year.
| | **1yr Avg** | **1yr Median** | **1yr Maximum** |
|---------------|-------------|----------------|-----------------|
| **Casts** | 348 | 72 | 28,968 |
| **Reactions** | 864 | 132 | 103,008 |
| **Follows** | 828* | 168* | 47,964* |
*Follows cannot be estimated accurately by extrapolation from a 30 day data set, since users follow often when they sign up and slow down thereafter*
### Appendix B: Estimated Message Sizes
Shows the estimated sizes for three largest set types by volume on Farcaster. Estimates are based on the add-type which is usually larger, and factor both the message size as well the size of any indexes need to look up the message in the hub.
| | **Avg (bytes)** | **Max (bytes)** |
|---------------|-----------------|-----------------|
| **Casts** | 700 | 1364 |
| **Reactions** | 348 | 348 |
| **Follows** | 332 | 332 |