# Decentralization of Hubs ## Problem Farcaster Hubs are designed to store a copy of all the data on the network for all users. This will require an increasing amount of disk space, which raises the cost of running a Hub. If Hubs’ costs become excessive, we’ll see fewer Hubs which will lead to: - The [SMTP problem](https://twitter.com/cfenollosa/status/1566484145446027265) where new entrants are blocked out - Collusion between hub operators to de-prioritize or censor content - Attacks against hub operators to take down the network or force censorship These problem are particularly painful when the network is controlled by < 10 actors. It becomes very easy for collusion and/or targeted attacks. If the network approaches ~ 100 unique, geographically distributed actors, most of these vectors become impractical. **We can state that the network is practically decentralized if we have ~ 100 unique Hubs operators.** ### **Hub Costs** Before talking about solutions it’s useful to get a sense of how much it may cost to run a Hub. We start off my making a few assumptions: - Farcaster plans to grow daily active users at ~ 5% per week - When Farcaster increases its daily active user count by one, the amount of data generated per year goes up by roughly 4 MB (~ 4000 messages) - Hubs must store all data on-disk to comply with the current sync model, and 1 TB of storage on AWS EBS or equivalent costs ~ $1,100 USD / year - The vast majority (>80%) of variable costs will come from storage The projected growth over time is: | Year Ending | Users | Data / Yr | Hub Storage Req | Costs | | --- | --- | --- | --- | --- | | 2022 | 4.7k | 18 GB | 18 GB | $20 | | 2023 | 60k | 234 GB | 252 GB | $270 | | 2024 | 800k | 3 TB | 3.2 TB | $3,500 | | 2025 | 10M | 38 TB | 41.2 TB | $45,000 | | 2026 | 120M | 482 TB | 523 TB | $575k | | 2027 | 1.5B | 5.58 PB | 6.1 PB | $6.9M | The actual costs are likely to be higher than these projections because it doesn’t account for new features which may increase storage, network effects which increase messages per user and additional costs of running hubs like computer and i/o. **The good news is that having each hub store a full copy of the network works well until ~ 2024.** After that the costs increase asymptotically and scaling solutions will be needed to maintain the number of hubs at ~ 100 which will be discussed below. ## Approaches There are a few obvious strategies that will decrease storage costs like compressing data (~10% reduction) and keeping the format lean (~2%). However, we will need an order of magnitude improvement to address the problem at scale. ### 👭Sharding Hubs can be configured to sync data for specific users by fid and ignore others. Our data model has been designed to explicitly allow this, so it will only need a change to our p2p model to enact this. The benefits are: - Hubs can be increasingly sharded as the network grows - Hubs can protect themselves against DDOS attacks by ignoring malicious users spamming messages The downsides are: - Hubs can never implement “global” operations - Peering is much harder since not all peers are guaranteed to have the data you want. ### 🏠 Home Hubs Users can declare “Home Hubs” by publishing an on-chain event, which are Hubs guaranteed to have a copy of their data. Sharding is a pre-requisite for this improvement, which allows a Home Hub to be configured to just store a single user’s data which can be run in the cloud cheaply. It’s like the “host your blog” model where you guarantee one copy of your data, and other Hubs can choose to relay it. The sync algorithm is developed such that hubs fetch all data from their peers when possible and occasionally sync data from Home Hubs as well. The benefits are: - Data store can be decreased by 10x-90x since most data is not replicated The downsides are: - Syncing is more complex since there is no guarantee that your peers have the data you want - Users are less protected against powerful actors and responsible for their own decentralization. Extensions to this idea: - A “Home Archive” where a user can publish a static copy of historical data to IPFS or other systems for posterity and only store recent data on their hubs ### 〰️ Ephemerality Hubs can switch to ephemeral (light) mode where they only store the recent data for users. This could be defined by a time window (convenient, but vulnerable to spam) or a message count (predictable but worse UX). Sync will be straightforward as long as hubs can connect to to other hubs. The benefits are: - Bounds storage cost to a fixed size per user - Ephemeral clients are protected against storage spam The downsides are: - Sets will need to be redesigned to work with a moving window - Some types like Follows and Signers will need to be redesigned entirely to fit this model Some extensions to this idea: - Enforced ephemerality where all hubs are light hubs and users must archive their data in different ways - Ephemerality could be enforced for users who did not have an fname ## Rejected Approaches The rejected approaches are ones that require incentivizing Hubs with rewards for storing data. While it is possible to verify proof of storage, it is practically impossible to verify if a hub is serving callers correctly (discussed more in the downsides in Storage Bounties). ### 🧰 Storage Bounties Hubs can be compensated for “proof of storage” which can be accomplished by simple mechanisms like polling for responses or more complex mechanisms like VDFs. The revenue coming in from fnames could be used to compensate users for this. Hubs could also be made to stake funds which get “slashed” if they fail to serve data as the “stick” to the carrot. Farcaster could use its protocol revenue to pay Hub operators as long as they prove that they’ve stored data on the network. This would incentivize Hubs through financial means and if we are charging $10 / user that easily covers storage for most users paying for fnames. VDFs could be used to ensure that hubs really have a copy of the data at the time that they claim they did. The benefits are: - Incentive model automatically encourages more hubs, even if the data size is large The downsides are: - Cannot verify “proof of delivery” of message since malicious Hubs may ignore real users and only respond to the requests that would slash their revenue. - Cannot deterministically verify if a response was delivered due to lossiness of network data. A Hub may send a request and a client may never receive it due to errors outside of their control. ### 🪑 Distributed Hash Tables Hubs act as a distributed hash table where each hub stores a partial copy of the data and the overall system tends to each piece of data being replicated on at least 10 different Hubs. The benefits are: - Reduces storage costs by ~10x The downsides are: - Need a punishment-reward system to ensure that a specific bit of data is always available - Data access takes longer and is less reliable because it is sometimes done over network calls - Global operations are no longer possible since Some extensions to this idea: - Anonymous routing protocols like [RPM](https://eprint.iacr.org/2022/1037.pdf) could be used to add a layer of anonymity to nodes