Reduce beaconchain database storage size

# Reduce beaconchain database storage size ## Background Beacon chain database (beaconchain.db) size is close to 20G in mainnet. The size is huge, considering that the chain is runing for only ~8 months at the time of writing this. Also the "state" is persisted only every (--slots-per-archive-point) 2048 slots. If an advancd user (like beaconchai.in) wants to increase it to 32 slots, the size of the db reaches close to 400G. Some of the advanced users want to increase it to 32 slots to reduce the response time in their aplications. More context in this issue: https://github.com/prysmaticlabs/prysm/issues/8954 ## What we want to achieve 1) Bring down the beaconchain.db size 2) CPU/Memory resources usage are kept in check while doing 1 3) Dont change the db interfaces, to reduce effects ## State of the "state" bucket 1) "state" and "blocks" buckets have the largest sizes in the DB (gist 1). The largest bang for the buck will come if we can reduce the size of "state" bucket, as it seems to grow a exponentially if we reduce the slots-per-archive-point. 2) The "state" bucket **value** field is the entire state at a given point in time. The entire BeaconState object is stored as one big blob. On Average the compressed size of the state value is 9.5Mb and the max is 15Mb (gist 1). 3) The biggest culprit in the values of the "state" bucket are the registry validator entries (gist 2). 4) The validator List is big and most of the time they dont change (gist 3). 5) Basic stats indicate that most of the validator state doesnt change so often (gist 4). ## Gists: The following analysis was done using the mainnet beaconchain.db which has a default 2048 slots-per-archive-point. 1) https://gist.github.com/jmozah/10830b5c019a3f1d90881dbc141e9f76 2) https://gist.github.com/jmozah/ab1abbe9d9035002e6cc1e4daa00f1fb 3) https://gist.github.com/jmozah/b0d4260f47d642a2c8e0666d6c5ad842 4) https://gist.github.com/jmozah/40ceeb47020c4ef9fcd116f6e1fda6dc Analysis for the "state" db with "slots-per-archive-point" equal 32 is on the way. ## Potential Solutions #### 1) Store validator list entries in a seperate bucket The basic idea is to store the validator entry only if it changes (similar to a time-series database). In the state bucket, store only the pointer to the actual validator entry. - Create a new bucket called "validator entries" bucket for storing validator entries - Key : Hash(validator Entry) - Value : Validator entry - Create another bucket called "validator index" to store the key of validators involved in a given state. - Key : Block Root - Value : Validator entry hashes (for a given state) - Whenevery a state is stored, store empty list of validators in the state entry and store the validators and its hashes in the new buckets. - When constructing a state, the coresponding validaor entries has to be fetched from the validator bucket using the validator index - To reduce DB reads and increase state construction throughput, an validator entry cache can be introduced. ![](https://i.imgur.com/AerQu0G.png) Pros: - Savings in space (no validator entry will be stored twice) Cons: - Constructing a state might require more CPU/Memory - Saving a state might take more CPU as hash for every entry is created #### 2) Representing entire as a trie instead of a blob There is one more solution proposed in the issue #8954, the solution is to store the entire merkelized state as a trie in boltDB. But iam of the opinion that this will not help much as the data duplication will not be caught in this scenario. This is because the validator entry in the validator list is random. ## Migration: PR #9155, does a one way migration. i.e. move the validator entries from `state` bucket to couple of new buckets (see `green` color buckets in figure 1 above). This means the place holder for registry validators inside `state` will be empty. When saving/retrieving the state, the new code combines them and creates a single state object. Once the migration is successful, the new code path will be followed **even** if the feature flag is disabled thereafter. There are two issues that arise becasue of this ``` 1) The DB gets scrambled / corrupted after this migration. 2) The migration is successfull, but there is some other issue afterwards. ``` In either of the above cases, the user should have an option to go back to the same state of affaris as before the migration. Below are some of the options that we can take to tackle this situation. 1) **"Should we suport migrating back" Argument**: We run this in our cluster for a week. See if we get any issue. And if not, we simply push it to production. Anyway this is an experimental feature and the users know the risk. We can also ask them to do DB backup and restore if required. - Pro: - Simple - no new code or DB sub-command - Actually saves space (Which is what the PR is supposed to do :-)) - Con: - Backup takes space. - backup mean system down time (iam not sure about this). - if the backup is ols, then syncing takes time and node is down there too. 2) **Snapshot Method**: Take a DB snapshot / backup before turning on the feature flag. If case (1) or (2) happens, the user have to restore the DB from backup. - Pro: - Simple - No code changes required - Con: - Node should start syncing from the restored point - Extra space for backup - If backup is lost, then Hail Mary. 3) **Redundant storage method**: During migration, just create new validator entries for the `validator hashes` bucket and `validator entry` bucket. But **do not modify** the old `state` bucket. once migrated like this, then every persistance of state object will save the validator entries in both `state` and new validator buckets. If the feature flag is enabled, the the new buckets will be used, otherwise the old bucket with old code will be used. Later when the feature is deemed working in production, we can have a DB sub-command to clean up old validator entries in the state object. - Pro: - Reverting back is easy, use a DB sub-command to `delete` the new buckets. - Con: - Space utilization will be more (~ 5G for today's mainnet) as we will store the validators. - UX issues as the users should not only disable the flag but also use the DB sub-command to delete the new buckets. 4) **Migrate back**: This will only solve problem (2). If the user wants to go back. He just runs a DB sub-command to push all the validator entries back in to the `state` bucket and deletes the new buckets. - Pro: - Saves spaces (as we store the state only once). - Con: - Re-constructing the DB back will take time and the node has to take downtime.. - Doesn't solve problem (1). ---------------------------- Comments: 1) Nishant: if you have 160,000 validators, each time you want to retrieve the validator registry it will basically have to iterate through > 200,000 validator entries to find the correct one to construct for - Add validator cache. Do some experimentation and find out the resource consumption. 2) Nishant: Also, if we are hashing the validator entry it becomes 32 bytes * 1600000 ~ 5mb per state to store( uncompressed) , which still seems like a fair bit. would it be possible to bring that down ? - Yes thats a bit too much. We can use a slightly lower resolution hash (64 bit). But for now i would keep it 256 bit and see the impact. 3) Nishant: what do you think about doing this for our other big arrays ? ( randao mixes, state roots, block roots), etc - Do some tests and see how much savings does suffix trie like structure can be useful. 4) Radek: Should we delete the validator cache entry when a state is deleted. - https://discord.com/channels/476244492043812875/550365651118850081/866975036489334804 - Finally it was decided to better delete the cache and check its performance impact. Check https://github.com/prysmaticlabs/prysm/pull/9155#issuecomment-881672415 for more detailed performance reports.