Active Balance Cache

# Active Balance Cache ## Why do we need it? `get_total_active_balance` was used infrequnetly in phase0. It is used on per epoch basis during - Process justification and finalization - Process rewards - Calculate base rewards That was easy to cache. As we can calculate the balance at the start of `process_epoch` and use it through out all three phases above. We only need to calculate this once. `get_total_active_balance` is used more frequently in Altair. It is used on per block basis during - Process every sync aggregate for participant and proposer rewards - Process every attestation for participant and proposer rewards Let's look at how expensive `get_total_active_balance` is. - It retrieves active validator indices. This is cached so cheap O(1) - It loops through active validator indices to sum up the effective balances. This is expensive O(n) ```python= def get_total_active_balance(state: BeaconState) -> Gwei: """ Return the combined effective balance of the active validators. Note: ``get_total_balance`` returns ``EFFECTIVE_BALANCE_INCREMENT`` Gwei minimum to avoid divisions by zero. """ return get_total_balance(state, set(get_active_validator_indices(state, get_current_epoch(state)))) ``` ```python= def get_total_balance(state: BeaconState, indices: Set[ValidatorIndex]) -> Gwei: """ Return the combined effective balance of the ``indices``. ``EFFECTIVE_BALANCE_INCREMENT`` Gwei minimum to avoid divisions by zero. Math safe up to ~10B ETH, afterwhich this overflows uint64. """ return Gwei(max(EFFECTIVE_BALANCE_INCREMENT, sum([state.validators[index].effective_balance for index in indices]))) ``` Such bottleneck can greatly be exposed during initial syncing where the speed is ~50% slower, and captured with pprof ![](https://i.imgur.com/CKOxMyN.png) ## Previous method Previous [implementation](https://github.com/prysmaticlabs/prysm/pull/9187) has flaws - It leverages committee cache by adding an additional field - Committee cache accounts for two epochs in most cases, which introduces assymetrical issue where active balance cache as is based on one epoch - It triggered a consensus failure during a mass slashing event as documented [here](https://docs.google.com/document/d/10ukjj1AuHabmKwxGLNGPDJCLULPC2iZAUvfb5yC6Lnw/edit#heading=h.d5qpimbr9o7y) ## New method One short term solution to alleviate bottleneck is to move total balance calculation outside of `process_attestations`. So instead of calculate total balance for every attestation, we only need to do this once. The PR is [here](https://github.com/prysmaticlabs/prysm/pull/9417) and the pprof. It is already better than before ![](https://i.imgur.com/AQqd7rf.png) The short term solution does not address all issues especially where block proposer suffers in time to broadcast due to the bottleneck of having to loop through every validator. The goal of the block proposer to process all attestations before broadcasting the block. We will need caching. Let's brainstorm how to do this: - **LRU** cache - **Max** 4 items. Can be more beause cheap, but dont think it's necessary - **Key** is the epoch + block_root at epoch_start_slot - 1 - **Value** is the total balance in uint64 - The cache gets filled as process epoch gets called - The cache does not get refilled with a cache miss Rationale: - **LRU** is better suited for forks around epoch boundary - **Key** uses *block_root at epoch_start_slot - 1* to account for forks - Cache does not get refilled to avoid adding more complicated code to pollute core logic