Analysis of Caches Used in Prysm

# Analysis of Caches Used in Prysm ## Objective To document and understand cache usage in the Prysm codebase and to come up with strategies for more rigorous testing of cache vs. no-cache behavior. ## Background [Prysm](https://github.com/prysmaticlabs/prysm) implements the [Ethereum specification](https://github.com/ethereum/eth2.0-specs) for proof-of-stake consensus. The current proof-of-stake chain, live today, secures over $15bn USD worth of value, and Prysm runs a non-trivial amount of nodes in the network. The safety of our code is paramount, but we must also find a good balance regarding performance and user experience when running our node software. To run on consumer hardware, it is important for our code to be efficient. Sometimes, naively implemented code can be unreasonable without caching strategies. For example, recomputing an expensive operation many times when it can be retrieved quickly from memory. Caching is used extensively within Prysm and most likely within all other eth2 clients and node software. Caching, however, is also the source of many critical bugs. Bugs in caches can arise through different means including: how the caches are filled, cache access, or how the caches are invalidated. When using a cache in critical capacity, it is also **imperative** to audit their performance and their safety. This document is meant to gather the major use-cases of caches within the Prysm codebase, list all caches in production today, and explain how they are used. Finally, we explore potential for more rigorous continuous integration of caches in our project beyond the simple unit tests to prevent failures. ## Cache Use-Cases Across Prysm, caches are used for the following purposes: - To reduce the amount of expensive computation that need to be done in state transitions (block, state) -> state', improving the speed of applying a block through the state machine - To reduce the amount of disk I/O via intermediate cache layers for important operations - For fast access to large data structures in different services across the application, such as the `BeaconState`, the list of validator deposits, and the validator deposit trie - To allow for "in-progress" requests where out of `N` workers submitting a request `R`, only a single worker performs an expensive operation and all other workers share the result ## Cache Use-Cases Explained ### For speeding up state transitions and reusing data `core.ExecuteStateTransition` in Prysm is the meat of the Ethereum proof-of-stake protocol. It is called on every block processed and can also be called many times throughout the codebase by validators attesting and trying to advance state slots, in initial sync, and in many other places. When performing state transitions, especially when doing the same transition multiple times, there are several pieces of data that are recomputed and used outside of the state transition, such as validator committees. Committees are used across the Prysm codebase to compute block proposers and in other critical computation. Instead of having to recompute them from scratch, we can leverage caches for this across Prysm. ### To reduce disk I/O Sometimes, we might be hitting the same data requests in our database, triggering more I/O. Instead, we could be reading such data from a map in Prysm that is layered thinly above a database read. ### For fast access to data structures, such as `BeaconState` The `BeaconState` is the largest data structure in Prysm, and arguably the one used the most. It is used as a source of truth for a lot of computation, and almost all services in Prysm need access to it _immediately_. Fetching it from disk every time is unreasonable, and having it always available in-memory is extremely helpful. ## Listing All Caches Most caches in Prysm can be found under [`beacon-chain/cache`](https://github.com/prysmaticlabs/prysm/tree/develop/beacon-chain/cache) and each have a relatively simple interface, metrics, and cache hits/misses, and are designed to be thread-safe unless otherwise needed. Caches in Prysm use underlying FIFO, LRU caches, or a regular Go thread-safe cache from [k8s.io/client-go/tools/cache](https://k8s.io/client-go/tools/cache), [github.com/hashicorp/golang-lru](https://github.com/hashicorp/golang-lru), or [github.com/patrickmn/go-cache](https://github.com/patrickmn/go-cache) respectively. All code mentioned below is from our `develop` branch under commit [1f102c256d5f094a540be0ed1c43c7034f9321ef](https://github.com/prysmaticlabs/prysm/commit/1f102c256d5f094a540be0ed1c43c7034f9321ef), August 11, 2021. Top-level caches in Prysm that are noteworthy and used extensively are maintained: [beacon-chain/cache/attestation_data.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/attestation_data.go) used for caching in-progress requests for attestation data by validators when producing attestations - Uses FIFO from [k8s.io/client-go/tools/cache](https://k8s.io/client-go/tools/cache) - Cache keys: Slot of the attestation - Used in production: no :x: - Notes: combines an "in-progress" cache with a FIFO cache, meaning `Get()` blocks if there if there is a request already in-progress for computing the data. Needs separation of concerns [beacon-chain/cache/checkpoint_state.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/checkpoint_state.go) for saving `BeaconState` instances by `Checkpoint` data structures as keys - Uses LRU from [github.com/hashicorp/golang-lru](https://github.com/hashicorp/golang-lru) - Cache keys: Hash of a SSZ-marshaled `*ethpb.Checkpoint`, strongest :heavy_check_mark: - Used in production: yes :heavy_check_mark: - How it is used: Used in the `getAttPreState()` function which retrieves the state for an attestation, used in verification and attestation processing pipelines - What happens on cache miss: Regenerates a base state with stategen from the target root, then advances slots if necessary. Finally, fills the cache with the end state and returns the end state [beacon-chain/cache/committee.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/committee.go) for fetching validator committees using a RANDAO 32-byte seed to prevent recomputation across the codebase - Uses LRU from [github.com/hashicorp/golang-lru](https://github.com/hashicorp/golang-lru) - Cache keys: RANDAO 32-byte seed, likely strong enough as this would prevent collisions with a different validator set - Used in production: yes :heavy_check_mark: - How it is used: Used across the `beacon-chain/core` package for computing data from committees - What happens on cache miss: Returns nil and it is up to the caller to then compute committees or perform different business logic if this is the case [beacon-chain/cache/proposer_indices.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/proposer_indices.go) creates a new proposer indices cache for storing/accessing proposer index assignments of an epoch - Uses FIFO from [k8s.io/client-go/tools/cache](https://k8s.io/client-go/tools/cache) - Cache keys: 32-byte block root - Used in production: yes :heavy_check_mark: - How it is used: Used across the `beacon-chain/core` package fast retrieval of beacon proposer indices - What happens on cache miss: Returns nil if not exists and caller has to check for this. Caller then computes the proposer index from scratch [beacon-chain/cache/skip_slot_cache.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/skip_slot_cache.go) used for caching in-progress requests for skip slots in doing state transitions - Uses LRU from [github.com/hashicorp/golang-lru](https://github.com/hashicorp/golang-lru) along with an in-progress cache using mutexes. Blocks if there is already a request in progress for the action. - Cache keys: Uses a concatenation of the root from the latest block header in the beacon state **and** the slot itself. Considered **high-risk** as can cause serious failures, but is now under control - Used in production: yes :heavy_check_mark: - How it is used: Used extensively in state transitions when processing slots. Used in critical capacity in Prysm - What happens on cache miss: computes slots normally by advancing the beacon state transitions [beacon-chain/cache/subnet_ids.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/subnet_ids.go) for caching attester and aggregator subnet IDs - Uses [github.com/patrickmn/go-cache](https://github.com/patrickmn/go-cache) for persistent subnets, and LRU for attester and aggregator subnets - Cache keys: slot for attester/aggregator and public key for the subnet ids - Used in production: yes :heavy_check_mark: - How it is used: used in validator RPC functions to retrieve validator subnets when fetching assignments and their expiration dates - What happens on cache miss: returns nil and requires caller to handle this safely [beacon-chain/cache/sync_committee.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/sync_committee.go) for caching actual sync committee data used in the Altair hard fork - Uses FIFO from from [k8s.io/client-go/tools/cache](https://k8s.io/client-go/tools/cache) - Cache keys: sync committee roots, which encompass enough information to be safe - Used in production: no, but will be in Altair :x: - How it is used: Used in critical capacity when retrieving sync committee information in validator RPC methods [beacon-chain/cache/sync_subnet_ids.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/sync_subnet_ids.go) for caching sync committee subnets for p2p - Uses [github.com/patrickmn/go-cache](https://github.com/patrickmn/go-cache) - Cache keys: concatenation of public key + epoch - Used in production: no, but will be in Altair :x: - How it is used: Used in p2p for refreshing node ENRs with information regarding sync committee subnet ids - What happens on cache miss: Returns an empty list of subnets. The return arguments of `GetSyncCommitteeSubnets` are **unused** except for the expiration time [beacon-chain/cache/depositcache/pending_deposits.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/pending_deposits.go) for pending deposits into the eth2 validator deposit contract - Does not use third-party packages - Cache keys: No keys, just a getter - Used in production: yes :heavy_check_mark: - How it is used: Used in block processing when it comes to fetching deposits to put in blocks - What happens on cache miss: There are no cache misses, just a list of pending deposits kept in memory [beacon-chain/cache/depositcache/deposits_cache.go](https://github.com/prysmaticlabs/prysm/blob/1f102c256d5f094a540be0ed1c43c7034f9321ef/beacon-chain/cache/deposits_cache.go) for all deposits into the eth2 validator deposit contract - Does not use third-party packages - Cache keys: No keys, just a getter - Used in production: yes :heavy_check_mark: - How it is used: Used for constructing the deposit trie and in the powchain service for processing new deposits - What happens on cache miss: There are no cache misses, just a list of deposits kept in memory Other minor caches in Prysm are found in: [beacon-chain/db/kv/kv.go]() `blockCache`, `validatorEntryCache`, `stateSummaryCache`, used to reduce database reads - Uses [github.com/dgraph-io/ristretto](github.com/dgraph-io/ristretto) - Cache keys: Roots - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/blockchain/head.go]() `cacheJustifiedStateBalances` complex function which stores the justified state balances - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/core/state/trailing_slot_state_cache.go]() used before processing slots to return last processed state with slot plus one advancement to save on computation - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/operations/attestations/service.go]() fork choice processed attestation roots cache - Uses LRU - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/operations/attestations/kv/kv.go]() attestation pool caches protected behind an RW mutex used for unaggregated, block attestations, aggregated, and fork choice attestations - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/operations/synccommittee/kv.go]() sync committee message and contribution caches - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: ```go= // AttCaches defines the caches used to satisfy attestation pool interface. // These caches are KV store for various attestations // such are unaggregated, aggregated or attestations within a block. type AttCaches struct { aggregatedAttLock sync.RWMutex aggregatedAtt map[[32]byte][]*ethpb.Attestation unAggregateAttLock sync.RWMutex unAggregatedAtt map[[32]byte]*ethpb.Attestation forkchoiceAttLock sync.RWMutex forkchoiceAtt map[[32]byte]*ethpb.Attestation blockAttLock sync.RWMutex blockAtt map[[32]byte][]*ethpb.Attestation seenAtt *cache.Cache } ``` [beacon-chain/powchain/block_cache.go]() cache for block header heights and hashes from eth1 nodes - Uses FIFO - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/state/stategen/hot_state_cache.go]() cache used for storing hot/cold states as part of optimizations for beacon state management in Prysm - Uses LRU - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/state/v1/field_roots.go]() roots cache for beacon state field hashing and leaves caches - Uses ristretto - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [beacon-chain/sync/service.go]() lots of LRU caches used in the sync-service, each with a lock, used for seen objects from sync - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: ```go= type Service struct { cfg *Config ctx context.Context cancel context.CancelFunc slotToPendingBlocks *gcache.Cache seenPendingBlocks map[[32]byte]bool blkRootToPendingAtts map[[32]byte][]*ethpb.SignedAggregateAttestationAndProof pendingAttsLock sync.RWMutex pendingQueueLock sync.RWMutex chainStarted *abool.AtomicBool validateBlockLock sync.RWMutex rateLimiter *limiter seenBlockLock sync.RWMutex seenBlockCache *lru.Cache seenAttestationLock sync.RWMutex seenAttestationCache *lru.Cache seenExitLock sync.RWMutex seenExitCache *lru.Cache seenProposerSlashingLock sync.RWMutex seenProposerSlashingCache *lru.Cache seenAttesterSlashingLock sync.RWMutex seenAttesterSlashingCache map[uint64]bool badBlockCache *lru.Cache badBlockLock sync.RWMutex ``` [beacon-chain/blockchain/service.go]() lots of critical data stored as struct properties and mutex protected - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: ```go= type Service struct { genesisTime time.Time head *head headLock sync.RWMutex genesisRoot [32]byte justifiedCheckpt *ethpb.Checkpoint prevJustifiedCheckpt *ethpb.Checkpoint bestJustifiedCheckpt *ethpb.Checkpoint finalizedCheckpt *ethpb.Checkpoint prevFinalizedCheckpt *ethpb.Checkpoint nextEpochBoundarySlot types.Slot boundaryRoots [][32]byte checkpointStateCache *cache.CheckpointStateCache initSyncBlocks map[[32]byte]block.SignedBeaconBlock initSyncBlocksLock sync.RWMutex justifiedBalances []uint64 justifiedBalancesLock sync.RWMutex } ``` [shared/bls/blst/public_key.go]() Public keys cache used for BLS cryptography - Uses Ristretto - Cache keys: - Used in production: yes :heavy_check_mark: - How it is used: - What happens on cache miss: [validator/client/service.go]() Domain data cache used when signing data in the validator client ## Cache Audits Questions to ask are: - Does this cache key encompass enough information required for hits? - What are the undesirable effects of collisions in this cache? - What happens if there is a cache miss? - What behavior are we replacing by using a cache? ## Improving Code Health ### Code health - Minimizing use of third-party caches: just FIFO and LRU. LRU is more complex, but a simple parse of the [dependency source code](https://github.com/kubernetes/client-go/blob/master/tools/cache/fifo.go) we use for FIFO shows reveals it is a simple struct using a mutex. Perhaps we don't need a dependency for FIFO - Continue using the `cache` package for consolidation and visibility. There are many caches that have no reason to be outside of the `cache` package - [hashicorp/golang-lru](https://github.com/hashicorp/golang-lru) is thread-safe, but our uses of it are always accompanied by our own mutex. Do we need our own mutex? - Some ad-hoc caches we use throughout the codebase leak implementation details such as locking and map access inside of places they shouldn't. If they were part of the `cache` package, they could be exposed via an abstraction instead - Functions can get ugly throughout the Prysm codebase when they use unabstracted cache code, goroutine spawning, mutexes, and feature flags altogether ### Ease of testing - Tight coupling of cache implementations in the codebase make it harder to test code paths that fill caches vs. code without caches. No easy way to "disable" any cache we want via integration tests - Enabling/disabling of caches via feature flag should have those conditionals inside the actual cache function abstractions ```go // in cache/committee.go func (c *CommitteeCache) Get(publicKey [48]byte) (*Committee, error) { if !featureconfig.Get().EnableCommitteeCache { return nil, cache.ErrNotFound } ... } // elsewhere committee, err := committeeCache.Get(publicKey) if err == cache.ErrNotFound { // Get from db... ... } else { log.Error(err) return } ``` - No consistent interface for caches implemented in Prysm - Inconsistent creation and treatment of cache keys and initialization constants ## Testing Testing cache behavior is difficult, but observing failures in production due to caches is even more painful and we should leverage all tools available to use to reduce the potential for error. This section explores existing and new methods for testing our caches. ### Fuzzing Strategies Our fuzzing strategy involves using Golang build tags to specify different behaviors for libfuzzer. In particular, we maintain a functional implementation of a cache in a file such as `cache/sync_committee.go` and then maintain a "disabled", or "no-op" version inside of `cache/sync_committee_disabled.go` by specifying a libfuzzer build tag. ```go // +build libfuzzer package cache import ( types "github.com/prysmaticlabs/eth2-types" "github.com/prysmaticlabs/prysm/beacon-chain/state" ) // FakeSyncCommitteeCache is a fake `SyncCommitteeCache` to satisfy fuzzing. type FakeSyncCommitteeCache struct { } // NewSyncCommittee initializes and returns a new SyncCommitteeCache. func NewSyncCommittee() *FakeSyncCommitteeCache { return &FakeSyncCommitteeCache{} } // CurrentEpochIndexPosition -- fake. func (s *FakeSyncCommitteeCache) CurrentPeriodIndexPosition(root [32]byte, valIdx types.ValidatorIndex) ([]types.CommitteeIndex, error) { return nil, nil } // NextEpochIndexPosition -- fake. func (s *FakeSyncCommitteeCache) NextPeriodIndexPosition(root [32]byte, valIdx types.ValidatorIndex) ([]types.CommitteeIndex, error) { return nil, nil } // UpdatePositionsInCommittee -- fake. func (s *FakeSyncCommitteeCache) UpdatePositionsInCommittee(syncCommitteeBoundaryRoot [32]byte, state state.BeaconStateAltair) error { return nil } ``` ### Cache vs. No-Cache Benchmarks and Unit Tests Another way of testing logic is to try running a unit test for simple `got, want` comparisons that enable cache and disable cache to compare results. The results should be the same. ### Integration Tests for Cache Fills Performing integration tests for cache vs. no cache behavior ## Prior Bugs ## References - [Booby trapping the Ethereum blockchain](https://samczsun.com/booby-trapping-the-ethereum-blockchain/) example of critical bug in geth from not encompassing enough information in cache keys - [How can QA staff test cache logic that they cannot see](https://softwareengineering.stackexchange.com/questions/276554/how-can-qa-staff-test-caching-logic-that-they-cant-see) - [Best practices for unit testing methods that use cache heavily](https://softwareengineering.stackexchange.com/questions/185138/best-practices-for-unit-testing-methods-that-use-cache-heavily)