Block Processing Bottlenecks

# Block Processing Bottlenecks ### Introduction When mainnet launched 2 months back, a validator's voting was near perfect with a significant majority of validators in the network having head/target/source voting of 100%. However in the past month or so, the voting results have been more varied, with a large number of reports of users having seen their head voting percentage drop from 100% to 95% and even 93%. This is obviously a worrying development and signifies a situation that needs a deeper look in order for appropriate rectification to occur. ### Steps that have already been taken 1) Improvement in block gossip validation. A series of PRs back from december have been merged into the repo and made it into our v1.1 release. These changes should have had a positive effect on gossip validation of blocks and therefore improve time spent on each hop and prevent late voting of blocks. https://github.com/prysmaticlabs/prysm/pull/8096 https://github.com/prysmaticlabs/prysm/pull/8107 https://github.com/prysmaticlabs/prysm/pull/8094 2) Add in the ability for a validator to attest early therefore improving their performance and inclusion slot. This issue was discovered by potuz and was added in behind a flag. https://github.com/prysmaticlabs/prysm/pull/8197 While these steps have lead to a network wide improvement in gossip validation times for prysm nodes. Head Voting was continuing to get worse with a larger validator set which was growing bigger day by day. Despite higher peer counts and big refactors in subnet search, a validator's head voting performance continues to worsen over time. This document illustrates the tracing profiling that had been undertaken to uncover block processing bottlenecks. ### Instrumentation of Prysm In order to identify the bottlenecks in block processing for a Prysm node, Jaeger has been used to trace object processing and lifetime spans. Jaeger can be used to measure any tagged method but for purposes of this document we will focus on block processing. For looking at bottlenecks, we have traced the beacon node with a 100% sample rate in order to capture all the outliers effectively. ## Bottlenecks ![](https://i.imgur.com/4TQv7PT.png) These are the recorded spans over the period of 1.5 hours. As it can be seen block processing times have a large median and mean at 60 - 100 ms, with a few prominent outliers up to 4 seconds. These spans give a glimpse on the how bad votes can occur. #### 1) Finalized Deposit Insertion Digging into the outliers, one method stands out : ![](https://i.imgur.com/1CZjUia.png) Inserting finalized deposits takes up a huge amount, growing up to 4.15 seconds ! Given that attesters have a period of 4 seconds till they have to make a decision on which head to vote for, a spike like this can lead to a validator voting on a bad head vote. #### 2) Creating Attestation Signature Sets ![](https://i.imgur.com/0xrNpCo.png) This is a smaller bottleneck but nevertheless is still a big chunk. When processing a block and creating its overall attestation signature set, this method can take up a significant portion of block processing time. For example in this particular span, this block took up 600ms to create the signature set. This is a harder issue to solve as there is no direct way to aggregate pubkeys 'faster'. The blst library is highly optimized as it is now. #### 3) State By Root Calls ![](https://i.imgur.com/rUjbRIC.png) In this span, a state by root call spiked to 500 ms. While we have a lot of different caches to cache all our respective states from our finalized state onwards, there are always possibilities of cache misses. In the above case a cache miss increases our block processing time by 500ms, which in the context of head voting is non trivial. #### 4) Epoch Metrics Collection While a picture of this case was unable to be captured, this issue has been observed in multiple runs. At the start of a new epoch we perform a metrics sweep of the previous epoch, this metrics collection exercise unfortunately can also increase block processing time. As in this case we do create a new precompute instance and process all the attestations in an epoch which isn't expensive. In the repeated observations the above outliers have been observed and been identified as bottlenecks in our block processing at certain instances. #### 5) State Migration While this only happens every 2048 slots, this method still has the potential to lock up our routine for critical periods. State migration is a generally expensive process as it involved regeneration state and perisisting that state to disk which both are activities on CPU and Disk I/O respectively. While not observed yet(due to issues with jaeger), it is likely this method can lead to a bad head vote. ### Underlying Cause The underlying issue that causes all these bottlenecks to influence our block processing time and therefore head votes is how our `onBlock` method is structured. Currently: ```go= func (s *Service) ReceiveBlock(ctx context.Context, block *ethpb.SignedBeaconBlock, blockRoot [32]byte) error { ctx, span := trace.StartSpan(ctx, "blockChain.ReceiveBlock") defer span.End() blockCopy := stateTrie.CopySignedBeaconBlock(block) // Apply state transition on the new block. if err := s.onBlock(ctx, blockCopy, blockRoot); err != nil { err := errors.Wrap(err, "could not process block") traceutil.AnnotateError(span, err) return err } // Update and save head block after fork choice. if err := s.updateHead(ctx, s.getJustifiedBalances()); err != nil { log.WithError(err).Warn("Could not update head") } // Send notification of the processed block to the state feed. s.stateNotifier.StateFeed().Send(&feed.Event{ Type: statefeed.BlockProcessed, Data: &statefeed.BlockProcessedData{ Slot: blockCopy.Block.Slot, BlockRoot: blockRoot, SignedBlock: blockCopy, Verified: true, }, }) // Handle post block operations such as attestations and exits. if err := s.handlePostBlockOperations(blockCopy.Block); err != nil { return err } // Have we been finalizing? Should we start saving hot states to db? if err := s.checkSaveHotStateDB(ctx); err != nil { return err } // Reports on block and fork choice metrics. reportSlotMetrics(blockCopy.Block.Slot, s.HeadSlot(), s.CurrentSlot(), s.finalizedCheckpt) // Log block sync status. logBlockSyncStatus(blockCopy.Block, blockRoot, s.finalizedCheckpt) // Log state transition data. logStateTransitionData(blockCopy.Block) return nil } ``` And `onBlock` : ```go= func (s *Service) onBlock(ctx context.Context, signed *ethpb.SignedBeaconBlock, blockRoot [32]byte) error { ctx, span := trace.StartSpan(ctx, "blockChain.onBlock") defer span.End() if signed == nil || signed.Block == nil { return errors.New("nil block") } b := signed.Block preState, err := s.getBlockPreState(ctx, b) if err != nil { return err } set, postState, err := state.ExecuteStateTransitionNoVerifyAnySig(ctx, preState, signed) if err != nil { return errors.Wrap(err, "could not execute state transition") } valid, err := set.Verify() if err != nil { return errors.Wrap(err, "could not batch verify signature") } if !valid { return errors.New("signature in block failed to verify") } if err := s.savePostStateInfo(ctx, blockRoot, signed, postState, false /* reg sync */); err != nil { return err } // Updating next slot state cache can happen in the background. It shouldn't block rest of the process. if featureconfig.Get().EnableNextSlotStateCache { go func() { // Use a custom deadline here, since this method runs asynchronously. // We ignore the parent method's context and instead create a new one // with a custom deadline, therefore using the background context instead. slotCtx, cancel := context.WithTimeout(context.Background(), slotDeadline) defer cancel() if err := state.UpdateNextSlotCache(slotCtx, blockRoot[:], postState); err != nil { log.WithError(err).Debug("could not update next slot state cache") } }() } // Update justified check point. if postState.CurrentJustifiedCheckpoint().Epoch > s.justifiedCheckpt.Epoch { if err := s.updateJustified(ctx, postState); err != nil { return err } } // Update finalized check point. if postState.FinalizedCheckpointEpoch() > s.finalizedCheckpt.Epoch { if err := s.beaconDB.SaveBlocks(ctx, s.getInitSyncBlocks()); err != nil { return err } s.clearInitSyncBlocks() if err := s.updateFinalized(ctx, postState.FinalizedCheckpoint()); err != nil { return err } /* Rest of code removed for reduced verbosity */ ``` As it can be observed in our main block receiver method, we only update our head after we fully carry out `onBlock`. However `onBlock` does not only carry out state transitions, it aslo carries out all the important post-finalization bookeeping. The post justification and finalization bookeeping has been recorded as having a significantly negative effect on block processing time. ### Potential Solutions #### 1) Refactor Block Processing Routine Update Head immediately after a sucessful state transition occurs and important post state transition data is persisted. This should resolved all our bottlenecks. However this will be a very big change as we have had the current pattern of processing blocks for nearly 2 years. The side effects of a big change like this are not fully known yet. Some of these post finalization cleanups are important and might be unintended invariants by our other important services/routines in the client. Changing the ordering of updating head might break invariants that we are not fully cognizant on. #### 2) Separate out potentially computationally intensive portions This is already implemented here: https://github.com/prysmaticlabs/prysm/pull/8405 By decoupling this from our main block processing routine we can prevent this from leading to bad head votes. However while some parts of our processing routine are able to be split out not all portions can be done in the same manner. A few methods require strictly sequential ordering in order to proceed and running them asynchronously might introduce new bugs. ### Open Questions 1) Anyway to future proof our block processing times from an even larger state ? 2) Even with all these bottlenecks optimized, how much leeway can we provide to the eth1-engine in block proposals for the merge? Eth1 execution is expensive and time consuming(as it involves heavy disk i/o use)