Cluster Results: Efficient Historical State Representation

# Cluster Results: Efficient Historical State Representation This document records the findings of running the PR #9155 (Efficient Historical State Representation) and its follow up PRs in a cluster for a week. Issue : https://github.com/prysmaticlabs/prysm/issues/8954 Base PR : https://github.com/prysmaticlabs/prysm/pull/9155 Design : https://hackmd.io/soUUA3-dTYCSeBx4bkRc-Q This PR is enabled by a flag (--enable-historical-state-representation). WHen enabled, this PR migrates the validator entries from the old `state` bucket to the new buckets. See the design doc referenced above for more details. ## The cluster setup The cluster ran 6 nodes, 2 nodes for each case below. 1) `develop` branch code without enabling the flag (**normal**). 2) `develop` branch code with the flag enabled (**flag-enabled**). 3) `develop` branch code without the flag during initial sync and enabled after the normal sync has run for few hours (**flag-enabled-later**) ### Initial Sync The following diagram shows resource utilization during the entire period of initial sync. Please remember, both **normal** and **flag-enabled-later** are one and the same, since the flag will be enabled only after the initial sync. #### Observations - The bolt TX writes are high during this period for **flag-enabled-later**. Since this should react like **normal**, the only possible explanations that we can come is that may be the tat particular pod was scheduled in a IO heavy VM. - CPU and Memory utilizations looks the same for all the cases. - Validator cache is being actilily used by the **flag-enabled** instance. ![](https://i.imgur.com/A5m7iES.jpg) ### Normal Sync The following diagram shows resource utilization during normal sync. Remember that the flag was enabled here for **flag-enabled-later**. So both **flag-enabled** and **flag-enabled-later** should behave like one and the same. #### Observations - We see a small CPU peak for **flag-enabled**, but since this was not happening on a regular basis and happened for a few seconds, this can also be safely ignored for now. OR the theory is that the entire vector cache will be purged when a state is deleted, when the next state arrives, the all the validaor entries are stored in the cache and that might spike the CPU. Since state deletion are also rare, we can safely ignore this spike. - CPU and memory utilizations are almost the same - Validator was used by both **flag-enabled** and **flag-enabled-later** in one place. - Bolt utilzations were steady, albety slightly higher for instances which had the flags enabled, which is expected. ![](https://i.imgur.com/DtFfyeu.jpg) ### Under Load A load was created by calling the /balance api which constructs the state. a tool was used to send these API with random epoch at a standard rate for all three configurations. Here both **flag-enabled** and **flag-enabled-later** have the flag enabled, so they should work the same. #### Observations - CPU and memory resources are the almost the same - The validator cache is being very activily being used by both the flag enabled instances 2) and 3). - The Bolt Tx Write rate is higher in **flag-enabled-later**. This seems odd at first, then since 2) and 3) are same this can be attributed to the pod. ![](https://i.imgur.com/DN3kB3Y.jpg)