Dec Network Outage - Learning and Next Steps

# Dec Network Outage - Learning and Next Steps ## Context A lot is happening on the network since epoch 51 (ongoing). Capturing some learning and future action items. Consensus goal: reach a quorum on every block, i.e., two-thirds of validators should have correct configuations and be running. And produce new blocks. The network is compromising liveness, i.e., producing new blocks, but not on its safety. Add all your hypothesis, we can strike them off and strike off if anything is proved incorrect ## Incorrect configurations on validators Some of the validators have incorrect configurations, such as wrong IP address and un-open ports. These configurations are used for communication among the validators in the validator set. **Learning** 1. Required ports should be open 2. IP address needs to be correct **Action items** 1. Tool to change configurations, this has to be updated on-chain. 2. Ports 6178, 6179 should be open for communication among validators. Need checks in `ol health` 3. Check for correctness of IP address. 0.0.0.0 is especially problematic, should probably make it fail if using that [nazario.eth] 4. Document on what correct configurations are, check if there are already on onboarding document ## Validators are down to reach quorum We need at least two-thirds nodes to be synced (at same block number) and running for making new blocks. **Challenges** 1. Coordination with validator operators from different time zones **Action items** 1. Running diem-node as daemon (probably?) - needs investigation ## Network upgrade vs mempool Network upgrade happened in epoch 51. This was supposed to happen in an earlier epoch but did not go through. The block responsible for upgrade was rejected with mempool errors. **Hack** Turning off full nodes for 10 mins (checks for upgrades happen at block 2 of epoch) who send most number of VDF proofs, i.e., stop transactions from Carpe miner. **Learning** Mempool needs to be cleared during the epoch, requires further investigation **Action items** Further work on whether it is specific to this upgrade which has state change in stdlib or can this affect future upgrades. ## Number of validators down after network upgrade Hurray upgrade happened! 11 out of 45 validators are down after the upgrade. Some were due to AWS being down, others were not able to sync after the upgrade. The latter validators were unable to accept new blocks. **Reason** Needs investigation. 1. ~~the diem-node needs to be upgraded to latest version.~~ Observed a running validator which uses older version of diem-node. 2. Would this be a problem if majority of nodes do? **Hack** Failed: use an archieve with waypoint from start of the epoch Worked: Use a archieve with waypoint after the upgrade. But this leads to loss of state since genesis. Having a complete epoch-archieve helps but a huge file. **Action items** 1. Investigate why some of the validators have issues 2. Did these validators vote on block 2 (upgrade transaction) 3. Investigate why we need an archieve with waypoint after the upgrade? what is blocking? 4. Tool/documentation to download complete epoch archive, not just recent waypoint (low hanging fruit - Zaki is expert here) ## Web-monitor is down Web-monitor is most needed tool during times like this. **Learning** Alternate mechanisms to have access to it ## Take-aways 1. If you have TEST=y on their .bashrc, this is not going to work. It will always cause problems with `ol restore`. Next, those nodes need more RAM, you should have 16GB. If you have less, state-synchronization will cause your diem-node to fail. You will see a Killed message at the end of your logs. [0o-de-lally] If 16GB of RAM is difficult or expensive, an alternative would be creating a swap file on the root file system. It will run slower, but it shouldn't die. Here is how: https://linuxize.com/post/create-a-linux-swap-file/ [barmalei] ## Next steps 1. Get back an active rex-testnet 2. Check what does voting power mean? 3. Upper limit on invalid proofs in an epoch?