Regen Network v4.1.2 upgrade incident postmortem

# Regen Network v4.1.2 upgrade incident postmortem ## What caused the issue When the previous software upgrade **v4.0.0** was applied on height `6589400` it created a discrepancy in the tendermint state and the app state. Validator set changes or any voting power changes happening on chain were not being reflected in the tendermint state but was accurately reflected in the app state. Software upgrade **v4.1.0** was scheduled at block height `7479500` to fix the tendermint issue with a full refresh of the tendermint validator set. One edgecase which was not tackled during the testing period led to the panic error message and halted the chain at upgrade block height. To perform a full validator set refresh the following steps had to be taken - Return active validatorset to tendermint. - Return validators who got jailed or went into inactive state after v4.0 upgrade with zero voting power. - Don't return validators who were jailed or inactive before and after v4.0 upgrade(they were already not part of consensus state). The edgecase which was missed was part of the last step in which inactive valdiators before and after **v4.0.0** upgrade were not returned. ## What steps were taken to identify the issue Once validators started restarting their nodes with the `v4.1.2` binary and aplying the upgrade at the upgrade height they started reporting a panic error in the logs which made their nodes go into a restart loop. ``` Oct 07 16:17:16 regen-private-rpc cosmovisor[1103]: 4:17PM INF ABCI Replay Blocks appHeight=7479499 module=consensus stateHeight=7479499 storeHeight=7479500 Oct 07 16:17:16 regen-private-rpc cosmovisor[1103]: 4:17PM INF Replay last block using real app module=consensus Oct 07 16:17:16 regen-private-rpc cosmovisor[1103]: 4:17PM INF applying upgrade "v4.1.0" at height: 7479500 Oct 07 16:17:16 regen-private-rpc cosmovisor[1103]: 4:17PM INF minted coins from module account amount=5922947uregen from=mint module=x/bank Oct 07 16:17:17 regen-private-rpc cosmovisor[1103]: 4:17PM INF validator would have been slashed for downtime, but was either not found in store or already jailed module=x/slashing validator=regenvalcons 1wg08yqmfaqh2u67hka63w0n3daq4p45vq8g4j7 Oct 07 16:17:17 regen-private-rpc cosmovisor[1103]: 4:17PM INF validator would have been slashed for downtime, but was either not found in store or already jailed module=x/slashing validator=regenvalcons 1wxcaw3h69yf76laz6z86etc6ayp5wthy6u5td9 Oct 07 16:18:02 regen-private-rpc cosmovisor[1103]: 4:18PM INF executed block height=7479500 module=consensus num_invalid_txs=0 num_valid_txs=0 Oct 07 16:18:02 regen-private-rpc cosmovisor[1103]: Error: error during handshake: error on replay: commit failed for application: error changing validator set: failed to find validator 491F6B7B26FFA0AD9 DED95C8A83173D768C49E3C to remove ``` Validator hex address `491F6B7B26FFA0AD9DED95C8A83173D768C49E3C`present in the error logs mapped out to this validator:- https://www.mintscan.io/regen/validators/regenvaloper153dka3lmwzfzzm6yfmz2tt2wkz6nmanlf80dqx. The issue of missing edgecase was identified once the status of this validator was observed. ## What steps were taken to fix the issue Once the edgecase was identified a PR was raised by @aleem1314 which addressed the issue:- https://github.com/regen-network/regen-ledger/pull/1533. After PR reviews from @aaronc, @anilCSE and @technicallyty were completed the branch was used to build a binary for testing purposes along with chain data which was backed up by Cosmovisor on upgrade height. Binary was replaced on these nodes in an isolated dev environment. Once it was established that the panic error was not present after the upgrade is applied on the test nodes the PR was merged. ## What steps were taken to upgrade the network Once the PR was merged a tag was pushed for `v4.1.3` to the repo, all the internal nodes were first upgraded to this version. An announcement was made to the validators to upgrade to `v4.1.3`. The network started producing blocks in ~25 mins after the announcement once 67% of the VP had upgraded to the latest version. ## How we could have avoided the issue This issue could have been avoided if the testing environment included inactive validators being present in it. An internal devnet was set up to test out `v4.1.2` upgrade process. The genesis for this devnet was generated using a mainnet export. The consensus and account keys of the top 5 validators were replaced and their VP was increased in the genesis file itself to start the network. This provided a test environment which had the latest mainnet data and had 5 validators which were controlled by the team. As the `max_validators` param of mainnet was set to 75, this devnet also had the active set at 75 validators. This was the reason why the edgecase of having an inactive validator in the network was missed. Similar issues like this can be avoided in the future if inactive validators are also included in the testing strategy.