Network Post Mortems
The goal of this section is to both describe past issues that influenced the evolution of the proposed consensus update workflow, as well as to act as a regression prevention mechanism for contributors. Only add details relevant to this process.
During the Byzantium mainnet hard fork, the Parity team made 3 emergency releases 4, 3 and 2 days respectively before the hard fork, whereas the Geth team made 1 emergency release 1 day before the hard fork. Both clients were passing all consensus tests.
The reason for the last minute hotfixes was the creation and introduction of a differential fuzzer a week before the scheduled hard fork, which proved immensely effective at finding issues. Although the fork was a success, it's obvious that fuzzers need to be an ongoing part of consensus testing.
During the Constantinople hard fork of the Ropsten test network, Geth and Parity got stuck at the fork block. Later both moved a few hundred blocks, after which they split into two chains and Geth got stuck for a while before continuing. Some Geth nodes were on the same chain as Parity, some rejected it. For a couple days multiple chains were competing and clients in general were extremely unstable with regard to sync.
The reason for getting stuck in the hard fork originally was that both Geth and Parity teams assumed some DApp developer will be mining, so neither team started miners. Turned out nobody was mining. The takeaway is that clients must not rely on community mining for testing a hard fork on a testnet.
The reason Geth and Parity split into two after getting a miner onboard was a consensus issue in Parity, but since the miner was backed by a Parity node, there was nothing to push the Geth chain forward. The takeaway is that while developers figure out which client is correct on a fork, both chains must be able to go forward, so all client teams must have their own miners, one is not enough.
While trying to fix the stuck chains and get some new mining nodes onboard, it became apparent that both Geth's fast sync and Parity's warp sync is useless as they sync to the heaviest chain and ignore state transitions (by design). This was the reason why Geth and Parity nodes intermingled on chains they otherwise considered invalid. The takeaway is that once something goes wrong, it can be hard to get new clients on the correct chains, so it's better to have all nodes deployed before the fork hits. Clients also need to support rolling back chains and blacklisting certain blocks.
After identifying the consensus issue in the Parity code and stopping mining on that chain, Geth still had a hard time to sync as the longest chain was invalid. Some internal optimizations were not really properly handling long competing but invalid chains. The takeaway is that we need to be able to test scenarios where the consensus rules conflict with the GHOST protocol reorg rules. See next part on how to do it.
Since the chain split into many forks and clients had a hard time picking any of those and remaining up to date with them, we essentially destroyed the utility of the Ropsten testnet for a few days. The Ropsten transaction load was however needed to trigger the issue int he first place. The takeaway is that instead of doing an irreversible fork for the stress test, a "private" rehearsal one should be done with a limited number of nodes. Doing a private fork, all the transactions get replayed and consensus issues are hit, but we're not crippling higher layer developers if things go bad. Care must be taken to keep the hash-rate below the live network to avoid accidentally getting newly syncing nodes over to the fork.
Copyright and related rights waived via CC0.