Post mortem: 9 Jul - 12 Jul '21 `nim-waku` `prod` fleet incident

# Post mortem: 9 Jul - 12 Jul '21 `nim-waku` `prod` fleet incident ## Background On Friday, 9 Jul 2021, the state of the [`wakuv2.prod` fleet](https://fleets.status.im/) was as follows: 1. Nodes were running [`nim-waku` release `v0.4`](https://github.com/status-im/nim-waku/releases/tag/v0.4) 2. `nim-waku` `chat2` clients reported issues with accessing `store` functionality when attempting to connect to `prod`. Issue reported [here](https://github.com/status-im/nim-waku/issues/659). Importantly, the `js-waku` client did not have any similar issues. 3. All `prod` fleet notes reported violations of the [GossipSub backoff period](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md#prune-backoff-and-peer-exchange) that other clients has to respect before attempting a reconnection. The effect was that `prod` nodes failed to connect to each other and form a mesh. 4. There were indications of possible SQLite DB corruption (logs indicated message storage failure, `select *` queries returned unexpected results). This has not been fully investigated yet. As far as can be established, the above-mentioned issues were present on the `prod` fleet for at least a week. On the day, the [`chat2bridge` to Matterbridge/Discord](https://github.com/status-im/infra-nim-waku/issues/8), which had been offline for about a week, was also redeployed off the latest `nim-waku` `master`. This meant that the `prod` fleet nodes, on top of their failure to connect to each other, didn't support the `ping` keep-alive mechanism used by `chat2bridge`. ## Steps taken on 9 Jul It seemed likely that either DB corruption, unexpected behaviour of the deprecated keep-alive mechanism, or both caused the emergent issues. The exact cause is still the topic of an ongoing debugging investigation. Based on the above, the following was done at around 8 AM UTC: 1. As first priority, attempted to get the `prod` fleet in a stable state by redeploying off latest `master`. Jenkins job [here](https://ci.status.im/job/nim-waku/job/deploy-v2-prod/). 2. In parallel, tried to debug the issues. ## Impact on `prod` fleet The redeployment had the following effects on `prod`: 1. `nim-waku` clients could again connect to the `prod` fleet and access `store` functionality. 2. Error logs related to possible DB corruption, backoff violations, and keep-alive issues disappeared. 3. Connection to `chat2bridge` was _not_ restored (this may have been related to the inconsistent `Peer` table). Overall, the stability of the `prod` fleet was restored after the upgrade. The plan then was to continue debugging the cause of the original issues, fix connectivity to `chat2bridge` and communicate to clients that the fleet is usable again. ## Impact on `js-waku` client The upgrade changed the `relay` protocol ID advertised by the `prod` fleet to the stable `/vac/waku/relay/2.0.0`. Since the released version of the `js-waku` client does not support this protocol ID, the upgrade caused `js-waku` clients to fail to connect to `prod`. The protocol ID issue is tracked [here](https://github.com/status-im/nim-waku/issues/661). Franck and external users of `js-waku` client, reported the regression around 8 AM UTC on Monday, 12 Jul. This blocked their progress, as they've previously been able to connect to `prod` nodes, despite the issues. ## Steps taken on 12 Jul The following steps were taken to revert the changes to the `prod` fleet: 1. Hanno: Redeploy release `v0.4` to `prod` 2. Arthur: Recreate the SQLite DB (both `Peer` and `Message` table) 3. Arthur: Restore connectivity between `prod` fleet nodes ## Current state of `prod` fleet The current state of the `wakuv2.prod` fleet: 1. Nodes are running [`nim-waku` release `v0.4`](https://github.com/status-im/nim-waku/releases/tag/v0.4), with `relay` protocol ID `/vac/waku/relay/2.0.0-beta2` 2. Connectivity between nodes have been restored 3. Connectivity to the `chat2bridge` has _not_ been restored. This will require either an upgrade of `prod` or a downgrade of `chat2bridge`. The redeployment and recreation of the DBs seem to have fixed the keep-alive and connectivity issues of before. `js-waku` clients report that they can connect to `prod` as before. ## Lessons learned 1. **Waku incident channel:** `prod` _incidents_ and status updates should be clearly communicated. The `#waku-network` Discord channel could be used as "command centre" for incidents. 2. **Strict upgrade procedure:** `prod` _upgrades_ should always be done in a coordinated fashion. It requires general agreement from all clients after informing them of possible impact. 3. **Only run releases on `prod`:** `prod` should only run released versions of `nim-waku`, unless there is an urgent reason not to (e.g. unforeseen and critical bugs in a release, etc.) ## Next steps 1. Determine scope for next `nim-waku` release. Discuss impact with other Waku v2 clients. 2. Upgrade `prod` with release version. 3. Verify that: - [ ] All clients connect as expected to the upgraded `prod` fleet - [ ] Connectivity between `prod` fleet nodes is stable - [ ] `prod` nodes correctly connect and relay to the `chat2bridge` to Discord 4. Continue investigating the original causal issues, e.g. https://github.com/status-im/nim-waku/issues/659, https://github.com/status-im/nim-waku/issues/637