# Post mortem: 9 Jul - 12 Jul '21 `nim-waku` `prod` fleet incident
## Background
On Friday, 9 Jul 2021, the state of the [`wakuv2.prod` fleet](https://fleets.status.im/) was as follows:
1. Nodes were running [`nim-waku` release `v0.4`](https://github.com/status-im/nim-waku/releases/tag/v0.4)
2. `nim-waku` `chat2` clients reported issues with accessing `store` functionality when attempting to connect to `prod`. Issue reported [here](https://github.com/status-im/nim-waku/issues/659). Importantly, the `js-waku` client did not have any similar issues.
3. All `prod` fleet notes reported violations of the [GossipSub backoff period](https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md#prune-backoff-and-peer-exchange) that other clients has to respect before attempting a reconnection. The effect was that `prod` nodes failed to connect to each other and form a mesh.
4. There were indications of possible SQLite DB corruption (logs indicated message storage failure, `select *` queries returned unexpected results). This has not been fully investigated yet.
As far as can be established, the above-mentioned issues were present on the `prod` fleet for at least a week.
On the day, the [`chat2bridge` to Matterbridge/Discord](https://github.com/status-im/infra-nim-waku/issues/8), which had been offline for about a week, was also redeployed off the latest `nim-waku` `master`. This meant that the `prod` fleet nodes, on top of their failure to connect to each other, didn't support the `ping` keep-alive mechanism used by `chat2bridge`.
## Steps taken on 9 Jul
It seemed likely that either DB corruption, unexpected behaviour of the deprecated keep-alive mechanism, or both caused the emergent issues. The exact cause is still the topic of an ongoing debugging investigation.
Based on the above, the following was done at around 8 AM UTC:
1. As first priority, attempted to get the `prod` fleet in a stable state by redeploying off latest `master`. Jenkins job [here](https://ci.status.im/job/nim-waku/job/deploy-v2-prod/).
2. In parallel, tried to debug the issues.
## Impact on `prod` fleet
The redeployment had the following effects on `prod`:
1. `nim-waku` clients could again connect to the `prod` fleet and access `store` functionality.
2. Error logs related to possible DB corruption, backoff violations, and keep-alive issues disappeared.
3. Connection to `chat2bridge` was _not_ restored (this may have been related to the inconsistent `Peer` table).
Overall, the stability of the `prod` fleet was restored after the upgrade. The plan then was to continue debugging the cause of the original issues, fix connectivity to `chat2bridge` and communicate to clients that the fleet is usable again.
## Impact on `js-waku` client
The upgrade changed the `relay` protocol ID advertised by the `prod` fleet to the stable `/vac/waku/relay/2.0.0`. Since the released version of the `js-waku` client does not support this protocol ID, the upgrade caused `js-waku` clients to fail to connect to `prod`. The protocol ID issue is tracked [here](https://github.com/status-im/nim-waku/issues/661).
Franck and external users of `js-waku` client, reported the regression around 8 AM UTC on Monday, 12 Jul. This blocked their progress, as they've previously been able to connect to `prod` nodes, despite the issues.
## Steps taken on 12 Jul
The following steps were taken to revert the changes to the `prod` fleet:
1. Hanno: Redeploy release `v0.4` to `prod`
2. Arthur: Recreate the SQLite DB (both `Peer` and `Message` table)
3. Arthur: Restore connectivity between `prod` fleet nodes
## Current state of `prod` fleet
The current state of the `wakuv2.prod` fleet:
1. Nodes are running [`nim-waku` release `v0.4`](https://github.com/status-im/nim-waku/releases/tag/v0.4), with `relay` protocol ID `/vac/waku/relay/2.0.0-beta2`
2. Connectivity between nodes have been restored
3. Connectivity to the `chat2bridge` has _not_ been restored. This will require either an upgrade of `prod` or a downgrade of `chat2bridge`.
The redeployment and recreation of the DBs seem to have fixed the keep-alive and connectivity issues of before. `js-waku` clients report that they can connect to `prod` as before.
## Lessons learned
1. **Waku incident channel:** `prod` _incidents_ and status updates should be clearly communicated. The `#waku-network` Discord channel could be used as "command centre" for incidents.
2. **Strict upgrade procedure:** `prod` _upgrades_ should always be done in a coordinated fashion. It requires general agreement from all clients after informing them of possible impact.
3. **Only run releases on `prod`:** `prod` should only run released versions of `nim-waku`, unless there is an urgent reason not to (e.g. unforeseen and critical bugs in a release, etc.)
## Next steps
1. Determine scope for next `nim-waku` release. Discuss impact with other Waku v2 clients.
2. Upgrade `prod` with release version.
3. Verify that:
- [ ] All clients connect as expected to the upgraded `prod` fleet
- [ ] Connectivity between `prod` fleet nodes is stable
- [ ] `prod` nodes correctly connect and relay to the `chat2bridge` to Discord
4. Continue investigating the original causal issues, e.g. https://github.com/status-im/nim-waku/issues/659, https://github.com/status-im/nim-waku/issues/637