Westend finality incident

# Westend finality incident ### The day was 20-09-2022 The finality has stalled after a migration deployment issue. From Artyom: > I started to migrate Westend validator nodes to K8s from VMs last week. By Monday this week all of the new validator nodes were up and running. I updated the session keys using rotateKeys RPC call on all of the new validator nodes. Everything was working fine until I updated A records of Westend bootnodes to point to the replacement nodes yesterday. > The finalization slowed down right after this. I found out that both ws and tcp p2p ports were closed on that new nodes. So I switched the DNS back to point to the old nodes that were still running. It didn't help. After the migration was reverted, finality did not start progressing. Initial investigation suggested this is due to approval-voting subsystem rejecting assignments and approvals from an old session due to out of window: ``` Got a bad assignment from peer <snip> error=Unknown session index: 21505 ``` With the help of DevOps, we did `sudo->initializer->forceApprove` and the finality was catching up. Until it stopped again less than an hour after that. ##### But why did finality stop again? We've been seeing a lot of ``` WARN tokio-runtime-worker parachain::availability-distribution: fetch_pov_job err=FetchPoV(NetworkError(NotConnected)) ``` Could it be that reputation changes for `bad assignments` might stack up? The cost per message is `-300_000` and the banning threshold is `-1760936552`. We need over 5.8k of such messages to reach the threshold per peer. This sounds plausible with aggression levels for gossip, but this hypothesis needs to be proven with logs. ### Next day 21-09-2022 From Artyom > A few minutes ago I stopped the node binary on all of the old VMs. Right now only new validator nodes are up and running. Some of the new nodes stopped validating overnight. So now I'm kicking off validation by invoking Validate call in the polkadot.js.org panel > Today I also opened the necessary ws and tcp p2p ports on the new nodes and switched the DNS to point to that new bootnodes. #### Failed elections Next day morning, we've noticed that elections have failed: ``` 🗳 Failed to finalize election round. reason ElectionError::Fallback("NoFallback.") ``` According to Niklas, this might be due to finality stop: > as we subscribe to finalized heads we won't submit anything then... but maybe something else? #### Disabled validators Another thing we've noticed is 11 out of 16 validators were disabled to produce blocks: ![](https://i.imgur.com/MTeMrYG.png) The reason for this is likely to be an Offence by im-online: https://westend.subscan.io/extrinsic/12692406-0. The reputation changes hypothesis also supports this idea. #### Finality restored With another help from DevOps, applying `forceApprove` restored the finality again. The block production times remain unstable due lots of disabled validators. ### Resolved questions: 1. Why elections didn't start working after finality was restored. Looks like by design. > the staking miner only kicks in once the EPM:SignedPhase 2. Why there's no pending slashes for disabled validators according to polkadot js/rpc? `staking->invulnerables` contains only 4 entries. It seems slashes were applied, but the UI didn't display them. ### Mitigation in the future 1. It seems that aggression-levels for approval-distribution (introduced [here](https://github.com/paritytech/polkadot/pull/5164)) might do more harm than good. Consider simplifying it. 2. Consider extending the session window to be more than dispute_period (e.g. double that). https://github.com/paritytech/polkadot/issues/6040 3. ???

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.