# Round #1 Hypothesis 1. Nodes are receiving unexpected messages as a result of a race condition <https://hackmd.io/z3-dPcR7Roaw6jnH8UlBzA?both> * too generic, must be split in multiple 1. Nodes are receiving unexpected messages as a result of active leaf notifications being missed * falsified * nodes with additional logging deployed on Rococo, no signs of missed leaves 1. Nodes are incorrectly labeling expected messages as unexpected and ignoring them * falsified * additional logging and code review do not support this 1. Nodes are receiving many unexpected messages, which leads to heavy amounts of disconnections. This leads to poor connectivity and thereby, a parachain production slowdown - so this implies the view is correct, but what we send out is outdated * falsified * drops are much too frequent to match up the rep changes * <https://grafana.parity-mgmt.parity.io/goto/_nDLcBa7k?orgId=1> * all collators lose connection to a validator at once, and reconnect at once (~within a few milliseconds) --- The Above assumption of coalescing everything as one issue might not be correct. So instead splitting hypothesis further is reasonable, under the assumptio `Out of few` messages _might_ not be related to delayed parachain block times. # Round #2 Hypothesis 1. peer view update messages are reordered with other messages on the network level. * it's tcp, it's a ordered stream * p2p does not re-order messages 1. peer view updates are dropped. * the number of messages dropped in substrate are close to zero, and happen in bursts, unrelated to the logged statements of out of view 1. peer view update messages are reordered with other messages on the receiving side. * Likely falsified * View is updated before we have a blame about out of view for a specific peer: View update (relay chain block `0d16a9...3f`): ``` 2022-02-10 15:07:12 2022-02-10 15:07:12.038 TRACE tokio-runtime-worker parachain::bitfield-distribution: Peer view change peerid=PeerId("12D3KooWRK34hnh5QRMw28ad7QdoN98z5mtm71TEBxL5D4aNaatS") view=View { heads: [0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 0x57fab7663bc3b1fdf757b0b1dfdbccd9621427ffea3aecc5d6884831afa14fd6], finalized_number: 704004 } ``` Out of order blame for the same relay chain head and the same peer: ``` 2022-02-10 15:07:12 2022-02-10 15:07:12.070 DEBUG tokio-runtime-worker parachain::approval-distribution: Assignment from a peer is out of view peer_id=PeerId("12D3KooWRK34hnh5QRMw28ad7QdoN98z5mtm71TEBxL5D4aNaatS") fingerprint=Assignment(0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 0, ValidatorIndex(33)) 2022-02-10 15:07:12 2022-02-10 15:07:12.070 DEBUG tokio-runtime-worker parachain::approval-distribution: Assignment from a peer is out of view peer_id=PeerId("12D3KooWRK34hnh5QRMw28ad7QdoN98z5mtm71TEBxL5D4aNaatS") fingerprint=Assignment(0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 0, ValidatorIndex(28)) 2022-02-10 15:07:12 2022-02-10 15:07:12.070 DEBUG tokio-runtime-worker parachain::approval-distribution: Assignment from a peer is out of view peer_id=PeerId("12D3KooWRK34hnh5QRMw28ad7QdoN98z5mtm71TEBxL5D4aNaatS") fingerprint=Assignment(0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 0, ValidatorIndex(10)) 2022-02-10 15:07:12 2022-02-10 15:07:12.070 DEBUG tokio-runtime-worker parachain::approval-distribution: Assignment from a peer is out of view peer_id=PeerId("12D3KooWRK34hnh5QRMw28ad7QdoN98z5mtm71TEBxL5D4aNaatS") fingerprint=Assignment(0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 0, ValidatorIndex(34))Show context 2022-02-10 15:07:12 2022-02-10 15:07:12.070 DEBUG tokio-runtime-worker parachain::approval-distribution: Assignment from a peer is out of view peer_id=PeerId("12D3KooWRK34hnh5QRMw28ad7QdoN98z5mtm71TEBxL5D4aNaatS") fingerprint=Assignment(0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 0, ValidatorIndex(37)) ``` But these `out of view` errors disappears just after the following events: ``` 2022-02-10 15:07:12 2022-02-10 15:07:12.077 DEBUG tokio-runtime-worker parachain::approval-distribution: Importing locally a new assignment fingerprint=Assignment(0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 0, ValidatorIndex(32)) 2022-02-10 15:07:12 2022-02-10 15:07:12.077 TRACE tokio-runtime-worker parachain::approval-distribution: Sending an assignment to peers block_hash=0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f claimed_candidate_index=1 local_source=Yes num_peers=25 2022-02-10 15:07:12 2022-02-10 15:07:12.077 DEBUG tokio-runtime-worker parachain::approval-distribution: Importing locally a new assignment fingerprint=Assignment(0x0d16a9bc9b12b0a992eb4521237e8e4e7ea34c442539b87c9221973a21105f3f, 1, ValidatorIndex(32)) ``` Hence, it seems that `knowledge.known_messages` was not aware of this assignment before it was imported locally and blamed about out-of-view. 1. No free occupied cores due to a lack in backing or availability explain both the `Collation wasn't advertised` as well as observed delays in block times <https://grafana.parity-mgmt.parity.io/goto/sBcsUB-nk?orgId=1> - verifyable by measuring time to availablity / backing * ... # Round #3 Hypothesis If we have multiple out of view messages not being considered due to bugs, these out of view messages could cause a delay of availability (and maybe backing). 1. The bugs are in the distribution subsystems 1. Statement distribution * self.is_known is non-sensible * draft fix TBD 3. Approval distribution * draft fix <https://github.com/paritytech/polkadot/pull/4908> * 3. AvailabilityBitfields distribution * ... 1. No free occupied cores due to a lack in backing or availability explain both the `Collation wasn't advertised` as well as observed delays in block times <https://grafana.parity-mgmt.parity.io/goto/sBcsUB-nk?orgId=1> - verifyable by measuring time to availablity / backing 1. Availability time is increased * ... 2. Backing times are increased * ...