Postmortem Astar collator

# Postmortem Astar collator ## Overview **Wednesday May 26th 2022** This postmortem is created to give an overview about my collator and downtime of 10 hours. Also would like to apply back in the active whitelisted collator set of Astar Network, parachain on Polkadot. The reason of downtime and missing crutial blocks will be explained in this post mortem. ## Proven record Unboarded as collator on March 20th 2022 Total blocks produced: **25262** ([source](https://astar.subscan.io/transfer?address=aLvRL8nq1FypsvQbZG9kHEJfkfya7oLUDcKvWULTaVrvnNz&startDate=&endDate=&startBlock=&endBlock=&timeType=date&direction=received)) Total blocks in the network produced by my node: 2.28%. The calculation is done based on the total finalized blocks (state May 26st 20:37PM CET) in the network: 1108787. ## Server specs | **Main server** | **Backup Server** | | -------- | -------- | | Dedicated server | Dedicated server | | CPU: Intel® (i9 9900K) / **16-core** | CPU: Intel® i7 / **8-core** | | Memory: 64GB DDR4 | Memory: 64GB DDR4 | | OS: Linux Ubuntu 20.04 - 64Bits | OS: Linux Ubuntu 20.04 - 64Bits | | HD: 1TB SSD | HD: 1TB SSD | | Bandwidth: 1 Gbps | Bandwidth: 1 Gbps | | Location: Europe | Location: Europe | ## Full analyse ### A high-level summary of what happened. With the downtime of my node, Astar Network faced block production issues and increased the block production rate of the network. **Downtime started on 2022/05/15 18:37 (+UTC) and restored 2022/05/15 05:07 (+UTC)**, a total downtime of 10 hours. Our node (PolkaShots) get kicked out the active set on 2022/05/18 14:38 (+UTC). I received an email alert during my night. The moment this came to my attention I stood up to check the status. I wasn't possible to access our dedicated server remotely and couldn't contact the provider. We immediatly build the new collator with a new partner and contacted bLd regarding our downtime. We gained access back to our main server before the backup server got synced and restarted our main collator to start producing blocks. The reason why a backup server was not in place is because everything went so smooth for the first months that I didn't paid attention to a scenerio like what happened during this downtime. Didnt expect my provider to completely shot me out of accessing my server. ### The root cause analysis Node monitoring was working email received but no backup server was ready to be used as fall back. Received also warning from my provider: - Host is unreachable - SSH is down The reason of node being down was because CPU overload: - **alertname** = HostHighCpuLoad - **instance** = localhost:9100 - **description** = CPU load is > 80% VALUE = 91.16640687499999 LABELS: map[instance:localhost:9100] - **summary** = Host high CPU load (instance PolkaShots) ### Steps taken to resolve 1. Upgraded my node to the latest build. 2. Restarted my server. 3. Actions to better diagnose and how to prevent are written in the section ‘Leanings and next steps’. ### Learnings and next steps Improved internal workflow and created a backup server to be used when dedicated server is unreachable. We followed the steps described in the documentation: https://docs.astar.network/maintain/collator/node-maintenance Possible scenerio's to investigate are running the same server with the same keys. Problem is slashing because of double signing. Currently we already have a complete server synced and ready in case the main server goes down. Multiple devops now have access to both servers. ### Contact Email: polkashots@protonmail.com Element: @polkashots:matrix.org