# Solana Network Restart Feb 25, 2023 Validator Community Post-Mortem
## 1 - Purpose
* Gather validators involved in restarting the network after the Feb 25, 2023 network restart and discuss root cause analysis, process improvements and suggested next steps, specific to the v1.14 upgrade and beyond.
* Further improve decentralization of decison making process by reducing dependence on core dev team and foundation, in both root cause analysis (yes/no?) and decision making (governance).
## 2 - Root Cause Analysis
* Official [Solana release](https://solana.com/news/02-25-23-solana-mainnet-beta-outage-report) (WIP)
Note: Pre-discussions indicate this section may be skipped in favor of relying on the "official" RCA.
* Additional Solana internal discussion [Handling Common Solana Outages](https://docs.google.com/document/d/1RkNAyz-5aKvv5FF44b8SoKifChKB705y5SdcEoqMPIc/edit#heading=h.3n4r2183s43v)
## 3 - Process Improvements
* Goal(s)
* Primary
* Preserve chain integrity
* Reduce downtime
* Secondary
* Successful upgrade completion
* Coherent decision making, particularly under duress, understanding decentralization can be messy
* Determine driving priorities, e.g. bring the network up as safely and quickly as possible, drive upgrades through in the interest of progress, etc.?
* the process around this was pretty adhoc with emoji based voting. debate around this is healthy but the process seemed very chaotic - do we need some form of stakeweight based voting?
* Develop process
* Pre-execution
* Execution
* Current process [here](https://docs.google.com/document/d/e/2PACX-1vQVMKVqh1o-qF4Z8ATM-Qxa_FhT96qoCmqGHFroq4e1EbrHkfg-H8Q8Z_FRnX3mYOxRgBbK6yzDols_/pub)
* Self-reported status tracking?
* Identify supporting tooling
* in case of a restart, finding the last block after which there were no user transactions is necessary. a mis-step here led to an additional delay in the restart process. It also felt like not everyone understood what was expected here and what the tooling output actually meant - for the most part it was just blindly following doc instructions. it would help validators to know what the command does
* Validators should be required to build their own snapshots and retain previous ledger. Downloading snapshots, or trashing the ledger, should be reserved for exceptional cases. In those cases, we need global/cdn snapshot availability. would be good to have a community owned/funded CDN (on s3, gcs) where multiple validators can upload snapshots to. it would def speed up the process. the people having access to the CDNs are not always necessarily the same ones who can take local snapshots
* Process diagram/playbook?
* Timely participation, i.e. getting people to show up
* Administrative
* Who, i.e. minimum viable set of decision makers
* Where, to maximize signal and minimize noise
* Motivations
* Incentives
* Disincentives
* Do's and Don't's
* Do
* TBD
* Don't
* Upgrade on a Friday
## 4 - Suggested Next Steps
Note: To be updated by Chris based on call notes.
Reference: [Solana Plans to Improve Network Upgrades](https://solana.com/news/plans-to-improve-the-network-upgrades)
### Specific to v1.14 Upgrade
* Validators & RPC operators to take more control of deployment schedules.
* TBD
### More Generally
* Work on [community-led postmortem](https://docs.google.com/document/d/1cOJwG2zGjttkat0O3lxR5iNBgvGTPcOLqu3bGkVCCJ0/edit?usp=sharing) and document a timeline of the incident from the validators' perspective (ask SolBlaze for edit access)
## 5 - Contributors to the agenda
* Brian Long, BlockLogic and Triton One
* Chris Remus, Chainflow
* SolBlaze
## 6 - Call notes and key takeaways
### Call notes
Call notes taken by Max Sherwood from H20 nodes [here](https://docs.google.com/document/d/13fnbAzKlMQmzWShHl0lIeWhCXpkz26tUNIZOgGOsahk/edit).
### Key takeaways
* Follow the documented procedure, make sure the procedure is documented, accessible and understood by validators
* Primary goals, decisions should be made in support of these goals
* Safety and security, i.e. preserving chain integrity
* Reduce downtime
* Receive proof that restart is unecessary, otherwise restart
* Determine more objective criteria and guiding principles to timebox the decision-making window, e.g. 20-30 minutes (or less?), understanding the each situation is at least slightly unique
* Use some type of (onchain/offchain/combination) tooling to poll verified validators during decision making process
* To-do
* Define the criteria and guiding principles used when deciding when/how to move forward
* Identify tooling to use to make decisions
* Empowering broader validator participation
* Validators have different levels of knowledge and understanding
* Help validators increase understanding
* Possibly test validator understanding
* Develop additional tooling to level the playing field
* Provide more comprehensive documentation
* Validator spec
* Validator monitoring, i.e. which metrics to keep and eye on and understand when they're hitting danger zones
* Develop self-reporting tool to track validator participation during restarts
* Increase visibility of who's working and where they are in the process
* Understand who's available to make decisions
* To-do
* Identify and develop necessary docs
* Identify and develop necessary tooling
* Identify and develop necessary spec and monitoring information
* Develop and implement self-reporting tool
* How do we poll and who's voices are most important?
* Current polling techniques are ineffective, need something better
* Verfied validators and vouched-for RPC operators should have a voice
* How to reach quorum?
* Each situation is at least slightly different, need to develop principles, rather than binary decisions
* Criteria
* Stake weight
* Number of delegators
* Other (TBD)
* To-do
* Implement a gated, i.e. #validators-verified, Discord channel to filter operators from non-operators is needed
* Define the decision making process, select an off-chain tool to execute it, implement the tool
* Determine criteria for a quorum
* On verifying and disseminating snapshots
* A better way to disseminate snapshots is needed
* Snapshot can be used safely
* Ideally validators generate their own
* Second best, validators verify their bank hash then download a snapshot corresponding to the bank hash
* To-do
* Form a team to spec and implement snapshot distribution network/platform
* Better document process for validators to verify their bank hash
## 7 - Consolidated To-Do's
Found [here](https://hackmd.io/@KFEZk8oMTz6vBlwADz0M4A/Sk7eAvR0j).