Anson Lau
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
1
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Secure Validator Setup Over the last year, we have started to see a number of projects moving from PoW to PoS, which creates a whole new industry for running a staking business. In some sense, running a validator is like operating a cryptocurrency exchange because the underlying staked asset could be potentially worth more than few million dollarss. Thus, as a node operator you have to maintain a heightened security level of your validator, as well as have a robust network architecture to make sure that the value at stake is not at risk of being lost. There are many different ways to architect your network infrastructure. The goal of running a validator is to not be slashed, and the two common slashing situations are server downtime and equivocation (i.e., double signing). The amount of effort spent on achieving these goals is dependent on what level of security you need - someone with a few thousand dollars behind their validator is going to have a different thread model than somebody with several million. It also encourages everyone to come up with their own designs to avoid being hacked in the same way as other validators. That being said, there are some common tips and techniques that wil be useful for anyone interested in running a validator. Without further ado, we will first go through some existing validator architecture designs, and then look into the different areas each individually. ### Existing approaches (Cosmos, Tezos, etc) We will focus on the design of the network architecture in this section. No matter what kind of validators you are trying to run, they will have a similar network design, so we will take Cosmos as a reference. The simplest approach would have only a validator node running without any firewall, with the p2p port accessible from the Internet. This means that anyone can know the IP address of the node, which is not ideal and opens additional attack vectors. The second approach is to include a firewall in front of the validator. This means that it can use the firewall to do deep package inspection, which can prevent attacks on the level of e.g. Syn-Floods. The disadvantages of this setup are that validators always a possibility to go down, so when your node has an outage or is otherwise unreachable, it could cause the validator to be slashed. Another problem is that it does not have DoS resistance, which means that it could easily be attacked in this manner. You can read more about this layered network architecture [here](#Layered-Network-Architecture). Another approach would be to separate the network architecture into two layers, the first having "sentry" nodes facing the Internet that hides the validator entirely to the public, and the validator would be sitting in a firewalled private network only accessible by the sentry nodes. This approach is an effective way to mitigate DDoS attacks by deploying multiple sentry nodes on different cloud environments. You can also add an additional layer of private sentry nodes as a middle layer, in order to increase separation between the public sentry nodes and validators. Unfortunately, this design still has an availability problem when the validator is down. One can mitigate this by deploying an additional validator to support the [High Availability (HA)](#High-Availability-(Optional)) feature. This way, even if one of your validators goes down, there is a backup validator which can replace it, this mitigating potential slashing from nonresponsiveness. Of course, if both your primary and secondary validators are nonresponsive, you will still be slashed. Thus, your primary and secondary validators should be, at a minimum, geographically separate, but you can vary other aspects (e.g. operating system, hardware, service provider) to help ensure that both validator nodes are unlikely to go down at the same time. Besides these architectural decisions, there are some approaches to improve key management by making it unavailable to potential intrusions in the validator servers. These approaches include using using Hardware Security Modules (HSM) (see [here](https://cosmos.network/docs/cosmos-hub/validators/validator-faq.html#technical-requirements) for Cosmos' suggested list of of HSMs that support ed25519). Cosmos also introduced a Key Management System (KMS) which has a unified API to support validators that manage their key from different sources like HSM and have double signing protection. It is recommended to host the KMS on another machine to have better security and risk management. This helps to ensure that your system does not have a single point of failure. Remember that an active validator must be up and running 24/7 in order to avoid slashing; this means that malicious users can attempt to attack your validator at all times! There is a great article about [Cosmos Hub Architecture](https://iqlusion.blog/a-look-inside-our-validator-architecture) written by Tony Arcieri and Shella Stephens. ### Layered Network Architecture The concept of a layered network architecture is discussed in a post in the [Cosmos forum](https://forum.cosmos.network/t/sentry-node-architecture-overview/454). The goal is to mitigate the DDoS attack by running multiple public full nodes on different cloud providers, and making those nodes the only way to talk to the validator. The validator itself is secured behind a firewall or private network. The public "sentry" nodes can run on cloud providers; they won't have any stake at risk, and if they are down for a while they can be replaced shortly without any disruption to the validator's work. It would also be interesting to run the public nodes in different providers/availability regions, so that the validator is not affected by any individual provider outages. That said, there are other solutions to mitigate DDoS attack - we encourage everyone to build up their own designs, to avoid homogeneity. ### High Availability (Optional) A validator node must be up-and-running 24/7, with as close to possible to 100% uptime. If an active validator becomes unreachable, then it would cause a portion of that validator's stake to be slashed. By setting up HA, you could make your validator more robust than a single validator node. Even if one of your validators fails to connect, you still have another to participate in the validation process. Below are two examples of a high availability set-up: #### Active - Standby Imagine there are two validators, one is active and the other one would be on standby (failover). We can keep track of the heartbeat of the active one, and in the case of a problem, the standby will take over immediately. It is important to make sure these two are configured in the same way. Whatever you change in the active, it must be the same with the standby. #### Active - Active Imagine we add a load balancer to connect these two validators, and it has an algorithm to decide which validator to execute it. It may use round-robin algorithm or others, but it should really careful to make sure that there is no equivocation (double signing), as two validators are running simultaneously with the same validator key. Either way of these approaches is valid; choosing one entirely depends on what you want to achieve. ### Hardware Security Module (HSM) [HSM](https://en.wikipedia.org/wiki/Hardware_security_module) is a hardware component for storing your keys inside a tamper-proof, secure element, which is never exposed to the file system of your machine during signing. This makes the attacker extremely hard to extract the private key. If your validator stores the secret keys in plain-text format on the same machine, your keys will be easily exposed if your validator is hacked. At this point, the attacker can do double signing to cause your validator's stake to be slashed. It is important to find a hardware component that is dedicated to store your keys. Currently, there are a few types of HSM available in the market. [YubiHSM2](https://www.yubico.com/products/yubihsm/) is the most widely used in Cosmos validators because it supports the Ed25519 curve. Validators can also use a CloudHSM such as [AWS](https://aws.amazon.com/cloudhsm/?nc1=h_ls) or [GCP](https://cloud.google.com/hsm/?hl=en) to store their keys. However, before choosing a CloudHSM, you should check whether the necessary curve is available or not. For example, at the time of this writing, because the Ed25519 curve is newer, the only CloudHSM which supports it is Microsoft Azure. That said, validators are managing worth millions dollars of assets, and so storing your keys on CloudHSM is not recommended. When you use a CloudHSM, you trust the solution provider, so consider how much you support the provider when you decide to set up a secure validator with CloudHSM. Moreover, compared with YubiHSM2, it costs more than $3,000 per month, which is relatively expensive if you are using [Azure](https://azure.microsoft.com/en-in/pricing/details/azure-dedicated-hsm/). Another possibility is to use a custom remote signing server, which offers a similar level of security. With this setup, a separate server is set up that is the single system which does the signing. If you want to verify that your architecture and general setup are secure, you can also have a third party audit your setup and publish the result. #### "Hacking" HSM PEM keys for Tezos https://blog.polychainlabs.com/tezos/2019/05/28/encoding-tezos-ec-keys.html ^ Something similar could be done for sr25519. ### Monitoring Tools - [Telemetry](https://github.com/paritytech/substrate-telemetry) This tracks your node details including the version you are running, block height, CPU & memory usage, block propagation time, etc. - [Prometheus](https://prometheus.io/) based monitoring stack, including [Grafana](https://grafana.com) for dashboards and log aggregation. It includes an alert, query, visualization and monitoring features, and works for both cloud and on-premises systems. The data from substrate-telemetry can be made available to prometheus through exporters like [this](https://github.com/w3f/substrate-telemetry-exporter). ### Linux Best Practices - Never use root user. - Always update the security patches for your OS. - Enable and set up a firewall. - Never allow password-based SSH, only use key-based access. - Disable non-essential SSH subsystems (banner, motd, scp, X11 forwarding) and harden your ssh configuration ([reasonable enough guide to begin with](https://stribika.github.io/2015/01/04/secure-secure-shell.html)) - Back up your storage regularly. ## Conclusions and Proposal * We should not expose validators to the public internet, they should only be accessible by allowed parties. Therefore, we propose a layered approach in which the validators are isolated from the internet and connect to the Polkadot network via an intermediate layer of public-facing nodes. * At the moment, Polkadot/Substrate can't interact with HSM/SGX, so we need to provide the signing key seeds to the validator machine. This key is kept in memory during the lifetime of the node. * Given that HA setups we would always be at risk of double-signing and there's no built-in mechanism to prevent it, we propose to have a single instance of the validator so that we won't be slashed for this reason. Slashing penalties for being offline are much less than those for equivocation. ### Validators * Should only run the Polkadot/Substrate binary, and they should not listen on any port other than the configured p2p port. * Should run on bare-metal machines, as opposed to VMs. This will prevent some of the availability issues with cloud providers, along with potential attacks from other VMs on the same hardware. The provisioning of the validator machine should be automated and defined in code, this code should be kept in private version control, reviewed, audited, and tested. * Signing and node keys should be provided in a secure way. [WIP: Developing RPC to rotate Session keys.] * Polkadot/Substrate should be started at boot and restarted if it is stopped for any reason (supervisor process). * Polkadot/Substrate should run as non-root user. * Each validator should connect to the polkadot network through a set of at least 2 public-facing nodes (set through `--reserved-nodes`); the connection is done through a VPN and the machine can't access the public internet, thus the only possible connection is through the VPN. ### Public Facing Nodes * At least two nodes associated with each validator run on at least two different cloud providers and they only publicly expose the p2p port. * They can run as a container on kubernetes and we can define declaratively the desired state (number of replicas always up, network and storage settings); the connection between the validator and the public-facing nodes is done through a VPN. They have the common kubernetes security setup in place (restrictive service account, pod security policy and network policy). * Node keys should be provided in a secure way. * Only run the Substrate container, no additional services. The VPN agent should run on a sidecar in the same pod (sharing the same network stack). ### Monitoring * Public-facing nodes and validator are monitored and alerts for several failure conditions are defined. * There's an on-call rotation defined for managing the alerts. * There's a clear runbook with actions to perform for each level of each alert and an escalation policy. ## References https://medium.com/figment-networks/full-disclosure-figments-cosmos-validator-infrastructure-3bc707283967 https://kb.certus.one/ https://github.com/slowmist/eos-bp-nodes-security-checklist https://forum.cosmos.network/t/sentry-node-architecture-overview/454 https://medium.com/loom-network/hsm-policies-and-the-importance-of-validator-security-ec8a4cc1b6f

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully