Anson Lau
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    1
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Secure Validator Setup Over the last year, we have started to see a number of projects moving from PoW to PoS, which creates a whole new industry for running a staking business. In some sense, running a validator is like operating a cryptocurrency exchange because the underlying staked asset could be potentially worth more than few million dollarss. Thus, as a node operator you have to maintain a heightened security level of your validator, as well as have a robust network architecture to make sure that the value at stake is not at risk of being lost. There are many different ways to architect your network infrastructure. The goal of running a validator is to not be slashed, and the two common slashing situations are server downtime and equivocation (i.e., double signing). The amount of effort spent on achieving these goals is dependent on what level of security you need - someone with a few thousand dollars behind their validator is going to have a different thread model than somebody with several million. It also encourages everyone to come up with their own designs to avoid being hacked in the same way as other validators. That being said, there are some common tips and techniques that wil be useful for anyone interested in running a validator. Without further ado, we will first go through some existing validator architecture designs, and then look into the different areas each individually. ### Existing approaches (Cosmos, Tezos, etc) We will focus on the design of the network architecture in this section. No matter what kind of validators you are trying to run, they will have a similar network design, so we will take Cosmos as a reference. The simplest approach would have only a validator node running without any firewall, with the p2p port accessible from the Internet. This means that anyone can know the IP address of the node, which is not ideal and opens additional attack vectors. The second approach is to include a firewall in front of the validator. This means that it can use the firewall to do deep package inspection, which can prevent attacks on the level of e.g. Syn-Floods. The disadvantages of this setup are that validators always a possibility to go down, so when your node has an outage or is otherwise unreachable, it could cause the validator to be slashed. Another problem is that it does not have DoS resistance, which means that it could easily be attacked in this manner. You can read more about this layered network architecture [here](#Layered-Network-Architecture). Another approach would be to separate the network architecture into two layers, the first having "sentry" nodes facing the Internet that hides the validator entirely to the public, and the validator would be sitting in a firewalled private network only accessible by the sentry nodes. This approach is an effective way to mitigate DDoS attacks by deploying multiple sentry nodes on different cloud environments. You can also add an additional layer of private sentry nodes as a middle layer, in order to increase separation between the public sentry nodes and validators. Unfortunately, this design still has an availability problem when the validator is down. One can mitigate this by deploying an additional validator to support the [High Availability (HA)](#High-Availability-(Optional)) feature. This way, even if one of your validators goes down, there is a backup validator which can replace it, this mitigating potential slashing from nonresponsiveness. Of course, if both your primary and secondary validators are nonresponsive, you will still be slashed. Thus, your primary and secondary validators should be, at a minimum, geographically separate, but you can vary other aspects (e.g. operating system, hardware, service provider) to help ensure that both validator nodes are unlikely to go down at the same time. Besides these architectural decisions, there are some approaches to improve key management by making it unavailable to potential intrusions in the validator servers. These approaches include using using Hardware Security Modules (HSM) (see [here](https://cosmos.network/docs/cosmos-hub/validators/validator-faq.html#technical-requirements) for Cosmos' suggested list of of HSMs that support ed25519). Cosmos also introduced a Key Management System (KMS) which has a unified API to support validators that manage their key from different sources like HSM and have double signing protection. It is recommended to host the KMS on another machine to have better security and risk management. This helps to ensure that your system does not have a single point of failure. Remember that an active validator must be up and running 24/7 in order to avoid slashing; this means that malicious users can attempt to attack your validator at all times! There is a great article about [Cosmos Hub Architecture](https://iqlusion.blog/a-look-inside-our-validator-architecture) written by Tony Arcieri and Shella Stephens. ### Layered Network Architecture The concept of a layered network architecture is discussed in a post in the [Cosmos forum](https://forum.cosmos.network/t/sentry-node-architecture-overview/454). The goal is to mitigate the DDoS attack by running multiple public full nodes on different cloud providers, and making those nodes the only way to talk to the validator. The validator itself is secured behind a firewall or private network. The public "sentry" nodes can run on cloud providers; they won't have any stake at risk, and if they are down for a while they can be replaced shortly without any disruption to the validator's work. It would also be interesting to run the public nodes in different providers/availability regions, so that the validator is not affected by any individual provider outages. That said, there are other solutions to mitigate DDoS attack - we encourage everyone to build up their own designs, to avoid homogeneity. ### High Availability (Optional) A validator node must be up-and-running 24/7, with as close to possible to 100% uptime. If an active validator becomes unreachable, then it would cause a portion of that validator's stake to be slashed. By setting up HA, you could make your validator more robust than a single validator node. Even if one of your validators fails to connect, you still have another to participate in the validation process. Below are two examples of a high availability set-up: #### Active - Standby Imagine there are two validators, one is active and the other one would be on standby (failover). We can keep track of the heartbeat of the active one, and in the case of a problem, the standby will take over immediately. It is important to make sure these two are configured in the same way. Whatever you change in the active, it must be the same with the standby. #### Active - Active Imagine we add a load balancer to connect these two validators, and it has an algorithm to decide which validator to execute it. It may use round-robin algorithm or others, but it should really careful to make sure that there is no equivocation (double signing), as two validators are running simultaneously with the same validator key. Either way of these approaches is valid; choosing one entirely depends on what you want to achieve. ### Hardware Security Module (HSM) [HSM](https://en.wikipedia.org/wiki/Hardware_security_module) is a hardware component for storing your keys inside a tamper-proof, secure element, which is never exposed to the file system of your machine during signing. This makes the attacker extremely hard to extract the private key. If your validator stores the secret keys in plain-text format on the same machine, your keys will be easily exposed if your validator is hacked. At this point, the attacker can do double signing to cause your validator's stake to be slashed. It is important to find a hardware component that is dedicated to store your keys. Currently, there are a few types of HSM available in the market. [YubiHSM2](https://www.yubico.com/products/yubihsm/) is the most widely used in Cosmos validators because it supports the Ed25519 curve. Validators can also use a CloudHSM such as [AWS](https://aws.amazon.com/cloudhsm/?nc1=h_ls) or [GCP](https://cloud.google.com/hsm/?hl=en) to store their keys. However, before choosing a CloudHSM, you should check whether the necessary curve is available or not. For example, at the time of this writing, because the Ed25519 curve is newer, the only CloudHSM which supports it is Microsoft Azure. That said, validators are managing worth millions dollars of assets, and so storing your keys on CloudHSM is not recommended. When you use a CloudHSM, you trust the solution provider, so consider how much you support the provider when you decide to set up a secure validator with CloudHSM. Moreover, compared with YubiHSM2, it costs more than $3,000 per month, which is relatively expensive if you are using [Azure](https://azure.microsoft.com/en-in/pricing/details/azure-dedicated-hsm/). Another possibility is to use a custom remote signing server, which offers a similar level of security. With this setup, a separate server is set up that is the single system which does the signing. If you want to verify that your architecture and general setup are secure, you can also have a third party audit your setup and publish the result. #### "Hacking" HSM PEM keys for Tezos https://blog.polychainlabs.com/tezos/2019/05/28/encoding-tezos-ec-keys.html ^ Something similar could be done for sr25519. ### Monitoring Tools - [Telemetry](https://github.com/paritytech/substrate-telemetry) This tracks your node details including the version you are running, block height, CPU & memory usage, block propagation time, etc. - [Prometheus](https://prometheus.io/) based monitoring stack, including [Grafana](https://grafana.com) for dashboards and log aggregation. It includes an alert, query, visualization and monitoring features, and works for both cloud and on-premises systems. The data from substrate-telemetry can be made available to prometheus through exporters like [this](https://github.com/w3f/substrate-telemetry-exporter). ### Linux Best Practices - Never use root user. - Always update the security patches for your OS. - Enable and set up a firewall. - Never allow password-based SSH, only use key-based access. - Disable non-essential SSH subsystems (banner, motd, scp, X11 forwarding) and harden your ssh configuration ([reasonable enough guide to begin with](https://stribika.github.io/2015/01/04/secure-secure-shell.html)) - Back up your storage regularly. ## Conclusions and Proposal * We should not expose validators to the public internet, they should only be accessible by allowed parties. Therefore, we propose a layered approach in which the validators are isolated from the internet and connect to the Polkadot network via an intermediate layer of public-facing nodes. * At the moment, Polkadot/Substrate can't interact with HSM/SGX, so we need to provide the signing key seeds to the validator machine. This key is kept in memory during the lifetime of the node. * Given that HA setups we would always be at risk of double-signing and there's no built-in mechanism to prevent it, we propose to have a single instance of the validator so that we won't be slashed for this reason. Slashing penalties for being offline are much less than those for equivocation. ### Validators * Should only run the Polkadot/Substrate binary, and they should not listen on any port other than the configured p2p port. * Should run on bare-metal machines, as opposed to VMs. This will prevent some of the availability issues with cloud providers, along with potential attacks from other VMs on the same hardware. The provisioning of the validator machine should be automated and defined in code, this code should be kept in private version control, reviewed, audited, and tested. * Signing and node keys should be provided in a secure way. [WIP: Developing RPC to rotate Session keys.] * Polkadot/Substrate should be started at boot and restarted if it is stopped for any reason (supervisor process). * Polkadot/Substrate should run as non-root user. * Each validator should connect to the polkadot network through a set of at least 2 public-facing nodes (set through `--reserved-nodes`); the connection is done through a VPN and the machine can't access the public internet, thus the only possible connection is through the VPN. ### Public Facing Nodes * At least two nodes associated with each validator run on at least two different cloud providers and they only publicly expose the p2p port. * They can run as a container on kubernetes and we can define declaratively the desired state (number of replicas always up, network and storage settings); the connection between the validator and the public-facing nodes is done through a VPN. They have the common kubernetes security setup in place (restrictive service account, pod security policy and network policy). * Node keys should be provided in a secure way. * Only run the Substrate container, no additional services. The VPN agent should run on a sidecar in the same pod (sharing the same network stack). ### Monitoring * Public-facing nodes and validator are monitored and alerts for several failure conditions are defined. * There's an on-call rotation defined for managing the alerts. * There's a clear runbook with actions to perform for each level of each alert and an escalation policy. ## References https://medium.com/figment-networks/full-disclosure-figments-cosmos-validator-infrastructure-3bc707283967 https://kb.certus.one/ https://github.com/slowmist/eos-bp-nodes-security-checklist https://forum.cosmos.network/t/sentry-node-architecture-overview/454 https://medium.com/loom-network/hsm-policies-and-the-importance-of-validator-security-ec8a4cc1b6f

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully