Parity WFP Blockchain Support Handover

# Parity WFP Blockchain Support Handover This document aims to explain how to use the provided automation scripts to manage the creation, maintenance, and monitoring of a Proof-of-Authority (PoA) Ethereum blockchain infrastructure.  - [Concepts](#concepts) * [Capabilities](#capabilities) * [Types of nodes](#types-of-nodes) * [Technology stack](#technology-stack) * [Persistent Storage, Backups and Restores](#persistent-storage-backups-and-restores) * [Network topology and security](#network-topology-and-security) * [Load Balancing and PrivateLink](#load-balancing-and-privatelink) * [Peering to other VPC, Regions or Accounts](#peering-to-other-vpc-regions-or-accounts) - [Defining and deploying blockchain networks](#defining-and-deploying-blockchain-networks) * [Node configuration](#node-configuration) + [Node files layout](#node-files-layout) + [Chainspec](#chainspec) + [Node keys](#node-keys) + [Parity Ethereum command-line arguments](#parity-ethereum-command-line-arguments) + [Parity Ethereum reserved-peers](#parity-ethereum-reserved-peers) + [Parity Ethereum chain data](#parity-ethereum-chain-data) * [Terraform configuration overview](#terraform-configuration-overview) * [Ansible configuration overview](#ansible-configuration-overview) + [Global variables](#global-variables) + [Monitoring server variables](#monitoring-server-variables) + [Blockchain Nodes variables](#blockchain-nodes-variables) + [Blockchain validators variables](#blockchain-validators-variables) - [Runbooks](#runbooks) * [Create a new environment from scratch](#create-a-new-environment-from-scratch) + [Terraform](#terraform) + [Ansible](#ansible) + [Creating the terraform remote state for the current AWS account](#creating-the-terraform-remote-state-for-the-current-aws-account) + [Initializing the infrastructure with Terraform](#initializing-the-infrastructure-with-terraform) + [Setting up DNS](#setting-up-dns) + [Provisioning instances with Ansible](#provisioning-instances-with-ansible) + [Final checks](#final-checks) * [Update an existing environment](#update-an-existing-environment) + [Update Terraform managed instances configuration](#update-terraform-managed-instances-configuration) + [Update Ansible provisioning](#update-ansible-provisioning) * [Troubleshooting](#troubleshooting) - [Emergency SSH access for troubleshooting nodes](#emergency-ssh-access-for-troubleshooting-nodes) + [What could go wrong?](#what-could-go-wrong) + [What to do when things go wrong?](#what-to-do-when-things-go-wrong) + [Previous issues](#previous-issues) - [Monitoring](#monitoring) * [Prometheus UI](#prometheus-ui) * [Loki UI](#loki-ui) * [Grafana Dashboards](#grafana-dashboards) + [Chain Dashboard](#chain-dashboard) + [Ethexporter](#ethexporter) + [Node Exporter](#node-exporter) + [Prometheus](#prometheus) * [Alerts](#alerts) + [AlertManager Configuration](#alertmanager-configuration) * [Temporarily disable alerts](#temporarily-disable-alerts) - [Disaster Recovery](#disaster-recovery) * [Recreate a single node reusing its existing data disk volume](#recreate-a-single-node-reusing-its-existing-data-disk-volume) * [Recover a node from a snapshot](#recover-a-node-from-a-snapshot)  ## Concepts ### Capabilities The blockchain network infrastructure as code configuration is flexible and can be dynamically : - Scaled up: upgraded to more powerful machines to allow more transactions per seconds - Scaled out: add additional nodes on the networks (possibly to other regions) to increase resiliency to node failures Alternatively, the network can also be scaled down by downgrading to smaller machines or reducing the number of nodes. ### Types of nodes To achieve a production setup, the network is composed of nodes assuming different roles: - Validator nodes: produce new blocks and participate in the PoA consensus - Access nodes: handle RPC requests from external applications - Archive nodes: similar to access nodes, but they don't prune their state (ie. keep the state data for all past blocks) - Backup nodes: will be periodically taken offline to snapshot its data for backup purposes ### Technology stack The WFP blockchain deployment automation is composed of two parts: - The cloud infrastructure configuration developed using Terraform - Ansible provisioning scripts Processes are consistently managed on Linux hosts using Systemd. The main pieces of software deployed on blockchain nodes are : - Parity Ethereum - Monitoring agents: - Ethexporter: a custom python service to expose Ethereum RPC metrics for Prometheus - NodeExporter: a service to expose system metrics for Prometheus - Promtail: a service to expose logs for Loki The monitoring stack is deployed to a dedicated server and is composed of: - Prometheus: metrics collection and requesting - AlertManager: connected to the AWS Simple Email Service for sending alerts based on Prometheus metrics - Loki: logs collection service - Grafana: monitoring dashboards - Nginx: Proxy to expose the monitoring tools with IP whitelisting and TLS encryption with LetsEncrypt ### Persistent Storage, Backups and Restores All blockchain nodes instances are attached to 2 Elastic Block Store volumes: - The **root** volume (~16gb) which hosts the OS and installed software and configuration (mounted on `/`) - the **data-disk** volume (variable size) which hosts the chain data (mounted on `/data/chains`) ### Network topology and security Each blockchain network is deployed to its own dedicated VPC network which is subdivided into a public and private subnet in two availability zones (eu-central-1a and eu-central-1b). The only internet-facing entry points into the network are : - The two bastions hosts in zone-a and zone-b. - (Optionally) The monitoring server All blockchain nodes are in the private subnet with no internet gateway access. To download software from the internet they have to pass through the proxy server on the bastion host. Only access and archive nodes allow inbound connection to the Ethereum RPC endpoint on port 8545 which is exposed throught a Load Balancer as a VPC Service Endpoint. This Service endpoint can then be consumed by applications in other VPC (possibly even in a different AWS account) through a [PrivateLink](https://aws.amazon.com/privatelink/). ![AWS architecture diagram](images/aws_architecture.png) ### Load Balancing and PrivateLink The access and archive nodes are exposed behind an AWS Network load balancer which is exposed cross-account using PrivateLink. The convention used is that a single Load Balancer is set up and accessible via AWS PrivateLink (VPC Endpoint) to expose both access and archive nodes: - Access nodes on blockchain.testing|staging|production.wfp.parity.io:8545 - Archive nodes on blockchain.testing|staging|production.wfp.parity.io:8546 ### Peering to other VPC, Regions or Accounts To connect the blockchain network to external validators in other AWS VPCs (possibly in a different regions or accounts), a peering connection needs to be set up. The provided terraform configuration allows to easily set one or several of such peerings using the `vpc_peerings` and `vpc_peerings_to_accept` blocks to either request or accept a peering connection. ## Defining and deploying blockchain networks For each environment (wfp-testing, wfp-staging, wfp-production), external configuration files (**tfvars** and **inventory** files) are used to define environment-specific properties that can be applied to the system in an immutable way through Terraform and Ansible. ### Node configuration #### Operating System The OS in use is Ubuntu 20.04 setup from its [official AMI](https://cloud-images.ubuntu.com/locator/ec2/). It is provisioned with Ansible using the `wfp-base-setup.yaml` playbook and `base` role which set up the following: - Preserve AWS hostname - Configure journald logging - Set up an HTTP proxy for apt caching using apt-cacher-ng - Set up chrony for NTP management as a replacement of systemd-timesyncd #### Node files layout The node configuration layout in `/home/parity/` is the following: ```text . ├── .local │ └── share │ └── io.parity.ethereum │ ├── keys │ │ └── wfp │ │ ├── 0 # Ethereum Wallet (private key) │ │ └── address_book.json │ ├── network │ └── key # Node private key ├── chainspec.json # Chain configuration ├── parity_pre_start # Pre-start script ├── pass # Ethereum Wallet password file └── reserved-peers # Boot nodes list ``` The node data layout in `/data` is: ```text . └── chains ├── ver.lock # parity-ethereum version └── wfp ├── db │ └── c7749611fc14f846 │ ├── overlayrecent │ │ └── db │ │ ├── XXXXXX.sst # RocksDB Static Sorted Table │ │ ├── XXXXXX.log # RocksDB Log │ │ ├── ... # Other data files │ └── snapshot └── network └── nodes.json # Connected Node list ``` #### Chainspec The Ethereum blockchain itself is configured in the chainspec file present on the node at `/home/parity/chainspec.json`. They are configured for each environment in `roles/parity-ethereum-systemd/files/chainspec_$ENVIRONMENT.json`. #### Node keys Each Ethereum node has a private key and public key, to keep the same node identity when recreating instances, those are configured in the Ansible inventory for each node (in the `nodekey` and `public_nodekey` variables). To create a new set of public/private node keys, the Geth **bootnode** binary can be installed (https://geth.ethereum.org/downloads/) and the following playbook run: ansible-playbook misc/generate-node-identity.yaml #### Parity Ethereum command-line arguments Parity Ethereum is configured using command-line arguments provided in `/etc/default/parity-ethereum`. They are configured from Ansible in `roles/parity-ethereum-systemd/templates/default.j2`. For example: ```shell # /etc/default/parity-ethereum ARGS="\ --chain chainspec.json \ --reserved-peers reserved-peers \ --db-compaction ssd \ --auto-update none \ --jsonrpc-apis personal,web3,eth,parity,parity_set,net,traces,rpc,parity_accounts,signer \ --jsonrpc-hosts all \ --jsonrpc-interface all \ --jsonrpc-cors '*' \ --jsonrpc-server-threads 8 \ --ws-hosts all \ --tx-queue-per-sender 8192 \ --tx-queue-size 16536 \ --scale-verifiers \ --no-periodic-snapshot \ --logging rpc=trace,txqueue=trace \ --cache-size 4096 \ --db-path /data/chains \ " ``` #### Parity Ethereum reserved-peers The Parity Ethereum reserved peers (the nodes it will attempt to connect to on startup) are configured in `/home/parity/reserved-peers`. They are configured from Ansible in `roles/parity-ethereum-systemd/templates/default.j2`. For example: ```text # /home/parity/reserved-peers enode://ba527c97b7161f2cae9108c20c5624e3c180585d4f1dae56e6becce98caa602cdfa02e7d09513607ea7ea2e35efd715f4c0be486598f43ec07e4ed0e0bf0dc8d@wfp-testing-validator-node-1.testing.wfp.internal:30303 enode://82bcfa5d36a77e1468d9c1f9ac531638ae44e69d6a1b9cc6d65e9b91d6db40016c45660b32e9241526b531e76e3d56059104c5b5e1746b6d36df0f8ac25004b8@wfp-testing-validator-node-2.testing.wfp.internal:30303 enode://1845641d2581abd15c1319f7727cacc9428b059dcf6a4fce2ce5618c4d711232f26487f0253f2180e64e2a51cf0d651502b50103aa5a28ff261db845f8cf412d@wfp-testing-validator-node-3.testing.wfp.internal:30303 enode://3b41b6fce6256378e849c2b7622acf20eed5fa6af997b6ef55a7640d073dea9aca3032fda184fe43148d7dd6ce81543b561d02fc964b1202e7f54dea1bb1a86e@wfp-testing-validator-node-4.testing.wfp.internal:30303 ``` #### Parity Ethereum chain data The chain data is stored on a separate "data-disk" volume which is typically mounted at `/data/chains`. ### Terraform configuration overview The Terraform code is structured such that it can be used to deploy multiple blockchain networks from the same code. By providing dedicated `tfvars` property files for each environment, the operators can manage the deployment to different accounts, regions, instance types, and with variations in public or private exposure as needed. Changes to the underlying terraform code should be done only to improve the current setup or extend it to new use cases. Each environment's infrastructure configuration is defined in the `.tfvars` file with the following variables: - VPC configuration: `vpc_cidr`, `availability_zones`, `private_subnets`, `public_subnets` are used to define your network - VPC peering: `vpc_peerings` and `vpc_peerings_to_accept` are used to define the requester and accepter side of a VPC peering - IP access lists: `ssh_ip_access_list` (which IPs will be allowed to ssh on the bastion hosts ) `monitoring_ip_access_list` (which IPs will be allowed to access the monitoring host) - `public_dns_zone` (exposed on the internet) and `private_dns_zone` (available inside the VPC only) - `ec2_ssh_public_key`: The initial EC2 SSH public key for provisioning - `initial_database_disk_snapshot_id` : (Optional) an EBS snapshot ID to load on the node data disk - Machine configuration for every host in the network: * `ami`: Amazon Machine Image ID (specific to each region, for Ubuntu the IDs are listed at https://cloud-images.ubuntu.com/locator/ec2/) * `instance_type`: An [EC2 instance type](https://aws.amazon.com/ec2/instance-types/) * `subnet`: `private` or `public` depending on whether you want the instance to be part of the `public` subnet and as such be exposed to the internet. * `availability_zone`: The [Availability Zone](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/) in which to create the instance * `volume_size_gb`: The size of the attached data volume * `iops`: The number of provisioned IOPS for the data volume ### Ansible configuration overview Remark: Secrets in the inventory are encrypted with a password using [Ansible Vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html). To provision instances, 3 playbooks are available: - `wfp-base-setup.yaml`: add administrators public SSH keys and run basic setup tasks on hosts - `wfp-parity-ethereum`: setup parity ethereum on blockchain nodes with SystemD - `wfp-monitoring`: setup the monitoring server and agents The Ansible configuration is defined in the `inventory_$ENVIRONMENT.yaml`. #### Global variables - `ansible_ssh_common_args`: The SSH command flags - `http_proxy`: The value to set to HTTP_PROXY and HTTPS_PROXY variables for all hosts - `domain`: The environment name - `internal_domain`: The internal DNS name (note: DNS record are already setup with terraform for `${host}.${internal_domain}`) - `parity_chain_dir`: The directory where the chain data is located - `parity_ethereum_chainspec`: The chainspec file to use (found in `roles/parity-ethereum-systemd/files/`) - `parity_ethereum_custom_args`: Custom arguments provided in the `/etc/default/parity-ethereum` file - `external_validators_enodes`: List of external validator enode addresses (`enode://public_nodekey@host:30303`) to connect to on startup #### Monitoring server variables - `domain_name`: The dns at which the monitoring service will be reachable - `monitoring_allowed_ips`: List of allowed IPs to check for access to the unauthenticated endpoints (eg. `/prometheus/`) - `prometheus_retention_days` and `loki_retention_period` to configure the time period to retain metrics and logs - `smtp_*` to configure the SMTP server through which to send alerts and grafana registration emails - `smtp_to` and `smtp_to_critical`: the list of emails to receive standard/critical alerts - `dead_man_snitch_email`: The email to send watchdog alerts to (a snitch need to be created at https://deadmanssnitch.com/) - `runbook`: URL which contains troubleshooting instructions (it will be linked to in email alerts) - `network_elb_regex`: Regex to match AWS ELBs to monitor in Prometheus - `grafana_admin_password` and `grafana_users`: Users to create in Grafana #### Blockchain Nodes variables - `nodekey`: the node private key (can be generated with the `misc/generate-node-identity.yaml` playbook ) - `public_nodekey`: the node public key (used to produce the list of boot nodes in `/home/parity/reserved-peers`) #### Blockchain validators variables - `engine_signer`: the address used to participate in block validation consensus - `keyfile`: the parity wallet key file - `parity_password`: The parity wallet password saved in `/home/parity/pass` ## Runbooks This section lists the procedures used to create and update to the blockchain infrastructure. ### Create a new environment from scratch #### Terraform Before running Terraform commands against an AWS account, make sure to load your AWS credentials export AWS_ACCESS_KEY_ID=<AWS_ID>;export AWS_SECRET_ACCESS_KEY=<AWS_SECRET> aws sts get-caller-identity Make sure that you are correctly authenticated against the correct AWS account. #### Ansible Before using the `ansible-playbook` command, make sure that you have access to the Ansible Vault password for the project. For convenience, this can be saved to a file and passed to ansible with the `--vault-password-file /path/to/wfp_vault_file` #### Creating the terraform remote state for the current AWS account /!\ These commands should be executed only once on the AWS account Initialize the S3 bucket and Dynamodb table to store the Terraform state: cd terraform/00_init_remote_terraform_state terraform init && terraform apply Note down the s3 bucket and dynamo db table name from the outputs and set it into `wfp-terraform.sh` with the region and environment name you want to use: ```bash wfp-production) export TF_VAR_terraform_state_bucket=wfp-terraform-state-063921503c3a7b1a export TF_VAR_terraform_lock_table=wfp-terraform-state-lock-063921503c3a7b1a export TF_VAR_aws_region=eu-central-1 export TF_VAR_environment=wfp-production ;; ``` #### Initializing the infrastructure with Terraform Initialize the environment with: cd terraform/01_infrastructure ../wfp-init-env.sh $WFP_ENVIRONMENT $AWS_REGION $AWS_STATE_ID Example for `wfp-production`: ../wfp-init-env.sh wfp-production eu-central-1 063921503c3a7b1a This script will achieve 2 things: - Initialize the environment terraform workspace in the remote S3 bucket (using Terraform workspaces, the same S3 bucket can be used to manage one or multiple environments) - Generate a set of SSH keys Save the private key created in /tmp/$ENVIRONMENT.pem to a secure Vault, it will be used as initial EC2 ssh keys for new instances. Create a new Terraform configuration in `terraform/$ENVIRONMENT.tfvars`. For example: ```hcl # Global parameters # Global parameters global_tags = { Terraform = "true" Environment = "wfp-production" } # Network configuration availability_zones = ["eu-central-1a", "eu-central-1b"] vpc_cidr = "10.10.0.0/16" private_subnets = ["10.10.1.0/24", "10.10.2.0/24"] public_subnets = ["10.10.101.0/24", "10.10.102.0/24"] public_dns_zone = "production.wfp.parity.io" private_dns_zone = "production.wfp.internal" // Parity VPN IP ssh_ip_access_list = ["212.227.252.235/32"] // Parity VPN IP monitoring_ip_access_list = ["212.227.252.235/32"] // Peering requests vpc_peerings = { unw-production = { peer_account_id: "****" peer_vpc_id: "vpc-****" peer_vpc_cidr: "10.1.0.0/16" } } // Peering accepts vpc_peerings_to_accept = { } // AWS Principals allowed to connect the VPC Endpoint (AWS PrivateLink) endpoint_allowed_principals = [ "arn:aws:iam::472240007126:user/pierre.besson@parity.io", "arn:aws:iam::472240007126:user/fabio.tranchitella@wfp.org", ] # Instances ec2_ssh_public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCjY6c971Pk8k2iE4aZZkrnQYyWjDu4N8C9XDN+w4trMVV84Grnt+IyCDigpDSh50y4i7pH3MTds2dm1PCsULqWSEZK8ljocBb/VgLdYoliNI0tOXdrbzFCOudnONSoAr/2xInIJvAafN2tVQ0row53XWlm0/L3X6t16dodEOjMyFTnoZ6UaCm53rbUu1SGoXAIVQIVvw/qrJ6WuArIe7Y9ALsAX4SF1xtug6iDv2afZadc2Vyuf4/BPbEhE7abHxiKeoBY1A5vIZcCkVBMsWXgwgd8G2KZEiGeOoun5560QZDkZyCwnis8lg5mXT+eKsHJwjWLdDXnw39JVDTAVfnzK2Sg52YWZmCtedXtl2RGwmEF9XO3sRKfIYtNOeuDZXtQgFRhjqMKrbpDyv6v0i39d4pBgKwryQdhBTSakk8Wgkb9oFYj3J6Sw1oMubxU/Yb2g2nVl9mhD5l7p83iSS4ju6zvBsY2W5logiXojKkl4buEV7/zgMfZAIFUyw899Y0= wfp-production-key" //initial_database_disk_snapshot_id = "" monitoring = { "monitoring-1" = { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "t3.medium" subnet: "public" availability_zone: "eu-central-1a" volume_size_gb: 64 } } bastions = { "bastion-eu-central-1a": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "t3.micro" subnet: "public" availability_zone: "eu-central-1a" }, "bastion-eu-central-1b": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "t3.micro" subnet: "public" availability_zone: "eu-central-1b" } } # Blockchain nodes validator_nodes = { "validator-node-1": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1a" volume_size_gb: 100 provisioned_iops: 1000 } "validator-node-2": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1b" volume_size_gb: 100 provisioned_iops: 1000 } "validator-node-3": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1a" volume_size_gb: 100 provisioned_iops: 1000 } "validator-node-4": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1b" volume_size_gb: 100 provisioned_iops: 1000 } } access_nodes = { "access-node-1": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1a" volume_size_gb: 100 provisioned_iops: 1000 }, "access-node-2": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1b" volume_size_gb: 100 provisioned_iops: 1000 } } archive_nodes = { "archive-node-1": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1a" volume_size_gb: 500 provisioned_iops: 1000 } } backup_nodes = { "backup-node-1": { ami: "ami-0502e817a62226e03" // Ubuntu 20.04 instance_type: "m5.large" subnet: "private" availability_zone: "eu-central-1b" volume_size_gb: 100 provisioned_iops: 1000 } } ``` Make sure to define the bastion, monitoring, and blockchain instances you need, then apply it from the `01_infrastructure` directory: ../wfp-terraform.sh $ENVIRONMENT init ../wfp-terraform.sh $ENVIRONMENT apply This will set up the following components: - VPC and subnets - DNS Zone - Simple Email Service and SMTP credentials - Service endpoints #### Setting up DNS Go to the Route53 AWS console and note down the AWS name servers. ![route53 public zone](images/route53_public_zone.png) In your registrar, add an NS route delegation record: ![registrar NS record](images/cloudflare_dns.png) After DNS has propagated over the internet, the following commands should resolve properly: dig NS production.wfp.parity.io dig monitoring.production.wfp.parity.io In the SES Console, the domain should show up as verified (DKIM verification is optional to successfully send emails): ![AWS SES](images/aws_ses.png) #### Provisioning instances with Ansible As a result of applying the Terraform, a `$ENVIRONMENT.ssh_config` file should have been created in `ansible/`. This allows easy access to your hosts for SSH access: ssh -F $ENVIRONMENT.ssh_config $HOST_NAME You will need to provide a comprehensive `inventory_$ENVIRONMENT.yaml` including a section for each host. After creating the instances with Ansible, you will need to first run the `wfp-base-setup.yaml` playbook. This has to be done by providing the environment EC2 private key and the default `ubuntu` user. Note: the bastion hosts has to be provisioned first as it hosts a proxy server that the other hosts will use to retrieve software from the internet: ansible-playbook -l bastions -i inventory_wfp-staging.yaml --private-key ~/wfp-staging.pem -u ubuntu wfp-base-setup.yaml ansible-playbook -l accessnodes,backupnodes,archivenodes,validators,monitoring -i inventory_$ENVIRONMENT.yaml --private-key /path/to/$ENVIRONMENT.pem -u ubuntu wfp-base-setup.yaml Then setup parity ethereum on all blockchain nodes: ansible-playbook -l validators,accessnodes,backupnodes,archivenodes -i inventory_$ENVIRONMENT.yaml wfp-parity-ethereum.yaml --vault-password-file /path/to/wfp_vault_file Finally, setup the monitoring server and agents on all hosts: ansible-playbook -i inventory_$ENVIRONMENT.yaml wfp-monitoring.yaml --vault-password-file /path/to/wfp_vault_file #### Final checks After all the setup steps are complete, perform the following checks: - [ ] In Prometheus, all targets are UP (See https://monitoring.$ENVIRONMENT.wfp.parity.io/prometheus/targets) - [ ] In Grafana Ethexporter dashboards, all nodes are visible and the block height graph is increasing for all nodes. #### Connect the external application to the VPC Service Endpoint (PrivateLink) ##### Setup Private DNS for Private Link Note: this step has to be done manually as Terraform didn't support this resource as of the end of 2020. After the VPC endpoint was set up using Terraform open the **VPC Endpoints Service** section in the AWS Console: 1. Right-click the VPC endpoint service and select **"modify private dns"** 2. In the popup check **"Associate a private DNS Name with the service"** and set it to `blockchain.$ENVIRONMENT.wfp.parity.io`. 3. Note down the **Domain verification name** and **Domain verification value** strings in the details of your endpoint service. 4. Go to your `production.wfp.parity.io` zone in Route53 add add a TXT record and set the "verification name" string in **"Record name"** and the "verication value" in the **"Value"** fields. 5. Right-click the VPC endpoint service and select **"verify domain ownership"** and click "Verify". 6. At the end of the process, the **Domain verification status** should be 'Verified'. ##### Connect the VPC endpoint to an external VPC Before connecting any external VPC to the VPC endpoint, make sure that the `endpoint_allowed_principals` includes at least one user which has the permission to create the connection in the target VPC account. 1. On the external VPC account - Go to VPC -> Endpoints -> Create Endpoint then select "Find service by name". - Input the `service_name` field retrieved from the Terraform outputs and choose the VPC which will consume the endpoint. 2. On the blockchain account, check the Endpoint service's **"Endpoint Connection"** tab and accept the connection request that should be pending. 3. On the external VPC account, allow the connection once again and make sure to check the **"enable private DNS box"**. ### Update an existing environment #### Update Terraform managed instances configuration The Terraform configuration defined in `$ENVIRONMENT.tfvars` can be changed and reapplied with `../wfp-terraform.sh $ENVIRONMENT apply`. However, the following rules apply: - New instances added will need to be provisioned with Ansible - Existing instances will be updated in place when changing the following parameters: `instance_type`, `volume_size_gb`, `iops`. - Existing instances will be destroyed and recreated when changing the following parameters: `ami`, `instance_type`, `subnet`, `availability_zone`. - When an instance is recreated, it will reattach to its existing data disk volume but will lose all its installed software and will need to be reprovisioned with Ansible. - The `initial_database_disk_snapshot_id` variable is only taken into account when first creating the node and data disk volume. #### Update Ansible provisioning To update Ansible configuration on already provisioned hosts, simply rerun the playbooks. You can target hosts (with `-l`) and tags (with `-t`) to update a specific components, eg: ansible-playbook -l monitoring -t grafana -i inventory_$ENVIRONMENT.yaml wfp-monitoring.yaml --vault-password-file /path/to/wfp_vault_file ### Troubleshooting ##### Emergency SSH access for troubleshooting nodes Administrators can ssh on nodes to troubleshoot them, the following commands may be useful: ssh -F wfp-staging.ssh_config wfp-testing-access-node-1 # SSH on node sudo journalctl -f -u parity-ethereum # Stream logs sudo systemctl status parity-ethereum # See parity-ethereum status sudo systemctl [stop/start/restart] parity-ethereum # Stop/Start/Restart parity-ethereum #### What could go wrong? - Infra and DevOps: * Not processing incoming transactions * A missing transaction * Infrastructure issue such a connectivity or disk space - Data related issues: * Backup and restore #### What to do when things go wrong? - DevOps and Infra: check the grafana dashboards ([Ethexporter](#ethexporter) and [Node exporter](#node-exporter)) - Data related issues: check the [Loki logs](#loki-ui) Fixes: Refer to the [restore from snapshot disaster recovery proces](#recover-a-node-from-a-snapshot) #### Previous issues - Disk space issue: In the event of a disk space issue it will most likely be on the access node due to log files growing too large (note: log rotation is in place to remove the old logs before they can cause an issue). Alerts are also setup on high disk usage so issues shoud be prevented in the future. - Missing transaction: In the event that a transaction is missing then the first place to check is the access node logs on Loki. ## Monitoring Monitoring tools UI are available at the following URLs: - Testing: * Grafana: https://monitoring.testing.wfp.parity.io/ * Prometheus: https://monitoring.testing.wfp.parity.io/prometheus/graph * Alertmanager: https://monitoring.testing.wfp.parity.io/alertmanager/ * Loki: https://monitoring.testing.wfp.parity.io/explore - Staging: * Grafana: https://monitoring.staging.wfp.parity.io/ * Prometheus: https://monitoring.staging.wfp.parity.io/prometheus/graph * Alertmanager: https://monitoring.staging.wfp.parity.io/alertmanager/ * Loki: https://monitoring.staging.wfp.parity.io/explore - Production: * Grafana: https://monitoring.production.wfp.parity.io/ * Prometheus: https://monitoring.production.wfp.parity.io/prometheus/graph * Alertmanager: https://monitoring.production.wfp.parity.io/alertmanager/ * Loki: https://monitoring.production.wfp.parity.io/explore Note: Prometheus and Alertmanager not having any authentication, are available only to allowed IPs (defined in inventory). ### Logs and metrics retention Logs and metrics are kept in the Prometheus database for a number of days which can be set in the Ansible inventory file. For example, in the current configuration metrics are retained for 3 months and logs for 6 weeks: ```yaml monitoring: hosts: wfp-staging-monitoring-1: prometheus_retention_days: 91 # 3 months # must be a multiple of 168h loki_retention_period: 1008h # 6 weeks ``` ### Prometheus UI The Prometheus UI can be used to discover metrics and perform a quick analysis of metrics data. ![prometheus ui](images/prometheus_ui.png) ### Loki UI The Loki UI (part of Grafana Explore view) can be used to view logs from all instances. ![loki ui](images/loki_ui.png) ### Grafana Dashboards Grafana is used to provide customizable dashboards. Initial Grafana users are created from the inventory configuration. To invite additional users by email, go to **Configuration -> Users** and click the **Invite** button. #### Chain Dashboard This dashboard lets you view the global state of the blockchain. Make sure to select the node you want to retrieve information from. - Testing: https://monitoring.testing.wfp.parity.io/d/3ErqZbmGz/chain-dashboard - Staging: https://monitoring.staging.wfp.parity.io/d/3ErqZbmGz/chain-dashboard - Production: https://monitoring.production.wfp.parity.io/d/3ErqZbmGz/chain-dashboard Metrics displayed: - Highest block number - Actual block duration over time (30s and 5m rolling average) - Transactions (total and pending) - Block size - Block authors by Validator node ID - Server response time - RPC requests per seconds (aggregated and for each method) ![chain dashboard](images/monitoring_chain_dashboard.png) #### Ethexporter This dashboard lets you view the Ethexporter metrics for each nodes. Metrics displayed: - AWS LB network metrics - Block height (per instance/author) - Block time per instance - Block transactions (total/pending) - Peer count (the number of peers each node is connected to) ![ethexporter 1](images/monitoring_ethexporter_1.png) ![ethexporter 2](images/monitoring_ethexporter_2.png) #### Node Exporter This dashboard lets you view node exporter metrics (cpu, mem, disk, network, ...) for each instance (blockchain, bastion and monitoring hosts). Make sure to select the host(s) you want to show information from. - Testing: https://monitoring.testing.wfp.parity.io/d/vIJjttzGk/node-exporter-server-metrics - Staging: https://monitoring.staging.wfp.parity.io/d/vIJjttzGk/node-exporter-server-metrics - Production: https://monitoring.production.wfp.parity.io/d/vIJjttzGk/node-exporter-server-metrics Metrics displayed: - CPU utilization - Memory utilization - Load average - Disk space used/left - Disk IO/ Throughput - Network statistics #### Prometheus This dashboard is used to monitor the Prometheus server. ### Alerts Alerts are sent with AlertManager based on Prometheus metrics. - Testing: https://monitoring.testing.wfp.parity.io/prometheus/alerts - Staging: https://monitoring.staging.wfp.parity.io/prometheus/alerts - Production: https://monitoring.production.wfp.parity.io/prometheus/alerts #### AlertManager Configuration - Alertmanager receivers are configured to send alerts by email in the `Prometheus-alertmanager` role - Alerts are defined in the `prometheus` role's `templates/alerting_ruels.yml.j2` file. - The following alerts are defined: * System-level errors (node exporter unreachable, low disk, high cpu, high mem, time drift, ntp problem) * Dead Man's Snitch Watchdog: a dummy alert is sent periodically to a third party service (deadmanssnitch.com) which will alert if it receives nothing for 15 minutes * Cloudwatch: monitor the AWS Elastic Load Balancer state * Prometheus: monitor Prometheus for issues * Blockchain errors: - block production: no new blocks produced in the last minute - rpc unhealthy - ethexporter down - block authoring: check that in the last 5 minutes the correct number of validators were authors of new blocks (this helps monitor external validators) ### Temporarily disable alerts When doing a maintenance operation, it can be useful to disable alerts to avoid email spam. To create a silence, SSH into the monitoring instance and run: amtool silence add -d 2h -c "maintenance" instance=~"wfp-staging-*" ## Disaster Recovery ### Recreate a single node reusing its existing data disk volume To recreate a blockchain node from scratch, the instance can be recreated in an immutable way with Terraform and reattached to the node's "data disk" EBS volume. The process to achieve this is the following: 1. Terminate the EC2 instance manually using the AWS Console. 2. Reapply the Terraform for this environment, it will detect the divergence from its state and recreate the missing EC2 instance and automatically reattach the data volume and EIP (if one has been attached to the instance). Note: if you use Terraform targeting for this step, the ssh_config file will not be updated correctly. 3. Reprovision the node using Ansible from scratch as the new instance didn't keep the installed software (just its data under `/data/chains`) ansible-playbook -l node-name -i inventory.yaml --private-key ~/key.pem -u ubuntu wfp-base-setup.yaml ansible-playbook -l node-name -i inventory.yaml wfp-parity-ethereum.yaml --vault-password-file .wfp_vault_file This process does incur some downtime for the node being recreated. ### Recover a node from a snapshot 1. Prepare a new AWS terraform workspace with its associated configuration in a tfvars file. 2. In the `.tfvars` file, set the `initial_database_disk_snapshot_id` to an EC2 snapshot ID available in the current AWS account which was created from data disk volume. 2. Apply the Terraform and provision the nodes using Ansible.