Alert brainstorming

# Alert brainstorming ## Levels * critical * It is broken, user / tenant impact * warning * It is broken, but no user/tenant impact * We need to react right now, it should not be postponed * informational * It is going to be broken soon * We can postpone to later ## Brainstorming (unsorted) * Services on Home DC * Master PowerDNS down 5m | critical * Updates to the CDN will fail and impact tenants * Vault is sealed | critical * Tenants cannot upload certificates * Edge Nodes cannot retrieve certificates * Services on Service Nodes * PowerDNS * One or more PowerDNS down for 5m | informational * All PowerDNS in one location down 5m | warning * All PowerDNS in all locations down 2m | critical * Vault agent on the service node down * 1 down per PoP | informational * All down per PoP | critical * Services on Edge nodes * Nginx/Openresty * 1..(n-1) down per PoP | NginxDown | informational * All down per PoP | PoPAllNginxDown | critical * VPN * Strongswan: * VPN tunnel to Location is down | VPNTunnelLocationDown | critical * Frrouting: * 10.190.x.0/24 is not reachable from 10.183.5.0/24 | VPNRoutingLocationDown | critical * Impacts: * SolariaCDN.com DNS zone transfer not working * Delivery API cannot fetch new configurations: varnish, nginx and certificates * Monitoring of PoP fails * Billing metrics and tenant logging stucks * Can we add the prometheus master/main IPv4 address as a reference or multiple IPs in homedc? * Is there any prometheus exporter for strongswan and ffrouting? * Networking * IPv4 addresses of switches are unreachable from outside * for an individual pop: warning * for all pops: critical * We need to ensure that the monitoring does have an upstream * This can be solved later, we are aware of it. * PoP down alerts * If all edge nodes of a pop are down | critical * Because the pop might be used by a tenant solely * Service nodes * 1 down | informational * everything might still be running * All down | warning * The Pop will not be updated anymore * Inconsistencies expected to happen soon * Prometheus of a pop down | warning * Promtail * for billing metrics * PowerDNS slaves ## Discussion / not fully defined * Service nodes * 2 dns per pop * Networking aspects * IPv4 addresses of HTTP service unreachable | critical * This pop is not reachable anymore * IPv4 addresses of DNS service unreachable | warning ## When is the CDN down * If one PoP is down * Because tenants might have configured a specific PoP * If all DNS servers are down * Even if the rest is still up ## When is a PoP down * If all HTTP services are down * If all HTTPs services are down * If a PoP is not reachable from the outside * Can be done implicit via http/https services * But should also be done explicit by pinging PoP specific IPs * Usually I would do this with pinging the upstream ISP assigned IPs (test1) * And our own IPs (test2) * If its IPv4 network is not announced via BGP * A good test would be to check 2-3 Looking glass installations and do the logical AND * Program: if all_looking_glass donot contain $pop network -> pop down | PoPBGPDown | critical