# Alert brainstorming
## Levels
* critical
* It is broken, user / tenant impact
* warning
* It is broken, but no user/tenant impact
* We need to react right now, it should not be postponed
* informational
* It is going to be broken soon
* We can postpone to later
## Brainstorming (unsorted)
* Services on Home DC
* Master PowerDNS down 5m | critical
* Updates to the CDN will fail and impact tenants
* Vault is sealed | critical
* Tenants cannot upload certificates
* Edge Nodes cannot retrieve certificates
* Services on Service Nodes
* PowerDNS
* One or more PowerDNS down for 5m | informational
* All PowerDNS in one location down 5m | warning
* All PowerDNS in all locations down 2m | critical
* Vault agent on the service node down
* 1 down per PoP | informational
* All down per PoP | critical
* Services on Edge nodes
* Nginx/Openresty
* 1..(n-1) down per PoP | NginxDown | informational
* All down per PoP | PoPAllNginxDown | critical
* VPN
* Strongswan:
* VPN tunnel to Location is down | VPNTunnelLocationDown | critical
* Frrouting:
* 10.190.x.0/24 is not reachable from 10.183.5.0/24 | VPNRoutingLocationDown | critical
* Impacts:
* SolariaCDN.com DNS zone transfer not working
* Delivery API cannot fetch new configurations: varnish, nginx and certificates
* Monitoring of PoP fails
* Billing metrics and tenant logging stucks
* Can we add the prometheus master/main IPv4 address as a reference or multiple IPs in homedc?
* Is there any prometheus exporter for strongswan and ffrouting?
* Networking
* IPv4 addresses of switches are unreachable from outside
* for an individual pop: warning
* for all pops: critical
* We need to ensure that the monitoring does have an upstream
* This can be solved later, we are aware of it.
* PoP down alerts
* If all edge nodes of a pop are down | critical
* Because the pop might be used by a tenant solely
* Service nodes
* 1 down | informational
* everything might still be running
* All down | warning
* The Pop will not be updated anymore
* Inconsistencies expected to happen soon
* Prometheus of a pop down | warning
* Promtail
* for billing metrics
* PowerDNS slaves
## Discussion / not fully defined
* Service nodes
* 2 dns per pop
* Networking aspects
* IPv4 addresses of HTTP service unreachable | critical
* This pop is not reachable anymore
* IPv4 addresses of DNS service unreachable | warning
## When is the CDN down
* If one PoP is down
* Because tenants might have configured a specific PoP
* If all DNS servers are down
* Even if the rest is still up
## When is a PoP down
* If all HTTP services are down
* If all HTTPs services are down
* If a PoP is not reachable from the outside
* Can be done implicit via http/https services
* But should also be done explicit by pinging PoP specific IPs
* Usually I would do this with pinging the upstream ISP assigned IPs (test1)
* And our own IPs (test2)
* If its IPv4 network is not announced via BGP
* A good test would be to check 2-3 Looking glass installations and do the logical AND
* Program: if all_looking_glass donot contain $pop network -> pop down | PoPBGPDown | critical