HackMD - Collaborative Markdown Knowledge Base

<style> .reveal { font-size: 18px; font-family: "courier" } </style> ## Bring Your Own Observability --- ## Observability We're talking about the metrics, telemetry, tracing and logs that help us get notified, diagnose and improve the state of our applications. --- #### Observability Options on GOV.UK PaaS --- #### Pazmin + `cf logs` * Great for getting started! :+1: * Zero config! :+1: * No option for custom metrics * No option for custom dashboards * No option for alerting * Brief log retention --- #### Export to a SaaS offering * Zero maintenance! :+1: * ie syslog drain -> Logit * ie statsd -> Hosted Graphite * Burden of procurement * May require deploying some kind of adapter * Fragmented from the PaaS * Recommending products is a grey area for PaaS team --- #### The "Observe" Prometheus * Minimal config for app service discovery! :+1: * Alerting in-the-box :+1: * Dashboarding in-the-box :+1: * Custom metrics collection in-the-box :+1: * GDS teams only :unamused: * No solution for logging (punt it to Logit) * Has some quirks (exposing metrics publicly, service discovery gets out of sync, requires authorizing a "user" outside of the team) --- #### DIY (Deploy It Yourself) * InfluxDB backing service available! :+1: * Many tools will run on the PaaS backed by InfluxDB ie: * Deploy your own Prometheus for metrics :+1: * Deploy your own Grafana for dashboarding :+1: * Deploy your own Alertmanager for alert routing :+1: * Deploy your own Telegraf for log collection :+1: * Burden of choice and configuration: :cry: * Burden of "Day 2" operations * May require knowledge or more "advanced" PaaS features such as BoshDNS and NetworkPolicies * Custom "glue" code or static configuration management * Duplication of effort * Bluring of the App vs Service model --- #### DIY++ Can we do better? * A "PaaS-native" experience available for ALL * Solution for both metrics and logging in-the-box * Reduce burden of configuration (minimal/zero) * Reduce burden of choice * Recuce burden of "day 2" operations --- #### Kubernetes, Operators & Sidecars * An "Operator" is an app that manages the configuration and lifecycle of another app * A "Sidecar" is a supporting process bolted on to the side of an application or injected by an Operator --- #### The Prometheus Operator * An kubernetes application you can deploy to your namespace that: * manages the lifecycle of Prometheus (metrics) * manages the lifecycle of Alertmanager (alerts) * manages the lifecycle of Grafana (dashboarding) * manages sidecars to automate platform specific metric collection * manages sidecars to automate configuration / discovery * provides kubernetes-native methods to customise configurations (`kubectl apply custom-resource.yml`) --- #### PaaS, Brokers & Sidecars * A "Broker" is an app that manages the configuration and lifecycle of another app * Although usually deployed platform-wide, you can deploy your own via "space scoped brokers" * A "Sidecar" is a supporting process bolted on to the side of an application or injected by a buildpack --- #### The BYO Obserability Broker * A GOV.UK PaaS application you can deploy to your own space that: * manages the lifecycle of Prometheus (metrics) * manages the lifecycle of Grafana (dashboarding) * manages the lifecycle of Telegraf (logs) * manages the lifecycle of InfluxDB (storage) * manage sidecars to automate platform specific metric collection * manages sidecars to automate configuration / discovery * provides PaaS-native methods to customise configurations (`cf bind my-app prometheus`, `cf bind my-app `) * potentially offer other stacks TICK vs TIPG --- #### User Experience The UX we're aiming for something like: ```bash # install the broker cf push -b "https://github.com/alphagov/paas-byo-observability" # create an oberserability stack cf create-service TIPG observability # configure app for metric collection and log shipping cf bind-service my-app observability # check last-status for useful URLs cf service observability ... grafana: https://some-grafana.cloudapps.digital prometheus: https://some-grafana.cloudapps.digital ... # open grafana URL and see my-app metrics and logs in one place ``` --- #### Great! What do we need? * A broker implementation that can deploy other apps and services to the paas * PaaS-deployable versions of: * Proemtheus * Grafana * Telegraf * Jager maybe? * Alertmanager maybe? * PaaS-deployable exporters for: * Container metric collection * Sidecar processes for each service that watch cloudfoundry API and generate configuration based on Bindings and Service parameters * Automation of network policy where required * An easy way to install the broker (custom buildpack - self register as broker?) --- #### So far... * [x] broker provision code for influx backed prometheus * [x] broker provision code for grafana * [x] broker provision code for paas-exporter * [ ] broker provision code for influx backed telegraf * [ ] broker provision code for alertmanager * [x] grafana config sidecar (bindings -> datasources) * [x] prometheus config sidecar (binding -> DNSSD scrape config) * [x] telegraf config sidecar (binding -> syslog drain config) * [ ] grafana authentication config * [ ] deprovisioning steps * [ ] deployable broker * [ ] ...