How to implement monitoring from scratch ?

## History [Alerting Proposal](https://docs.google.com/document/d/18_cmwH3FjQHgcL5hQd9VEsVh9xG7HtKCc57S_ibszVM/edit) [Observability 2020 - Alerting: integration w/ pagerduty](https://docs.google.com/document/d/1O2yjpff5cfFEOn-UhaqMsMXiJpHRcjjnoJC9tMFnQwk/edit#heading=h.il8wj7nnhlfx) [One Pager Q1](https://docs.google.com/document/d/1H1KFLOL30fIgv4HbhX-SebrIVE4P4rf4vK9giBgu8BU/edit?usp=sharing) ## State for the problem # How to implement monitoring from scratch ? Monitoring > Collecting, processing, aggregating, and displaying real-time quantitative data about a system. Dashboard > An application that provides a summary view of a service’s core metrics. ## Monitoring Best Practices ### Setting Reasonable Expectations ### Actionnable Alert <=> Action If an alert doesn't induce you working to resolve it, you're doing alerting wrong. Not a "bird eye view" use-case. Use dashboards for that. ### Top-down, not Bottom-up What is a "healthy" state ? A healthy system is a system that continuously generate value for consumers. > Keep SLIs within SLOs. ### Choose SLIs It's similar to Google's recomandations, except it skips on saturation. [RED Method](https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/) Rate Errors Duration #### Google's Golden signals [Monitoring Distributed Systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/) Latency Traffic Errors Saturation #### A possible alternative [Use Method](http://www.brendangregg.com/usemethod.html) Utilization Saturation Errors Saturation is hard to get right at first. You can live without it. ### Define SLOs Good news, there's dedicated documentation elsewhere ### On domain-related metrics ## "Transverse" Alerting ## Preserve the namespace labels ## Still doesn't work ? ### As foundation, On transverse metrics ``` record: redis_memory_used_ratio expr: | max by(namespace, app) ( redis_memory_used_bytes / redis_memory_max_bytes) alert: redis_memory_used_ratio > 0.90 for: 5m labels: severity: critical ``` #### Be nice with the level1. ``` record: redis_memory_used_ratio_threshold expr: | 0.90 + 0 * count by(app, namespace) (redis_up) alert: RedisOutOfMemory expr: | max by(namespace, app) ( redis_memory_used_bytes / redis_memory_max_bytes) > redis_memory_used_ratio_threshold for: 5m labels: severity: critical ``` -> configurable thresholds ! [Using time series as alert thresholds](https://www.robustperception.io/using-time-series-as-alert-thresholds) ``` record: redis_memory_used_ratio_threshold expr: | 1 + 0 * count by(app, namespace) (redis_up) labels: namespace: trip app: memorystore-search-metrics record: redis_memory_used_ratio_threshold expr: | 0.90 + 0 * count by(app, namespace) (redis_up) ``` -> runbook, dashboards There are some "magic" annotations. ``` alert: RedisOutOfMemory expr: | max by(namespace, app) ( redis_memory_used_bytes / redis_memory_max_bytes) > redis_memory_used_ratio_threshold for: 5m labels: severity: critical annotations: grafana_url: https://grafana.prod-1.blbl.cr/d/TjnRg55Zz/redis?var-namespace={{$labels.namespace}}&var-cluster={{$labels.app}}-metrics&panelId=4&fullscreen runbook_url: https://ops-run-book.corp.blablacar.com/search?q=RedisOutOfMemory ``` -> more to come ? ## As Dev team, On transverse metrics. Setup thresholds ``` record: redis_memory_used_ratio_threshold expr: 1 labels: namespace: trip app: memorystore-search-metrics ``` Or go yolo ``` record: redis_memory_used_ratio_threshold expr: NaN labels: namespace: trip app: memorystore-search-metrics ``` Feedback / Help / etc ## SRE Advice ## Alert on Slack from Pagerduty, not Prometheus ### ## Use Grafana for "bird eye view" ## Other Slow to iterate via flux. git -> Flux -> HR-> tiller -> deploy cm -> reload config Unit testing for alerts in our observability 2020 plan. # Dependencies # Cleanup [Top-Down Approach to Monitoring](https://fr.slideshare.net/BigPandaIO/topdown-approach-to-monitoring)