## History
[Alerting Proposal](https://docs.google.com/document/d/18_cmwH3FjQHgcL5hQd9VEsVh9xG7HtKCc57S_ibszVM/edit)
[Observability 2020 - Alerting: integration w/ pagerduty](https://docs.google.com/document/d/1O2yjpff5cfFEOn-UhaqMsMXiJpHRcjjnoJC9tMFnQwk/edit#heading=h.il8wj7nnhlfx)
[One Pager Q1](https://docs.google.com/document/d/1H1KFLOL30fIgv4HbhX-SebrIVE4P4rf4vK9giBgu8BU/edit?usp=sharing)
## State for the problem
# How to implement monitoring from scratch ?
Monitoring
> Collecting, processing, aggregating, and displaying real-time quantitative data about a system.
Dashboard
> An application that provides a summary view of a service’s core metrics.
## Monitoring Best Practices
### Setting Reasonable Expectations
### Actionnable
Alert <=> Action
If an alert doesn't induce you working to resolve it, you're doing alerting wrong.
Not a "bird eye view" use-case. Use dashboards for that.
### Top-down, not Bottom-up
What is a "healthy" state ?
A healthy system is a system that continuously generate value for consumers.
> Keep SLIs within SLOs.
### Choose SLIs
It's similar to Google's recomandations, except it skips on saturation.
[RED Method](https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/)
Rate
Errors
Duration
#### Google's Golden signals
[Monitoring Distributed Systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/)
Latency
Traffic
Errors
Saturation
#### A possible alternative
[Use Method](http://www.brendangregg.com/usemethod.html)
Utilization
Saturation
Errors
Saturation is hard to get right at first. You can live without it.
### Define SLOs
Good news, there's dedicated documentation elsewhere
### On domain-related metrics
## "Transverse" Alerting
## Preserve the namespace labels
## Still doesn't work ?
### As foundation, On transverse metrics
```
record: redis_memory_used_ratio
expr: |
max by(namespace, app) ( redis_memory_used_bytes / redis_memory_max_bytes)
alert: redis_memory_used_ratio > 0.90
for: 5m
labels:
severity:
critical
```
#### Be nice with the level1.
```
record: redis_memory_used_ratio_threshold
expr: |
0.90
+ 0 * count by(app, namespace) (redis_up)
alert: RedisOutOfMemory
expr: |
max by(namespace, app) ( redis_memory_used_bytes / redis_memory_max_bytes)
> redis_memory_used_ratio_threshold
for: 5m
labels:
severity: critical
```
-> configurable thresholds !
[Using time series as alert thresholds](https://www.robustperception.io/using-time-series-as-alert-thresholds)
```
record: redis_memory_used_ratio_threshold
expr: |
1
+ 0 * count by(app, namespace) (redis_up)
labels:
namespace: trip
app: memorystore-search-metrics
record: redis_memory_used_ratio_threshold
expr: |
0.90
+ 0 * count by(app, namespace) (redis_up)
```
-> runbook, dashboards
There are some "magic" annotations.
```
alert: RedisOutOfMemory
expr: |
max by(namespace, app) ( redis_memory_used_bytes / redis_memory_max_bytes)
> redis_memory_used_ratio_threshold
for: 5m
labels:
severity: critical
annotations:
grafana_url: https://grafana.prod-1.blbl.cr/d/TjnRg55Zz/redis?var-namespace={{$labels.namespace}}&var-cluster={{$labels.app}}-metrics&panelId=4&fullscreen
runbook_url: https://ops-run-book.corp.blablacar.com/search?q=RedisOutOfMemory
```
-> more to come ?
## As Dev team, On transverse metrics.
Setup thresholds
```
record: redis_memory_used_ratio_threshold
expr: 1
labels:
namespace: trip
app: memorystore-search-metrics
```
Or go yolo
```
record: redis_memory_used_ratio_threshold
expr: NaN
labels:
namespace: trip
app: memorystore-search-metrics
```
Feedback / Help / etc
## SRE Advice
## Alert on Slack from Pagerduty, not Prometheus
###
## Use Grafana for "bird eye view"
## Other
Slow to iterate via flux.
git -> Flux -> HR-> tiller -> deploy cm -> reload config
Unit testing for alerts in our observability 2020 plan.
# Dependencies
# Cleanup
[Top-Down Approach to Monitoring](https://fr.slideshare.net/BigPandaIO/topdown-approach-to-monitoring)