owned this note
owned this note
Published
Linked with GitHub
# Philosophy of Observability
{%hackmd @coscup/announcement-2025 %}
> 請從這裡開始
### How Human Deals With Data
- Human has optimized detailed accounts into key events into numbers of millenia Again and again and again
- Any established industry is number-first
### Observability and SRE
Monitoring: collecting instead of using data
- extremes
- full text indexing
- data lake
Observability: enabling humans to understand complex systems
Observabilty, the buzzword
- Cool new term, almost meaningless by now, what does it mean?
- Pitfall alert
- It's about changing the behavior, not about changing the name
- "Monitoring" has taken on a meaning of collecting, not using data
> Observations are approximations to the truth
>
> -- Carl Friedrich Gauβ, 1809
Observation: how well the internal of a system is going
---
### Complexity
- Fake complexity, a.k.a, Bad design
-
### Services
- What's a service
- Why contract
- Other common term: layer
- Other exmaple
- contract: shared agreement which MUST NOT be broken
### Cloud Native vs Client-Server vs Mainframe
- A mainframe application and a microservice fleet are fundamentally the same
- MIcroservices broke up old service and system boundaries
### SRE
- At its core: Align incentives accros the org
- Error budgets allow devs, ops, PMs, etc, to optimize the shared benefits
- Measure it!
- Service Level Indicator(SLI): what you measure
- Service Level Objective(SLO): what you need to hit
- Service Level Agreement(SLA): when you need to pay
- Discern between different SLIs
- Primary: service-relevent, for alerting
- Sceondary: informational, debugging , ...
`SLI` is the imporatance
### Shared Understanding
- Everyone use the same tool and dashboards
- Shared incentive to invest into tooling
- Pooling of institutional system knowledge
- Shared language & understanding of dervice
### Alerting
- Customers care about sevice being up, not about individual components
> Alerting currently or imminently impacting customer sevice must be alerted upon
> But nothing else!
- should consider the level of alerts -- something that might break days after does not necessarily need to be handled immediately
### Prometheus 101
- Inspired by Google's Borgmon
### Time Series
- Time series are recorded values which change over time
- Individual events are usually merged into counters and/or ~~histograms~~
gauge
counter
### Cloud Native Default
- Kubernates
- Prometheus
### Main Selling Points
- Highly dynamic, bulit-in service discovery
- No hierarchical model, n-dimentional label set
- PromQL: for processing, graphing, alerting, and export
- Simple operation
- Highly efficient
-
### Super easy to emit, parse, and ?
### PromQL
What are the ratio of request errors
### Promtheus scale
- 1000000+ sample per seconds no problem on current hardware
- ~2000000 smaples...
### Mimir
- For metrics
- Promotheus -> Cortex -> Grafana Enterprise Metrics, Mimir
### Mimir @ Grafana
### Loki
- For logs
- Following the same label-based system as promtheus
- Work with logs at sacle, without massive cost
- Access logs ...
### Loki @ Grafana Labs
- 10TiB per day
### Tempo
- For traces
- Historic problem
- Examplars
- Index and search by labelsets avaliable for those who need it
- 100% compatible with OpenTelemetry...
## Data (and cost) savings
#### Log to metrics
- Full text indexing: 10TiB logs -> 20TiB index
- Loki: 10 TiB logs -> ~200 MiB Index
- Logs@Grafana ~600 Bytes avarage per line
- Metrics ~1.36 Bytes per metric sample
- 99.8% reduction in storage size for first log line
- 100% for every??
### Grafana
playground: https://play.grafana.org
[playground with query string:](https://play.grafana.org/d/bdnahipisghdsa/getting-started-with-grafana-play?orgId=1&from=now-1h&to=now&timezone=browser)
> All of this is open source and you can run it yourself.
---
## Thank you
- https://chaos.social/@RichiH
- https://github.com/RichiH/talks
---
## 簡報(will update soon)
https://github.com/RichiH/talks/tree/main/2025/08-COSCUP_Taipei
## QA
Q1: Should we use the monitoring tools from a Cloud provider, or use the Open Source ones?
A1: The monitoring tools from a Cloud provider often update slowly. My recommendation is either self-hosting the Open Source one, or using the one from the monitoring service provider.
Q2: Can you provide important metrics that are often missing?
A2: Instead of fixating on the current metrics, we need to evolve them constantly and add the missing ones.
https://s.dwave.cc/Zkw8YRQ
### Book Recommendation
https://sre.google/books/
###### tags: `COSCUP2025`, `en`, `elementary`