Philosophy of Observability

# Philosophy of Observability {%hackmd @coscup/announcement-2025 %} > 請從這裡開始 ### How Human Deals With Data - Human has optimized detailed accounts into key events into numbers of millenia Again and again and again - Any established industry is number-first ### Observability and SRE Monitoring: collecting instead of using data - extremes - full text indexing - data lake Observability: enabling humans to understand complex systems Observabilty, the buzzword - Cool new term, almost meaningless by now, what does it mean? - Pitfall alert - It's about changing the behavior, not about changing the name - "Monitoring" has taken on a meaning of collecting, not using data > Observations are approximations to the truth > > -- Carl Friedrich Gauβ, 1809 Observation: how well the internal of a system is going --- ### Complexity - Fake complexity, a.k.a, Bad design - ### Services - What's a service - Why contract - Other common term: layer - Other exmaple - contract: shared agreement which MUST NOT be broken ### Cloud Native vs Client-Server vs Mainframe - A mainframe application and a microservice fleet are fundamentally the same - MIcroservices broke up old service and system boundaries ### SRE - At its core: Align incentives accros the org - Error budgets allow devs, ops, PMs, etc, to optimize the shared benefits - Measure it! - Service Level Indicator(SLI): what you measure - Service Level Objective(SLO): what you need to hit - Service Level Agreement(SLA): when you need to pay - Discern between different SLIs - Primary: service-relevent, for alerting - Sceondary: informational, debugging , ... `SLI` is the imporatance ### Shared Understanding - Everyone use the same tool and dashboards - Shared incentive to invest into tooling - Pooling of institutional system knowledge - Shared language & understanding of dervice ### Alerting - Customers care about sevice being up, not about individual components > Alerting currently or imminently impacting customer sevice must be alerted upon > But nothing else! - should consider the level of alerts -- something that might break days after does not necessarily need to be handled immediately ### Prometheus 101 - Inspired by Google's Borgmon ### Time Series - Time series are recorded values which change over time - Individual events are usually merged into counters and/or ~~histograms~~ gauge counter ### Cloud Native Default - Kubernates - Prometheus ### Main Selling Points - Highly dynamic, bulit-in service discovery - No hierarchical model, n-dimentional label set - PromQL: for processing, graphing, alerting, and export - Simple operation - Highly efficient - ### Super easy to emit, parse, and ? ### PromQL What are the ratio of request errors ### Promtheus scale - 1000000+ sample per seconds no problem on current hardware - ~2000000 smaples... ### Mimir - For metrics - Promotheus -> Cortex -> Grafana Enterprise Metrics, Mimir ### Mimir @ Grafana ### Loki - For logs - Following the same label-based system as promtheus - Work with logs at sacle, without massive cost - Access logs ... ### Loki @ Grafana Labs - 10TiB per day ### Tempo - For traces - Historic problem - Examplars - Index and search by labelsets avaliable for those who need it - 100% compatible with OpenTelemetry... ## Data (and cost) savings #### Log to metrics - Full text indexing: 10TiB logs -> 20TiB index - Loki: 10 TiB logs -> ~200 MiB Index - Logs@Grafana ~600 Bytes avarage per line - Metrics ~1.36 Bytes per metric sample - 99.8% reduction in storage size for first log line - 100% for every?? ### Grafana playground: https://play.grafana.org [playground with query string:](https://play.grafana.org/d/bdnahipisghdsa/getting-started-with-grafana-play?orgId=1&from=now-1h&to=now&timezone=browser) > All of this is open source and you can run it yourself. --- ## Thank you - https://chaos.social/@RichiH - https://github.com/RichiH/talks --- ## 簡報(will update soon) https://github.com/RichiH/talks/tree/main/2025/08-COSCUP_Taipei ## QA Q1: Should we use the monitoring tools from a Cloud provider, or use the Open Source ones? A1: The monitoring tools from a Cloud provider often update slowly. My recommendation is either self-hosting the Open Source one, or using the one from the monitoring service provider. Q2: Can you provide important metrics that are often missing? A2: Instead of fixating on the current metrics, we need to evolve them constantly and add the missing ones. https://s.dwave.cc/Zkw8YRQ ### Book Recommendation https://sre.google/books/ ###### tags: `COSCUP2025`, `en`, `elementary`

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.