Learning OpenTelemetry

# Learning OpenTelemetry # Chapter 1: The State of Modern Observability Three pillars of observability were an accident, it must be able to correlate and pivot forth and back between signals and three browser tabs. There are two ways for correlations: - Human investigation: effort consumption, depend on the human memory - Computer investigation: the data must be connected ![image](https://hackmd.io/_uploads/BJEKhcoweg.png) # Chapter 2: Why Use OpenTelemtry? ## The Importance of Telemetry ### Hard and Soft Context Context is the metadata that helps describe the relationship between system operations and telemetry. - `hard context`: unique, per-request, can be propagated to other services. It's a `logical` context - `soft context`: various pieces of metadata that telemetry attaches to measurements from various services/infra that handle the same request. For example: a customer id, hostname of LB ![image](https://hackmd.io/_uploads/Sk8projDex.png) ![image](https://hackmd.io/_uploads/HkQfUiiwex.png) The soft context most commonly used is time. Operators should narrow the soft context until they've identified a sufficiently one useful enough to find results. Hard context enable linking different types of instruments. ![image](https://hackmd.io/_uploads/H1puYosDee.png) ### Telemetry Layering Old way: Turn logs into time-series metrics -> build dashboards, alerts,... This is resource and time consuming, effort consuming for maintaining parser rules... A better solution: `layer` telemetry signals, link signals through contexts and layer the telemetry to get the right data from overlapping signals. ![image](https://hackmd.io/_uploads/HkaRjssDgg.png) OpenTelemetry has this builtin, signals are linked through hard context. ### Semantic Telemetry The data need to be stored and analyzed. Costs from storage, network bandwidth, telemetry creation overhead, analysis cost, alert rate,.... Ability to understand a software system is ultimately a cost optimization exercise. OpenTelemetry changes this through portable, semantic telemetry: - portable: any observability frontend - semantic: self-describing ## Why Use OpenTelemetry? - Universal Standards - Correlated Data # Chapter 3: OpenTelemetry Overview ![image](https://hackmd.io/_uploads/Skz3zkTPxl.png) ## Primary Observability Signals Logs, metrics and traces, through white box (manual) instrumentation and black box (API, SDK) instrumentation ### Traces A trace includes multiple spans, each span contains variety of fields ![image](https://hackmd.io/_uploads/S14D7yavgg.png) Each trace represents one user's path through a system. Traces can be converted to metrics. A single trace contains all the information needed to compute the `golden signals` (latency, traffic, errors, saturation) for a single request. ### Metrics Open Telemetry metrics include semantic meaning, can be linked to other signals through hard and soft contexts. OpenTelemetry metrics support StatsD and Prometheus. `exemplars`: a special type of hard context, allowing link an event to a specifc span and trace. ### Logs Traditionally, correlations from logs to metrics and traces are performed either by aligning time windows or by comparing shared attributes. OpenTelemetry try to unify this signal by enriching log statements with trace context and links to traces, metrics. In short, it takes existing log, see if there's an existing context, if so, associate. In OpenTelemetry, 4 reasons to use logs: - Signals from services can't be trace, such as legacy systems,... - Correlate infra resources such as managed databases or LBs with application events. - Understand behavior in a system that isn't tied to a user request (cron jobs, adhoc work) - Process them into other signals, metrics or traces ## Observability Context 3 types of basic context in OpenTelemetry: `time`, `attributes` and the context object itself. ### The Context Layer `context`: propagation mechanism which carries execution-scoped values across API boundaries and between logically associated execution units `propagators`: how you actually send values from one process to the next ![image](https://hackmd.io/_uploads/Sk1aYJawle.png) ![image](https://hackmd.io/_uploads/r1lkckawle.png) OpenTelemetry maintains a variety of semantic conventions to create clear and consistent set of metadata for signals. ### Attributes and Resources Attribute keys are not duplicated, can be: boolean, float, int, array (same type). Max 128 attributes/telemetry (memory constraint and cardinality explosion) ![image](https://hackmd.io/_uploads/H1TUaJTDeg.png) 2 ways to handle attribute cardinality: - Use observability pipelines, views, tools to reduce the cardinality of signals - Omit attributes from metrics with high cardinality and use in spans/logs `resource`: a special type of attribute. Resources remain the same for the entire life of a process (hostname for example) ### Semantic Conventions Semantic conventions sources: - From the project, standard, publicly used - From the platform teams, internal resources, extend the schema from OpenTelemetry ## OpenTelemetry Protocol OLTP: Standard data format and protocol, can be binary or text-based encoding Multiple integrated producers and consumers. ## Compatibility and Future-Proofing ![image](https://hackmd.io/_uploads/BkX4bl6Dle.png) ![image](https://hackmd.io/_uploads/B1nvZlaDlg.png) # Chapter 4: The OpenTelemetry Architecture Components: instrumentation installed within applications, exporters, pipeline components. ## Application Telemetry ![image](https://hackmd.io/_uploads/SyrkPlaPeg.png) ### Library Instrumentation Many popular OSS libraries come with OpenTelemetry instrumented, can cover most of the needs. ### The OpenTelemetry API Use OpenTelemetry API for more manual instrumentation. ### THe OpenTelemetry SDK Use to communicate with the API. The SDK is a plugin framework consisting of sampling algorithms, lifecycle hooks and exporters (can be configured with YAMl or env) ## Infrastructure Telemetry Kubernetes, cloud services, database services,... ## Telemetry Pipelines OpenTelemetry use the Collector and OTLP ## What's Not Included In OpenTelemetry For `standardization`: long-term storage, analysis, GUIs, frontend components... # Chapter 5: Instrumenting Applications ## Agents and Automated Setup In all languages, 2 parts are needed: - the SDK that process and exports telemetry - instrumentation libraries that match the frameworks, database clients, other components ### Installing the SDK Construct and configure a set of providers and register them with the Open Telemetry API. ### Registering Providers `provider`: an implementation of the OpenTelemetry instrumentation API. Examples: TraceProvider, MeterProvider, LoggerProvider. ## Providers ### TraceProvider ![image](https://hackmd.io/_uploads/BkBpIq4dgx.png) **Samplers**: sampling strategy. **SpanProcessors**: collect and modify spans. Default process is BatchProcessor. Processing in SpanProcessor can be done in the collector (preferable way) **Exporters**: default is OTLP exporter ### MeterProvider ![image](https://hackmd.io/_uploads/B1w1Ojr_lx.png) **MetricReaders**: metric equivalent of SpanProcessors. Default is PeriodicExportingMetricReader (collect and push data in batches) **MetricProducers**: connect 3rd party instrumentation (Prometheus) to an OT SDK **MetricExporters**: send batch metrics, use OTLP push by default **Views**: for customizing metrics outputs. Which instrument are ignored, how instrument aggregates data. Can be created at Collector level ### LoggerProvider ![image](https://hackmd.io/_uploads/BJZWYjSuge.png) **LogRecordProcessors**: like other processors, default is BatchProcessor **LogRecordExporters**: emit logging data in multiple common formats, default is OTLP ### Shutting Down Providers It's critical to flush any remaining telemtry before shutting down application. If you're using automatic instrumentation with agents, no need to do anything. ### Custom Providers Allowing alternative implementations is one of the reasons the OT API is separate with the SDK. ## Configuration Best Practices Configure the SDK in 3 ways: - In code, when constructing providers - Env var (most supported) - YAML config file ### Remote Configuration OpenTelemetry is developing Open Agent Management Protocol (OpAMP). It allows Collectors and SDKs to open a port, they transmit status and receive configuration updates. Useful for a control plane or analysis tool to control the settings dynamically. ## Attaching Resources `resource` describes the service, the VM, the platform, region, cloud provider,... ### Resource Detectors Most resources come from the environment: k8s, aws, gcp, azure, Linux. `resource detectors`: plugins that discover these resources. Most resources can be detected by using a local Collector and attached to the telemetry. Using from Collector side reduce the overhead when using with SDK on application code. ### Service Resources Critical set of resources can't be gathered from env: the resources that describe your service. Need to set them up when configuring OpenTelemetry `service.name`, `service.namespace`, `service.instance.id`, `service.version` ## Installing Instrumentation Auto instrumentation can help us with common libraries instrumentation. If not available, we have to manually setup instrumentation for the library. ### Instrumentating Application Code Best practice is to add instrumentation code with pattern in the in house libraries. Decorating spans: get the span from library instrumentation and add attributes for it, no need to create new spans. ### How much is too much? Unless it's a critical operation, don't add it until you need it. Focus on breadth-first, not depth-first, try to instrument all the services. ### Layering Spans and Metrics It's a good practice to create histogram metrics for API endpoints with high-throughput. ### Browser and Mobile Clients Client telemetry is referred to RUM (real user monitoring) Otel Collector is not designed to be run publicly, we have to have an additional layer of proxy in front of it. ## The Complete Setup Checklist - Is instrumentation available for every important library? - Does the SDK register show providers for tracing, metrics, and logs? - Is the exporter correctly installed? - Are the correct propagators installed? - Is the SDK sending data to the Collector? - Is the Collector sending data to the analysis tool? - Are the correct resources emitted? - Are all traces complete? - Are no traces broken? # Chapter 6: Instrumenting Libaries `native instrumentation`: instrument the libraries ## The Importance of Libraries Most of the resource usage happen at the library level, not the application code. ### Observability works by default in native instrumentation Remove the dependency of the plugins or hooks for observability of the libraries. **Documentation and playbooks**: Provide schema and instructions on libraries's observability **Dashboards and alerts**: default set of dashboards, alerts that users can setup easily ### Native instrumentation shows that you care about performance Observability should be treated as a first-class citizen (like tests) for prioritizing production problems: latency, timeouts, resource contention, unexpected behavior under load. ## Why aren't libraries already instrumented? Almost no libraries currently emit telemetry at all (time of writing, 2023-2024). Library instrumentation is always written by someone else other than the maintainer. Why? **composition** and **tracing** **Composition**: observability systems vary, the library can use a different tool than you use. ![image](https://hackmd.io/_uploads/SkrnOkL_gl.png) For **tracing**, all the libraries must use the same tracing system for propagating. ## How OpenTelemetry is designed to support libaries? Instrumentation is a `cross-cutting concern` - subsystem that ends up everywhere, used by every part of the codebase (like db calls) ### OpenTelemetry separates the instrumentation API and the implementation Library maintainer uses the interface for writing the instrumentation code they own, the application maintainer installs and configures plugins and exporters,... The API doesn't have dependencies, we can safely import many libaries without conflicts. The SDK and dependencies only referenced once, by the application developer during setup. ### Otel maintains Backward Compatibility The API needs to maintain compatibility across all libraries. ### Otel keeps instrumentation off by default OTel API calls are always safe, they will never throw exception. They are used directly within the libaries code, without any plugins or wrappers. ![image](https://hackmd.io/_uploads/HJdDhkLugg.png) ![image](https://hackmd.io/_uploads/HkCDhy8uxg.png) ## Shared Libraries Checklist - Have you enabled OpenTelemetry by default? - Have you avoided wrapping the API? - Have you used existing semantic conventions? - Have you created new semantic conventions? - Have you imported only API packages? - Have you pinned your library to the major version number? - Have you provided comprehensive documentation? - Have you tested performance and shared the results? ## Shared Services Checklist Databases, proxies, messagings systems... - Have you used the OpenTelemetry config file? - Are you outputting OTLP by default? - Have you bundled a local Collector? # Chapter 7: Observing Infrastructure ## What is Infrastructure Observability? Infrastructure providers: AWS, Azure, GCP,... Infrastructure platforms: K8s, FaaS, PaaS, CICD platforms What matters: - Can we establish context (soft / hard) between specific infrastructure and application signals? - Does understanding these systems through observability help you achieve specific business/technical goals? ## Observing Cloud Providers Categorize into 2 groups: - `bare infrastructure`: on-demand and scalable services like VM, blob storage, API gateways, managed DBs - `maanged services`: on-demand K8s clusters, ML, stream processors, serverless platforms Using Otel receivers for Cloud service, for example, CloudWatch receiver. ### Collecting Cloud Metrics and Logs Consider what signals are important to collect and how to use them. Foundational principles: - Use semantic convention to build soft context between metric signals and application telemetry - Otel Collector has plugin allowing you to convert existing telemetry from many sources to OTLP - What you actually need, how long you need it? It's recommended to use `Collector Builder` to build Collectors on production (with custom modules, plugins,...) ![image](https://hackmd.io/_uploads/rk6fFbI_ex.png) ![image](https://hackmd.io/_uploads/rJ8EYWLugx.png) ### Metamonitoring `metamonitoring`: monitoring the Collectors's performance `otelcol_processor_refused_spans` and `otelcol_processor_refused_metrics_points` (by memory ballast extensino) useful metrics to tell if limiter is causing data drop, in that case, we need to scale up. Rules to plan Collector capacity: - Experiment per host or per workload to find the correct size of the ballast (pre-alloc memory to the heap) - For scraped metrics, avoid scrape collisions - Heavier processing can be moved to later stages of the pipeline - Better to overprovision than losing telemetry *Note*: Memory ballast extension is deprecated, use `GOMEMLIMIT` and `GOCC` ### Collectors in Containers A good rule is to use factors of 2 for memory limits and ballast. For example: ballast of 40% container memory and limit is 80%. These percentages (40% ballast, 80% limit) are industry best practices; too little ballast doesn't stabilize memory, and too high a memory limit risks hitting out-of-memory errors. ## Observing Platforms ### Kubernetes Platforms **Kubernetes telemtry**: events, metrics, logs, trace from cluster's components. `k8sclusterreceiver`, `k8seventreceiver`, `k8sobjectsreceiver`,... will listen for metrics and logs **Kubernetes applications**: Target Allocator can help discover targets. The Operator also provides automatic instrumentation package into a pod (may conflict with application automatic instrumentation) Production deployment tips: - use sidecar Collectors for avoid complex development and deployment, cleaner shutdowns of pod - split out Collectors by signal type so they can scale independently - separate telemetry creation and telemetry configuration. For example: redaction and sampling on Collectors ### Serverless Platforms Besides standard application telemetry, few more things to pay attention for: - Invocation time (duration) - Resource usage - Cold start time Useful tools: OpenTelemetry Lambda Layer Functions much wait on the exporting of telemetry data. ### Queues, Service Buses, and other async workflows Should be splitted into many subtraces, linked to an origin by a custom correlation ID (baggage) or a shared trace ID # Chapter 8: Designing Telemetry Pipelines ## Common Topologies ### No Collector If the telemetry emitted requires little to no processing, Collector can be passed ![image](https://hackmd.io/_uploads/SykhfE5Olx.png) This setup will miss the host metrics, so they should be collected through another channels (node exporter or cloud collector) ### Local Collector ![image](https://hackmd.io/_uploads/rJtx7Ncule.png) Advantages: - Gather environment resources and host metrics - Avoid data loss from crashes: when using local collector, we can send batch in small size with short interval, config the Collector to use big batches Good for separating telemetry creation and telemetry processing, filtering, configuration,... ### Collector Pools ![image](https://hackmd.io/_uploads/ByB8U49dlg.png) Load balancing to handle `backpressure` to Collectors due to Collectors's statelessness **Resource management**: the local Collector should be only responsible for collecting host metrics and quickly collect application signals to avoid resources consumption. Other processing should be done at Collector Pools **Deployment and Configuration**: separate configuration, scaling concern with application **Gateways and specialized workloads**: ![image](https://hackmd.io/_uploads/ByTea_iOex.png) Reasons for specialized Collector pools: - Reducing the size of the Collector binary: stripped-down build of Collector to meet the case - Reducing resource consumption: 1 + 1 may be larger than 2 - Tail-based sampling: a gateway pool with load balancing exporter to make the spans go to correct instance, then a separate pool perform the sampling - Backend-specific workloads: separate signal processing - Reducing egress costs: Otel Arrow protocol - data compression (requires large amount of data transmission and it's a stateful protocol) ## Pipeline Operations ### Filtering and Sampling 3 sampling strategies: - Head-based sampling: make sampling decisions when a trace starts (1 in 10 or 1 in 100), not recommended because of missing important traces - Tail-based sampling: waiting until trace is done to sample - Storage-based sampling: implemented in analysis tool. Does not reduce the cost of sending telemetry but allowing for features (more comprehensive queries) **Filtering is easy; sampling is dangerous**: sampling is not recommended until egress and storage costs matter. ## Transforming, Scrubbing and Versioning Filter -> Transform -> Sample -> Export Collectors help redaction, transform signal types. But more transformation come with more resources consumption and time, so it's better to get the signal right at the first time (application emit) ### Transforming Telemetry with OTTL OTel Transformation Language (OTTL) ![image](https://hackmd.io/_uploads/r1Z-B6Tdgg.png) ### Privacy and Regional Regulations The Collector is an ideal place to manage the data scrubbing and routing that such regulations often mandate. ### Buffering and Backpressure Collector need enough memory to buffer data in the pipeline. ## Collector Security Ensure Collector listen for local traffic. Use TLS/SSL/authen/author for receivers. ## Kubernetes Some deployment types: - Daemonset to run Collector locally on every nodes - Sidecar for every container - Deployment for Collector pool - Statefulset for stateful Collector pool Daemonset for local Collectors and Deployment for Collectors pool are recommended. Otel Kubernetes Operator helps injecting auto-instrumentation into applications and configure it. ## Managing Telemetry Costs Don't monitor what doesn't matter. # Chapter 9: Rolling Out Observability ## The Three Axes of Observability - Deep versus wide: is it better to collect all or detailed from a few parts of system - Rewriting code versus rewriting collection: add new instrumentation or transform existing data into new formats - Centrazlied versus decentralized: create a strong central observability team ## OpenTelemetry Rollout Checklist - Is management involved? - Have you identified a small but important first goal? - Are you implementing only what you need to accomplish your first goal? - Have you found a quick win? - Have you centralized observability? - Have you created a knowledge base? - Can your old and new observability systems overlap? -