Basics of Distributed Tracing

# Basics of Distributed Tracing The rise of microservices and the public cloud has had a dramatic impact on software architecture. The cost and technical burden of building complex distributed systems is significantly lower. The significant increase in complexity and scale of modern applications led to the invention of distributed tracing which is the technology that modern telemetry instrumentation services like Open Telemetry are built around. ## Distributed Tracing Distributed Tracing is the process of capturing and analyzing telemetry data about a complex system that can span across many distributed environments and services. The data collected can help inform administrators about how their system is performing, when and why errors in the system occur, what usage patterns across their system look like, and much more. In order to implement distributed tracing, three things are needed: an application to trace, a telemetry agent that does the tracing, and a collection service that captures data from telemetry agents. ### Telemetry Agents Telemetry agents are applications that gather distributed tracing data in an application, and propagate distributed tracing context to other applications that the host application interacts with. In the observability field, the term "instrumentation" is used to describe applying an observability agent to capture telemetry data from an application. Generally instrumentation is provided by an SDK that can be implemented in a host application, or as a service that runs in the same environment as your application and fetches data from the system and runtime for a given language. Many times, it is a combination of the two depending on what the language supports. For compiled languages like C++, Rust, or Go, the language runtime generally does not expose data about running services, so all instrumentation needs to be handled by an SDK. These agents are responsible for gathering metrics, traces, events, and logs. Each of these are different data points for determining the health, performance, and behavior of an application. Depending on the amount of data you collect in your application this can introduce performance, network, and data ingestion overhead on your system. As a result, many of these data types are designed to operate within certain performance constraints, or sample data down into a smaller subset that is statistically representative. More on that later. ### Telemetry Data Types **Metrics** are statistical measurements of an app's behavior that are represented as counts, histograms, and gauges. These can be used to measure things like the number of requests per second a server handles, how many unique users a service serves, or a server's request latency. The volume of this data is low, so telemetry agents are generally able to capture and export this data very reliably and without sampling by default. **Events** are a way to capture notable single occurrences within a system at a given moment in time, like a runtime error or a user getting rate limited. They are generally captured as plain text. Events can sometimes contain large instances of data, like stack traces for example. Since events are usually important, telemetry agents try to prioritize them and avoid sampling or dropping them. **Logs** are a unique event type optimized for capturing application logs. Logs can be used to get a more nuanced understanding of what is going on in your application at a given moment in time. Modern telemetry agents are able to capture logs, but this isn't always preferable depending on the size of your logs and the volume you need to collect. Log forwarding has existed for a long time, but the important innovation that modern telemetry services bring is the ability to associate logs with telemetry data by adding linking metadata to the log attributes or message. This means that if a trace captures an error event for example, corresponding system logs can be correlated based on the timestamp and trace metadata to help you determine the root cause of a failure more quickly. **Traces** are what tie everything together. Traces represent a single execution or process across a distributed system. Traces can be broken into subsections which in Open Telemetry are referred to as spans. Data can be tagged with trace and span IDs in order to be associated with a given process that was observed in an application. If your application interacts with many services, trace headers can be injected into outgoing network traffic to propagate the trace context across your system. For this reason, instrumentation needs to intercept all incoming and outgoing network and database traffic for distributed tracing to work. There are both proprietary and open protocols for distributed tracing headers. OpenSource projects like Open Telemetry follow the [w3 trace context protocol](https://www.w3.org/TR/trace-context/). Traces may contain a very large volume of data, and capturing too many of them can add a lot of overhead to your system. As a result they are almost always sampled. Head based sampling occurs in each agent and is a decision made at the beginning of the tracing request. It decides whether to capture or drop a trace based on a configured policy. This is very resource efficient, but can result in loss of high value data. As a result, tail based sampling was created. Tail based sampling collects all traces, then applies a weighted score to each trace based on its contents. Then based on the configured policy, it captures a subset of traces with the highest weighted score. This can allow you to avoid dropping valuable data. In Open Telemetry, this is done in the collector, not in the agent. ## Data Collection In Open Telemetry, data is collected with a service called the OTLP (open telemetry protocol) collector. This standalone service handles aggregating data from all telemetry agents in your system, sampling it, optionally applying transformations to it, then sending it to configured collection services. Having the collector as a seperate service offloads a lot of memory and compute intensive work from agents. It also allows unsampled data to be collected without being sent to external services, minimizing costly external network traffic. ### Data Collection Architectures There are many software as a service (SAAS) products that are designed to reduce the complexity of capturing and creating value from telemetry data. Some leaders are Data Dog, Splunk, New Relic, Dynatrace, and Honeycomb. ```mermaid graph TB subgraph "Application Network" subgraph Gateway["Gateway"] GatewayAgent["OTel Agent"] end subgraph AppServer["App Server"] AppServerAgent["OTel Agent"] end subgraph Database["Database"] DatabaseAgent["OTel Agent"] end OTLPCollector["OTLP Collector"] end subgraph "External Network" SAAS["SAAS Collection Service "] end GatewayAgent -->|"OTLP"| OTLPCollector AppServerAgent -->|"OTLP"| OTLPCollector DatabaseAgent -->|"OTLP"| OTLPCollector OTLPCollector -->|"Telemetry Data"| SAAS ``` There are also open source tools like Prometheus, Jaeger, Grafana and Loki that store, process, and visualize telemetry data on self hosted infrastructure. A simple example architecture could look something like this. ```mermaid graph TB subgraph "Application Network" subgraph Gateway["Gateway"] GatewayAgent["OTel Agent"] end subgraph AppServer["App Server"] AppServerAgent["OTel Agent"] end subgraph Database["Database"] DatabaseAgent["OTel Agent"] end OTLPCollector["OTLP Collector "] Prometheus["Prometheus "] Jaeger["Jaeger "] Grafana["Grafana "] Loki["Loki"] direction TB end GatewayAgent -->|"OTLP"| OTLPCollector AppServerAgent -->|"OTLP"| OTLPCollector DatabaseAgent -->|"OTLP"| OTLPCollector OTLPCollector -->|"Metrics"| Prometheus OTLPCollector -->|"Logs"| Loki OTLPCollector -->|"Traces + Events"| Jaeger Prometheus -->|"Metrics"| Grafana Jaeger -->|"Traces + Events"| Grafana Loki -->|"Logs"| Grafana ```