Try   HackMD

Level up your debugging!

You know how tricky it is to develop complex distributed systems. What about debugging them? It’s even trickier!

Think of all the complexity: there are many services, and each of them may fail independently. Some failures may cascade into others, hiding root causes. Plus, there are external services that we don't control that may introduce new vulnerabilities!

You may be wondering, then: how should we debug such complex distributed systems? Distributed tracing for the win!

During my time at Expedia, I worked on a project where each request had to go through a dozen services, each deployed as an independent application. To figure out where a problem occurred, we had to guess and check a variety of logs and dashboards—a time-consuming and frustrating process, not unlike untangling a giant knot woven together with hundreds of threads.

Eventually, our team adopted Haystack, an in-house-built OSS framework that made it easier to debug our distributed application. Since then, I’ve been using distributed tracing everywhere, and in this blog post, I’ll show you why.

OpenTracing

Haystack is one of many implementations of the OpenTracing protocol. OpenTracing is a standardized API that is implemented by different vendors.

The fundamental concepts of OpenTracing are traces, spans, and trace contexts:
A trace is a collection of spans often visualized as a Gantt chart.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Fig 1. A single trace

A span represents a unit of work (operation) in the trace. Each span has a name and a range of time during which the associated operation happened. It may also contain references to the parent span, status codes or tags added by a user.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Fig 2. A span and its tags

How do spans connect into a single trace if they occur in different services? Trace context is the key. It's some data that uniquely identifies a trace and each span within it.

So what?

Logging systems do not explicitly track causality between events, whereas the tracing systems keep track of causality relationships between spans. This allows us to immediately see potential issues that weren't obvious before.

We can go from viewing global metrics to inspecting specific customer errors in seconds, all the data we need for performance tracking is right at our fingertips: latency, errors, connectivity, and more!

Let’s take a look at a few examples.

Critical path

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Fig 3. Trace latency

The simplest way to find a critical path is to look at the trace of the longest request, then identify which spans contributed to it.

Search operations

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Fig 4. Search by name and by error

It is easy to find operations by name, by service, by error or by custom tags.

Service map

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Fig 5. Traffic between services

The service map allows us to reason about the system as a whole and understand traffic patterns within it.

Instrumenting your code

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Perhaps, you figured already that you need to wrap your operations into spans. You may be worried about the amount of work it takes. Fortunately, it's easier than it sounds!

The latest distributed tracing frameworks support auto instrumentation out of the box. Information about database statements, calls to popular frameworks such as Spring, Akka or Play, are auto-magically traced for you.

Conclusion

At Expedia, we had a latency problem, but using OpenTracing we discovered that we performed the same operation several times in different services. Removing redundant calls sped up the system by 200%!

OpenTracing can save us time, because it helps us debug complex distributed systems, without having to manually untangle their giant distributed knots.

Learn more at https://opentracing.io/.

Happy debugging!