Distributed Tracing in Node.js: How OpenTelemetry and Axiom Helps To Improve Performance

# Distributed Tracing in Node.js: How OpenTelemetry and Axiom Helps To Improve Performance :::info :bulb:This article doesn't claim to be academically "correct", rather it serves as an elaborated lesson-learnt from author. Take it with a grain of salt! ::: ![axiom_dashboard](https://hackmd.io/_uploads/BJR-ThWZxg.png) Developing a backend or fullstack Node.js app has become easier—maybe even trivial—these days. It’s "easy" enough to learn, and you have access to a wide range of actively maintained libraries and tools, from UI kits and modern frontend frameworks to REST frameworks, authentication, and even digital payment integration. But this ease and simplicity come with trade-offs, especially on the backend. JavaScript (Node.js) is fundamentally slower than backend-native languages like Go, Java, or C#. So, if you want performance that's on par with those technologies, you'll need to invest effort into optimization across your development process. Yes, building a Node.js app is simple—but making it performant is a different story. Not only you need to understand how Node.js works under the hood, you also have to identify where the bottlenecks are—what parts of your app are slowing things down. This article will walk you through how distributed tracing and monitoring, using OpenTelemetry and Axiom, can help you uncover performance bottlenecks and optimize your Node.js app effectively. ## Distributed Tracing Tracing is a method to trace (usually in a form of time span or duration) process, requst handling or even API response time. Distributed Tracing is a method to do tracing in distributed system or microservice architecture. So you basically "install" an agent to do tracing on each service and then have a centralized logging or monitoring to accept or agregate those traces and logs. ### Opentelemetry [Opentelemetry](https://opentelemetry.io/) is like the agent for tracing that we will "install" in your services. It's a collection of tools, SDKs, and APIs across several programming language that helps you to instrument, collect and export the telemetry data such as traces, logs, and metrics. Luckily, since NodeJS is a popular tech-stack, Opentelemetry already provides SDK and configured instrumentation. You only need to install them via ``` pnpm install @opentelemetry/auto-instrumentations-node && pnpm install @opentelemetry/sdk-node ``` and it will automatically configure the instrumentation to automatically get traces, span for each recognized function calls. To see all supported automated isntrumentation, refer [here](https://www.npmjs.com/package/@opentelemetry/auto-instrumentations-node). ### Axiom [Axiom](https://axiom.co) is a cloud-native monitoring, logging and alerting platform. It is similar to Grafana+Prometheus and or ELK stack, but is cloud managed service, so you don't need do deploy your own dashboard. In our distributed tracing flow, Axiom will be the receiving ends. Whenever Opentelemetry finishes collecting traces, it will send the telemetry data to axiom as a dataset. In axiom, Dataset is like a database, you can separate telemetry data based on certain domain as datasets. These datasets can be postprocessed and visualized as information or insights on Axiom. ![image](https://hackmd.io/_uploads/HkDUlA-Zxg.png) <p style="text-align:center;"> Example of traces span telemetry visualized on Axiom Dashboard </p> In the picture above, you can see that the trace is in the form of spans telemetry; time or duration taken for respective function calls. the orange bar represents the span visually, which will give you intuitive insighs what function is slow and whatnot. To setup Opentelemetry with Axiom, refer [here](https://axiom.co/docs/guides/opentelemetry-nodejs) ## Identifying Bottlenecks Generally, bottlenecks happens in the form of ***slow response*** and there are many ways as well as indicators to identify bottlenecks happening in NodeJS app. These conditions below will be a good starting point 1. Long span duration 2. CPU stall indicators 3. Blocking process ### Long Span Duration ![image](https://hackmd.io/_uploads/BkPcpAbWgl.png) In the example trace above, The total span duration is around **1800ms** long. Notice that around **1600ms** was taken only for 2 function calls, they are `redis-GET` and the `pg.query.SELECT`, with the former took most of the span duration. This indicates that the biggest bottleneck happens in `redis.get()` when the system is trying to query for cached data. #### Optimization Optimization strategies naturally depend on the root cause of the performance issue. 1. If you're dealing with a CPU-bound task, you might need to increase CPU power or core count. 2. If it's a network-bound task, you may need to improve connection handling—such as enabling keep-alive connections or implementing a connection pool. In our case, we’re focusing on optimizing `redis-GET` during peak load. Since Redis is **single-threaded**, using a connection pool won't significantly improve performance—because, ultimately, all commands are processed one at a time on a single thread per instance. That means increasing concurrent connections to a single Redis instance won't help much. To handle high read loads more effectively, a better approach is to **enable Redis Cluster with multiple read replicas**. This way, read requests can be distributed across replicas, significantly reducing latency and improving throughput. In short, if you're hitting **read-performance** limits with Redis, adding **read replicas and distributing load** among them is a logical and scalable optimization. ### CPU Stall Time ![image](https://hackmd.io/_uploads/Syp4Q1zZle.png) In the example trace above, the total span duration is around **823ms**. Notice that around **820ms (+90% of it)** was taken for `request handler - /api/auth/login`, but notice that total duration for the span childrens don't add up to 820ms, not even the 10% of it. This can mean several things though, some of it are either there's some logic or function that isn't instrumentized by `@opentelemetry/auto-instrumentations-node` or there's CPU stall happening. CPU stall is a condition where the process (or thread actually) can't be done because there's not enough computing resource can be used, so it keeps waiting until it's given enough computing resource. This usually happens when your app has spiking or even peak load with (almost) maximized CPU usage and no resource scaling being implemented #### Optimization In theory, we want to eliminate CPU stalls entirely during peak load. But in practice, that’s not always feasible. One of the most straightforward ways to address performance issues is by **increasing computing resources**—for example, upgrading to a machine with more CPU power or cores. However, this approach is **economically expensive**, especially at scale. In cost-sensitive environments, you might intentionally **allow some level of CPU stall** to keep infrastructure costs under control. The key is to strike a balance between performance and cost, depending on your app’s priorities and business requirements. ### Blocking Process ![image](https://hackmd.io/_uploads/r1v9ikzbgg.png) ```ts= const products = await getPaginatedProductsByTenantId( SERVER_TENANT_ID, page, pageSize ); if (products) { // Blocking happens await productCache.setPaginatedProducts(cacheKey, products); } return { data: { products, }, status: 200, }; ``` in The example trace above, notice that the `redis-SET` span duration alone takes around **1400ms** . This is clearly a bottleneck. If we see at the particular code for the logic, line 9 blocks the process before the function returns actual data. #### Optimization To optimize this, rather than using await on `redis-SET`, we can **defer its execution** if the cached data isn’t needed immediately afterward. Here’s how the change looks: ```ts= if (products) { productCache .setPaginatedProducts(cacheKey, products) .then((isSuccess) => { if (!isSuccess) logger.warn(`Failed to set cache entry for key ${cacheKey}`); }) .catch((e) => { logger.error( e, `Error while setting cache entry for key ${cacheKey}` ); }); } return { data: { products, }, status: 200, }; ``` By **removing the await**, the `redis-SET` becomes a f**ire-and-forget operation**. This means the main thread doesn’t have to wait for the Promise to resolve before executing the `return` statement, reducing response time slightly—especially under heavy load. The improvement can actually be seen in traces, where the `GET /api/product` already returns at around **1600ms** before the `redis-SET` is finished at **1800ms** as shown below ![image](https://hackmd.io/_uploads/H1l3pJzZxe.png) ## Conclusion Distributed tracing with OpenTelemetry and Axiom in a Node.js app can help you identify performance bottlenecks by providing visibility into what’s happening inside your application. OpenTelemetry supports multiple programming languages, making it ideal for distributed systems that span across different ecosystems or technology stacks.