FINAL REPORT - HackMD

# FINAL REPORT --- ## TL;DR This project takes [Grandine](https://github.com/grandinetech/grandine) to a whole new level of observability. By integrating modern Tokio Tracing, the old logging pipeline was replaced with a structured, async-aware, and runtime-configurable tracing architecture that opens a big observability gate for a client. 🔥 **[project proposal](https://github.com/eth-protocol-fellows/cohort-six/blob/master/projects/Grandine-Implementing-Tokio-Tracing-For-Debugging-And-Performance-Analysis.md)** 🔥 - for full technical details. --- ## Content [TOC] --- ## Challanges Grandine moves fast - chasing the new hard fork, FUSAKA - and tracing touches almost every file. That meant constant syncing with new code, chasing merge conflicts, and updating tracing across files that changed daily. It required a sharp eye, speed, and a lot of patience. For anyone else attempting a tracing integration: I pushed through and executed the [full 9-phase **roadmap**](https://github.com/eth-protocol-fellows/cohort-six/blob/master/projects/Grandine-Implementing-Tokio-Tracing-For-Debugging-And-Performance-Analysis.md#roadmap), making sure that after every phase the tracing integration could compile, pass all tests (including consensus), and remain mergeable with production code. [tracing integration challanges](https://github.com/eth-protocol-fellows/cohort-six/blob/master/projects/Grandine-Implementing-Tokio-Tracing-For-Debugging-And-Performance-Analysis.md#challenges) - for full technical details --- ## First impresions How it all started I joined the cohort and - hey, we’re going to work directly on the protocol and there’s a whole group of us! Lots of ambitious and smart people. So we get introduced to the rules, and the Cohort 6 machine is ready to start! - Mondays — standup meetings, a chance to share and learn from other fellows - Wednesdays — office hours, a chance to learn from the most experienced ones We picked our preferred projects and teams, wrote project proposals, presented them, and everything was ready to begin. but.. --- <details> <summary>bad luck at start</summary> Every time I scheduled my proposal presentation, bad luck intervenes and I couldn’t present it… ([attempt 1](https://github.com/eth-protocol-fellows/cohort-six/issues/138#issuecomment-3065793701), [attempt 2](https://github.com/eth-protocol-fellows/cohort-six/issues/176#issuecomment-3083694536) ...) Probably some kind of cosmic balance, just so things wouldn’t go too perfectly. 🔥 Summary of my “bad luck achievements”: 3 full anesthesias in just 2 weeks. 🔥 It all started as a routine surgery, so at that time my EPF contributions were actually made from the hospital.. Then a complication - a doctor’s mistake that required another surgery. More recovery, then home treatment and small contributions along the way. And then, guess what? :smile: The second surgery damaged an artery located near the operated area… and it burst while I was far from the hospital. Barely made it back, lost a lot of blood… And so I proudly fell into the tiny percentage of patients who get this kind of “after-surgery bonus complication” from what was supposed to be a simple routine procedure. This time, contributions had to wait. It seems the doctors could use some observability too :smile: </details> ### Motivated Comeback Even though I missed every possible presentation slot, the first day after leaving the hospital, I sat down, recorded the [missed proposal presentation on YouTube](https://www.youtube.com/watch?v=Y047VIkHe9Q&t=3s), sent it to the EPF group, and got myself back in the game! At least now I had nowhere to go - one full month of staying at home, just typing. So the story could continue :smile: In addition to the usual Monday standup and Wednesday office hour, I got another weekly meeting - Thursday - with the Grandine team lead, Saulius! A one-hour slot where we dive into technical details, discuss ideas, planning, and implementation possibilities. They approved my initial proposal - and the contribution could finally begin. --- # Implementation can start Tracing is a big feature that needs to be merged directly into production, and it touches almost every file. So I needed some incremental approach - let’s make a tracing branch and add features to it step by step. --- ## FIRST PR THAT WAS MERGED INTO PRODUCTION - **[Full Repository Migration to Tokio Tracing & Dynamic Runtime Log Control #422](https://github.com/grandinetech/grandine/pull/422)** - **[eth2_libp2p migration to Tokio Tracing #21](https://github.com/grandinetech/eth2_libp2p/pull/21) (following submodule PR)** is composed of 3 sub PRs --- ### **First sub PR** - -> [Tracing Migration – Foundational Setup](https://github.com/sntntn/grandine/pull/1) turned the tracing subscriber into Grandine’s central diagnostic backend and hooked it into the old logging interface. I set up the initial formatting for existing log events and added support for adjusting which logging levels get collected through ENV variables at node startup. This means the node can now choose exactly which level of logs the tracing subscriber listens to - so later we can process that information. But for now, output goes to the console, with automatic TTY detection and ANSI styling. If the output is redirected to a file, the user can explicitly enable ANSI styling when needed. Passed Tumas’ review and **got merged** into the tracing branch! --- ### **Second sub PR** (big one) - -> [Repository-Wide Tracing Migration & Full Network Layer Instrumentation](https://github.com/sntntn/grandine/pull/9) The next step was to introduce a set of custom tracing macros that would replace every old logging event across the codebase. These macros plug directly into the existing tracing system and every event now automatically includes a `PEER METRICS` field - a requirement from the Grandine team. To make this work smoothly, I ensured full compatibility with all previously used log formats, studied how the original tracing macros were built, and kept the same performance guarantees. Thanks to how tracing works internally, the extra `PEER METRICS` field isn’t even evaluated unless the subscriber is listening on the corresponding level - so there’s no runtime overhead for every written macro. With the new macros ready, I went through the codebase and replaced every existing log event. At the same time, I coordinated updates between the main Grandine repo and the Lighthouse-derived **`eth2_libp2p`** submodule. The old `slog` bridge, which used serialization to deliver p2p logs and its fields into Grandine, was removed. All slog events were converted to tracing events, now wired directly to the tracing subscriber - meaning fields can be passed straight through without any serialization at all. With the new macros in place, it was time to move into **instrumentation**. I started by instrumenting the P2P layer and the HTTP API endpoints. Once trace level is enabled, these instrumented parts begin producing spans that add real context to our tracing events - finally letting us see where and why things happen in this async-heavy environment. No more scattered logs with no connection to each other. Now every event comes with a story attached. [(Here’s the PR for the submodule.)]((https://github.com/grandinetech/eth2_libp2p/pull/21)) In the middle of all this, I hit a **tricky thread-safety issue** that took days to track down - huge thanks to Tumas for help on this. > Debugging thread safety in a codebase this large, especially after touching almost every file, is… fun. In the end, the entire issue came from one single span that wasn’t async-friendly and ended up where only an async-safe span should be. more about this in this [issue #14](https://github.com/sntntn/grandine/issues/14) Even with that, timing was still good - validator duties were successfully instrumented after this, and only FCR remained before the whole project could wrap up. But then the Saulius dropped an interesting question: “**Can we change tracing behavior at runtime?**” And of course, yes, just to find way how. --- There was still time for more adventure, so I went back into the digging in tracing docs and opened a fresh PR: ### Third sub PR - -> **[Add runtime API endpoint for dynamic tracing log level control](https://github.com/sntntn/grandine/pull/18)** I wrapped the tracing subscriber in a reload layer, created a tracing handle, propagated it through the runtime, exposed it to a POST API endpoint - and **BUM** - Now the entire tracing system **can change its behavior at runtime**, all through a simple **`POST`** request. ([video example](https://www.youtube.com/embed/c8RrdKiOQno)) Perfect. Merge to the tracing branch, and then ship it to production! --- BUT… The rest of the Grandine team was pushing hard at the same time. The submodule had to be synchronized with the PeerDAS branch and the main repository absorbed a large part of Fusaka. >That left me with thousands of merge conflicts. And the merge commit looked [like this](https://github.com/sntntn/grandine/commit/e4ca206b9055dfbc41d8b047d3fb3d75ad2e6c35): ![image](https://hackmd.io/_uploads/r1hKFhYpll.png) Then Grandine shipped its first pre-release, and I was supposed to land in the second. And grandine team still continue to push hard :D And so began the cycle: - Fix merge conflicts in the evening → wake up to new ones. - Fix them again → new commits arrive while git workflow tests run. Grandine was moving at full speed, and I was integrating a feature that touched basically everything - so any new code immediately collided with it. >Evening: fix everything. Morning: everything broken again. In the end, after days of this loop, Tumas one morning [manually merged this PR into main grandine branch](https://github.com/grandinetech/grandine/pull/422). ### Tracing lands in the next pre-release - And just like that, **[this version of tracing](https://github.com/grandinetech/grandine/commit/384378e40cfc15fc7b51edc10e7ced298d07d734)** is now officially part of the **Grandine pre-release**! ![Screenshot from 2025-11-01 17-51-43](https://hackmd.io/_uploads/Skm9i3XyZx.png) --- ## SECOND PR THAT WAS MERGED INTO PRODUCTION We still had some time, so why stop? The `crit!` macros weren’t really capturing what we wanted to highlight. We needed a way to mark unusual behavior - things that don’t necessarily belong to the normal life of our consensus client. Sometimes it’s something suspicious coming from outside, sometimes just something that breaks the usual pattern. Enter the `exception!` macro: - visible on the standard output only for developers, enabled either via `POST API` during runtime or via an `ENV` variable at startup - invisible to regular users but still there in case something goes wrong - everything is stored in a separate file so it can be provided to us if it's needed. To support all of this, a brand-new tracing subscriber design was introduced in the PR. - **[Two-Layer Tracing Subscriber Design with Dedicated Exception Log File and Local Timezone Support + Developer Controls + Tests #432](https://github.com/grandinetech/grandine/pull/432)** (with following submodule PR) **[rename 'crit' logs in 'exception' logs for better clarity #22](https://github.com/grandinetech/eth2_libp2p/pull/22)** -> **The PR got merged** and officially shipped **in the Grandine FUSAKA release**. ### Tracing landend in [Fusaka 2.0.0](https://github.com/grandinetech/grandine/releases/tag/2.0.0) release!!! We chased that, and we made it! ![image](https://hackmd.io/_uploads/S1mQplBxZl.png) ![image](https://hackmd.io/_uploads/ry09UlEx-e.png) > Meanwhile, I added full suite of unit tests: verifying tracing from developer and regular user side, behavior across different tracing levels, and all the formats allowed by our custom tracing macros. ## THIRD PR THAT WAS MERGED INTO PRODUCTION Exception logs don’t happen often - but from my personal life experience I know there is always a “but”! :smile: The last thing we need is a runaway log file eating disk space.. So I built in a treshold: 1GB per exception log file. When a file reaches that limit, it gets automatically compressed (shrinking its size by ~5–6x) and a fresh file is opened. We keep five of these compressed files in rotation — once we hit the sixth, the oldest one simply gets rotated out. The result? - Up to 5GB of exception history preserved - But never more than ~2GB total disk usage - Fully automatic, hands-off log management A small change, but it turns exception logging into something robust, predictable, and production-safe. And this PR successfully landed in production. 🚀 -> **[rotate exception.log when it exceeds 1GB #456](https://github.com/grandinetech/grandine/pull/456#event-20727003149)** ## Exploring Tracing Backends and Integration Possibilities This is an exciting, ongoing exploration. There are plenty of cool tracing backends and tools out there - from [Adaptive Tracing](https://grafana.com/docs/grafana-cloud/adaptive-telemetry/adaptive-traces/introduction/) to Tempo, and even features that let you query traces using [natural language powered by AI](https://grafana.com/whats-new/2025-08-08-access-tracing-data-using-mcp-server-in-grafana-cloud-traces/) instead of [TraceQL](https://grafana.com/docs/tempo/latest/traceql/). Thanks to the way our program is instrumented, it’s already fully compatible with these tools - giving us a solid foundation to experiment, integrate, and take observability to the next level. # Ariving in Buenos Aires Typing this while flying on the plane, thinking about everything - how far I’ve come.. Getting ready to meet my fellows, who have been growing, learning, exploring, and becoming experts in their areas of blockchain. Also finally meeting Josh and Mario, the people who brought all of us together from all around the world and led us through this whole journey with guidance and support. Landing in Buenos Aires to meet visionaries and builders who are, in some ways, like me - and I’m excited to learn from our differences and from their perspectives on the present and the future. Just arrived! I’m meeting up with Nando (one of the fellows) to share an Uber. Our driver, showing up as “Rufino,” tells us he’s actually a writer who uses the people he picks up at the airport as inspiration - and then writes stories about them. So let our story begin :smile: > (When he writes ours, I’ll add the link here) Buenos Aires - here we are! :rocket: --- # PR summary <details> <summary>🔥 List of merged PRs:</summary> - **[Full Repository Migration to Tokio Tracing & Dynamic Runtime Log Control #422](https://github.com/grandinetech/grandine/pull/422)** Summary of Related PRs: - [Tracing Migration – Foundational Setup](https://github.com/sntntn/grandine/pull/1) - [Repository-Wide Tracing Migration & Full Network Layer Instrumentation](https://github.com/sntntn/grandine/pull/9) - [Add runtime API endpoint for dynamic tracing log level control](https://github.com/sntntn/grandine/pull/18) - **[eth2_libp2p migration to Tokio Tracing #21](https://github.com/grandinetech/eth2_libp2p/pull/21) (following submodule PR)** - **[Two-Layer Tracing Subscriber Design with Dedicated Exception Log File and Local Timezone Support + Developer Controls + Tests #432](https://github.com/grandinetech/grandine/pull/432)** - **[rename 'crit' logs in 'exception' logs for better clarity #22](https://github.com/grandinetech/eth2_libp2p/pull/22) (following submodule PR)** - **[rotate exception.log when it exceeds 1GB #456](https://github.com/grandinetech/grandine/pull/456#event-20727003149)** </details> --- --- > <details> > <summary>PR on hold</summary> > > - PR on hold (in case Grandine ever wants that output customization) > **[logging format: introduce AlignedFormatter for aligned log output with optional ANSI colors #447](https://github.com/grandinetech/grandine/pull/447)** > </details> ### Mentors [Saulius](https://github.com/sauliusgrigaitis) [Tumas](https://github.com/Tumas)