What Writing Rocket Software Taught Me About Performance Tuning

![image](https://hackmd.io/_uploads/ByB8-YjWWe.png) Before college, I had believed that the only meaningful difference between programming languages was their raw execution speed. They sat in a simple hierarchy based on performance with no other meaningful difference between them. However, some of my recent work with the Data Acquisition (DAQ) pipeline built by [Rocket Project @ UCLA](https://www.rocketproject.seas.ucla.edu/) has given me a more nuanced view on this. **Context** The act of building rocket engines is an iterative process, and the data collected during these tests, better known in industry as coldflows and hotfires, is absolutely critical for feedback on the team's engine design. To create the most granular profile of our engines to use for post test analysis, we also aim to collect data at the highest rates possible. At a high level, our DAQ was set up like this: ![Screenshot 2025-12-01 at 10.45.04 AM](https://hackmd.io/_uploads/rJkHfPoWbe.png) **Problem** Our DAQ collects pressure data at a rate of 400Hz, and we noticed after one of our coldflows this year, we noticed that some data packets were being dropped at random points of time during the coldflow. The dropouts lasted for as long as 2 seconds, which equates to more than 1000 packets dropped. We make use of the pyserial package to communicate with the ESP32, and the code looked something like this: ```python while True: line_hv = serial_port_hv.read_until(b'\r\n', None).rstrip(b'\r\n') data = struct.unpack("8f", line) log_entry = process_data(data) ``` **Triage** Fortunately, our efforts to reproduce the issue were also quite consistent. Running the same script over a period of 8 minutes, we observed that there were about 100 independent instances where data packets were lost. ![image](https://hackmd.io/_uploads/rJ_PjEjWbl.png) Performing more simple profiling by timing the main components of our DAQ pipeline, the fact that ingesting the entire packet over the serial port took more time than processing the packet confirmed that this was an I/O bound workload. ![Screenshot 2025-12-01 at 10.49.16 AM](https://hackmd.io/_uploads/SkF47PsbZe.png) **Initial Solution** Based on the profiling above, it seemed like resolving the packet drops requires running the reading and processing loops in parallel, since each individual component could complete its part of the process before the next packet arrived (<2.5ms). Incorporating the producer-consumer architecture into our DAQ's caused it to look something like this: ![Screenshot 2025-12-01 at 11.19.47 AM](https://hackmd.io/_uploads/BylD5DjZbg.png) Nonetheless, while we were able to split individual components into threads, the packet loss still persisted. More research lead me to come across python's GIL, which only allows 1 thread to run at a time. This prevented us from achieving true parallelism, which for our DAQ, meant that packet drops were inevitable. **Trying out multithreading with Go** While python could not provide true multithreading (at least not with python 3.11), it didn't mean that a parallel program was out of reach. I had picked up Go earlier this year, and it's suitability for I/O bound tasks and multi-threading made it the perfect candidate for our DAQ. Modelling the same producer-consumer pattern, we started to see some promising results with the log files generated by [our Go program](https://github.com/SleepyWoodpecker/Rp-Goes-Serial) for the same 8 minute test that we did before. We went from ~100 packet drops, to just 2! Note: At the recommendation of a [friend](https://github.com/en0mem/), our analysis now used a perl script instead of a collab notebook, but the principle remains the same -- check that consecutive lines in the log file follow sequential packet numbering ![Screenshot 2025-10-24 at 12.45.00 AM](https://hackmd.io/_uploads/SyGJ6vsZZg.png) **Stumbling on our real bottleneck** While we had definitely fixed the problem, it still felt like something was amiss. While it was true that Python's GIL prevented parallelism, I had expected the reading thread to create minimal overhead and sleep while waiting for data, allowing the processing thread to run. This set up should definitely have stopped the packet loss. Checking up the [source code for `read`](https://github.com/python/cpython/blob/9bbb68a52456655f15953ff0e2ee4e2bdbf78c1e/Modules/posixmodule.c#L9932), also confirms this, with its `Py_BEGIN_ALLOW_THREADS` and `Py_END_ALLOW_THREADS` macros: Still pretty confused, I stumbled across [Viztracer](https://github.com/gaogaotiantian/viztracer) in a bid to get a better understanding of how our threaded python program was behaving with the GIL. Here's a screenshot of one invocation of our reading function in the execution trace: ![Screenshot 2025-11-18 at 12.10.17 AM](https://hackmd.io/_uploads/S1R0f_j--l.png) And here was where we realized what the problem was. Do you see it too? Under a single call to `read_until` there were multiple calls (42 to be exact, since our packets were 42 bytes each) to `read`. This specific read also happened to be the call to `os.read` which meant that every time we tried to process a single packet, we were making 42 system calls! This is also consistent with the [implementation of `read_until` in pyserial](https://github.com/pyserial/pyserial/blob/1c760efe790317589e87bc1c091b9705a82d246a/serial/serialutil.py#L668). Since we knew the packet size we were expecting, we could minimize the syscall overhead by writing our reading function as such: ```python line = b"" while (l := len(line)) < expected_packet_length: line += serial_port.read(expected_packet_length - l) if not line.endswith(STOP_SEQUENCE): print(f"""[Reader {READING_TYPE}] End sequence wrong {line[:-2]}""") sync(serial_port) continue line = line.removesuffix(STOP_SEQUENCE) return line ``` A screenshot from the trace of our new implementation confirms this: ![Screenshot 2025-11-18 at 12.01.35 AM](https://hackmd.io/_uploads/SyJHSdo-Wl.png) Running the same perl script on our new log files after the same 8 minute test, we also confirmed that there were no packet drops (I regrettably do not have a screenshot of this). **Remarks** While Go was definitely flashier to implement and could potentially futureproof our pipeline by accommodating much higher data rates than Python could, we decided on sticking with Python to: - Preserve the existing ecosystem to eliminate the significant overhead and technical risk associated with a full-scale migration. - Prioritize long-term maintainability by selecting a language that remains accessible to cross-functional team members without more software-oriented backgrounds. It is really cool to also notice that while software historically has not been a primary focus for Rocket Project, that is rapidly changing. As our hardware pushes boundaries with higher data rates, we are facing novel problems that can only be solved through better software. It is a privilege to be a part of this shift and I am excited for the challenges we will next encounter. On a more personal note, I had seen somewhere that most of the drastic improvments that come from rewrites of programs in a faster language are often a result of a better implementation, rather than from a change in the language itself. This experience seems to strengthen this view. Finally, I used to believe that reading source code was the best method for debugging and optimization, but this experience has demonstrated the undeniable power of code tracers. Seeing the value of runtime analysis has motivated me to add tools like `perf` to my skillset over the next few months. **Acknowledgements** Thank you to Nathan for helping me with every stage of this debugging process, as well as the amazing Rocket Project team! <3