Improving debugging support for JIT'd code in LLVM

# Improving debugging support for JIT'd code in LLVM ## Motivation For a JIT to have debugging support, it needs to communicate the layout of the JIT'd code to the debugger. Debugging support, in GDB and LLDB, for JIT'd code in LLVM is currently implemented on top of an [interface](https://sourceware.org/gdb/onlinedocs/gdb/JIT-Interface.html) that was designed more than a decade ago. The interface, on a high level, works through shared knowledge, between the debugger and the JIT, about special globals. While the interface works it is not optimal for performance and there is room for improvement in both what is shared between the JIT and the debugger and how it is shared. ## Dimensions for improvement ### The "What" In LLVM debugging support for JIT'd code is implemented using a [Plugin](https://github.com/llvm/llvm-project/blob/08cf5360c2a5885f3ee879cd4b32fb9a74aae07f/llvm/include/llvm/ExecutionEngine/Orc/ObjectLinkingLayer.h#L60). For every relocatable object that is added to the LLVM JIT, this plugin stores a copy of the entire object in memory and makes the object available to the debugger via the above mentioned debugger-JIT interface. The debugger then processes the debug sections of the in-memory objects and generates information that it needs to enable debugging of the object. #### Open questions 1. **Does the JIT have to share entire copies of each relocatable object that was added to it, with the debugger?** * Perhaps the shared information about the object be determined based on what debugging capabilities the user wants? 2. **How does the size of debug info scale as translation units get smaller?** To find out, I did some initial experimentation with a very simple program. ```C= int add(int a, int b) { return a + b; } int mul(int a, int b) { return a * b; } int main() { int a = add(4, 2); int b = mul(4, 2); return a + b; } ``` First, I compiled the program, with debug info turned on, to LLVM IR. Then, I compiled from IR to machine code in two different ways: 1. Just compile the IR for the program to object files as-is. 2. Use `llvm-nm` and `llvm-extract` to break the program up into a bunch of single-function modules, then compile each of them to object files. Finally, I compared the size of the debug info sections between case (1) and (2). The debug information sizes were as follows - 1. Whole - 1040 bytes 2. Divided - 1962 bytes Instead of trying this out with more programs right away, I tried to understand the general patterns/causes of bloated debug info. I noticed at least 3 things - 1. Same types are duplicated across object files. 2. `.debug_abbrev` section gets duplicated across object files. 3. `.debug_str` section gets partially duplicated across object files. I read [this](https://dwarfstd.org/doc/Debugging%20using%20DWARF-2012.pdf) doc on DWARF and it seems like there's some prescriptive ways that we can try out to shrink debug info: 1. DWARF offers ways to further reduce the size of the debugging data. Most strings in the DWARF debugging data are actually references into a separate .debug_str section. Duplicate strings can be eliminated when generating this section. Potentially, a linker can merge the .debug_str sections from several compilations into a single, smaller string section. 2. Many compilers generate the same abbreviation table and base types for every compilation, independent of whether the compilation actually uses all of the abbreviations or types. These can be saved in a shared library and refer enced by each compilation unit, rather than being duplicated in each. 3. Many programs contain declarations which are duplicated in each compilation unit. For example, debugging data describing many (perhaps thousands) declarations of C++ template functions may be repeated in each compilation. These repeated descriptions can be saved in separate compila tion units in uniquely named sections. The linker can use COMDAT (common data) techniques to eliminate the duplicate sections. 4. Many programs reference a large number of include files which contain many type definitions, resulting in DWARF data which contains thousands of DIEs for these types. A compiler can reduce the size of this data by only generating DWARF for the types which are actually used in the compi lation. With DWARF Version 4, type definitions can be saved into a separate .debug_types section. The compilation unit contains a DIE which references this separate type unit and a unique 64bit sig nature for these types. A linker can recognize compilations which define the same type units and eliminate the duplicates. ### The "How" The current debugger-JIT interface locks us into the format that we use to specify a relocatable object's debug info. > Additionally, I've heard it also forces us to send the debug info container (relocatable object) to the process being debugged, but I don't understand how? IIUC, the `DebugObjectManagerPlugin` owns the debug objects and just shares pointers to the memory occupied by its debug objects, with the debugger. #### Open questions 1. GDB uses an [interface](https://sourceware.org/gdb/onlinedocs/gdb/Custom-Debug-Info.html#Custom-Debug-Info) using which JITs decide on a debug info format and provide readers that parse the debug info generated by the JITs. Can we explore this approach for LLDB? 2. Is it feasible to implement a query based interface, where the debugger can ask the JIT for the debug info that it needs? How granular can the queries be?