Compiler: The Program, the Language, & the Computer Work

# Compiler: The Program, the Language, & the Computer Work ###### tags: `Compiler` `Tokenizer` `Parser` `Optimization` `Compiler Design` `GCC` `Clang` `LLVM` `JVM` `TurboFan` `V8` `Automata` `GNU` `Linux` `Kernel` `Program Analysis` `Operating System` `Computer Architecture` `ASM` `ABI` `ISA` `Runtime` `Interpreter` `Toolchain` `C` `C++` `Java` `JavaScript` `C#` `Cross Platform` `JIT` `AOT` These are the learning notes, research, & discussions on compiler design for GNU/Linux systems. Topics include optimization strategies, program analysis, runtime support, cross-compilers, & other advanced subjects. ## Preface > **"All computer work is just doing string processing."** By applying ourselves with a mathematical and automata mindset when running programs, we can understand limitations, optimize processes, and approach ultimate solutions. :::success **"Anything can be optimized. Anything can become an exquisite piece of art."** Whatever path we strive for, mastering the essence of each endeavor allows us to capture the true gem of artistry. For Example: - **Metaprogramming:** Viewing the program from a high-level perspective focuses on the essence of the code, not the code itself. So the code can be read, generated, analyzed, or transformed to other programs, or even modified by itself at runtime. Providing flexibility & dynamic manipulation of the code allows us to **==Treat "Code" as "Data"==** that ***++inspect "Code" as "Data"++*** & do potential improvements like *code motion* or others. Or do generative programming that lets us ***++manipulate "Code" as "Data"++***. [Metaprogramming - Wiki](https://en.wikipedia.org/wiki/Metaprogramming) - **Exposing Runtime Environments to Programming Languages with APIs:** For example, .NET Common Intermediate Language (CIL) emitter. [C# CLR, IL, JIT Compilation & Code Access Security Explained - Kunal Tandon - Medium](https://medium.com/@kunaltandon.kt/c-clr-il-jit-compilation-code-access-security-explained-269124121f5) ![image](https://hackmd.io/_uploads/HkF24wFjA.png =300x) - **Dynamic Execution of Expressions:** *"Treat Data as Executables"*. The dynamic expressions that contain programming commands, often composed from strings, arguments or context, like JavaScript. Thus, *"programs can write programs"*. - **"Meta"-Programming:** View programs in a superior aspect beyond the actual programs. General purpose program transformation systems (e.g., compilers), which accept language descriptions and carry out arbitrary transformations on those languages, are direct implementations of *general metaprogramming*. - [Compiler-compiler - Wiki](https://en.wikipedia.org/wiki/Compiler-compiler) - **Generic Programming Paradigm:** Generate a Program that can do *"automatically manufacturing software components"*. For example, `POSIX Shell Script`, `IDL` spec files, macro, `template` in `C++`, dynamic funciton loading, ... - **Use Case: Domain-Specific Languages (DSLs) Applications:** `lex` (lexical analyzer) & `yacc` (parser generator) for generating lexical analysers parsers, let the user describe the language using regular expressions & context-free grammars, & embed the complex algorithms required to efficiently parse the language. > **Note:** Not all *metaprogramming* involves generative programming. If programs are *modifiable* at runtime, or if incremental compiling is available (e.g., in `C#`, `JavaScript`, `Lisp`, `Elixir`, `Lua`, `Perl`, `PHP`, `Python`, `Ruby`, `Rust`, `R`), then techniques can be used to perform metaprogramming **without** generating source code. - **JIT Compilation:** Dynamic translation or run-time compilations that is compilation at *RUNTIME*. [[Intro] Metaprogramming and JIT Compilers - A Brief Overview - VoxelRifts](https://www.youtube.com/watch?v=FFgvV0sA3kU) - **Metaprogramming from Source Code:** Since the lack of further information of code, there's not much thing can do but *++Generative Programming++*. - **Metaprogramming from AST / IR / Bytecode:** With the extra information than the source code via data/control flow analysis, now we not only can do generative programming but also *++inspect & analyze++*, & take further operations. - **Metaprogramming from Assembly:** Not high level enough for informative details nor low level enough machine can easy identify, limiting the use case of metaprogramming. - **Metaprogramming from Machine Code:** Here we can turn data into any executables. > **Note:** Should stick the executables with the correct calling convention on target platform & machine code. - **JIT Compilation Implementation Concept:** ```cpp= /* * Since malloc turns off "executable" by default, * we need to use mmap instead. */ char* buffer = mmap( NULL, // Start address (let the kernel choose) 1024, // Length of the mapping PROT_READ | PROT_WRITE | PROT_EXEC, // Protection flags MAP_PRIVATE | MAP_ANONYMOUS, // Mapping flags -1, // File descriptor (not used here) 0 // Offset (not used here) ); // func setup typedef void(*buffer_as_func)(); buffer_as_func func = (buffer_as_func) &buffer; // mov eax, ecx buffer[0] = 0x89; buffer[1] = 0xc8; // ret buffer[2] = 0xc2; // execute func(); int result = munmap(buffer, 1024); buffer = NULL; if (result == -1) return -1; return 0; ``` > Related Topics: > - **The JIT Implementation on Browser Engine (JS JIT Entity x Browser Native Environment):** [Binding & Sandboxing - Anatomy of Web Browser Engines - shibarashinu](https://hackmd.io/@shibarashinu/H1wrAk4o0#WebCore---Rendering-Engine) > - **Exploiting Excutable Code:** [Sandboxing the No Data Execution Prevention (DEP) - shibarashinu](https://hackmd.io/@shibarashinu/H1UUR_p-0#Memory-Exploit-in-JavaScript-JIT-Engine) > - **LLVM:** Intro lies down below. ::: info **Classic Example: Lisp** In *Lisp metaprogramming*, the unquote operator (typically a comma) introduces code that is evaluated at program definition time rather than at run time. The *metaprogramming* language is thus identical to the *host* programming language, & **existing Lisp routines can be directly reused for *metaprogramming* if desired** *(here takes "meta" literally)*. This approach has been ==implemented in other languages by incorporating an ++interpreter++ in the program==, which works directly with the program's data. ::: - **Algorithms x Turing Machines:** - **Specialized Data Structures:** [Dijkstra's Hidden Prime Finding Algorithm - boo1](https://www.youtube.com/watch?v=fwxjMKBMR7s) - **Dynamic Programming:** [The Algorithm Behind Spell Checkers - b001](https://www.youtube.com/watch?v=d-Eq6x1yssU) - **Computer Architectures x Calculation:** - **Fast InvSqrt:** `i = 0x5f37'59df - (i >> 1)` & Newton's Method [没有显卡的年代，这群程序员用4行代码优化游戏 - 量子位](https://www.youtube.com/watch?v=g1r3iLejTw0) - **Fast Fourier Transform** - Bitwise, Error Correction, Lossy/-less Compression, Encode, Hash, Encryption - **Data Structures for Special Scenarios:** Splay Tree, B+ Tree, Bloom Filters, BSP Tree, R* Tree, [你所不知道的 C 語言: linked list 和非連續記憶體 - jserv](https://hackmd.io/@sysprog/c-linked-list), ... - **Hardware Acceleraton with Lookup Tables:** [Pentium FDIV bug - Wiki](https://en.wikipedia.org/wiki/Pentium_FDIV_bug) - **Parallel Computing:** SIMD, SIMT, MIMD - **Computation Transform:** ![image](https://hackmd.io/_uploads/ryGkzHPhR.png =200x) ![image](https://hackmd.io/_uploads/rJ_VF3m90.png =200x) - **Logarithm:** Operations from multiplications to additions. - **Calculus:** Riemann Integral, Taylor Series, Fourier Series, ... - **Differential Equations:** ODE, PDE, Laplace Transform, Fourier Transform, Z Transform - **Linear Algebra:** linear transformation, transformation compositions, eigenvectors, ... - **Dimensional Analysis:** Try finding the critical associated control factors in the large-scale complex systems. - **Probability, Statistics:** Same as the dimensional analysis, but focusing on the "results". - **Automatic Control:** transfer functions, PID feedback systems, ... - **Apply Geometric Approach to Solve Algebraic Patterns:** [Why is pi here? And why is it squared? A geometric answer to the Basel problem - 3Blue1Brown](https://www.youtube.com/watch?v=d-o3eB9sfls) - **Numerical Analysis:** numerical methods, finite element method, Newton's Method, ... - **Complex Analysis:** complex analytic functions, ... - **Natural Transformations:** *"Morphisms between Functors"*. Transform one functor into another while preserving the categorical structure. ::: :::warning **Hackers & Painters:** *The philosophy of how we think, how we work, how we develop technology, & how we live.* > Empathy is probably the single most important difference between a good hacker and a great one. Some hackers are quite smart, but when it comes to empathy are practically solipsists. It's hard for such people to design great software, because they can't see things from the user's point of view. Resources: - [[Intro] Hackers and Painters - Paul Graham](https://paulgraham.com/hp.html) - [[Book] Hackers & Painters: Big Ideas from the Computer Age - Paul Graham - Amazon](https://www.amazon.com/Hackers-Painters-Big-Ideas-Computer-ebook/dp/B0026OR2NQ?ref_=ast_author_mpb) - [[Article] Computer Programming as an Art - Donald E. Knuth, 1974](https://paulgraham.com/knuth.html) - [[Book] The World Is Flat: A Brief History of the Twenty-first Century - Thomas L. Friedman - Amazon](https://www.amazon.com/World-Flat-History-Twenty-first-Century/dp/0374292884) - [[Article] The Top Idea in Your Mind - Paul Graham](https://www.paulgraham.com/top.html) - [Part 1 - Capt. Grace Hopper on Future Possibilities: Data, Hardware, Software, and People (1982) - National Security Agency](https://www.youtube.com/watch?v=si9iqF5uTFk) [Part 2](https://www.youtube.com/watch?v=AW7ZHpKuqZg) {%youtube si9iqF5uTFk %} ::: ## Reference #### Courses - [OptComp-2021: Optimising Compilers - University of Cambridge](https://www.cl.cam.ac.uk/teaching/2021/OptComp/) - [COPT-2019: Compiler Optimisation - University of Edinburgh](https://www.inf.ed.ac.uk/teaching/courses/copt/) - [CS 6120: Advanced Compilers - Cornell University](https://www.cs.cornell.edu/courses/cs6120/2022sp/lesson/) - [15-819 O: Program Analysis - Carnegie Mellon University](https://www.cs.cmu.edu/~aldrich/courses/15-819O-13sp/) - [CS 701: Construction of Compilers - University of Wisconsin–Madison](https://pages.cs.wisc.edu/~fischer/cs701.html) - [CSC 372: Systems II: Compiler Construction - Adelphi University](https://home.adelphi.edu/~siegfried/cs372/notes.html) #### Lectures - [你所不知道的 C 語言：編譯器和最佳化原理篇 - jserv](https://hackmd.io/@sysprog/c-compiler-optimization) - [你所不知道的 C 語言: 執行階段程式庫 (CRT) - jserv](https://hackmd.io/@sysprog/c-runtime) #### Links - [[Book] Let’s Build a Compiler - Jack Crenshaw](https://xmonader.github.io/letsbuildacompiler-pretty/about.html) - [Compiler Design Tutorial - GeeksforGeeks](https://www.geeksforgeeks.org/compiler-design-tutorials/) - [編譯器設計 - ji3g4al](https://hackmd.io/@ji3g4al/SyBQJy9yq) - [深入淺出教你寫編譯器 (Compiler) - Jace Ju](https://jaceju.net/simple-compiler/) - [GNU Compiler Collection (GCC) Internals - Online Docs - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/) - [GNU Compiler Collection Internals PDF (for gcc version 4.2.4) - GNU GCC](https://gcc.gnu.org/onlinedocs/gcc-4.2.4/gccint.pdf) - [The Java® Virtual Machine Specification - Oracle](https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf)  #### Conferences - [IEEE Xplore](https://ieeexplore.ieee.org/Xplore/home.jsp) - [OSDI '24 - USENIX](https://www.usenix.org/conference/osdi24) - [The International Symposium on Code Generation and Optimization (CGO) - ACM](https://dl.acm.org/conference/cgo) #### Reviews - **Theory of Computation** ![image](https://hackmd.io/_uploads/ByAcV9bS0.png =500x) - **RE / NFA / DFA:** Ex. $L=\{w:|w|\ mod\ 3=0\}$ - **PDA:** State machine with mutable memories (like a "stack"), Ex. $L=\{a^nb^n\}$ - **Compiler Construction** - Lexer, Top-Down/Bottom-Up Parsing, Grammar, SDT, IR, Code Gen for VMs, Runtime Support, Target Code Gen - [Compiler Design Tutorial - tutorialspoint](https://www.tutorialspoint.com/compiler_design/index.htm) #### Todos :::warning - **Programming Language Theory** - Lambda Calculus - Formal Semantics - Type Theory - Program Analysis & Transformation - Run-Time Systems - Metaprogramming - Lisp: *List Processing Languages* for symbolic data processing. [[Book] History of Lisp Parentheses - shaunlebron](https://github.com/shaunlebron/history-of-lisp-parens?tab=readme-ov-file) - **Category Theory** [Applied Category Theory Course - John Baez](https://math.ucr.edu/home/baez/act_course/) - [Natural Transformations - Bartosz Milewski's Programming Cafe](https://bartoszmilewski.com/2015/04/07/natural-transformations/) - **Functional Programming** [Why is writing a compiler in a functional language easier? - StackOverflow](https://stackoverflow.com/questions/2906064/why-is-writing-a-compiler-in-a-functional-language-easier) :::info **Pros of F.P. in Compilation** A compiler frequently works with trees, parsing source code into a syntax tree. This tree may be transformed for type checking, converted to core language elements, and optimized through various transformations. Eventually, the compiler generates a tree in a normal form, from which it produces the target assembly code. Functional language have features like **pattern-matching**, **immutability** (anti-side-effect), **declarative nature** ("what to" rather than "how to" solve), and good support for **efficient recursion**, which make it easy to work with trees, so that's why they're generally considered good languages for writing compilers. ::: - **C Language Reference** - [C Language Reference - C++, C, and Assembler - Microsoft Learn](https://learn.microsoft.com/en-us/cpp/c-language/c-language-reference?view=msvc-170) ::: ## Overviews ### Compiler Applications - Android Runtime (JVM Bytecode JIT (Former) / AOT (Latter) Compiler) - Java, .NET, Python, PHP, Ruby - VMware, QEMU, eBPF - v8 JavaScript Engine, wasm, regex engine (JIT Compiler) - HTML, CSS Parser - Shader Language GLSL (LLVM Backend) - OpenCL, Android RenderScript (Clang Frontend as ref, LLVM Backend) - Font FreeType (Bytecode Interpreter): noto, Adobe TTF - Editor Autocomplete Feature (Clang Frontend) - Cross-Platform Portable Application, Ex. "Wine Is Not an Emulator", simulator program to host-machine executable ### GCC Framework :+1: [GCC 4.0.2 - The Implementation - GCC Resource Center](https://www.cse.iitb.ac.in/grc/intdocs/gcc-implementation-details.html) ![image](https://hackmd.io/_uploads/HJRTQUPB0.png =500x) ![image](https://hackmd.io/_uploads/rJvs-UDSR.png =500x) - **Lexer / Tokenizer (Lexical Analyzer):** Uses `RE` -> `NFA` -> `DFA` to turn strings into ==tokens== (e.g., keywords, identifiers, literals, operators, ...). - **Parser (Syntax (Rule) & Semantic (Type) Analysis):** Uses `SDD` (Syntax-Directed Definition) to perform a top-down/bottom-up `SDT` (Syntax-Directed Translation) -> `CFG` -> `PDA` to generate an ==AST==. - **Intermediate Code Generator:** Generates ==IR== as an `AST`. - **[Machine-Independent] Pre-Code Optimizer:** Uses `IR` to streamline program logic and apply software-specific techs. - **Code Generator:** Uses `IR` to peform a `SDT` to generate ==Target-machine code==. - **[Machine-Dependent] Post-Code Optimizer:** Check whether `Target-machine code` can be improved by applying hardware-specific techs or optimizations. Applications: - [Which GCC optimization flags affect binary size the most? - StackOverflow](https://stackoverflow.com/questions/72030595/which-gcc-optimization-flags-affect-binary-size-the-most) ### LLVM Mainly contributing to developing its own **language-agnostic** compiler `IR`, optimizer, & backend infrastructures in *C++*, which accepts both static & dynamic language form (e.g., *JVM bytecode*). ![image](https://hackmd.io/_uploads/ryLdlAWIA.png) ![image](https://hackmd.io/_uploads/Byv0pRW8A.png) - **Clang:** A "LLVM Native" compiler frontend that transforms C/C++/Objective-C into *LLVM IR*. - **Core (opt & llc/lli):** Responsible for *LLVM IR* analysis, optimization, & compilation process. - **Dragonegg:** Uses GCC 4.5 (GPL v2) compiler frontend & LLVM Core. - **LLDB:** LLVM's Debugger > Back in the days, LLVM is used to be *Low-Level Virtual Machine* & used GCC as its compiler frontend (with *Gimple* output/input). > However, after Apple attempted to transition from GCC to LLVM by applying licenses other than GPL v2 to their contributions, particularly for LLVM as a GCC backend, the GCC community responded by updating the GPL license to prevent such actions. In contrast, LLVM proceeded to develop its own compiler frontend known as **Clang**. Intro: - :+1: [A Gentle Introduction to LLVM IR - mcyoung](https://mcyoung.xyz/2023/08/01/llvm-ir/) - :+1: [LLVM - The Architecture of Open Source Applications](https://aosabook.org/en/v1/llvm.html) #### LLVM IR In these equivalent form: 1. **JIT in-memory format** 2. **asm format** 3. **bitcode format** (same as bytecode) > **Main design concept:** The lower & more detailed the language (IR) is, the more optimization strategies the program gains. #### LLVM TableGen & TD (Target Description) Similar to *GCC RTL pattern* & the corresponding *MD (Machine Description)*. Ex. Running TableGen ```sh llvm-tblgen X86.td -print-enums -class=Register ``` - x86.td ```cpp= // Definition of ADD for x86 architecture let Predicates = [HasAVX] in { // [HasAVX]: asserts AVX support available def ADD : I< 0x01, // Defines opcode: 0x01 MRMSrcMem, // Addressing mode: register/memory source operand (outs GR32:$dst), // output: 32-bit general-purpose register (ins GR32:$src, Mem:$mem), // src: 32-bit register & mem: memory operand "add{l}\t{$src, $mem|$mem, $src}", // assembly output format [(set GR32:$dst, (add GR32:$src, Mem:$mem))] // SSA instruction selection format > { let Inst{25-24} = 0b11; // ModR/M mode: Register/Memory let Inst{22-20} = 0b000; // Opcode for ADD let Inst{19-18} = 0b01; // Operand size: 32-bit let Inst{17-16} = 0b00; // Addressing mode: Direct let Inst{3-0} = 0b0001; // Instruction opcode } } ``` #### LLVM Optimization Implemented as *Analysis and Transform Passes* that traverse some portion of a program to either collect information or transform the program. - **Analysis passes:** Compute information that the following passes can use, or for debugging/profiling purposes. - **Transform passes:** Mutate the program in some way. - **Utility passes:** Provides some easy-to-use utility (e.g., extracting functions to bitcode or writing a module to bitcode). Details: - [LLVM’s Analysis and Transform Passes - LLVM](https://llvm.org/docs/Passes.html#introduction) #### LLVM IR Interpreter: lli Executes *LLVM bitcode* directly using a ==JIT compiler== or ==interpreter== (restricted to the host architecture & unable to emulate other architectures). [lli - directly execute programs from LLVM bitcode - LLVM](https://llvm.org/docs/CommandGuide/lli.html) :::info Ex. Execute C/C++ program by LLVM interpreter. ```sh # Compile C/C++ program to LLVM bitcode clang -O3 -emit-llvm a.c -c -o a.bc # Interpret & execute LLVM bitcode lli a.bc ``` ::: #### LLVM Static Compiler: llc Compiles LLVM source inputs into assembly language for a specified architecture. The assembly language output can then be passed through a native assembler and linker to generate a native executable. [llc - LLVM static compiler - LLVM](https://llvm.org/docs/CommandGuide/llc.html) ### JIT/AOT Compiler - 2 Side of the Coin ![image](https://hackmd.io/_uploads/HkIuPPciC.png =500x) ![image](https://hackmd.io/_uploads/SyxzsTJPR.png =500x) :+1: [AOT vs. JIT Compilation in Java - César Soto Valero](https://www.cesarsotovalero.net/blog/aot-vs-jit-compilation-in-java.html) JIT compilers make programs cross-platform, it is known for "Write Once, Run Anywhere". JIT compilers **reduce latency** thanks to the ability to use *concurrent garbage collectors* and increase the *resilience under peak throughput conditions*. AOT compilers run programs more efficiently with **peak performance without warmup**, **reduced attack surface**, **lower memory consumption**, **smaller size on disk & container images** with *dead code elimination (classes, fields, methods, branches)*, and **faster startup speed**. > AOT compilation is particularly suited for *cloud applications* or *embedded* resource constraint environments. #### Java Hotspot JVM Compiler ![image](https://hackmd.io/_uploads/H1V2yA1P0.png =500x) ![image](https://hackmd.io/_uploads/BJ7lHXYiR.png =500x) - **C1 ByteCode Compiler (Fast, Lightweight):** Performs value numbering, inlining, and class analysis, featuring: - **Simple control-flow-graph (CFG) SSA "high-level" `IR`** - **Machine-oriented "low-level" `IR`** - **Linear scan register allocator** - **Template-style code generator** - **C2 ByteCode Compiler (Highly Optimized):** Performs global value numbering, conditional constant type propagation, constant folding, global code motion, algebraic identities, method inlining (aggressive, optimistic, and/or multi-morphic), intrinsic replacement, loop transformations (unswitching, unrolling), array range check elimination, featuring: - **Sea-of-nodes SSA & machine-specific `IR`** - **Graph-coloring register allocator in various types (e.g., local, global, and argument registers and stack)** :::success **Sea of Node SSA Representation** A graph representation of single-static assignment (SSA) representation of a program that **combines *data flow* and *control flow***, relaxing the control flow from a "total order" to a "partial order", keeping only the orderings required by data flow by **integrating traditional IR layers (CFG & Basic Blocks) to *value [dependency graph](https://en.wikipedia.org/wiki/Dependency_graph) (VDG)***. Features: - **Not Control Flow Graph (CFG):** Fully-specified evaluation order (e.g., pure operations: integer addition). - **Lack of ordering:** Relaxing evaluation order for most operations. - **Easier to reorder instructions:** Skeleton of a CFG remains but requires a *global code motion algorithm* to convert it back into a *control flow graph (CFG)*). - **Effect edges order stateful operations:** Resulting in better implementation of *dead code elimination*, *code motion*, and *constant propagation*. > Very suited for **Speculative Optimization** -- JIT optimizing compiler. ![image](https://hackmd.io/_uploads/SkeUA1lw0.png =500x) More: - :+1: [Sea of Nodes - Fedor Indutny's Blog](https://darksi.de/d.sea-of-nodes/) - :+1: [Sea-of-nodes 论文阅读笔记 - chuj](https://www.cjovi.icu/compilers/1654.html) ::: :::info **[[Paper] A parallel abstract interpreter for JavaScript - K Dewey, 2015](https://vineethk.github.io/files/Dewey2015.pdf)** A parallel approach of data-flow analysis (DFA) with control-flow graph (CFG) on JavaScript language to boost up the analysis performance. Features: 1. **Non-Obvious Control Flow:** JavaScript is a highly dynamic language, featuring *prototype-based inheritance*, *higher-order functions*, *implicitly applied type-conversions*, *implicit exceptions*, *dynamic memory management of objects, variables, & functions with garbage collections*, *only 2 types of variable scope (global & function internal)*, & *hoisting variables*. 2. **JavaScript’s Inherent Dynamism:** Means that high precision is important to get useful results, implying that any useful analysis will be expensive. Solutions: 0. **Worklist-Parallel Strategy:** Processing data-flow node thread-safe independently, until reaching some merge point specified by the merge strategy merged states are inserted back into the global worklist. 1. **Per-Context Parallel Strategy:** To reduce node ordering issues, limit synchronization between independent parts of the analysis, and increase the granularity of the thread computations. ==Context-sensitive analysis== is a suitable option, it clones functions based on some notion of abstract calling context. Each clone is specialized to a particular context and, most importantly, analyzed separately. ![image](https://hackmd.io/_uploads/B1Bvk7gcA.png =400x) > We observe that many contexts are typically enqueued for processing simultaneously, highlighting the promise of a **per-context strategy** for analyzing JavaScript. Instead of a global worklist, we use one worklist per context, each managed by its own thread. Each worklist has a non-blocking, lock-free backlog queue for other threads to enqueue work. When a thread exhausts its worklist, it transfers its backlog queue to the worklist and continues. If a thread processes a function call for a new context, it enqueues the abstract state into that context’s backlog queue. The analysis reaches a fixpoint when all worklists and backlog queues are empty. ::: #### V8 TurboFan JIT Optimizing compiler Optimizes code for *cases seen in the past* (gathered in type feedback vector by the interpreter). - **TurboFan Overview** ![image](https://hackmd.io/_uploads/H1Ow_dCj0.png =500x) ![image](https://hackmd.io/_uploads/SJ1Pv_RjC.png =500x) :::info **How to apply the "JIT" Optimization?** Using *On-Stack Replacement (OSR)*, we can replace the top of the current call stack with one in which the frames have been merged together. OSR takes the contexts of some contiguous range of function applications and the variables that are live in those activations, and transforms those inputs into a new context or contexts and live variable set, and splats the new activations on the stack. ![image](https://hackmd.io/_uploads/rkJApuRjA.png =300x) Refs: - [on-stack replacement in v8 - wingolog](https://wingolog.org/archives/2011/06/20/on-stack-replacement-in-v8) ::: - **TurboFan JIT Optimization Pipeline** :+1: [Node.js Under The Hood #7 - The new V8 - Lucas Santos](https://dev.to/_staticvoid/node-js-under-the-hood-7-the-new-v8-4gd6) ![image](https://hackmd.io/_uploads/H1Nc9OAsA.png =500x) - Example: ```javascript= // Source Input function f(a, b, c) { var d = c - 100; return a + d * b; } ``` ```javascript= // Bytecode Output after "Ignition" // Load constant 100 into the accumulator (Smi is Small Integer) LdaSmi #100 // Subtract the constant loaded from the a2 parameter (i.e. c) // & store into the accumulator Sub a2 // Store the value in the accumulator into r0 Star r0 // Read the value of the a1 parameter (b) // & store into the accumulator Ldar a1 // Multiply r0 by the accumulator // & store the result also in the accumulator Mul r0 // Adds the first parameter a0 (a) into the accumulator // & stores the result in the accumulator Add a0 // Return Return ``` - **Type & Range Analysis** ![image](https://hackmd.io/_uploads/r1e4IJeP0.png =500x) - **Code Reduction** ![image](https://hackmd.io/_uploads/BJhRSyxDR.png =500x) - **Typed Lowering Reduction** ![image](https://hackmd.io/_uploads/H1WcrkeDA.png =500x) Refs: - [[Slides] Deoptimization in V8 - Jaroslav Sevcik](https://docs.google.com/presentation/d/1Z6oCocRASCfTqGq1GCo1jbULDGS-w-nzxkbVF7Up0u0/edit?usp=sharing) - [[Slides] TurboFan: A new code generation architecture for V8 - Benedikt Meurer](https://docs.google.com/presentation/d/1_eLlVzcj94_G4r9j9d_Lj5HRKFnq6jgpuPJtnmIBs88/edit?usp=sharing) - [[Slides] TurboFan TechTalk Presentation - Ben L. Titzer](https://docs.google.com/presentation/d/1sOEF4MlF7LeO7uq-uThJSulJlTh--wgLeaVibsbb3tc/edit?usp=sharing) - [從編譯器優化角度初探 Javascript的V8 引擎 - maxchiu](https://tech-blog.cymetrics.io/posts/maxchiu/turbofan/) #### Cpython JIT Optimizing Compiler [【python】Faster CPython的重要力量 — Specialized Instruction - 码农高天](https://www.youtube.com/watch?v=SNXZPZA8PY8) ### Compiler Optimization #### Static Code Analysis - **Data-Flow Analysis (DFA):** Analyzes & streamlines value ranges, logic conditions, constraints, variable types, dead code at compile time. - **Reaching Definitions Analysis:** Determines if the definition "reaches" a particular use of the *variable* or *expression* or can be safely optimized or eliminated. - **Live Variable Analysis:** Determines if a variable or expression is "live", meaning that its value is still needed for some future computation or can be safely optimized or eliminated. - **Available Expressions Analysis:** Determines if a particular expression is "available", meaning that its value has already been computed and can be reused (e.g., common subexpression elimination). - **Constant Propagation Analysis:** Tracks the values of constants and determines the points in the program where a particular constant value is used (e.g., constant folding). - **Control-Flow Analysis:** Analyzes the goto logic of the program (e.g., control dependency, loop). - **Memory Dependency Analysis:** Analyzes the memory access patterns (e.g., instructions, array items). - ==**Points-To Analysis (Pointer Analysis):**== All compiler analyses and optimizations are limited by the potential effects of assignments through pointers and references. More: - :+1: [Lecture Notes: Pointer Analysis - Claire Le Goues - cmu.edu](https://www.cs.cmu.edu/~clegoues/courses/15-819O-16sp/notes/notes07-pointers.pdf) - [[Book] Pointer Analysis Tutorial - Yannis Smaragdakis - University of Athens](https://yanniss.github.io/points-to-tutorial15.pdf) :::success **Procedure & Code Placement (for Spatial Locality):** Many optimizations aim to reduce executed instructions. Another crucial type concerns memory usage: **programs are often paged in *virtual memory* and exceed the *I-cache size***. Thus, how *procedures* and *Basic Blocks* are positioned in memory is critical due to costly ==page faults== and ==I-cache misses==. ::: :::info [**[Paper] Profile Guided Code Positioning - K Pettis, 1990**](https://pages.cs.wisc.edu/~fischer/cs701.f05/code.positioning.pdf) > Tips: Concentrate the execution of hot spots in the code to improve cache hits! Demonstrates strategies of *Code Placement Optimization*: 1. **Procedure Positioning:** Try to keep procedures that often call each other as close together in memory as possible. Ex. ```cpp= /** * We can use this trick by obtaining dynamic call * frequencies using a profiling tool like gprof or qpt, * then create a call graph annotated with call frequencies: * ┌───────┐ * ┌────┤ A ├────┐ * 4│ └───────┘ │10 * ┌───┴───┐ ┌───┴───┐ * │ B ├─────────┤ C │ * └───┬───┘ 3 └───┬───┘ * 8│ │2 * ┌───┴───┐ ┌───┴───┐ * │ D ├─────────┤ E │ * └───────┘ 1 └───────┘ * * Then group procedures by call frequency: * ┌───────┐ ┌───────┐ * │ [B,D] ├───────┤ [A,C] │ * └───┬───┘ 3,4 └───┬───┘ * │ │2 * │ ┌───┴───┐ * └───────────┤ E │ * 1 └───────┘ * * Here we need to merge [B,D] & [A,C], but who should be * the adjacent? It turns out (A,B) has the max freq: 4. * ┌───────────┐ ┌───────┐ * │ [D,B,A,C] ├───────┤ E │ * └───────────┘ 1,2 └───────┘ * * Finally, (E,C) is greater than (E,D), so the result is: * ┌─────────────┐ * │ [D,B,A,C,E] │ * └─────────────┘ */ ``` 2. **Basic Block Positioning:** Try to place the most frequently executed series of basic blocks "in sequence". Ex. ```cpp= /** * Since the error_handler is rarely executed, arranging * these Basic Blocks in a certain sequence may lead to * frequent jumps to normal execution. */ if (is_error) { error_handler; } ... ``` 3. **Procedure Splitting:** Place infrequently executed "fluff" in a different memory area than heavily executed code. ::: :::info :+1: [**[Paper] Fast and Accurate Flow-Insensitive Points-To Analysis - M Shapiro, 1997**](https://www.cs.utexas.edu/users/pingali/CS380C/2007fa/papers/popl97.pdf) Address techniques for flow/context-insensitive ==interprocedural analysis== of stack-based storage; **measure the preciseness & space/time complexity in different algorithms such as Andersen's, Steensgaard's, & the fine-tuned versions of those two**. ![image](https://hackmd.io/_uploads/rk-UUI-UR.png =500x) ::: ## Optimization of Basic Blocks Using `IR` to analyze code & streamline program's logic. This includes: - **Structure-Preserving Transformations** - Dead Code Elimination - Common Subexpression Elimination - Renaming of Temporary Variables - Interchanging Independent Adjacent statements - **Algebraic Transformation** - Constant Folding - Copy Propagation - Variable Propagation - Constant Propagation - Strength Reduction - **Loop Optimization** - Code motion & Frequency Reduction - Induction variable elimination - Loop merging/combining - Loop Unrolling [Optimization of Basic Blocks - GeeksforGeeks](https://www.geeksforgeeks.org/optimization-of-basic-blocks/) ```sh # All IR optimization process output gcc -c -fdump-tree-all a.c # Each IR optimization process output gcc -c -fdump-tree-cfg=cfg.txt a.c gcc -c -fdump-tree-gimple=gimple.txt a.c ... ``` ### Gimple IR Generates 3-address `IR` (Ex. *t1 = A * B*) from `AST` in GCC. [GCC Internal Representations - D Novillo, 2007](https://www.airs.com/dnovillo/200711-GCC-Internals/200711-GCC-Internals-3-IR.pdf) Ex. ```gimple= f (a) { unsigned int i.0; char * i.1; char * D.1140; int D.1141; ... goto <D1136>; <D1135>: ... D.1140 = a + i.1; D.1141 = g * i; ... <D1136>: if (i < n) {goto <D1135>;} ... } ``` ### Tree SSA Passes [Tree SSA passes - Online Docs - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/Tree-SSA-passes.html) Ex. Compiling C code: ```cpp= int main() { int a = 10, b = 20, c = 3000; int i, result; for(i = 0; i < 10000; i++) { result = (a * b + i) / (a * c); } return result; } ``` #### 1. Dividing into Basic Blocks ```cpp= int main() { <bb 0> : a = 10, b = 20, c = 3000, result = UNDEFINED; <bb 1> : i = 0; <bb 2> : if (i < 10000) goto <bb 3>; else goto <bb 5>; <bb 3> : result = (a * b + i) / (a * c); <bb 4> : i = i + 1; goto <bb 2>; <bb 5> : return result; } ``` #### 2. Optimizing Basic Blocks #### 2-1. Merging Excess-Divided Basic Blocks (No Goto Reference) ```cpp= int main() { <bb 0> : a = 10, b = 20, c = 3000, result = UNDEFINED, i = 0; <bb 2> : if (i < 10000) goto <bb 3>; else goto <bb 5>; <bb 3> : result = (a * b + i) / (a * c); i = i + 1; goto <bb 2>; <bb 5> : return result; } ``` #### 2-2. Reducing Branches if Possible ```cpp= int main() { <bb 0> : a = 10, b = 20, c = 3000, result = UNDEFINED, i = 0; goto <bb 2>; <bb 3> : result = (a * b + i) / (a * c); i = i + 1; // No need to use goto, because it is next to it. //goto <bb 2>; <bb 2> : if (i < 10000) goto <bb 3>; else goto <bb 5>; <bb 5> : return result; } ``` #### 3. Turning into 3-Address Code ```cpp= int main() { <bb 0> : a = 10, b = 20, c = 3000, result = UNDEFINED, i = 0; goto <bb 2>; <bb 3> : //result = (a * b + i) / (a * c); t0 = a * b; t1 = t0 + i; t2 = a * c; result = t1 / t2; i = i + 1; <bb 2> : if (i < 10000) goto <bb 3>; else goto <bb 5>; <bb 5> : return result; } ``` #### 4. Static Single Assignment (SSA) Form When any time the variable get changed, assign a new version to that variable, and analyze the relationship of variables in different versions. [SSA Techs - GCC & GNU Toolchain Developers' Summit 2004](http://www.naturalbridge.com/GCC2004Summit.pdf) ![image](https://hackmd.io/_uploads/r1ceYXXS0.png =x200) #### 4-1. Turning into SSA Form ```cpp= int main() { <bb 0> : a0 = 10, b0 = 20, c0 = 3000, result0 = UNDEFINED, i0 = 0; goto <bb 2>; <bb 3> : t0 = a0 * b0; t1 = t0 + i1; t2 = a0 * c0; result1 = t1 / t2; i2 = i1 + 1; <bb 2> : i1 = Φ (i0, i2); // value assigned from i0 or i2 if (i1 < 10000) goto <bb 3>; else goto <bb 5>; <bb 5> : result2 = Φ (result0, result1); return result2; } ``` #### 4-2. Constant Propagation & Constant Folding Turn determinstic & immutable variables into constant. ```cpp= int main() { <bb 0> : a0 = 10, b0 = 20, c0 = 3000, result0 = UNDEFINED, i0 = 0; goto <bb 2>; <bb 3> : // No one else here gonna use a0, b0, c0, changing to constant t0 = 10 * 20; // a0 * b0 t1 = 200 + i1; // t0 + i1 t2 = 10 * 3000; // a0 * c0 result1 = t1 / 30000; // t1 / t2 i2 = i1 + 1; <bb 2> : i1 = Φ (0, i2); // Φ (i0, i2) if (i1 < 10000) goto <bb 3>; else goto <bb 5>; <bb 5> : result2 = Φ (UNDEFINED, result1); // Φ (result0, result1) return result2; } ``` #### 4-3. Dead Code & Induction Variable Elimination After turning variables into constant, some redundant assignments & logics can be merged or eliminated. ```cpp= int main() { <bb 0> : //a0 = 10, b0 = 20, c0 = 1000, result0 = UNDEFINED, i0 = 0; goto <bb 2>; <bb 3> : //t0 = 200; t1 = 200 + i1; //t2 = 30000; result1 = t1 / 30000; i2 = i1 + 1; <bb 2> : i1 = Φ (i0, i2); if (i1 < 10000) goto <bb 3>; else goto <bb 5>; <bb 5> : result2 = Φ (result0, result1); return result2; } ``` #### 4-4. Value Range Propagation Peek and assess any remaining variable's value range in advance of actual execution. This may yield deterministic results based on mathematical equations involving the variables. ```cpp= /** * i1 = [0, 10000] * i2 = [0, 10000] * * t1 = [200, 10200] * result1 = [200 / 30000, 10200 / 30000] * result2 = 0, [200 / 30000, 10200 / 30000] * result1 = 0 * result2 = 0 */ int main() { <bb 0> : goto <bb 2>; <bb 3> : t1 = 200 + i1; result1 = 0; // t1 / 30000 i2 = i1 + 1; <bb 2> : i1 = Φ (i0, i2); if (i1 < 10000) goto <bb 3>; else goto <bb 5>; <bb 5> : result2 = 0; // Φ (result0, result1) -> Φ (UNDEFINED, 0) return 0; } ``` #### 4-5. Control Flow Optimization #### 4-5-1. Branch Forwarding ![image](https://hackmd.io/_uploads/ryogQ0rBA.png =500x) :::info **On-trend research topic.** Problems need to be solved: - Expensive - May not be correct - End up with a lot of extra copy statements - Loose all information attached to SSA variables - Ranges, value numbers, aliasing information ::: #### 4-5-2. Loop Optimization Employing multiple strategies to enhance loop performance, including leveraging cache optimization and parallel processing capabilities. - **Loop-Invariant Code Motion:** Move non-related-to-looping instructions out of the loop. - **Loop Unrolling:** Unwrap the loop to get rid of the branch control if the cost is relatively low. - **Loop Jamming:** Bundle the separate loops into one. - **Loop Fission:** Separate different array operations from the same loop. This improves locality of reference. > CPU concentrates at processing each array at single time & context. - **Loop Interchange:** Adjust access pattern of multi-level array to fit the best locality of reference in memory & cache. More: - [Code Motion Trick in Python - b001](https://www.youtube.com/shorts/uWEIaF0PNGg) :::warning **Loop-Invariant Code Motion & Pointer Aliasing Problem: Will the optimization lead to an unintended behaviour?** Example: ```cpp= int func(int *pa, int *pb, int *pc, int d) { while (*pc < d) { int i = *pa + *pb; // Can this line get rid of the loop? *pc += i; } } ``` It turns out that *pa, *pb, and i seem to be independent to the operation in the loop and can get rid of the loop, but what if *pc is one of *pa or *pb? > Become dependently! To guarantee these are independent pointers: - **[Choice 1]** Rearrange program logic by our own. - **[Choice 2]** Use "Restricting Pointer Aliasing": `__restrict__` tag in gcc. ::: #### 5. Result ```sh= gcc -c -fdump-tree-optimized=cfg.txt a.c ``` ```cpp= int main() { return 0; } ``` ## Optimization of Register Transfer Language (RTL) Features: - Uses *S-Expression* (same as `LISP`) - Implements GCC built-in & architecture-defined operations - Peephole optimization :::success **Peephole/Window Optimization:** Scanning over the current `RTL` logic, finding if any adjacent `RTL` logic can be bundled & get optimized. > Find the best operations for the equivalent logic. ![image](https://hackmd.io/_uploads/BJX3APvSR.png =300x) Ex. - **Remove redundant logic** - Remove `push $rax; pop $rax` - From `x = 5; y = x; z = y` to `z = 5` - **Optimize arithmetic operations** - From multiplying a floating point register by 8, to adding 3 to the floating point register's exponent. - **Replace slow instructions with faster ones** - From `mul $rax 2` to `add $rax $rax` - From `test $eax $eax` to `cmp $eax 0` ::: ### RTL Passes [RTL Generation and Optimization Passes Overview - Online Docs - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/RTL-passes.html) #### Control Flow Graph Cleanup (used to be Jump Optimization Pass) Removes unreachable code, simplifies jumps to next, jumps to jump, jumps across jumps, ... #### Forward Propagation of Single-Def Values Removes redundant computation & simplifies the results by substituting variables that come from a single definition by performing *Copy Propagation* and *Optimal Addressing Mode Selection*. - **Global Constant Propagation:** Propagating constant values across the code to reduce unnecessary computations. - **Copy Propagation:** Replacing redundant copies with references to the original source. - **Addressing Mode Selection (AMS):** Selecting the optimal addressing mode for addressing registers in the input program. More: - [Addressing mode selection - E. Eckstein, 2003](https://dl.acm.org/doi/abs/10.5555/776261.776298) #### Common Subexpression Elimination (CSE) Removes redundant computation within basic blocks, and optimizes addressing modes based on cost. #### Global CSE (GCSE) & Partial Redundancy Elimination (PRE) ![image](https://hackmd.io/_uploads/ryRVzP6HC.png =300x) :::success **Methods:** - **Classic GCSE:** Will only eliminate the computation of "exp1" in block B3 since the evaluation in block B6 can be reached via a path which does not have an earlier computation of "exp1" (e.g., `entry -> B1 -> B7 -> B8 -> B5 -> B6`). **Less code size compared to *PRE*.** - **PRE:** Will eliminate the evaluations of "exp1" in blocks B3 & B6; to make the evaluation in B6 redundant, PRE will insert an evaluation of "exp1" in block B8. **More powerful optimization that will tend to increase code size to improve runtime performance.** **Tools:** - **Lazy Code Motion Optimizer Framework (LCM) based GCSE:** May increase code size for a gain in speed. - Performs loop invariant code motion (moving invariant computations outside loops). - Includes load and store motion optimizations to improve performance. - **Morel-Renvoise Partial Redundancy Elimination (MR PRE):** Save more code size relatively. - Does not move invariants out of loops; this task is handled by the *Loop Optimization* pass separately. - Includes code hoisting (unification) and load motion as part of the process. ::: > Global Constant Propagation & Copy Propagation & GCSE are enabled by default with `-O2` optimization flag or with `-fgcse` flag. Refs: - [Global CSE/Partial Redundancy Elimination - News - GNU GCC](https://gcc.gnu.org/news/gcse.html) - [Lazy Code Motion Optimizer Framework - News - GNU GCC](https://gcc.gnu.org/news/lcm.html) #### Loop Optimization Performs several loop related optimizations. > The source files cfgloopanal.cc and cfgloopmanip.cc contain generic loop analysis and manipulation code. Initialization and finalization of loop structures is handled by loop-init.cc. A loop invariant motion pass is implemented in loop-invariant.cc. Basic block level optimizations—unrolling, and peeling loops— are implemented in loop-unroll.cc. Replacing of the exit condition of loops by special machine-dependent instructions is handled by loop-doloop.cc. #### Jump Bypassing An aggressive form of *GCSE* that transforms the control flow graph of a function by propagating constants into conditional branch instructions. #### If Conversion Attempts to replace conditional branches and surrounding assignments with arithmetic, boolean value producing comparison instructions, and conditional move instructions. > In the very last invocation after reload/LRA, it will generate predicated instructions when supported by the target. #### Web Construction Splits independent uses of each pseudo-register. This can improve effect of the other transformation, such as CSE or register allocation. #### Instruction Combination Attempts to combine groups of two or three instructions that are related by data flow into single instructions. It combines the RTL expressions for the instructions by substitution, simplifies the result using algebra, and then attempts to match the result against the machine description. #### Mode Switching Optimization Looks for instructions that require the processor to be in a specific "mode" and minimizes the number of mode changes required to satisfy all users. What these modes are, and what they apply to are completely target-specific. #### Modulo Scheduling Looks at innermost loops and reorders their instructions by overlapping different iterations. Modulo scheduling is performed immediately before instruction scheduling. #### Instruction Scheduling Applying scheduling strategies to mitigate the effects of **instruction stalling** caused by resource dependencies on the CPU. ![image](https://hackmd.io/_uploads/Sygez36BR.png) > This pass is performed before and after *Register Allocation*. The code for this pass is located in haifa-sched.cc, sched-deps.cc, sched-ebb.cc, sched-rgn.cc & sched-vis.c. #### Register Allocation Makes sure that all occurrences of pseudo registers are eliminated, either by allocating them to a hard register, replacing them by an equivalent expression (e.g. a constant) or by placing them on the stack. Features: - Register allocation makes scheduling harder by creating extra dependencies between instructions. - Less aggressive register allocation may be desirable. - Some processors allocate and schedule dynamically. :::success **Integrated register allocator (IRA):** Integrates: - Coalescing (register aliasing) - Live range splitting (while two register's lifetime are conflicting) - Spill/restore code moving (to/from the stack) - Hard register preferencing into the ==coloring process==. This approach improves efficiency and compatibility with the *Reload/LRA Pass*. Pseudo-registers spilled during allocation or reload/LRA can reclaim hard registers if others are evicted. **The allocator prioritizes optimal pseudos for spilling based on their live ranges and consolidates stack slots for spilled pseudos.** *IRA* acts as a regional register allocator, switching to a ==Chaitin-Briggs allocator== (*a bottom-up graph coloring register allocation algorithm*) within a single region. ![image](https://hackmd.io/_uploads/SyOunWCHA.png =400x) > By default, IRA selects regions based on register pressure, but users can enforce single-region or loop-specific allocations. **Reload Pass:** Transforms non-strict `RTL IR` into strict `RTL IR` by imposing instruction constraints and renumbering pseudo registers with target machine register numbers. **Pseudo registers without allocated hard registers are substituted with stack slots.** Then it finds instructions that are invalid because a value has failed to end up in a register, or has ended up in a register of the wrong kind. It fixes up these instructions by reloading the problematical values temporarily into registers. Additional instructions are generated to do the copying. The reload pass also optionally eliminates the frame pointer and inserts instructions to save and restore call-clobbered registers around calls. > **"Reload is the GCC equivalent of Satan."** > > Reload does everything, and probably no one exactly knows how much that is. But to give you some idea: > - Spill code generation > - Instruction/register constraint validation > - Constant pool building > - Turning non-strict RTL into strict RTL (doing more of the above in evil ways) > - Register elimination--changing frame pointer references to stack pointer references > - Reload inheritance--essentially a builtin CSE pass on spill code > > -- [Reload - GCC Wiki](https://gcc.gnu.org/wiki/reload) **Local Register Allocation (LRA):** A modern replacement of the reload pass. Unlike the reload pass, *intermediate LRA* decisions are reflected in RTL as much as possible (doing more work on separating each module to reduce the number of target-dependent macros and hooks), leaving instruction constraints as the primary source of control. [[Slides] Local Register Allocator Project - Vladimir Makarov](https://vmakarov.fedorapeople.org/LRA.html) ![image](https://hackmd.io/_uploads/BksMYZCrC.png =500x) - **Hard Register:** Target-machine's register - **Virtual Register:** Argument pointer, Frame pointer, Stack pointer, & other specialized registers (if not available, use [$SP + offset] instead). - **Pseudo Register:** ∞ numbers of pseudo registers can be manipulated by the compiler. ::: Refs: - [The Integrated Register Allocator Ira For Gcc Doc - KITO's Lab](http://kito.wikidot.com/the-integrated-register-allocator-ira-for-gcc-doc) - [Register Allocation in GCC : Reload / LRA - Kito's Lab](https://kitoslab.blogspot.com/2012/01/register-allocation-in-gcc-reload-lra.html) - [[Slides] Design and Implementation of GCC Register Allocation - Kito Cheng](https://docs.huihoo.com/infoq/hellogcc-design-and-Implementation-of-gcc-register-allocation-20140916.pdf) :::info [**[Paper] On Local Register Allocation - M Farach-Colton, 2000**](https://dl.acm.org/doi/pdf/10.5555/314613.314848) Demonstrates: - *Local Register Allocation* problem is **NP-hard**. - Variant of the furthest-first heuristic achieves a good approximation ratio. - **2-approximation algorithm** for LRA. ::: :::info [**[Paper] Integrated Register Allocation and Instruction Scheduling with Constraint Programming - RC Lozano, 2014**](https://robcasloz.github.io/publications/TRITA-ICT-ECS-AVH-14-13.pdf) The goal is to devise a combinatorial approach for integrated register allocation and instruction scheduling that is practical and hence usable in a modern compiler. Such an approach must capture all main subtasks of register allocation and instruction scheduling while scaling to programs of realistic size. Additionally, to be practical a combinatorial approach must be robust, that is, must deliver consistent code quality and compilation times across a significant range of programs. - Traditional Code Gen: ![image](https://hackmd.io/_uploads/SyHVOvtSR.png) - Combinatorial Code Gen: ![image](https://hackmd.io/_uploads/r1LwuDFrA.png) - Constraint-Based Code Gen: ![image](https://hackmd.io/_uploads/r19wwvYBC.png) ::: #### Basic Block Reordering Implements profile guided code positioning. If profile information is not available, various types of static analysis are performed to make the predictions normally coming from the profile feedback (IE execution frequency, branch probability, etc). #### Variable Tracking Computes where the variables are stored at each position in code and generates notes describing the variable locations to RTL code. The location lists are then generated according to these notes to debug information if the debugging information format supports location lists. #### Delayed Branch Scheduling (Optional) Attempts to find instructions that can go into the delay slots of other instructions, usually jumps and calls. #### Branch Shortening On RISC machines, **branch instructions often have restricted ranges**, necessitating longer instruction sequences for longer branches. This compiler pass calculates the distance between instructions to determine whether standard branch instructions or longer sequences are required for each branch. #### Register-To-Stack Conversion Conversion from usage of some hard registers to usage of a register stack may be done at this point. :::warning Currently, this is supported only for the floating-point registers of the Intel 80387 coprocessor. ::: #### Final Emits the target assembly code for the function. > How do I know which instruction should generate? How can I do extensive instruction support? :::success **RTL Template:** Used to define which insns match the particular pattern and how to find their operands. For named patterns, the RTL template also says how to construct an insn from specified operands. Ex. - source.RTL_IR ```lisp ; b = a - 56 (set (reg:SI 60 [b]) (plus:SI (reg:SI 61 [a]) (const_int -56 [0xffffffc8]))) ``` - MIPS.md (machine definition file) ```lisp (define_insn "addsi3" ; matching pattern [(set (match_operand:SI 0 "general_operand" "=r,m") (plus:SI (match_operand:SI 1 "general_operand" "0,0") ; addu (match_operand:SI 2 "general_operand" "g,r") ; addiu ))] "TARGET_MIPS16" "@ addu %0, %1, %2 addiu %0, %1, %2" ) ``` Refs: - [RTL Template - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/RTL-Template.html) - [Target assembly code emission - GCC Resource Center](https://www.cse.iitb.ac.in/grc/intdocs/gcc-implementation-details.html#Target-assembly-code-emission) ::: #### Debugging Information Output Run after final because it must output the stack slot offsets for pseudo registers that did not get hard registers. > Source files are dwarfout.c for DWARF symbol table format, files dwarf2out.cc and dwarf2asm.cc for DWARF2 symbol table format, and vmsdbgout.cc for VMS debug symbol table format. :::warning *DWARF* is a debugging information file format used by **GDB**. ::: ## Compiler Support ### Built-In Functions Support provided by the compiler elevates the compiling code, such as: - Adorns `IR` with fine-grained labels. - Provides miscinallious control options for programmer to fine-tune their programs (spanning **OS support**, **machine restriction**, **machine acceleration**, **compilation policy**, **optimization options**, **security enhencement**, **code specialization**, **customization rules**, ...). Ex. - Turn **`memcpy()`** into: - **`__builtin_memcpy_inline()`**: Enabling inline expansion or loop unrolling. - **`__builtin_memcpy_align()`**: Enabling specific hardware features or alignment requirements. - **`__builtin_memcpy_vectorized()`**: Enabling vector instructions for bulk memory operations. > *Instruction support*, *data bus bandwidth*, *MM in OS* all matter. - **`__builtin_memcpy_check_bounds()`**: Enabling buffer size checks for secure operations. - **Lambda function closure issues:** [【python】这个十多年的bug，没点黑魔法还真解决不了 - 码农高天](https://www.youtube.com/watch?v=Mp_f1sBckjU) - **Intrinsic function:** **`_mm_cmpistrc()`** wraps **CISC SIMD's SSE**, a machine acceleration instruction, for developer use. > [x86 intrinsics list - Microsoft Learn](https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170) - **`__builtin_expect()`** provide **`likely()`** & **`unlikely()`** options in branch controls. - Degenerate **`printf(char[])`** without format specifiers (e.g., *%s*, *%d*, *%.2f* ...) to **`puts(char[])`**. :::warning **Sometimes the developer just want using the function directly without any optimization & transformation.** Then it should be very careful on tuning compiler options for special development scenarios. Ex. In embedded system development, if the device doesn't implement the "puts()" function, issues may arise when "printf()" is replaced with "puts()" due to default compiler behavior. To prevent optimization on specific built-in function, do: ``` gcc -fno-builtin-func_name ... ``` ::: ### Whole-Program Modularization & Optimization The program is generally divided into several procedures to make the job easy in analyzing, debugging, etc. > ***Local Optimization*** refers to optimization in a *Basic Block*; ***Regional Optimization*** refers to optimization in an *Extended Basic Block*; ***Global Optimization*** refers to the optimization of a *whole procedure*; ***Interprocedural optimization*** refers to the optimization of *whole programs* which includes different procedures. **Pros:** - Limiting the amount of code that the compiler considers at any particular time. - Keeping the compile-time data structures small and limiting the cost of various compile-time algorithms. **Cons:** - Preventing the compiler from understanding what happens inside a call. - Requiring frequent jumps. :::warning Same work can be applied in code bundlers like *Webpack Optimization*. ::: #### Compilation Unit Each `*.c` file is a compilation unit treated as an independent unit that can perform abstraction, encapsulation, optimization, variable visibility (Ex. `static`, `local`, `global`), and source tree tracking. :::warning This is the point where *Makefile* can do parallel compilation by processing independent modules. ```sh make -j8 ... ``` ::: #### Incremental Compilation Incremental compilation minimizes recompilation by only recompiling parts of the code that have changed, while retaining unchanged sections. [Incremental Compilation - Rust Blog](https://blog.rust-lang.org/2016/09/08/incremental.html) ![image](https://hackmd.io/_uploads/S1ZSuIYSR.png =500x) #### Interprocedural Optimization (IPO) To reduce the inefficiencies that are introduced by separate procedures which can't be done within the *Global Optimaization*, an *Intraprocedural Optimization*, the compiler may analyze and transform multiple procedures together. Interprocedural Optimization helps in achieving this. :::success **Inline Substitution:** The compiler can improve the efficiency by replacing the call site with a copy of the callee’s body via *Link-Time Optimization*. This helps in allowing the compiler to avoid most of the procedure linkage code, resulting in the streamlining of the program’s call graph. Ex. Inline the shared object's function ```cpp= int isOdd(int num) { return (num & 0x1); } ``` More: - [Link-time optimization for the kernel - LWN](https://lwn.net/Articles/512882/) **Points-To Analysis:** A compile-time technique that helps identify relationships between pointer variables and the memory locations that they point to during program execution. ::: ### Cross-Platform Parallel Computing #### OpenMP (Open Multi-Processing) An API standard that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran. #### C++ AMP *(Deprecated)* C++ Accelerated Massive Parallelism (C++ AMP) accelerates execution of C++ code by taking advantage of data-parallel hardware such as a graphics processing unit (GPU) on a discrete graphics card. [C++ AMP Overview - Microsoft](https://learn.microsoft.com/en-us/cpp/parallel/amp/cpp-amp-overview?view=msvc-160) ### Cross Compilation To build on one platform a binary that will run on another platform. ![image](https://hackmd.io/_uploads/B1yVh0-IC.png) :::warning **For example, develop embedded systems in arm architecuture on Intel x86 machines.** The *cross compilation* enables efficient and effective deployment of software onto embedded devices, which are often resource-constrained & have specialized processors or operating systems. > **It usually needs a set of "Cross Compiler Toolchain" to complete the whole task:** Runtime support (& maybe using different linker script) should be taken into account. ::: ### Runtime Support Maps program elements like "procedure names" and "identifiers" to their actual memory locations during the runtime. It is so-called the **dynamic state of the target machine** that includes *software libraries* and *environment variables*, to support running processes. [Runtime Environments in Compiler Design - GeeksforGeeks](https://www.geeksforgeeks.org/runtime-environments-in-compiler-design/) #### Procedure Call Stack Keep track of the live *procedure activations* (i.e. the procedures whose execution have not been completed). ![](https://hackmd.io/_uploads/SkgEirMdR.png) The call stack consist of **activation records** that provide essential runtime enviroment components: *local vriables*, *temporary values*, *parent/other procedure snapshots*, *machine status*, *return value*, *parameters*, dynamic links, ... #### Runtime Storage (Section -> Segment) - **Target Code:** from ELF's `.Text` - **Data Objects:** - Static Variables: from ELF's `.BSS`, `.Data` - Automatic Data Objects: Stack - Dynamic Data Objects: Heap - **Access Links:** refer to non-local data held in other activation records. - **Control links:** information of procedures' control flow. #### Application Binary Interface (ABI) - **Calling Conventions:** - How to pass/handle parameters. - [How Many Arguments Can Be Passed in Variadic Functions (The GNU C Library) - GNU.org](https://www.gnu.org/software/libc/manual/html_node/How-Many-Arguments.html): > There is no general way for a function to determine the number and type of the optional arguments it was called with. That is, the developer maintains their own calling convention for each variadic function. - **Number of optional arguments as a fixed argument:** This convention implies all of the optional arguments are of the same type. - **Bitmasks & types of optional arguments as a fixed argument:** If the bit is set, fetch the value of the next argument, otherwise use a default value. > For example: The *format string argument* of `printf`. - **Special Terminator as the last optional argument:** This should assume that the special terminator isn’t otherwise meaningful to the function. > For example: A null pointer indicates the end of the argument list in `execl`. - What tasks the machine should do before & after the call. - Function/Syscalls following the API standard. - **Name Mangling:** provides means to encode added information in the symbol names (e.g., *function*, *structure*, *class*) **to pass extra semantic information from the compiler to the linker**. - **Type Representations:** define how different static/dynamic types are stored, accessed, and manipulated (especially useful in object-oriented programming & casting scenarios). - **Primitive Types:** *int*, *float*, *char*, *bool*, ... - **Composite Types:** *arrays*, *structures*, *unions*, *tables*, *blobs*, *streams*, ... - **User-Defined Types:** *int32_t*, *audio* derived from *blobs*, ... - **Machine Structures & Machanisms:** defines the standards to sucessfully perform the tasks, such as: - **Memory Management:** stacks, heaps, files, libraries, devices, paging, caching, garbage collection, thread-safe shared memory, IPC, brk(), MMU, ASLR, ELF loaders, dynamic linking, ... - **Exception/Interrupt Handling:** defines how error/event/signal conditions are communicated and handled between the operating system and programs. - **Thread Support & Synchronization:** encapsulate & establish standards for data sharing and synchronization in parallel processing. More: - [How do linkers resolve symbols? Systems Programming CS Lecture - Chris Kanich](https://www.youtube.com/watch?v=6XVUIeAaROU) - [Application Binary Interface - Derya Cortuk](https://medium.com/@derya.cortuk/application-binary-interface-92c52d46bbf6) #### C-Runtime & C-Library (glibc on GNU/Linux) ![ry4p1V4NA](https://hackmd.io/_uploads/rk0e7bySA.png =400x) [[Slides] How A Compiler Works: GNU Toolchain - jserv - slideshare](https://www.slideshare.net/slideshow/how-a-compiler-works-gnu-toolchain/45294966) - **Load & Return:** Provides supports at the beginning & the end of the process. ![image](https://hackmd.io/_uploads/S1oi65Gh0.png =300x) - **ELF File Format:** [In-depth: ELF - The Extensible & Linkable Format - stacksmashing](https://www.youtube.com/watch?v=nC1U1LJQL8o) - **ELF Dynamic Linking:** [Linux Executable Symbol Relocation Explained - Chris Kanich](https://www.youtube.com/watch?v=E804eTETaQs) - **Handle Exception & Runtime Supports:** Supports the process with OS & hardware/software functionarities via C libraries. > Ex. FPU software acceleration to support hardware operation requirements. More: - [Creating a C Library - OSDev Wiki](https://wiki.osdev.org/Creating_a_C_Library) #### C++ RTTI (Run-Time Type Information) Determines the data type of a variable at runtime, it is used when invoking *dynamic_cast* or *typeid operators*. > The C++ standard does not determine how exactly RTTI is implemented, **its implementation is "delegated" to the application binary interface (ABI)**. Most compilers store information in a **virtual table (vtable)**. If you need details, you can look at the vtable implementation on the x86_64 platform. More: - [[Slides] Static Analysis of C++ Virtual Tables (from GCC) - James Rowley, 2023](https://hardwear.io/usa-2023/presentation/analyzing-decompiled-c++vtables-and-objects-in-GCC-binaries.pdf) :::success **Get Rids of the RTTI's dynamic_cast:** ![image](https://hackmd.io/_uploads/HJPmuCJwA.png =500x) dynamic_cast is bad because: 1. It work only on polymorphic types. 2. It doesn't work with the `-fno-rtti` flag. 3. ABI-specific information about types is required for its work. 4. Downcasting works via the vtable view, and this is a slow operation. > Alternatively, we can implement a simple self-hosting dynamic_cast by additionally wrapping our current class with a parent class that contains type information that can be determined at runtime. Refs: - [Is there life without RTTI or How we wrote our own dynamic_cast - Vladislav Stolyarov](https://pvs-studio.com/en/blog/posts/cpp/0998/) ::: #### C# .NET CLR (Common Language Runtime) The *JIT compiler* in *.NET* translates *IL (Intermediate Language)* into native machine code with *CLR* support, enabling *IL* to be dynamically transformed into specific-platform machine code.