# Compiler: The Program, the Language, & the Computer Work (1)
###### tags: `Compiler` `Tokenizer` `Parser` `Optimization` `Compiler Design` `GCC` `Clang` `LLVM` `JVM` `TurboFan` `V8` `Automata` `GNU` `Linux` `Kernel` `Program Analysis` `Operating System` `Computer Architecture` `ASM` `ABI` `ISA` `Runtime` `Interpreter` `Toolchain` `C` `C++` `Java` `JavaScript` `C#` `Cross Platform` `JIT` `AOT` `Automata` `Computing Machine`
These are the learning notes, research, & discussions on compiler design for GNU/Linux systems. Topics include optimization strategies, program analysis, runtime support, cross-compilers, & other advanced subjects.
:::info
:arrow_right: Next article: [Compiler: The Program, the Language, & the Computer Work (2) - shibarashinu](https://hackmd.io/@shibarashinu/SkAhDMLmll).
:::
:::warning
**Program vs. Language vs. System**
:arrow_right: Related topics with compiler design, implementations, & applications: [我所不知道的 JavaScript - shibarashinu](https://hackmd.io/@shibarashinu/BkAHFociR).
:::
> All Linux kernel functions discussed here primarily based on *version 6.11*.
## Preface
:::info
*"All computer work is just simply doing string processing."*
:::
By applying ourselves with a mathematical and automata mindset when running programs, we can understand limitations, optimize processes, and approach ultimate solutions.
:::success
**"Anything can be optimized. Anything can become an exquisite piece of art."**
Whatever path we strive for, mastering the essence of each endeavor allows us to capture the true gem of artistry.
For example:
- **Metaprogramming:** Viewing the program from a high-level perspective focuses on the essence of the code, not the code itself. So the code can be read, generated, analyzed, or transformed to other programs, or even modified by itself at runtime. Providing flexibility & dynamic manipulation of the code allows us to **==Treat "Code" as "Data"==** that ++inspect "Code" as "Data"++ & do potential improvements like *code motion* or others. Or do generative programming that lets us ++manipulate "Code" as "Data"++.
[Metaprogramming - Wiki](https://en.wikipedia.org/wiki/Metaprogramming)
Metaprogramming can be in some of the belowing fields:
- **Exposing Runtime System Environments to Programming Language Execution Context with APIs:** e.g., .NET Common Intermediate Language (CIL) emitter.
[C# CLR, IL, JIT Compilation & Code Access Security Explained - Kunal Tandon - Medium](https://medium.com/@kunaltandon.kt/c-clr-il-jit-compilation-code-access-security-explained-269124121f5)

- **Dynamic Execution of Expressions:** *"Treat Data as Executables"*. The dynamic expressions that contain programming commands, often composed from strings, arguments or context, like JavaScript. Thus, *"programs can write programs"*.
- **"Meta(data)"-Programming:** View programs in a superior aspect beyond the actual programs. General purpose program transformation systems (e.g., compilers), which accept language descriptions and carry out arbitrary transformations on those languages, are direct implementations of *general metaprogramming*.
[Compiler-compiler - Wiki](https://en.wikipedia.org/wiki/Compiler-compiler)
- **Generic Programming Paradigm:** Generate a Program that can do *"automatically manufacturing software components"*.
E.g., POSIX shell scripts, IDL system, [PDL](https://github.com/google/pdl) (protocol), devicetree, macro, `template` in `C++`, dynamic funciton loading, ...
> devicetree system:
> 
- **Use Case: Domain-Specific Languages (DSLs) Applications:** `lex` (lexical analyzer) & `yacc` (parser generator) for generating lexical analysers parsers, let the user describe the language using regular expressions & context-free grammars, & embed the complex algorithms required to efficiently parse the language.
:::warning
**Note:** Not all *metaprogramming* involves generative programming. If programs are *modifiable* at runtime, or if incremental compiling is available (e.g., in `C#`, `JavaScript`, `Lisp`, `Elixir`, `Lua`, `Perl`, `PHP`, `Python`, `Ruby`, `Rust`, `R`), then techniques can be used to perform metaprogramming **without** generating source code.
:::
- **JIT Compilation:** Dynamic translation or run-time compilations that is compilation at *RUNTIME*.
[[Intro] Metaprogramming and JIT Compilers - A Brief Overview - VoxelRifts](https://www.youtube.com/watch?v=FFgvV0sA3kU)
- **Metaprogramming from Source Code:** Since the lack of further information of code, there's not much thing can do but *++Generative Programming++*.
- **Metaprogramming from AST / IR / Bytecode:** With the extra information than the source code via data/control flow analysis, now we not only can do generative programming but also *++inspect & analyze++*, & take further operations.
- **Metaprogramming from Assembly:** Not high level enough for informative details nor low level enough machine can easy identify, limiting the use case of metaprogramming.
- **Metaprogramming from Machine Code:** Here we can turn data into any executables.
> Still aware of sticking the executables with the correct calling convention on target platform & machine code.
- **JIT Compilation Implementation Concept:**
```cpp=
/*
* Since malloc turns off "executable" by default,
* we need to use mmap instead.
*/
char* buffer = mmap(
NULL, // Start address (let the kernel choose)
1024, // Length of the mapping
PROT_READ | PROT_WRITE | PROT_EXEC, // Protection flags
MAP_PRIVATE | MAP_ANONYMOUS, // Mapping flags
-1, // File descriptor (not used here)
0 // Offset (not used here)
);
// func setup
typedef void(*buffer_as_func)();
buffer_as_func func = (buffer_as_func) &buffer;
// mov eax, ecx
buffer[0] = 0x89;
buffer[1] = 0xc8;
// ret
buffer[2] = 0xc2;
// execute
func();
// clean up
int result = munmap(buffer, 1024);
if (result == -1)
return -1;
buffer = NULL;
return 0;
```
:::info
:arrow_right: See more about JIT implementation on browser engine (JS JIT entity x browser native environment): [Binding & Sandboxing - Anatomy of Web Browser Engines - shibarashinu](https://hackmd.io/@shibarashinu/H1wrAk4o0#WebCore---Rendering-Engine).
:::
:::info
:arrow_right: See more about exploiting excutable code: [Sandboxing the No Data Execution Prevention (DEP) - shibarashinu](https://hackmd.io/@shibarashinu/H1UUR_p-0#Memory-Exploit-in-JavaScript-JIT-Engine).
:::
:::info
**Classic Example: Lisp**
In *Lisp metaprogramming*, the unquote operator (typically a comma) introduces code that is evaluated at program definition time rather than at run time. The *metaprogramming* language is thus identical to the *host* programming language, & **existing Lisp routines can be directly reused for *metaprogramming* if desired** *(here takes "meta" literally)*.
This approach has been ==implemented in other languages by incorporating an ++interpreter++ in the program==, which works directly with the program's data.
:::
- **Algorithms x Turing Machines:**
- **Specialized Data Structures:** [Dijkstra's Hidden Prime Finding Algorithm - boo1](https://www.youtube.com/watch?v=fwxjMKBMR7s)
- **Dynamic Programming:** [The Algorithm Behind Spell Checkers - b001](https://www.youtube.com/watch?v=d-Eq6x1yssU)
- **Computer Architectures x Calculation:**
- **Fast InvSqrt:** `i = 0x5f37'59df - (i >> 1)` & Newton's Method
[没有显卡的年代,这群程序员用4行代码优化游戏 - 量子位](https://www.youtube.com/watch?v=g1r3iLejTw0)
- **Fast Fourier Transform**
- [XOR linked list](https://en.wikipedia.org/wiki/XOR_linked_list), Error Correction, Lossy/-less Compression, Encode, Hash, Encryption
- **Data Structures for Special Scenarios:** Splay Tree, B+ Tree, Bloom Filters, BSP Tree, R* Tree, [你所不知道的 C 語言: linked list 和非連續記憶體 - jserv](https://hackmd.io/@sysprog/c-linked-list), ...
- **Hardware Acceleraton with Lookup Tables:** [Pentium FDIV bug - Wiki](https://en.wikipedia.org/wiki/Pentium_FDIV_bug)
- **Parallel Computing:** SIMD, SIMT, MIMD
- **Computation Transform:** The essence of math is that starts from solving actual problems, abstracting their core concept, generalizing them (& generalizing them, ...), & finally toward the generalized truth, transformative wildcard patterns.
<div style='display: grid; grid-template-columns: repeat(2, 1fr); gap: .5em; align-items: center;'>
<img style='display: inline; margin: 0; object-fit: contain; aspect-ratio: 3 / 2; background: #fff;' src='https://hackmd.io/_uploads/SyXKzKArJe.png' />
<img style='display: inline; margin: 0; object-fit: contain; aspect-ratio: 3 / 2; background: #fff;' src='https://hackmd.io/_uploads/rJ_VF3m90.png' />
<img style='display: inline; margin: 0; object-fit: contain; aspect-ratio: 3 / 2; background: #fff;' src='https://hackmd.io/_uploads/SyuM6SPk1l.png' />
</div>
> 
>
> (Source: [所以,到底什么是傅里叶变换?- 漫士沉思录 Meditation Math](https://www.youtube.com/watch?v=nwMKuChwpMo))
- **Powers & Exponents:** e.g., $exp()$, $log()$, ...
- **Geometry, Limits & Calculus:** e.g., Cartesian/cylindrical/spherical coordinates, trigonometry, Riemann integral, Taylor series, Fourier series, $e$, ...
- **Linear Algebra:** linear transformations, transformation compositions, eigenvectors, ...
- **Probability, Statistics:** Same as the dimensional analysis, but focusing on the "results".
- **Automatic Control:** transfer functions, PID feedback systems, ...
- **Complex Systems:** $i$, Euler coordinates, Laplace transform, Fourier transform (FFT, DFT, [Discrete Weighted Transforms & Large Integer Arithmetic](https://www.ams.org/journals/mcom/1994-62-205/S0025-5718-1994-1185244-1/S0025-5718-1994-1185244-1.pdf)), Z-transform, dimensional analysis, ...
- **Commutative Algebra:** algebraic geometry, number theory, ...
- [Why is pi here? And why is it squared? A geometric answer to the Basel problem - 3Blue1Brown](https://www.youtube.com/watch?v=d-o3eB9sfls)
- [Commutative Ring - ScienceDirect](https://www.sciencedirect.com/topics/mathematics/commutative-ring)
- **Natural Transformations:** *"Morphisms between Functors"*. Transform one functor into another while preserving the categorical structure.
:::
:::warning
**Hackers & Painters:** *The philosophy of how we think, how we work, how we develop technology, & how we live.*
> Empathy is probably the single most important difference between a good hacker and a great one. Some hackers are quite smart, but when it comes to empathy are practically solipsists. It's hard for such people to design great software, because they can't see things from the user's point of view.
Resources:
- [[Intro] Hackers and Painters - Paul Graham](https://paulgraham.com/hp.html)
- [[Book] Hackers & Painters: Big Ideas from the Computer Age - Paul Graham - Amazon](https://www.amazon.com/Hackers-Painters-Big-Ideas-Computer-ebook/dp/B0026OR2NQ?ref_=ast_author_mpb)
- [[Article] Computer Programming as an Art - Donald E. Knuth, 1974](https://paulgraham.com/knuth.html)
- [[Book] The World Is Flat: A Brief History of the Twenty-first Century - Thomas L. Friedman - Amazon](https://www.amazon.com/World-Flat-History-Twenty-first-Century/dp/0374292884)
- [[Article] The Top Idea in Your Mind - Paul Graham](https://www.paulgraham.com/top.html)
- [Part 1 - Capt. Grace Hopper on Future Possibilities: Data, Hardware, Software, and People (1982) - National Security Agency](https://www.youtube.com/watch?v=si9iqF5uTFk)
- [Part 2 - Capt. Grace Hopper on Future Possibilities: Data, Hardware, Software, and People (1982) - National Security Agency](https://www.youtube.com/watch?v=AW7ZHpKuqZg)
:::
## Reference
#### Courses
- [OptComp-2021: Optimising Compilers - University of Cambridge](https://www.cl.cam.ac.uk/teaching/2021/OptComp/)
- [COPT-2019: Compiler Optimisation - University of Edinburgh](https://www.inf.ed.ac.uk/teaching/courses/copt/)
- [CS 6120: Advanced Compilers - Cornell University](https://www.cs.cornell.edu/courses/cs6120/2022sp/lesson/)
- [15-819 O: Program Analysis - Carnegie Mellon University](https://www.cs.cmu.edu/~aldrich/courses/15-819O-13sp/)
- [CS 701: Construction of Compilers - University of Wisconsin–Madison](https://pages.cs.wisc.edu/~fischer/cs701.html)
- [CSC 372: Systems II: Compiler Construction - Adelphi University](https://home.adelphi.edu/~siegfried/cs372/notes.html)
#### Lectures
- [你所不知道的 C 語言:編譯器和最佳化原理篇 - jserv](https://hackmd.io/@sysprog/c-compiler-optimization)
- [你所不知道的 C 語言: 執行階段程式庫 (CRT) - jserv](https://hackmd.io/@sysprog/c-runtime)
#### Links
- [[Book] Let’s Build a Compiler - Jack Crenshaw](https://xmonader.github.io/letsbuildacompiler-pretty/about.html)
- [Compiler Design Tutorial - GeeksforGeeks](https://www.geeksforgeeks.org/compiler-design-tutorials/)
- [編譯器設計 - ji3g4al](https://hackmd.io/@ji3g4al/SyBQJy9yq)
- [深入淺出教你寫編譯器 (Compiler) - Jace Ju](https://jaceju.net/simple-compiler/)
- [GNU Compiler Collection (GCC) Internals - Online Docs - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/)
- [GNU Compiler Collection Internals PDF (for gcc version 4.2.4) - GNU GCC](https://gcc.gnu.org/onlinedocs/gcc-4.2.4/gccint.pdf)
- [The Java® Virtual Machine Specification - Oracle](https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf)
<!--
#### Books
- [[Book] Hackers and Painters: Big Ideas from the Computer Age - digtvbg](https://digtvbg.com/files/books-for-hacking/Hackers%20%26%20Painters%20-%20Big%20Ideas%20From%20The%20Computer%20Age%20by%20Paul%20Graham.pdf)
- [[Book] Basics of Compiler Design - Department of Computer Science (University of Copenhagen)](http://hjemmesider.diku.dk/~torbenm/Basics/basics_lulu2.pdf)
- [[Book] Compilers: Principles, Techniques, and Tools - NTHUCS](https://www.cs.nthu.edu.tw/~ychung/slides/CSC4180/Alfred%20V.%20Aho,%20Monica%20S.%20Lam,%20Ravi%20Sethi,%20Jeffrey%20D.%20Ullman-Compilers%20-%20Principles,%20Techniques,%20and%20Tools-Pearson_Addison%20Wesley%20(2006).pdf)
- [[Book] Crafting a Compiler - NTHUCS](https://www.cs.nthu.edu.tw/~ychung/slides/CSC4180/Crafting%20a%20Compiler%20-%202010.pdf)
- [[Book] Engineering a Compiler - R-5: The Game of Life](https://www.r-5.org/files/books/computers/compilers/writing/Keith_Cooper_Linda_Torczon-Engineering_a_Compiler-EN.pdf)
-->
#### Conferences
- [IEEE Xplore](https://ieeexplore.ieee.org/Xplore/home.jsp)
- [OSDI '24 - USENIX](https://www.usenix.org/conference/osdi24)
- [The International Symposium on Code Generation and Optimization (CGO) - ACM](https://dl.acm.org/conference/cgo)
#### Reviews
- **Theory of Computation**

- **NFA / DFA (Regular Language):** The state machine for checking grammar that memorize current state only. $L=\{a^*b^+\}$, $L=\{w:|w|\ mod\ 3=0\}$
- **CFG / PDA (Context Free Language):** The state machine for checking grammar with an extra memory of a R/W LIFO stack. $L=\{a^nb^n: n \geq 0 \}$, $L=\{ww^R\}$
- **Turing Machine (Formal Language):** The state machine for checking grammar with an extra memory of an infinite movable R/W tape. $L=\{a^nb^nc^n: n \geq 0 \}$, $L=\{ww\}$
- **Compiler Construction**
- Lexer, Top-Down/Bottom-Up Parsing, Grammar, SDT, IR, Code Gen for VMs, Runtime Support, Target Code Generator
- [Compiler Design Tutorial - tutorialspoint](https://www.tutorialspoint.com/compiler_design/index.htm)
#### Todos
- **Programming Language Theory**
- Lambda Calculus
- Formal Semantics
- Type Theory
- Program Analysis & Transformation
- Run-Time Systems
- Metaprogramming
- Lisp: *List Processing Languages* for symbolic data processing.
[[Book] History of Lisp Parentheses - shaunlebron](https://github.com/shaunlebron/history-of-lisp-parens?tab=readme-ov-file)
- **Category Theory**
[Applied Category Theory Course - John Baez](https://math.ucr.edu/home/baez/act_course/)
- [Natural Transformations - Bartosz Milewski's Programming Cafe](https://bartoszmilewski.com/2015/04/07/natural-transformations/)
- **Functional Programming**
[Why is writing a compiler in a functional language easier? - StackOverflow](https://stackoverflow.com/questions/2906064/why-is-writing-a-compiler-in-a-functional-language-easier)
:::info
**Pros of F.P. in Compilation**
A compiler frequently works with trees, parsing source code into a syntax tree. This tree may be transformed for type checking, converted to core language elements, and optimized through various transformations. Eventually, the compiler generates a tree in a normal form, from which it produces the target assembly code.
Functional language have features like **pattern-matching**, **immutability** (anti-side-effect), **declarative nature** ("what to" rather than "how to" solve), and good support for **efficient recursion**, which make it easy to work with trees, so that's why they're generally considered good languages for writing compilers.
:::
- **C Language Reference**
- [C Language Reference - C++, C, and Assembler - Microsoft Learn](https://learn.microsoft.com/en-us/cpp/c-language/c-language-reference?view=msvc-170)
## Compiler History
> Compiler's Goal: To enable people who don't know machine code to compile programs written in natural languages or higher-level languages into machine code using an intelligent compiler that translates these languages into the target machine code.
Who was the first creator of the compiler? How a high-level programming language bootstrap itself?
More on this topic:
- [Who is Grace Hopper? Meet the Queen of Code - Honeypot](https://www.youtube.com/watch?v=5sNuPYJpSCI)
- [COBOL - Wiki](https://en.wikipedia.org/wiki/COBOL)
- [Bootstrapping (compilers) - Wiki](https://en.wikipedia.org/wiki/Bootstrapping_(compilers))
- [Compiler-compiler - Wiki](https://en.wikipedia.org/wiki/Compiler-compiler)
### Compilation Process & Coroutines
In the modern era, a high-level compiler consists of: *lexical analysis*, *syntactic analysis*, *semantic analysis*, *compiler optimization*, *target code generation*, etc.
The essence of the compilation process is started from the source code, and sequentially do operations by feeding the output of previous pass to the next one. Those intermediate output can be stored in temporary files or RAM. All modern compilers are implemented following these procedures.
But in the old days, computers relied on batch processing, so it is very inconvinient if the process need to deal with temporary outputs and go on (due to the heavy cost of moving read/write head backward & forward).
In *Cobol* (1960s), with its complexity and various constructs of its syntax, aligns better with the capabilities of ==*LR parsing*== techniques. Even today, it is not quite intuitive to perform a lexical analysis in *LR parsing* by means of a subroutine process.
This led to the need for a mechanism design that could do all various work on the same temporary data, which gave birth to the *coroutines*, an implementation of *cooperative multitasking*.
As such, *Cobol's* compilation process of lexical & syntactic analysis interleave, that is, when one generates enough information (e.g., tokens, syntax) within the "current window", it then ++actively yields++ its control of the processor, records the current status for ++resumes afterward++, and passes the output immediately to another as its input.
:::info
**[Further Thinking] What Do Old-School Cooperative Coroutines & Modern Preemptive Multi-Threading Systems Really Do in These Days?**
*The task responsibility transferation between the fulfillers & the corresponding tactics in different layers.*
We can think of a function call in a shared library, a trap call into the kernel (whether using fast syscall), a service routine done by the middleware in the multi-layer architecture, are all work in a cooperative multitasking system like flavor.
Whereas a multi-threaded software architecture running on modern multicore multiprocessors, typically hands its cooperation & scheduling work to the underlying system (OS or runtimes), which abstracts, undertakes & tunes those coordination in a system-wide perspective & long-term period, aiming for generalized but hardware-friendly system performance.
So, Multi-threaded or not? Where it is good for a cleaner system design pattern, but bad for customization & fine-grained optimization because the underlying system is more dedicated to providing a more general "good enough" service to "all users", which may make individual apps compromise or tear down in certain scenarios.
Weighting cooperative vs. preemptive, in Operating system (or runtime system) vs. user app levels, all depend on client requirements, system resources, targets, ... One-size-fits-all solution doesn't exist.
:::
Refs:
- [程式珠璣番外篇 - Q 協程的歷史,現在與未來 - 徐宥](https://blog.youxu.info/2014/12/04/coroutine/)
## Compiler Implementations & Applications
General speaking, a compiler works on:
```sh!
Source Code -> IR -> Target Machine Code / Target IR
```
- **Scanner -- Lexical Analysis (Tokens)**
Uses ++RE++ -> ++NFA++ (++DFA++) to transform strings into ==tokens== (e.g., *keywords*, *identifiers*, *literals*, *operators*, ...).
- **Parser -- Syntax Analysis (Rules) & Semantic Analysis (Meanings)**
Uses ++Syntax-Directed Definition (SDD)++ -> ++CFG++ -> ++PDA++ to transform tokens into ==ASTs==.
> By performing a top-down (LL) / bottom-up (LR) *Syntax-Directed Translation (SDT)*).
- **Intermediate Code Generator**
Transform ASTs into high-level ==machine-independent IR==.
> A semifinished product as an absraction layer between source code & machine code, which is convenient for the following *coprocessing* of the following compiler tools & passes.
- **[Machine-Independent] Code Pre-Optimizer**
Dedicate to optimizing the pure code logic (i.e., performing machine-independent optimization) on IR in basic block units.
- **[Machine-Dependent] Code Post-Optimizer**
Dedicate to analyzing the code logic & apply the target machine-/platform-dependent technique to the current IR in basic block units.
> Besides, from here on, the compiler starts to take the actual running environment of the program into account to do code analysis, optimization, transformation, ... as well (e.g., *ISA*, *register allocation*, *cache layout*, *processor-/system-specific operations*, ...).
- **Machine Code Generator**
Finally, transform IR to ==target machine code==.
> By peforming a *Syntax-Directed Translation (SDT)* like what ASTs to IR did.
### GCC Framework Overview

[GCC 4.0.2 - The Implementation - GCC Resource Center](https://www.cse.iitb.ac.in/grc/intdocs/gcc-implementation-details.html)
> 
> 
> - **Source -> AST:** (top-down) recursive descent parser
> 
> - **Generic:** tree-based (syntax) IR
> 
> - **Gimple:** graph-based (CFG) SSA IR
> 
> - SSA form:
> for passes of *loop unrolling*, *constant propagation*, *dead code elimination*, ...
> - variable rename: on each assignment
> - Φ: variable dependecy control
> - three-address IR:
> ```arm
> (operator, operand1, operand2, result)
> ```
> - **RTL:** Register-based list style IR
> 
>
> (Source: [[Slides] from Source to Binary: How GNU Toolchain Works - jserv](https://www.slideshare.net/slideshow/from-source-to-binary-how-gnu-toolchain-works/7528437))
More:
- [Which GCC optimization flags affect binary size the most? - StackOverflow](https://stackoverflow.com/questions/72030595/which-gcc-optimization-flags-affect-binary-size-the-most)
### LLVM
Mainly contributing to developing its own **language-agnostic** compiler `IR`, optimizer, & backend infrastructures in *C++*, which accepts both static & dynamic language form (e.g., *JVM bytecode*).


- **Clang:** A "LLVM Native" compiler frontend that transforms C/C++/Objective-C into *LLVM IR*.
- **Core (opt & llc/lli):** Responsible for *LLVM IR* analysis, optimization, & compilation process.
- **Dragonegg:** Uses GCC 4.5 (GPL v2) compiler frontend & LLVM Core.
- **LLDB:** LLVM's Debugger
> Back in the days, LLVM is used to be *Low-Level Virtual Machine* & used GCC as its compiler frontend (with *Gimple* output/input).
> However, after Apple attempted to transition from GCC to LLVM by applying licenses other than GPL v2 to their contributions, particularly for LLVM as a GCC backend, the GCC community responded by updating the GPL license to prevent such actions. In contrast, LLVM proceeded to develop its own compiler frontend known as **Clang**.
Intro:
- :+1: [A Gentle Introduction to LLVM IR - mcyoung](https://mcyoung.xyz/2023/08/01/llvm-ir/)
- :+1: [LLVM - The Architecture of Open Source Applications](https://aosabook.org/en/v1/llvm.html)
#### LLVM IR
In these equivalent form:
1. **JIT In-Memory Format**
2. **ASM Format**
3. **Bitcode (Same as Bytecode) Format**
:::warning
**Main Design Concept:** The lower & more detailed the language (IR) is, the more optimization strategies the program gains.
:::
#### LLVM TableGen & TD (Target Description)
Similar to *GCC RTL pattern* & the corresponding *MD (Machine Description)*.
E.g., Running TableGen
```sh
llvm-tblgen X86.td -print-enums -class=Register
```
- x86.td
```cpp=
// Definition of ADD for x86 architecture
let Predicates = [HasAVX] in { // [HasAVX]: asserts AVX support available
def ADD : I<
0x01, // Defines opcode: 0x01
MRMSrcMem, // Addressing mode: register/memory source operand
(outs GR32:$dst), // output: 32-bit general-purpose register
(ins GR32:$src, Mem:$mem), // src: 32-bit register & mem: memory operand
"add{l}\t{$src, $mem|$mem, $src}", // assembly output format
[(set GR32:$dst, (add GR32:$src, Mem:$mem))] // SSA instruction selection format
> {
let Inst{25-24} = 0b11; // ModR/M mode: Register/Memory
let Inst{22-20} = 0b000; // Opcode for ADD
let Inst{19-18} = 0b01; // Operand size: 32-bit
let Inst{17-16} = 0b00; // Addressing mode: Direct
let Inst{3-0} = 0b0001; // Instruction opcode
}
}
```
#### LLVM Optimization
Implemented as *Analysis and Transform Passes* that traverse some portion of a program to either collect information or transform the program.
- **Analysis passes:** Compute information that the following passes can use, or for debugging/profiling purposes.
- **Transform passes:** Mutate the program in some way.
- **Utility passes:** Provides some easy-to-use utility (e.g., extracting functions to bitcode or writing a module to bitcode).
Details:
- [LLVM’s Analysis and Transform Passes - LLVM](https://llvm.org/docs/Passes.html#introduction)
#### LLVM IR Interpreter: lli
Executes *LLVM bitcode* directly using a ==JIT compiler== or ==interpreter== (restricted to the host architecture & unable to emulate other architectures).
[lli - directly execute programs from LLVM bitcode - LLVM](https://llvm.org/docs/CommandGuide/lli.html)
:::info
E.g., Execute C/C++ program by LLVM interpreter.
```sh
# Compile C/C++ program to LLVM bitcode
clang -O3 -emit-llvm a.c -c -o a.bc
# Interpret & execute LLVM bitcode
lli a.bc
```
:::
#### LLVM Static Compiler: llc
Compiles LLVM source inputs into assembly language for a specified architecture. The assembly language output can then be passed through a native assembler and linker to generate a native executable.
[llc - LLVM static compiler - LLVM](https://llvm.org/docs/CommandGuide/llc.html)
### Valgrind
*Runtime Memory Debugger / Profiler*
Valgrind is a virtual machine with built-in JIT (re)compilation accelaration.
Similar to *LLVM* compiler backend toolchains: It first tranlate binaries (as its input) into *SSA form*, machine-independent IR, then pass it to functioning tools (e.g., *Memcheck*, *Cachegrind*, ...) which are able to inject extra checkpoints in the IR, or doing advanced runtime analysis, inspecting & profiling. Then finally recompile (initially from binary to binary) it to the target machine code (i.e., able to do ++cross compilation++).
So the data pipeline in the compilation chains looks like:
```md!
Host Binaries -> Machine-Independent IR (SSA Form) -> Modified IR (Among Tool Passes) -> Target Binaries
```
> Also, *Valgrind* has integrated with *GDB*, which provides further interactive runtime debugging inside the Valgrind test programs.
[Valgrind - Wiki](https://en.wikipedia.org/wiki/Valgrind)
:::info
But anyway, use *language-/system-specific sanitizers* for analysis whenever possible due to some [drawbacks](https://stackoverflow.com/questions/47251533/memory-address-sanitizer-vs-valgrind) & [limitations](https://en.wikipedia.org/wiki/Valgrind#Limitations_of_Memcheck).
:arrow_right: See: [Compiler Sanitizers - Linux Kernel Debugging - shibarashinu](https://hackmd.io/@shibarashinu/ryyKT2wZR#Compiler-Sanitizers).
:::
### NeoVim's Parsers
- **Treesitter (Syntax Parser):** For *syntax highlighting*, *syntax error checking* & *code folding*, ...
> E.g., identifies `He (S.) + is gonna (be V. to) + sleep (V.).` according to the CFG syntax rules.
> E.g.,
> ```javascript=
> if (x > 0) { return x; }
> ```
> ```j=
> if_statement:
> parenthesized_expression:
> binary_expression:
> identifier
> >
> number
>
> statement_block:
> return_statement:
> identifier
> ```
:::info
**Using Bottom-Up Generalized LR (GLR) Parsing**
Enable fast, lightweight incremental parsing analysis while allowing incomplete input form.
:::
- **LSP Client/Server (Semantic Parser):** For *autocompletion*, *type checking*, *function scope construction*, *formatting* & *refactoring*, ...
> E.g., identifies `He (a person) + is gonna (set the previous subject {the person}'s current status to the following action {sleep}) + sleep (action).` according to the CFG semantic rules.
> E.g.,
> ```javascript=
> if (x > 0) { return x; }
> ```
> ```js=
> IfStatement {
> condition: BinaryExpression {
> left: Identifier(x)
> operator: ">"
> right: Number(0)
> }
> body: ReturnStatement {
> expression: Identifier(x)
> }
> }
> ```
:::info
**Parsing Based on Compiler Parsers**
*I.e., Language-Dependently*
Analyze more thorough language-specific details upon code than Syntax parsers do.
E.g.,
- **Top-Down LL-Based (`LL(k)`) Parsing:** JavaScript, TypeScript, Python, ...
- **Bottom-Up LR (`LALR(1)`) Parsing:** C, Rust, Golang, ...
:::
:::warning
**Top-Down (LL) vs. Bottom-Up (LR) in Semantic Parsing**
*Why don't we use ++LR parsing++ once and forever, just like in syntax analysis does?*
Well, ++LL (top-down, recursive descent) parsing++ still owns some merits:
- They do the same process as the code is written / as the rules are devised in a human-friendly way.
> *More* straight forward implementation, debugging & maintainance. *Less* additional machine involvement to build parsers (like LR parser generator yacc).
- ++LR (buttom-up) parsing++ needs to defer semantic actions until the entire structure is complete, which may cause more effect on implementation & runtime analysis.
& also its cons:
- [Review] ++LL parsing++ is not able to parse *left-recursive context-free grammar*: or it will analyze & expand the same rule infinitely.
For example:
```grammar=
Expr ::=
Expr '+' Term |
Expr '-' Term |
Term
Term ::=
Term '*' Factor |
Term '/' Factor |
Factor
Factor ::=
'(' Expr ')' |
num |
ident
```
& this would fail in the top-down parsing:
```c=
1 + 1 + 1
// what LL parser sees is:
// (1 + 1 + 1)
// ok, is it Expr ?
// ok, maybe Expr '+' Term ?
// ok, is it Expr ?
// ok, maybe Expr '+' Term ?
// ok, is it Expr ?
// ...
// (never deduce is it accepted Expr)
// ok, is it '+' ?
// ok, is it Term ?
// ok, maybe Expr '-' Term ?
// ok, maybe Term ?
```
To solve this, adujst the rules to conform to a right-recursive way:

```c=
1 + 1 + 1
// what LL parser sees is:
// ok, is it Expr ?
// (1 + 1 + 1)
// ok, maybe Term Expr' ?
// ok, is it Term ?
// ok, maybe Factor Term' ?
// ok, is it Factor ?
// 1 (num)
// accepted !
// ok, is it Term' ?
// (+ 1 + 1)
// ok, maybe '*' Factor Term' ?
// Failed
// ok, maybe '/' Factor Term' ?
// Failed
// ok, maybe empty ?
// accepted !
// ok, is it Expr' ?
// ...
```
- If the grammar rules are complicated, ++LL parsing++ would need to look ahead (i.e., postpone the speculation & stack current info) & exhaust all matching possibilities only to find out one single solution: which is unefficient if analysis gets broader & deeper.
To summarize:
- **LL (Top-Down) Parsing:** Compile-time built easily, run-time run heavily.
> Suited for simple, human-readable grammars.
- **LR (Bottom-Up) Parsing:** Compile-time built heavily, runtime run easily.
> Powerful, but require complicated parser generators or compiler compilers.
:::
### JIT Compilers - Interpreted Language Accelerators
#### Java Hotspot JVM's Compilers


> JIT & AOT compilers in Java:
> 
> 
> (Source: [AOT vs. JIT Compilation in Java - César Soto Valero](https://www.cesarsotovalero.net/blog/aot-vs-jit-compilation-in-java.html))
- **C1 ByteCode Compiler (Fast, Lightweight):** Performs value numbering, inlining, and class analysis, featuring:
- **Simple control-flow-graph (CFG) SSA "high-level" `IR`**
- **Machine-oriented "low-level" `IR`**
- **Linear scan register allocator**
- **Template-style code generator**
- **C2 ByteCode Compiler (Highly Optimized):** Performs global value numbering, conditional constant type propagation, constant folding, global code motion, algebraic identities, method inlining (aggressive, optimistic, and/or multi-morphic), intrinsic replacement, loop transformations (unswitching, unrolling), array range check elimination, featuring:
- **Sea-of-nodes SSA & machine-specific `IR`**
- **Graph-coloring register allocator in various types (e.g., local, global, and argument registers and stack)**
:::info
**Sea of Node SSA Representation**
A graph representation of single-static assignment (SSA) representation of a program that **combines *data flow* and *control flow***, relaxing the control flow from a "total order" to a "partial order", keeping only the orderings required by data flow by **integrating traditional IR layers (CFG & Basic Blocks) to *value [dependency graph](https://en.wikipedia.org/wiki/Dependency_graph) (VDG)***.
Features:
- **Not Control Flow Graph (CFG):** Fully-specified evaluation order (e.g., pure operations: integer addition).
- **Lack of ordering:** Relaxing evaluation order for most operations.
- **Easier to reorder instructions:** Skeleton of a CFG remains but requires a *global code motion algorithm* to convert it back into a *control flow graph (CFG)*).
- **Effect edges order stateful operations:** Resulting in better implementation of *dead code elimination*, *code motion*, and *constant propagation*.
> Very suited for **Speculative Optimization** -- JIT optimizing compiler.

More:
- :+1: [Sea of Nodes - Fedor Indutny's Blog](https://darksi.de/d.sea-of-nodes/)
- :+1: [Sea-of-nodes 论文阅读笔记 - chuj](https://www.cjovi.icu/compilers/1654.html)
:::
:::info
**[[Paper] A parallel abstract interpreter for JavaScript - K Dewey, 2015](https://vineethk.github.io/files/Dewey2015.pdf)**
A parallel approach of data-flow analysis (DFA) with control-flow graph (CFG) on JavaScript language to boost up the analysis performance.
Features:
1. **Non-Obvious Control Flow:** JavaScript is a highly dynamic language, featuring *prototype-based inheritance*, *higher-order functions*, *implicitly applied type-conversions*, *implicit exceptions*, *dynamic memory management of objects, variables, & functions with garbage collections*, *only 2 types of variable scope (global & function internal)*, & *hoisting variables*.
2. **JavaScript’s Inherent Dynamism:** Means that high precision is important to get useful results, implying that any useful analysis will be expensive.
Solutions:
1. **Worklist-Parallel Strategy:**
Processing data-flow node thread-safe independently, until reaching some merge point specified by the merge strategy merged states are inserted back into the global worklist.
2. **Per-Context Parallel Strategy:**
To reduce node ordering issues, limit synchronization between independent parts of the analysis, and increase the granularity of the thread computations. ==Context-sensitive analysis== is a suitable option, it clones functions based on some notion of abstract calling context. Each clone is specialized to a particular context and, most importantly, analyzed separately.

> We observe that many contexts are typically enqueued for processing simultaneously, highlighting the promise of a **per-context strategy** for analyzing JavaScript. Instead of a global worklist, we use one worklist per context, each managed by its own thread. Each worklist has a non-blocking, lock-free backlog queue for other threads to enqueue work. When a thread exhausts its worklist, it transfers its backlog queue to the worklist and continues. If a thread processes a function call for a new context, it enqueues the abstract state into that context’s backlog queue. The analysis reaches a fixpoint when all worklists and backlog queues are empty.
:::
#### JavaScript V8 TurboFan JIT Compiler
Optimizes code for *cases seen in the past* (gathered in type feedback vector by the interpreter).
Refs:
- [[Slides] Deoptimization in V8 - Jaroslav Sevcik](https://docs.google.com/presentation/d/1Z6oCocRASCfTqGq1GCo1jbULDGS-w-nzxkbVF7Up0u0/edit?usp=sharing)
- [[Slides] TurboFan: A new code generation architecture for V8 - Benedikt Meurer](https://docs.google.com/presentation/d/1_eLlVzcj94_G4r9j9d_Lj5HRKFnq6jgpuPJtnmIBs88/edit?usp=sharing)
- [[Slides] TurboFan TechTalk Presentation - Ben L. Titzer](https://docs.google.com/presentation/d/1sOEF4MlF7LeO7uq-uThJSulJlTh--wgLeaVibsbb3tc/edit?usp=sharing)
- [從編譯器優化角度初探 Javascript的V8 引擎 - maxchiu](https://tech-blog.cymetrics.io/posts/maxchiu/turbofan/)
:::info
:arrow_right: For more comprehensive anatomy of V8 architecture, please check out: [我所不知道的 JavaScript - shibarashinu](https://hackmd.io/@shibarashinu/BkAHFociR).
:::
- **TurboFan Overview**



:::info
**How to apply the "JIT" Optimization?**
Using *On-Stack Replacement (OSR)*, we can replace the top of the current call stack with one in which the frames have been merged together.
OSR takes the contexts of some contiguous range of function applications and the variables that are live in those activations, and transforms those inputs into a new context or contexts and live variable set, and splats the new activations on the stack.
For example,
```js=
function g() { return 1; }
function f() {
var ret = 0;
for (var i = 1; i < 10_000_000; i++) {
ret += g ();
}
return ret;
}
```

[on-stack replacement in v8 - wingolog](https://wingolog.org/archives/2011/06/20/on-stack-replacement-in-v8)
:::
- **TurboFan JIT Optimization Pipeline**
:+1: [Node.js Under The Hood #7 - The new V8 - Lucas Santos](https://dev.to/_staticvoid/node-js-under-the-hood-7-the-new-v8-4gd6)
For example:
```javascript=
// Source Input
function f(a, b, c) {
var d = c - 100;
return a + d * b;
}
```
```javascript=
// Bytecode Output after "Ignition"
// Load constant 100 into the accumulator (Smi is Small Integer)
LdaSmi #100
// Subtract the constant loaded from the a2 parameter (i.e. c)
// & store into the accumulator
Sub a2
// Store the value in the accumulator into r0
Star r0
// Read the value of the a1 parameter (b)
// & store into the accumulator
Ldar a1
// Multiply r0 by the accumulator
// & store the result also in the accumulator
Mul r0
// Adds the first parameter a0 (a) into the accumulator
// & stores the result in the accumulator
Add a0
// Return
Return
```
- **Type & Range Analysis**

- **Code Reduction**

- **Typed Lowering Reduction**

#### Cpython JIT Optimizing Compiler
[【python】Faster CPython的重要力量 — Specialized Instruction - 码农高天](https://www.youtube.com/watch?v=SNXZPZA8PY8)
## Static Code Analysis
*like linters (e.g., clang-tidy) do*
- **Data-Flow Analysis (DFA):** Analyzes & streamlines value ranges, logic conditions, constraints, variable types, dead code at compile time.
- **Reaching Definitions Analysis:** Determines if the definition "reaches" a particular use of the *variable* or *expression* or can be safely optimized or eliminated.
- **Live Variable Analysis:** Determines if a variable or expression is "live", meaning that its value is still needed for some future computation or can be safely optimized or eliminated.
- **Available Expressions Analysis:** Determines if a particular expression is "available", meaning that its value has already been computed and can be reused (e.g., common subexpression elimination).
- **Constant Propagation Analysis:** Tracks the values of constants and determines the points in the program where a particular constant value is used (e.g., constant folding).
- **Control-Flow Analysis:** Analyzes the goto logic of the program (e.g., control dependency, loop).
:::info
**Recursive Tail Call Optimization**
Recursive function calls without growing the call stack.
- [Tail call - Wiki](https://en.wikipedia.org/wiki/Tail_call)
- [你的递归调用是如何被优化的?尾递归优化! - EfficLab](https://www.youtube.com/watch?v=M6Sm5WplSGQ)
- [Tail call optimization - Exploring ES6](https://exploringjs.com/es6/ch_tail-calls.html)
:::
- **Memory Dependency Analysis:** Analyzes the memory access patterns (e.g., instructions, array items).
- ==**Points-To Analysis (Pointer Analysis):**== All compiler analyses and optimizations are limited by the potential effects of assignments through pointers and references.
More:
- :+1: [Lecture Notes: Pointer Analysis - Claire Le Goues - cmu.edu](https://www.cs.cmu.edu/~clegoues/courses/15-819O-16sp/notes/notes07-pointers.pdf)
- [[Book] Pointer Analysis Tutorial - Yannis Smaragdakis - University of Athens](https://yanniss.github.io/points-to-tutorial15.pdf)
:::success
**Procedure & Code Placement (for Spatial Locality):**
Many optimizations aim to reduce executed instructions. Another crucial type concerns memory usage: **programs are often paged in *virtual memory* and exceed the *I-cache size***. Thus, how *procedures* and *Basic Blocks* are positioned in memory is critical due to costly ==page faults== and ==I-cache misses==.
:::
:::info
[**[Paper] Profile Guided Code Positioning - K Pettis, 1990**](https://pages.cs.wisc.edu/~fischer/cs701.f05/code.positioning.pdf)
> Tips: Concentrate the execution of hot spots in the code to improve cache hits!
Demonstrates strategies of *Code Placement Optimization*:
1. **Procedure Positioning:** Try to keep procedures that often call each other as close together in memory as possible.
For example:
```cpp=
/**
* We can use this trick by obtaining dynamic call
* frequencies using a profiling tool like gprof or qpt,
* then create a call graph annotated with call frequencies:
* ┌───────┐
* ┌────┤ A ├────┐
* 4│ └───────┘ │10
* ┌───┴───┐ ┌───┴───┐
* │ B ├─────────┤ C │
* └───┬───┘ 3 └───┬───┘
* 8│ │2
* ┌───┴───┐ ┌───┴───┐
* │ D ├─────────┤ E │
* └───────┘ 1 └───────┘
*
* Then group procedures by call frequency:
* ┌───────┐ ┌───────┐
* │ [B,D] ├───────┤ [A,C] │
* └───┬───┘ 3,4 └───┬───┘
* │ │2
* │ ┌───┴───┐
* └───────────┤ E │
* 1 └───────┘
*
* Here we need to merge [B,D] & [A,C], but who should be
* the adjacent? It turns out (A,B) has the max freq: 4.
* ┌───────────┐ ┌───────┐
* │ [D,B,A,C] ├───────┤ E │
* └───────────┘ 1,2 └───────┘
*
* Finally, (E,C) is greater than (E,D), so the result is:
* ┌─────────────┐
* │ [D,B,A,C,E] │
* └─────────────┘
*/
```
2. **Basic Block Positioning:** Try to place the most frequently executed series of basic blocks "in sequence".
For example:
```cpp=
/**
* Since the error_handler is rarely executed, arranging
* these Basic Blocks in a certain sequence may lead to
* frequent jumps to normal execution.
*/
if (is_error) {
error_handler;
}
...
```
3. **Procedure Splitting:** Place infrequently executed "fluff" in a different memory area than heavily executed code.
:::
:::info
:+1: [**[Paper] Fast and Accurate Flow-Insensitive Points-To Analysis - M Shapiro, 1997**](https://www.cs.utexas.edu/users/pingali/CS380C/2007fa/papers/popl97.pdf)
Address techniques for flow/context-insensitive ==interprocedural analysis== of stack-based storage; **measure the preciseness & space/time complexity in different algorithms such as Andersen's, Steensgaard's, & the fine-tuned versions of those two**.

:::
## Optimization of Basic Blocks
Using `IR` to analyze code & streamline program's logic. This includes:
- **Structure-Preserving Transformations**
- Dead Code Elimination
- Common Subexpression Elimination
- Renaming of Temporary Variables
- Interchanging Independent Adjacent statements
- **Algebraic Transformation**
- Constant Folding
- Copy Propagation
- Variable Propagation
- Constant Propagation
- Strength Reduction
- **Loop Optimization**
- Code motion & Frequency Reduction
- Induction variable elimination
- Loop merging/combining
- Loop Unrolling
[Optimization of Basic Blocks - GeeksforGeeks](https://www.geeksforgeeks.org/optimization-of-basic-blocks/)
```sh
# All IR optimization process output
gcc -c -fdump-tree-all a.c
# Each IR optimization process output
gcc -c -fdump-tree-cfg=cfg.txt a.c
gcc -c -fdump-tree-gimple=gimple.txt a.c
...
```
### Gimple IR
Generates 3-address `IR` (e.g., *t1 = A * B*) from `AST` in GCC.
[GCC Internal Representations - D Novillo, 2007](https://www.airs.com/dnovillo/200711-GCC-Internals/200711-GCC-Internals-3-IR.pdf)
For example:
```gimple=
f (a)
{
unsigned int i.0; char * i.1;
char * D.1140; int D.1141;
...
goto <D1136>;
<D1135>: ...
D.1140 = a + i.1;
D.1141 = g * i;
...
<D1136>:
if (i < n) {goto <D1135>;}
...
}
```
### Tree SSA Passes
[Tree SSA passes - Online Docs - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/Tree-SSA-passes.html)
e.g., Compiling C code:
```cpp=
int main()
{
int a = 10, b = 20, c = 3000;
int i, result;
for(i = 0; i < 10000; i++) {
result = (a * b + i) / (a * c);
}
return result;
}
```
#### 1. Dividing into Basic Blocks
```cpp=
int main() {
<bb 0> :
a = 10, b = 20, c = 3000, result = UNDEFINED;
<bb 1> :
i = 0;
<bb 2> :
if (i < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 3> :
result = (a * b + i) / (a * c);
<bb 4> :
i = i + 1;
goto <bb 2>;
<bb 5> :
return result;
}
```
#### 2. Optimizing Basic Blocks
#### 2-1. Merging Excess-Divided Basic Blocks (No Goto Reference)
```cpp=
int main() {
<bb 0> :
a = 10, b = 20, c = 3000, result = UNDEFINED, i = 0;
<bb 2> :
if (i < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 3> :
result = (a * b + i) / (a * c);
i = i + 1;
goto <bb 2>;
<bb 5> :
return result;
}
```
#### 2-2. Reducing Branches if Possible
```cpp=
int main() {
<bb 0> :
a = 10, b = 20, c = 3000, result = UNDEFINED, i = 0;
goto <bb 2>;
<bb 3> :
result = (a * b + i) / (a * c);
i = i + 1;
// No need to use goto, because it is next to it.
//goto <bb 2>;
<bb 2> :
if (i < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 5> :
return result;
}
```
#### 3. Turning into 3-Address Code
```cpp=
int main() {
<bb 0> :
a = 10, b = 20, c = 3000, result = UNDEFINED, i = 0;
goto <bb 2>;
<bb 3> :
//result = (a * b + i) / (a * c);
t0 = a * b;
t1 = t0 + i;
t2 = a * c;
result = t1 / t2;
i = i + 1;
<bb 2> :
if (i < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 5> :
return result;
}
```
#### 4. Static Single Assignment (SSA) Form
When any time the variable get changed, assign a new version to that variable, and analyze the relationship of variables in different versions.
[SSA Techs - GCC & GNU Toolchain Developers' Summit 2004](http://www.naturalbridge.com/GCC2004Summit.pdf)

#### 4-1. Turning into SSA Form
```cpp=
int main() {
<bb 0> :
a0 = 10, b0 = 20, c0 = 3000, result0 = UNDEFINED, i0 = 0;
goto <bb 2>;
<bb 3> :
t0 = a0 * b0;
t1 = t0 + i1;
t2 = a0 * c0;
result1 = t1 / t2;
i2 = i1 + 1;
<bb 2> :
i1 = Φ (i0, i2); // value assigned from i0 or i2
if (i1 < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 5> :
result2 = Φ (result0, result1);
return result2;
}
```
#### 4-2. Constant Propagation & Constant Folding
Turn determinstic & immutable variables into constant.
```cpp=
int main() {
<bb 0> :
a0 = 10, b0 = 20, c0 = 3000, result0 = UNDEFINED, i0 = 0;
goto <bb 2>;
<bb 3> :
// No one else here gonna use a0, b0, c0, changing to constant
t0 = 10 * 20; // a0 * b0
t1 = 200 + i1; // t0 + i1
t2 = 10 * 3000; // a0 * c0
result1 = t1 / 30000; // t1 / t2
i2 = i1 + 1;
<bb 2> :
i1 = Φ (0, i2); // Φ (i0, i2)
if (i1 < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 5> :
result2 = Φ (UNDEFINED, result1); // Φ (result0, result1)
return result2;
}
```
#### 4-3. Dead Code & Induction Variable Elimination
After turning variables into constant, some redundant assignments & logics can be merged or eliminated.
```cpp=
int main() {
<bb 0> :
//a0 = 10, b0 = 20, c0 = 1000, result0 = UNDEFINED, i0 = 0;
goto <bb 2>;
<bb 3> :
//t0 = 200;
t1 = 200 + i1;
//t2 = 30000;
result1 = t1 / 30000;
i2 = i1 + 1;
<bb 2> :
i1 = Φ (i0, i2);
if (i1 < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 5> :
result2 = Φ (result0, result1);
return result2;
}
```
#### 4-4. Value Range Propagation
Peek and assess any remaining variable's value range in advance of actual execution. This may yield deterministic results based on mathematical equations involving the variables.
```cpp=
/**
* i1 = [0, 10000]
* i2 = [0, 10000]
*
* t1 = [200, 10200]
* result1 = [200 / 30000, 10200 / 30000]
* result2 = 0, [200 / 30000, 10200 / 30000]
* result1 = 0
* result2 = 0
*/
int main() {
<bb 0> :
goto <bb 2>;
<bb 3> :
t1 = 200 + i1;
result1 = 0; // t1 / 30000
i2 = i1 + 1;
<bb 2> :
i1 = Φ (i0, i2);
if (i1 < 10000)
goto <bb 3>;
else
goto <bb 5>;
<bb 5> :
result2 = 0; // Φ (result0, result1) -> Φ (UNDEFINED, 0)
return 0;
}
```
#### 4-5. Control Flow Optimization
#### 4-5-1. Branch Forwarding

:::info
**On-trend research topic.**
Problems need to be solved:
- Expensive
- May not be correct
- End up with a lot of extra copy statements
- Loose all information attached to SSA variables
- Ranges, value numbers, aliasing information
:::
#### 4-5-2. Loop Optimization
Employing multiple strategies to enhance loop performance, including leveraging cache optimization and parallel processing capabilities.
- **Loop-Invariant Code Motion:** Move non-related-to-looping instructions out of the loop.
- **Loop Unrolling:** Unwrap the loop to get rid of the branch control if the cost is relatively low.
- **Loop Jamming:** Bundle the separate loops into one.
- **Loop Fission:** Separate different array operations from the same loop. This improves locality of reference.
> CPU concentrates at processing each array at single time & context.
- **Loop Interchange:** Adjust access pattern of multi-level array to fit the best locality of reference in memory & cache.
More:
- [Code Motion Trick in Python - b001](https://www.youtube.com/shorts/uWEIaF0PNGg)
:::warning
**Loop-Invariant Code Motion & Pointer Aliasing Problem: Will the Compiler Optimization Lead to an Unintended Behaviour?**
Example:
```cpp=
int func(int *pa, int *pb, int *pc, int d) {
while (*pc < d) {
int i = *pa + *pb; // Can this line be moved out of the loop?
*pc += i;
}
}
```
It turns out that `*pa`, `*pb`, and `i` seem to be independent to the operation in the loop & can be moved out of it, but what if `*pc` is one of `*pa` or `*pb`?
> Dependency of the parameters (input) must be considered!
To guarantee these are independent pointers:
- **[Choice 1]** Rearrange program logic by our own.
- **[Choice 2]** Use "Restricting Pointer Aliasing": `__restrict__` tag in gcc.
:::
#### 5. Result
```sh=
gcc -c -fdump-tree-optimized=cfg.txt a.c
```
```cpp=
int main() {
return 0;
}
```
## Optimization of Register Transfer Language (RTL)
Features:
- Uses *S-Expression* (same as `LISP`)
- Implements GCC built-in & architecture-defined operations
- Peephole optimization
:::success
**Peephole/Window Optimization**
Scanning over the current `RTL` logic, finding if any adjacent `RTL` logic can be bundled & get optimized.
> Find the best operations for the equivalent logic.

For example:
- **Remove redundant logic**
- Remove `push $rax; pop $rax`
- From `x = 5; y = x; z = y` to `z = 5` (easily implement with SSA form)
- **Optimize arithmetic operations**
- From multiplying a floating point register by 8, to adding 3 to the floating point register's exponent.
- **Replace slow instructions with faster ones**
- From `mul $rax 2` to `add $rax $rax`
- From `test $eax $eax` to `cmp $eax 0`
:::
### RTL Passes
[RTL Generation and Optimization Passes Overview - Online Docs - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/RTL-passes.html)
#### Control Flow Graph Cleanup (used to be Jump Optimization Pass)
Removes unreachable code, simplifies jumps to next, jumps to jump, jumps across jumps, ...
#### Forward Propagation of Single-Def Values
Removes redundant computation & simplifies the results by substituting variables that come from a single definition by performing *Copy Propagation* and *Optimal Addressing Mode Selection*.
- **Global Constant Propagation:** Propagating constant values across the code to reduce unnecessary computations.
- **Copy Propagation:** Replacing redundant copies with references to the original source.
- **Addressing Mode Selection (AMS):** Selecting the optimal addressing mode for addressing registers in the input program.
More:
- [Addressing mode selection - E. Eckstein, 2003](https://dl.acm.org/doi/abs/10.5555/776261.776298)
#### Common Subexpression Elimination (CSE)
Removes redundant computation within basic blocks, and optimizes addressing modes based on cost.
#### Global CSE (GCSE) & Partial Redundancy Elimination (PRE

:::success
**Methods:**
- **Classic GCSE:** Will only eliminate the computation of "exp1" in block B3 since the evaluation in block B6 can be reached via a path which does not have an earlier computation of "exp1" (e.g., `entry -> B1 -> B7 -> B8 -> B5 -> B6`). **Less code size compared to *PRE*.**
- **PRE:** Will eliminate the evaluations of "exp1" in blocks B3 & B6; to make the evaluation in B6 redundant, PRE will insert an evaluation of "exp1" in block B8. **More powerful optimization that will tend to increase code size to improve runtime performance.**
**Tools:**
- **Lazy Code Motion Optimizer Framework (LCM) based GCSE:** May increase code size for a gain in speed.
- Performs loop invariant code motion (moving invariant computations outside loops).
- Includes load and store motion optimizations to improve performance.
- **Morel-Renvoise Partial Redundancy Elimination (MR PRE):** Save more code size relatively.
- Does not move invariants out of loops; this task is handled by the *Loop Optimization* pass separately.
- Includes code hoisting (unification) and load motion as part of the process.
:::
> Global Constant Propagation & Copy Propagation & GCSE are enabled by default with `-O2` optimization flag or with `-fgcse` flag.
Refs:
- [Global CSE/Partial Redundancy Elimination - News - GNU GCC](https://gcc.gnu.org/news/gcse.html)
- [Lazy Code Motion Optimizer Framework - News - GNU GCC](https://gcc.gnu.org/news/lcm.html)
#### Loop Optimization
Performs several loop related optimizations.
> The source files cfgloopanal.cc and cfgloopmanip.cc contain generic loop analysis and manipulation code. Initialization and finalization of loop structures is handled by loop-init.cc. A loop invariant motion pass is implemented in loop-invariant.cc. Basic block level optimizations—unrolling, and peeling loops— are implemented in loop-unroll.cc. Replacing of the exit condition of loops by special machine-dependent instructions is handled by loop-doloop.cc.
#### Jump Bypassing
An aggressive form of *GCSE* that transforms the control flow graph of a function by propagating constants into conditional branch instructions.
#### If Conversion
Attempts to replace conditional branches and surrounding assignments with arithmetic, boolean value producing comparison instructions, and conditional move instructions.
> In the very last invocation after reload/LRA, it will generate predicated instructions when supported by the target.
#### Web Construction
Splits independent uses of each pseudo-register. This can improve effect of the other transformation, such as CSE or register allocation.
#### Instruction Combination
Attempts to combine groups of two or three instructions that are related by data flow into single instructions. It combines the RTL expressions for the instructions by substitution, simplifies the result using algebra, and then attempts to match the result against the machine description.
#### Mode Switching Optimization
Looks for instructions that require the processor to be in a specific "mode" and minimizes the number of mode changes required to satisfy all users. What these modes are, and what they apply to are completely target-specific.
#### Modulo Scheduling
Looks at innermost loops and reorders their instructions by overlapping different iterations. Modulo scheduling is performed immediately before instruction scheduling.
#### Instruction Scheduling
Applying scheduling strategies to mitigate the effects of **instruction stalling** caused by resource dependencies on the CPU.

> This pass is performed before and after *Register Allocation*. The code for this pass is located in haifa-sched.cc, sched-deps.cc, sched-ebb.cc, sched-rgn.cc & sched-vis.c.
#### Register Allocation
Makes sure that all occurrences of pseudo registers are eliminated, either by allocating them to a hard register, replacing them by an equivalent expression (e.g. a constant) or by placing them on the stack.
Features:
- Register allocation makes scheduling harder by creating extra dependencies between instructions.
- Less aggressive register allocation may be desirable.
- Some processors allocate and schedule dynamically.
:::success
**Integrated register allocator (IRA):**
Integrates:
- Coalescing (register aliasing)
- Live range splitting (while two register's lifetime are conflicting)
- Spill/restore code moving (to/from the stack)
- Hard register preferencing
into the ==coloring process==. This approach improves efficiency and compatibility with the *Reload/LRA Pass*.
Pseudo-registers spilled during allocation or reload/LRA can reclaim hard registers if others are evicted. **The allocator prioritizes optimal pseudos for spilling based on their live ranges and consolidates stack slots for spilled pseudos.**
*IRA* acts as a regional register allocator, switching to a ==Chaitin-Briggs allocator== (*a bottom-up graph coloring register allocation algorithm*) within a single region.

> By default, IRA selects regions based on register pressure, but users can enforce single-region or loop-specific allocations.
**Reload Pass:**
Transforms non-strict `RTL IR` into strict `RTL IR` by imposing instruction constraints and renumbering pseudo registers with target machine register numbers. **Pseudo registers without allocated hard registers are substituted with stack slots.**
Then it finds instructions that are invalid because a value has failed to end up in a register, or has ended up in a register of the wrong kind. It fixes up these instructions by reloading the problematical values temporarily into registers. Additional instructions are generated to do the copying.
The reload pass also optionally eliminates the frame pointer and inserts instructions to save and restore call-clobbered registers around calls.
> **"Reload is the GCC equivalent of Satan."**
>
> Reload does everything, and probably no one exactly knows how much that is. But to give you some idea:
> - Spill code generation
> - Instruction/register constraint validation
> - Constant pool building
> - Turning non-strict RTL into strict RTL (doing more of the above in evil ways)
> - Register elimination--changing frame pointer references to stack pointer references
> - Reload inheritance--essentially a builtin CSE pass on spill code
>
> -- [Reload - GCC Wiki](https://gcc.gnu.org/wiki/reload)
**Local Register Allocation (LRA):**
A modern replacement of the reload pass. Unlike the reload pass, *intermediate LRA* decisions are reflected in RTL as much as possible (doing more work on separating each module to reduce the number of target-dependent macros and hooks), leaving instruction constraints as the primary source of control.
[[Slides] Local Register Allocator Project - Vladimir Makarov](https://vmakarov.fedorapeople.org/LRA.html)

- **Hard Register:** Target-machine's register
- **Virtual Register:** Argument pointer, Frame pointer, Stack pointer, & other specialized registers (if not available, use [$SP + offset] instead).
- **Pseudo Register:** ∞ numbers of pseudo registers can be manipulated by the compiler.
:::
Refs:
- [The Integrated Register Allocator Ira For Gcc Doc - KITO's Lab](http://kito.wikidot.com/the-integrated-register-allocator-ira-for-gcc-doc)
- [Register Allocation in GCC : Reload / LRA - Kito's Lab](https://kitoslab.blogspot.com/2012/01/register-allocation-in-gcc-reload-lra.html)
- [[Slides] Design and Implementation of GCC Register Allocation - Kito Cheng](https://docs.huihoo.com/infoq/hellogcc-design-and-Implementation-of-gcc-register-allocation-20140916.pdf)
:::info
[**[Paper] On Local Register Allocation - M Farach-Colton, 2000**](https://dl.acm.org/doi/pdf/10.5555/314613.314848)
Demonstrates:
- *Local Register Allocation* problem is **NP-hard**.
- Variant of the furthest-first heuristic achieves a good approximation ratio.
- **2-approximation algorithm** for LRA.
:::
:::info
[**[Paper] Integrated Register Allocation and Instruction Scheduling with Constraint Programming - RC Lozano, 2014**](https://robcasloz.github.io/publications/TRITA-ICT-ECS-AVH-14-13.pdf)
The goal is to devise a combinatorial approach for integrated register allocation and instruction scheduling that is practical and hence usable in a modern compiler. Such an approach must capture all main subtasks of register allocation and instruction scheduling while scaling to programs of realistic size. Additionally, to be practical a combinatorial approach must be robust, that is, must deliver consistent code quality and compilation times across a significant range of programs.
- Traditional Code Gen:

- Combinatorial Code Gen:

- Constraint-Based Code Gen:

:::
#### Basic Block Reordering
Implements profile guided code positioning. If profile information is not available, various types of static analysis are performed to make the predictions normally coming from the profile feedback (IE execution frequency, branch probability, etc).
#### Variable Tracking
Computes where the variables are stored at each position in code and generates notes describing the variable locations to RTL code. The location lists are then generated according to these notes to debug information if the debugging information format supports location lists.
#### Delayed Branch Scheduling (Optional)
Attempts to find instructions that can go into the delay slots of other instructions, usually jumps and calls.
#### Branch Shortening
On RISC machines, **branch instructions often have restricted ranges**, necessitating longer instruction sequences for longer branches. This compiler pass calculates the distance between instructions to determine whether standard branch instructions or longer sequences are required for each branch.
#### Register-To-Stack Conversion
Conversion from usage of some hard registers to usage of a register stack may be done at this point.
:::warning
Currently, this is supported only for the floating-point registers of the Intel 80387 coprocessor.
:::
#### Final
Emits the target assembly code for the function.
> How do I know which instruction should generate? How can I do extensive instruction support?
:::success
**RTL Template:**
Used to define which insns match the particular pattern and how to find their operands. For named patterns, the RTL template also says how to construct an insn from specified operands.
For example:
- source.RTL_IR
```lisp=
; b = a - 56
(set (reg:SI 60 [b])
(plus:SI (reg:SI 61 [a])
(const_int -56 [0xffffffc8])))
```
- MIPS.md (machine definition file)
```lisp=
(define_insn "addsi3"
; matching pattern
[(set (match_operand:SI 0 "general_operand" "=r,m")
(plus:SI (match_operand:SI 1 "general_operand" "0,0") ; addu
(match_operand:SI 2 "general_operand" "g,r") ; addiu
))]
"TARGET_MIPS16"
"@
addu %0, %1, %2
addiu %0, %1, %2"
)
```
Refs:
- [RTL Template - GNU GCC](https://gcc.gnu.org/onlinedocs/gccint/RTL-Template.html)
- [Target assembly code emission - GCC Resource Center](https://www.cse.iitb.ac.in/grc/intdocs/gcc-implementation-details.html#Target-assembly-code-emission)
:::
#### Debugging Information Output
Run after final because it must output the stack slot offsets for pseudo registers that did not get hard registers.
> Source files are dwarfout.c for DWARF symbol table format, files dwarf2out.cc and dwarf2asm.cc for DWARF2 symbol table format, and vmsdbgout.cc for VMS debug symbol table format.
:::warning
*DWARF* is a debugging standard used by **GDB**.
More:
- [DWARF Debugging Information Format - DWARF](https://dwarfstd.org/)
- [DWARF for WebAssembly Files - Yury Delendik](https://yurydelendik.github.io/webassembly-dwarf/)

:::