編譯系統 Compiler System

###### tags: `Compiler` # 編譯系統 Compiler System [TOC] ## Lecture-1 Overview ### 1. Overview of Compiler ![](https://i.imgur.com/zlsAmd2.png) ![](https://i.imgur.com/xBvBMJ7.png) ![](https://i.imgur.com/cEHGujP.png) - Front-end： - semantic analyzer 之前（包括） - Back-end： - Intermediate code generator 之後（包括），兩個部分依靠 AST 相連 ### 2. Compiler vs Interpreter - Interpreter 可以直接根據輸入的 program 進行執行，不需要 translation - Compiler 需要經過兩個階段 1. compilation parse: 產生 source program 2. execution parse : 執行 source program ### 3. Semantic error vs Syntax error - Syntax error: - compiler 或 interpreter ***cannot understand the source code*** - Semantic error: - writing ***invalid(無效的) program logic*** that 產生錯誤的結果 ## Lecture-2 Simple Compiler ### 1. Architecture of Compiler ![](https://i.imgur.com/eYQbiCN.png) ## Lecture-3 Scanning ### 0. Overview >A scanner transforms a character stream of source program into a token stream – Also called a lexical analyzer or lexer ![](https://i.imgur.com/iLCqT7A.png) ### 1. Regular Expression (RE) The three essential operations for RE. Let r1 and r2 be REs 1. Catenation. $r1r2$ denotes $r1$ is followed by $r2$ 2. Alternation. $r1 | r2$ denotes either $r1$ or $r2$ 3. Kleene closure. $r1 *$ denotes zero or more occurrences of $r1$ - the symbol ∈ represents set membership `E.g., if s1 ∈ P and s2 ∈ Q, then string s1s2 ∈ (P Q)` `The string s ∈ (P|Q) if, and only if, s ∈ P or s ∈ Q` `The regular expression a|b denotes {a, b}` `(a|b)* denotes the set of all strings of a’s and b’s, also is (a*b*)*` ### 2. Finite Automata >consists of: – set of **states** – **vocabulary**, denoted Σ – **transitions** from one state to another, labeled with characters in Σ – the **start state** – the accepting, or **final states** ![](https://i.imgur.com/sB222aI.png) ![](https://i.imgur.com/hWgD0Ay.png) ![](https://i.imgur.com/EFE7wVQ.png) ### 3. RE to NFA - rule ![](https://i.imgur.com/Qea0Olx.png) - example ![](https://i.imgur.com/7Igw3ps.png) ### 4. NFA to DFA ![](https://i.imgur.com/oTEbOjm.png) ## Lecture-4 Grammar and Parsing ### 0. preliminary - ***Grammar*** is A structured sentence can be diagrammed ![](https://i.imgur.com/R1KIIxP.png) - ***sentential form*** may contain both terminals and nonterminals - ***sentence*** is a sentential form with no nonterminals - ***Parser*** ![](https://i.imgur.com/b81GM6P.png) - leftmost derivation 指從最左邊開始解讀字串 ``` αAβ ⇒ αζβ – is leftmost derivation, if α does not contain a nonterminal – is rightmost derivation, if β does not contain a nonterminal ``` ### 1. Rule Derivations ![](https://i.imgur.com/rq6kaLR.png) ### 2. Context free grammar ![](https://i.imgur.com/3wAw2qo.png) ![](https://i.imgur.com/Nog0fI3.png) - **Reduced Grammar** : include useless symbols - **Faulty Language** : do not belong in the language ### 3. Ambiguity >A grammar that produces more than one parse tree for some sentence is said to be ambiguous ![](https://i.imgur.com/LLzYg1F.png) ![](https://i.imgur.com/s9CNYkL.png) - 兩個方式可以解決 Ambiguity :::info $E → E + E | E * E | ( E ) | id$ ::: 1. Enforce precedence - gives + lower precedence than * (as the most top figure) - more distant from the starting symbol, higher precedence 2. Rewrite grammar $E → E + T | T$ $T → T * F | F$ $F → ( E ) |id$ - left-associativity 變數都留在左邊 $Exp → Exp - Num | Num$ ![](https://i.imgur.com/u4w7vAK.png) - Backus-Naur Form >formal grammar for expressing context-free grammars BNF extends the grammar notation defined above, with syntax for defining **optional** and **repeated symbols** - Optional Symbols $A→α [ X1 . . .Xn ] β$ 表示 X1 . . .Xn 有一次或都沒有 - Repeated Symbols $B→γ \{X1 . . .Xm\} δ$ 表示 X1 . . .Xm 可以有多次或都沒有 ### 4. Two Parsing Approaches ![](https://i.imgur.com/FSeMniJ.png) - Top-down parsers 順序 : 中左右 ![](https://i.imgur.com/S2jHvuO.png) - Bottom-up parsers 順序 : 左右中 ![](https://i.imgur.com/7U0d47V.png) ### 5. First and Follow > two important ***functions*** for the construction of ***top-down*** and ***bottom-up*** parsers - way ![](https://i.imgur.com/I0oY0rD.png) ![](https://i.imgur.com/gCF3hO5.png) - example 1 ![](https://i.imgur.com/ixkooyy.png) - example 2 ![](https://i.imgur.com/v6RJvyF.png) ![](https://i.imgur.com/g0becU7.png) ## Lecture-5 Top-Down parsing ### 0. preliminary 建表 ![](https://i.imgur.com/upAF2gB.png) ### 1. Mutual recursive - 現在位於一個 nonterminal 上 - 根據下一個 token 決定使用哪一個 rule，導致前往不同的 nonterminal - 所以不同 nonterminal 的 predict set 內的 token 不能相同，否則就沒辦法判斷到底要使用哪一個 rule（同時對於生成兩個 nonterminal 都合理） - ![](https://i.imgur.com/ozK4Q04.png) - ![](https://i.imgur.com/BkrP06j.png) ### 2. Table Driven - Mutual recursive 只要遇到程式的 grammar 很複雜，就比較花時間和資源，所以改用 Table Driven - non-recursive parser - ![](https://i.imgur.com/lyP0oxi.png) - ![](https://i.imgur.com/FgJSloM.png) - ![](https://i.imgur.com/i6mHWI8.png) ### 3. 解決 common prefix - ![](https://i.imgur.com/dorDh54.png) ### 4. 解決 left recursion - ![](https://i.imgur.com/MAJXkxV.png) ## Lecture-6 Bottom-Up pasrsing ### 1. Architecture >The parser works its way from the terminal symbols to the grammar’s goal symbol. - bottom-up parsers can handle the largest class of grammars that allow parsing to proceed deterministically. - p.s. 他可以處理 common prefix，而不需要重設計 grammar - Botton Up Parsing 的順序剛好跟 Top down Parsing 相反 ![](https://i.imgur.com/sdP9lAc.png) - Rightmost 的 derivation (右上半)，順序剛好跟下方的 Bottom Up Parsing 相反 ![](https://i.imgur.com/Ui6nuUA.png) - **Shift-reduce** 兩個最主要執行 bottom up 的方法， - shift symbols onto the parse stack - reduce 在 stack 上的 symbols string 成 one of the grammar’s nonterminals. - **LR(k)** The bottom-up parsers scan the input from the left (the “L” in LR) producing a rightmost derivation (the “R” in LR) in reverse, using k symbols of lookahead ![](https://i.imgur.com/rPZAEra.png) ![](https://i.imgur.com/iMW8Ygz.png) - An LR parser makes shift-reduce decisions - by maintaining states to keep track of where we are in a parse - States represent sets of items :::info - 版本1 ![](https://i.imgur.com/sUf06AZ.png) ![](https://i.imgur.com/nY0TcGf.png) - 這裡的 State 5 收到 State 6 reduced 的 E 會回到 State 1，這是因為他還要考慮到 stack 目前的狀況 ::: :::success - 版本2-1 ![](https://i.imgur.com/OsTEJyQ.png) - ***`LR`*** **state 1, 6** 會出現 **conflict** - ***`SLR`*** 可以解決 **conflict** ![](https://i.imgur.com/CJ9SvjV.png) - **state 1** 此時就要觀察 ***`reduce C => A => follow A => B => b`***，故當後面是 b 時，要 ***reduce*** - **state 6** 此時就也是要觀察 ***`reduce C ......`*** - **其他沒有 confict 的state** 也可以根據 follow 是誰而加在正確的黃色位置 ::: :::success - 版本2.2 ![](https://i.imgur.com/0JbuGnj.png) ![](https://i.imgur.com/qf8lRvP.png) ![](https://i.imgur.com/cV83SDO.png) ::: ### 2. Construct Function - **`AdvanceDot(s, X)`** 找出可以從 s 經過 X 到達的 item set ![](https://i.imgur.com/SoPeYgo.png) ![](https://i.imgur.com/Yo7takw.png) - **`ComputeGoto(States, s)`** is responsible for computing all possible states reachable from s for ***all*** the grammar symbol X 給一個 state，每個跑過每個 X, 負責**建表**，ADDSTATE 負責查看是否已經有 state，有就找到回傳，沒有就建一個新的(會用到 AdvanceDot) > ![](https://i.imgur.com/2ZtJFwe.png) - **`ComputeLR0(Grammar)`** 給一個 grmmar，跑過每個 state 去建表(會使用到 ComputeGoto) - **`CompleteTable(Table, Grammar)`** 負責把 reduce 的地方加上去 ### 3. Conflict #### (1) Ambiguous 解法就是使用 ~~人工~~ 工人智慧。 - 原本： ![](https://i.imgur.com/NbC5G2o.png) - 後來： ![](https://i.imgur.com/zcJl6aX.png) #### (2) 非 Ambiguous - **解法一** 使用 `LR(k)`，調整 lookahead 的 k，確定哪一個步驟才是正確的無法解決： - ![](https://i.imgur.com/u9tkUcE.png) - k 沒辦法變成無窮大去判斷最後一個 token 是 a 還是 b - 要改用更 powerful 的 grammar :::info `LR(0)` 的其他缺點 - ![](https://i.imgur.com/dCSKzWX.png) - 在某一個 state，沒辦法根據 follow 去決定要做什麼 - action，要 reduce 就要全部情況都 reduce ::: - **解法二** 更 powerful 的演算法 ![](https://i.imgur.com/7EEUhuH.png) :::info `SLR(1)` - 遇到 ambiguous (不知道要選 reduce 還是 shift)時，check ***reduce 的 follow*** 有沒有包含下一個，如果有就選擇 reduce - *SLR(k): A handle (Right Hand Side) should NOT be reduced to N,if the look ahead token is NOT in Followk(N)* ![](https://i.imgur.com/29ZDO49.png) ::: ## Lecture-7 Intermediate Representations and Runtime Support ### 1. Architecture ![](https://i.imgur.com/EhDLKWX.png) ![](https://i.imgur.com/n3Vp75H.png) ![](https://i.imgur.com/rz64jra.png) - ***Front end*** : parses source code, 確認有沒有 errors、建立 language-specific Abstract Syntax Tree (AST) 去表示 input code - ***Optimizer*** : 負責 doing a broad variety of transformations to try to improve the code's performance, 多多少少 independent of language and target - ***Back-end*** : also known as the code generator(machine code), then maps the intermediate code onto the target instruction set - **Java Virtual Machine (JVM)** is also an implementation of this model, 吃入 **`Java bytecode`**，產生出 **`Machine Code`** - **LLVM** 也是一種 optimizer - Three Phases Design 跟 classical design 最大的差別就是可以支持 **multiple source languages or target architectures** - 中間的 optimizer 使用的語言是 common code representation，所以不管 input 是什麼，都可以 compile to it，output 也是可以被 compiler from it. :::info - **JVM** 就是使用 Java bytecode ![](https://i.imgur.com/w79ZER2.png) - **LLVM** 就是使用 LLVM IR ![](https://i.imgur.com/hlZnJTD.png) ::: ### 2. LLVM >LLVM is designed as a set of libraries >LLVM is an infrastructure, a collection of useful compiler technology that can be brought to bear on specific problems - It 讀進 LLVM IR, chews on it a bit, then 釋放 LLVM IR which hopefully will 執行的快一點 - 它的結構為 **a pipeline of distinct optimization passes**，一個 pass 指的就是一個簡單的功能或是任務 :::info - For example, **-O3** optimizer 裡面就有 67 個 passes ::: - ***Link-Time Optimization*** : 他也可以利用 Linker 對先 compile 完的 .o file 做到**跨檔案**的優化，而不只是一個檔案優化而已 ![](https://i.imgur.com/XsQGk2h.png) - ***Install-time optimization (ITO)***: 當最後要轉成 **`Machine code`** 的時候，也可以進行優化，就是找出 the specifics of the device you're targeting ![](https://i.imgur.com/Dv1FOXJ.png) ### 3. JVM - Key Features of JVM - (1) Compactness 表示指令為 zero-address form : 就是 instruction 裡面沒有列出現 register，而是直接用 stack 的順序取 register ``` Example: Executing the iadd instruction, it 1. pops two items off the stack and 2. pushes their sum onto the TOS – 如此一來這個 operation 只會用到一個 byte，因為 instruction’s operands 就直接從 stack 拿 ``` - (2) Safety - 卻也可以 reference 一個指令的 storage - 只要這個 storage type 是被這個 instruction allowed - 位置也能被合法 access - safety errors 就可以在 executing code 前就被抓到 - 在一個 ***pure zero-address form*** (JVM 不是), a register load 被完成的步驟為 - a) computing a register number that is pushed on TOS - b) pops the register number from the stack - c) accesses the register’s contents - d) pushes the contents on TOS :::info - 但會發生 security 問題，因為 load instruction 這樣只會在 runtime 的時候才能知道究竟要 access 到哪個實際的位置 - security 有問題 -> JVM 不是用 pure zero-address form，JVM 的 load instruction 會指定 register number ::: - Architecture ![](https://i.imgur.com/zdEAKK3.png) - (1) ***class loader*** : loading types (classes and interfaces) given fully qualified names - (2) ***execution engine*** : executing the instructions contained in the methods of loaded classes - (3) ***runtime data areas*** : organizes the memory it needs to execute a program - 可以是每個 threads 共用 - 也可以是 private to individual threads - 但每個 thread 都會有自己的 - pc register - Java stack ![](https://i.imgur.com/scNNt4d.png) - A native method is a Java method 但不是用 java code 所寫 - ***class file*** Java Virtual Machine interprets class files 左半邊除了 constant pool 其實都是 ***`pointer`***，指向右邊的 ***`constant pool`***，object oriented design ![](https://i.imgur.com/u4NBnkp.png) ![](https://i.imgur.com/nS5EZDn.png) 複雜版 ![](https://i.imgur.com/vEivDZT.png) 如果此時往 Constant Pool 裡面的 node 一看，裡面有 type，後面可能接**值**是裝多少，或是接一個 **pointer** 指到另外一個 pool element ![](https://i.imgur.com/h2TKULP.png) - Field >is an instance or a class level variable (property) of the class or interface - Methods >The Methods component host the methods - Attributes >Attribute section contains several attributes about the fields, methods, and class ![](https://i.imgur.com/VNGPumj.png) ## Lecture-8 Code Analysis & Optimization ### 1. Architecture ![](https://i.imgur.com/Jri1qAo.png) >IR 進 IR 出每個 IR 也都有屬於自己的架構，就是另外一種語言所以 Optimization 就是 code transformations - 優點就可能包含 1. ***Space*** 2. ***Time*** 3. ***Energy*** - improve execution time 跟 improve space 通常就是 trade off ### 2. Many possible Optimizations - ***Function inlining*** 內嵌函數 > Replace a function call with the function body ```c int g (int x) { return f(x) – 1; } int f (int n) { int b = 1; while (n--) { b = 2*b; } return b; } ``` change to ```c int g (int x) { int r; int n = x; { int b = 1; while (n--) { b = 2*b; } r = b; } return r -1; } ``` - ***Function cloning*** > when the compiler makes a second copy of a function with some modification - 舉例來說如果 compiler 發現一個 function 被呼叫非常多次 with 相同的 initial parameter, it may clone the function to produce a version which takes one less parameter, and then change all the callers to call the clone. ```clike void f (int x[], int n, int m) { for (int i = 0; i<n; i++) { x[i] = x[i] + i * m; } } ``` For a version of `f(a, 10, 1)`, change to ```clike void f’ (int x[]) { for (int i = 0; i<10; i++) { x[i] = x[i] + i; } } ``` - ***Constant folding*** 常數摺疊 > Transform the computation expression into the corresponding constant - $x = 1.1 * 2; => x = 2.2;$ - ***Constant propagation*** > Replace use of variable with constant ```c Example – n = 10 – c = 2 – for (i=0; i<n; i++) { s = s+i*c; } • Replace n, c: – for (i=0; i<10; i++) { s = s+i*2; } ``` - ***Algebraic Simplification*** > take advantage of usual simplification rules ```c a * 1 => a a / 1 => a b || false => b a * 0 => 0 a + 0 => a b || true => b ``` ```c (y*1+0) / 1 => y*1+0 => y*1 => y ``` - ***Copy Propagation*** > Replace uses of x with y, after seeing assignment x = y ```c x = y; if (x > 1) s = x * f(x - 1); ``` change to ```c x = y; if (y > 1) s = y * f(y - 1); ``` - ***Common Subexpression Elimination*** > Reuse the computed value instead of computing the same expression multiple times ```clike a = b + c; c = b + c; d = b + c; ``` change to ```clike a = b + c; c = a; d = b + c; ``` 這裡 **`第三行`** 不能代換，因為 c 此時已經在 **`第二行`** 改變了！ - ***Unreachable Code Elimination*** ```clike if (false) S; => ; if (true) S; else S’; => S; if (false) S; else S’; => S’; while (false) S; => ; while(2>3) S; => ; ``` - ***Dead code elimination*** > 排除掉 **沒有實際用處** 的那行指令，跟 Unreachable 不相同 ```clike x = y + 1; y = 1; x = 2 * z; ``` change to ```clike y = 1; x = 2 * z; ``` - ***Loop-invariant code motion*** > 最重要，因為是 program hot spots - ***Strength Reduction*** > Replaces expensive operations by cheap ones，如將乘法改成加法、將次方改成左右移 ```clike s = 0; for (i=0; i<n; i++) { v = 4 * i; s = s + v; … } ``` change to ```clike s = 0; v = -4; for (i=0; i<n; i++) { v = v + 4; s = s + v; … } ``` --- ```clike x * 2 i * 2^c i / 2^c ``` change to ```clike x + x i << c i >> c ``` - ***Induction Variable Elimination*** > 減少 variable ```clike s = 0; v = -4; for (i=0; i<n; i++) { v = v + 4; s = s + v; … } ``` change to ```clike s = 0; v = -4; for ( ; v<(4*n-4); ) { v = v + 4; s = s + v; … } ``` - Loop Unrolling > Execute loop body multiple times at each iteration，這樣可以減少時間，但會增加空間用量 ```c /* Copy 20 elements */ for (i=0; i<20; i++) { a[i] = b[i]; } ``` change to ```c /* Unrolled four times */ for (i=0; i<20; i+=4) { a[i] = b[i]; a[i+1] = b[i+1]; a[i+2] = b[i+2]; a[i+3] = b[i+3]; } ```