###### tags: `Compiler`
# 編譯系統 Compiler System
[TOC]
## Lecture-1 Overview
### 1. Overview of Compiler



- Front-end:
- semantic analyzer 之前(包括)
- Back-end:
- Intermediate code generator 之後(包括),兩個部分依靠 AST 相連
### 2. Compiler vs Interpreter
- Interpreter 可以直接根據輸入的 program 進行執行,不需要 translation
- Compiler 需要經過兩個階段
1. compilation parse: 產生 source program
2. execution parse : 執行 source program
### 3. Semantic error vs Syntax error
- Syntax error:
- compiler 或 interpreter ***cannot understand the source code***
- Semantic error:
- writing ***invalid(無效的) program logic*** that 產生錯誤的結果
## Lecture-2 Simple Compiler
### 1. Architecture of Compiler

## Lecture-3 Scanning
### 0. Overview
>A scanner transforms a character stream of source program
into a token stream
– Also called a lexical analyzer or lexer

### 1. Regular Expression (RE)
The three essential operations for RE.
Let r1 and r2 be REs
1. Catenation. $r1r2$ denotes $r1$ is followed by $r2$
2. Alternation. $r1 | r2$ denotes either $r1$ or $r2$
3. Kleene closure. $r1 *$ denotes zero or more occurrences of $r1$
- the symbol ∈ represents set membership
`E.g., if s1 ∈ P and s2 ∈ Q, then string s1s2 ∈ (P Q)`
`The string s ∈ (P|Q) if, and only if, s ∈ P or s ∈ Q`
`The regular expression a|b denotes {a, b}`
`(a|b)* denotes the set of all strings of a’s and b’s, also is (a*b*)*`
### 2. Finite Automata
>consists of:
– set of **states**
– **vocabulary**, denoted Σ
– **transitions** from one state to another, labeled with characters in Σ
– the **start state**
– the accepting, or **final states**



### 3. RE to NFA
- rule

- example

### 4. NFA to DFA

## Lecture-4 Grammar and Parsing
### 0. preliminary
- ***Grammar*** is A structured sentence can be diagrammed

- ***sentential form*** may contain both terminals and nonterminals
- ***sentence*** is a sentential form with no nonterminals
- ***Parser***

- leftmost derivation 指從最左邊開始解讀字串
```
αAβ ⇒ αζβ
– is leftmost derivation, if α does not contain a nonterminal
– is rightmost derivation, if β does not contain a nonterminal
```
### 1. Rule Derivations

### 2. Context free grammar


- **Reduced Grammar** : include useless symbols
- **Faulty Language** : do not belong in the language
### 3. Ambiguity
>A grammar that produces more than one parse tree for some sentence is said to be ambiguous


- 兩個方式可以解決 Ambiguity
:::info
$E → E + E | E * E | ( E ) | id$
:::
1. Enforce precedence
- gives + lower precedence than * (as the most top figure)
- more distant from the starting symbol, higher precedence
2. Rewrite grammar
$E → E + T | T$
$T → T * F | F$
$F → ( E ) |id$
- left-associativity
變數都留在左邊
$Exp → Exp - Num | Num$

- Backus-Naur Form
>formal grammar for expressing context-free grammars
BNF extends the grammar notation defined
above, with syntax for defining **optional** and **repeated
symbols**
- Optional Symbols
$A→α [ X1 . . .Xn ] β$
表示 X1 . . .Xn 有一次 或 都沒有
- Repeated Symbols
$B→γ \{X1 . . .Xm\} δ$
表示 X1 . . .Xm 可以有多次 或 都沒有
### 4. Two Parsing Approaches

- Top-down parsers
順序 : 中左右

- Bottom-up parsers
順序 : 左右中

### 5. First and Follow
> two important ***functions*** for the construction of ***top-down*** and ***bottom-up*** parsers
- way


- example 1

- example 2


## Lecture-5 Top-Down parsing
### 0. preliminary
建表

### 1. Mutual recursive
- 現在位於一個 nonterminal 上
- 根據下一個 token 決定使用哪一個 rule,導致前往不同的 nonterminal
- 所以不同 nonterminal 的 predict set 內的 token 不能相同,否則就沒辦法判斷到底要使用哪一個 rule(同時對於生成兩個 nonterminal 都合理)
- 
- 
### 2. Table Driven
- Mutual recursive 只要遇到程式的 grammar 很複雜,就比較花時間和資源,所以改用 Table Driven
- non-recursive parser
- 
- 
- 
### 3. 解決 common prefix
- 
### 4. 解決 left recursion
- 
## Lecture-6 Bottom-Up pasrsing
### 1. Architecture
>The parser works its way from the terminal symbols to the grammar’s goal symbol.
- bottom-up parsers can handle the largest class of grammars that allow parsing to proceed deterministically.
- p.s. 他可以處理 common prefix,而不需要重設計 grammar
- Botton Up Parsing 的順序剛好跟 Top down Parsing 相反

- Rightmost 的 derivation (右上半),順序剛好跟下方的 Bottom Up Parsing 相反

- **Shift-reduce**
兩個最主要執行 bottom up 的方法,
- shift symbols onto the parse stack
- reduce 在 stack 上的 symbols string 成 one of the grammar’s nonterminals.
- **LR(k)**
The bottom-up parsers scan the input from the left (the “L” in LR) producing a rightmost derivation (the “R” in LR) in reverse, using k symbols of lookahead


- An LR parser makes shift-reduce decisions
- by maintaining states to keep track of where we are in a parse
- States represent sets of items
:::info
- 版本1


- 這裡的 State 5 收到 State 6 reduced 的 E 會回到 State 1,這是因為他還要考慮到 stack 目前的狀況
:::
:::success
- 版本2-1

- ***`LR`*** **state 1, 6** 會出現 **conflict**
- ***`SLR`*** 可以解決 **conflict**

- **state 1** 此時就要觀察 ***`reduce C => A => follow A => B => b`***,故當後面是 b 時,要 ***reduce***
- **state 6** 此時就也是要觀察 ***`reduce C ......`***
- **其他沒有 confict 的state** 也可以根據 follow 是誰而加在正確的黃色位置
:::
:::success
- 版本2.2



:::
### 2. Construct Function
- **`AdvanceDot(s, X)`** 找出可以從 s 經過 X 到達的 item set


- **`ComputeGoto(States, s)`** is responsible for computing all possible states reachable from s for ***all*** the grammar symbol X
給一個 state,每個跑過每個 X, 負責**建表**,ADDSTATE 負責查看是否已經有 state,有就找到回傳,沒有就建一個新的(會用到 AdvanceDot)
> 
- **`ComputeLR0(Grammar)`** 給一個 grmmar,跑過每個 state 去建表(會使用到 ComputeGoto)
- **`CompleteTable(Table, Grammar)`** 負責把 reduce 的地方加上去
### 3. Conflict
#### (1) Ambiguous
解法就是使用 ~~人工~~ 工人智慧。
- 原本:

- 後來:

#### (2) 非 Ambiguous
- **解法一**
使用 `LR(k)`,調整 lookahead 的 k,確定哪一個步驟才是正確的
無法解決:
- 
- k 沒辦法變成無窮大去判斷最後一個 token 是 a 還是 b
- 要改用更 powerful 的 grammar
:::info
`LR(0)` 的其他缺點
- 
- 在某一個 state,沒辦法根據 follow 去決定要做什麼
- action,要 reduce 就要全部情況都 reduce
:::
- **解法二**
更 powerful 的演算法

:::info
`SLR(1)`
- 遇到 ambiguous (不知道要選 reduce 還是 shift)時,check ***reduce 的 follow*** 有沒有包含下一個,如果有就選擇 reduce
- *SLR(k):
A handle (Right Hand Side) should NOT be reduced to N,if the look ahead token is NOT in Followk(N)*

:::
## Lecture-7 Intermediate Representations and Runtime Support
### 1. Architecture



- ***Front end*** : parses source code, 確認有沒有 errors、 建立 language-specific Abstract Syntax Tree (AST) 去表示 input code
- ***Optimizer*** : 負責 doing a broad variety of transformations to try to improve the code's performance, 多多少少 independent of language and target
- ***Back-end*** : also known as the code generator(machine code), then maps the intermediate code onto the target instruction set
- **Java Virtual Machine (JVM)** is also an implementation of this model, 吃入 **`Java bytecode`**,產生出 **`Machine Code`**
- **LLVM** 也是一種 optimizer
- Three Phases Design 跟 classical design 最大的差別就是可以支持 **multiple source languages or target architectures**
- 中間的 optimizer 使用的語言是 common code representation,所以不管 input 是什麼,都可以 compile to it,output 也是可以被 compiler from it.
:::info
- **JVM** 就是使用 Java bytecode

- **LLVM** 就是使用 LLVM IR

:::
### 2. LLVM
>LLVM is designed as a set of libraries
>LLVM is an infrastructure, a collection of useful compiler technology that can be
brought to bear on specific problems
- It 讀進 LLVM IR, chews on it a bit, then 釋放 LLVM IR which hopefully will 執行的快一點
- 它的結構為 **a pipeline of distinct optimization passes**,一個 pass 指的就是一個簡單的功能或是任務
:::info
- For example, **-O3** optimizer 裡面就有 67 個 passes
:::
- ***Link-Time Optimization*** : 他也可以利用 Linker 對先 compile 完的 .o file 做到**跨檔案**的優化,而不只是一個檔案優化而已

- ***Install-time optimization (ITO)***: 當最後要轉成 **`Machine code`** 的時候,也可以進行優化,就是找出 the specifics of the device you're targeting

### 3. JVM
- Key Features of JVM
- (1) Compactness
表示指令為 zero-address form : 就是 instruction 裡面沒有列出現 register,而是直接用 stack 的順序取 register
```
Example:
Executing the iadd instruction, it
1. pops two items off the stack and
2. pushes their sum onto the TOS
– 如此一來 這個 operation 只會用到一個 byte,因為 instruction’s operands 就直接從 stack 拿
```
- (2) Safety
- 卻也可以 reference 一個指令的 storage
- 只要這個 storage type 是被這個 instruction allowed
- 位置也能被合法 access
- safety errors 就可以在 executing code 前就被抓到
- 在一個 ***pure zero-address form*** (JVM 不是), a register load 被完成的步驟為
- a) computing a register number that is pushed on TOS
- b) pops the register number from the stack
- c) accesses the register’s contents
- d) pushes the contents on TOS
:::info
- 但會發生 security 問題,因為 load instruction 這樣只會在 runtime 的時候才能知道究竟要 access 到哪個實際的位置
- security 有問題 -> JVM 不是用 pure zero-address form,JVM 的 load instruction 會指定 register number
:::
- Architecture

- (1) ***class loader*** : loading types (classes and interfaces) given fully qualified names
- (2) ***execution engine*** : executing the instructions contained in the methods of loaded classes
- (3) ***runtime data areas*** : organizes the memory it needs to execute a program
- 可以是每個 threads 共用
- 也可以是 private to individual threads
- 但每個 thread 都會有自己的
- pc register
- Java stack

- A native method is a Java method 但不是用 java code 所寫
- ***class file***
Java Virtual Machine interprets class files
左半邊除了 constant pool 其實都是 ***`pointer`***,指向右邊的 ***`constant pool`***,object oriented design


複雜版

如果此時往 Constant Pool 裡面的 node 一看,裡面有 type,後面可能接**值**是裝多少,或是接一個 **pointer** 指到另外一個 pool element

- Field
>is an instance or a class level variable (property) of the class or interface
- Methods
>The Methods component host the methods
- Attributes
>Attribute section contains several attributes about the fields, methods, and class

## Lecture-8 Code Analysis & Optimization
### 1. Architecture

>IR 進 IR 出
每個 IR 也都有屬於自己的架構,就是另外一種語言
所以 Optimization 就是 code transformations
- 優點就可能包含
1. ***Space***
2. ***Time***
3. ***Energy***
- improve execution time 跟 improve space 通常就是 trade off
### 2. Many possible Optimizations
- ***Function inlining*** 內嵌函數
> Replace a function call with the function body
```c
int g (int x) {
return f(x) – 1;
}
int f (int n) {
int b = 1;
while (n--) { b = 2*b; }
return b;
}
```
change to
```c
int g (int x) {
int r;
int n = x;
{
int b = 1;
while (n--) { b = 2*b; }
r = b;
}
return r -1;
}
```
- ***Function cloning***
> when the compiler makes a second copy of a function with some modification
- 舉例來說 如果 compiler 發現一個 function 被呼叫非常多次 with 相同的 initial parameter, it may clone the function to produce a version which takes one less parameter, and then change all the callers to call the clone.
```clike
void f (int x[], int n, int m) {
for (int i = 0; i<n; i++) { x[i] = x[i] + i * m; }
}
```
For a version of `f(a, 10, 1)`, change to
```clike
void f’ (int x[]) {
for (int i = 0; i<10; i++) { x[i] = x[i] + i; }
}
```
- ***Constant folding*** 常數摺疊
> Transform the computation expression into the corresponding constant
- $x = 1.1 * 2; => x = 2.2;$
- ***Constant propagation***
> Replace use of variable with constant
```c
Example
– n = 10
– c = 2
– for (i=0; i<n; i++) { s = s+i*c; }
• Replace n, c:
– for (i=0; i<10; i++) { s = s+i*2; }
```
- ***Algebraic Simplification***
> take advantage of usual simplification rules
```c
a * 1 => a
a / 1 => a
b || false => b
a * 0 => 0
a + 0 => a
b || true => b
```
```c
(y*1+0) / 1
=> y*1+0
=> y*1
=> y
```
- ***Copy Propagation***
> Replace uses of x with y, after seeing assignment x = y
```c
x = y;
if (x > 1)
s = x * f(x - 1);
```
change to
```c
x = y;
if (y > 1)
s = y * f(y - 1);
```
- ***Common Subexpression Elimination***
> Reuse the computed value instead of computing the same expression multiple times
```clike
a = b + c;
c = b + c;
d = b + c;
```
change to
```clike
a = b + c;
c = a;
d = b + c;
```
這裡 **`第三行`** 不能代換,因為 c 此時已經在 **`第二行`** 改變了!
- ***Unreachable Code Elimination***
```clike
if (false) S; => ;
if (true) S; else S’; => S;
if (false) S; else S’; => S’;
while (false) S; => ;
while(2>3) S; => ;
```
- ***Dead code elimination***
> 排除掉 **沒有實際用處** 的那行指令,跟 Unreachable 不相同
```clike
x = y + 1;
y = 1;
x = 2 * z;
```
change to
```clike
y = 1;
x = 2 * z;
```
- ***Loop-invariant code motion***
> 最重要,因為是 program hot spots
- ***Strength Reduction***
> Replaces expensive operations by cheap ones,如將乘法改成加法、將次方改成左右移
```clike
s = 0;
for (i=0; i<n; i++) {
v = 4 * i;
s = s + v;
…
}
```
change to
```clike
s = 0;
v = -4;
for (i=0; i<n; i++) {
v = v + 4;
s = s + v;
…
}
```
---
```clike
x * 2
i * 2^c
i / 2^c
```
change to
```clike
x + x
i << c
i >> c
```
- ***Induction Variable Elimination***
> 減少 variable
```clike
s = 0;
v = -4;
for (i=0; i<n; i++) {
v = v + 4;
s = s + v;
…
}
```
change to
```clike
s = 0;
v = -4;
for ( ; v<(4*n-4); ) {
v = v + 4;
s = s + v;
…
}
```
- Loop Unrolling
> Execute loop body multiple times at each iteration,這樣可以減少時間,但會增加空間用量
```c
/* Copy 20 elements */
for (i=0; i<20; i++)
{
a[i] = b[i];
}
```
change to
```c
/* Unrolled four times */
for (i=0; i<20; i+=4)
{
a[i] = b[i];
a[i+1] = b[i+1];
a[i+2] = b[i+2];
a[i+3] = b[i+3];
}
```