# 編譯器 Compiler
###### tags: `courses meow` `CSIE-3` `Compiler` `2021 Fall`
:::success
:::spoiler click to open TOC
[TOC]
:::
:::
:::info
### :link: Links
- [上課連結](https://meet.google.com/dzf-jskc-bzd)
- [eeclass](https://ncueeclass.ncu.edu.tw/course/bulletin/9342)
- [9 grid](http://ninegrids.csie.ncu.edu.tw/)
- [online judge](http://domjudge.csie.ncu.edu.tw/)
:::
## Introduction
```graphviz
digraph {
label="Figure 1.1: A user's view of a compiler"
rankdir = LR
proLan [label="Programming Language"]
com [label="Compiler"]
macLan [label="Machine Language"]
proLan -> com -> macLan
}
```
### Front-end and Back-end
- Front-end
- turn source code into **intermediate representation (IR)**
- platform indeqndent (Mac, Linux, Windows...)
- Back-end
- generate the **target code (assembly code)**
- platform dependent (Mac, Linux, Windows...)
```graphviz
digraph {
source [label="Source\ncode"]
ir [label="Intermediate\nrepresentation"]
target [label="target\ncode"]
rankdir = LR
source -> ir
ir -> target
source -> target [style=dashed]
}
```
### Types of machine code
- Pure Machine Code
- without operating system or library routines
- Augmented Machine Code
- *Augmented* by
- `syscall` / `win32API`...
- other
- Virtual Machine Code
- useing **virtual instruction set**

- e.g.
- Linux `/usr/bin/java hello-world.jar`
- Windows `C://programx86//java hello-wrold.jar`
- java -> JVM interpret
### Bootstrapping
We have C compiler on X86 and it was written by C language
How can we generate a compiler on ARM?

> `cc.c` => c compiler source code
> `cc-x86-modified.exe` => an executable c compiler which backend has modified (to produce ARM instruction set)
> `cc-arm.exe` => an executable c compiler in ARM
```graphviz
digraph {
source [label="cc.c"]
cc_x86_arm_exe [label="cc-x86-modified.exe"]
cc_arm_exe [label="cc-arm.exe"]
rankdir = LR
source -> cc_x86_arm_exe -> cc_arm_exe
}
```

#### Our understanding of the graph
Goal: we want a **L-compiler** written in **L**, but we do not have **L-compiler**(but we have K-compiler)
-> So we use **K** to write a compiler for **L**
-> first **L-compiler** get!!
-> use **first L-compiler** to compile **L-compiler** written in **L**
-> **L-compiler** written in **L** get!!
-> discarded the **first L-compiler**
> and ignore the VM part. idk what it means...
:::info
- 假設我今天有一個可以編譯J語言的編譯器J_compiler
- 目標:要寫一個R語言,而且我希望這個R語言是自託管(有專門編譯R語言的編譯器且由R語言寫成)而非代託管(有專門編譯R語言的編譯器但並非由R語言寫成,可能由L語言寫成...)
- 步驟:
- 第一步:首先使用J語言寫出一個代託管的R語言編譯器 -> R_compiler.j
- 第二步:將R_compiler.j放入J語言編譯器中得到執行檔 -> R_compiler.exe
- 第三部:用R語言寫出一個R語言的自託管編譯器 -> R_compiler.r
- 第四部:將R_compiler.r放入R_compiler.exe中得到用R語言寫成的R語言編譯器(完成)
<!--
如果我沒有理解錯的話應該大概是這樣?
像這樣用不同語言產生新的編譯環境似乎又叫Cross Compiler(?我沒有很確定哈哈
-->
:::
```graphviz
digraph {
node [shape="record"]
L_com_K [label="L-compiler.K"]
L_com_K_exe [label="L-compiler.K.exe"]
L_com_K -> L_com_K_exe [label="by K-compiler"]
L_com_L [label="L-compiler.L"]
L_com_L_exe [label="L-compiler.L.exe"]
L_com_L -> L_com_L_exe [label="by L-compiler.K.exe"]
}
```

### Compiler vs. Interpreter
- compiler
- 特點:Compiler會根據不同的系統架構,直接將程式碼轉換成電腦語言然後再轉換成執行檔。
- 優點:
- 因為是轉換成電腦語言,所以電腦在處理電腦語言的時候速度會較快。
- 缺點:
- 編譯過程中需要對應不同的系統環境進行轉換,因此跨平台能力低(0?)。
- Interpreter
- 特點:直譯器會先將程式碼轉換成特定中間語言(OS independent),然後要執行時再逐行轉換並執行。
- 優點:由於事先轉換成中間語言,因此大部分系統都可以共用此中間語言,跨平台能力高。
- 缺點:電腦需要先轉換成中間語言,且執行時才一行一行轉譯,需要耗費較多時間。
### Different types of interpreter
- interpret one line and than execute one line
- > LISP, BASIC
- translate source code into some efficient **intermediate representation** and immediately execute this
- > e.g. Perl, Python, MATLAB, and Ruby
- compiler make a pre-compiled code -> interpreter run them (compiler is part of interpreter)
- > e.g. UCSD, Pascal
### Organization of a Compiler

Scanner, Parser -> frontend
others -> backend
### Scanner
analysis of the source program

### Parser
builds an abstract syntax tree (AST)

### Type checker(semantic analyzer)
- checks and decorates the **semantics** of each AST code
- checks type **correctness**, **reachability** and **termination**, and **exception handling**
> :warning: disclaimer: no guarantee on correctness[name=蛋蕉]
:::info
-> source code
```
int a = 1 + 2 - 3 * 4 - 5
```
-> Parser
```
(assign (a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))
```
> Hey! this is a tree!
-> typechecker
```
(assign (int:a) (int:sub (int:sub (int:add (int:1) (int:2)) (int:mul (int:3) (int:4))) (int:5))*
```
> Hey! I'm tree
:::
:::warning
reachability & termination
reachability: x
correctness:
```c=
// test.c
int a;
a = 5.2;
```

:::
### Translator
- If an AST node is semantically correct then
```graphviz
digraph {
source [label="AST"]
IR_code [label="IR code"]
rankdir = LR
source -> IR_code [label="Tranlator"]
}
```
### Optimizer
- Improves IR code’s performance
- Reomve redundant code
### Code generator
- Mapping IR code or AST into target code
---
# chapter 2
## The grammar for ac
Assumed we have a simple language called ac
:::info
`i a` declare a integer variable a
`f a` declare a float variable a
`a = 1` assign 1 (int num) to variable a
:::
### grammar example
> 1. a program can be split into declarations and statements
> 2. declarations can be split into many single declaration
> 3. a declaration can be split into `intdcl id` or `dloatdcl id`
> 4. ...
```
1 Prog -> Dcls Stmts $
2 Dcls -> Dcl Dcls
| \lambda
4. Dcl -> flaotdcl id
| intdcl id
6. Stmts -> Stmt Stmts
| \lambda
9. Stms -> id assign Val Expr
| print id
10 Expr -> plus Val Expr
| minus Val Expr
13. Val -> id
| inum
| fnum
```
- $\lambda$ (\lambda): empty string
- Prog: Program
- Dcls: declarations
- dcl: single declaration
- Stmts: Statements
- stmt: single statement
- Expr: Expression
- Val: value
- inum: integer numbers
- fnum: floating numbers
### terminal
- terminal
- $\lambda$
- intdcl, floatdcl
- inum, fnum
- id
- assign
- print
- ...
- non-terminal
- stmt
- dcl
- val
- expr
- ...
### decent recursive way
> not all grammar can implemented by recursive way
> many grammar cannot be written as recursive decent parser
just write a function for every non-terminal grammar.
## AST
```
# an example of ac code
f b
i a
a = 5
b = a + 3.2
p b
```
```graphviz
digraph {
node [shape="record"]
p [label="Program"]
fb [label="{floatdcl | b}"]
p -> fb
p -> ia
ia [label="{intdcl | a}"]
ass1 [label="assign"]
ida [label="{id | a}"]
i5 [label="{inum | 5}"]
p -> ass1
ass1 -> ida
ass1 -> i5
ass2 [label="assign"]
idb [label="{id | b}"]
plus [label="plus"]
ida2 [label="{id | a}"]
f3d2 [label="{fnum | 3.2}"]
p -> ass2
ass2 -> idb
ass2 -> plus
plus -> ida2
plus -> f3d2
pb [label="{print | b}"]
p -> pb
}
```
### traversal AST
#### create symbol table
```graphviz
digraph {
node [shape="record"]
p [label="Program"]
fb [label="{floatdcl | b}"]
p -> fb [color="red" label="create symbol table"]
p -> ia [color="red" label=""]
symbol_b[label="register *b* as a float\n in symbol table" color="red"]
fb -> symbol_b
symbol_a[label="register *a* as a int\n in symbol table" color="red"]
ia -> symbol_a
ia [label="{intdcl | a}"]
ass1 [label="assign"]
ida [label="{id | a}"]
i5 [label="{inum | 5}"]
p -> ass1
ass1 -> ida
ass1 -> i5
ass2 [label="assign"]
idb [label="{id | b}"]
plus [label="plus"]
ida2 [label="{id | a}"]
f3d2 [label="{fnum | 3.2}"]
p -> ass2
ass2 -> idb
ass2 -> plus
plus -> ida2
plus -> f3d2
pb [label="{print | b}"]
p -> pb
}
```
#### type
```graphviz
digraph {
label="after type checking"
node [shape="record"]
p [label="Program"]
fb [label="{floatdcl | b}"]
p -> fb
p -> ia
ia [label="{intdcl | a}"]
ass1 [label="{assign | integer}" color="red"]
ida [label="{id | a}"]
i5 [label="{inum | 5}"]
p -> ass1
ass1 -> ida
ass1 -> i5
ass2 [label="{assign | float}" color="red"]
idb [label="{id | b}"]
plus [label="{plus | float}" color="red"]
int2float [label="{int2float | float}" color="red"]
ida2 [label="{id | a}"]
f3d2 [label="{fnum | 3.2}"]
p -> ass2
ass2 -> idb
ass2 -> plus
plus -> int2float
int2float -> ida2
plus -> f3d2
pb [label="{print | b}"]
p -> pb
}
```
### code generation
# Chapter 3
## Regular expression
[JavaScript 的 rex](https://developer.mozilla.org/zh-TW/docs/Web/JavaScript/Guide/Regular_Expressions)
:::danger
### 注意
regular expression 有很多不同版本,每個版本的定義會有點不同
:::
### some boring definition
bla bla bla
### LEX
https://www.computerhope.com/unix/ulex.htm
---
# Chapter 4
> 我不是很確定我聽到了什麼
> 同感
## The definition of a grammar
xxx
## Properties of grammars
- nullable
- useless
- if it can never procudct a terminal
- ambiguous
- more than one parse-tree
## First() & Follow()
> 希望我理解的沒錯
### First set
:::info
FIRST($\alpha$)
> the set of terminals that begin strings derived from $\alpha$
:::
- 第一個不可被替換的terminal們
- EX:
- S -> ABC
- A -> a|$\lambda$
- B -> b|$\lambda$
- C -> c|$\lambda$
- First(S) = {a,b,c}
- 因為A有可能返回 $\lambda$ ,所以當 A 為 $\lambda$ 時第一個就是 b,所以要把 B 考慮進去,同理C。
- 因為不知道怎麼打Landa所以打^當作NULL