【Compiler】 Ch3. Lexical Analysis

# 【Compiler】 Ch3. Lexical Analysis * 先複習一下什麼是Lexical Analysis。 ![](https://i.imgur.com/UE0pZM5.png) * 我們將input讀進lexical analyzer(又稱scanner)，目標是產生出tokens和建立symbol table供parser使用。 ## 名詞解釋 * Tokens * 文法中的terminal symbols。 * Patterns * 描述輸入字串如何對應tokens的東西。 * Lexeme * 就是輸入。 * 範例：輸入字串 `<=` 到scanner後，scanner根據 `<= → leq` 輸出 `leq` 。 * `<=` 就是Lexeme、Patterns就是 `<= → leq` 、而 `leq` 就是Tokens。 ## Operations on Languages * 我們可以透過將Language做運算來得到更複雜的Language。 * 假設要對兩個語言L = {0}和M = {1}做運算： * Union，用$L\bigcup M$表示。 * 結果：{0, 1} * Concatenation，用LM表示。 * 結果：{01} * Closure，為單元運算，假設對L運算則表示為$L^*$。 * 結果：{$\epsilon$, 0, 00, 000, ...} * Positive closure，為單元運算，假設對L運算則表示為$L^+$。 * 結果：{0, 00, 000, ...} * 更多例子： * $L^4$ = {0000} * $(L\bigcup M)^*$ = 所有0和1組成的字串(包含空字元)。 ## Regular Expressions * 正規表達式Regular Expressions `r`可以表示一個語言`L(r)`。 * Regular Expression的好處是它支援類似上面的那些運算，幫助我們簡單產生想要的語言： * `(r)|(s)` 得到 $L(r) \bigcup L(s)$ * `(r)(s)` 得到 $L(r) L(s)$ * `(r)*` 得到 $(L(r))^*$ * `(r)+` 得到 $(L(r))^+$ * 運算優先度為Closure > Concatenation > Union * (a)|((b)*(c)) ≡ a |b*c * Regular Expression中也有交換律、結合律之類的東西，大多數常見、常用的都不難想，這邊就不附了。 ### Regular Definitions * 為了更方便定義符號，在Regular Expression中我們可以使用`→`定義一個字母集。 * 範例： * digit → 0 | 1 | … | 9 * letter → [A-Za-z] * digits → digit+ * optional_fraction → (. digits )? * 這樣就不用每次使用到數字時都打 `0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9`，只需輸入`digit`即可。 ## Recognition of Tokens * Scanner到底怎麼辨識出token呢？最常使用的是State transition diagrams，也就是狀態機。 * 舉個簡單的例子：辨識出 relop → < | <= | = | <> | > | >= ![](https://i.imgur.com/56pk64e.png) * 起始在狀態`0`，依據輸入字串就可以得到不同的結果。 * 在實作練習中，我們使用工具`lex`來自動產生狀態機，詳情會在實作篇講解。 ## Finite Automata * 我們建立一個叫做Finite Automata的狀態機來完成想要的regular expression。 * $M = \{S, \sigma , \delta , s_0, F\}$ * S: 代表狀態的有限集合 * $\sigma$: 使用的字母集 * $\delta$: 狀態轉移關係的集合 * $\delta$(0, a) = {1}，表示在狀態0看到a往狀態1跑。 * $s_0$: 起始狀態 * F: 結束狀態的集合 * 如果一個Finite Automata符合以下特性則又稱deterministic finite automaton (DFA) * 沒有狀態擁有 є-transition (空輸入的轉換) * 狀態s在遇到輸入a時，最多只能有一條狀態轉移出現 * 否則稱作Nondeterministic finite automaton(NFA)。 * 雖然感覺起來DFA比NFA簡單，但通常NFA的圖看起來會比較直覺。 ### NFA 練習題給予一NFA $M = \{S, \sum , \delta , s_0, F\}$ S = {0, 1, 2, 3} $\sum$ = {a, b} $\delta$ = { (0, a) = {0, 1} (0, b) = {0} (1, b) = {2} (2, b) = {3} } $s_0$ = 0 F = {3} 請畫出該NFA 答案 ![](https://i.imgur.com/KwrTlPo.png) #### Operations on NFA States * 這邊介紹幾個NFA狀態中可以使用的運算，在後續的狀態機轉換中會使用到。 * 先上個範例NFA，這次比上面的複雜得多： (a|b)*abb ![](https://i.imgur.com/ZiuwoDR.png) * є-closure(s) * 輸入s為一個狀態，回傳一個狀態集合。 * 回傳所有在狀態s能夠使用є-transitions得到的集合。 * 例如：є-closure(0) = {0, 1, 2, 4, 7} * є-closure(T) * 輸入T為一個狀態集合，回傳一個狀態集合。 * 回傳將所有T裡面的狀態做є-closure(s)並union起來。 * 例如：є-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} * moveto(T, a) * 輸入T為一個狀態集合、a為一個字元，回傳一個狀態集合。 * 回傳所有T裡面的狀態，先遇到a之後做狀態轉移，再對所有狀態做є-closure(s)，最後union起來。 * 例如：moveto(A, a) = є-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} ## Constructing a DFA from an NFA * 因為NFA比較容易建出來(限制比較少)，所以如果要得到DFA，習慣的做法是建出NFA後轉換出DFA。 * 使用Subset construction演算法即可。 1. 先對起始狀態做є-closure，並將狀態集合放入queue中 2. 從queue拿出一個狀態集合，對該狀態集合和所有輸入都做一次moveto()。 3. 得到的狀態集合如果沒有出現過，放入queue中。要記錄這個新的狀態集合是從哪個狀態、哪個輸入moveto過來的。 4. 回到步驟2.直到queue沒有狀態集合了。 5. 將所有出現過的狀態集合當作一個狀態，加上轉移關係(moveto)之後即得到DFA。 --- * 以上面的NFA為例： * є-closure(0) = {0, 1, 2, 4, 7} ≡ A * moveto(A, a) = є-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} ≡ B * moveto(A, b) = є-closure({5}) = {1, 2, 4, 5, 6, 7} ≡ C * moveto(B, a) = є-closure({3, 8}) = B * moveto(B, b) = є-closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} ≡ D * moveto(C, a) = є-closure({3, 8}) = B * moveto(C, b) = є-closure({5}) = C * moveto(D, a) = є-closure({3, 8}) = B * moveto(D, b) = є-closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} ≡ E * moveto(E, a) = є-closure({3, 8}) = B * moveto(E, b) = є-closure({5}) = C ![](https://i.imgur.com/unXOjIf.png) ## Constructing NFA from Regular Expression * 前面提過NFA比較容易建出來，那麼我們到底如何從Regular Expression快速得到NFA呢？ * 使用Thompson’s construction演算法即可。 * 所有Regular Expression都可以用Union、Concatenation、Closure來表示。 * 因此我們只要分別學會Union、Concatenation、Closure的NFA，就可以透過組裝得到所有Regular Expression的NFA。 ![](https://i.imgur.com/wFJZ3EL.png) 由上到下分別是Union、Concatenation、Closure。 ## Minimizing the Number of States * 前面我們已經學會從NFA得到DFA，然而這樣的方法建出來的DFA可能不是狀態數最少的DFA。 * 我們可以藉由演算法發現是否有狀態是無法分辨的狀態，進而合併來縮減狀態數。 1. 將狀態分為兩組，結束狀態和非結束狀態 (F, S-F)。 2. 對所有狀態、所有輸入進行轉移，如果轉移後的狀態跑到不同組，將該狀態獨立出一組。 3. 重複步驟2.直到沒有新的組出現。 4. 同樣組的狀態可以合併成一個狀態。 ![](https://i.imgur.com/tKvdlOl.png) ## Building Regular Grammar from NFA * 提醒，Regular Grammar和Regular Expressions不同喔！ * 複習一下，Regular Grammar需要$G = (V_N, V_T, P, S)$四個要素。 * $V_N$就是狀態數量，有幾個狀態就有幾個Nonterminal symbol。 * $V_T$就是輸入，可以從狀態轉移得到。 * Productions比較複雜一點：每個轉移邊可以得到一個Production P = (目前狀態) → (輸入字元)(下個狀態) * S就是起始狀態，可以直接從圖得到。 --- * 反過來從NFA得到Regular Grammar應該不難，這邊就不多說了。 ###### tags: `compiler` `note`