自動機與形式語言筆記

0 Introduction

最終目標是介紹 Complexity，但會先從 Automata、Computability 開始。

Notions

基本概念

集合 (Set)：無序的獨特物件組合。
序列 (Sequence)：有順序的物件排列。（本筆記中盡量 0-based）
- 元組 (Tuple)：有限序列，K 個元素稱 K-元組。
笛卡兒積 (Cartesian product/ Cross product)：
$A \times B = {(a, b) | a \in A, b \in B}$ ，所有有序對的集合。
冪集 (Power Set)：
$P (A) = {S | S \subseteq A}$ ，集合
$A$ 所有子集的集合。

函數與關係

函數 (Function)：每個輸入對應唯一輸出。
述詞 (Predicate / Property)：range 為 {TRUE, FALSE} 的函數。
關係 (Relation)：Domain 為 a set of k tuples (
$A \times \dots \times A$ ) 的 predicate，比如大於(2>1 是 TRUE)。用集合表示會更方便，例如
$P : D \to {TRUE, FALSE}$ 寫成
$(D, S)$ ，其中
$S = {a \in D | P (a) = TRUE}$ 。
- 等價關係 (Equivalence Relation)：二元關係
  $R$ 須滿足：
  1. 自反性 (Reflexive)：
    $\forall x, x R x$ 。
  2. 對稱性 (Symmetric)：
    $\forall x, y, x R y \Rightarrow y R x$ 。
  3. 遞移性 (Transitive)：
    $\forall x, y, z, x R y \land y R z \Rightarrow x R z$ 。

圖 (Graph)

這裡未指名時通常是無向。

Strings 與 Languages

很重要

Alphabet 是任何 nonempty finite set
Alphabet 的成員是 Symbol
String (over an alphabet A) 是 finite sequence of symbols from A。（本筆記中盡量 1-based）
- $ϵ$ 是空字串。
- Shortlex order: 和字典序一樣，但是先比長度再比字典序，例如
  $(ϵ, 0, 1, 00, 01, \dots)$ 是 over
  ${0, 1}$ 的 shortlex order。
Language 是某個 alphabet 上的 string 的集合。

Definitions, Theorems, and Proofs

Definition: 定義概念。
Statement: 命題，未必是 TRUE 或 FALSE。
Proof: 講述命題為真的邏輯論述。
Theorem: 已被證明的 statement。
- Lemma: 輔助用的 theorem。
- Corollary: 從 theorem 推導出的 statement。

Types of Proofs

Proof by Construction: 直接構造出例子。
- 比如要求證明對於大於 2 的偶數 n，存在一個 n 個 node 的 3-regular Graph，就直接寫出對於每個 n 要如何建構 Graph。
Proof by Contradiction: 反證法。
- 假設命題為 FALSE，然後推導出矛盾。
Proof by Induction: 歸納法。
- basis: 基礎 case。
- inductive step: 假設對於 n 成立 (Induction Hypothesis)，證明對於 n+1 也成立。

1 Regular Languages

Finite Automata

Definition of Finite Automaton

(Q, Σ, δ, q_{0}, F)

是一個 finite automaton，其中：

$Q$ 是 states 的有限集合。
$Σ$ 是作為 alphabet 的有限集合。
$δ : Q \times Σ \to Q$ 是 transition function。
$q_{0} \in Q$ 是 start state。
$F \subseteq Q$ 是 accept (final) states 的集合。

運作方式：自動機從初始狀態

q_{0}

開始，依序讀取輸入字串中的每個符號。根據轉移函數

δ

，從當前狀態和讀取的符號決定下一個狀態。讀取完所有符號後，若最終狀態屬於接受狀態集合

F

，則該字串被接受；否則，該字串被拒絕。

機器 M 所辨識的語言 (language of machine M): 所有被 M 接受的字串所組成的集合 A，寫成

L (M) = A

，稱為 M recognizes A (accepts 也行，但是通常用在 string)。

Definition of Computation

Let

$M = (Q, Σ, δ, q_{0}, F)$ be a finite automaton.
$w = a_{1} a_{2} \dots a_{n}$ be a string over
$Σ$ .

M accepts w

⟺

there exists a sequence of states

r_{0}, r_{1}, \dots, r_{n} \in Q

such that:

$r_{0} = q_{0}$
$r_{i + 1} = δ (r_{i}, a_{i + 1}), \forall i \in [0, n - 1]$
$r_{n} \in F$

M recognizes language A

⟺ A = {w | M accepts w}

A Regular Language 是可以被某個 finite automaton 所接受的 language。

Regular Operations

Let A and B be languages.

Union:
$A \cup B = {w | w \in A \lor w \in B}$
Concatenation:
$A \circ B = {w_{1} w_{2} | w_{1} \in A \land w_{2} \in B}$
也可以省略不寫
$\circ$ ，直接寫成
$A B$ 。
Kleene Star:
$A^{*} = {w_{1} w_{2} \dots w_{k} | k \geq 0 and each w_{i} \in A}$ ，也可以寫成
$⋃_{n = 0}^{\infty} A^{n} = {ϵ} \cup A \cup A^{2} \cup A^{3} \cup \dots$ ，其中
$A^{0} = {ϵ}, A^{n} = {w v | w \in A^{n - 1}, v \in A}$ 。

Theorem: The set of regular languages is closed under all three of the regular operations.

For union, prove by construction:
Let
$A_{1}$ and
$A_{2}$ be regular languages recognized by finite automata
$M_{1} = (Q_{1}, Σ, δ_{1}, q_{01}, F_{1})$ and
$M_{2} = (Q_{2}, Σ, δ_{2}, q_{02}, F_{2})$ , we can construct a new finite automaton
$M = (Q, Σ, δ, q_{0}, F)$ that recognizes
$A_{1} \cup A_{2}$ .
- $Q = Q_{1} \times Q_{2}$ (Cartesian product)
- $Σ$ remains the same, but this theorem is still true if the alphabets are different.
- $δ ((q_{1}, q_{2}), a) = (δ_{1} (q_{1}, a), δ_{2} (q_{2}, a))$
- $q_{0} = (q_{01}, q_{02})$
- $F = (F_{1} \times Q_{2}) \cup (Q_{1} \times F_{2}) = {(q_{1}, q_{2}) | q_{1} \in F_{1} \lor q_{2} \in F_{2}}$
For concatenation: 無法簡單構造，因為不知道
$A_{1}$ 和
$A_{2}$ 中間要在哪裡切，所以需要 Nondeterminism。Kleene Star 也是類似的情況。

Nondeterminism

Definition of Nondeterministic Finite Automaton (NFA)

Transition function 由

Q \times Σ \to Q

改成

Q \times Σ_{ϵ} \to P (Q)

，其中

P (Q)

是

Q

的 Power set，

Σ_{ϵ} = Σ \cup {ϵ}

。也就是可能轉移到輸出的子集中的任一個 state，甚至輸入的 symbol 可以是

ϵ

。

其實也能在使輸入維持

Q \times Σ \to P (Q)

，只要把轉移後再經過若干

ϵ

轉移能到的狀態都新增到輸出中就行了。

轉移條件改為

r_{i + 1} \in δ (r_{i}, a_{i + 1}), \forall i \in [0, n - 1]

Nondeterminism is a generalization of determinism: Every DFA (deterministic finite automaton) is automatically an NFA (nondeterministic finite automaton).

Equivalence of NFAs and DFAs

NFA 與 DFA 是等價的，也就是對於任意 NFA，都可以找到與其 recognizes 相同 language 的 DFA。對於有

k

個狀態的 NFA

N = (Q, Σ, δ, q_{0}, F)

，對應的 DFA

D = (Q^{'}, Σ, δ^{'}, q_{0}^{'}, F^{'})

的定義如下：

先定義輔助用的
$E (R) = {q | q$ can be reached from
$R$ by traveling along 0 or more
$ϵ$ arrows
$}$ ，表示從狀態集合
$R$ 可以到達的所有狀態。
state:
$Q^{'} = P (Q)$
transition function:
$δ^{'} (R, a) = ⋃_{r \in R} E (δ (r, a))$
start state:
$q_{0}^{'} = E ({q_{0}})$
accept states:
$F^{'} = {R \in Q^{'} | R \cap F \neq \emptyset}$

因此，以後並不太需要在意 NFA、DFA 的差別。

Corollary: A language is regular

⟺

an NFA recognize it.

\Rightarrow

：Regular 有 DFA recognize，而 DFA 是 NFA 的特例。

\Leftarrow

：NFA recognize 某 language，則對應的 DFA 也 recognize 相同 language。

Closure under regular operations

Let

$N_{1} = (Q_{1}, Σ, δ_{1}, q_{01}, F_{1})$ be an NFA recognizing language
$A_{1}$ .
$N_{2} = (Q_{2}, Σ, δ_{2}, q_{02}, F_{2})$ be an NFA recognizing language
$A_{2}$ .

For each operation, construct a NFA

N = (Q, Σ, δ, q_{0}, F)

that recognizes the resulting languages,

A_{1} \cup A_{2}, A_{1} \circ A_{2}, A_{1}^{*}

Union: 額外建立一個初始狀態
$q_{0}$ ，可以經過
$ϵ$ 隨意到
$q_{01}$ 或
$q_{02}$ 。
- $Q = {q_{0}} \cup Q_{1} \cup Q_{2}$
- $δ (q, a) =$
  - $δ_{1} (q, a)$ if
    $q \in Q_{1}$
  - $δ_{2} (q, a)$ if
    $q \in Q_{2}$
  - ${q_{01}, q_{02}}$ if
    $q = q_{0} \land a = ϵ$
  - $\emptyset$ if
    $q = q_{0} \land a \neq ϵ$
- $F = F_{1} \cup F_{2}$
Concatenation: 因為先經過
$N_{1}$ 再經過
$N_{2}$ ，
$M$ accept 的狀態必須是
$N_{1}, N_{2}$ 都 accept 過，所以把所有
$N_{1}$ 的 accept state 以
$ϵ$ 接到
$N_{2}$ 的 start state。細節是
$N_{1}$ 的 accept state 也可能繼續在
$Q_{1}$ 裡面走，不一定都直接到
$q_{02}$ 。
- $Q = Q_{1} \cup Q_{2}$
- $δ (q, a) =$
  - $δ_{1} (q, a)$ if
    $q \in Q_{1} \land \neg (q \in F_{1} \land a = ϵ)$
  - $δ_{1} (q, a) \cup {q_{02}}$ if
    $q \in F_{1} \land a = ϵ$
  - $δ_{2} (q, a)$ if
    $q \in Q_{2}$
- $q_{0} = q_{01}$
- $F = F_{2}$
Kleene Star: 所有原本的 accept state 可以回到 start state 繼續，且新 start state 也是一種 accept state。
- $Q = {q_{0}} \cup Q_{1}$
- $δ (q, a) =$
  - $δ_{1} (q, a)$ if
    $(q \in Q_{1} \land q \notin F_{1}) \lor (q \in F_{1} \land a \neq ϵ)$
  - $δ_{1} (q, a) \cup {q_{01}}$ if
    $q \in F_{1} \land a = ϵ$
  - ${q_{01}}$ if
    $q = q_{0} \land a = ϵ$
  - $\emptyset$ if
    $q = q_{0} \land a \neq ϵ$
- $F = {q_{0}} \cup F_{1}$

另外，事實上對於一些其他 operation 也是 closed，比如：

Intersection: $A_1\cap A_2。使用 product automaton (之前在 Union 使用的方法)，使得狀態同時在
$N_{1}$ 和
$N_{2}$ 的 accept states 才算 accept。
Set Difference:
$A_{1} - A_{2}$ 。一樣用 product automaton，讓狀態在
$N_{1}$ 的 accept states 但不在
$N_{2}$ 的 accept states 才算 accept。
Complement:
$A_{1}^{c} = Σ^{*} - A_{1}$ 。顯然
$Σ^{*}$ 是 regular，而 Set Difference 下也 closed，所以
$A_{1}^{c}$ 也是 regular。

Regular Expressions

Symbol 或 Alphabet 經過 Regular Operations 即為 Regular Expression。每個 Expression 只是一個記號，代表一個 set of strings 也就是 language，說成 "The expression describes the language"。

Let

Σ

be an alphabet. A regular expression

R

over

Σ

is defined as follows (以下

\to

符號代表 expression 描述的 language):

$\emptyset \to \emptyset$
$ϵ \to {ϵ}$
$a$ for some
$a \in Σ \to {a}$
$R_{1} \cup R_{2}$ for regular expressions
$R_{1}, R_{2} \to L (R_{1}) \cup L (R_{2})$
$R_{1} \circ R_{2}$ for regular expressions
$R_{1}, R_{2} \to L (R_{1}) \circ L (R_{2})$
$R_{1}^{*}$ for regular expression
$R_{1} \to L (R_{1})^{*}$

若沒有括號，順序是 Kleene Star > Concatenation > Union，分別可以類比為次方、乘法、加法。

速記

$R^{+} = R R^{*}$ ，至少重複一次，
$R^{+} \cup ϵ = R^{*}$
$R^{n} = R^{n - 1} \circ R, R^{0} = ϵ$
$L (R)$ 是
$R$ 所代表的 language。

Examples

只挑稍微難理解的

$1^{*} \emptyset$ : 因為是指任何數量的 1 接上「什麼東西都不能」，所以
$1^{*} \emptyset = \emptyset$ ，事實上任何 expression 接上
$\emptyset$ 都是
$\emptyset$ 。
任何 expression 與
$\emptyset$ 的 Union 是本身。
類似的，任何 expression 與
$ϵ$ 的 Concatenation 是本身。
$\emptyset^{*} = ϵ$ ，因為
$L (\emptyset^{*}) = L (\emptyset)^{0} \cup L (\emptyset)^{1} \cup L (\emptyset)^{2} \cup \dots = {ϵ} \cup \emptyset \cup \emptyset \cup \dots = {ϵ}$

注意正規表達式本身是 notation，不直接等於它描述的 language。因此，應寫

\emptyset^{*} = ϵ

（表示這兩個正規表達式描述相同 language），或

L (\emptyset^{*}) = {ϵ}

（表示

\emptyset^{*}

所描述的 language 是只有 epsilon 的集合）。

Equivalence with Finite Automata

A language is regular

⟺

A regular expression describes it.

\Leftarrow

: Proof by Construction. Given a regular expression

R

, we can construct a NFA

M = (Q, Σ, δ, q_{0}, F)

that recognizes the language

L (R)

. (若很明顯會省略)

Base cases
- $R = a, a \in Σ$ :
  $δ (q_{0}, a) = {q_{1}}, F = {q_{1}}$
- $R = ϵ$ :
  $F = {q_{0}}$
- $R = \emptyset$ :
  $F = \emptyset$
Inductive steps. For Union, Concatenation, and Kleene Star, use the proof that the set of regular languages is close under those operations.
Let
$R_{1}$ and
$R_{2}$ are regular expressions such that
$L (R_{1}), L (R_{2})$ are both regular languages so there exist their respective DFAs. For
$(R_{1} \cup R_{2}), (R_{1} \circ R_{2}), (R_{1}^{*})$ , we can use the construction from the proof for closure under regular operations to construct a DFA
$M$ that recognizes
$L (R_{1} \cup R_{2}), L (R_{1} \circ R_{2}), L (R_{1}^{*})$ respectively.

\Rightarrow

: Proof by Construction. Convert a DFA to a regular expression. First to a special form of generalized nondeterministic finite automaton (GNFA) then to a regular expression.

GNFA

GNFA: NFA but can have regular expressions as the input to the transition function.

Use a special form of GNFA:

Start state
- It has a transition going to any other state
- It has no transition coming in from any other state
Accept state
- There is only a single accept state
- It is not the same as the start state
- It has a transition coming in from any other state
- It has no transition going to any other state
Except for the start and accept states
- One transition goes from every state to every other state
- One transition goes from each state to itself

(可以把非 special form 的 GNFA 轉換成 special form 的 GNFA)

Definition of GNFA (Special Form):

(Q, Σ, δ, q_{s t a r t}, q_{a c c e p t})

where:

$Q, Σ$ : same as NFA
$δ :$
$(Q - {q_{a c c e p t}}) \times (Q - {q_{a c c e p t}})$
$\to R$ , where
$R$ is the set of regular expressions over
$Σ$ . (從狀態、symbol 對應到下一個狀態改為狀態對對應到 regular expression) No transition from
$q_{a c c e p t}$ and no transition to
$q_{s t a r t}$ .
$q_{s t a r t}, q_{a c c e p t} \in Q$ are the start and accept states, respectively.

Computation 非常簡單，略過。

證明過程本體

Convert a DFA to a GNFA (Special Form)
- Add a new start state with an
  $ϵ$ transition to the original start state.
- Add a new accept state with an
  $ϵ$ transition from every original accept state.
- 對於每一對狀態的兩個方向，分別使用 Union 整合所有 transition
- 對於每一對狀態的兩個方向，對沒有 transition 的以 label
  $\emptyset$ 連接
Convert a GNFA (Special Form) to a regular expression: Remove states until only the start and accept states remain.
- Select any state
  $q_{r i p} \in Q$ different from
  $q_{s t a r t}, q_{a c c e p t}$ .
- Let G' be the GNFA
  $(Q^{'}, Σ, δ^{'}, q_{s t a r t}, q_{a c c e p t})$
  - $Q^{'} = Q - {q_{r i p}}$
  - $δ^{'} (q_{1}, q_{2}) = δ (q_{1}, q_{r i p}) δ (q_{r i p}, q_{r i p})^{*} δ (q_{r i p}, q_{2}) \cup δ (q_{1}, q_{2})$ for any
    $q_{1} \in Q^{'} - {q_{a c c e p t}}$ and any
    $q_{2} \in Q^{'} - {q_{s t a r t}}$ .

Nonregular Languages

$B = 0^{n} 1^{n} | n \geq 0$ : Nonregular
$C = {w | w has an equal number of 0s and 1s}$ : Nonregular
$D = {w | w has an equal number of occurrences of 01 and 10 as substrings}$ : Regular！
因為其實只要看開頭與結尾即可，若一樣則 01 與 10 數量相同，若 0 開頭 1 結尾則 01 多了一個，反之亦然。重點是因為兩者數量頂多差 1，所以能以有限種狀態表達。

The Pumping Lemma for Regular Languages

A

is a regular language

\Rightarrow

There exists a pumping length

p \geq 1

such that for any string

w \in A

with

| w | \geq p

, we can write

w = x y z

such that:

$x y^{i} z \in A \forall i \geq 0$
$| y | \geq 1$
$| x y | \leq p$

The converse is not true!!! It can only be used to prove a language is nonregular, but some nonregular languages may satisfy the property. To prove a language is regular, the typical way is to construct an FA or a regular expression.

Proof by Construction

Let

$M = (Q, Σ, δ, q_{0}, F)$ be a FA recognizing language
$A$ .
$p = | Q |$ be the pumping length.
$w = a_{1} a_{2} \dots a_{n}$ be a string in
$A$ of length
$n \geq p$ .
$r_{0}, r_{1}, \dots, r_{n}$ be the states visited by
$M$ on input
$w$ . (
$r_{0} = q_{0}$ ,
$r_{i + 1} = δ (r_{i}, a_{i + 1}) \forall i \in [0, n - 1]$ , and
$r_{n} \in F$ )
取前
$p + 1 = | Q | + 1$ 個，因為
$M$ 只有
$| Q |$ 種狀態，by the pigeonhole principle, there exists
$i, j$ such that
$0 \leq i < j \leq p$ and
$r_{i} = r_{j}$ .
$x = a_{1} \dots a_{i}, y = a_{i + 1} \dots a_{j}, z = a_{j + 1} \dots a_{n}$ .

圖示：

r_{0} \overset{x}{\to} r_{i} \overset{y}{\to} r_{j} \overset{z}{\to} r_{n}

Since

r_{i} = r_{j}

, repeating

y

any number of times will not change the state, so

L (x y^{*} z) \subseteq A

.
Also,

| y | = j - i \geq 1

because

i < j

, and

| x y | \leq p

because

j \leq p

Using Pumping Lemma to Prove Nonregularity

P (A is a regular language)
$\Rightarrow$ Q (The Pumping Lemma Condition):
If a language
$A$ is regular, then there exists a pumping length
$p \geq 1$ such that for every string
$w \in A$ with
$| w | \geq p$ , we can write
$w = x y z$ satisfying the three conditions:
Contrapositive Logic for Non-regularity (
$\neg Q \Rightarrow \neg P$ ):
The negation of the Pumping Lemma Condition (
$\neg Q$ ): For every
$p \geq 1$ , there exists a string
$w \in A$ with
$| w | \geq p$ such that
$w$ cannot be written as
$x y z$ satisfying the three conditions.

Example: Prove
$B = {0^{n} 1^{n} | n \geq 0}$ is nonregular

Assume

B

is regular. Let

p

be the pumping length.
Select

w = 0^{p} 1^{p}

. This

w \in B

and

| w | = 2 p \geq p

.
By the Pumping Lemma,

w = x y z

must satisfy the three conditions.
Since

| x y | \leq p

x

and

y

can only contain '0's. So

y = 0^{k}

for some

k \geq 1

.
Consider

x y^{2} z

. This string will have

p + k

'0's and

p

'1's.

Since

k \geq 1

, the number of '0's (

p + k

) is not equal to the number of '1's (

p

).
Therefore,

x y^{2} z \notin B

. This contradicts the Pumping Lemma. Thus,

B

is nonregular.

For

{w | w has an equal number of 0s and 1s}

, we can use a similar argument.

Example: Prove
$C = {0^{n} 1^{m} | n \neq m \land n, m \geq 0}$ is nonregular

(Note: While

L (0^{*} 1^{*}) - B = E

can prove this using the closure property of regular languages under set difference, here's the direct Pumping Lemma proof.)

Assume

E

is regular. Let

p \geq 1

be the pumping length. Select

w = 0^{p} 1^{p + p!}

. This

w \in E

(since

p \neq p + p!

) and

| w | = 2 p + p! \geq p

.
By the Pumping Lemma,

w = x y z

must satisfy the three conditions.
Since

| x y | \leq p

x

and

y

can only contain '0's. Let

y = 0^{k}

for some

k \geq 1

.
Consider

x y^{i} z

. The number of '0's will be

p + (i - 1) k

, while the number of '1's remains

p + p!

.
Choose

i = 1 + \frac{p!}{k}

. (Since

1 \leq k \leq p

k

divides

p!

, so

i

is an integer

\geq 1

.)
For this

i

x y^{i} z = 0^{p + p!} 1^{p + p!}

In this string, the number of '0's equals the number of '1's.
Therefore,

x y^{i} z \notin E

(as

n = m

). This contradicts the Pumping Lemma. Thus,

E

is nonregular.

Example: Prove
$D = {1^{n^{2}} | n \geq 0}$ is nonregular

Assume

D

is regular. Let

p \geq 1

be the pumping length.
Select

w = 1^{p^{2}}

. This

w \in D

and

| w | = p^{2} \geq p

.
By the Pumping Lemma,

w = x y z

must satisfy the three conditions.
Since

| y | \geq 1

| x y^{2} z | = | x y z | + | y | > p^{2}

.
Since

| x y | \leq p \Rightarrow | y | \leq p

| x y^{2} z | = | x y z | + | y | \leq p^{2} + p < (p + 1)^{2}

Since

| x y^{2} z |

is not a perfect square,

x y^{2} z \notin D

. This contradicts the Pumping Lemma. Thus,

D

is nonregular.

Takeaways

NFA、DFA 是等價的
一個 language 是 regular，若且唯若一個 FA recognize 該 language
一個 language 是 regular，若且唯若一個 regular expression 描述該 language
Regular languages 在一些 Operations 下是封閉的
Pumping Lemma 可以用來證明某些 language 是 nonregular

2 Context-Free Languages

Context-Free Grammar: 比 Regular Expression 更強大的描述語言的方式。
Context-Free Language(CFL): 由 Context-Free Grammar 生成的語言，Regular Language 是 Context-Free Language 的子集。
Pushdown Automaton(PDA): 一種有 stack 的 automaton，可以辨識 Context-Free Language。

Context-Free Grammar (CFG)

主要由許多規則組成，把 symbol 分為 variable (包含 start variable) 與 terminal，規則的形式為

V \to w

，其中

V

是 variable，

w

是 terminal 或 variable 的串接。
衍生或生成 (derivation, generation) 一個字串的過程是：

寫下 start variable
對於每個 variable，找到左邊相同的規則
把變數替換為右側，直到整個字串只剩下 terminal 為止

所有某 grammar

G

能生成的 strings 的集合稱為 language of the grammar，一樣是

L (G)

。
能被某 context-free grammar

G

生成的 language 稱為 context-free language。
若同個 variable 有多個規則，可以以 | 串接，比如

A \to 0 A 1 | B

Formal Definition of CFG

(V, Σ, R, S)

是一個 context-free grammar，其中：

$V$ 是 variable 的有限集合。
$Σ$ 是 terminal 的有限集合，
$V \cap Σ = \emptyset$ 。
$R$ 是規則的有限集合，每條規則的形式為
$v \to w$ ，其中
$v \in V$ ，
$w \in (V \cup Σ)^{*}$ 。
$S \in V$ 是 start variable。

以下大寫代表 variable，小寫代表 terminal。
yield 指替換一步，比如

u A v

以

A \to w

替換為

u w v

，

u A v

yields

u w v

，記為

u A v \Rightarrow u w v

。
derivation 指替換任意步，記為

u A v \xRightarrow * u w v

，表示

u A v

經過任意次替換後變成

u w v

。
因此 language of a grammar 寫為

{w \in Σ^{*} | S \xRightarrow * w}

。

Examples of CFG

${0^{n} 1^{n} | n \geq 0}$ :
$S \to 0 S 1 | ϵ$
對 a 的加與乘運算:
- $<EXPR> \to <EXPR> + <TERM> | <TERM>$
- $TERM \to <TERM> \times <FACTOR> | <FACTOR>$
- $FACTOR \to (<EXPR>) | a$

其中第二個例子解釋先乘後加的規則可以由生成規則定義，由生成結構判斷順序：比如

a + a \times a

與

(a + a) \times a

都只有一種生成方式，所以很明顯要哪個先。

Designing CFG

分開定義，比如
$S \to S_{1} | S_{2}$ ，
$S_{1}, S_{2}$ 再各自生成 language。
對於 regular language，可以轉換 DFA 為 CFG:
- 對每個 DFA 的 state
  $q_{i}$ 建立一個 CFG 的 variable
  $A_{i}$
- 對每個
  $δ (q_{i}, a) = q_{j}$ 的 transition，加入規則
  $A_{i} \to a A_{j}$
- 對每個 accept state
  $q_{acc}$ 加入規則
  $A_{acc} \to ϵ$
- $A_{0}$ 作為 start variable
遞迴結構：以
$A \to U A V$ 等方式定義

Parse Tree

每個 variable 與 terminal 作為節點，以 start variable 為根，每個 variable yield 的 variable 或 terminal 為其子節點，terminal 是 leaf，通常直接畫在最下面一排。
比如上方的「對 a 的加與乘運算」：

      <EXPR>
     /      \
<EXPR>    <TERM>
  |       /     \
  |   <TERM>  <FACTOR>
  |     |        |
  |     |      ( <EXPR> )
  |     |       /     \
  |     |   <EXPR>   <TERM>
  |     |     |        |
  a  +  a x ( a   +    a  )

Ambiguity of CFG

例子的第二個使加號由

<EXPR>

生成，而乘號以

<TERM>

生成。

a + a \times a

因為加法外無括號，必然由 start variable

<EXPR>

生成，而不是

<TERM>

先生成乘號再生成加號。

如果同個 string 能以某 grammar 透過多種不同方式 derive，則 the string is derived ambiguously in the grammar。
若對於某 grammar，存在一個 string which is derived ambiguously，則 the grammar is ambiguous。

注意若只是順序不同，不視為不同方式。可以透過一律由 leftmost derivation (每步都替換最左邊的 variable) 來避免替換順序造成的歧義。

有些 CFL 能由 CFG 生成，但必須由 ambiguous CFG 生成，比如

{0^{i} 1^{j} 2^{k} | i = j \lor j = k}

，稱為 inherently ambiguous。

Chomsky Normal Form

特殊形式，加上限制後可更簡化，只包含三類規則：

$A \to B C$ 。
$A \to a$ 。
$S \to ϵ$ 。
$S$ 是 start variable

Start variable 不出現在任意規則的右側。

所有 CFG 都可以被轉換為 Chomsky Normal Form。以下

W

是 variable 或 terminal 的字串。

加上新的 start variable
$S^{'}$ ，以及
$S^{'} \to S$ 。
能避免 start variable 在任意規則右側。
刪除所有
$ϵ$ 規則 (
$A \to ϵ$ ) ，除了
$S \to ϵ$
對於其他 variable 替換為包含
$A$ 的規則
$B \to W$ ，對於
$W$ 中
$A$ 的集合的每個子集，加入刪除後的規則。
比如
$B \to u A v A w$ 變成
$B \to u v A w | u A v w | u v w$ 。
但若
$B \to A$ ，且之前刪過
$B \to ϵ$ ，則不必加入
$B \to ϵ$ ，其下游都以考慮過。
刪除所有單一變數規則 (
$A \to B$ )
對於所有
$B \to W$ 規則，加入
$A \to W$ 規則。
其他轉換
- 處理右側長度
  $k > 2$ 的規則
  $A \to W_{1} W_{2} \dots W_{k}$ ，新增一些中繼變數
  $A_{1} \dots A_{k - 2}$ ，將其轉換為
  - $A \to W_{1} A_{1}$
  - $A_{1} \to W_{2} A_{2}$
  - $⋮$
  - $A_{k - 2} \to W_{k - 1} W_{k}$
- 處理 terminal 出現在一類規則的右側，對
  $u$ 新增一個中繼變數
  $A_{u}$ ，替換右側的
  $u$ ，並新增規則
  $A_{u} \to u$ 。

Pushdown Automata (PDA)

FA 可視為 State Control 保有狀態，有一個 arrow 指向 input tape，並且只能往右移動。
PDA 作為 FA 的修改版，再加上一個 stack 作為額外的記憶體。
Deterministic PDA 與 Nondeterministic PDA 的能力不等價，但兩者都能 recognize 所有 Context-Free Language，所以以下 PDA 指的都是 Nondeterministic PDA，稱 Deterministic PDA 為 DPDA。

Formal Definition of PDA

註：設 stack 的頂端在左側。

PDA:

M = (Q, Σ, Γ, δ, q_{0}, F)

where

$Q$ 是有限狀態集合。
$Σ$ 是有限集合，作為 input alphabet。
$Γ$ 是有限集合，作為 stack alphabet。
$δ : Q \times Σ_{ϵ} \times Γ_{ϵ} \to P (Q \times Γ_{ϵ})$ 。
$q_{0} \in Q$ 是初始狀態。
$F \subseteq Q$ 是 accept states 的集合。

Computation 課本的定義稍微怪怪的，因此我自己寫。

Let:

$M = (Q, Σ, Γ, δ, q_{0}, F)$ be a PDA.
$w = a_{1} a_{2} \dots a_{m}$ be a string over
$Σ$ .

M

accepts

w

⟺

there exists a sequence of instantaneous descriptions (IDs)

(q_{0}, w_{0}, t_{0}), (q_{1}, w_{1}, t_{1}), \dots, (q_{n}, w_{n}, t_{n}) \in Q \times Σ^{*} \times Γ^{*}

such that:

Initial ID:
$q_{0}$ is the start state,
$w_{0} = w$ is the full input string, and
$t_{0} = ϵ$ (Initial stack is empty)。
Transition: For each
$i \in [0, n - 1]$ ,
$(q_{i}, w_{i}, t_{i}) \Rightarrow (q_{i + 1}, w_{i + 1}, t_{i + 1})$ must satisfy the following:
- Let
  $t_{i} = x t^{'}$ , where
  $x \in Γ_{ϵ}$ .
- Let
  $w_{i} = c w^{'}$ , where
  $c \in Σ_{ϵ}$ .
- There must exist a pair
  $δ (q_{i}, c, x) ∋ (q_{i + 1}, y)$ such that
  $w_{i + 1} = w^{'}$ (
  $c$ is read) and
  $t_{i + 1} = y t^{'}$ (stack:
  $x t^{'}$ to
  $y t^{'}$ ).
Final ID:
- $w_{n} = ϵ$ (the entire input string has been consumed)
- $q_{n} \in F$ (in an accepting state).

在圖示時，可能會在邊上標記

a, x \to y

，代表讀取

a

，stack 頂端要是

x

，移除

x

並將

y

放到 stack 頂端。

Shorthand for PDA

使

δ

能 push string，由右而左放入 stack，方便表示 transition。對於簡寫

δ (q, a, s) ∋ (r, w), w = a_{1} a_{2} \dots a_{n}

，新增 state

q_{1} \dots q_{n - 1}

，以及 transition：

$δ (q, a, s) ∋ (q_{1}, a_{n})$
$δ (q_{1}, ϵ, ϵ) ∋ (q_{2}, a_{n - 1})$
$⋮$
$δ (q_{n - 1}, ϵ, ϵ) ∋ (r, a_{1})$

或是

q \overset{a, s \to r}{\to} r

變成：

q \overset{a, s \to a_{n}}{\to} q_{1} \overset{ϵ, ϵ \to a_{n - 1}}{\to} q_{2} \overset{ϵ, ϵ \to a_{n - 2}}{\to} \dots \overset{ϵ, ϵ \to a_{1}}{\to} r

。

實際上維基上的定義及其他課本會使 transition 能直接 push 一個字串進 stack，能用 shorthand 所以是等價的。
因為狀態是有限的，所以自然能 push 的 string 必須是有限的。

Equivalence of PDA and CFG

A language is context-free

⟺

some PDA recognizes it.

\Rightarrow

: Proof by Construction. Given a context-free grammar

G = (V, Σ, R, S)

, we can construct a PDA

M = (Q, Σ, Γ, δ, q_{0}, F)

that recognizes the language

L (G)

$Q = {q_{start}, q_{loop}, q_{acc}} \cup E$ ，其中
$E$ 是用於 shorthand 新增的額外狀態。
$F = {q_{acc}}$ 。
$\delta(q_{\text{start}},\epsilon,\epsilon)\ni (q_{\text{loop}},S$)$
一開始放入 $$$，確保 accept 時 stack 實質上為空。
放入
$S$ 啟動 derivation。
$δ (q_{loop}, ϵ, A) = {(q_{loop}, w) | A \to w is a rule in R}$
依規則展開
$A$ 。
$δ (q_{loop}, a, a) = {(q_{loop}, ϵ)}$
讀取 input，若與 stack 頂端的 terminal 相同，則繼續比對。
$\delta(q_{\text{loop}},\epsilon,$)={(q_{\text{acc}},\epsilon)}$
Input 與 stack 都讀完，代表輸入與
$S$ derive 的其中一個字串相同。

\Leftarrow

: Proof by Construction. Given a PDA

M = (Q, Σ, Γ, δ, q_{0}, F)

, we can construct a context-free grammar

G = (V, Σ, R, S)

that generates the language

L (M)

首先介紹 Special Form of PDA，以及經過一些看起來很蠢的轉換後把任何 PDA 轉換成這個形式：

在 Accept 時 stack 必須是空的
- 新增 state
  $q_{pop}$
- 從每個原本的 accept state 新增 transition
  $ϵ, ϵ \to ϵ$ 到
  $q_{pop}$
- $\forall x \in Γ, δ (q_{pop}, ϵ, x) ∋ (q_{pop}, ϵ)$ ，使
  $q_{pop}$ 能把 stack pop 到空
只有一個 accept state
- 新增 start state 與 $S'\xrightarrow{\epsilon,\epsilon\to $}S$
- 新增 accept state 與 $q_{\text{pop}}\xrightarrow{\epsilon,$\to \epsilon}q_{\text{acc}}$，順便確保 stack 排空
每個 transition 必須 push 或 pop 一個字元
- 把
  $p \overset{a, b \to c}{\to} q$ 替換為
  $p \overset{a, b \to ϵ}{\to} s \overset{ϵ, ϵ \to c}{\to} q$ ，
  $s$ 是新增狀態
- 把
  $p \overset{a, ϵ \to ϵ}{\to} q$ 替換為
  $p \overset{a, ϵ \to b}{\to} s \overset{ϵ, b \to ϵ}{\to} q$ ，
  $s$ 是新增狀態，
  $b$ 是任意的 stack symbol

對於

M

的每對 states

p, q

，Let

A_{p q} \in V

，想要使

A_{p q}

derives

x ⟺

x 能使機器從 (

p

, empty stack) 走到 (

q

, empty stack)。
這些字串

\forall w \in Γ^{*}

也能使機器從 (

p

w

) 走到 (

q

w

)。

Construction for

A_{p q}

任意

A_{p p}

不經任何轉移可產生

ϵ

，即

A_{p p} \to ϵ

。
任意

A_{p q}

產生的非空字串

x

在 special form of PDA 上第一步必然是 push，且最後一步是 pop，以維持空 stack，可以分成 push 與 pop 的字元是否相同的兩種情況：

兩者相同：假設
$p \overset{a}{\to} r \dots s \overset{b}{\to} q$ ，則新增規則
$A_{p q} \to a A_{r s} b$

$\forall p, q, r, s \in Q, u \in Γ, and a, b \in Σ_{ϵ}$ , if
$δ (p, a, ϵ) ∋ (r, u) \land δ (s, b, u) ∋ (q, ϵ)$ , put
$A_{p q} \to a A_{r s} b$ in
$G$ .
如果
$A_{r s}$ 是
$\emptyset$ 也沒差，反正是 Nondeterministic。
兩者不同：在路徑上必然在經過其中一個 state
$r$ 時 stack 是空的，新增規則
$A_{p q} \to A_{p r} A_{r q}$

$\forall p, q, r \in Q$ , put
$A_{p q} \to A_{p r} A_{r q}$ in
$G$ .
一樣，如果
$A_{p r}$ 或
$A_{r q}$ 是
$\emptyset$ 也沒差。

兩個方向的證明都是很 trivial 的 induction。

Regular languages 是 context-free languages 的子集，因為任何 FA 都自動是 PDA，只要完全不使用 stack 即可。

Non-Context-Free Languages

Some Non-Context-Free Languages:

{a^{n} b^{n} c^{n} | n \geq 0}, {a^{i} b^{j} c^{k} | 0 \leq i \leq j \leq k}, {w w | w \in {0, 1}^{*}}

The Pumping Lemma for Context-Free Languages (CFL)

括號內是 Pumping Lemma for RL

A

is a context-free language

\Rightarrow

There exists a pumping length

p \geq 1

such that every string

s \in A

with

| s | \geq p

can be written as

s = w v x y z

(

= x y z

), satisfying:

$\forall i \geq 0, w v^{i} x y^{i} z \in A$ (
$x y^{i} z \in A$ )
$| v y | \geq 1$ (
$| y | \geq 1$ )
$| v x y | \leq p$ (
$| x y | \leq p$ )

證明：

設

G

為一個 CFG，其生成語言為 CFL

A

。令

b

為

G

中所有規則右邊部分的 symbol 數的最大值。若

b = 1

，則

A

為 alphabet 的子集，對於生成的

s

，令

v = s

且

w = x = y = z = ϵ

就滿足條件。以下假設

b \geq 2

。

在 parse tree 中，是已經 Nondeterministically 從眾多替換選項中選出一個，因此所有節點都至多有

b

個子節點。對於一個高度

h

(深度定義改為由 start variable 走幾步，相當於邊的數量、一般定義的高度減一)，至多有

b^{h}

個葉節點，也就是最終的 terminal 數量、字串長度。換句話說，至少

b^{h} + 1

長的字串必須來自高度至少

h + 1

的 parse tree。

設

p = b^{| V | + 1}

，對於某字串

s \in A, | s | \geq p

，其任意 parse tree 的高度至少

| V | + 1

，取其中節點數最少的。
在此 parse tree 中，至少有個葉節點，其從根節點到葉節點的路徑長至少為

| V | + 1

。取此路徑上最底下的

| V | + 2

個節點，也就是

| V | + 1

個 variable 與最後的 terminal。在 variables 中，根據鴿籠原理，必然有某個 variable

A

重複出現，稱上方的為

A_{u}

，下方的為

A_{d}

。
使

A_{d}

的子樹葉節點為

x

，

A_{u}

的子樹葉節點為

v x y

，

s

的剩餘部分給

w, z

。以下分別檢測三個條件：

$\forall i \geq 0, w v^{i} x y^{i} z \in A$
- 對於
  $i > 1$ ，把
  $A_{d}$ 的子樹遞迴地替換為
  $A_{u}$ 的子樹。
- 對於
  $i = 0$ ，把
  $A_{u}$ 的子樹替換為
  $A_{d}$ 的子樹，變成
  $w x z$ 。
$| v y | \geq 1$
- 此條件僅在
  $v = y = ϵ$ 時不成立，但此時
  $A_{u}$ 子樹葉節點與
  $A_{d}$ 子樹葉節點所代表的字串相同。若以
  $A_{d}$ 的子樹替換
  $A_{u}$ 的子樹，則總節點數變少，與節點數最少的假設矛盾。
$| v x y | \leq p$
- 由於代表
  $v x y$ 的
  $A_{u}$ 的子樹高度至多為
  $| V | + 1$ ，因此至多有
  $b^{| V | + 1} = p$ 個葉節點。

Example: Prove
$A = {0^{n} 1^{n} 2^{n} | n \geq 0}$ is non-CFL

For every

p \geq 1

, select

s = 0^{p} 1^{p} 2^{p}

若
$v, y$ 都只有一種 symbol，則必然
$w v^{2} x y^{2} z$ 的三種 symbol 數量不同。
若
$v, y$ 其一包含兩種 symbol，則
$w v^{2} x y^{2} z$ 的順序不對。
若
$v, y$ 其一包含三種 symbol，則其長度至少
$p + 2$ ，違反
$| v x y | \leq p$ 。

自動機與形式語言筆記

0 Introduction

Notions

基本概念

函數與關係

圖 (Graph)

Strings 與 Languages

Definitions, Theorems, and Proofs

Types of Proofs

1 Regular Languages

Finite Automata

Definition of Finite Automaton

Definition of Computation

Regular Operations

Nondeterminism

Definition of Nondeterministic Finite Automaton (NFA)

Equivalence of NFAs and DFAs

Closure under regular operations

Regular Expressions

速記

Examples

Equivalence with Finite Automata

GNFA

證明過程本體

Nonregular Languages

The Pumping Lemma for Regular Languages

Proof by Construction

Using Pumping Lemma to Prove Nonregularity

Example: Prove B={0n1n|n≥0} is nonregular

Example: Prove C={0n1m|n≠m∧n,m≥0} is nonregular

Example: Prove D={1n2|n≥0} is nonregular

Takeaways

2 Context-Free Languages

Context-Free Grammar (CFG)

Formal Definition of CFG

Examples of CFG

Designing CFG

Parse Tree

Ambiguity of CFG

Chomsky Normal Form

Pushdown Automata (PDA)

Formal Definition of PDA

Shorthand for PDA

Equivalence of PDA and CFG

Non-Context-Free Languages

The Pumping Lemma for Context-Free Languages (CFL)

Example: Prove A={0n1n2n|n≥0} is non-CFL

Read more

作業系統筆記

會計學甲一下筆記

會計學甲一上筆記

Systems Programming 筆記 part 2

Example: Prove
$B = {0^{n} 1^{n} | n \geq 0}$ is nonregular

Example: Prove
$C = {0^{n} 1^{m} | n \neq m \land n, m \geq 0}$ is nonregular

Example: Prove
$D = {1^{n^{2}} | n \geq 0}$ is nonregular

Example: Prove
$A = {0^{n} 1^{n} 2^{n} | n \geq 0}$ is non-CFL