Regular Language

# Regular Language Regular Language form **the simplest language class**. It is a formal language that can be defined by a **regular expression**. Alternatively, a regular language can be defined as a language recognized by **finite automaton**. The equivalence of regular expressions and finite automata is known as **Kleene's theorem**. There are three ways to describe reuglar language: DFA, regular expression, regular grammars. ## Regular Expressions Regular expression originated in 1951, when mathematician **Stepgen Cole Kleene** described regular language using his mathematical notation called regular events. Regular expressions are recursive formulation (rules) describing how strings are evolved. 1. Any symbol $\sigma$ in $\Sigma$ is a regular expression. 2. Empty string $\varepsilon$ is a regular expression. Two regular expressions are equivalent if they denote the same language. ### Combinatory Rules If $r_1$ and $r_2$ are regular expressions, then the followings are regular expressions: * Parenthesis: $(r_1)$ * Kleene Star: $(r_1)^*$ * Concatenation: $(r_1 \circ r_2)$ * Union: $(r_1 | r_2)$ Nothing eles is regular expression. > 後來看到的 `+`、`?`、`^`、`$` 都是實作版本的正規表達式規則（有分 PCRE、Javascript version），在計算理論只探討理論上的規則（例如括號、星號、串接跟聯集）。 ### Example * $\Sigma = \{a, b\}$ * Regular Expression = $(a|b)^*a(a|b)^*$ * $L = \{\text{a set of strings containing at least one 'a'}\}$ Regular expression entered popular use from 1968 in two uses: pattern matching in a text editor and lexical（語彙的） analysis in a compiler. Unix 的 `grep` 是由 **Ken Thompson** 開發，該程式的全稱為：globally search a regular expression and print，基本上就是一個正規表達式的使用例子。 ## Regular Grammar 請參見[這裡](https://hackmd.io/@23657689/context-free-language#Regular-Grammer)。 https://en.wikipedia.org/wiki/Regular_grammar # Finite State Machine **Warren McCulloch** and **Walter Pitts** introduced finite automata in 1943. Finite State Machine 可以分兩類： 1. Deterministic Finite State Machine 2. Non-Deterministic Finite State Machine ## Deterministic Finite State Machine ### Introduction The control unit decides which action to take. It refers the current state and the symbol read (pointed) by the head before making an action. input: $(q, a) = (\text{current state, input symbol})$ Possiable actions in one step: 1. Moving right by one position (eating a symbol). 2. Changing the current state. The string is **accepted** if the automata is in one of its final states, otherwise is **rejected** by the automata. ### Components * An infinite tape: keeping input string * A state register: inidcating machine conditions * A control unit: deciding actions * A read head: reading & erasing symbols ### Configuration The configuration of a Finite Automata is decided by: * Its current state, $q$. * The contents in the tape, for example "abc". * The position of the read head. * Denoted as $C_i = (\text{current state, tape contents, head position})$. ### Derivation The Finite Automata changes from one configuration to another, we called a **derivation** or a step, and it is controlled by control unit, denoted as $(q_1, \#\underline{a}bcd\#) \rightarrow (q_2, \#\#\underline{b}cd\#)$ . ### Mathematical Model Deterministic Finite Automata is a quintuple representation: $$ M = (K, \Sigma, \delta, s, F) $$ * $K$ : finite set of states, $\{q_0, q_1, ..., q_{m-1}\}$. * $\Sigma$ : alphabet. * $\delta$ : transition function, $K \times \Sigma \rightarrow K$. * $s$ : start state. * $F$ : set of final states. #### The Set of States The state of set, $K$, contains a finite number of states. * One state is designated as the start (intital) state. * Some of the states are designated as the final states, must at least 1 final state. #### The Transition Function The Transition Function is the program of the Finite Automata, it directs the action of the finite automata, denoted as $\delta(q, a) \rightarrow p$. ```c= if (state == q && symbol == 'a') { state = p; moving right by 1 cell; } ``` Therefore, $\delta$ contains a list of rules. The transition function is fully and uniquely defined over all possible combinations of $\{\text{state, symbol}\}$. The finite automata exactly one action at each step and each configuration, this type of finite automata form the class of **Deterministic Finite Automata**. ### Language of DFA DFA can be used for solving decision problems of language. The language of a DFA $M$ is denoted as $L(M)$. #### Regular Language $L(M)$ is the set of strings $w$ accepted by $M$. A language $L$ is called regular if and only if there exists some DFA $M$ such that, $L = L(M)$. Finite Automata: Language recognizer. Regular Expression (Regular Grammar): Language generator. ### Transition Diagram The mathematical model is tedious（使人厭煩、冗長乏味）. Using **direct graphs** to represent DFA is a better strategy. There are 4 rules: 1. States are represented as vertices. 2. Transition rules: arcs with labels. 3. Final State: double circled vertices. 4. The initial state is marked with ">". ### Example $M = (K, \Sigma, \delta, s, F)$, where: * $K = \{q_0, q_1\}$ * $\Sigma = \{a, b\}$ * $s = q_0$ * $F = \{q_0\}$ * $\delta$ is defined by: | $q$ | $\sigma$ | $\delta(q, \sigma)$ | | -------- | -------- | -------- | | $q_0$ | $a$ | $q_0$ | | $q_0$ | $b$ | $q_1$ | | $q_1$ | $a$ | $q_1$ | | $q_1$ | $b$ | $q_0$ | The transition diagram will be: ![](https://i.imgur.com/69JSTEV.png) ## Non-Deterministic Finite State Machine NFAs were introduced in 1959 by **Michael O. Rabin** and **Dana Scott**. ### Introduction * A new computatinal model. * Easier to design. * Non-deterministic behaviors * Ambiguous computation * Undefined transition rules * $\varepsilon$-transitions > The classes of DFA and NFA are equally powerfully. Design NFA : * Replace the transition function with a **transition relation** $\Delta(q, a)$. * A string $w$ may produce multiple paths in the transition diagram. Result: * If string $w$ is not in $L(M)$, it is always rejected. * if string $w$ in $L(M)$, it may be rejected or accepted by $M$. NFA is illusive（虛幻的）, but it is more "flexibility", and an NFA can always be converted into a DFA. > We can use NFA to prove that regular expressions are equivalent to finite automata. Given an NFA, we can convert it into a DFA in **exponential time** steps. Given an NFA, we can convert it into equivalent regular expressions in **polynomial time**. Given some regular expressions, we can construct an equivalent NFA in **polynomial time**. ### ε-transition An ε-transition = an arc with an ε-label, that meaning: The machine(NFA) can changes its state without refering the input symbol. Denoted as $\Delta(q, \varepsilon) = p$, $\varepsilon$ means no symbol. ### Mathematical Model Non-deterministic Finite Automata is a quintuple representation: $$ M = (K, \Sigma, \Delta, s, F) $$ * $K$ : finite set of states, $\{q_0, q_1, ..., q_{m-1}\}$. * $\Sigma$ : alphabet. * $\Delta$ : transition relation, $K \times (\Sigma \cup \{\varepsilon\}) \rightarrow K$. * $s$ : start state. * $F$ : set of final states. #### Difference between $\Delta$ and $\delta$ $\Delta(q, a)$ is **partially defined**, it may be ambiguous. ## Regular Expression to NFA Thompson's construction https://en.wikipedia.org/wiki/Thompson%27s_construction ## NFA to DFA Theorem: an NFA can always be converted into an equivalent DFA. All DFA are NFA by definition. ### Proof by deduction 1. Design an algorithm which converts any NFA to an equivalent DFA in **exponential time** steps. (Therefore NFA is a subset of DFA.) 2. DFA form a subset of NFA (By definition). 3. Two sets are subsets of each. 4. Therefore, NFA = DFA. #### ε-Closures of States Assume $M$ is an NFA, and $q$ is a state of $M$. $$ M = (K, \Sigma, \Delta, s, F) $$ Definition: $\varepsilon$-closure of state $q$ is defined by: $E(q) = \{ p \quad | \quad p \quad \text{is a state of} \quad M \quad \text{and is reachable from} \quad q \quad \text{via} \quad \varepsilon \text{-transitions} \}$ ### Algorithm 假設我們已經有一個 NFA，根據定義為： $$ M = (K, \Sigma, \Delta, s, F) $$ 而轉換過後的 DFA 為： $$ M' = (K', \Sigma, \delta, s', F') $$ 其中： * $K' =$ Power set of $K$. * $s' =$ $E(s)$ (ε-closure of $s$). * $F' =$ Set of any subsets of $K$ containing some final states = Set of states of $K'$ which contain at least one final state of $K$. 所以演算法的步驟為： 1. Generate the **power set** $K'$ of $K$. 2. Compute ε-closure of all the states in $K$. 3. Identify the new **initial state** and **final states**. 4. Design the transition function of the DFA. For each $Q$ if $K'$, create the following transition rule for the DFA: $\delta(Q, a) = \cup E(p)$, if $\Delta(q, a) = p$, where $Q = E(q)$ and $q \in Q$. > enumerate all states $q$ of $Q$, collect transition rules of $q$, union the reachable states via the transition rules. 5. Draw a directed graph to represent the DFA. #### Power Set 冪集定義請看[這](https://zh.wikipedia.org/wiki/%E5%86%AA%E9%9B%86)。 #### Time Complexity Wost Time: $O(2^N)$, where $N$ is the number of states in the NFA. ### More Efficient Alorithm ```c++= create the inital state s' = E(s); enqueue(s'); while(Q_NOT_EMPTY) { q = dequeue(); for each a in Sigma do { \\ Use Delta-transition + \varepsilon-closures create p = \delta(q, a); if (p has not been created) { add p to the DFA; create transition from q to p; enqueue(p); } } } ``` ### Example ![](https://i.imgur.com/K8TTGjD.png) 1. Compute ε-closure of all the states in $K$. $E(q0) = \{q0, q1, q2, q3, q4\}$ $E(q1) = \{q1, q2, q4, q3\}$ $E(q2) = \{q2, q4, q3\}$ $E(q3) = \{q3\}$ $E(q4) = \{q4, q3\}$ 2. Identify the new **initial state**. Initial state $s' = E(s) = E(q0) = \{q0, q1, q2, q3, q4\}$. ![](https://i.imgur.com/RiOLUbE.png) 3. 執行演算法先看 $\{q0, q1, q2, q4, q3\}$ ，從 $q0$ 只走 $a$ 可以到 $\{q2\}$；從 $q1$ 只走 $a$ 無路可走；從 $q2$ 只走 $a$ 可以到 $\{q3\}$；從 $q3$ 只走 $a$ 無路可走；從 $q4$ 只走 $a$ 可以到 $\{q0\}$，所以總共可以走 $\{q0, q2, q3\}$。然後再來看 $\{q0, q2, q3\}$，把每個 state 的 ε-closure，合併起來，可以得 $\{q0, q1, q2, q3, q4\}$，而該 state 已經被建立過，所以無需再建立一個；結果如下所示： ![](https://i.imgur.com/dX65ENZ.png) 再來看 $b$，從 $q0$ 出發什麼都走不了；從 $q1$ 出發可以到 $\{q4\}$；從 $q2$ 出發什麼都走不了；從 $q3$ 出發什麼都走不了；從 $q4$ 出發什麼都走不了，所以總共可以走 $\{q4\}$。而 $\{q4\}$ 的 ε-closure 為 $\{q3, q4\}$，所以合併起來為 $\{q3, q4\}$。 ![](https://i.imgur.com/uW2OX0C.png) 再來看 $\{q3, q4\}$ state，從 $q3$ 走 $a$ 什麼都走不了；從 $q4$ 走 $a$ 可以到 $q0$，但加上 $q0$ 的 ε-closure，最終會等於 $\{q0, q1, q2, q3, q4\}$，所以: ![](https://i.imgur.com/Kb8eMnU.png) 來看走 $b$，從 $q3$ 和 $q4$ 走什麼都走不了，所以要創建一個空集合的 state，而在這一個 state 不管之後是甚麼字母都會走不出去。 ![](https://i.imgur.com/Fsseahf.png) 目前所有的 state 對於每一個 symbol 都有 arc 出來，最後我們必須補上 finite state，只要有 state 內至少包含一個 NFA 的 finite state，就算是 finite state，所以最終結果如下圖 ![](https://i.imgur.com/BeZzcEo.png) ## Minimize DFA Theorem: Given a DFA $M$, we can minimize the number of states of $M$ in **polynomial time**. To verify whether two DFA are equivalent or not? * Reduce them into the minimum DFA. * Compare their topological structures. * If equal, they're the same; Otherwise, they are different. To verify whether two NFA => converted them into DFA at first. ### Algorithm 01 1. Let n = 0; 2. Compute equivalence classes for $\equiv_{0}$; 3. Increase $n = n+1$; 4. Compute equivalence classes for $\equiv_{n}$; 5. If the equivalence classes are invariant, stop; 6. Otherwise go to step 3; 7. Merge states in the same equivalence class; 8. Create the minimized DFA; #### Equivalence Relation of States * Let $p$ and $q$ be 2 states of DFA $M$, we say: $$ p \equiv_n q $$ if: $$ (p, w) \mapsto^*_M (f_i, \varepsilon) \quad iff \quad (q, w) \mapsto^*_M (f_i, \varepsilon) \\ \forall w \in \Sigma^* \quad |w| \leq n; $$ ![](https://i.imgur.com/bzlbXtH.png) > 如圖，當 $q$ 和 $p$，它們遇到 $w1$ 之後都是會走到非接受的狀態，而遇到 $w2$ 都可以是 final state，這樣就可以說 $q$ 和 $p$ 是 Equivalence，在這樣子的情況下狀態 $q$ 與 $p$ 就可以合併成一個 state。 Meaning: The two states $p$ and $q$ are **n-equivalent**, for any string $w$ in $\Sigma^*$, $|w| \leq n$, and $w$ lead $M$ from $p$ into a final state if and only if $w$ lead $M$ from $q$ into a final state too. But the input string's lengths may $\geq n$, so that we need more observation. **Lemma**: $p \equiv_{n+1} q$ iff $p \equiv_{n} q$ and $\delta(p, a) \equiv_{n} \delta(q, a), \forall a \in \Sigma$. 解釋： 1. $p$ and $q$ are already equivalent under $\equiv_{n}$, and $\delta(p, a) \equiv_{n} \delta(q, a) \rightarrow r_1 \equiv_{n} r_2$. 2. We can say that for any string $y = a \cdot w$, y can bring $M$ from $p$ and $q$ to final or non-final states unambiguously, because $r_1 \equiv_{n} r_2$. 3. $|y| = n + 1$, so for all string with lengths $\leq n + 1$, state $p$ and $q$ still are equivalent. ### Hopcroft's Algorithm 這個演算法是基於 **John Hopcroft** 的論文【An n log n Algorithm for Minmizing States in a Finite Automaton】（1971），基於 Myhill-Nerode equivalence relation 理論得出的，在介紹該算法以前必須要先了解 **Myhill-Nerode Theorem**： #### Distinguishable String ![](https://i.imgur.com/nVF54Ur.png) 根據圖，我們可以知道 $xz \in L$，但 $yz \notin L$，所以說 $x$ 和 $y$ 就是 distinguishable（可區分）；如果 $yz \in L$，則 $x$ 和 $y$ 就是 indistinguishable（不可區分）。 #### Distinguishable State Distinguishable State 的定義有三種情況： 1. $p \in F, q \notin F$ 2. $p \notin F, q \in F$ 3. if $\delta(p, a)$ and $\delta(q, a)$ are distinguishable, for some $a \in \Sigma$. 其算法如下：假設有一個 DFA $M$ 如下圖： ![](https://i.imgur.com/11beFDl.png) 其中 $|K| = 6$，則建立一個 $6 \times 6$ 的表格，但只需要砍斜對角即可，這個表格稱之為 Indistinguishable Table： ![](https://i.imgur.com/dDULsdK.png) 先從 Final State 看起，State 3 跟任何的 State （除了 5）之外都是屬於 Distinguishable；而 State 5 跟任何的 State （除了 3）之外也一樣，所以我們可以記做 $F$ 代表表格上對應的兩個 State 搭配起來是 Distinguishable State。 ![](https://i.imgur.com/GbSlXCG.png) 接著就一個一個看，從 State 2 with 1 看起，走 a 的結果是 {4, 2}，該格尚未紀錄，不做任何事；看走 b 結果是 {3, 4}，而 {3, 4} 已經是 Distinguishable State 了，故 State 2 with 1 是 Distinguishable State。 State 4 with 1，走 a 的結果是 {4, 2}，該格尚未紀錄，不做任何事；看走 b 結果是 {5, 4}，該格有紀錄是 F，所以 State 4 with 1 是 Distinguishable State。 State 4 with 2，走 a 的結果是 {4, 2}，該格尚未紀錄，不做任何事；看走 b 結果是 {3, 5}，該格尚未紀錄，不做任何事。 State 5 with 3，走 a 的結果是 {5, 3}，該格尚未紀錄，不做任何事；看走 b 結果是 {5, 3}，該格尚未紀錄，不做任何事。（這代表這兩個 State 注定是 Indistinguishable 了） State 6 with 1，走 a 的結果是 {6, 2}，該格尚未紀錄，不做任何事；看走 b 結果是 {5, 4}，該格有紀錄是 F，所以 State 6 with 1 是 Distinguishable State。 State 6 with 2，走 a 的結果是 {6, 4}，該格尚未紀錄，不做任何事；看走 b 結果是 {5, 3}，該格尚未紀錄，不做任何事。 State 6 with 4，走 a 的結果是 {6, 4}，該格尚未紀錄，不做任何事；看走 b 結果是 {5, 5}，而任何節點的自己搭配自己一定是 Indistinguishable。最後表格會長這樣: ![](https://i.imgur.com/0AbcKMo.png) 最後還需要在跑一次，確認 {4, 2}、{5, 3}、{6, 2}、{6, 4} 是無法被登記成 F 的話，就代表這些 State 是可以被合併的。最後節點 2、4、6 可以被合併成一個，而 3 與 5 合併成一個，並且為 Final State，所以最後化簡的 DFA 為： ![](https://i.imgur.com/hmKofED.png) ## Closure Properties Regular language are cloesd under the following set operators: * Union * Concatenation * Kleene Star * Complement * Interscrion * Difference The languages generated by applying these operators on regular languages are regular too. 1. Regular Expressions can be converted into NFA. 2. Any FA can be converted into regular expressions. 3. Regular Languages = Regular Expressions = Finite Automata. # Non-Regular Language What is Non-regular language? Try to create a DFA accepting the language, if the number of states which occur is finite, the language is regular, otherwise, it is non-regular. How do we prove a language is regular? 1. Design an FA for the language. 2. Design regular expressions for the language. 3. By using **closure properties** to construct it. How to prove that a given language is not regular? 1. By using the **pumping theory** of regular language. 2. By using the **Myhill-Nerode theorem**. 3. By using **closure properties**. ## Pumping Lemma If $L$ is a regular language, there exists a constant $n$ (pumping length) such that for any string $w$ in $L$ and $|w| \geq n$, $w$ can be divided into $w=xyz$, satisfying the following conditions: 1. $\forall i \geq 0, xy^iz \in L$ 2. $|y| > 0$ 3. $|xy| \leq n$ 基本上 Pumping Lemma 是講說，當你一個語言 $L$ 是正規語言的時候，勢必存在一個 DFA $M$ 來接受這個語言，其狀態數量是有限的（finite），假設 $M$ 的狀態集合 $K$ 的數量為 $|K|$ 個，如果你輸入的字串 $w$ 長度夠長（至少大於等於），根據鴿籠定理（Pigeonhole Principle），勢必會有某狀態重覆，而重覆的那一個子字串就會是 $y$。由於一般來說，我們不會知道該語言的 DFA 狀態數量是多少，所以這個字串 $w$ 要多長也不知道，但至少知道一定存在某數，所以這個長度我們稱呼為 Pumping Length，可以用 $n$ （或是 $p$）來表示。 $y$ 不可以是 $\varepsilon$ 空字串，$y$ 長度至少要大於 0，但 $x$ 跟 $z$ 就可以為空，而且 $|xy| \leq n$，來控制重覆部分只會發生在 $y$ 字串中，而非是 $x$。 Pumping Lemma is used to prove that a language is NOT regular, **it can not be use to prove that a language is regular**. Use this logic: if $P$ then $Q$ $=$ if $\neg Q$ then $\neg P$. > Try to find a string $w$ of $L$, which does not meet the condition, to deny that $L$ is regular. ### Proof Steps 1. Assume $L$ is regular and accepted by a DFA $M$. 2. Select a string $w$ which is in $L$ and long enough. 3. Try all possiable ways to divide $w$ into $xyz$. 4. For each possible division, repeat $y$ and verify that $xy^iz$ is NOT always in $L$. 5. Therefore, $L$ does not possess the previous property and it is NOT regular. ### Example Proof this language is NOT regular: $$ L = \{a^nb^n, n\geq 0\} $$ The proof is by contradiction. 1. Assuming $L$ is regular. Let $p$ be the pumping length given by the pumping lemma. 2. we can choose a string $w = a^pb^p$, that $w \in L$ and $|w| \geq p$. 3. The pumping lemma guarantees that $w$ can be split into three pieces, $w = xyz$, and for any $i \geq 0$ such that $xy^iz \in L$. 4. We consider three cases to show that this result is impossible: (1). $y$ consists only of $a$s, pumping $y$ makes #$(a)$ $>$ #$(b)$. In this case the string $w$ has more $a$s than $b$s and so is not in $L$, violating conditon of the pumping lemma, This case is a contradiction. (2). $y$ consists only of $b$s, This case also gives a contradiction. (3). $y$ contains both $a$s and $b$s, pumping $y$ results in incorrect orders of $a$ and $b$, $y = a^mb^l \rightarrow y^2=a^mb^la^mb^l$, such that $w \notin L$, which is a contradiction. ## Myhill-Nerode Theorem https://en.wikipedia.org/wiki/Myhill%E2%80%93Nerode_theorem https://mropengate.blogspot.com/2015/04/formal-language-ch5-myhill-nerode.html ###### tags: `計算理論`