# Computer Theory # Formal Language 語言(Language)是人類之間經過萬年發展以來用來互相溝通的方法,世界上的語言高達有千種以上(包含方言),而人類平常所使用的語言稱之為自然語言(Natural Languages),人類的語言中文法是非常複雜、有許多曖昧文法的存在(Ambiguous Grammar),難以使用機器或是數學來明確規範或是定義,通常這種屬於人文領域 - 語言學(Linguistics)。 而人類有發展另一套語言系統,基於數學公式的定義下且可以提供電腦(機器)理解的語言,這種語言被稱呼為形式語言(Formal Language)。 在這邊,語言可以看做是許多個的字串(String)所組成;而字串是由許多個符號(Symbol)所組成;文法(Grammar)就是規範這個語言的邏輯形式,可以說文法是語言的資料結構、數學公式;而專門設計一個判定某種字串是否符合該語言文法的機器則會被稱呼為 Automaton。 ## Basic Definition * Symbols: characters to compose strings. * Alphabet: a set of symbols,通常表示: $\Sigma = \{a, b \}$. PS: An alphabet must be a finite set. ## Strings ### Definition A string is series of symbols concatenated together, eg: $w = a_1a_2a_3...a_n$ ### Operations * String length: number of symbols in a string, 記做: $|w| = n$。 * Concatenation: $w = x \circ y$ * Repetition: $w^i = \underbrace{w \circ w \circ w ... w}_{i}$, if $i$ D= 0 then empty string. * Reverse: $w^R = a_{n-1}a_{n-2}...a_1$ * Kleene Star: repeating a string by 0, 1, 2, ... times, denote as: $w^*$。 * $\Sigma^*$: a set of all the strings defined over $\Sigma$. * $\varepsilon$: empty string containing no symbol. > We usually use $w, x, y$ and $z$ to denote string and use $a, b, c$ and $d$ to represent symbols. ## Languages ### Definition A language is a set of strings defined over an alphabet. Example: $$ \Sigma = \{a, b\} \\ L = \{aa, aab, aabb, ... \} $$ $L$: a language defined over the alphabet $\Sigma$. ### Infinite vs Finite Languages It is an infinte language if it contains an infinite number of strings. Otherwise, it's a finite language. ### Countable Languages are usually countable. ### Operations Languages are sets, so that set operators are applicable to languages. **Languages are closed under set operations**. * Union: $L_1 \cap L_2$. * Intersection: $L_1 \cup L_2$. * Difference: $L_1 - L_2$. * Complement: $L^,$. * Concatenation: $L = L_1 \circ L_2$. * Repetition: $L^i$. (Select i strings from L and concatenate the string to form a new string. In such way, we define a new language) * Kleene Star: $L^* = \{L^0, L^1, L^2, L^3, ...\}$ * Empty Set: $L^0$ or denoted as $\Psi$. > Empty string $\varepsilon$ is not in empty language $\Psi$. ## Grammars ### Definition We define a language by specifying a set of rules, these rules are called the **Grammar** of the language; The rules (recursive rules) to produce strings of a language. ### Components * Variables $V$ (Non-Terminals) * Symbols representing substring (to be generated). * To be replaced by substrings in the deviation steps. * Terminals $T$ * Symbols from the alphabet. * To form(建構) strings. * Can't be replaced. * Rules $G$: * The methods (formula) to replace variable with strings. * Denoted as: $xyz \rightarrow abc$ * Meaning: replace "xyz" by "abc", where "xyz" must contain at least one vatiable. ## Membership and Decision Problem Q: Given a string $w$, how to verify whether it is a member of the language $L$? A: We can based on the grammar of language to design a machine $M$, let $M$ decides whether $w$ is in $L$. ## Chomsky Hierarchy 以前的人用數學的方式,研發了四種語言的規律,由嚴謹到寬鬆排列為: * Regular Language * Context-Free Language * Context-Sensitive Language * Recursively Enumerable Language 這種架構被稱之為 Chomsky Hierarchy,由 Noam Chomsky 在 1956 年提出;其中 Regular Language 和 Context-Free Language 因為較於嚴謹,有實務價值;Regular Language 用於字串搜尋、驗證字串格式;Context-Free Language 用於程式設計語言、檢所網頁資料等。 每個語言都有其對應的 Grammar 和 Automaton,其表格如下: | Class | Language | Grammar | Automata | | -------- | -------- | -------- | -------- | | 3 | Regular | Regular | Finite State Machine | | 2 | Context-Free | Context-Free | Pushdown Automaton | | 1 | Context-Sensitive | Context-Sensitive | Linear-Bounded Automata | | 0 | Recursively Enumerable | Unrestricted | Turing Machine | ## 其他章節 [Regular Language](/vtgB8qkzQWqmqdtti1yclQ) [Context-Free Language](/Ct67jKdeQNefWoBPTv1BpA) [Context-Sensitive Language](/5QUtCXdKRNquHwbmeQemoA) [Turing Machine](/sAbsN-t3QweoXA4Oa5FqNA) ## Reference https://web.ntnu.edu.tw/~algo/Language.html https://srhuang.github.io/formal%20language/2019/09/15/formallanguage-001.html ###### tags: `計算理論`