Regular Expressions

--- tags: compiler construction --- $\newcommand{\sem}[1]{[\![#1]\!]}$ (part of [CC 2022](https://github.com/alexhkurz/compiler-construction-2022/blob/master/lecture-by-lecture.md)) (check out https://www.youtube.com/watch?v=ImnZRGCwBrY) # Regular Expressions ("the book" refers to [Introduction to Automata Theory, Languages, and Computation](http://ce.sharif.edu/courses/94-95/1/ce414-2/resources/root/Text%20Books/Automata/John%20E.%20Hopcroft,%20Rajeev%20Motwani,%20Jeffrey%20D.%20Ullman-Introduction%20to%20Automata%20Theory,%20Languages,%20and%20Computations-Prentice%20Hall%20%282006%29.pdf)) Recall that - a DFA can be seen as the specification of an algorithm searching for a certain "regular" pattern encoded by the automaton - one can build larger automata from smaller one using (amongst others) the operations of - sequential composisiton - parallel composition - repetition - these operations can be seen as an algebra of languages/automata, the implementation of which leads us beyond DFA to NFA and $\epsilon$-NFA. In this lecture we will learn - how to use regular expressions to specify languages/automata - that $\epsilon$-NFA, NFA and DFA have the same expressive power - algorithms of how to convert an NFA to DFA and an $\epsilon$-NFA to an NFA. ## Examples As running examples we can take the following two questions from the book. - Exercise 2.3.4.a (p.66). ![](https://i.imgur.com/J8KER6B.png) - Exercise 2.5.3.a (p.79). ![](https://i.imgur.com/UkHngTK.png) Moreover, we want to do this in a systematic fashion so that we can apply the same method to other examples. This method involves to equivalent but different formalisms, non-deterministic finite automata and regular expressions. ## Non-Deterministic Automata Non-determinstic automata were introduced by [Rabin and Scott](https://www.researchgate.net/publication/230876408_Finite_Automata_and_Their_Decision_Problems) in 1959. Both authors made much deeper contributions to computer science (such as zero knowledge proofs and domain theory), but non-determinstic automata were important enough to grant them the "Nobel price of computer science", the Turing award in 1976. Most algorithms are deterministic and, until not so long ago, implementations on a modern computer were deterministic (even those using random numbers). So why are non-determinstic machines so important? - Specifications as opposed to implentations typically do not determine computations completely. - Modelling open systems, the environment is often best modelled non-determinstically. For example, - an operating system interacting with a keyboard cannot predict which keys will be pressed next - in security modelling an attacker must allow the intruder any possible action - In the presence of concurrency or parallelism, the precise timing or even order of actions is not determined. - In algorithms, non-deterministic machines are crucial to describe the possibly most important class of algorithms, namely those in non-deterministic polynomial time (NP). - Non-deterministic automata and deterministic automata are equivalent, but non-determinstic automata can be exponentially smaller. ## Regular Expressions ### Syntax Regular expressions are defined via the abstract syntax $$e ::= \epsilon \mid \varnothing \mid a \mid e+e \mid e\ e \mid e^*$$ where $a$ ranges over an alphabet. **Remark on Notation:** - Sometimes the centerdot $e\cdot e$ is used instead of $e\ e$. - The book writes $\bf 0$ for $\varnothing$ and $\bf 1$ for $\varepsilon$. This is good notation, but bold fonts are difficult to do on the whiteboard (and ordinary fonts could be confused with the letters $0,1\in\Sigma$). We denote the set of regular expressions as RE. ### Denotational Semantics The semantics, or meaning, of a regular expression is a language, that is, a set of words over an alphabet $\Sigma$. \begin{align} \sem{-}: \mathsf{RE} & \to \mathsf{LANG}\\ \sem{\epsilon} & = \{\epsilon\}\\ \sem{\varnothing} & = \{\}\\ \sem{a} & = \{a\}\\ \sem{e_1+e_2} & = \sem{e_1}\cup \sem{e_2}\\ \sem{e_1\ e_2} & = \sem{e_1}\cdot \sem{e_2}\\ \sem{e^\ast} & = \sem{e}^\ast \end{align} where the union and concatenation and Kleene star of languages were defined in the lecture on [composing automata](https://hackmd.io/@alexhkurz/ryV_FU7XI). Note that the mathematical definition is declarative in the sense that it is a direct translation of the English specification; indeed we can read it as - the meaning of $\epsilon$ is the empty word - the meaning of $\varnothing$ is the empty language - the meaning of $a$ is the language that constists just of $a$ - the meaning of $+$ is union - the meaning of $\cdot$ is concatenation - the meaning of $^\ast$ is repetition ### Operational Semantics To interpret regular expressions as algorithms, we translate them into automata. We translate the regular expression first into an NFA and then the NFA into a DFA. (Looking back, this was the reason why we introduced NFAs in the first place.) Once this translation is understood one can introduce various optimisations. For example, one can do the translation in one step instead of two. Moreover, it is not always necessary to construct all of the NFA/DFA first and then search but rather one constructs the automaton "on the fly", that is, one only constructs those parts of the automaton that is actually needed in the search. See [this article](https://swtch.com/~rsc/regexp/regexp1.html) for more details on implementing regular expressions. Let $\mathit{RE}$ be the set of all regular expressions and $\epsilon\mathsf{NFA}$ be the class of all NFAs with $\epsilon$ transitions. We call $\mathit{nfa}$ the function that compiles a regular expression to an $\epsilon$NFA. It is defined by the following pseudocode in a functional programming language. \begin{align} \mathit{nfa}:\mathsf{RE}& \to\epsilon\mathsf{NFA}\\ \hline\mathit{nfa}(\epsilon) & = A_\epsilon\\ \mathit{nfa}(\varnothing) & = A_\varnothing\\ \mathit{nfa}(a) & = A_a\\ \mathit{nfa}(e_1+e_2) & = \mathit{nfa}(e_1)\mid\mathit{nfa}(e_2)\\ \mathit{nfa}(e_1\cdot e_2) & = \mathit{nfa}(e_1)\cdot \mathit{nfa}(e_2)\\ \mathit{nfa}(e^\ast) & = \mathit{nfa}^\ast\\ \end{align} where the notation on the right-hand sides was explained in the lecture on [composing automata](https://hackmd.io/@alexhkurz/ryV_FU7XI). **Remark:** If this mathematical definition is terse, have a look at this excellent [article](https://swtch.com/~rsc/regexp/regexp1.html) where you find essentially the same definition described pictorially: ![](https://i.imgur.com/poMeWBD.png =700x) **Remark:** The construction of an automaton from a regular expression is conceptually simple and elegant. If you are interested in an in-depth study see the survey [From Finite Automata to Regular Expressions and Back](https://arxiv.org/pdf/1405.5594.pdf). Of course, we would expect that the translation from regular expressions to automata preserves the meaning of the regular expression. In more detail, if $e$ is a regular expression with language $\sem{e}$, then the language $\sem{\mathit{nfa}(e)}$ accepted by the automaton $\mathit{nfa}(e)$ should be $\sem{e}$ as well. We state this as a theorem. **Theorem (Kleene):** For every regular expression $e$ there is an equivalent NFA (and hence DFA), in symbols $$\quad\sem{e} = \sem{\mathit{nfa}(e)}.$$ Conversely, for any DFA $A$ there is a regular expession $e$ such that the $\sem{e}$ is the language of $A$, in symbols, $\sem{A}=\sem{e}$. **Remark:** Kleene only considered DFAs, NFAs where introduced later by Rabin and Scott. NFAs with $\epsilon$ transitions are due to McNaughton and Yamada. *Proof:* The proof of the first statement is just an explication of the method we used to find the definition of the operations $\mid$ and $;$ and $^\ast$ on automata. For example, the "asynchronous parallel composition" $A_1\mid A_2$ is defined exactly so that $$\mathit{nfa}(e_1+e_2) = \mathit{nfa}(e_1)\mid\mathit{nfa}(e_2)$$ is true. I leave it as an exercise to convince yourself that this is true in this case and the other cases. For the second statement, the construction of a regular expression is more complicated but you can consult Section 3.2.1 (page 91) of the book. ## <font color=red>Homework - [Homework](https://hackmd.io/@alexhkurz/S1EVYe7bO) on regular expressions. </font> --- (this is how far we came on Thursday in L2.2) ## Converting NFAs to DFAs We cover this in the lectures. For a write-up see the [Short Introduction to Automata and Haskell](https://hackmd.io/zo6eujQmTQyMoFquFxdv0Q?view#Converting-NFAs-to-DFAs) as well as Section 2.3.5 of the book. ## Eliminating $\epsilon$-Transitions We cover this in the lectures. For a write-up see Section 2.5.5 of the book. ## Homework - [Homework](https://hackmd.io/@alexhkurz/HJ1BAFYbd) on converting NFA to DFA. ## Optional Homework and Further Reading - Do the "Exercise on Converting NFAs to DFAs" from the [Short Introduction to Automata and Haskell](https://hackmd.io/zo6eujQmTQyMoFquFxdv0Q?view#Converting-NFAs-to-DFAs) which asks you to implement the algorithm in Haskell (you get a template, so it is just 3 lines of code). - I didn't have time to think about how to do $\epsilon$-transitions in Haskell. I'd be curious to know if you gave it a try. - Have a look at the article [(Rabin and Scott, 1959)](https://www.researchgate.net/publication/230876408_Finite_Automata_and_Their_Decision_Problems) which introduced non-deterministic automata. Both authors later made much deeper contributions to computer science (such as zero knowledge proofs and domain theory), but non-determinstic automata were important enough to grant them the "Nobel price of computer science", the Turing award, in 1976. Most algorithms are deterministic and, until not so long ago, implementations on a modern computer were deterministic (even those using random numbers). So why are non-determinstic machines so important? - Specifications as opposed to implentations typically do not determine computations completely. - Modelling open systems, the environment is often best modelled non-determinstically. For example, - an operating system interacting with a keyboard cannot predict which keys will be pressed next - in security modelling an attacker must allow the intruder any possible action - In the presence of concurrency or parallelism, the precise timing or even order of actions is not determined. - In algorithms, non-deterministic machines are crucial to describe the possibly most important class of algorithms, namely those in non-deterministic polynomial time (NP). - Non-deterministic automata and deterministic automata are equivalent, but non-determinstic automata can be exponentially smaller. - Have a look at Kleene's orginal [Representation of Events in Nerve Nets and Finite Automata](http://www.dlsi.ua.es/~mlf/nnafmc/papers/kleene56representation.pdf) is difficult to follow today. It is still worth looking at. Here are some of my takeaways and questions. - The paper was motivated by what we would call today neural networks. The famous paper [(McCulloch and Pitts, 1943)](https://www.cs.cmu.edu/~./epxing/Class/10715/reading/McCulloch.and.Pitts.pdf) which Kleene takes as a starting point was very influential in many areas including AI and machine learning. - It is also interesting to note that Turing machines are only mentioned in the last paragraph (before the appendix). While we see DFAs and Turing machines as members of the same hierarchy of machines, it is not clear how much Kleene's paper was influenced by Turing. - Kleene's "regular events" (Section 7) are easily recognisable as what we call regular expressions today. Our theorem from the lecture, that each regular expression corresponds to a finite automaton should be Theorem 4. How different is Kleene's proof from ours? Does he allow non-deterministic automata? - Kleene's "finite automata" (Section 8) do not yet have the clean mathematical definition we use today. - Regular languages are defined in Section 9 just before Lemma 7. - I didn't try to understand the proof that every finite automaton can be described by a regular expression (Theorem 5 in Section 9). It would be an interesting exercise to translate the proof in to modern language and then to determine which of the different textbook proofs correspond to it. ## Acknowledgements [Polyphia](https://www.youtube.com/watch?v=2w9hxLYIfO4&list=PLDCdjwiC90TF9x9orSv_voDSiLRt7_i10) for the music (and Donner for the recommendation)