Regular Expressions

--- tags: compiler construction, algorithm analysis --- $\newcommand{\sem}[1]{[\![#1]\!]}$ - part of Algorithm Analysis 2024 ... this is part 2 with part 1 being [Regular Expressions Composing Automata](https://hackmd.io/@alexhkurz/Hk9l8poop) - check out the - [regex-visualizer](https://27t.github.io/regex-visualizer/) and the - [regular expression debugger](https://regex101.com/r/66xQpj/1)[^visualizer] and - [Regular Expression Matching Can Be Simple And Fast](https://swtch.com/~rsc/regexp/regexp1.html) [^visualizer]: Careful, notation can be different. Real world applications of regexes include search and replace, validate user input, extracting information in web crawlers, scraping, data wrangling, parsing and lexing, syntax highlighting, and much more. --- # Regular Expressions Recall from [Regular Expressions Composing Automata](https://hackmd.io/@alexhkurz/Hk9l8poop) that - a DFA can be seen as the specification of an algorithm searching for a certain "regular" pattern encoded by the automaton - one can build larger automata from smaller one using (amongst others) the operations of - sequential composisiton - parallel composition - repetition (Klenee star) - these operations can be seen as an algebra of languages/automata, the implementation of which leads us beyond DFA to NFA and $\epsilon$-NFA. We started the lecture by revising the above with the help of the [regex-visualizer](https://27t.github.io/regex-visualizer/). In this lecture we will learn - how to use regular expressions to specify languages/automata - that $\epsilon$-NFA, NFA and DFA have the same expressive power - algorithms of how to convert an NFA to DFA and an $\epsilon$-NFA to an NFA. [^NFAs] ## Examples As running examples we can take Exercises 2.3.4.a and 2.5.3.a from the book. - Exercise 2.3.4.a (p.66). ![](https://i.imgur.com/J8KER6B.png) - Exercise 2.5.3.a (p.79). ![](https://i.imgur.com/UkHngTK.png) ## Syntax of Regular Expressions Regular expressions are defined via the abstract syntax [^ast] $$e ::= \epsilon \mid \varnothing \mid a \mid e+e \mid e\ e \mid e^*$$ where $a$ ranges over an alphabet. **Remark on Notation:** - For concatenation, instead of juxtaposition $e\ e$, sometimes the centerdot `\cdot` $e\cdot e$ is used or also $e.e$. - Other texts in theoretical computer science use $\lambda$ instead of $\epsilon$. $e^+$ can mean "repeating at least once" (that is, $e^+=ee^\ast$) - In software tools and programming languages one finds a wide variation of notation. In Linux, `*` is used as a wild card for an arbitrarily long string. - Dropping parentheses: I adopt rules similar to arithmetic: The unary operation binds stronger than the binary ones; sequential composition (multiplication) binds strong than choice (addition). For example $a+bc^\ast= a+(b(c^\ast))$. To experiment with various flavours of regular expressions you can use this [regular expression debugger](https://regex101.com/r/66xQpj/1). It allows you to go run a regex step by step on a given input string. We denote the set of regular expressions as RE. ## Outlook and Summary of the Lecture The notion of language already played a central role in previous lectures. To repeat: **Definition:** Given an 'alphabet' $\Sigma$, a **language** $L$ over $\Sigma$ is a subset $L\subseteq \Sigma^\ast$ (where $\Sigma^\ast$ is the set of strings over $\Sigma$). The set of all languages over $\Sigma$ is variously written as $2^{\Sigma^\ast}$ or $\mathcal P(\Sigma^\ast)$. In the following, we simply write $$\sf LANG$$ for the set of all languages. At the end of the lecture, we will understand the following commuting diagram[^commute], relating regular expressions, languages and automata: ![](https://hackmd.io/_uploads/BJSVZvxmh.png =400x) [^commute]: Commuting diagram is a technical term of category theory. For us, it just means that the arrows in the diagram are functions and composing those functions in different ways yields the same results. In particular, for any regular expression $e$ we have $\sem{e}=\sem{nfa(e)}$. Each of the next three sections is devoted to understanding on of the arrows in the diagram. ## Denotational Semantics of Regular Expressions The semantics, or meaning, of a regular expression is a language, that is, a set of words over an alphabet $\Sigma$. It is defined by recursion over abstract syntax as follows. [^lang] \begin{align} \sem{-}: \mathsf{RE} & \to \mathsf{LANG}\\ \sem{\epsilon} & = \{\epsilon\}\\ \sem{\varnothing} & = \{\}\\ \sem{a} & = \{a\}\\ \sem{e_1+e_2} & = \sem{e_1}\cup \sem{e_2}\\ \sem{e_1\ e_2} & = \sem{e_1}\cdot \sem{e_2}\\ \sem{e^\ast} & = \sem{e}^\ast \end{align} where the union and concatenation and Kleene star of languages were defined in the lecture on composing automata. Note that this mathematical definition is declarative in the sense that it is a direct translation of the English specification; indeed we can read it as - the meaning of $\epsilon$ is the empty word - the meaning of $\varnothing$ is the empty language - the meaning of $a$ is the language that constists just of $a$ - the meaning of $+$ is union - the meaning of juxtaposition is concatenation - the meaning of $^\ast$ is repetition ## Operational Semantics of Regular Expressions To interpret regular expressions as algorithms, we translate them to automata. We translate the regular expression first into an NFA and then the NFA into a DFA. (Looking back, this was the reason why we introduced NFAs in the first place.) Once this translation is understood one can introduce various optimisations. For example, one can do the translation in one step instead of two. Moreover, it is not always necessary to construct all of the NFA/DFA first and then search but rather one constructs the automaton "on the fly", that is, one only constructs the part of the automaton that is actually needed in the search. See [this article](https://swtch.com/~rsc/regexp/regexp1.html) for more details on implementing regular expressions. Finally, one can minimize the DFA to one that has the smallest number of possible states. Let $\mathsf{RE}$ be the set of all regular expressions and $\epsilon\mathsf{NFA}$ be the class of all NFAs with $\epsilon$ transitions. We call $\mathit{nfa}$ the function that compiles a regular expression to an $\epsilon$NFA. It is defined by the following pseudocode in a functional programming language. [^nfa] \begin{align} \mathit{nfa}:\mathsf{RE}& \to\epsilon\mathsf{NFA}\\ \hline\mathit{nfa}(\epsilon) & = A_\epsilon\\ \mathit{nfa}(\varnothing) & = A_\varnothing\\ \mathit{nfa}(a) & = A_a\\ \mathit{nfa}(e_1+e_2) & = \mathit{nfa}(e_1)\mid\mathit{nfa}(e_2)\\ \mathit{nfa}(e_1\ e_2) & = \mathit{nfa}(e_1)\,; \mathit{nfa}(e_2)\\ \mathit{nfa}(e^\ast) & = \mathit{nfa}(e)^\ast\\ \end{align} where the notation on the right-hand sides was explained in the lecture on composing automata. **Remark:** If this mathematical definition is terse, have a look at the excellent article [Regular Expression Matching Can Be Simple And Fast](https://swtch.com/~rsc/regexp/regexp1.html) by Russ Cox where you find essentially the same definition described pictorially: ![](https://i.imgur.com/poMeWBD.png =700x) **Remark:** The construction of an automaton from a regular expression is conceptually simple and elegant. If you are interested in an in-depth study see the survey [From Finite Automata to Regular Expressions and Back](https://arxiv.org/pdf/1405.5594.pdf). ## Denotational Semantics of Automata This section recaps the lecture [Regular Expressions Composing Automata](https://hackmd.io/@alexhkurz/Hk9l8poop) using the mathematical notation introduced in this lecture. Recall that given a DFA $A=(Q,\Sigma,\delta,q_0,F)$ with transition function $$\delta: Q\times \Sigma \to Q$$ we extended $\delta$ from input symbols in $\Sigma$ to words in $\Sigma^\ast$ $$\hat\delta : Q\times \Sigma^\ast\to Q.$$ **Definition:** Given a DFA with transition function $\delta: Q\times \Sigma \to Q$ and set $F$ of final states, its *denotational semantics* (we write here $2$ for the set of Booleans, $2=\{0,1\}$) $$\sem{-}:Q\to (\Sigma^\ast\to 2)$$ is defined by $\sem{q} = \{w\mid \hat\delta(q,w)\in F\}$. We can now define the semantics of an automaton $A$ as $$\sem A = \sem{q_0}$$ To summarize, read $\sem A$ as **the language of $A$**. The details of the following remark are not important. But it is important to know that the above mathematical definitions can be easily implemented. Moreover, in a programming language such as Haskell, the program and the mathematical definition are identical, up to changing some names and fonts. For example, the remark below makes use of the following table, for more details recall our [Short Introduction to Automata (Written in Haskell)](https://hackmd.io/@alexhkurz/HylLKujCP). |Mathematics|Programming| |:---:|:---:| $Q$ | `State` $\Sigma$ | `Char` $\Sigma^\ast$ | `[Char]` $\delta$ | `delta` $\hat\delta$ | `run delta` $\epsilon$ | `[]` $2$ | `Bool` $\sem{-}$ | `semantics` **Remark:** Recall from [automata01](https://repl.it/@alexhkurz/automata01#main.hs) of our [Short Introduction to Automata (Written in Haskell)](https://hackmd.io/@alexhkurz/HylLKujCP) the Haskell definition ```haskell run :: (State -> Char -> State) -> State -> [Char] -> State run delta q [] = q run delta q (c:cs) = run delta (delta q c) cs ``` The mathematical function $\sem{-}$ can now be defined as ```haskell semantics :: State -> ([Char] -> Bool) semantics q w = final (run delta q w) ``` **Remark:** Not all languages arise as languages of DFAs. Those that do are of particular importance and are called regular. **Definition:** A language is called **regular** if it is the language of a DFA. More precisely, a language $L$ is regular if there is a DFA $A$ such that $L=\sem{A}$. To summarize, we have formalised our intution that languages are the meaning of DFAs. In the following we will see the important role played by switching between automata and languages. ## Minimization **Remark on the exercises:** This section is structured in such a way that it leaves some of the work to exercises. Most exercises are meant to give you an opportunity to practice concepts, definitions and notations that appeared in class. An exercise marked (*) requires a new idea and can be skipped at first. In this section we exploit the denotational semantics of automata, that is, the mathematical definition of $$\sem{-}: \epsilon\mathsf{NFA}\to {\sf LANG}$$ in order to show that for each automaton there exists a minimal DFA. The idea is that the minimal DFA of a given automaton $A$ is the image of $A$ under the function $\sem{-}: \epsilon\mathsf{NFA}\to {\sf LANG}$. To work this out we need to - refine the definition of $\sem{-}$ so that it takes as inputs not only automata but also states of automata and - turn $\sf LANG$ itself into a (gigantic) deterministic automaton. We start with the second item. Given a language $L$ we define for each $a\in\Sigma$ a transition $$L\stackrel a \longrightarrow L_a=\{w\in\Sigma^\ast \mid aw \in L\}$$ and we let $L$ be a final state if $\epsilon\in L$. **Exercise 1:** Verify that this turns $\sf LANG$ into a deterministic automaton. [^ex1] [^ex1]: This automaton does not have an initial state for a reason that will become clear during the exercises. **Exercise 2 (*):** Start with a language of your choice and use the definition of $L\stackrel a \longrightarrow L_a$ above to construct a DFA for it. Coming back to the first bullet point, we refine the function $\sem{-}$ so that it assigns a language to each state $q$ of any given automaton $A$. **Exercise 3:** Give a precise definition of the language $\sem{A,q}$. Now, given a DFA $A=(Q,\Sigma,\delta,q_0,F)$, we define a minimal automaton $\overline A=(\overline Q, \Sigma,\overline\delta,\overline q_0,\overline F)$ as the image of $A$ in $\sf LANG$. We do this in two steps. We first define the reachable part $A'$ of $A$ and then identify (fuse, merge) all states that accept the same language. In more detail: First, let $A'=(Q',\Sigma,\delta',q_0,F')$ be the automaton that is obtained from $A$ by removing all states that are not reachable from the initial state $q_0$. **Exercise 4:** Define reachable. Second, define the set of states of the minimal DFA $\overline A$ as $\overline Q = \{\sem{A',q} \mid q\in Q' \}$. **Exercise 5:** Complete the definition of $\overline A$. **Exercise 6:** Show that $\overline A$ is a DFA. **Exercise 7 (*):** Show that $\overline A$ is minimal. ## Summary Of course, we would expect that the translation from regular expressions to automata preserves the meaning of the regular expression. In more detail, if $e$ is a regular expression with language $\sem{e}$, then the language $\sem{\mathit{nfa}(e)}$ accepted by the automaton $\mathit{nfa}(e)$ should be $\sem{e}$ as well. Using the language of category theory we can summarize this by saying that the following diagram commutes.  ![](https://hackmd.io/_uploads/BJSVZvxmh.png =400x) We also state this in more conventional terms: **Theorem (Kleene):** For every regular expression $e$ there is an equivalent NFA (and hence DFA), in symbols $$\quad\sem{e} = \sem{\mathit{nfa}(e)}.$$ Conversely, for any DFA $A$ there is a regular expession $e$ such that the $\sem{e}$ is the language of $A$, in symbols, $\sem{A}=\sem{e}$. **Remark:** Kleene only considered DFAs, NFAs where introduced later by Rabin and Scott. NFAs with $\epsilon$ transitions are due to McNaughton and Yamada. *Proof:* The proof of the first statement is just an explication of the method we used to find the definition of the operations $\mid$ and $;$ and $^\ast$ on automata. For example, the "asynchronous parallel composition" $A_1\mid A_2$ is defined exactly so that $$\mathit{nfa}(e_1+e_2) = \mathit{nfa}(e_1)\mid\mathit{nfa}(e_2)$$ is true. I leave it as an exercise to convince yourself that this is true in this case and the other cases. For the second statement, the construction of a regular expression from a DFA is more complicated but you can consult Section 3.2.1 (page 91) of [IALC].   ## Further Study - Do the "Exercise on Converting NFAs to DFAs" from the [Short Introduction to Automata and Haskell](https://hackmd.io/zo6eujQmTQyMoFquFxdv0Q?view#Converting-NFAs-to-DFAs) which asks you to implement the algorithm in Haskell (you get a template, so it is just 3 lines of code). - I didn't think about how to do $\epsilon$-transitions in Haskell. Let me know if you give it a try. - Have a look at the article [(Rabin and Scott, 1959)](https://www.researchgate.net/publication/230876408_Finite_Automata_and_Their_Decision_Problems) which introduced non-deterministic automata. Both authors later made much deeper contributions to computer science (such as zero knowledge proofs and domain theory), but non-determinstic automata were important enough to grant them the "Nobel price of computer science", the Turing award, in 1976. Most algorithms are deterministic and, until not so long ago, implementations on a modern computer were deterministic (even those using random numbers). So why are non-determinstic machines so important? - Specifications as opposed to implentations typically do not determine computations completely. - Modelling open systems, the environment is often best modelled non-determinstically. For example, - an operating system interacting with a keyboard cannot predict which keys will be pressed next - in security modelling an attacker must allow the intruder any possible action - In the presence of concurrency or parallelism, the precise timing or even order of actions is not determined. - In algorithms, non-deterministic machines are crucial to describe the possibly most important class of algorithms, namely those in non-deterministic polynomial time (NP). - Non-deterministic automata and deterministic automata are equivalent, but non-determinstic automata can be exponentially smaller. - Have a look at Kleene's orginal [Representation of Events in Nerve Nets and Finite Automata](http://www.dlsi.ua.es/~mlf/nnafmc/papers/kleene56representation.pdf) is difficult to follow today. It is still worth looking at. Here are some of my takeaways and questions. - The paper was motivated by what we would call today neural networks. The famous paper [(McCulloch and Pitts, 1943)](https://www.cs.cmu.edu/~./epxing/Class/10715/reading/McCulloch.and.Pitts.pdf) which Kleene takes as a starting point was very influential in many areas including AI and machine learning. - It is also interesting to note that Turing machines are only mentioned in the last paragraph (before the appendix). While we see DFAs and Turing machines as members of the same hierarchy of machines, it is not clear how much Kleene's paper was influenced by Turing. - Kleene's "regular events" (Section 7) are easily recognisable as what we call regular expressions today. Our theorem from the lecture, that each regular expression corresponds to a finite automaton should be Theorem 4. How different is Kleene's proof from ours? Does he allow non-deterministic automata? - Kleene's "finite automata" (Section 8) do not yet have the clean mathematical definition we use today. - Regular languages are defined in Section 9 just before Lemma 7. - I didn't try to understand the proof that every finite automaton can be described by a regular expression (Theorem 5 in Section 9). It would be an interesting exercise to translate the proof in to modern language and then to determine which of the different textbook proofs correspond to it. ## References - [IATLC] Hopcroft, Motwani, Ullman: Automata Theory, Languages, and Computation. - Russ Cox: [Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)](https://swtch.com/~rsc/regexp/regexp1.html) - Gruber, Holzer: [From Finite Automata to Regular Expressions and Back](https://arxiv.org/pdf/1405.5594.pdf).  [^ast]: This notation is called [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) and is a way to write [context-free grammars](https://en.wikipedia.org/wiki/Context-free_grammar). We study this in detail in Programming Languages. For now you don't need to know the details, just read a$e ::= \epsilon \mid \varnothing \mid a \mid e+e \mid e\ e \mid e^*$ as a shorthand for set of rules: - $\epsilon$ is a regular expression - $\varnothing$ is a regular expression - $a\in\Sigma$ is a regular expression - if $e_1$ and $e_2$ are regular expressions, then - $e_1+e_2$ is a regular expressions - $e_1\,e_2$ is a regular expression - if $e$ is a regular expression, then $e^\ast$ is a regular expression. [^lang]: Recall from the lecture on composing automata - $\{\}$ - $\{\epsilon\}$ - $\{a\}$ - $L_1 \cup L_2$ - $L_1 \cdot L_2$ - $L^\ast$ [^NFAs]: (Recap on the importance of non-deterministic automata.) Non-determinstic automata were introduced by [Rabin and Scott](https://www.researchgate.net/publication/230876408_Finite_Automata_and_Their_Decision_Problems) in 1959. Both authors made much deeper contributions to computer science (such as zero knowledge proofs and domain theory), but non-determinstic automata were important enough to grant them the "Nobel prize of computer science", the Turing award in 1976. Most algorithms are deterministic and, until not so long ago, implementations on a modern computer were deterministic (even those using random numbers). So why are non-determinstic machines so important? - Specifications as opposed to implentations typically do not determine computations completely. - Modelling open systems, the environment is often best modelled non-determinstically. For example, - an operating system interacting with a keyboard cannot predict which keys will be pressed next - in security modelling an attacker must allow the intruder any possible action - In the presence of concurrency or parallelism, the precise timing or even order of actions is not determined. - In algorithms, non-deterministic machines are crucial to describe the possibly most important class of algorithms, namely those in non-deterministic polynomial time (NP). - Non-deterministic automata and deterministic automata are equivalent, but non-determinstic automata can be exponentially smaller. [^nfa]: Recall from the lecture on composing automata - $A_\varnothing$ - $A_\epsilon$ - $A_a$ - $A_1 \mid A_2$ - $A_1 \,; A_2$ - $A^\ast$

Read more

Playing Sudoku

Ideas for Ongoing and Future Research

A Short Introduction to Stone Duality (Intro)

Proof that the Halting Problem is Undecidable