# Regular expressions
Regular expressions are a ubiquitous tool in text processing. They are a slick way to describe formal languages. For example, $R=(\mathbf{0}+\mathbf{1})^* \mathbf{0}\mathbf{1}(\mathbf{0}+\mathbf{1})^*$ describes the language $L = L(R)$ that consists of words with $\mathbf{0}\mathbf{1}$ in the middle.
We want to make this precise now.
## Set operations
The semantics of regular expressions will rely on the following operations on sets:
* **union** $\cup$: This is the usual set-theoretic union, i.e., for languages $L,M$ we have: $L \cup M := \{ x \; | \; (x \in L) ~ \text{or} ~ (x \in M) \}$
For example, for $L=\{0,1,10\}$ and $M = \{\varepsilon, 0, 1\}$ we have $L \cup M = \{\varepsilon, 0, 1, 10\}$.
* **contenation** $\cdot$: For languages $L,M$ we define their concatenation as $L \cdot M := LM := \{vw \; | \; v \in L, w \in M\}$.
For example, for $L= \{01,10\}$, $M = \{\varepsilon, 0,1\}$, we have:
$LM = \{01\varepsilon, 010, 011, 10\varepsilon, 100, 101,\ldots\}$
$= \{ 01, 010, 011, 10, 100, 101,\ldots \}$
* **Kleene closure/Kleene star ${}^*$:** For a language $L$, we define $L^* := \bigcup_{i = 0}^\infty L^i$, with $L^0 := \{\varepsilon\}$ and $L^{i+1} := L^i \cdot L$, for $i \geq 1$.
For example, for $L := \{0,1\}$, we have
$L^* = \{ \varepsilon, 0,1, 00, 01, 10, 11,000, 001, \ldots \}$
Now, regular expressions are introduced *inductively* as follows:
## Regular expressions -- Base cases
1. **Syntax:** $\emptyset$ and $\varepsilon$ are regular expressions.
**Semantics:** $L(\emptyset) = \emptyset$ and $L(\varepsilon) = \{\varepsilon\}$.
2. **Syntax:** If $a \in \Sigma$, then $\textbf{a}$ is a regular expression.
**Semantics:** $L(\mathbf a) = \{a\}$.
4. **Syntax:** If $L, K, \ldots$ are *variables* then $L, K, \ldots$ are regular expressions.
**Semantics:** Every variable $L, K, \ldots$ is interpreted as a language.
## Regular expressions -- Step cases
1. **Syntax:** If $E,F$ regular expressions, then their *sum* $E+F$ is a regular expression.
**Semantics:** $L(E+F) = L(E) \cup L(F)$
2. **Syntax:** If $E,F$ regular expressions, then their *concatenation* $EF$ is a regular expression.
**Semantics:** $L(EF) = L(E)L(F)$
3. **Syntax:** If $E$ is a regular expression, then its *Kleene closure* $E^*$ is a regular expression.
**Semantics:** $L(E^*) = L(E)^*$
4. **Syntax:** If $E$ is a regular expression, then $(E)$ is a regular expression.
**Semantics:** $L((E)) = L(E)$
The order of precedence of these operations is (from high to low):
1. $*$: Kleene star
2. $\cdot$: concatenation
3. $+$: sum
For example, $R = 01^*+1 = ((0(1^*)) + 1)$.
# From regular expressions to DFAs/NFAs
We have shown that every DFA can be captured by a regular expression. Conversely, given any regular expression, can we construct a DFA whose language is described by that expression? Yes, we can indeed.
:::warning
**Exercise (From regex to automata).**
1. For each of the following languages $L_i$, $i=1,2,3$, construct an NFA $\mathcal A_i$ accepting $L_i$:
1.1 $L_1 = \emptyset$
1.2 $L_2 = \{\varepsilon\}$
1.3 $L_3 = \{ a \}$ for any $a \in \Sigma$
2. Let $\mathcal A$ and $\mathcal B$ be NFAs. Describe how to construct NFAs $\mathcal C_i$, $i=1,2,3$, such that:
2.1 $L(\mathcal C_1) = L(\mathcal A) \cup L(\mathcal B)$
2.2 $L(\mathcal C_2) = L(\mathcal A) L(\mathcal B)$
2.3 $L(\mathcal C_3) = (L(\mathcal A))^*$
3. Using 1. and 2., argue that for any regular expression $R$ we can find an NFA $\mathcal A_R$ such that $L(\mathcal A_R) = L(R)$. Why does this give rise even to a *DFA* with that property?
:::
# From DFAs/NFAs to regular expressions
See [Kleene's algorithm](https://hackmd.io/@jweinberger/Bk5fTUCOZl).