---
tags: programming languages
---
# Intro to Parsing and Context-Free Grammars
(part of a course on programming languages)
## Introduction
In previous sessions, we found that the ability of Haskell to do
- pattern matching and
- recursion on algebraic data types
makes it easy to implement a simple calculator that operates on abstract syntax. (I also have a course on Compiler Construction which shows how this scales up to real world programming languages.)
We also found that abstract syntax is not convenient for human readers. Instead of having to work through
Plus (Num 1) (Times (Num 2) (Num 3))
we generally prefer
1 + 2 * 3
So can we get the best of both worlds? Elegant and concise programs working on abstract synatx but reading input created by humans in concrete syntax?
Sure. We only need to find a way to automatically translate concrete syntax into abstract syntax.
The bad news is that this is more difficult than it should be. The good news is that this is a problem that was solved quite some time ago: The transformation of concrete syntax into abstract syntax is known as **parsing**. And the theory (and practice) of parsing is well understood.[^parsing]
## Parsing
Looking at parsing from the point of view of ordinary (1-dimensional) syntax, parsing is about putting the parentheses in the right place. For example,
1+2*3
is turned into
1+(2*3)
But this is still a linear expression that is not easy to process automatically. In particular, matching opening and closing parentheses may be arbitrarily far apart.
**Discussion:** We can find similar examples in natural language. Use the idea of "putting the parentheses into the right place" to explain how the following sentences (collect your owns) are ambigous.
Novak Djokovic, five times a champion here, failed
to reach the concluding Sunday of the ATP World Tour Finals
for a record eighth time [...].
**Discussion:** Have a look at [Fibonacci in LISP](https://lisp-lang.org/learn/functions). Discuss the claim that in LISP the programmer directly writes abstract syntax. Why was LISP was this way? Thanks to Christopher Chang for providing a link to this sketch of a [history of Lisp](https://twobithistory.org/2018/10/14/lisp.html).
**Optional Activity:** Write a calculator (operating on ordinary concrete syntax) in your favourite language without using a library for parsing. This is an excellent exercise to learn to appreciate the subtleties of parsing.
As discussed already, for processing abstract syntax such as (written now in 2-dimensional syntax):
Plus
/ \
Num Times
| / \
1 Num Num
| |
2 3
The aims of this session are the following.
- Define the concrete syntax of a calculator using a context free grammar.
- Write a context-free grammar in BNFC.
- Use `bnfc` to generate a parser that translates concrete arithmetic expressions to abstract ones.
- Link up the parser with the calculator from the previous session to obtain a calculator written in Haskell and operating on concrete syntax.
This session is meant to be rather introductory. We will dig deeper into these concepts later. In particular, next semester, we will see that all of these concepts scale up to write a compiler for a programming language such as C++.
## A context-free grammar
A context-free grammar is a set of rules such as the following. (I skip the rules for `Integer`s.)
Exp -> Exp '+' Exp1
Exp1 -> Exp1 '*' Exp2
Exp2 -> Integer
Exp2 -> '(' Exp ')'
Exp -> Exp1
Exp1 -> Exp2
The purpose of a **context-free grammar** is to define a language, that is, a set of strings (aka words). In our case, we want to define arithmetic expressions. A particular string is in the language if it can be derived from the **start symbol**, which is `Exp` in the example. The symbols that will appear in the strings are those that are enclosed in single quotes, that is,
+
*
(
)
The other symbols such as `Exp`, `Exp1`, etc ... control which strings can be derived.
<!--
(In the following derivation I will drop the single quotes for readability.)
Exp ->
Exp1 ->
Exp1 * Exp1 ->
Exp2 * Exp1 ->
Exp2 * Exp2 ->
Integer * Exp2 ->
Integer * Integer ->
2 * Integer ->
2 * 1
The same string can have many different derivations as we are free to choose the order in which we apply the rules. On the other hand, for the grammar above, any two different derivations of the same string correspond to the same tree.
**Exercise:** Study the derivation above. How many other deriviations can you make just by switching the order in which you apply the rules?
**Exercise:** Draw the deriviation above as a tree. How many different derivation trees can you make?
-->
The next exercise is the most important one. You need to do enough exercises of this kind until you feel comfortable parsing small examples by hand.
## <font color=red>Homework (for the Report):</font>
Using the context-free grammar
Exp -> Exp '+' Exp1
Exp1 -> Exp1 '*' Exp2
Exp2 -> Integer
Exp2 -> '(' Exp ')'
Exp -> Exp1
Exp1 -> Exp2
write out the derivation trees (also called **parse trees** or **concrete syntax trees**) for the following strings:
- `2+1`
- `1+2*3`
- `1+(2*3)`
- `(1+2)*3`
- `1+2*3+4*5+6`
## More Exercises
**Exercise:** Why do the following strings not have parse trees (given the context-free grammar above)?
- `2-1`
- `1.0+2`
- `6/3`
- `8 mod 6`
**Exercise:** Can you change the grammar, so that the strings in the previous exercise become parsable?
## More on Order of Operations via Precedence Levels
The grammar above has one subtle feature: The grammar makes sure that the *only* way to parse
1+2*3
is to mean $1 + (2\cdot 3)$ and not as $(1+2)\cdot 3$.
Watch the video [Order of Operations in CFGs](https://youtu.be/jf1xhZSpCvg) for an explanation.
**<font color=red>Homework</font>** (Optional): This homework follows up on the video [Uniqueness of Parse Trees](https://youtu.be/3ZLkPwB_c9g).
- With the simplified grammar without precedence levels
Exp -> Exp '+' Exp
Exp -> Exp '*' Exp
Exp -> Integer
how many parse trees can you find for the following expressions?
1+2+3
1*2*3*4
- Answer the question above using instead the grammar
Exp -> Exp '+' Exp1
Exp -> Exp1
Exp1 -> Exp1 '*' Exp2
Exp1 -> Exp2
Exp2 -> Integer
**Discussion:** What are the similarities and differences between the grammar
Exp -> Exp '+' Exp
Exp -> Exp '*' Exp
Exp -> Integer
and the following algebraic data type for the abstract syntax?
```haskell
data Exp = Plus Exp Exp | Times Exp Exp | Num Integer
```
**Remark:** Each language defines a level of abstraction that hides detail. We could complicate the picture sketched above to reveal even more languages. The Haskell runtime system is written in C and C--. Haskell is compiled to LLVM. LLVM is compiled to the target architecture on your computer, which in turn is implemented in microcode. (I should admit that with this remark I am crossing the limits of my knowledge ... while it would be very interesting to dive into more details, I am glad that we do not need to do this for this course.)
**Discussion**: Discuss the advantages and disadvantages of the concrete and the abstract format. Relate this discussion to the introductory quote by McCarthy.
## General Background
General background is available in the Wikipedia articles on [BNF](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form) and [context-free grammars](https://en.wikipedia.org/wiki/Context-free_grammar). Also consult the article on the [Chomsky hierarchy](https://en.wikipedia.org/wiki/Chomsky_hierarchy), in particular the table that aligns grammars with automata.
## Further Study: A brief look at the history
A number of deep insights into parsing can be gleaned from
- Jeffrey Kegler's [timeline](https://jeffreykegler.github.io/personal/timeline_v3) of the history of parsing (**highly recommended**).
The literature on parsing is vast. We will learn much more about parsing in the course on Compiler Construction. Here, I limit myself to a few early references that continue to have a lasting influence. (It is always worth to look at the classics, something we don't do enough in science.)
- Joachim Lambek: [The Mathematics of Sentence Structure](https://www.cs.cmu.edu/~fp/courses/15816-f16/misc/Lambek58.pdf). 1958.
Read sections 1-4 up to the end of page 158. Terms worth learning here are: sentence, nonsentence, syntactic type, leakage, decision problem, primitive type, context, compound type. Section 4 is important because it explains the idea of *parsing-as-deduction*.
While Lambek's work continues to be influential, and is based on many of the same ideas, his formalism is different from context-free grammars, which were introduced on the first two pages of Chapter 4 of
- Noam Chomsky: [Syntactic Structures](https://www.academia.edu/4073170/Noam_Chomsky_Syntactic_Structure). 1957.
I think the first two pages of Chapter 4 are still an excellent introduction. Also the general background reviewed in Chapter 2 is worth reading today (as is much else in this book). For the historical significance also see its introduction and [Wikipedia on Syntactic Structures](https://en.wikipedia.org/wiki/Syntactic_Structures).
Given how simple it is to write an interpreter on abstract syntax, and how difficult it is to write it on concrete syntax (if we do not use libraries or tools that help us with the parsing), it is maybe not a surprise that in one of the first modern programming languages, LISP, programmers write directly in linearised abstract syntax. (That is why LISP programs have so many parentheses.)
- John McCarthy's original article on Lisp: [Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I](http://jmc.stanford.edu/articles/recursive/recursive.pdf)
Have a look at the classical article
- [Towards a Mathematical Science of Computation](http://www-formal.stanford.edu/jmc/towards.ps) by [McCarthy](https://en.wikipedia.org/wiki/John_McCarthy_%28computer_scientist%29).
McCarthy was a pioneer of Computer Science. In 1955 he coined the term "artificial ingelligence", shortly afterwards he invented LISP and garbage collection, [time-sharing systems](https://en.wikipedia.org/wiki/Time-sharing), and in 1962 he introduced, with the quote above, the notion of abstract syntax.
Interesting for us is that McCarthy emphasises that BNF makes it easy to generate, write and synthesize programs, whereas abstract syntax makes it easy to analyse, translate, type check, interprete and compile programs. Another way to put it, is to say that concrete syntax is good for human readers and abstract syntax is good for automated processing.
Another influential article by McCarthy is
- [Ascribing mental qualities to machines](http://cs.uns.edu.ar/~grs/InteligenciaArtificial/ascribing.pdf), 1979.
(It is always worth looking at the original work of the pioneers. They are mostly remembered for only a very small part of their ideas, so there is often something interesting and forgotten to discover.)
[^parsing]: If you read articles and books about compilers written before approx 1979, be aware that there was a time where parsing was THE problem of compiler construction. See this excellent [timeline](https://jeffreykegler.github.io/personal/timeline_v3). I found particular revealing the entries from "Language as of 1965" to "The Parsing Problem as of 1979".
[^bnfc]: You need to [install Haskell](https://hackmd.io/@alexhkurz/Hk86XnCzD) first.
[^echo]: `echo` copy its input to "standard output". The pipe `|` connects standard output of the program to the left with the standard input of the program to the right.
[^AbsNumber]: If you run BNFC, there will be a file `AbsNumber.hs` ... have a look yourself.
[^LBNF]: More precisely, the language in which the context-free grammars are written is called [LBNF](https://bnfc.readthedocs.io/en/latest/lbnf.html), for labelled BNF.
[^LBNF2]: (LBNF for "Labelled BNF" is the name of the language in which one writes the grammar that is then converted to a parser).