Introduction to Shift-Reduce Parsing

# Introduction to Shift-Reduce Parsing (part of [CC 2022](https://github.com/alexhkurz/compiler-construction-2022)) ## Aims The aim of this and the next lectures is to explain what shift/reduce and reduce/reduce conflicts are and to explore how to avoid them. But first, we need to understand how shift-reduce parsers work. ## Introduction We have already in a previous lecture explained how to construct a parse tree from a grammar. Our explanations followed a top-down strategy. While this strategy is simple and elegant, most parsers prefer a bottom-up, or shift-reduce, strategy for efficency reasons. In this lecture we will look at how to shift-reduce parse by hand. This is still a non-determinstic algorithm. In the next lecture we will see the determinstic shift-reduce parser, a so-called LALR(1) parser, generated by the Haskell parser generator Happy. LALR(1) is a particular kind of shift-reduce parser (terminology will be explained in the following). ## Required Reading Section 3.7 of [Implementing Programming Languages](http://www.cse.chalmers.se/edu/year/2012/course/DAT150/lectures/plt-book.pdf). ## The `.info` file of a Happy parser The aim of this section Let us start again with the calculator example of the BNFC tutorial. We look at a simplification of the grammar [`Calc.cf`](https://github.com/alexhkurz/compiler-construction-2022/blob/master/Sources/Calculator/Calc.cf) EAdd. Exp ::= Exp "+" Exp1 ; EMul. Exp1 ::= Exp1 "*" Integer ; EInt. Exp1 ::= Integer ; coercions Exp 1 ; We are also familiar already with using BNFC to create a parser bnfc -m -haskell Calc.cf make and how to run the parser as for example in echo "1 + 2 * 3" | ./TestCalc We want to understand how the parser works. In order to see a specification of the parser generated by BNFC we need to look at the "info" file.[^option-i] We open the generated `.info` file in a texteditor. I use the editor visual studio code for which I created my own alias `vs` so I type vs ParCalc.info but any editor will be fine. We are looking at [ParCalc.info](https://github.com/alexhkurz/compiler-construction-2022/blob/master/Sources/Calculator/ParCalc.info). To follow the rest of the lecture, you should be looking at the lecture notes and `ParCalc.info` next to each other. ## Shift-reduce parsing LR, or LALR(1) parsing is a special case of shift-reduce parsing. As opposed to top-down parsing, shift-reduce parsing is working bottom-up.  At each step during the parsing we have to decide wether to read from the input (shift) or whether to apply a rule of the grammar (reduce). During the parsing, we maintain a stack that contains a record of what we parsed so far. A shift pushes the current symbol from the input onto the stack. A reduce matches the right hand side of a rule of the grammar against the top of the stack and replaces it by the left-hand side. The current stack together with the input remaining to be read is called a configuration. Let us explain the details by parsing the string `1+2*3` using the grammar from `ParCalc.info`: Integer -> L_integ (2) Exp -> Exp '+' Exp1 (3) Exp -> Exp1 (4) Exp1 -> Exp1 '*' Integer (5) Exp1 -> Integer (6) Exp1 -> '(' Exp ')' (7) There are also two further rules in the `.info`-file which we ignore for simplicity. We take `Exp` as the start symbol for the grammar, that is, we require that all parse trees have `Exp` at their root. There are also rules that are not listed explicitely in the `.info` file including L_integ -> '1' L_integ -> '2' L_integ -> '3' The parse tree is the following Exp / | \ Exp + Exp1 | / | \ Exp1 Exp1 * Integer | | | Integer Integer L_integ | | | L_integ L_integ 3 | | 1 2 It is easy to write down which sequence of rules produces the parse tree in a top-down left-to-right computation: (3), (4), (6), (2), (5), (4), (6), (2), (6), (2). **Exercise:** Anotate the parse tree with the rules listed above. The task now is to discover the parse tree by processing the input bottom-up and left-to-right, which gives the name to LR-parsing (we will see later a variation of LR called LALR(1)). We start with the `configuration` `(<>, 1+2*3)` where <> is the empty stack and `1+2*3` is the input we want to parse. As the stack is empty there is nothing to reduce so we start with a shift `(L_integ, +2*3)` which reads 1 from the input and pushes `L_integ` on the stack (`L_integ` represents all integers). Next we reduce using rule (2) to `(Integer, +2*3)` **Shift or reduce?** In most configurations we can shift or reduce. For example, in the current configuration we can reduce using rule (4) resulting in `(Exp1, +2*3)` or we can shift resulting in `(Integer +, 2*3)`. Which one should we use? In this case, it is better to reduce because shifting would result in a dead end because: - There is no rule in the grammar that would allow us to reduce a stack starting with `Integer +`. The only rule we have to reduce a `+` is rule (3) and it requires an `Exp` and not an `Integer` to the left of the `+`. - **Rules can only be used to reduce the top of the stack.** In particular we are not allowed to apply rule (6) to the stack `Integer +`. This restriction is made to reduce (and, in the end, eliminate any form of) non-determinism, which greatly improves the efficiency of parsing. As the discussion above shows the right thing to do is to reduce using rule (6) resulting in `(Exp1, +2*3)` **Exercise:** Detail the sequence of configurations and shift/reduce steps that would follow the configuration `(Exp1, +2*3)` if one shifted. As the previous exercise shows, shifting in `(Exp1, +2*3)`does not allows us to parse our input. Therefore, we reduce using rule (4) and get `(Exp, +2*3)` **Exercise:** Convince yourself that so far we constructed the following part of the parse tree. Exp | Exp1 | Integer | L_integ | 1 In order to see how to continue do the following exercise. **Exercise:** Detail the sequence of shift/reduce steps which lead from `(Exp, +2*3)` to `(Exp + Exp1, *3)` **Exercise:** Convince yourself that we constructed the following partial parse tree. Exp + Exp1 | / Exp1 Exp1 | | Integer Integer | | L_integ L_integ | | 1 2 *Remark:* The configuration `(Exp + Exp1, *3)` is interesting because it decides whether we parse `1+2*3` to mean `(1+2)*3` or `1+(2*3)`. If we reduce `(Exp + Exp 1, *3)` using rule (3), then we build the parse tree for `1+2`. If we shift then we continue parsing `*3` and will first reduce using (5) and then later reduce using (3), hence we will first build the tree for `2*3` and then the tree for `1+(2*3)`. If we have to make a decision between shift and reduce this is called a **shift-reduce conflict**. **Exercise:** Detail the sequence of configurations that follow from `(Exp + Exp 1, *3)` after doing a reduce using (3). Explain why this sequence of reductions shows that we cannot parse `1+2*3` as `(1+2)*3`. As the previous exercise shows, reducing in `(Exp + Exp1, *3)` does not allow us to parse the input, so we continue with a shift leading to `(Exp + Exp 1 * , 3)` **Exercise:** Detail the sequence of shift/reduce steps which follow from `(Exp + Exp 1 * , 3)` and lead to the configuration `(Exp + Exp 1 * Integer, <>)` where `<>` denotes the empty input. **Exercise:** Which sequence of rules is used to reduce `(Exp + Exp 1 * Integer, <>)` to `(Exp, <>)` Finally, from reaching this configuration we can conclude that we parsed `1+2*3` correctly. *Remark:* See also the [parsing of `1+2*3` according to the IPL book](https://github.com/alexhkurz/compiler-construction-2022/blob/master/Sources/Calculator/shift-reduce.png). **Exercise:** Extract the complete parse tree (concrete syntax tree) from the computation steps leading from `(<>, 1+2*3)` to `(Exp, <>)`. **Exercise** (understanding `coercions`): Simplify the grammar at the beginning of this lecture to `Calc2.cf` as follows. EAdd. Exp ::= Exp "+" Exp1 ; EMul. Exp1 ::= Exp1 "*" Integer ; EInt. Exp1 ::= Integer ; by omitting the line `coercions Exp 1 ;` Parse `1+2*3` using the simplified grammar.[^build] What changes? Which rule can you add to the grammar so that it parses `1+2*3`? What happens if you parse `(1+2)*3`? Which rule can you add to the grammar so that it parses `(1+2)*3`? Explain why our original grammar `Calc.cf` parses `(1+2)*3`. **Exercise** (understanding precedence levels): Simplify the grammar at the beginning of this lecture to `Calc3.cf` as follows. EAdd. Exp ::= Exp "+" Exp ; EMul. Exp ::= Exp "*" Exp ; EInt. Exp ::= Integer ; coercions Exp 1; When you build the parser, how many shift/reduce conflicts do you get? How does it compare with `Calc.cf`? Parse `1+2*3` with the parser generated from `Calc3.cf`. How does it compare with `Calc.cf`? What happens if you parse `1*2+3` instead? ## Summary If set you the problem "Show the steps taken by a shift-reduce parser for the given grammer on input `1+2*3`" the following would be a good answer:[^reductions] [^reductions]: The numbers in the "Action" column refer to the following rules: ``` Integer -> L_integ (2) Exp -> Exp '+' Exp1 (3) Exp -> Exp1 (4) Exp1 -> Exp1 '*' Integer (5) Exp1 -> Integer (6) Exp1 -> '(' Exp ')' (7) ``` The names in the rule column refer to the BNF-grammar ``` EAdd. Exp ::= Exp "+" Exp1 ; EMul. Exp1 ::= Exp1 "*" Integer ; EInt. Exp1 ::= Integer ; ``` |Stack| Input| Action| Rule | |---:|:---|:---:|:---:| | | `1+2*3` | shift `1` | | L_integ | `+2*3` | reduce (2) | | Integer | `+2*3` | reduce (6) | EInt | Exp1 | `+2*3` | reduce (4) | | Exp | `+2*3` | shift `+` | Exp `+` | `2*3` | shift `2` | Exp `+` L_integ | `*3` | reduce (2) | | Exp `+` Integer | `*3` | reduce (6) | EInt | Exp `+` Exp1 | `*3` | shift `*` | Exp `+` Exp1 `*` | `3` | shift `3` | Exp `+` Exp1 `*` L_integ || reduce (2) | | Exp `+` Exp1 `*` Integer || reduce (5) | EMul | Exp `+` Exp1 | | reduce (3) | EAdd | Exp | | | In the "Rule" column, I only put the rules which are explicitely named in the grammar. I did not put built-in rules that convert numbers to integers or that convert `Exp1` to `Exp`. The reason is that only the rules named in the grammar appear in the abstract syntax tree. **Remark:** In Haskell the abstract syntax tree will be ```haskell EAdd (EInt 1) (EMul (EInt 2) 3) ``` **Exercise:** Check the last claim by creating [^creating] the parser `TestCalc` for [`Calc.cf`](https://github.com/alexhkurz/compiler-construction-2022/blob/master/Sources/Calculator/Calc.cf) and running ``` echo "1+2*3" | ./TestCalc ``` [^creating]: ``` bnfc -m --haskell Calc.cf make ``` ## Homework (not assessed but important for Assignment 1) - Read Section 3.7 of [Implementing Programming Languages](http://www.cse.chalmers.se/edu/year/2012/course/DAT150/lectures/plt-book.pdf). - Do the Exercises above. Particularly important are the following, which I repeat here for easy reference (you may have to locate them in context): - **Exercise:** Detail the sequence of configurations that follow from `(Exp + Exp 1, *3)` after doing a reduce using (3). Explain why this sequence of reductions shows that we cannot parse `1+2*3` as `(1+2)*3`. - **Exercise** (understanding `coercions`): Simplify the grammar at the beginning of this lecture to `Calc2.cf` as follows. EAdd. Exp ::= Exp "+" Exp1 ; EMul. Exp1 ::= Exp1 "*" Integer ; EInt. Exp1 ::= Integer ; by omitting the line `coercions Exp 1 ;` Parse `1+2*3` using the simplified grammar.[^build] What changes? Which rule can you add to the grammar so that it parses `1+2*3`? What happens if you parse `(1+2)*3`? Which rule can you add to the grammar so that it parses `(1+2)*3`? Explain why our original grammar `Calc.cf` parses `(1+2)*3`. - **Exercise** (understanding precedence levels): Simplify the grammar at the beginning of this lecture to `Calc3.cf` as follows. EAdd. Exp ::= Exp "+" Exp ; EMul. Exp ::= Exp "*" Exp ; EInt. Exp ::= Integer ; coercions Exp 1; When you `make` the parser, how many shift/reduce conflicts do you get? How does it compare with `Calc.cf`? Parse `1+2*3` with the parser generated from `Calc3.cf`. How does it compare with `Calc.cf`? What happens if you parse `1*2+3` instead? - Go through this lecture again, but now parsing the input (1+2)*3 - Write out the parsing table for `hello.cc` and `Cpp.cf`. ## Summary We summarise the lecture using standard terminology from parsing. For a thorough reference see Chapter 2 of [Grune, Jacobs: Parsing Techniques](https://www.dickgrune.com/Books/PTAPG_1st_Edition/BookBody.pdf) A **grammar** consists of - **non-terminals**; in the example: `Integer`, `Exp`, `Exp1`, `L_integ` - **terminals**; in the example `+` `*` `(` `)` `1` `2` `3` - a **start symbol**, aka axiom; in the example `Exp` - a set of **rules** (aka production rules); for example Exp -> Exp1 Exp -> Exp '+' Exp1 Exp1 -> Exp1 '*' Integer Exp1 -> Integer Integer -> L_integ L_integ -> 1 L_integ -> 2 L_integ -> 3 A grammar is **context-free** if each rule has only a single symbol on the left of the `->`. Context free grammars are good for parsing. In fact, a context-free grammar can be understood as the specification of a parsing algorithm as much as a regular expression can be understood as the specification of a search algorithm. A **sentence** is a string consisting only of terminals. **Intermediate forms** or **sentential forms** are strings that contain terminals and non-terminals. To produce the **parse tree** Exp / | \ Exp + Exp1 | / | \ Exp1 Exp1 * Integer | | | Integer Integer L_integ | | | L_integ L_integ 3 | | 1 2 in a **bottom-up** manner we started with the **sentence** `1+2*3` and worked our way upwards to the start symbol `Exp` by using the production rules. To find the production rule we work from left to right through the input. After applying a production rule, we have an intermediate form or sentential form. The sequential form associated with a configuration such as `(L_integ, +2*3)` is simply the concatenation `L_integ + 2 * 3` of the stack with the remaining input. During any time of the parse there is at most one substring at the top of the stack that needs to be reduced in order to construct a given parse tree. This substring is called the **handle**. If there is no handle on the top of the stack we shift. The difficulty lies in finding the handle. For example, if we want to construct the parse tree above and are in the configuration `(Exp + Exp1, *3)` rule `Exp -> Exp '+' Exp1` matches the top of the stack `Exp + Exp1` but nevertheless is not the handle since reducing according to that rule will not produce the parse tree and, in fact, as we have seen, cannot produce any parse tree (with `Exp` at the root). ## Further Reading Chapter 2 of [Grune, Jacobs: Parsing Techniques](https://www.dickgrune.com/Books/PTAPG_1st_Edition/BookBody.pdf) with Sections 2.1 and 2.2 being the most relevant ones for this lecture. [^option-i]: In earlier versions of BNFC, in order to generate the "info" file, you had to add yourself the `-i` option to the Happy parser happy -i ParCalc.y To make this more permanent, one could make this modification directly into the Makefile by changing the relevant line to happy -gcai ParCalc.y To avoid that a new Makefile gets overwritten next time BNFC is called, call BNFC without the option "-m" as in bnfc -haskell Calc.cf make [^build]: Don't forget to build a new parser using `bnfc`.