Recursive Descent Parsers

--- tags: posts --- Recursive Descent Parsers ========================= A simple way to implement parsers for reasonably simple languages is based on a recursive descent approach. As an example consider writing of a parser for expressions with support for: - numbers - variables - addition, subtraction, multiplication and division - parenthesis Definining the grammar ====================== The first step is defining a precise grammar for our language, in our case something like: number ::= "0".."9" ["0".."9"] variable ::= "a".."z" [{"a".."z" | "0".."9"}] term ::= {number | variable | "(" expression ")"} addendum ::= term [{"*" | "/"} term] expression ::= addendum [{"+" | "-"} addendum] In the above I've used the notation `<term> ::= <definition>`, `[<expr>]` to mean zero or more repetitions and `{<expr1>|<expr2>|...|<exprN>}` to represent alternatives. Note that I've used `addendum` (handling multiplication and division) and `expression` (adding addition and subtraction) to incorporate in the grammar the higher precedence of multiplication and division over addition and subtraction. The simpler definition: expression ::= term [{"+" | "-" | "*" | "/"} term] wouldn't capture this concept and would for example parse the text `1+2*3` as `(1+2)*3` instead of `1+(2*3)`. Note also that `term` is defined in terms of `expression`, and in turn `expression` is defined in terms of `addendum` that is defined in terms of `term`. This is where the recursion comes from. Defining a source of characters =============================== The parser needs to read from a source of characters and produce some output (for example the result, an AST or bytecode for a virtual machine). In our case something that is useful and powerful enough is just the concept of "current character" and "advance to next character". Using C++ as implementation language I'll just use a reference to a pointer to char for the source, assuming a standard NUL-terminated string as input. This approach is not powerful enough to handle multi-character operators (e.g. `>=`) and in that case an intermediate "tokenizer" necessary. In that case the parser would be implemented in terms of "current token" and an "advance to next token". Implementing the parser ======================= Assume we want to compute the numerical result of the expression, given the source code and a mapping between variables and values. #include <stdio.h> #include <stdlib.h> #include <string.h> #include <string> #include <map> std::map<std::string, double> variables; // Evaluate expression pointed by p, returning its value double eval_expression(const char *& p); // Utility to signal an error [[noreturn]] void error(const char *msg) { fprintf(stderr, "Error: '%s'", msg); exit(1); } Whitespace handling =================== The grammar wasn't considering whitespace, but we would like to ignore spaces when they're not relevant. So I'll write a simple function "skip spaces" that will just advance `p` to first intereseting char. void skipSpaces(const char *& p) { while (*p && strchr("\r\t\n ", *p)) p++; } Implementing `number` ===================== Let's start by implementing the `number` element of the grammar; for this the standard library function `strtod` does what we want (actually more, it also handles decimal point, exponent etc...) double eval_number(const char *& p) { skipSpaces(p); if (*p < '0' || *p > '9') error("Not a number"); char *ep = (char *)p; double res = strtod(p, &ep); // Todo: error checking p = ep; return res; } Implementing `variable` ======================= To evaluate a variable we simply collect the name and lookup the dictionary: double eval_variable(const char *& p) { skipSpaces(p); if (*p < 'a' || *p > 'z') error("Not a variable"); const char *p0 = p; while (*p >= 'a' && *p <= 'z') p++; return variables[std::string(p0, p)]; } Implementing `term` =================== For a `term` we need to check a few cases: it could be a number, a variable or a parenthesized sub-expression: double eval_term(const char *& p) { skipSpaces(p); if (*p >= 'a' && *p <= 'z') return eval_variable(p); if (*p >= '0' && *p <= '9') return eval_number(p); if (*p == '(') { p++; double res = eval_expression(p); skipSpaces(p); if (*p != ')') error("')' expected"); p++; return res; } error("Number, variable or sub-expression expected"); } Implementing `addendum` ======================= For the implementation of `addendum` we first evaluate the term, then make a loop multiplying or dividing by a following term, exiting when no more multiplcations or divisions are present: double eval_addendum(const char *& p) { double res = eval_term(p); for(;;) { skipSpaces(p); if (*p != '*' && *p != '/') break; char op = *p++; res = op == '*' ? res * eval_term(p) : res / eval_term(p); } return res; } Implementing `expression` ========================= Finally we got to closing the recursion circle. `expression` implementation is very similar to `addendum`, with the difference that instead of calling `eval_term` we need to call `eval_addendum` and that we need to handle `+` and `-` instead of `*` and `/`. The similarity is a direct consequence of the similar definition in the grammar. double eval_expression(const char *& p) { double res = eval_addendum(p); for(;;) { skipSpaces(p); if (*p != '+' && *p != '-') break; char op = *p++; res = op == '+' ? res + eval_addendum(p) : res - eval_addendum(p); } return res; } Final remarks ============= That's it. This implementation is able to handle the four basic operations, parenthesis, numbers and variables; let's check it out with a small test program: int main(int argc, const char *argv[]) { variables = std::map<std::string, double>{{"x", 10.0}, {"y", 20.0}}; const char *p = "x*x + y*y"; double res = eval_expression(p); printf("result = %.18g (expected %.18g)\n", res, 10.0*10.0 + 20.0*20.0); return 0; }