---
tags: posts
---
Recursive Descent Parsers
=========================
A simple way to implement parsers for reasonably simple languages is based on a recursive descent approach.
As an example consider writing of a parser for expressions with support for:
- numbers
- variables
- addition, subtraction, multiplication and division
- parenthesis
Definining the grammar
======================
The first step is defining a precise grammar for our language, in our case something like:
number ::= "0".."9" ["0".."9"]
variable ::= "a".."z" [{"a".."z" | "0".."9"}]
term ::= {number | variable | "(" expression ")"}
addendum ::= term [{"*" | "/"} term]
expression ::= addendum [{"+" | "-"} addendum]
In the above I've used the notation `<term> ::= <definition>`, `[<expr>]` to mean zero or more repetitions and `{<expr1>|<expr2>|...|<exprN>}` to represent alternatives.
Note that I've used `addendum` (handling multiplication and division) and `expression` (adding addition and subtraction) to incorporate in the grammar the higher precedence of multiplication and division over addition and subtraction. The simpler definition:
expression ::= term [{"+" | "-" | "*" | "/"} term]
wouldn't capture this concept and would for example parse the text `1+2*3` as `(1+2)*3` instead of `1+(2*3)`.
Note also that `term` is defined in terms of `expression`, and in turn `expression` is defined in terms of `addendum` that is defined in terms of `term`.
This is where the recursion comes from.
Defining a source of characters
===============================
The parser needs to read from a source of characters and produce some output (for example the result, an AST or bytecode for a virtual machine). In our case something that is useful and powerful enough is just the concept of "current character" and "advance to next character".
Using C++ as implementation language I'll just use a reference to a pointer to char for the source, assuming a standard NUL-terminated string as input.
This approach is not powerful enough to handle multi-character operators (e.g. `>=`) and in that case an intermediate "tokenizer" necessary. In that case the parser would be implemented in terms of "current token" and an "advance to next token".
Implementing the parser
=======================
Assume we want to compute the numerical result of the expression, given the source code and a mapping between variables and values.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
#include <map>
std::map<std::string, double> variables;
// Evaluate expression pointed by p, returning its value
double eval_expression(const char *& p);
// Utility to signal an error
[[noreturn]] void error(const char *msg) {
fprintf(stderr, "Error: '%s'", msg);
exit(1);
}
Whitespace handling
===================
The grammar wasn't considering whitespace, but we would like to ignore spaces when they're not relevant. So I'll write a simple function "skip spaces" that will just advance `p` to first intereseting char.
void skipSpaces(const char *& p) {
while (*p && strchr("\r\t\n ", *p)) p++;
}
Implementing `number`
=====================
Let's start by implementing the `number` element of the grammar; for this the standard library function `strtod` does what we want (actually more, it also handles decimal point, exponent etc...)
double eval_number(const char *& p) {
skipSpaces(p);
if (*p < '0' || *p > '9') error("Not a number");
char *ep = (char *)p;
double res = strtod(p, &ep);
// Todo: error checking
p = ep;
return res;
}
Implementing `variable`
=======================
To evaluate a variable we simply collect the name and lookup the dictionary:
double eval_variable(const char *& p) {
skipSpaces(p);
if (*p < 'a' || *p > 'z') error("Not a variable");
const char *p0 = p;
while (*p >= 'a' && *p <= 'z') p++;
return variables[std::string(p0, p)];
}
Implementing `term`
===================
For a `term` we need to check a few cases: it could be a number, a variable or a parenthesized sub-expression:
double eval_term(const char *& p) {
skipSpaces(p);
if (*p >= 'a' && *p <= 'z') return eval_variable(p);
if (*p >= '0' && *p <= '9') return eval_number(p);
if (*p == '(') {
p++;
double res = eval_expression(p);
skipSpaces(p);
if (*p != ')') error("')' expected");
p++;
return res;
}
error("Number, variable or sub-expression expected");
}
Implementing `addendum`
=======================
For the implementation of `addendum` we first evaluate the term, then make a loop multiplying or dividing by a following term, exiting when no more multiplcations or divisions are present:
double eval_addendum(const char *& p) {
double res = eval_term(p);
for(;;) {
skipSpaces(p);
if (*p != '*' && *p != '/') break;
char op = *p++;
res = op == '*' ? res * eval_term(p)
: res / eval_term(p);
}
return res;
}
Implementing `expression`
=========================
Finally we got to closing the recursion circle. `expression` implementation is very similar to `addendum`, with the difference that instead of calling `eval_term` we need to call `eval_addendum` and that we need to handle `+` and `-` instead of `*` and `/`.
The similarity is a direct consequence of the similar definition in the grammar.
double eval_expression(const char *& p) {
double res = eval_addendum(p);
for(;;) {
skipSpaces(p);
if (*p != '+' && *p != '-') break;
char op = *p++;
res = op == '+' ? res + eval_addendum(p)
: res - eval_addendum(p);
}
return res;
}
Final remarks
=============
That's it. This implementation is able to handle the four basic operations, parenthesis, numbers and variables; let's check it out with a small test program:
int main(int argc, const char *argv[]) {
variables = std::map<std::string, double>{{"x", 10.0},
{"y", 20.0}};
const char *p = "x*x + y*y";
double res = eval_expression(p);
printf("result = %.18g (expected %.18g)\n", res, 10.0*10.0 + 20.0*20.0);
return 0;
}