# [Compiler2024] Implementation Assignment - Lexical Analyser
> [name=林濬祺 B11032043]
I write this project with the assist of [This tutorial](https://westes.github.io/flex/manual/). I also must notice you that the original note is written in HackMD [here](https://hackmd.io/@XYQZ/SJLj81If0) then exported into HTML and printed out. Many beautiful things are lost through the conversions, and if you please check out the original one.
## Code Breakdown
First we define and declare the necessary things here.
```c=1
%option noyywrap
%{
#include<math.h>
#include<stdio.h>
```
Then to be verbose and convenient, I write the line starter before each token is read. The `"\x1b[1;31m"` is the ANSI code sequence for **bold red colour** and `"\x1b[0m"` resets.
```c=6
#define head "\tqv-lexical-analyser reads "
#define errorHead "\x1b[1;31mQV-Scanner Error (Line %d)\x1b[0m: "
```
I carefully categorise keywords so that only `OP_CMPAR` and `IDENTIFR` need to check the whole string, as for the others, checking only the first character is sufficient to differentiate each of them:
```c=8
enum Token_Type
{
UNIDNTFD,
IDENTIFR,
LIT_ITGR,
LIT_REAL,
LIT_BOOL,
VAR_DECL,
VAL_DECL,
TYPENAME,
KW_CLASS,
KW_FLOWS,
KW_FUNCS,
OP_CMPAR,
OP_ASIGN,
OP_BRACK,
OP_ARITH,
OP_SEPAR,
};
```
I want to keep track of line number and number of failures for error print-out to be more readable, thus:
```c=28
int lineNum = 1;
int nFail = 0;
```
I write so that when an error happens, the scanner just throws the error string away and keeps going. So the function `myterm` is for print-out the error count of the whole piece of code at the end.
```c=31
void myterm();
%}
```
Here we define some useful regex-es, a `<COMMENT>` state and then start to define all scanning things.
```c=33
id [_a-zA-Z][_a-zA-Z0-9]*
nonDelimiter [^ \t\n+\-*/()\[\]{},;:'"=!<>]
opChar [+\-*/=<>()[\]{},:;'"]
escSeq \\[nt\'\"?\\]
nonEscSeq \\[^nt'"?\\]
/* for multiple-line comment */
%x COMMENT
```
I grouped this chunk of code into sectors of the same starting letters. So this should be quite clear and self-explanatory. Note that there is precedence when fetching regex-es. For example, `01580` suffices to both `[0-9]+` and `[0-9]{nonDelimiter}+` but the first one is guaranteed to be fetched
```c=41
%%
/* spaces */
" "|"\t" {}
"\n" { ++lineNum; }
/* tokens-started-with-numerals */
[0-9]+ { printf( head "`lit-integer`(%i), `%s`\n", LIT_ITGR, yytext); }
[0-9]*"."[0-9]+|[0-9]+"." { printf( head "`lit-real`(%i), `%s`\n", LIT_REAL, yytext); }
true|false { printf(head "`lit-bool-%s`(%i)\n", yytext, LIT_BOOL); }
[0-9]{nonDelimiter}+ { printf(errorHead "Unidentified symbol(%i) `%s`\n", lineNum, UNIDNTFD, yytext); ++nFail; }
/* tokens started with alphabetics */
var { printf(head "`variable-decl`(%i)\n", VAR_DECL, yytext); }
val { printf(head "`value-decl`(%i)\n", VAL_DECL, yytext); }
bool|char|int|real { printf(head "`typename`(%i): %s\n", TYPENAME, yytext); }
class { printf(head "`keyword-class`(%i)\n", KW_CLASS); }
if|else|for|while|do|switch|case { printf(head "`keyword-control-flow`(%i), `%s`\n", KW_FLOWS, yytext); }
fun|ret { printf(head "`keyword-function-related`(%i), `%s`\n", KW_FUNCS, yytext); }
{id} { printf(head "`identifier`(%i), `%s`\n", IDENTIFR, yytext); }
/* tokens started with symbols */
"'"([^'\n\\]|{escSeq})?"'" { printf(head "`lit-single-quote`, `%s`\n", yytext); }
[']([^'\n\\]|{escSeq})? { printf(errorHead "single-quote was never closed, or contains more than a character: %s\n", lineNum, yytext); ++nFail; }
[']{nonEscSeq}[']? { printf(errorHead "Invalid `\\`-sequence in single-quote: %s\n", lineNum, yytext); ++nFail; }
["]([^"\n\\]|{escSeq})*["] { printf(head "`lit-double-quote`, `%s`\n", yytext); }
["]([^"\n\\]|{escSeq})*\n { printf(errorHead "double-quote was never closed, or contains more than one line: %s", lineNum, yytext); ++lineNum; ++nFail; }
["]([^"\n\\]|{escSeq})*{nonEscSeq} { printf(errorHead "Invalid `\\`-sequence in double-quote:%s", lineNum, yytext); ++nFail; }
"=="|"!="|"<"|">"|"<="|">=" { printf(head "`operator-compare`(%i), `%s`\n", OP_CMPAR, yytext); }
"("|")"|"{"|"}"|"["|"]" { printf(head "`operator-brackets`(%i), `%s`\n", OP_BRACK, yytext); }
"+"|"-"|"*"|"/" { printf(head "`operator-arithmatics`(%i), `%s`\n", OP_ARITH, yytext); }
","|";"|":" { printf(head "`operator-separator`(%i), `%s`\n", OP_SEPAR, yytext); }
"=" { printf(head "`operator-assignment`(%i), `%s`\n", OP_ASIGN, yytext); }
"//"[^\n]*"\n" { printf(head "a single-line comment: %s", yytext); ++lineNum;}
"/*" { BEGIN(COMMENT); }
<COMMENT>"*/" { printf(head "a multiple-line comment (not shown here)\n"); BEGIN 0;}
<COMMENT>[^\n] {}
<COMMENT>"\n" { ++lineNum; }
/* unrecognised characters and end-of-file handling */
<INITIAL><<EOF>> { myterm(); yyterminate(); }
<COMMENT><<EOF>> { printf(errorHead "a multiple-line comment was never closed\n", lineNum); ++nFail; myterm(); yyterminate(); }
!({opChar}|[_0-9A-Za-z]) { printf(errorHead "unrecognised character %s", lineNum, yytext); ++nFail; }
%%
int main()
{
yylex();
}
void myterm()
{
if(nFail)
printf("\nThe qv program is scanned with %d errors.\n", nFail);
else
printf("\nThe qv program is scanned successfully.\n");
}
```
## Sample code given in Notion(R)

The result is too long to be in a screenshot. I pasted the log here.
```
░▒▓ ~/Codes/compiler ▓▒░ ./lex.o < ./sample1.qv ░▒▓ ✔ at 22:20:56 ▓▒░
qv-lexical-analyser reads a single-line comment: // qv Sample Program No. 1
qv-lexical-analyser reads `keyword-function-related`(10), `fun`
qv-lexical-analyser reads `identifier`(1), `main`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-brackets`(13), `{`
qv-lexical-analyser reads a single-line comment: // Function definition
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `i`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): int
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-integer`(2), `10`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Integers; always signed
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `j`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): real
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-real`(3), `3.14159`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Real numbers; always signed
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `k`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): char
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-single-quote`, `'c'`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Character; in ASCII encoding
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `l`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): int
qv-lexical-analyser reads `operator-brackets`(13), `[`
qv-lexical-analyser reads `lit-integer`(2), `5`
qv-lexical-analyser reads `operator-brackets`(13), `]`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // 1D array (/vector) with 5 integers
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `m`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): int
qv-lexical-analyser reads `operator-brackets`(13), `[`
qv-lexical-analyser reads `lit-integer`(2), `3`
qv-lexical-analyser reads `operator-brackets`(13), `]`
qv-lexical-analyser reads `operator-brackets`(13), `[`
qv-lexical-analyser reads `lit-integer`(2), `4`
qv-lexical-analyser reads `operator-brackets`(13), `]`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // 2D array with 3 rows, each with 4 integers
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `n`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): char
qv-lexical-analyser reads `operator-brackets`(13), `[`
qv-lexical-analyser reads `lit-integer`(2), `10`
qv-lexical-analyser reads `operator-brackets`(13), `]`
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-double-quote`, `"Hello, world!"`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // 1D arrays with characters are strings
qv-lexical-analyser reads `identifier`(1), `println`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `identifier`(1), `i`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Function call; print i and a new line character
qv-lexical-analyser reads `identifier`(1), `i`
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-integer`(2), `20`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Assign a new value 20 for i
qv-lexical-analyser reads `identifier`(1), `println`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `identifier`(1), `i`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `identifier`(1), `l`
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `operator-brackets`(13), `{`
qv-lexical-analyser reads `lit-integer`(2), `1`
qv-lexical-analyser reads `operator-separator`(15), `,`
qv-lexical-analyser reads `lit-integer`(2), `2`
qv-lexical-analyser reads `operator-separator`(15), `,`
qv-lexical-analyser reads `lit-integer`(2), `3`
qv-lexical-analyser reads `operator-separator`(15), `,`
qv-lexical-analyser reads `lit-integer`(2), `4`
qv-lexical-analyser reads `operator-separator`(15), `,`
qv-lexical-analyser reads `lit-integer`(2), `5`
qv-lexical-analyser reads `operator-brackets`(13), `}`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Assign a vector with 5 integers 1, 2, 3, 4, 5 in order
qv-lexical-analyser reads `identifier`(1), `println`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `identifier`(1), `l`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `identifier`(1), `k`
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-single-quote`, `'\\'`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Assign a char with new value '\\' (backslash)
qv-lexical-analyser reads `identifier`(1), `println`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `identifier`(1), `k`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `identifier`(1), `println`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `identifier`(1), `n`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `identifier`(1), `n`
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-double-quote`, `"Another string"`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a multiple-line comment (not shown here)
qv-lexical-analyser reads `identifier`(1), `n`
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-double-quote`, `"Third string"`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `identifier`(1), `println`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `identifier`(1), `n`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `keyword-function-related`(10), `ret`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads a single-line comment: // Return nothing to terminate the function body
qv-lexical-analyser reads `operator-brackets`(13), `}`
The qv program is scanned successfully.
```
## Sample code without errors

```
░▒▓ ~/Codes/compiler ▓▒░ ./lex.o < ./pass.qv ░▒▓ ✔ at 22:51:55 ▓▒░
qv-lexical-analyser reads `value-decl`(6)
qv-lexical-analyser reads `identifier`(1), `pass`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): int
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-real`(3), `125.1`
qv-lexical-analyser reads `operator-arithmatics`(14), `/`
qv-lexical-analyser reads `lit-real`(3), `1598613.235`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `______q2uh`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): int
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-integer`(2), `1085`
qv-lexical-analyser reads `operator-arithmatics`(14), `/`
qv-lexical-analyser reads `lit-integer`(2), `5`
qv-lexical-analyser reads `operator-arithmatics`(14), `*`
qv-lexical-analyser reads `lit-integer`(2), `1`
qv-lexical-analyser reads `operator-arithmatics`(14), `+`
qv-lexical-analyser reads `operator-arithmatics`(14), `-`
qv-lexical-analyser reads `operator-arithmatics`(14), `+`
qv-lexical-analyser reads `operator-arithmatics`(14), `-`
qv-lexical-analyser reads `operator-arithmatics`(14), `-`
qv-lexical-analyser reads `operator-arithmatics`(14), `+`
qv-lexical-analyser reads `operator-arithmatics`(14), `-`
qv-lexical-analyser reads `operator-arithmatics`(14), `+`
qv-lexical-analyser reads `operator-arithmatics`(14), `-`
qv-lexical-analyser reads `lit-integer`(2), `13`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `value-decl`(6)
qv-lexical-analyser reads `identifier`(1), `qot3298yh`
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `identifier`(1), `pass`
qv-lexical-analyser reads `operator-arithmatics`(14), `/`
qv-lexical-analyser reads `identifier`(1), `______q2uh`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `keyword-function-related`(10), `fun`
qv-lexical-analyser reads `identifier`(1), `main`
qv-lexical-analyser reads `operator-brackets`(13), `(`
qv-lexical-analyser reads `operator-brackets`(13), `)`
qv-lexical-analyser reads `operator-brackets`(13), `{`
qv-lexical-analyser reads `value-decl`(6)
qv-lexical-analyser reads `identifier`(1), `PI`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): real
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-real`(3), `3.14159265358979323846264338`
qv-lexical-analyser reads a single-line comment: //我可以背圓周率到400位,有沒有加分?
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `identifier`(1), `vvwi00ari38`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `typename`(7): char
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-single-quote`, `'4'`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `value-decl`(6)
qv-lexical-analyser reads `identifier`(1), `r1`
qv-lexical-analyser reads `operator-separator`(15), `:`
qv-lexical-analyser reads `operator-brackets`(13), `[`
qv-lexical-analyser reads `operator-brackets`(13), `]`
qv-lexical-analyser reads `typename`(7): char
qv-lexical-analyser reads `operator-assignment`(12), `=`
qv-lexical-analyser reads `lit-double-quote`, `"jtgqr9h3t908hwogih4\\13"`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `keyword-function-related`(10), `ret`
qv-lexical-analyser reads `operator-separator`(15), `;`
qv-lexical-analyser reads `operator-brackets`(13), `}`
The qv program is scanned successfully.
```
## Sample code with errors

```
░▒▓ ~/Codes/compiler ▓▒░ ./lex.o < ./error.qv ░▒▓ ✔ at 22:51:50 ▓▒░
qv-lexical-analyser reads `lit-integer`(2), `114514`
qv-lexical-analyser reads `variable-decl`(5)
qv-lexical-analyser reads `value-decl`(6)
qv-lexical-analyser reads `keyword-function-related`(10), `ret`
qv-lexical-analyser reads `keyword-function-related`(10), `fun`
QV-Scanner Error(Line 2): Invalid `\`-sequence in single-quote: '\m'
QV-Scanner Error(Line 3): Invalid `\`-sequence in single-quote: '\i'
qv-lexical-analyser reads `lit-single-quote`, `'\\'`
QV-Scanner Error(Line 5): Invalid `\`-sequence in single-quote: '\!'
QV-Scanner Error(Line 6): single-quote was never closed, or contains more than a character: '9
qv-lexical-analyser reads `lit-integer`(2), `87`
QV-Scanner Error(Line 6): single-quote was never closed, or contains more than a character: '
QV-Scanner Error(Line 7): single-quote was never closed, or contains more than a character: '\'
qv-lexical-analyser reads `lit-real`(3), `1.1`
QV-Scanner Error(Line 9): Unidentified symbol(0) `13.5787.2`
QV-Scanner Error(Line 11): Invalid `\`-sequence in double-quote:"aowih/q3toiha;\t \n\\wr\5 qv-lexical-analyser reads `identifier`(1), `ub`
QV-Scanner Error(Line 11): double-quote was never closed, or contains more than one line: "
QV-Scanner Error(Line 14): a multiple-line comment was never closed
The qv program is scanned with 10 errors.
```
## quick note
- we do not recognise `\` at the end of line as line-break prevention.
- we break any continuous operator sequence if they are unable to be scanned as a whole, for example: `+-+-+-+-+` will be seen as 9 operators without space.