# [Compiler2024] Implementation Assignment - Lexical Analyser > [name=林濬祺 B11032043] I write this project with the assist of [This tutorial](https://westes.github.io/flex/manual/). I also must notice you that the original note is written in HackMD [here](https://hackmd.io/@XYQZ/SJLj81If0) then exported into HTML and printed out. Many beautiful things are lost through the conversions, and if you please check out the original one. ## Code Breakdown First we define and declare the necessary things here. ```c=1 %option noyywrap %{ #include<math.h> #include<stdio.h> ``` Then to be verbose and convenient, I write the line starter before each token is read. The `"\x1b[1;31m"` is the ANSI code sequence for **bold red colour** and `"\x1b[0m"` resets. ```c=6 #define head "\tqv-lexical-analyser reads " #define errorHead "\x1b[1;31mQV-Scanner Error (Line %d)\x1b[0m: " ``` I carefully categorise keywords so that only `OP_CMPAR` and `IDENTIFR` need to check the whole string, as for the others, checking only the first character is sufficient to differentiate each of them: ```c=8 enum Token_Type { UNIDNTFD, IDENTIFR, LIT_ITGR, LIT_REAL, LIT_BOOL, VAR_DECL, VAL_DECL, TYPENAME, KW_CLASS, KW_FLOWS, KW_FUNCS, OP_CMPAR, OP_ASIGN, OP_BRACK, OP_ARITH, OP_SEPAR, }; ``` I want to keep track of line number and number of failures for error print-out to be more readable, thus: ```c=28 int lineNum = 1; int nFail = 0; ``` I write so that when an error happens, the scanner just throws the error string away and keeps going. So the function `myterm` is for print-out the error count of the whole piece of code at the end. ```c=31 void myterm(); %} ``` Here we define some useful regex-es, a `<COMMENT>` state and then start to define all scanning things. ```c=33 id [_a-zA-Z][_a-zA-Z0-9]* nonDelimiter [^ \t\n+\-*/()\[\]{},;:'"=!<>] opChar [+\-*/=<>()[\]{},:;'"] escSeq \\[nt\'\"?\\] nonEscSeq \\[^nt'"?\\] /* for multiple-line comment */ %x COMMENT ``` I grouped this chunk of code into sectors of the same starting letters. So this should be quite clear and self-explanatory. Note that there is precedence when fetching regex-es. For example, `01580` suffices to both `[0-9]+` and `[0-9]{nonDelimiter}+` but the first one is guaranteed to be fetched ```c=41 %% /* spaces */ " "|"\t" {} "\n" { ++lineNum; } /* tokens-started-with-numerals */ [0-9]+ { printf( head "`lit-integer`(%i), `%s`\n", LIT_ITGR, yytext); } [0-9]*"."[0-9]+|[0-9]+"." { printf( head "`lit-real`(%i), `%s`\n", LIT_REAL, yytext); } true|false { printf(head "`lit-bool-%s`(%i)\n", yytext, LIT_BOOL); } [0-9]{nonDelimiter}+ { printf(errorHead "Unidentified symbol(%i) `%s`\n", lineNum, UNIDNTFD, yytext); ++nFail; } /* tokens started with alphabetics */ var { printf(head "`variable-decl`(%i)\n", VAR_DECL, yytext); } val { printf(head "`value-decl`(%i)\n", VAL_DECL, yytext); } bool|char|int|real { printf(head "`typename`(%i): %s\n", TYPENAME, yytext); } class { printf(head "`keyword-class`(%i)\n", KW_CLASS); } if|else|for|while|do|switch|case { printf(head "`keyword-control-flow`(%i), `%s`\n", KW_FLOWS, yytext); } fun|ret { printf(head "`keyword-function-related`(%i), `%s`\n", KW_FUNCS, yytext); } {id} { printf(head "`identifier`(%i), `%s`\n", IDENTIFR, yytext); } /* tokens started with symbols */ "'"([^'\n\\]|{escSeq})?"'" { printf(head "`lit-single-quote`, `%s`\n", yytext); } [']([^'\n\\]|{escSeq})? { printf(errorHead "single-quote was never closed, or contains more than a character: %s\n", lineNum, yytext); ++nFail; } [']{nonEscSeq}[']? { printf(errorHead "Invalid `\\`-sequence in single-quote: %s\n", lineNum, yytext); ++nFail; } ["]([^"\n\\]|{escSeq})*["] { printf(head "`lit-double-quote`, `%s`\n", yytext); } ["]([^"\n\\]|{escSeq})*\n { printf(errorHead "double-quote was never closed, or contains more than one line: %s", lineNum, yytext); ++lineNum; ++nFail; } ["]([^"\n\\]|{escSeq})*{nonEscSeq} { printf(errorHead "Invalid `\\`-sequence in double-quote:%s", lineNum, yytext); ++nFail; } "=="|"!="|"<"|">"|"<="|">=" { printf(head "`operator-compare`(%i), `%s`\n", OP_CMPAR, yytext); } "("|")"|"{"|"}"|"["|"]" { printf(head "`operator-brackets`(%i), `%s`\n", OP_BRACK, yytext); } "+"|"-"|"*"|"/" { printf(head "`operator-arithmatics`(%i), `%s`\n", OP_ARITH, yytext); } ","|";"|":" { printf(head "`operator-separator`(%i), `%s`\n", OP_SEPAR, yytext); } "=" { printf(head "`operator-assignment`(%i), `%s`\n", OP_ASIGN, yytext); } "//"[^\n]*"\n" { printf(head "a single-line comment: %s", yytext); ++lineNum;} "/*" { BEGIN(COMMENT); } <COMMENT>"*/" { printf(head "a multiple-line comment (not shown here)\n"); BEGIN 0;} <COMMENT>[^\n] {} <COMMENT>"\n" { ++lineNum; } /* unrecognised characters and end-of-file handling */ <INITIAL><<EOF>> { myterm(); yyterminate(); } <COMMENT><<EOF>> { printf(errorHead "a multiple-line comment was never closed\n", lineNum); ++nFail; myterm(); yyterminate(); } !({opChar}|[_0-9A-Za-z]) { printf(errorHead "unrecognised character %s", lineNum, yytext); ++nFail; } %% int main() { yylex(); } void myterm() { if(nFail) printf("\nThe qv program is scanned with %d errors.\n", nFail); else printf("\nThe qv program is scanned successfully.\n"); } ``` ## Sample code given in Notion(R) ![2024-05-06-22:54:47-screenshot](https://hackmd.io/_uploads/HyNryOLGA.png) The result is too long to be in a screenshot. I pasted the log here. ``` ░▒▓    ~/Codes/compiler ▓▒░ ./lex.o < ./sample1.qv ░▒▓ ✔  at 22:20:56  ▓▒░ qv-lexical-analyser reads a single-line comment: // qv Sample Program No. 1 qv-lexical-analyser reads `keyword-function-related`(10), `fun` qv-lexical-analyser reads `identifier`(1), `main` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-brackets`(13), `{` qv-lexical-analyser reads a single-line comment: // Function definition qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `i` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): int qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-integer`(2), `10` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Integers; always signed qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `j` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): real qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-real`(3), `3.14159` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Real numbers; always signed qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `k` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): char qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-single-quote`, `'c'` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Character; in ASCII encoding qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `l` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): int qv-lexical-analyser reads `operator-brackets`(13), `[` qv-lexical-analyser reads `lit-integer`(2), `5` qv-lexical-analyser reads `operator-brackets`(13), `]` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // 1D array (/vector) with 5 integers qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `m` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): int qv-lexical-analyser reads `operator-brackets`(13), `[` qv-lexical-analyser reads `lit-integer`(2), `3` qv-lexical-analyser reads `operator-brackets`(13), `]` qv-lexical-analyser reads `operator-brackets`(13), `[` qv-lexical-analyser reads `lit-integer`(2), `4` qv-lexical-analyser reads `operator-brackets`(13), `]` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // 2D array with 3 rows, each with 4 integers qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `n` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): char qv-lexical-analyser reads `operator-brackets`(13), `[` qv-lexical-analyser reads `lit-integer`(2), `10` qv-lexical-analyser reads `operator-brackets`(13), `]` qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-double-quote`, `"Hello, world!"` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // 1D arrays with characters are strings qv-lexical-analyser reads `identifier`(1), `println` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `identifier`(1), `i` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Function call; print i and a new line character qv-lexical-analyser reads `identifier`(1), `i` qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-integer`(2), `20` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Assign a new value 20 for i qv-lexical-analyser reads `identifier`(1), `println` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `identifier`(1), `i` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `identifier`(1), `l` qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `operator-brackets`(13), `{` qv-lexical-analyser reads `lit-integer`(2), `1` qv-lexical-analyser reads `operator-separator`(15), `,` qv-lexical-analyser reads `lit-integer`(2), `2` qv-lexical-analyser reads `operator-separator`(15), `,` qv-lexical-analyser reads `lit-integer`(2), `3` qv-lexical-analyser reads `operator-separator`(15), `,` qv-lexical-analyser reads `lit-integer`(2), `4` qv-lexical-analyser reads `operator-separator`(15), `,` qv-lexical-analyser reads `lit-integer`(2), `5` qv-lexical-analyser reads `operator-brackets`(13), `}` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Assign a vector with 5 integers 1, 2, 3, 4, 5 in order qv-lexical-analyser reads `identifier`(1), `println` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `identifier`(1), `l` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `identifier`(1), `k` qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-single-quote`, `'\\'` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Assign a char with new value '\\' (backslash) qv-lexical-analyser reads `identifier`(1), `println` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `identifier`(1), `k` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `identifier`(1), `println` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `identifier`(1), `n` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `identifier`(1), `n` qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-double-quote`, `"Another string"` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a multiple-line comment (not shown here) qv-lexical-analyser reads `identifier`(1), `n` qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-double-quote`, `"Third string"` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `identifier`(1), `println` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `identifier`(1), `n` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `keyword-function-related`(10), `ret` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads a single-line comment: // Return nothing to terminate the function body qv-lexical-analyser reads `operator-brackets`(13), `}` The qv program is scanned successfully. ``` ## Sample code without errors ![2024-05-06-22:53:36-screenshot](https://hackmd.io/_uploads/rkzZydLzC.png) ``` ░▒▓    ~/Codes/compiler ▓▒░ ./lex.o < ./pass.qv ░▒▓ ✔  at 22:51:55  ▓▒░ qv-lexical-analyser reads `value-decl`(6) qv-lexical-analyser reads `identifier`(1), `pass` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): int qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-real`(3), `125.1` qv-lexical-analyser reads `operator-arithmatics`(14), `/` qv-lexical-analyser reads `lit-real`(3), `1598613.235` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `______q2uh` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): int qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-integer`(2), `1085` qv-lexical-analyser reads `operator-arithmatics`(14), `/` qv-lexical-analyser reads `lit-integer`(2), `5` qv-lexical-analyser reads `operator-arithmatics`(14), `*` qv-lexical-analyser reads `lit-integer`(2), `1` qv-lexical-analyser reads `operator-arithmatics`(14), `+` qv-lexical-analyser reads `operator-arithmatics`(14), `-` qv-lexical-analyser reads `operator-arithmatics`(14), `+` qv-lexical-analyser reads `operator-arithmatics`(14), `-` qv-lexical-analyser reads `operator-arithmatics`(14), `-` qv-lexical-analyser reads `operator-arithmatics`(14), `+` qv-lexical-analyser reads `operator-arithmatics`(14), `-` qv-lexical-analyser reads `operator-arithmatics`(14), `+` qv-lexical-analyser reads `operator-arithmatics`(14), `-` qv-lexical-analyser reads `lit-integer`(2), `13` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `value-decl`(6) qv-lexical-analyser reads `identifier`(1), `qot3298yh` qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `identifier`(1), `pass` qv-lexical-analyser reads `operator-arithmatics`(14), `/` qv-lexical-analyser reads `identifier`(1), `______q2uh` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `keyword-function-related`(10), `fun` qv-lexical-analyser reads `identifier`(1), `main` qv-lexical-analyser reads `operator-brackets`(13), `(` qv-lexical-analyser reads `operator-brackets`(13), `)` qv-lexical-analyser reads `operator-brackets`(13), `{` qv-lexical-analyser reads `value-decl`(6) qv-lexical-analyser reads `identifier`(1), `PI` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): real qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-real`(3), `3.14159265358979323846264338` qv-lexical-analyser reads a single-line comment: //我可以背圓周率到400位,有沒有加分? qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `identifier`(1), `vvwi00ari38` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `typename`(7): char qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-single-quote`, `'4'` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `value-decl`(6) qv-lexical-analyser reads `identifier`(1), `r1` qv-lexical-analyser reads `operator-separator`(15), `:` qv-lexical-analyser reads `operator-brackets`(13), `[` qv-lexical-analyser reads `operator-brackets`(13), `]` qv-lexical-analyser reads `typename`(7): char qv-lexical-analyser reads `operator-assignment`(12), `=` qv-lexical-analyser reads `lit-double-quote`, `"jtgqr9h3t908hwogih4\\13"` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `keyword-function-related`(10), `ret` qv-lexical-analyser reads `operator-separator`(15), `;` qv-lexical-analyser reads `operator-brackets`(13), `}` The qv program is scanned successfully. ``` ## Sample code with errors ![2024-05-06-22:52:08-screenshot](https://hackmd.io/_uploads/r1K3Rv8zR.png) ``` ░▒▓    ~/Codes/compiler ▓▒░ ./lex.o < ./error.qv ░▒▓ ✔  at 22:51:50  ▓▒░ qv-lexical-analyser reads `lit-integer`(2), `114514` qv-lexical-analyser reads `variable-decl`(5) qv-lexical-analyser reads `value-decl`(6) qv-lexical-analyser reads `keyword-function-related`(10), `ret` qv-lexical-analyser reads `keyword-function-related`(10), `fun` QV-Scanner Error(Line 2): Invalid `\`-sequence in single-quote: '\m' QV-Scanner Error(Line 3): Invalid `\`-sequence in single-quote: '\i' qv-lexical-analyser reads `lit-single-quote`, `'\\'` QV-Scanner Error(Line 5): Invalid `\`-sequence in single-quote: '\!' QV-Scanner Error(Line 6): single-quote was never closed, or contains more than a character: '9 qv-lexical-analyser reads `lit-integer`(2), `87` QV-Scanner Error(Line 6): single-quote was never closed, or contains more than a character: ' QV-Scanner Error(Line 7): single-quote was never closed, or contains more than a character: '\' qv-lexical-analyser reads `lit-real`(3), `1.1` QV-Scanner Error(Line 9): Unidentified symbol(0) `13.5787.2` QV-Scanner Error(Line 11): Invalid `\`-sequence in double-quote:"aowih/q3toiha;\t \n\\wr\5 qv-lexical-analyser reads `identifier`(1), `ub` QV-Scanner Error(Line 11): double-quote was never closed, or contains more than one line: " QV-Scanner Error(Line 14): a multiple-line comment was never closed The qv program is scanned with 10 errors. ``` ## quick note - we do not recognise `\` at the end of line as line-break prevention. - we break any continuous operator sequence if they are unable to be scanned as a whole, for example: `+-+-+-+-+` will be seen as 9 operators without space.