387 lines
12 KiB
Plaintext
387 lines
12 KiB
Plaintext
Lexical Analyzer
|
|
|
|
Definition from [https://en.wikipedia.org/wiki/Lexical_analysis Wikipedia]:
|
|
|
|
: ''Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).''
|
|
|
|
Create a lexical analyzer for the simple programming language specified below. The
|
|
program should read input from a file and/or stdin, and write output to a file and/or
|
|
stdout. If the language being used has a lexer module/library/class, it would be great
|
|
if two versions of the solution are provided: One without the lexer module, and one with.
|
|
|
|
{{task heading|Input Specification}}
|
|
|
|
The simple programming language to be analyzed is more or less a subset of [[C]]. It supports the following tokens:
|
|
|
|
;Operators
|
|
|
|
:::{| class="wikitable"
|
|
|-
|
|
! Name !! Common name !! Character sequence
|
|
|-
|
|
| <tt>Op_multiply</tt> || multiply || <tt>*</tt>
|
|
|-
|
|
| <tt>Op_divide</tt> || divide || <tt>/</tt>
|
|
|-
|
|
| <tt>Op_mod</tt> || mod || <tt>%</tt>
|
|
|-
|
|
| <tt>Op_add</tt> || plus || <tt>+</tt>
|
|
|-
|
|
| <tt>Op_subtract</tt> || minus || <tt>-</tt>
|
|
|-
|
|
| <tt>Op_negate</tt> || unary minus || <tt>-</tt>
|
|
|-
|
|
| <tt>Op_less</tt> || less than || <tt><</tt>
|
|
|-
|
|
| <tt>Op_lessequal</tt> || less than or equal || <tt><=</tt>
|
|
|-
|
|
| <tt>Op_greater</tt> || greater than || <tt>></tt>
|
|
|-
|
|
| <tt>Op_greaterequal</tt> || greater than or equal || <tt>>=</tt>
|
|
|-
|
|
| <tt>Op_equal</tt> || equal || <tt>==</tt>
|
|
|-
|
|
| <tt>Op_notequal</tt> || not equal || <tt>!=</tt>
|
|
|-
|
|
| <tt>Op_not</tt> || unary not || <tt>!</tt>
|
|
|-
|
|
| <tt>Op_assign</tt> || assignment || <tt>=</tt>
|
|
|-
|
|
| <tt>Op_and</tt> || logical and || <tt>&&</tt>
|
|
|-
|
|
| <tt>Op_or</tt> || logical or || <tt>¦¦</tt>
|
|
|}
|
|
|
|
* The <code>-</code> token should always be interpreted as <tt>Op_subtract</tt> by the lexer. Turning some <tt>Op_subtract</tt> into <tt>Op_negate</tt> will be the job of the syntax analyzer, which is not part of this task.
|
|
|
|
;Symbols
|
|
|
|
:::{| class="wikitable"
|
|
|-
|
|
! Name !! Common name !! Character
|
|
|-
|
|
| <tt>LeftParen</tt> || left parenthesis || <tt>(</tt>
|
|
|-
|
|
| <tt>RightParen</tt> || right parenthesis || <tt>)</tt>
|
|
|-
|
|
| <tt>LeftBrace</tt> || left brace || <tt>{</tt>
|
|
|-
|
|
| <tt>RightBrace</tt> || right brace || <tt>}</tt>
|
|
|-
|
|
| <tt>Semicolon</tt> || semi-colon || <tt>;</tt>
|
|
|-
|
|
| <tt>Comma</tt> || comma || <tt>,</tt>
|
|
|}
|
|
|
|
;Keywords
|
|
|
|
:::{| class="wikitable"
|
|
|-
|
|
! Name || Character sequence
|
|
|-
|
|
| <tt>Keyword_if</tt> || <tt>if</tt>
|
|
|-
|
|
| <tt>Keyword_else</tt> || <tt>else</tt>
|
|
|-
|
|
| <tt>Keyword_while</tt> || <tt>while</tt>
|
|
|-
|
|
| <tt>Keyword_print</tt> || <tt>print</tt>
|
|
|-
|
|
| <tt>Keyword_putc</tt> || <tt>putc</tt>
|
|
|}
|
|
|
|
;Identifiers and literals
|
|
|
|
These differ from the the previous tokens, in that each occurrence of them has a value associated with it.
|
|
|
|
:::{| class="wikitable"
|
|
|-
|
|
! Name
|
|
! Common name
|
|
! Format description
|
|
! Format regex
|
|
! Value
|
|
|-
|
|
| <tt>Identifier</tt>
|
|
| identifier
|
|
| one or more letter/number/underscore characters, but not starting with a number
|
|
| <code style="white-space:nowrap">[_a-zA-Z][_a-zA-Z0-9]*</code>
|
|
| as is
|
|
|-
|
|
| <tt>Integer</tt>
|
|
| integer literal
|
|
| one or more digits
|
|
| <code>[0-9]+</code>
|
|
| as is, interpreted as a number
|
|
|-
|
|
| <tt>Integer</tt>
|
|
| char literal
|
|
| exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes
|
|
| <code><nowiki>'([^'\n]|\\n|\\\\)'</nowiki></code>
|
|
| the ASCII code point number of the character, e.g. 65 for <code>'A'</code> and 10 for <code>'\n'</code>
|
|
|-
|
|
| <tt>String</tt>
|
|
| string literal
|
|
| zero or more characters (anything except newline or double quote), enclosed by double quotes
|
|
| <code>"[^"\n]*"</code>
|
|
| the characters without the double quotes and with escape sequences converted
|
|
|}
|
|
|
|
* For char and string literals, the <code>\n</code> escape sequence is supported to represent a new-line character.
|
|
* For char and string literals, to represent a backslash, use <code>\\</code>.
|
|
* No other special sequences are supported. This means that:
|
|
** Char literals cannot represent a single quote character (value 39).
|
|
** String literals cannot represent strings containing double quote characters.
|
|
|
|
;Zero-width tokens
|
|
|
|
:::{| class="wikitable"
|
|
|-
|
|
! Name || Location
|
|
|-
|
|
| <tt>End_of_input</tt> || when the end of the input stream is reached
|
|
|}
|
|
|
|
;White space
|
|
|
|
* Zero or more whitespace characters, or comments enclosed in <code>/* ... */</code>, are allowed between any two tokens, with the exceptions noted below.
|
|
* "Longest token matching" is used to resolve conflicts (e.g., in order to match '''<=''' as a single token rather than the two tokens '''<''' and '''=''').
|
|
* Whitespace is ''required'' between two tokens that have an alphanumeric character or underscore at the edge.
|
|
** This means: keywords, identifiers, and integer literals.
|
|
** e.g. <code>ifprint</code> is recognized as an identifier, instead of the keywords <tt>if</tt> and <tt>print</tt>.
|
|
** e.g. <code>42fred</code> is invalid, and neither recognized as a number nor an identifier.
|
|
* Whitespace is ''not allowed'' inside of tokens (except for chars and strings where they are part of the value).
|
|
** e.g. <code>& &</code> is invalid, and not interpreted as the <tt>&&</tt> operator.
|
|
|
|
For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:
|
|
|
|
* <syntaxhighlight lang="c">if ( p /* meaning n is prime */ ) {
|
|
print ( n , " " ) ;
|
|
count = count + 1 ; /* number of primes found so far */
|
|
}</syntaxhighlight>
|
|
* <syntaxhighlight lang="c">if(p){print(n," ");count=count+1;}</syntaxhighlight>
|
|
|
|
;Complete list of token names
|
|
|
|
<pre>
|
|
End_of_input Op_multiply Op_divide Op_mod Op_add Op_subtract
|
|
Op_negate Op_not Op_less Op_lessequal Op_greater Op_greaterequal
|
|
Op_equal Op_notequal Op_assign Op_and Op_or Keyword_if
|
|
Keyword_else Keyword_while Keyword_print Keyword_putc LeftParen RightParen
|
|
LeftBrace RightBrace Semicolon Comma Identifier Integer
|
|
String
|
|
</pre>
|
|
|
|
{{task heading|Output Format}}
|
|
|
|
The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:
|
|
|
|
# the line number where the token starts
|
|
# the column number where the token starts
|
|
# the token name
|
|
# the token value (only for <tt>Identifier</tt>, <tt>Integer</tt>, and <tt>String</tt> tokens)
|
|
# the number of spaces between fields is up to you. Neatly aligned is nice, but not a requirement.
|
|
<br>
|
|
|
|
This task is intended to be used as part of a pipeline, with the other compiler tasks - for example:
|
|
<br><b>lex < hello.t | parse | gen | vm</b>
|
|
|
|
Or possibly:
|
|
<br><b>lex hello.t lex.out</b>
|
|
<br><b>parse lex.out parse.out</b>
|
|
<br><b>gen parse.out gen.out</b>
|
|
<br><b>vm gen.out</b>
|
|
|
|
<br>
|
|
This implies that the output of this task (the lexical analyzer) should be suitable as input to any of the [[Compiler/syntax_analyzer|Syntax Analyzer task]] programs.
|
|
|
|
{{task heading|Diagnostics}}
|
|
|
|
The following error conditions should be caught:
|
|
|
|
:::{| class="wikitable"
|
|
|-
|
|
! Error
|
|
! Example
|
|
|-
|
|
| Empty character constant
|
|
| <code>''</code>
|
|
|-
|
|
| Unknown escape sequence.
|
|
| <code>\r</code>
|
|
|-
|
|
| Multi-character constant.
|
|
| <code>'xx'</code>
|
|
|-
|
|
| End-of-file in comment. Closing comment characters not found.
|
|
|-
|
|
| End-of-file while scanning string literal. Closing string character not found.
|
|
|-
|
|
| End-of-line while scanning string literal. Closing string character not found before end-of-line.
|
|
|-
|
|
| Unrecognized character.
|
|
| <code>|</code>
|
|
|-
|
|
| Invalid number. Starts like a number, but ends in non-numeric characters.
|
|
| <code>123abc</code>
|
|
|}
|
|
|
|
{{task heading|Test Cases}}
|
|
:{| class="wikitable"
|
|
|-
|
|
! Input
|
|
! Output
|
|
|-
|
|
| style="vertical-align:top" |
|
|
Test Case 1:
|
|
<syntaxhighlight lang="c">/*
|
|
Hello world
|
|
*/
|
|
print("Hello, World!\n");</syntaxhighlight>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
4 1 Keyword_print
|
|
4 6 LeftParen
|
|
4 7 String "Hello, World!\n"
|
|
4 24 RightParen
|
|
4 25 Semicolon
|
|
5 1 End_of_input
|
|
</pre></b>
|
|
|
|
|-
|
|
| style="vertical-align:top" |
|
|
Test Case 2:
|
|
<syntaxhighlight lang="c">/*
|
|
Show Ident and Integers
|
|
*/
|
|
phoenix_number = 142857;
|
|
print(phoenix_number, "\n");</syntaxhighlight>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
4 1 Identifier phoenix_number
|
|
4 16 Op_assign
|
|
4 18 Integer 142857
|
|
4 24 Semicolon
|
|
5 1 Keyword_print
|
|
5 6 LeftParen
|
|
5 7 Identifier phoenix_number
|
|
5 21 Comma
|
|
5 23 String "\n"
|
|
5 27 RightParen
|
|
5 28 Semicolon
|
|
6 1 End_of_input
|
|
</pre></b>
|
|
|
|
|-
|
|
| style="vertical-align:top" |
|
|
Test Case 3:
|
|
<syntaxhighlight lang="c">/*
|
|
All lexical tokens - not syntactically correct, but that will
|
|
have to wait until syntax analysis
|
|
*/
|
|
/* Print */ print /* Sub */ -
|
|
/* Putc */ putc /* Lss */ <
|
|
/* If */ if /* Gtr */ >
|
|
/* Else */ else /* Leq */ <=
|
|
/* While */ while /* Geq */ >=
|
|
/* Lbrace */ { /* Eq */ ==
|
|
/* Rbrace */ } /* Neq */ !=
|
|
/* Lparen */ ( /* And */ &&
|
|
/* Rparen */ ) /* Or */ ||
|
|
/* Uminus */ - /* Semi */ ;
|
|
/* Not */ ! /* Comma */ ,
|
|
/* Mul */ * /* Assign */ =
|
|
/* Div */ / /* Integer */ 42
|
|
/* Mod */ % /* String */ "String literal"
|
|
/* Add */ + /* Ident */ variable_name
|
|
/* character literal */ '\n'
|
|
/* character literal */ '\\'
|
|
/* character literal */ ' '</syntaxhighlight>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
5 16 Keyword_print
|
|
5 40 Op_subtract
|
|
6 16 Keyword_putc
|
|
6 40 Op_less
|
|
7 16 Keyword_if
|
|
7 40 Op_greater
|
|
8 16 Keyword_else
|
|
8 40 Op_lessequal
|
|
9 16 Keyword_while
|
|
9 40 Op_greaterequal
|
|
10 16 LeftBrace
|
|
10 40 Op_equal
|
|
11 16 RightBrace
|
|
11 40 Op_notequal
|
|
12 16 LeftParen
|
|
12 40 Op_and
|
|
13 16 RightParen
|
|
13 40 Op_or
|
|
14 16 Op_subtract
|
|
14 40 Semicolon
|
|
15 16 Op_not
|
|
15 40 Comma
|
|
16 16 Op_multiply
|
|
16 40 Op_assign
|
|
17 16 Op_divide
|
|
17 40 Integer 42
|
|
18 16 Op_mod
|
|
18 40 String "String literal"
|
|
19 16 Op_add
|
|
19 40 Identifier variable_name
|
|
20 26 Integer 10
|
|
21 26 Integer 92
|
|
22 26 Integer 32
|
|
23 1 End_of_input
|
|
</pre></b>
|
|
|
|
|-
|
|
| style="vertical-align:top" |
|
|
Test Case 4:
|
|
<syntaxhighlight lang="c">/*** test printing, embedded \n and comments with lots of '*' ***/
|
|
print(42);
|
|
print("\nHello World\nGood Bye\nok\n");
|
|
print("Print a slash n - \\n.\n");</syntaxhighlight>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
2 1 Keyword_print
|
|
2 6 LeftParen
|
|
2 7 Integer 42
|
|
2 9 RightParen
|
|
2 10 Semicolon
|
|
3 1 Keyword_print
|
|
3 6 LeftParen
|
|
3 7 String "\nHello World\nGood Bye\nok\n"
|
|
3 38 RightParen
|
|
3 39 Semicolon
|
|
4 1 Keyword_print
|
|
4 6 LeftParen
|
|
4 7 String "Print a slash n - \\n.\n"
|
|
4 33 RightParen
|
|
4 34 Semicolon
|
|
5 1 End_of_input
|
|
</pre></b>
|
|
|
|
|}
|
|
|
|
;Additional examples
|
|
Your solution should pass all the test cases above and the additional tests found '''[[Compiler/Sample_programs|Here]]'''.
|
|
|
|
|
|
{{task heading|Reference}}
|
|
The C and Python versions can be considered reference implementations.
|
|
|
|
|
|
;Related Tasks
|
|
* [[Compiler/syntax_analyzer|Syntax Analyzer task]]
|
|
* [[Compiler/code_generator|Code Generator task]]
|
|
* [[Compiler/virtual_machine_interpreter|Virtual Machine Interpreter task]]
|
|
* [[Compiler/AST_interpreter|AST Interpreter task]]
|
|
<hr>
|
|
<br><br>
|
|
|