RosettaCodeData/Task/Compiler-lexical-analyzer/00-TASK.txt

387 lines
12 KiB
Plaintext

Lexical Analyzer
Definition from [https://en.wikipedia.org/wiki/Lexical_analysis Wikipedia]:
: ''Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).''
Create a lexical analyzer for the simple programming language specified below. The
program should read input from a file and/or stdin, and write output to a file and/or
stdout. If the language being used has a lexer module/library/class, it would be great
if two versions of the solution are provided: One without the lexer module, and one with.
{{task heading|Input Specification}}
The simple programming language to be analyzed is more or less a subset of [[C]]. It supports the following tokens:
;Operators
:::{| class="wikitable"
|-
! Name !! Common name !! Character sequence
|-
| <tt>Op_multiply</tt> || multiply || <tt>*</tt>
|-
| <tt>Op_divide</tt> || divide || <tt>/</tt>
|-
| <tt>Op_mod</tt> || mod || <tt>%</tt>
|-
| <tt>Op_add</tt> || plus || <tt>+</tt>
|-
| <tt>Op_subtract</tt> || minus || <tt>-</tt>
|-
| <tt>Op_negate</tt> || unary minus || <tt>-</tt>
|-
| <tt>Op_less</tt> || less than || <tt><</tt>
|-
| <tt>Op_lessequal</tt> || less than or equal || <tt><=</tt>
|-
| <tt>Op_greater</tt> || greater than || <tt>&gt;</tt>
|-
| <tt>Op_greaterequal</tt> || greater than or equal || <tt>&gt;=</tt>
|-
| <tt>Op_equal</tt> || equal || <tt>==</tt>
|-
| <tt>Op_notequal</tt> || not equal || <tt>&#33;=</tt>
|-
| <tt>Op_not</tt> || unary not || <tt>&#33;</tt>
|-
| <tt>Op_assign</tt> || assignment || <tt>=</tt>
|-
| <tt>Op_and</tt> || logical and || <tt>&amp;&amp;</tt>
|-
| <tt>Op_or</tt> || logical or || <tt>&brvbar;&brvbar;</tt>
|}
* The <code>-</code> token should always be interpreted as <tt>Op_subtract</tt> by the lexer. Turning some <tt>Op_subtract</tt> into <tt>Op_negate</tt> will be the job of the syntax analyzer, which is not part of this task.
;Symbols
:::{| class="wikitable"
|-
! Name !! Common name !! Character
|-
| <tt>LeftParen</tt> || left parenthesis || <tt>(</tt>
|-
| <tt>RightParen</tt> || right parenthesis || <tt>)</tt>
|-
| <tt>LeftBrace</tt> || left brace || <tt>{</tt>
|-
| <tt>RightBrace</tt> || right brace || <tt>}</tt>
|-
| <tt>Semicolon</tt> || semi-colon || <tt>&#59;</tt>
|-
| <tt>Comma</tt> || comma || <tt>,</tt>
|}
;Keywords
:::{| class="wikitable"
|-
! Name || Character sequence
|-
| <tt>Keyword_if</tt> || <tt>if</tt>
|-
| <tt>Keyword_else</tt> || <tt>else</tt>
|-
| <tt>Keyword_while</tt> || <tt>while</tt>
|-
| <tt>Keyword_print</tt> || <tt>print</tt>
|-
| <tt>Keyword_putc</tt> || <tt>putc</tt>
|}
;Identifiers and literals
These differ from the the previous tokens, in that each occurrence of them has a value associated with it.
:::{| class="wikitable"
|-
! Name
! Common name
! Format description
! Format regex
! Value
|-
| <tt>Identifier</tt>
| identifier
| one or more letter/number/underscore characters, but not starting with a number
| <code style="white-space:nowrap">[_a-zA-Z][_a-zA-Z0-9]*</code>
| as is
|-
| <tt>Integer</tt>
| integer literal
| one or more digits
| <code>[0-9]+</code>
| as is, interpreted as a number
|-
| <tt>Integer</tt>
| char literal
| exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes
| <code><nowiki>'([^'\n]|\\n|\\\\)'</nowiki></code>
| the ASCII code point number of the character, e.g. 65 for <code>'A'</code> and 10 for <code>'\n'</code>
|-
| <tt>String</tt>
| string literal
| zero or more characters (anything except newline or double quote), enclosed by double quotes
| <code>"[^"\n]*"</code>
| the characters without the double quotes and with escape sequences converted
|}
* For char and string literals, the <code>\n</code> escape sequence is supported to represent a new-line character.
* For char and string literals, to represent a backslash, use <code>\\</code>.
* No other special sequences are supported. This means that:
** Char literals cannot represent a single quote character (value 39).
** String literals cannot represent strings containing double quote characters.
;Zero-width tokens
:::{| class="wikitable"
|-
! Name || Location
|-
| <tt>End_of_input</tt> || when the end of the input stream is reached
|}
;White space
* Zero or more whitespace characters, or comments enclosed in <code>/* ... */</code>, are allowed between any two tokens, with the exceptions noted below.
* "Longest token matching" is used to resolve conflicts (e.g., in order to match '''<=''' as a single token rather than the two tokens '''<''' and '''=''').
* Whitespace is ''required'' between two tokens that have an alphanumeric character or underscore at the edge.
** This means: keywords, identifiers, and integer literals.
** e.g. <code>ifprint</code> is recognized as an identifier, instead of the keywords <tt>if</tt> and <tt>print</tt>.
** e.g. <code>42fred</code> is invalid, and neither recognized as a number nor an identifier.
* Whitespace is ''not allowed'' inside of tokens (except for chars and strings where they are part of the value).
** e.g. <code>& &</code> is invalid, and not interpreted as the <tt>&&</tt> operator.
For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions:
* <syntaxhighlight lang="c">if ( p /* meaning n is prime */ ) {
print ( n , " " ) ;
count = count + 1 ; /* number of primes found so far */
}</syntaxhighlight>
* <syntaxhighlight lang="c">if(p){print(n," ");count=count+1;}</syntaxhighlight>
;Complete list of token names
<pre>
End_of_input Op_multiply Op_divide Op_mod Op_add Op_subtract
Op_negate Op_not Op_less Op_lessequal Op_greater Op_greaterequal
Op_equal Op_notequal Op_assign Op_and Op_or Keyword_if
Keyword_else Keyword_while Keyword_print Keyword_putc LeftParen RightParen
LeftBrace RightBrace Semicolon Comma Identifier Integer
String
</pre>
{{task heading|Output Format}}
The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:
# the line number where the token starts
# the column number where the token starts
# the token name
# the token value (only for <tt>Identifier</tt>, <tt>Integer</tt>, and <tt>String</tt> tokens)
# the number of spaces between fields is up to you. Neatly aligned is nice, but not a requirement.
<br>
This task is intended to be used as part of a pipeline, with the other compiler tasks - for example:
<br><b>lex < hello.t | parse | gen | vm</b>
Or possibly:
<br><b>lex hello.t lex.out</b>
<br><b>parse lex.out parse.out</b>
<br><b>gen parse.out gen.out</b>
<br><b>vm gen.out</b>
<br>
This implies that the output of this task (the lexical analyzer) should be suitable as input to any of the [[Compiler/syntax_analyzer|Syntax Analyzer task]] programs.
{{task heading|Diagnostics}}
The following error conditions should be caught:
:::{| class="wikitable"
|-
! Error
! Example
|-
| Empty character constant
| <code>&apos;&apos;</code>
|-
| Unknown escape sequence.
| <code>\r</code>
|-
| Multi-character constant.
| <code>&apos;xx&apos;</code>
|-
| End-of-file in comment. Closing comment characters not found.
|-
| End-of-file while scanning string literal. Closing string character not found.
|-
| End-of-line while scanning string literal. Closing string character not found before end-of-line.
|-
| Unrecognized character.
| <code>&#124;</code>
|-
| Invalid number. Starts like a number, but ends in non-numeric characters.
| <code>123abc</code>
|}
{{task heading|Test Cases}}
:{| class="wikitable"
|-
! Input
! Output
|-
| style="vertical-align:top" |
Test Case 1:
<syntaxhighlight lang="c">/*
Hello world
*/
print("Hello, World!\n");</syntaxhighlight>
| style="vertical-align:top" |
<b><pre>
4 1 Keyword_print
4 6 LeftParen
4 7 String "Hello, World!\n"
4 24 RightParen
4 25 Semicolon
5 1 End_of_input
</pre></b>
|-
| style="vertical-align:top" |
Test Case 2:
<syntaxhighlight lang="c">/*
Show Ident and Integers
*/
phoenix_number = 142857;
print(phoenix_number, "\n");</syntaxhighlight>
| style="vertical-align:top" |
<b><pre>
4 1 Identifier phoenix_number
4 16 Op_assign
4 18 Integer 142857
4 24 Semicolon
5 1 Keyword_print
5 6 LeftParen
5 7 Identifier phoenix_number
5 21 Comma
5 23 String "\n"
5 27 RightParen
5 28 Semicolon
6 1 End_of_input
</pre></b>
|-
| style="vertical-align:top" |
Test Case 3:
<syntaxhighlight lang="c">/*
All lexical tokens - not syntactically correct, but that will
have to wait until syntax analysis
*/
/* Print */ print /* Sub */ -
/* Putc */ putc /* Lss */ <
/* If */ if /* Gtr */ >
/* Else */ else /* Leq */ <=
/* While */ while /* Geq */ >=
/* Lbrace */ { /* Eq */ ==
/* Rbrace */ } /* Neq */ !=
/* Lparen */ ( /* And */ &&
/* Rparen */ ) /* Or */ ||
/* Uminus */ - /* Semi */ ;
/* Not */ ! /* Comma */ ,
/* Mul */ * /* Assign */ =
/* Div */ / /* Integer */ 42
/* Mod */ % /* String */ "String literal"
/* Add */ + /* Ident */ variable_name
/* character literal */ '\n'
/* character literal */ '\\'
/* character literal */ ' '</syntaxhighlight>
| style="vertical-align:top" |
<b><pre>
5 16 Keyword_print
5 40 Op_subtract
6 16 Keyword_putc
6 40 Op_less
7 16 Keyword_if
7 40 Op_greater
8 16 Keyword_else
8 40 Op_lessequal
9 16 Keyword_while
9 40 Op_greaterequal
10 16 LeftBrace
10 40 Op_equal
11 16 RightBrace
11 40 Op_notequal
12 16 LeftParen
12 40 Op_and
13 16 RightParen
13 40 Op_or
14 16 Op_subtract
14 40 Semicolon
15 16 Op_not
15 40 Comma
16 16 Op_multiply
16 40 Op_assign
17 16 Op_divide
17 40 Integer 42
18 16 Op_mod
18 40 String "String literal"
19 16 Op_add
19 40 Identifier variable_name
20 26 Integer 10
21 26 Integer 92
22 26 Integer 32
23 1 End_of_input
</pre></b>
|-
| style="vertical-align:top" |
Test Case 4:
<syntaxhighlight lang="c">/*** test printing, embedded \n and comments with lots of '*' ***/
print(42);
print("\nHello World\nGood Bye\nok\n");
print("Print a slash n - \\n.\n");</syntaxhighlight>
| style="vertical-align:top" |
<b><pre>
2 1 Keyword_print
2 6 LeftParen
2 7 Integer 42
2 9 RightParen
2 10 Semicolon
3 1 Keyword_print
3 6 LeftParen
3 7 String "\nHello World\nGood Bye\nok\n"
3 38 RightParen
3 39 Semicolon
4 1 Keyword_print
4 6 LeftParen
4 7 String "Print a slash n - \\n.\n"
4 33 RightParen
4 34 Semicolon
5 1 End_of_input
</pre></b>
|}
;Additional examples
Your solution should pass all the test cases above and the additional tests found '''[[Compiler/Sample_programs|Here]]'''.
{{task heading|Reference}}
The C and Python versions can be considered reference implementations.
;Related Tasks
* [[Compiler/syntax_analyzer|Syntax Analyzer task]]
* [[Compiler/code_generator|Code Generator task]]
* [[Compiler/virtual_machine_interpreter|Virtual Machine Interpreter task]]
* [[Compiler/AST_interpreter|AST Interpreter task]]
<hr>
<br><br>