RosettaCodeData/Task/Compiler-syntax-analyzer/00-TASK.txt

653 lines
15 KiB
Plaintext

{{task heading|Syntax Analyzer}}
A Syntax analyzer transforms a token stream (from the [[Compiler/lexical_analyzer|Lexical analyzer]])
into a Syntax tree, based on a grammar.
Take the output from the Lexical analyzer [[Compiler/lexical_analyzer|task]],
and convert it to an [https://en.wikipedia.org/wiki/Abstract_syntax_tree Abstract Syntax Tree (AST)],
based on the grammar below. The output should be in a [[Flatten_a_list|flattened format.]]
The program should read input from a file and/or stdin, and write output to a file and/or
stdout. If the language being used has a parser module/library/class, it would be great
if two versions of the solution are provided: One without the parser module, and one
with.
{{task heading|Grammar}}
The simple programming language to be analyzed is more or less a (very tiny) subset of
[[C]]. The formal grammar in
[https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form Extended Backus-Naur Form (EBNF)]:
<syntaxhighlight lang="ebnf">
stmt_list = {stmt} ;
stmt = ';'
| Identifier '=' expr ';'
| 'while' paren_expr stmt
| 'if' paren_expr stmt ['else' stmt]
| 'print' '(' prt_list ')' ';'
| 'putc' paren_expr ';'
| '{' stmt_list '}'
;
paren_expr = '(' expr ')' ;
prt_list = (string | expr) {',' (String | expr)} ;
expr = and_expr {'||' and_expr} ;
and_expr = equality_expr {'&&' equality_expr} ;
equality_expr = relational_expr [('==' | '!=') relational_expr] ;
relational_expr = addition_expr [('<' | '<=' | '>' | '>=') addition_expr] ;
addition_expr = multiplication_expr {('+' | '-') multiplication_expr} ;
multiplication_expr = primary {('*' | '/' | '%') primary } ;
primary = Identifier
| Integer
| '(' expr ')'
| ('+' | '-' | '!') primary
;</syntaxhighlight>
The resulting AST should be formulated as a Binary Tree.
;Example - given the simple program (below), stored in a file called while.t, create the list of tokens, using one of the Lexical analyzer [[Compiler/lexical_analyzer|solutions]]
lex < while.t > while.lex
;Run one of the Syntax analyzer [[Compiler/syntax_analyzer|solutions]]:
parse < while.lex > while.ast
;The following table shows the input to lex, lex output, and the AST produced by the parser:
{| class="wikitable"
|-
! Input to lex
! Output from lex, input to parse
! Output from parse
|-
| style="vertical-align:top" |
<syntaxhighlight lang="c">count = 1;
while (count < 10) {
print("count is: ", count, "\n");
count = count + 1;
}</syntaxhighlight>
| style="vertical-align:top" |
<b><pre>
1 1 Identifier count
1 7 Op_assign
1 9 Integer 1
1 10 Semicolon
2 1 Keyword_while
2 7 LeftParen
2 8 Identifier count
2 14 Op_less
2 16 Integer 10
2 18 RightParen
2 20 LeftBrace
3 5 Keyword_print
3 10 LeftParen
3 11 String "count is: "
3 23 Comma
3 25 Identifier count
3 30 Comma
3 32 String "\n"
3 36 RightParen
3 37 Semicolon
4 5 Identifier count
4 11 Op_assign
4 13 Identifier count
4 19 Op_add
4 21 Integer 1
4 22 Semicolon
5 1 RightBrace
6 1 End_of_input
</pre></b>
| style="vertical-align:top" |
<b><pre>
Sequence
Sequence
;
Assign
Identifier count
Integer 1
While
Less
Identifier count
Integer 10
Sequence
Sequence
;
Sequence
Sequence
Sequence
;
Prts
String "count is: "
;
Prti
Identifier count
;
Prts
String "\n"
;
Assign
Identifier count
Add
Identifier count
Integer 1
</pre></b>
|}
;Specifications
;List of node type names:
<pre>
Identifier String Integer Sequence If Prtc Prts Prti While Assign Negate Not Multiply Divide Mod
Add Subtract Less LessEqual Greater GreaterEqual Equal NotEqual And Or
</pre>
In the text below, Null/Empty nodes are represented by ";".
;Non-terminal (internal) nodes:
For Operators, the following nodes should be created:
Multiply Divide Mod Add Subtract Less LessEqual Greater GreaterEqual Equal NotEqual And Or
For each of the above nodes, the left and right sub-nodes are the operands of the
respective operation.
In pseudo S-Expression format:
(Operator expression expression)
Negate, Not
For these node types, the left node is the operand, and the right node is null.
(Operator expression ;)
Sequence - sub-nodes are either statements or Sequences.
If - left node is the expression, the right node is If node, with it's left node being the
if-true statement part, and the right node being the if-false (else) statement part.
(If expression (If statement else-statement))
If there is not an else, the tree becomes:
(If expression (If statement ;))
Prtc
(Prtc (expression) ;)
Prts
(Prts (String "the string") ;)
Prti
(Prti (Integer 12345) ;)
While - left node is the expression, the right node is the statement.
(While expression statement)
Assign - left node is the left-hand side of the assignment, the right node is the
right-hand side of the assignment.
(Assign Identifier expression)
Terminal (leaf) nodes:
Identifier: (Identifier ident_name)
Integer: (Integer 12345)
String: (String "Hello World!")
";": Empty node
;Some simple examples
Sequences denote a list node; they are used to represent a list. semicolon's represent a null node, e.g., the end of this path.
This simple program:
a=11;
Produces the following AST, encoded as a binary tree:
Under each non-leaf node are two '|' lines. The first represents the left sub-node, the second represents the right sub-node:
(1) Sequence
(2) |-- ;
(3) |-- Assign
(4) |-- Identifier: a
(5) |-- Integer: 11
In flattened form:
(1) Sequence
(2) ;
(3) Assign
(4) Identifier a
(5) Integer 11
This program:
a=11;
b=22;
c=33;
Produces the following AST:
( 1) Sequence
( 2) |-- Sequence
( 3) | |-- Sequence
( 4) | | |-- ;
( 5) | | |-- Assign
( 6) | | |-- Identifier: a
( 7) | | |-- Integer: 11
( 8) | |-- Assign
( 9) | |-- Identifier: b
(10) | |-- Integer: 22
(11) |-- Assign
(12) |-- Identifier: c
(13) |-- Integer: 33
In flattened form:
( 1) Sequence
( 2) Sequence
( 3) Sequence
( 4) ;
( 5) Assign
( 6) Identifier a
( 7) Integer 11
( 8) Assign
( 9) Identifier b
(10) Integer 22
(11) Assign
(12) Identifier c
(13) Integer 33
;Pseudo-code for the parser.
Uses [https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm Precedence Climbing] for expression parsing, and
[https://en.wikipedia.org/wiki/Recursive_descent_parser Recursive Descent] for statement parsing. The AST is also built:
<syntaxhighlight lang="python">def expr(p)
if tok is "("
x = paren_expr()
elif tok in ["-", "+", "!"]
gettok()
y = expr(precedence of operator)
if operator was "+"
x = y
else
x = make_node(operator, y)
elif tok is an Identifier
x = make_leaf(Identifier, variable name)
gettok()
elif tok is an Integer constant
x = make_leaf(Integer, integer value)
gettok()
else
error()
while tok is a binary operator and precedence of tok >= p
save_tok = tok
gettok()
q = precedence of save_tok
if save_tok is not right associative
q += 1
x = make_node(Operator save_tok represents, x, expr(q))
return x
def paren_expr()
expect("(")
x = expr(0)
expect(")")
return x
def stmt()
t = NULL
if accept("if")
e = paren_expr()
s = stmt()
t = make_node(If, e, make_node(If, s, accept("else") ? stmt() : NULL))
elif accept("putc")
t = make_node(Prtc, paren_expr())
expect(";")
elif accept("print")
expect("(")
repeat
if tok is a string
e = make_node(Prts, make_leaf(String, the string))
gettok()
else
e = make_node(Prti, expr(0))
t = make_node(Sequence, t, e)
until not accept(",")
expect(")")
expect(";")
elif tok is ";"
gettok()
elif tok is an Identifier
v = make_leaf(Identifier, variable name)
gettok()
expect("=")
t = make_node(Assign, v, expr(0))
expect(";")
elif accept("while")
e = paren_expr()
t = make_node(While, e, stmt()
elif accept("{")
while tok not equal "}" and tok not equal end-of-file
t = make_node(Sequence, t, stmt())
expect("}")
elif tok is end-of-file
pass
else
error()
return t
def parse()
t = NULL
gettok()
repeat
t = make_node(Sequence, t, stmt())
until tok is end-of-file
return t</syntaxhighlight>
;Once the AST is built, it should be output in a [[Flatten_a_list|flattened format.]] This can be as simple as the following:
<syntaxhighlight lang="python">def prt_ast(t)
if t == NULL
print(";\n")
else
print(t.node_type)
if t.node_type in [Identifier, Integer, String] # leaf node
print the value of the Ident, Integer or String, "\n"
else
print("\n")
prt_ast(t.left)
prt_ast(t.right)</syntaxhighlight>
;If the AST is correctly built, loading it into a subsequent program should be as simple as:
<syntaxhighlight lang="python">def load_ast()
line = readline()
# Each line has at least one token
line_list = tokenize the line, respecting double quotes
text = line_list[0] # first token is always the node type
if text == ";" # a terminal node
return NULL
node_type = text # could convert to internal form if desired
# A line with two tokens is a leaf node
# Leaf nodes are: Identifier, Integer, String
# The 2nd token is the value
if len(line_list) > 1
return make_leaf(node_type, line_list[1])
left = load_ast()
right = load_ast()
return make_node(node_type, left, right)</syntaxhighlight>
Finally, the AST can also be tested by running it against one of the AST Interpreter [[Compiler/AST_interpreter|solutions]].
;Test program, assuming this is in a file called prime.t: lex <prime.t | parse
{| class="wikitable"
|-
! Input to lex
! Output from lex, input to parse
! Output from parse
|-
| style="vertical-align:top" |
<syntaxhighlight lang="c">/*
Simple prime number generator
*/
count = 1;
n = 1;
limit = 100;
while (n < limit) {
k=3;
p=1;
n=n+2;
while ((k*k<=n) && (p)) {
p=n/k*k!=n;
k=k+2;
}
if (p) {
print(n, " is prime\n");
count = count + 1;
}
}
print("Total primes found: ", count, "\n");</syntaxhighlight>
| style="vertical-align:top" |
<b><pre>
4 1 Identifier count
4 7 Op_assign
4 9 Integer 1
4 10 Semicolon
5 1 Identifier n
5 3 Op_assign
5 5 Integer 1
5 6 Semicolon
6 1 Identifier limit
6 7 Op_assign
6 9 Integer 100
6 12 Semicolon
7 1 Keyword_while
7 7 LeftParen
7 8 Identifier n
7 10 Op_less
7 12 Identifier limit
7 17 RightParen
7 19 LeftBrace
8 5 Identifier k
8 6 Op_assign
8 7 Integer 3
8 8 Semicolon
9 5 Identifier p
9 6 Op_assign
9 7 Integer 1
9 8 Semicolon
10 5 Identifier n
10 6 Op_assign
10 7 Identifier n
10 8 Op_add
10 9 Integer 2
10 10 Semicolon
11 5 Keyword_while
11 11 LeftParen
11 12 LeftParen
11 13 Identifier k
11 14 Op_multiply
11 15 Identifier k
11 16 Op_lessequal
11 18 Identifier n
11 19 RightParen
11 21 Op_and
11 24 LeftParen
11 25 Identifier p
11 26 RightParen
11 27 RightParen
11 29 LeftBrace
12 9 Identifier p
12 10 Op_assign
12 11 Identifier n
12 12 Op_divide
12 13 Identifier k
12 14 Op_multiply
12 15 Identifier k
12 16 Op_notequal
12 18 Identifier n
12 19 Semicolon
13 9 Identifier k
13 10 Op_assign
13 11 Identifier k
13 12 Op_add
13 13 Integer 2
13 14 Semicolon
14 5 RightBrace
15 5 Keyword_if
15 8 LeftParen
15 9 Identifier p
15 10 RightParen
15 12 LeftBrace
16 9 Keyword_print
16 14 LeftParen
16 15 Identifier n
16 16 Comma
16 18 String " is prime\n"
16 31 RightParen
16 32 Semicolon
17 9 Identifier count
17 15 Op_assign
17 17 Identifier count
17 23 Op_add
17 25 Integer 1
17 26 Semicolon
18 5 RightBrace
19 1 RightBrace
20 1 Keyword_print
20 6 LeftParen
20 7 String "Total primes found: "
20 29 Comma
20 31 Identifier count
20 36 Comma
20 38 String "\n"
20 42 RightParen
20 43 Semicolon
21 1 End_of_input
</pre></b>
| style="vertical-align:top" |
<b><pre>
Sequence
Sequence
Sequence
Sequence
Sequence
;
Assign
Identifier count
Integer 1
Assign
Identifier n
Integer 1
Assign
Identifier limit
Integer 100
While
Less
Identifier n
Identifier limit
Sequence
Sequence
Sequence
Sequence
Sequence
;
Assign
Identifier k
Integer 3
Assign
Identifier p
Integer 1
Assign
Identifier n
Add
Identifier n
Integer 2
While
And
LessEqual
Multiply
Identifier k
Identifier k
Identifier n
Identifier p
Sequence
Sequence
;
Assign
Identifier p
NotEqual
Multiply
Divide
Identifier n
Identifier k
Identifier k
Identifier n
Assign
Identifier k
Add
Identifier k
Integer 2
If
Identifier p
If
Sequence
Sequence
;
Sequence
Sequence
;
Prti
Identifier n
;
Prts
String " is prime\n"
;
Assign
Identifier count
Add
Identifier count
Integer 1
;
Sequence
Sequence
Sequence
;
Prts
String "Total primes found: "
;
Prti
Identifier count
;
Prts
String "\n"
;
</pre></b>
|}
; Additional examples
Your solution should pass all the test cases above and the additional tests found '''[[Compiler/Sample_programs|Here]]'''.
{{task heading|Reference}}
The C and Python versions can be considered reference implementations.
;Related Tasks
* [[Compiler/lexical_analyzer|Lexical Analyzer task]]
* [[Compiler/code_generator|Code Generator task]]
* [[Compiler/virtual_machine_interpreter|Virtual Machine Interpreter task]]
* [[Compiler/AST_interpreter|AST Interpreter task]]
<hr>
__TOC__