653 lines
15 KiB
Plaintext
653 lines
15 KiB
Plaintext
{{task heading|Syntax Analyzer}}
|
|
|
|
A Syntax analyzer transforms a token stream (from the [[Compiler/lexical_analyzer|Lexical analyzer]])
|
|
into a Syntax tree, based on a grammar.
|
|
|
|
Take the output from the Lexical analyzer [[Compiler/lexical_analyzer|task]],
|
|
and convert it to an [https://en.wikipedia.org/wiki/Abstract_syntax_tree Abstract Syntax Tree (AST)],
|
|
based on the grammar below. The output should be in a [[Flatten_a_list|flattened format.]]
|
|
|
|
The program should read input from a file and/or stdin, and write output to a file and/or
|
|
stdout. If the language being used has a parser module/library/class, it would be great
|
|
if two versions of the solution are provided: One without the parser module, and one
|
|
with.
|
|
|
|
{{task heading|Grammar}}
|
|
|
|
The simple programming language to be analyzed is more or less a (very tiny) subset of
|
|
[[C]]. The formal grammar in
|
|
[https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form Extended Backus-Naur Form (EBNF)]:
|
|
|
|
<syntaxhighlight lang="ebnf">
|
|
stmt_list = {stmt} ;
|
|
|
|
stmt = ';'
|
|
| Identifier '=' expr ';'
|
|
| 'while' paren_expr stmt
|
|
| 'if' paren_expr stmt ['else' stmt]
|
|
| 'print' '(' prt_list ')' ';'
|
|
| 'putc' paren_expr ';'
|
|
| '{' stmt_list '}'
|
|
;
|
|
|
|
paren_expr = '(' expr ')' ;
|
|
|
|
prt_list = (string | expr) {',' (String | expr)} ;
|
|
|
|
expr = and_expr {'||' and_expr} ;
|
|
and_expr = equality_expr {'&&' equality_expr} ;
|
|
equality_expr = relational_expr [('==' | '!=') relational_expr] ;
|
|
relational_expr = addition_expr [('<' | '<=' | '>' | '>=') addition_expr] ;
|
|
addition_expr = multiplication_expr {('+' | '-') multiplication_expr} ;
|
|
multiplication_expr = primary {('*' | '/' | '%') primary } ;
|
|
primary = Identifier
|
|
| Integer
|
|
| '(' expr ')'
|
|
| ('+' | '-' | '!') primary
|
|
;</syntaxhighlight>
|
|
|
|
The resulting AST should be formulated as a Binary Tree.
|
|
|
|
;Example - given the simple program (below), stored in a file called while.t, create the list of tokens, using one of the Lexical analyzer [[Compiler/lexical_analyzer|solutions]]
|
|
|
|
lex < while.t > while.lex
|
|
|
|
;Run one of the Syntax analyzer [[Compiler/syntax_analyzer|solutions]]:
|
|
|
|
parse < while.lex > while.ast
|
|
|
|
;The following table shows the input to lex, lex output, and the AST produced by the parser:
|
|
|
|
{| class="wikitable"
|
|
|-
|
|
! Input to lex
|
|
! Output from lex, input to parse
|
|
! Output from parse
|
|
|-
|
|
| style="vertical-align:top" |
|
|
<syntaxhighlight lang="c">count = 1;
|
|
while (count < 10) {
|
|
print("count is: ", count, "\n");
|
|
count = count + 1;
|
|
}</syntaxhighlight>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
1 1 Identifier count
|
|
1 7 Op_assign
|
|
1 9 Integer 1
|
|
1 10 Semicolon
|
|
2 1 Keyword_while
|
|
2 7 LeftParen
|
|
2 8 Identifier count
|
|
2 14 Op_less
|
|
2 16 Integer 10
|
|
2 18 RightParen
|
|
2 20 LeftBrace
|
|
3 5 Keyword_print
|
|
3 10 LeftParen
|
|
3 11 String "count is: "
|
|
3 23 Comma
|
|
3 25 Identifier count
|
|
3 30 Comma
|
|
3 32 String "\n"
|
|
3 36 RightParen
|
|
3 37 Semicolon
|
|
4 5 Identifier count
|
|
4 11 Op_assign
|
|
4 13 Identifier count
|
|
4 19 Op_add
|
|
4 21 Integer 1
|
|
4 22 Semicolon
|
|
5 1 RightBrace
|
|
6 1 End_of_input
|
|
</pre></b>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Assign
|
|
Identifier count
|
|
Integer 1
|
|
While
|
|
Less
|
|
Identifier count
|
|
Integer 10
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Prts
|
|
String "count is: "
|
|
;
|
|
Prti
|
|
Identifier count
|
|
;
|
|
Prts
|
|
String "\n"
|
|
;
|
|
Assign
|
|
Identifier count
|
|
Add
|
|
Identifier count
|
|
Integer 1
|
|
</pre></b>
|
|
|}
|
|
|
|
;Specifications
|
|
|
|
;List of node type names:
|
|
|
|
<pre>
|
|
Identifier String Integer Sequence If Prtc Prts Prti While Assign Negate Not Multiply Divide Mod
|
|
Add Subtract Less LessEqual Greater GreaterEqual Equal NotEqual And Or
|
|
</pre>
|
|
|
|
In the text below, Null/Empty nodes are represented by ";".
|
|
|
|
;Non-terminal (internal) nodes:
|
|
|
|
For Operators, the following nodes should be created:
|
|
|
|
Multiply Divide Mod Add Subtract Less LessEqual Greater GreaterEqual Equal NotEqual And Or
|
|
|
|
For each of the above nodes, the left and right sub-nodes are the operands of the
|
|
respective operation.
|
|
|
|
In pseudo S-Expression format:
|
|
|
|
(Operator expression expression)
|
|
|
|
Negate, Not
|
|
|
|
For these node types, the left node is the operand, and the right node is null.
|
|
|
|
(Operator expression ;)
|
|
|
|
Sequence - sub-nodes are either statements or Sequences.
|
|
|
|
If - left node is the expression, the right node is If node, with it's left node being the
|
|
if-true statement part, and the right node being the if-false (else) statement part.
|
|
|
|
(If expression (If statement else-statement))
|
|
|
|
If there is not an else, the tree becomes:
|
|
|
|
(If expression (If statement ;))
|
|
|
|
Prtc
|
|
|
|
(Prtc (expression) ;)
|
|
|
|
Prts
|
|
|
|
(Prts (String "the string") ;)
|
|
|
|
Prti
|
|
|
|
(Prti (Integer 12345) ;)
|
|
|
|
While - left node is the expression, the right node is the statement.
|
|
|
|
(While expression statement)
|
|
|
|
Assign - left node is the left-hand side of the assignment, the right node is the
|
|
right-hand side of the assignment.
|
|
|
|
(Assign Identifier expression)
|
|
|
|
Terminal (leaf) nodes:
|
|
|
|
Identifier: (Identifier ident_name)
|
|
Integer: (Integer 12345)
|
|
String: (String "Hello World!")
|
|
";": Empty node
|
|
|
|
;Some simple examples
|
|
Sequences denote a list node; they are used to represent a list. semicolon's represent a null node, e.g., the end of this path.
|
|
|
|
This simple program:
|
|
|
|
a=11;
|
|
|
|
Produces the following AST, encoded as a binary tree:
|
|
|
|
Under each non-leaf node are two '|' lines. The first represents the left sub-node, the second represents the right sub-node:
|
|
|
|
(1) Sequence
|
|
(2) |-- ;
|
|
(3) |-- Assign
|
|
(4) |-- Identifier: a
|
|
(5) |-- Integer: 11
|
|
|
|
In flattened form:
|
|
|
|
(1) Sequence
|
|
(2) ;
|
|
(3) Assign
|
|
(4) Identifier a
|
|
(5) Integer 11
|
|
|
|
|
|
This program:
|
|
|
|
a=11;
|
|
b=22;
|
|
c=33;
|
|
|
|
Produces the following AST:
|
|
|
|
( 1) Sequence
|
|
( 2) |-- Sequence
|
|
( 3) | |-- Sequence
|
|
( 4) | | |-- ;
|
|
( 5) | | |-- Assign
|
|
( 6) | | |-- Identifier: a
|
|
( 7) | | |-- Integer: 11
|
|
( 8) | |-- Assign
|
|
( 9) | |-- Identifier: b
|
|
(10) | |-- Integer: 22
|
|
(11) |-- Assign
|
|
(12) |-- Identifier: c
|
|
(13) |-- Integer: 33
|
|
|
|
In flattened form:
|
|
|
|
( 1) Sequence
|
|
( 2) Sequence
|
|
( 3) Sequence
|
|
( 4) ;
|
|
( 5) Assign
|
|
( 6) Identifier a
|
|
( 7) Integer 11
|
|
( 8) Assign
|
|
( 9) Identifier b
|
|
(10) Integer 22
|
|
(11) Assign
|
|
(12) Identifier c
|
|
(13) Integer 33
|
|
|
|
;Pseudo-code for the parser.
|
|
|
|
Uses [https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm Precedence Climbing] for expression parsing, and
|
|
[https://en.wikipedia.org/wiki/Recursive_descent_parser Recursive Descent] for statement parsing. The AST is also built:
|
|
|
|
<syntaxhighlight lang="python">def expr(p)
|
|
if tok is "("
|
|
x = paren_expr()
|
|
elif tok in ["-", "+", "!"]
|
|
gettok()
|
|
y = expr(precedence of operator)
|
|
if operator was "+"
|
|
x = y
|
|
else
|
|
x = make_node(operator, y)
|
|
elif tok is an Identifier
|
|
x = make_leaf(Identifier, variable name)
|
|
gettok()
|
|
elif tok is an Integer constant
|
|
x = make_leaf(Integer, integer value)
|
|
gettok()
|
|
else
|
|
error()
|
|
|
|
while tok is a binary operator and precedence of tok >= p
|
|
save_tok = tok
|
|
gettok()
|
|
q = precedence of save_tok
|
|
if save_tok is not right associative
|
|
q += 1
|
|
x = make_node(Operator save_tok represents, x, expr(q))
|
|
|
|
return x
|
|
|
|
def paren_expr()
|
|
expect("(")
|
|
x = expr(0)
|
|
expect(")")
|
|
return x
|
|
|
|
def stmt()
|
|
t = NULL
|
|
if accept("if")
|
|
e = paren_expr()
|
|
s = stmt()
|
|
t = make_node(If, e, make_node(If, s, accept("else") ? stmt() : NULL))
|
|
elif accept("putc")
|
|
t = make_node(Prtc, paren_expr())
|
|
expect(";")
|
|
elif accept("print")
|
|
expect("(")
|
|
repeat
|
|
if tok is a string
|
|
e = make_node(Prts, make_leaf(String, the string))
|
|
gettok()
|
|
else
|
|
e = make_node(Prti, expr(0))
|
|
|
|
t = make_node(Sequence, t, e)
|
|
until not accept(",")
|
|
expect(")")
|
|
expect(";")
|
|
elif tok is ";"
|
|
gettok()
|
|
elif tok is an Identifier
|
|
v = make_leaf(Identifier, variable name)
|
|
gettok()
|
|
expect("=")
|
|
t = make_node(Assign, v, expr(0))
|
|
expect(";")
|
|
elif accept("while")
|
|
e = paren_expr()
|
|
t = make_node(While, e, stmt()
|
|
elif accept("{")
|
|
while tok not equal "}" and tok not equal end-of-file
|
|
t = make_node(Sequence, t, stmt())
|
|
expect("}")
|
|
elif tok is end-of-file
|
|
pass
|
|
else
|
|
error()
|
|
return t
|
|
|
|
def parse()
|
|
t = NULL
|
|
gettok()
|
|
repeat
|
|
t = make_node(Sequence, t, stmt())
|
|
until tok is end-of-file
|
|
return t</syntaxhighlight>
|
|
|
|
;Once the AST is built, it should be output in a [[Flatten_a_list|flattened format.]] This can be as simple as the following:
|
|
|
|
<syntaxhighlight lang="python">def prt_ast(t)
|
|
if t == NULL
|
|
print(";\n")
|
|
else
|
|
print(t.node_type)
|
|
if t.node_type in [Identifier, Integer, String] # leaf node
|
|
print the value of the Ident, Integer or String, "\n"
|
|
else
|
|
print("\n")
|
|
prt_ast(t.left)
|
|
prt_ast(t.right)</syntaxhighlight>
|
|
|
|
;If the AST is correctly built, loading it into a subsequent program should be as simple as:
|
|
|
|
<syntaxhighlight lang="python">def load_ast()
|
|
line = readline()
|
|
# Each line has at least one token
|
|
line_list = tokenize the line, respecting double quotes
|
|
|
|
text = line_list[0] # first token is always the node type
|
|
|
|
if text == ";" # a terminal node
|
|
return NULL
|
|
|
|
node_type = text # could convert to internal form if desired
|
|
|
|
# A line with two tokens is a leaf node
|
|
# Leaf nodes are: Identifier, Integer, String
|
|
# The 2nd token is the value
|
|
if len(line_list) > 1
|
|
return make_leaf(node_type, line_list[1])
|
|
|
|
left = load_ast()
|
|
right = load_ast()
|
|
return make_node(node_type, left, right)</syntaxhighlight>
|
|
|
|
Finally, the AST can also be tested by running it against one of the AST Interpreter [[Compiler/AST_interpreter|solutions]].
|
|
|
|
;Test program, assuming this is in a file called prime.t: lex <prime.t | parse
|
|
|
|
{| class="wikitable"
|
|
|-
|
|
! Input to lex
|
|
! Output from lex, input to parse
|
|
! Output from parse
|
|
|-
|
|
| style="vertical-align:top" |
|
|
<syntaxhighlight lang="c">/*
|
|
Simple prime number generator
|
|
*/
|
|
count = 1;
|
|
n = 1;
|
|
limit = 100;
|
|
while (n < limit) {
|
|
k=3;
|
|
p=1;
|
|
n=n+2;
|
|
while ((k*k<=n) && (p)) {
|
|
p=n/k*k!=n;
|
|
k=k+2;
|
|
}
|
|
if (p) {
|
|
print(n, " is prime\n");
|
|
count = count + 1;
|
|
}
|
|
}
|
|
print("Total primes found: ", count, "\n");</syntaxhighlight>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
4 1 Identifier count
|
|
4 7 Op_assign
|
|
4 9 Integer 1
|
|
4 10 Semicolon
|
|
5 1 Identifier n
|
|
5 3 Op_assign
|
|
5 5 Integer 1
|
|
5 6 Semicolon
|
|
6 1 Identifier limit
|
|
6 7 Op_assign
|
|
6 9 Integer 100
|
|
6 12 Semicolon
|
|
7 1 Keyword_while
|
|
7 7 LeftParen
|
|
7 8 Identifier n
|
|
7 10 Op_less
|
|
7 12 Identifier limit
|
|
7 17 RightParen
|
|
7 19 LeftBrace
|
|
8 5 Identifier k
|
|
8 6 Op_assign
|
|
8 7 Integer 3
|
|
8 8 Semicolon
|
|
9 5 Identifier p
|
|
9 6 Op_assign
|
|
9 7 Integer 1
|
|
9 8 Semicolon
|
|
10 5 Identifier n
|
|
10 6 Op_assign
|
|
10 7 Identifier n
|
|
10 8 Op_add
|
|
10 9 Integer 2
|
|
10 10 Semicolon
|
|
11 5 Keyword_while
|
|
11 11 LeftParen
|
|
11 12 LeftParen
|
|
11 13 Identifier k
|
|
11 14 Op_multiply
|
|
11 15 Identifier k
|
|
11 16 Op_lessequal
|
|
11 18 Identifier n
|
|
11 19 RightParen
|
|
11 21 Op_and
|
|
11 24 LeftParen
|
|
11 25 Identifier p
|
|
11 26 RightParen
|
|
11 27 RightParen
|
|
11 29 LeftBrace
|
|
12 9 Identifier p
|
|
12 10 Op_assign
|
|
12 11 Identifier n
|
|
12 12 Op_divide
|
|
12 13 Identifier k
|
|
12 14 Op_multiply
|
|
12 15 Identifier k
|
|
12 16 Op_notequal
|
|
12 18 Identifier n
|
|
12 19 Semicolon
|
|
13 9 Identifier k
|
|
13 10 Op_assign
|
|
13 11 Identifier k
|
|
13 12 Op_add
|
|
13 13 Integer 2
|
|
13 14 Semicolon
|
|
14 5 RightBrace
|
|
15 5 Keyword_if
|
|
15 8 LeftParen
|
|
15 9 Identifier p
|
|
15 10 RightParen
|
|
15 12 LeftBrace
|
|
16 9 Keyword_print
|
|
16 14 LeftParen
|
|
16 15 Identifier n
|
|
16 16 Comma
|
|
16 18 String " is prime\n"
|
|
16 31 RightParen
|
|
16 32 Semicolon
|
|
17 9 Identifier count
|
|
17 15 Op_assign
|
|
17 17 Identifier count
|
|
17 23 Op_add
|
|
17 25 Integer 1
|
|
17 26 Semicolon
|
|
18 5 RightBrace
|
|
19 1 RightBrace
|
|
20 1 Keyword_print
|
|
20 6 LeftParen
|
|
20 7 String "Total primes found: "
|
|
20 29 Comma
|
|
20 31 Identifier count
|
|
20 36 Comma
|
|
20 38 String "\n"
|
|
20 42 RightParen
|
|
20 43 Semicolon
|
|
21 1 End_of_input
|
|
</pre></b>
|
|
|
|
| style="vertical-align:top" |
|
|
<b><pre>
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Assign
|
|
Identifier count
|
|
Integer 1
|
|
Assign
|
|
Identifier n
|
|
Integer 1
|
|
Assign
|
|
Identifier limit
|
|
Integer 100
|
|
While
|
|
Less
|
|
Identifier n
|
|
Identifier limit
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Assign
|
|
Identifier k
|
|
Integer 3
|
|
Assign
|
|
Identifier p
|
|
Integer 1
|
|
Assign
|
|
Identifier n
|
|
Add
|
|
Identifier n
|
|
Integer 2
|
|
While
|
|
And
|
|
LessEqual
|
|
Multiply
|
|
Identifier k
|
|
Identifier k
|
|
Identifier n
|
|
Identifier p
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Assign
|
|
Identifier p
|
|
NotEqual
|
|
Multiply
|
|
Divide
|
|
Identifier n
|
|
Identifier k
|
|
Identifier k
|
|
Identifier n
|
|
Assign
|
|
Identifier k
|
|
Add
|
|
Identifier k
|
|
Integer 2
|
|
If
|
|
Identifier p
|
|
If
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Prti
|
|
Identifier n
|
|
;
|
|
Prts
|
|
String " is prime\n"
|
|
;
|
|
Assign
|
|
Identifier count
|
|
Add
|
|
Identifier count
|
|
Integer 1
|
|
;
|
|
Sequence
|
|
Sequence
|
|
Sequence
|
|
;
|
|
Prts
|
|
String "Total primes found: "
|
|
;
|
|
Prti
|
|
Identifier count
|
|
;
|
|
Prts
|
|
String "\n"
|
|
;
|
|
</pre></b>
|
|
|}
|
|
|
|
; Additional examples
|
|
|
|
Your solution should pass all the test cases above and the additional tests found '''[[Compiler/Sample_programs|Here]]'''.
|
|
|
|
{{task heading|Reference}}
|
|
|
|
The C and Python versions can be considered reference implementations.
|
|
|
|
;Related Tasks
|
|
|
|
* [[Compiler/lexical_analyzer|Lexical Analyzer task]]
|
|
* [[Compiler/code_generator|Code Generator task]]
|
|
* [[Compiler/virtual_machine_interpreter|Virtual Machine Interpreter task]]
|
|
* [[Compiler/AST_interpreter|AST Interpreter task]]
|
|
|
|
<hr>
|
|
__TOC__
|
|
|