LECTURE 2:
|
____|______
____
| Lexical
|
|
| Analysis
|
| Syntax:
|__________|
|
|
| O(|C|)
_____|_____
|
| Syntax
|
|
| Analysis
|
|____
|__________|
|
____
_________
____|_____
_________ |
| Symbol
| | Semantic
| |
Error
| | Semantics
| Table
| |
Analysis
| | Handler
| | O(|C| log |C|)
|________|
|__________|
|__________| |
|
|___
____|_____
| Intermed. |
| Code |
|_________|
____
|
|
____|_____
|
| Optim-
|
| Most of the
| ization
|
| problems here
|__________|
| are NP-complete
|
|
____|_____
| approximate
| Code
|
| algorithms
| Generat'n
|
| are appropriate!
|__________|
|
|
|____
For syntactic analysis there are available techniques which are of
time complexity order the size of the source program. In
the semantic analysis phase (for some languages) one may face a slight
increase in
time complexity (a factor of the log of the program size). In the
area of optimization almost all the problems are NP-complete (register
allocation, optimal flow control, etc.) so that here we will often be
satisfied by
suboptimal but effective techniques.
You should have covered most of the basic material of this section
in
CPSC313. It is likely that I will use slightly different
notation
from what you may have met before so I will give a fairly complete
treatment.
I will motivate this section by the problem of writing an infix to
postfix
converter (see assignment #1).
Many calculators use postfix notation for input: usual mathematical
notation uses infix. We shall consider the problem of doing an
infix
to postfix conversion:
____________
|
|
(13 + 44) * 2 + 6
--------| infx2pofx |---- 13
44 + 2 * 6 +
|____________ |
Arithmetic expressions appear all over the place in compiles so it will be very useful to be extremely familiar with them!
Our objective is to write a program which has as input a (short)
file
containing an arithmetic expression and has as output a (short) file
containing
the postfix translation of the expression.
FAQ: Why is postfix notation popular in compiler writing???
Answer: because it is easy to implement on a stack machine ...
13
44
+
_____ 'PUSH 13' ______ 'PUSH 44'
______ 'ADD' _______
| 13
|
| 44
|
| 57
| etc.
|______|
|______|
|_______|
| 13 |
|______|
Here a single number is interpreted as a 'PUSH' instruction
the
arithmetic operations + is interpreted as an 'ADD' instruction which
removes
the top two elements of the stack adds them and pushes the result back
onto
the stack. Similarly for multiplication ... Thus, postfix
is
a very convenient intermediate form for stack based computations.
The first problem we have is to define the legal input strings for
our little infix to postfix converter. For this we need to be
able to write a grammar. So a reminder on grammars ...
A context free grammar G = (T,N,P,pr,start) consists of
Here is an example of a grammar:
T = { NUM, +,
* }
N = { expr }
P = {
p_1,p_2,p_3}
expr
------> expr +
expr
= pr(p_1)
| expr *
expr
= pr(p_2)
|
NUM
= pr(p_3)
start = expr
Notice that the set P essentially provides a numbering for the production rules.
As we shall realize soon enough this is not a very good grammar for our purposes! However, it will be useful for demonstrating some elementary points.
The first problem given a grammar is to sort out what things it recognizes. Now, in general, it is far from obvious what a grammar recognizes and what it does not recognize: so it is very useful to generate some examples. This can be done quite methodically by writing down the start symbol and methodically replacing (leftmost) nonterminals (in a breadth-first manner) until terminal strings are reached:
(0) expr
(1) expr + expr expr * expr NUM
(2) expr + expr +
expr expr + expr
* expr
expr * expr + expr expr *
expr * expr
NUM +
expr
NUM * expr
Using this method it is not hard to see that strings like:
NUM
NUM + NUM
NUM * NUM
NUM + NUM + NUM
NUM + NUM * NUM
....
can be generated.
Associated with any string which is generated is the sequence of uses of the production rules required to obtain that string. These may be organized into a parse tree. One way to describe a parse tree is as follows:
*** NOTE: this is not quite the same as in the book! ******
A parse tree is a tree whose nodes are labeled by the
productions. Each node has a (possible empty) list of typed
arguments (corresponding to the right-hand side of the production)
which themselves are parse trees of the appropriate type. For
example using the above grammar:
|
| expr
___|___
| p_1 |
|_______|
/ | \
expr / | \ expr
/ | \
__/__ | _\___
| p_3 | | | p_3 |
|_____| | |_____|
|
| |
NUM
| +| | NUM
is a parse tree for 'NUM + NUM'.
More exactly for a grammar we define a parse tree of nonterminal type to be:
Notice that in this grammar that a given string may have more than one parse. For example consider 'NUM + NUM * NUM' the first parse is:
|
| expr
___|___
| p_1 |
|_______|
/ | \
expr / | \ expr
/ | \
_ _/__ | _\___
| p_3 | | |
p_2 |
|_____| | |_____|________
|
|
|
\ \
|
| |
expr
\ \ expr
|
|
_|___
\ _\____
|
| | p_3
|
| | p_3 |
|
|
|_____|
| |_____|
|
| |
|
| |
NUM
| +
| | NUM
*| | NUM
the second parse is:
|
| expr
___|___
| p_2 |
|_______|
/ | \
expr / | \ expr
/ | \
__ _/__ | _\____
| p_1 | \
| p_3 |
|____ _| \ |_____|
/ |
\
\ \
expr / | \expr
\ \
/ |
\
\ \
__/__ |
_\___ | |
| p_3 | | | p_3
| | |
|_____| | |____ _|
| |
|
|
| | |
NUM| *| NUM|
+| | NUM
A grammar with this property is called ambiguous. A grammar is said to be ambiguous if the map:
parse_value
ParseTrees
----------------> T*
which associates to each parse tree the string of its terminals
(obtained from a left right traversal of the parse tree) is NOT one-one
(that is
monomophic or injective depending on where you learnt math :-) ).
In case this function is injective the grammar is said to be unambiguous
.
This problem can be reduced to the post correspondence
problem.
It is one of the best known undecidable problems! This
might
have been covered in CPSC313.
When we are parsing for a compiler we would like the structure of
the
parse tree to approximate the intended semantics so that the subsequent
translation to code is facilitated. This means that ambiguous
grammars are generally rather undesirable for our purposes ... as they
suggest that one might be able to attach more than one meaning to a
given source program. In the case of our little grammar this may
be particularly undesirable ...
For those with functional programming experience parse trees are just (mutually) recursive datatypes. For those without functional programming experience it is time to get some! I will be using functional notation in the lectures so it will be useful to be acquaintance with this early!
Here is an Haskell datatype (for the Haskell program see here) whose values are exactly all the parse trees of the above grammar:
---------------------
data Expr =
P_1
of Expr * Add * Expr
| P_2 of Expr * Mul * Expr
| P_3 of Num
data Add = ADD
data Mul = MUL
data Num = NUM;
---------------------
This is actually just a restatement of the recursive definition of parse trees. Literally it says:
Non-terminals:
a parse tree of type expr is one of
three things
P_1: a parse tree type expr followed
by a parse tree of type add followed by a parse tree of type expr
P_2: a parse tree type expr followed
by a parse tree of type mul followed by a parse tree of type expr
P_3: a parse tree type num
Terminals:
a parse tree of type add is just ADD
a parse tree of type mul is just MUL
a parse tree of type num is just NUM
Examples of values of this datatype are, for example, the two different parses mentioned above!
P_1(P_2(P_3 NUM,MUL,P_3 NUM),ADD,P_3 NUM)
P_2(P_3 NUM,MUL,P_1(P_3 NUM,ADD,P_3 NUM))
We may also write a mutually recursive function "parse_value"
parse_value (P_1(t1,x,t2)) =
(parse_value t1)++(parse_value_add x)++(parse_value t2)
parse_value (P_2(t1,x,t2)) =
(parse_value t1)++(parse_value_mul x)++(parse_value t2)
parse_value (P_3 x) = parse_value_num x
and parse_value_add ADD = " + "
and parse_value_mul MUL = " * "
and parse_value_num NUM = "NUM";
=============== END FIRST FUNCTIONAL INTERLUDE!!!
==================
The first problem therefore is to try and disambiguate this grammar. Here is a new and unambiguous grammar for the same language:
T = { NUM, +,
* }
N = { expr,
factor, term }
P = expr
----> expr +
term
p_1
|
term
p_2
term ----> term *
factor
p_3
|
factor
p_4
factor -->
NUM
p_5
start =
expr
Notice that to gain unambiguity we have had to introduce more
non-terminals. The effect is that, in the parse trees, we
better reflect the usual rules of the intended association for the
infix mathematical terms.
Here is the new (and only) parse tree for the point of ambiguity of the first grammar:
| expr
___|___
| p_1 |
|_______|
expr ____/_ | \
| p_2 |
| \ term
|______| | \
term ___/__
| _\____
| p_3 |
\ | p_4 |
term |______|
\ |______|
_____/ |
\
\ \
| p_4 | | \factor
\ \ factor
|_____| |
\
\ __\___
factor __|__ |
_\___ | | p_5 |
| p_5 | | | p_5 |
| |_____|
|_____| | |_____|
| |
|
|
|
| |
NUM| *| NUM|
+| | NUM
Here is also an SML datatype whose values are exactly all the parse
trees of
this second grammar:
---------------------
data
Expr = P_1 of Expr * Add * Term
| P_2 of Term
Term = P_3 of Term * Mul * Factor
| P_4 of Factor
data Factor = P_5 of Num
data Add = ADD
data Mul = MUL
data Num = NUM;
---------------------
Examples of values of this datatype are, for example, the parse mentioned above:
P_1(P_2(P_3(P_4(P_5 NUM),MUL,P_5 NUM)),ADD,P_4(P_5 NUM))
We may also write a mutually recursive function "parse_value"
parse_value (P_1(t1,x,t2)) =
(parse_value t1)
++ (parse_value_add x) ++ (parse_value_term t2)
parse_value (P_2 x) = parse_value_term x
parse_value_term (P_3(t1,x,t2)) = (parse_value_term t1) ++
(parse_value_mul x) ++ (parse_value_factor t2)
parse_value_term (P_4 t) =
parse_value_factor t
parse_value_factor (P_5 x) = parse_value_num x
parse_value_add ADD = " + "
parse_value_mul MUL = " * "
parse_value_num NUM = "NUM";
==== END SECOND FUNCTIONAL INTERLUDE!!! ======