lecture 2

CPSC411-- COMPILER CONSTRUCTION I

LECTURE 2:

Reading: Chapter 1 Louden and "Dragon book"

Overview of a compiler:

                                       |
                               ____|______                                    ____
                              | Lexical       |                                   |
                              | Analysis     |                                   |   Syntax:
                              |__________|                                   |
                                         |                                             |    O(|C|)
                               _____|_____                                   |
                               | Syntax        |                                  |
                               | Analysis     |                                  |____
                               |__________|
                                         |                                             ____
     _________          ____|_____          _________      |
     | Symbol   |        | Semantic |        | Error         |    | Semantics
     | Table      |        | Analysis     |        | Handler    |    | O(|C| log |C|)
     |________|        |__________|        |__________|    |
                                         |                                            |___
                                 ____|_____
                                 | Intermed. |
                                 | Code        |
                                 |_________|                                  ____
                                        |                                             |
                                ____|_____                                   |
                                |   Optim-   |                                | Most of the
                                |   ization      |                                | problems here
                                |__________|                                | are NP-complete
                                        |                                             |
                                ____|_____                                   | approximate
                               | Code           |                                 | algorithms
                               | Generat'n   |                                 | are appropriate!
                               |__________|                                 |
                                        |                                             |____

For syntactic analysis there are available techniques which are of time complexity order the size of the source program. In the semantic analysis phase (for some languages) one may face a slight increase in time complexity (a factor of the log of the program size). In the area of optimization almost all the problems are NP-complete (register allocation, optimal flow control, etc.) so that here we will often be satisfied by suboptimal but effective techniques.

INTRODUCTION TO GRAMMARS:

Reading: 2.1 - 2.4 "Dragon book"

You should have covered most of the basic material of this section in CPSC313. It is likely that I will use slightly different notation from what you may have met before so I will give a fairly complete treatment. I will motivate this section by the problem of writing an infix to postfix converter (see assignment #1).

Infix to postfix conversion:

Many calculators use postfix notation for input: usual mathematical notation uses infix. We shall consider the problem of doing an infix
to postfix conversion:

                                            ____________
                                           |                         |
(13 + 44) * 2 + 6 --------|      infx2pofx   |---- 13 44 + 2 * 6 +
                                           |____________ |

Arithmetic expressions appear all over the place in compiles so it will be very useful to be extremely familiar with them!

Our objective is to write a program which has as input a (short) file containing an arithmetic expression and has as output a (short) file containing the postfix translation of the expression.

FAQ: Why is postfix notation popular in compiler writing???

Answer: because it is easy to implement on a stack machine ...

       13                 44                 +
_____ 'PUSH 13' ______   'PUSH 44' ______    'ADD' _______
                               | 13      |                     | 44      |                |   57       |    etc.
                               |______|                     |______|                |_______|
                                                                  | 13      |
                                                                  |______|

Here a single number is interpreted as a 'PUSH' instruction the arithmetic operations + is interpreted as an 'ADD' instruction which removes the top two elements of the stack adds them and pushes the result back onto the stack. Similarly for multiplication ... Thus, postfix is a very convenient intermediate form for stack based computations.

The first problem we have is to define the legal input strings for our little infix to postfix converter. For this we need to be able to write a grammar. So a reminder on grammars ...

CONTEXT FREE GRAMMARS:

A context free grammar G = (T,N,P,pr,start) consists of

A finite set T of terminal symbols (or tokens)
A finite set N of nonterminal symbols
A finite set P of production rules with a map pr: P ---> N * list(N \union T) which associates to each production rule a nonterminal head and a body consisting of a (possibly empty) strings of symbols from the union of N and T.
A start symbol start chosen from N, the set of nonterminals.

Note that the productions can (and often are) viewed (not quite equivalently) as a relation: P \subset N * list(N \union T) [where * stands for cartesian product]. This is because we do not expect a grammar to have two rules which are exactly the same except for their name.

Here is an example of a grammar:

          T = { NUM, +, * }
          N = { expr }
          P = { p_1,p_2,p_3}
          expr ------> expr + expr              = pr(p_1)
                                  | expr * expr              = pr(p_2)
                                  | NUM                       = pr(p_3)
      start = expr

Notice that the set P essentially provides a numbering for the production rules.

As we shall realize soon enough this is not a very good grammar for our purposes! However, it will be useful for demonstrating some elementary points.

The first problem given a grammar is to sort out what things it recognizes. Now, in general, it is far from obvious what a grammar recognizes and what it does not recognize: so it is very useful to generate some examples. This can be done quite methodically by writing down the start symbol and methodically replacing (leftmost) nonterminals (in a breadth-first manner) until terminal strings are reached:

(0) expr

(1) expr + expr expr * expr NUM

(2) expr + expr + expr          expr + expr * expr
              expr * expr + expr        expr * expr * expr
                      NUM + expr               NUM * expr

Using this method it is not hard to see that strings like:

    NUM
    NUM + NUM
    NUM * NUM
    NUM + NUM + NUM
    NUM + NUM * NUM
    ....

can be generated.

Associated with any string which is generated is the sequence of uses of the production rules required to obtain that string. These may be organized into a parse tree. One way to describe a parse tree is as follows:

*** NOTE: this is not quite the same as in the book! ******

A parse tree is a tree whose nodes are labeled by the productions. Each node has a (possible empty) list of typed arguments (corresponding to the right-hand side of the production) which themselves are parse trees of the appropriate type. For example using the above grammar:

                            |
                            | expr
                      ___|___
                     |   p_1     |
                     |_______|
                         /    | \
                expr /     |   \ expr
                       /      |    \
                 __/__    |    _\___
               | p_3   |   |   | p_3   |
               |_____|   |   |_____|
                      |       |       |
           NUM |     +|       | NUM

is a parse tree for 'NUM + NUM'.

More exactly for a grammar we define a parse tree of nonterminal type to be:

a terminal TERM is a parse tree of type TERM
if p' is a production rule with pr(p')= (x,[x_1,..,x_n]) and ti is a parse tree of type x_i the p't_1,...,t_n) is a parse tree of type x.

We are interested in parse trees of type start: such a parse tree is called a parse of the grammar. Notice that this definition means that parse trees (whether they are parses or not) always have terminal nodes at their leaves.

Notice that in this grammar that a given string may have more than one parse. For example consider 'NUM + NUM * NUM' the first parse is:

                           |
                           | expr
                      ___|___
                     | p_1      |
                     |_______|
                         /     | \
               expr /      |   \ expr
                       /       |    \
                _ _/__      |    _\___
               | p_3   |     |   | p_2   |
               |_____|     |   |_____|________
                  |             |       |                 \     \
                  |             |       | expr          \     \ expr
                  |             |     _|___             \    _\____
                  |             |   | p_3   |            |   | p_3   |
                  |             |   |_____|            |   |_____|
                  |             |      |       |           |       |
       NUM |         + |     | NUM       *|       | NUM

the second parse is:
                                      |
                                      | expr
                                ___|___
                               | p_2      |
                               |_______|
                                  /   |   \
                        expr /     |    \ expr
                               /      |      \
                       __ _/__    |     _\____
                      | p_1     |    \    | p_3   |
                      |____ _|     \   |_____|
                         /   |   \         \      \
               expr /     |    \expr \      \
                      /       |     \         \      \
                __/__     |     _\___   |      |
               | p_3   |   |   | p_3    | |      |
               |_____|   |   |____ _| |      |
                       |      |            |     |      |
             NUM|    *|   NUM| +|       | NUM

A grammar with this property is called ambiguous. A grammar is said to be ambiguous if the map:

parse_value
ParseTrees ----------------> T*

which associates to each parse tree the string of its terminals (obtained from a left right traversal of the parse tree) is NOT one-one (that is monomophic or injective depending on where you learnt math :-) ). In case this function is injective the grammar is said to be unambiguous .

FACT: To determine whether a grammar is unambiguous is an undecidable problem!

This problem can be reduced to the post correspondence problem. It is one of the best known undecidable problems! This might have been covered in CPSC313.

When we are parsing for a compiler we would like the structure of the parse tree to approximate the intended semantics so that the subsequent translation to code is facilitated. This means that ambiguous grammars are generally rather undesirable for our purposes ... as they suggest that one might be able to attach more than one meaning to a given source program. In the case of our little grammar this may be particularly undesirable ...

=============== FIRST FUNCTIONAL INTERLUDE!!! ==================

For those with functional programming experience parse trees are just (mutually) recursive datatypes. For those without functional programming experience it is time to get some! I will be using functional notation in the lectures so it will be useful to be acquaintance with this early!

Here is an Haskell datatype (for the Haskell program see here) whose values are exactly all the parse trees of the above grammar:

---------------------

data         Expr = P_1 of Expr * Add * Expr
                          | P_2 of Expr * Mul * Expr
                          | P_3 of Num
data Add = ADD
data Mul = MUL
data Num = NUM;

---------------------

This is actually just a restatement of the recursive definition of parse trees. Literally it says:

   Non-terminals:
      a parse tree of type expr is one of three things
      P_1: a parse tree type expr followed by a parse tree of type add followed by a parse tree of type expr
      P_2: a parse tree type expr followed by a parse tree of type mul followed by a parse tree of type expr
      P_3: a parse tree type num

   Terminals:
     a parse tree of type add is just ADD
     a parse tree of type mul is just MUL
     a parse tree of type num is just NUM

Examples of values of this datatype are, for example, the two different parses mentioned above!

P_1(P_2(P_3 NUM,MUL,P_3 NUM),ADD,P_3 NUM)
P_2(P_3 NUM,MUL,P_1(P_3 NUM,ADD,P_3 NUM))

We may also write a mutually recursive function "parse_value"

      parse_value (P_1(t1,x,t2)) = (parse_value t1)++(parse_value_add x)++(parse_value t2)
      parse_value (P_2(t1,x,t2)) = (parse_value t1)++(parse_value_mul x)++(parse_value t2)
      parse_value (P_3 x) = parse_value_num x
and parse_value_add ADD = " + "
and parse_value_mul MUL = " * "
and parse_value_num NUM = "NUM";

=============== END FIRST FUNCTIONAL INTERLUDE!!! ==================

The first problem therefore is to try and disambiguate this grammar. Here is a new and unambiguous grammar for the same language:

          T = { NUM, +, * }
          N = { expr, factor, term }
          P = expr ----> expr + term               p_1
                                     | term                          p_2
                  term ----> term * factor            p_3
                                    | factor                         p_4
                  factor --> NUM                          p_5
           start = expr

Notice that to gain unambiguity we have had to introduce more non-terminals. The effect is that, in the parse trees, we better reflect the usual rules of the intended association for the infix mathematical terms.

Here is the new (and only) parse tree for the point of ambiguity of the first grammar:

                                                 | expr
                                           ___|___
                                          | p_1      |
                                          |_______|
                          expr   ____/_     |    \
                                   | p_2     |   |      \ term
                                   |______|   |       \
                        term   ___/__       |       _\____
                                  | p_3   |       \     | p_4     |
                        term |______|        \   |______|
                            _____/   | \           \         \
                           | p_4 |   |    \factor \         \ factor
                           |_____|   |     \            \     __\___
                 factor __|__    |    _\___     |    | p_5   |
                            | p_5 |   |   | p_5   |   |    |_____|
                           |_____|   |   |_____|   |       |
                                  |       |           |       |       |
                         NUM|    *| NUM|    +|        | NUM

=============== SECOND FUNCTIONAL INTERLUDE!!! ==================

Here is also an SML datatype whose values are exactly all the parse trees of
this second grammar:

---------------------

data
   Expr = P_1 of Expr * Add * Term
             | P_2 of Term
   Term = P_3 of Term * Mul * Factor
             | P_4 of Factor
data Factor = P_5 of Num
data Add = ADD
data Mul = MUL
data Num = NUM;

---------------------

Examples of values of this datatype are, for example, the parse mentioned above:

P_1(P_2(P_3(P_4(P_5 NUM),MUL,P_5 NUM)),ADD,P_4(P_5 NUM))

We may also write a mutually recursive function "parse_value"

parse_value (P_1(t1,x,t2)) = (parse_value t1)
++ (parse_value_add x) ++ (parse_value_term t2)
parse_value (P_2 x) = parse_value_term x
parse_value_term (P_3(t1,x,t2)) = (parse_value_term t1) ++ (parse_value_mul x) ++ (parse_value_factor t2)
parse_value_term (P_4 t) = parse_value_factor t
parse_value_factor (P_5 x) = parse_value_num x
parse_value_add ADD = " + "
parse_value_mul MUL = " * "
parse_value_num NUM = "NUM";

==== END SECOND FUNCTIONAL INTERLUDE!!! ======