Answers for the cpsc510 midterm 2000 fall exam
==============================================

Author:  Robin Cockett

[Answers for Qu 1 & 2 & 3]

==========================================================================
Qu 1.
==========================================================================


Describe the main stages of a compiler.

Here is a common scheme:

        |    source string (high level programming language)             F
 -------------------                                                     R
|  Lexical analysis |--------------------------                          O
 -------------------                           |                         N
        |    list of tokens                    |                         T
 ----------------------                        |
| Grammatical analysis |----------------       |
|     /Parsing         |                |      |
 ----------------------                 |      |                         E
        |   syntax tree                 |      |                         N
 -------------------                 -----------------    errors         D
| Semantic analysis |------>--------|  Error reporter |---->-----
 -------------------                 -----------------
        |   intermediate code                                            _
 -------------------
|    optimization   |                                                    B
 -------------------                                                     A
        |   intermediate code (optimized)                                C
 -------------------                                                     K
|  code generation  |
 -------------------                                                     E
        |   target language/assembler/machine code                       N
        |                                                                D

(1.1) Lexical analysis: the lexemes of the language are extracted (using a 
regular automaton) and made into tokens.
[Complexity: O(n) where n is length of input string]
(1.2) Grammatical analysis: the tokens are parsed (according to a context free 
grammar) and a structure is built which may not be the actual parse tree 
(but should be fairly closely related) which we call the syntax tree.  
This structure should be suitable for the next phase ...
[Complexity: O(n) where n is length of list of tokens]
(1.3) Semantic analysis:  here type checking (which includes ensuring that
functions and  procedures have the correct number of arguments) and the
differentiation of polymorhic symbols (using this type information) takes
place.  The output of this stage may be an intermediate code.
[Complexity: O(n) - O(n log n)  where n is the size of the syntax tree.
             The log n factor is necessary as one must lookup symbols 
             - use of hash tables can reduce this to effectively n again -
             in the symbol table AND one may have to do type unification 
             to complete the type checking.]
(1.4) Most significant code optimization takes place at the level of 
intermediate code.
[Complexity: =< O(n log n)  the problems here are NP-hard but to end up 
 with a practical compiler one must keep the complexity to at most 
 n log n]
(1.5) The intermediate code is turned into the target language.
[Complexity: =< O(n log n) the register allocation problem is equivalent 
 to the graph coloring problem which is NP-complete.  Thus one is forced to 
 use heuristic algorithms of low complexity.]


==========================================================================
Qu 2.
==========================================================================


The grammar you where asked to calculate the vital statistics for was:


stmts ->  stmts label stmts stmt SEMICOLON
        |.

label ->  LABEL COLON
        |.

stmt ->  ID ASSIGM r_expr.

r_expr ->  ID params
         | NUM.

params ->  LPAR arglist RPAR
         |.

arglist ->  r_expr arglist_end
          |.

arglist_end ->  COMMA r_expr arglist_end
              |.


All nonterminals are reachable and realizable.

The following nonterminals are nullable:

                   stmts label params arglist arglist_end.

The only endable nonterminal is:

                   stmts.

The nonterminal "stmts" is immediately left recursive: this means the 
grammar is not LL(1).  This should also be apparent from the calculation 
of the first sets and follow sets:


Nonterminal       first set            follow set
==========

label       ====> { LABEL }            { LABEL ID }
stmt        ====> { ID }               { SEMICOLON }
r_expr      ====> { ID NUM }           { COMMA RPAR SEMICOLON }
params      ====> { LPAR }             { COMMA RPAR SEMICOLON }
arglist_end ====> { COMMA }            { RPAR }
stmts       ====> { LABEL ID }         { LABEL ID }
arglist     ====> { ID NUM }           { RPAR }


To be LL(1) there are three conditions involving these statistics:

(1) Each nonterminal can have at most one nullable production:

           This is  satisfied by this grammar!

(2) The first sets of the RHS of any two distinct productions 
    of the same non-terminal must be disjoint.

    Notice that this is true (rather obviously) of all the non-terminals.

(3) For any nullable non-terminal the first set and the follow set must 
    be disjoint.

    Here we run into problems!  Only "stmt" and "r_expr"  are non-nullable
    so both "label" and "stmts" have this problem.

[If you got all this right you have 3/4 of the marks for the question.]

Now it is worth noting at this stage that the grammar with start state 
"stmt" is LL(1) as for those non-terminals which can be reached from 
"stmt" all three of the above conditions are satisfied.

It is a fact that every LL(1) grammar is LR(1) and this of course 
means that we now know the answer to 3(d) below!

There is something else you might note about this grammar: it is 
rather obviously ambiguous!  Look at the rule for "stmts": 
this starts with three nullable nonterminals
             "stmts label stmts"
now if the initial string matches just "stmts" there are two ways 
of parsing this starting at this sentential form:  one can expand 
the first "stmts" and null the next two nonterminals or null the 
first two nonterminals and expand the second "stmt".

So this actually answers the last two parts of this question! 
You do not want an ambiguous grammar in a programming language 
(the one exception is when you use precedence parsing - which 
actually removes the ambiguity again). Finally one naotes:
one cannot remove ambiguity by transformation and no ambiguous 
grammar is LL(1)!!!!


Let us actually transform the grammar to see what happens!!
We know everything from "stmt" down is non-left-recursive so 
really our concern is the first two rules:


stmts ->  stmts label stmts stmt SEMICOLON
        |.

label ->  LABEL COLON
        |.


There is an immediate left recursion so we need to remove 
that (no implied ordering is needed):

stmts -> stmts+     %   using the null non-left-recursive production

stmts+ -> label stmts stmt SEMICOLON stmts+
        | .

label ->  LABEL COLON
        |.

Now we want to make the first production good so we set 
      stmts < stmts+  
however examining the second rule as "label" is nullable 
we now have an improvable rule so we substitute "label" 
into this rule.  As this is the only place in which "label" 
is used we may then throw that production away:

stmts -> stmts+     

stmts+ -> LABEL COLON stmts stmt SEMICOLON stmts+
        | stmts stmt SEMICOLON stmts+
        | .

Again we can improve these productions to 

stmts -> stmts+     

stmts+ -> LABEL COLON stmts stmt SEMICOLON stmts+
        | stmts+ stmt SEMICOLON stmts+
        | .

But now we have made the second (hidden) left recursive aspect 
apparent!  We remove left recursion again:

stmts ->  stmts+.

stmts+ ->  LABEL COLON stmts stmt SEMICOLON stmts+ stmts++
         | stmts++.

stmts++ ->  stmt SEMICOLON stmts+ stmts++
          |.

But now we are done!

Of course as we have already observed this grammar is not LL(1) 
we may actually recalculate the vital statistics to verify this 
(although I was not expecting you to do this) to see what goes wrong.

The grammar with left recursion removed now looks like:

=================================================================
stmts ->  stmts+.

stmts+ ->  LABEL COLON stmts stmt SEMICOLON stmts+ stmts++
         | stmts++.

stmts++ ->  stmt SEMICOLON stmts+ stmts++
          |.

stmt ->  ID ASSIGM r_expr.

r_expr ->  ID params
         | NUM.

params ->  LPAR arglist RPAR
         |.

arglist ->  r_expr arglist_end
          |.

arglist_end ->  COMMA r_expr arglist_end
              |.
====================================================================

The nullable nonterminals are:

 stmts++ params arglist arglist_end stmts+ stmts.

Nonterminals      first            follow
============

stmts+      ====> { LABEL ID }     { ID }
stmt        ====> { ID }           { SEMICOLON }
r_expr      ====> { ID NUM }       { COMMA RPAR SEMICOLON }
params      ====> { LPAR }         { COMMA RPAR SEMICOLON }
arglist_end ====> { COMMA }        { RPAR }
stmts       ====> { LABEL ID }     { ID }
stmts++     ====> { ID }           { ID }
arglist     ====> { ID NUM }       { RPAR }


The three conditions for LL(1) now look like:

(1) satisfied
(2) satisfied (only the productions for "stmts+" must be checked!)
(3) Here "stmts+" "stmts++" "stmts" all produce problems!


==========================================================================
Qu. 3
==========================================================================

The grammar in question is:


stmt ->  l_expr ASSIGN r_expr.

l_expr ->  ID
         | NUM.

r_expr ->  ID params
         | NUM.

params ->  LPAR arglist RPAR
         |.

arglist ->  r_expr arglist_end
          |.

arglist_end ->  COMMA r_expr arglist_end
              |.

The follow sets and what is endable can be read off from the 
previous question:

The endable nonterminals are:

 params stmt r_expr.

The follow sets are:

Nonterminal:         follow
===========
arglist       ====> { RPAR }
arglist_end   ====> { RPAR }
params        ====> { COMMA RPAR }
l_expr        ====> { ASSIGN }
r_expr        ====> { COMMA RPAR }


The states of the LR(0) item DFA are:

States:                                     Transitions:
=======                                     ============
                                            NT           T
                                            ==           =
(1)                       
S -> .stmt                                  stmt         ID
----------
stmt -> .ID ASS r_expr 

(2)
stmt -> ID.ASS r_expr                                    ASS
---------------------

(3)
stmt -> ID ASS.r_expr                       r_expr       ID NUM
---------------------
r_expr -> .ID parms
r_expr -> .NUM

(4)
r_expr -> ID.parms                          parms        LPAR
------------------                  
parms -> .LPAR args RPAR
parms -> .
[shift/reduce conflict]
follow(parms) = {COM RPAR}

(5)
parms -> LPAR.args RPAR                     args r_expr  ID NUM
-----------------------
args -> .r_expr args_end
args -> .
r_expr -> .ID parms
r_expr -> .NUM
[shift/reduce conflict]
follow(args) = {RPAR}

(6)
args -> r_expr.arg_end                      arg_end      COM
----------------------
arg_end -> .COM r_expr arg_end
arg_end -> .
[shift/reduce conflict]
follows(args) = {RPAR}

(7)
arg_end -> COM.r_expr arg_end               r_expr       ID NUM
-----------------------------
r_expr -> .ID parms
r_expr -> .NUM

(8)
arg_end -> COM r_expr.argend                arg_end      COM
----------------------------
arg_end -> .COM r_expr argend
arg_end -> .
[shift/reduce conflict]
follows(arg_end) = {RPAR}

(9) through (16)

We can immediately see that this is not an LR(0) language as there 
are shift/reduce conflicts at the states (4) (5) (6) and (8).  

Is it an SLR(1) language?  In order for this to be so we must be able 
to resolve the conflicts using the follow sets. This can be done 
as the completed item always has a disjoint follow set from the 
possible terminal transitions out!

Every SLR(1) language is LALR(1).  As all the conflicts are shift/reduce 
conflicts they already will be resolved at the LALR(1) level if they are 
resolved at the LR(1) level.