intro.3.html

CPSC411: COMPILER CONSTRUCTION I

AUTHOR: Robin Cockett
DATE: Jan 17th 2002


Lecture 4: Introduction to parsing


Reading:

Louden: Chapter 3

Dragon: 2.4, 2.5

Recursive descent parsing - add expressions:



Now we shall turn to the problem of producing a parse tree.
There are many methods (some of which we shall study in the next 
sections) but one of the most elegant and simple to understand 
is  "recursive descent parsing".  This was invented by Tony Hoare 
(Oxford).  his idea was to make the parsing program correspond 
almost precisely to the grammar.

We show the process of developing a recursive descent 
parser for the language presented by the grammar:


   expr -> term rest            P_1
                                
   rest -> ADD term rest        P_2
         | SUB term rest        P_3
         | \epsilon             P_4
                                
   term -> NUM                  P_5

Before proceeding any further please recall that this is not the only 
presentation of this grammar which is possible.  Here is another 
presentation which is actually conceptually simpler:

   expr -> expr ADD term
         | expr SUB term
         | term

   term -> NUM

Unfortunately, while this presentation is conceptually simpler, it is 
ambiguous and so not amenable to the sort of implementation we wish to 
discuss.  Therefore I have chosen a grammar which is in a convenient 
(but conceptually less clear!) form.  The main topic in the next 
section concerns how one transformations grammars into a 
convenient form for parsing: we shall go into this in some 
considerable detail as you will need to be able to do these 
transformations.  Basically they involve transforming the grammar 
to eliminate left recursion and then checking that it is LL(1). 
It should be mentioned that all these techniques start with the 
assumption that the grammar is essentially unambiguous (precedence 
parsing is the one exception).

When presented with a grammar one should always try to write down some 
strings in the language to get a better idea of its intent.  For this 
grammar a typical string is:

              1 + 42 - 6 + 10

where we assume the lexical analysis stage does the following 
conversion into tokens:

          lexeme =====> token
          -------------------
               + =====> ADD
               - =====> SUB
              42 =====> NUM


For this grammar it is amazingly simple to write a recursive descent 
parser: here it is beside the grammar in Haskell:


   expr -> term rest            ||     expr L = rest (term L)
                                ||
   rest -> ADD term rest        ||     rest (ADD:L) = rest (term L)
         | SUB term rest        ||     rest (SUB:L) = rest (term L)
         | \epsilon             ||     rest L = L
                                ||
   term -> NUM                  ||     term (NUM:L) = L

This is a direct translation into a recursive descent parser.
(Do not forget such a translation is not possible for an arbitrary 
grammar but is for these special grammars, namely LL(k) grammars.)

Here is a trace of what this code does:

            "1+42-6+10"

                 |
                 |                               LEXICAL ANALYSIS
                 v

   [NUM(1),ADD,NUM(42),SUB,NUM(6),ADD,NUM(10)]
   |      |           
   |      |   term
   |      v
   |      [ADD,NUM(42),SUB,NUM(6),ADD,NUM(10)]
   |      |   |
   |      |   |   rest_1
   |      |   v
   |      |   [NUM(42),SUB,NUM(6),ADD,NUM(10)]
   |      |   |       |
   |      |   |       |   term
   |      |   |       v
   |      |   |       [SUB,NUM(6),ADD,NUM(10)]
   |      |   |       |   |
   |      |   |       |   |   rest_2              RECURSIVE DESCENT PARSING
   |      |   |       |   v
   |      |   |       |   [NUM(6),ADD,NUM(10)]
   |      |   |       |   |      |
   |      |   |       |   |      |   term
   |      |   |       |   |      v
   |      |   |       |   |      [ADD,NUM(10)]
   |      |   |       |   |      |   |
   |      |   |       |   |      |   |   rest_1
   |      |   |       |   |      |   v
   |      |   |       |   |      |   [NUM(10)]
   |      |   |       |   |      |   |       |
   |      |   |       |   |      |   |       |   term
   |      |   |       |   |      |   |       v
   |      |   |       |   |      |   |       []
   |      |   |       |   |      |   |       ||
   |      |   |       |   |      |   |       ||   rest_3
   v      v   v       v   v      v   v       vv
   [] <-- []<-[] <--- []<-[] <-- []<-[] <--- []
                returns

Here we have attached the original numbers to the NUM token only so 
that one can follow the parse steps better: in practice (as we shall
see) it is actually useful if the lexical analysis stage passes back this 
extra information.  However, for this program we are not actually 
doing this.

The list of tokens is only in the language if the program returns the 
empty list.

(Note: it is also possible that this program fails to match an input 
 in which case an exception will be raised  ...and, of course, the parse will have 
 failed!)

We may also wish to build a syntax tree in the attached file is a Haskell program to 
do this ...

Clearly to make this useful one really needs a little program to tokenize the input.  
Here is such a program which uses the Parsec combinators (actually with each token we pass 
back its position ffor error recovery).  Here is an example of a recursive descent program 
which uses this little lexer.


PREVIEW: When does a recursive descent program work?



Unfortunately a recursive descent program does not work for all grammars.
Essentially they must be LL(k) - which is a fancy way of saying that 
at each stage of the parse one must be able to determine the rule to 
apply by looking ahead no more than k input tokens.  We shall consider 
more carefully the case when k=1. As has been mentioned several times 
now: not every grammar is of this form.

There are two obvious conditions which are necessary:
(a) there must be no "left recursion"
(b) the "first sets" for the productions of each nonterminal must 
    be mutually disjoint.


What does no left recursion mean?

A grammar is "not left recursive" if for each nonterminal any leftmost 
series of expansions terminates.  Thus, in our grammar we have:

expr          rest_1           rest_2          rest_3       term
  |             |               |               |            |
  v             v               v               v            v
term rest     ADD term rest    SUB term rest   \epsilon      NUM
  |
  v
NUM rest

where we take a nonterminal act on it by a production and 
then we keep going by repeatedly expanding the first nonterminal.  
We stop when the first symbol is a terminal or the list produced 
is empty (\epsilon).

Why do we want the grammar not to be left recursive?  If the grammar 
is left recursive our recursive descent parser could get stuck in a 
non-terminating recursion.



Example:

If we simply take our grammar and "reverse" it we obtain 
a left recursive grammar which cannot be used with a recursive 
descent parser (and actually has the same language):

   expr -> rest term            P_1
                                
   rest -> rest term ADD        P_2
         | rest term SUB        P_3
         | \epsilon             P_4
                                
   term -> NUM                  P_5


Consider the leftmost expansion:

rest -> rest term ADD -> rest term SUB term ADD -> ....

gives an infinite leftmost expansion.  Just by looking at the next 
symbol (or next k - for fixed k) we cannot determine how many times 
we need to "pump" the rest production in order to allow for all the
operation tokens in the input string.  This would result in an 
infinite recursion.


What are first sets?



In the above expansions we note that each nonterminal expands to have 
only certain terminals occurring as the first terminal in any derived 
string.  These can be read-off from the above expansions:

        nonterminal      first set
        ----------      ---------
        expr            {NUM}
        rest*           {ADD,SUB}
        term            {NUM} 

I have starred the nonterminal 'rest' as it is "nullable" that is it can 
produce the empty list.

FAQ

Question: Can you calculate first sets for grammars which are left 
          recursive? 

Answer: Yes! First sets can be calculated for all context free grammars.
        Of course, you cannot calculate them using the method of leftmost 
        expansion above.
 
        ... see next section.



What do we want in order to be able to do a recursive descent parse? 

At each nonterminal we must be able to choose which production 
we are going to use thus the possible first sets associated with each 
PRODUCTION of a nonterminal must be disjoint so that the next token 
will determine the production chosen.  Suppose 

                A --> A1 A2 ... An            P_i

is a production then the first set of the production (determined by 
the RHS of the production) is 

         first(A1)    if A1 is not nullable
else     first(A1) \union first(A2)   if A2 is not nullable
else     first(A1) \union first(A2) \union first(A3) if A2 is not nullable
                      etc.

Here "nullable" means that the nonterminal is capable of producing an 
empty string of terminals.

Thus, one requirement (besides being non-left recursive) is that each
production for A must have a first set which is disjoint from the 
first sets of all the other productions of A.
  
Furthermore, it is also clear that at most one production can be 
nullable. For we should only use the null derivation when the next 
token is not in the first set of any production.

WARNING: NULLABLE NONTERMINALS MAKE THINGS MORE COMPLEX!!!!



Question: Do these conditions suffice to characterize grammars 
          which can be parsed by recursive descent and a lookahead of 
          one symbol?
Answer: NO!  These are necessary conditions they are not sufficient.

Recall that the reason for these conditions is that when the parser 
sits at the a nonterminal N it will know that the next token is, say, 
T. Now T can occur in at most one first set (of the productions of A).  
If it does not occur in the first set of a (non-nullable) production we 
know we cannot use that production (as it never produces a T as the next 
token).  However, there is a slightly complication when a 
nullable production is present.  

Clearly we have to use the nullable production when the next symbol 
does not occur in any of the first sets.  However, what tells us 
that we should not use the null production EVEN THOUGH the next symbol 
does occur in one of the first set? 

Consider the following example:

Example:
           s -> a X
           a -> X
              | \epsilon

(The only strings recognized are X and XX.)

By looking one symbol ahead how do we distinguish what we do when the 
first symbol is X?  Do we use a -> \epsilon or a -> X?  Clearly 
we cannot decide this on a one token lookahead!

Thus this is not an LL(1) grammar! But it certainly satisfies the 
two conditions we have listed.



To obtain a sufficient condition we also have to check one other 
aspect, namely that the "follow set" of any nonterminal which 
is nullable is disjoint from its first set.  What this means will 
be discussed in more detail shortly.

Here is some Haskell code using the Parsec combinators which inputs a file with and expression and outputs stack machine instructions. The code illustrates:

how to use the Parsec combinators on tokens and how to sequence Parsec parsers
how to read and write from files
how to write out a tree in postfix notation by modifying the show function ...