CPSC411: COMPILER CONSTRUCTION IAUTHOR: Robin CockettDATE: Jan 17th 2002

Lecture 4: Introduction to parsing

Reading:
• Louden:  Chapter 3

• Dragon:  2.4, 2.5

Recursive descent parsing - add expressions:

Now we shall turn to the problem of producing a parse tree.There are many methods (some of which we shall study in the next sections) but one of the most elegant and simple to understand is  "recursive descent parsing".  This was invented by Tony Hoare (Oxford).  his idea was to make the parsing program correspond almost precisely to the grammar.We show the process of developing a recursive descent parser for the language presented by the grammar:   expr -> term rest            P_1                                   rest -> ADD term rest        P_2         | SUB term rest        P_3         | \epsilon             P_4                                   term -> NUM                  P_5Before proceeding any further please recall that this is not the only presentation of this grammar which is possible.  Here is another presentation which is actually conceptually simpler:   expr -> expr ADD term         | expr SUB term         | term   term -> NUMUnfortunately, while this presentation is conceptually simpler, it is ambiguous and so not amenable to the sort of implementation we wish to discuss.  Therefore I have chosen a grammar which is in a convenient (but conceptually less clear!) form.  The main topic in the next section concerns how one transformations grammars into a convenient form for parsing: we shall go into this in some considerable detail as you will need to be able to do these transformations.  Basically they involve transforming the grammar to eliminate left recursion and then checking that it is LL(1). It should be mentioned that all these techniques start with the assumption that the grammar is essentially unambiguous (precedence parsing is the one exception).When presented with a grammar one should always try to write down some strings in the language to get a better idea of its intent.  For this grammar a typical string is:              1 + 42 - 6 + 10where we assume the lexical analysis stage does the following conversion into tokens:          lexeme =====> token          -------------------               + =====> ADD               - =====> SUB              42 =====> NUMFor this grammar it is amazingly simple to write a recursive descent parser: here it is beside the grammar in Haskell:   expr -> term rest            ||     expr L = rest (term L)                                ||   rest -> ADD term rest        ||     rest (ADD:L) = rest (term L)         | SUB term rest        ||     rest (SUB:L) = rest (term L)         | \epsilon             ||     rest L = L                                ||   term -> NUM                  ||     term (NUM:L) = LThis is a direct translation into a recursive descent parser.(Do not forget such a translation is not possible for an arbitrary grammar but is for these special grammars, namely LL(k) grammars.)Here is a trace of what this code does:            "1+42-6+10"                 |                 |                               LEXICAL ANALYSIS                 v   [NUM(1),ADD,NUM(42),SUB,NUM(6),ADD,NUM(10)]   |      |              |      |   term   |      v   |      [ADD,NUM(42),SUB,NUM(6),ADD,NUM(10)]   |      |   |   |      |   |   rest_1   |      |   v   |      |   [NUM(42),SUB,NUM(6),ADD,NUM(10)]   |      |   |       |   |      |   |       |   term   |      |   |       v   |      |   |       [SUB,NUM(6),ADD,NUM(10)]   |      |   |       |   |   |      |   |       |   |   rest_2              RECURSIVE DESCENT PARSING   |      |   |       |   v   |      |   |       |   [NUM(6),ADD,NUM(10)]   |      |   |       |   |      |   |      |   |       |   |      |   term   |      |   |       |   |      v   |      |   |       |   |      [ADD,NUM(10)]   |      |   |       |   |      |   |   |      |   |       |   |      |   |   rest_1   |      |   |       |   |      |   v   |      |   |       |   |      |   [NUM(10)]   |      |   |       |   |      |   |       |   |      |   |       |   |      |   |       |   term   |      |   |       |   |      |   |       v   |      |   |       |   |      |   |       []   |      |   |       |   |      |   |       ||   |      |   |       |   |      |   |       ||   rest_3   v      v   v       v   v      v   v       vv   [] <-- []<-[] <--- []<-[] <-- []<-[] <--- []                returnsHere we have attached the original numbers to the NUM token only so that one can follow the parse steps better: in practice (as we shallsee) it is actually useful if the lexical analysis stage passes back this extra information.  However, for this program we are not actually doing this.The list of tokens is only in the language if the program returns the empty list.(Note: it is also possible that this program fails to match an input  in which case an exception will be raised  ...and, of course, the parse will have  failed!)We may also wish to build a syntax tree in the attached file is a Haskell program to do this ...Clearly to make this useful one really needs a little program to tokenize the input.  Here is such a program which uses the Parsec combinators (actually with each token we pass back its position ffor error recovery).  Here is an example of a recursive descent program which uses this little lexer.

PREVIEW: When does a recursive descent program work?

Unfortunately a recursive descent program does not work for all grammars.Essentially they must be LL(k) - which is a fancy way of saying that at each stage of the parse one must be able to determine the rule to apply by looking ahead no more than k input tokens.  We shall consider more carefully the case when k=1. As has been mentioned several times now: not every grammar is of this form.There are two obvious conditions which are necessary:(a) there must be no "left recursion"(b) the "first sets" for the productions of each nonterminal must     be mutually disjoint.

What does no left recursion mean?

A grammar is "not left recursive" if for each nonterminal any leftmost series of expansions terminates.  Thus, in our grammar we have:expr          rest_1           rest_2          rest_3       term  |             |               |               |            |  v             v               v               v            vterm rest     ADD term rest    SUB term rest   \epsilon      NUM  |  vNUM restwhere we take a nonterminal act on it by a production and then we keep going by repeatedly expanding the first nonterminal.  We stop when the first symbol is a terminal or the list produced is empty (\epsilon).Why do we want the grammar not to be left recursive?  If the grammar is left recursive our recursive descent parser could get stuck in a non-terminating recursion.

Example:If we simply take our grammar and "reverse" it we obtain a left recursive grammar which cannot be used with a recursive descent parser (and actually has the same language):   expr -> rest term            P_1                                   rest -> rest term ADD        P_2         | rest term SUB        P_3         | \epsilon             P_4                                   term -> NUM                  P_5Consider the leftmost expansion:rest -> rest term ADD -> rest term SUB term ADD -> ....gives an infinite leftmost expansion.  Just by looking at the next symbol (or next k - for fixed k) we cannot determine how many times we need to "pump" the rest production in order to allow for all theoperation tokens in the input string.  This would result in an infinite recursion.

What are first sets?

In the above expansions we note that each nonterminal expands to have only certain terminals occurring as the first terminal in any derived string.  These can be read-off from the above expansions:        nonterminal      first set        ----------      ---------        expr            {NUM}        rest*           {ADD,SUB}        term            {NUM} I have starred the nonterminal 'rest' as it is "nullable" that is it can produce the empty list.

FAQ

Question: Can you calculate first sets for grammars which are left           recursive? Answer: Yes! First sets can be calculated for all context free grammars.        Of course, you cannot calculate them using the method of leftmost         expansion above.         ... see next section.

What do we want in order to be able to do a recursive descent parse? At each nonterminal we must be able to choose which production we are going to use thus the possible first sets associated with each PRODUCTION of a nonterminal must be disjoint so that the next token will determine the production chosen.  Suppose                 A --> A1 A2 ... An            P_iis a production then the first set of the production (determined by the RHS of the production) is          first(A1)    if A1 is not nullableelse     first(A1) \union first(A2)   if A2 is not nullableelse     first(A1) \union first(A2) \union first(A3) if A2 is not nullable                      etc.Here "nullable" means that the nonterminal is capable of producing an empty string of terminals.Thus, one requirement (besides being non-left recursive) is that eachproduction for A must have a first set which is disjoint from the first sets of all the other productions of A.  Furthermore, it is also clear that at most one production can be nullable. For we should only use the null derivation when the next token is not in the first set of any production.

WARNING: NULLABLE NONTERMINALS MAKE THINGS MORE COMPLEX!!!!

Question: Do these conditions suffice to characterize grammars           which can be parsed by recursive descent and a lookahead of           one symbol?Answer: NO!  These are necessary conditions they are not sufficient.Recall that the reason for these conditions is that when the parser sits at the a nonterminal N it will know that the next token is, say, T. Now T can occur in at most one first set (of the productions of A).  If it does not occur in the first set of a (non-nullable) production we know we cannot use that production (as it never produces a T as the next token).  However, there is a slightly complication when a nullable production is present.  Clearly we have to use the nullable production when the next symbol does not occur in any of the first sets.  However, what tells us that we should not use the null production EVEN THOUGH the next symbol does occur in one of the first set? Consider the following example:

Example:           s -> a X           a -> X              | \epsilon(The only strings recognized are X and XX.)By looking one symbol ahead how do we distinguish what we do when the first symbol is X?  Do we use a -> \epsilon or a -> X?  Clearly we cannot decide this on a one token lookahead!Thus this is not an LL(1) grammar! But it certainly satisfies the two conditions we have listed.

To obtain a sufficient condition we also have to check one other aspect, namely that the "follow set" of any nonterminal which is nullable is disjoint from its first set.  What this means will be discussed in more detail shortly.

Here is some Haskell code using the Parsec combinators which inputs a file with and expression and outputs stack machine instructions.   The code illustrates:
• how to use the Parsec combinators on tokens and how to sequence Parsec parsers
• how to read and write from files
• how to write out a tree in postfix notation by modifying the show function ...