CPSC411 -- Compiler Construction I:

LL(1) grammars and predictive top-down parsing
Lecture 4

Author: Robin Cockett
Date: 21 Jan 2002



LL(1) grammars and predictive top-down parsing:



Being LL(1) is a property of a grammar not of the language it, the grammar, recognizes.  In other words, while two grammars may recognize the same language, it is quite possible that one is LL(1) and the other is not!  Thus, being LL(1) is a property of the presentation as a grammar not of the language being described.

An LL(1) grammar is one which is suitable for recursive descent parsing. That is at each step in the parse which rule must now be chosen is uniquely determined by the current nonterminal and the next token (if there is one).  While it is of great benefit to have a recursive descent parser for an LL(1) grammar, this is not the usual theoretical description of an LL(1) grammar ... nor, in fact, is it the standard way to write a parser for an LL(1) grammar.  The more direct implementation uses a pushdown automata.  We illustrate how this works on two grammars (below) and discuss how the attributes can be added.

We should mention that a pushdown automata is a conceptually lower level model of the parsing process (when compared to recursive descent parsing) and that this is reflected in the code.  A pushdown automata implementation will be more efficient than a recursive descent implementation, however, in practice there will not be so very much difference! The point of the pushdown automata approach to parsing is that it indicates quite clearly what properties a grammar must have in order to allow a deterministic parse.  In fact, we must be able to build a deterministic action table to provide at most one possible action at each step in the parse (i.e. which production to use) given that one has available the current nonterminal and the next unprocessed input token (if one allows the next k tokens the grammar would be LL(k).)

An important side effect of showing that a grammar is LL(k) that this will also imply that it is unambiguous.



Context free grammars (again):


A context free grammar G = (T,N,P,pr,start) consists of

  1.    A finite set T of terminal symbols (or tokens)
  2.    A finite set N of nonterminal symbols
  3.    A finite set P of production rules with a map pr: P --->  N * list(N \union T) which associates to each production rule a nonterminal head and a body consisting of a (possibly empty) strings of symbols from the disjoint union of N and T.
  4.    A start symbol  start chosen from N, the set of nonterminals.
Here are two examples of grammar which recognize exactly the same language.  However, the first is not LL(1): indeed it is ambiguous.   The second grammar is unambiguous, but it is not LL(1).  The third grammar is both LL(1) and unambiguous.

(1)      T = { NUM, +, * }
          N = { expr }
          P = { p_1,p_2,p_3}
          expr ------>  expr + expr              = pr(p_1)
                                  |  expr * expr              = pr(p_2)
                                  |  NUM                       = pr(p_3)
      start = expr

(2)      T = { NUM, +, * }
          N = { expr, factor, term }
          P =  { e_1, e_2, t_1, t_2, f_1}
                  expr ---->  expr + term            = pr(e_1)
                                     |  term                       = pr(e_2)
                  term ---->  term * factor         = pr(t_1)
                                    |  factor                      = pr(t_2)
                  factor -->  NUM                        =pr(f_1)
           start = expr
(3)       T = { NUM, +, * }
            N = { expr, term, rest_expr, rest_term, factor }
            P = { k_1, k_2, k_3, k_4, k_5, k_6, k_7 }
                   expr ->  term  rest_expr              = pr(k_1)
                   rest_expr ->  +  expr                    = pr(k_2)
                                       |                                  = pr(k_3)
                   term -> factor rest_term              = pr(k_4)
                   rest_term ->  *  term                    = pr(k_5)
                                        |                                 = pr(k_6))
                   factor -> NUM                             = pr(k_7)
 

Our purpose is to develop some tools, descriptive and analytic, for discussing grammars.  To this end we need to introduce some basic definitions.


Reachable nonterminals:

A nonterminal of a grammar G is reachable if there is a derivation (that is a series of productions) from the start symbol which has the nonterminal in the derived string.


Example:In the grammar (2) above "factor" is a reachable nonterminal as we have the following derivation:

start = expr
      =>                    {e_1}
expr + term
      =>                    {expr + t_2}
expr + factor 



The reachable nonterminals can be determined most simply by forming a graph : Reachability in this graph (from the start nonterminal) is then equivalent to reachability in the grammar.  This means that the reachability of nonterminals is easily calculated.


Example:
In the grammar (2) above the reachability graph has the following edges:
start = expr

expr  ->  expr
expr  ->  term
term -> term
term -> factor

  The reachable nonterminals are those nonterminals which are reachable from "expr" in the usual graph theoretic sense that there is a path from "expr" to the other node. 
 Clearly, we are not particularly interested in unreachable nonterminals!   It is reasonable, therefore, to remove any productions which employ unreachable nonterminals.


Realizable nonterminals:


A nonterminal is realizable if if there is a derivation from it to a string of terminals.  In other words if the grammar with that nonterminal acting as the start symbol has a nonempty language. Note that the language which generates the empty string is nonempty (as the empty string is in the language).


Example: In the following grammar with one nonterminal x:
 x -> A B x
        | C x A.
The nonterminal x is not realizable! 

To determine whether a nonterminal is realizable we may mark the nonterminals using the following procedure:
  1. Mark all nonterminals who have a production whose RHS is a string of terminals (only) - this includes terminals  with an empty production.
  2. Now repeatedly mark nonterminals with a production whose RHS consists of marked nonterminals and terminals (only).
  3. Repeat (2) until no new nonterminals are marked: all the marked nonterminals are realizable.
Again clearly we are not particularly interest in nonterminals which cannot be realized.



Parsing using a nondeterministic pushdown automata:


The standard approach to parsing a context free grammar is to use a nondeterministic pushdown automaton. This is a machine consisting of a stack which can hold terminals and nonterminals and a read head on the input stream. The stack initially holds the start symbol. The machine operates as follows:

  1. When a nonterminal is the top element of the stack we may replace the nonterminal by pushing the right-hand side of a production whose head is that nonterminal onto the stack.
  2. When a terminal is the top element of the stack and the next input token is the same terminal we may pop the stack and move the read head one step to the right.
  3. The parse is complete when the stack is empty and the read head is sitting just to the right of the input.
Here is an example parse for the grammar (1) above:
 
 

STACK INPUT ACTION
expr NUM * NUM + NUM * NUM push(p_1)
expr + expr NUM * NUM + NUM * NUM push(p_2)
expr * expr + expr NUM * NUM + NUM * NUM push(p_3)
NUM * expr + expr NUM * NUM + NUM * NUM pop
* expr + expr * NUM + NUM * NUM pop
expr + expr NUM + NUM * NUM push(p_3)
NUM + expr NUM + NUM * NUM pop
+ expr + NUM * NUM pop
expr NUM * NUM push(p_2)
expr * expr NUM * NUM push(p_3)
NUM * expr NUM * NUM pop
* expr * NUM pop
expr NUM push(p_3)
NUM NUM pop


STOP

Note this is not deterministic as here are a number of possible rules we could have tried. Some such choices will result in the parse getting stuck at step (2):
 

STACK INPUT ACTION
expr NUM * NUM + NUM * NUM push(p_1)
expr + expr  NUM * NUM + NUM * NUM push(p_1)
expr + expr + expr NUM * NUM + NUM * NUM push(p_3)
NUM + expr + expr NUM * NUM + NUM * NUM pop
+ expr + expr * NUM + NUM * NUM !!


FAIL

The idea of an LL(k) grammar is that one can predict the (only possible) next step from the current nonterminal on top of the stack and by looking ahead in the input stream at most k tokens.  Furthermore, to be LL(k), there must be only one possible choice of which production to push: any other choice must fail to lead to a parse. This means that the grammar is unambiguous.

Note that for the above grammar we could have started with either p_1 or p_2 to get successful parses. This is because the grammar is not only not LL(k) for any k but it is also ambiguous.


Action tables:

Our aim is that the machine should be able to determine its next action by using the value of the top nonterminal (on the stack) and the next input token. This could be achieved by building an action table which would indicate what productions to push depending on these values. If we could do this then we would have an efficient parser.

However, to be able to do this, when the grammar is unambiguous, there must be at most one possible production which can be pushed at each stage!  For suppose there are two different productions which have subsequent derivations whose next terminals are the same; assuming that everything is realizable, there then must be two different input strings one of which can only be recognized by pushing the first production and the other of which can only be recognized by pushing the second production. This means that we cannot recognize the whole language if we make a choice at this point!

We shall therefore concentrate our efforts on forming a table whose entries are all the possible pushable productions. When this table happens to have at most one production in each position the grammar will be LL(1).

For the above grammar (1) this table looks like this:
 


NUM + * $
expr { p_1, p_2, p_3 } { } {} {}

Notice that the first entry for (expr, NUM) allows any of the productions to be pushed and indeed may actually require any of these. The $ stand for the case when there are no more input tokens.

However, now consider the grammar (3) above. The table for this grammar looks like:
 


NUM + * $
expr { k_1 } {} {} {}
rest_expr {} {k_2} {} {k_3}
term {k_4} {} {} {}
rest_term {} {} {k_5} {k_6 }
factor {k_7 } {} {} {}

The second row for "rest_expr" needs a little explanation: why could we not do k_3 always (the empty production)  when "+" is not the next symbol?  The answer is as follows the only symbol possible after "rest_expr" in a derivation is the end-of-input ($). So if the null rule k_3 is fired the next symbol must be an end-of-input.

A grammar is said to be LL(1) precisely in case this table has at most one production in each entry of its action table.



Question:    Describe precisely what an LL(k) action table would look like.  Be careful to consider the effect of the end-of-input!

The next problem is to describe exactly how to create an action table from a grammar.   Clearly it is crucial to be able to calculate what the first terminals which can result from the firing of a rule: these are called the first sets.  However also there is the complication which arises when a nonterminal is nullable (that is can generate the empty string): in this case we need to know what can follow the nonterminal as well in order to calculate these tables.  The latter are called follow sets.   The notion of a first set can be generalized to an arbitrary sequence of terminals and nonterminals.

Once we can calculate these properties of a grammar in hand we may say exactly how to calculate the action table:  a production p_i  with pr(p_i) =  n ->  \a, that is with head  n and body \a is in the action set at (n,T) in case:

  1. T is in the first set of \a
  2. \a is nullable and T is in the follow set of n.
Once the action table is deterministic we have a way of building a parser whose complexity is determined by the size of the parse tree.   However,  the size of the parse tree is still O(n) where n is the length of the input.