CPSC411 -- Compiler Construction I:

LL(1) grammars and predictive top-down parsing
Lecture 5

Author: Robin Cockett
Date: 21 Jan 2002

Recognizing LL(1):

Here are two properties we know must be true of a grammar if it is to be LL(1):

the grammar must not be left recursive
the rule which should be chosen when developing a nonterminal must be determined by that nonterminal and the (at most) next token on the input.

Checking that a grammar is not left-recursive is relatively simple and has already been discussed. It is the second condition that is somewhat more tricky. Suppose we are standing at a non-terminal "a" with which are associated a number of productions

a -> \alpha_1
| ...
| \alpha_n

and suppose we have a next input token T. How do we determine which of the productions must be chosen? Essentially what we must do is to determine for each production RHS, \alpha_i, what the first tokens can be: this set is called the first set of \alpha_i and is written First(\alpha_i).

First(\alpha_i) = { A | \alpha_i ==> A \beta }

Clearly a necessary condition is that:

Condition A:
For the productions associated with each nonterminal x

x -> \alpha_1 pr(r_1)
| ...
| \alpha_n pr(r_n)

First(\alpha_i) \intersection First(\alpha_j) is empty whenever i =\= j. That is the bodies of the productions must have disjoint first sets.

This is because if two RHS for x have a first token in common we will not be able to choose between the two productions when that token is the next input symbol.

First sets of sentential forms:

It turns out that first sets are relatively easy to calculate. We say that a nonterminal x is nullable if the empty sequence can be derived from it. There are then various cases:

The first set of the empty sequence is empty
The first set of a sequence starting with a terminal is the set containing just that terminal.
The first set of a sequence starting with a nonterminal is the first set of that nonterminal together with, when the nonterminal is nullable, the first set of the rest of the sequence.

This last condition can be expressed by:

first(x \alpha) = first(x) if x is not nullable
= first(x) \union first(\alpha) if x is nullable

This means that if we can determine which nonterminals are nullable and what the first set of each nonterminal is we are done.

Nullable nonterminals:

The set of nullable nonterminals are the least set such that:

every nonterminal with a null production is in the set,
every nonterminal with a production consisting of only nullable nonterminals is in.

To determine the nullable nonterminals we can use a marking algorithm. We successively mark nullable nonterminals: first we mark those which have an empty production, then we mark those which have productions whose RHS consist of only nullable nonterminals. We keep marking until no new nullable nonterminals are marked.

At this stage it is worth noting another necessary condition for a grammar to be LL(1):

Condition B:
For each nonterminal x and the production set associated with the nonterminal

x -> \alpha_1 pr(r_1)
|...
|\alpha_n pr(r_n)

at most one \alpha_i is nullable.

This is because whenever one rule could be fired the other could equally well serve. Thus, we would have more than one entry in that slot in the table. Unfortunately while the conditions (A) and (B) above are necessary they are not sufficient.

First set of a nonterminal:

The first set of a nonterminal x is the least set such that

for each production x -> \alpha we have First(\alpha) \subset First(x).

To determine the first set of a nonterminal, x:

First calculate the immediate first set of that nonterminal. These are the terminals A which appear in the RHS of a production from x as x -> \alpha A \beta, where \alpha consists only of nullable nonterminals. N
We develop a containment relation on the nonterminals and where the fact that a pair x --> y is in this relation indicates that the first set of y must be included in that of x. Such an arrow is present if and only if there is a production: x -> \alpha y \beta where \alpha consists only of nullable nonterminals.
Form the transitive reflexive closure of this graph and compose this with the immediate first relation. This gives the first set relation.

Exercise: Calculate the first sets for the additive expression grammar.

Necessary and sufficient conditions for LL(1):

Conditions (A) and (B) are not sufficient: consider what happens when a non-terminal, x, is nullable. While the next token T could indicate that we should choose pr(r_i) = x -> \alpha_i as T is in the first set of only \alpha_i, it is quite possible that by choosing the nullable rule there will be a following production whose first set also contains the token T. If this happens we will not know which production to choose: r_i will eat the token but the null rule will allow the token to be eaten by a later production ... which is the correct choice will depend on the rest of the input (which we are not allowed to inspect!).

This problem is avoided if and only if the follow set of a nullable non-terminal never contains tokens in common with the first set. A terminal token T is in the follow set of a nonterminal x if and only if there is a derivation

start ===> \alpha x B \beta

Fortunately the follow set of a nonterminal can also be calculated: we discuss how this is done below. This gives a third necessary condition for being LL(1):

Condition C:
If x is a nullable nonterminal then follow(x) is disjoint from first(x).

In fact it is easy to see that these three conditions are also sufficient as this last condition forces us to choose the production r_i when the next token is in First(\alpha_i) and the (unique) nullable production when it is not in any First(\alpha_i). We have:

THEOREM:

A grammar is LL(1) if and only if conditions (A), (B), and (C) above are satisfied for all reachable nonterminals.

The reason for restricting to reachable nonterminals is that only these will appear on top of the stack! Thus strictly speaking only these nonterminals must satisfy the conditions.

Please note: we have dropped the requirement of not being left recursive! This is not a mistake!! In fact it follows from the above (A), (B), and (C) that the grammar is not left recursive ... although this is definitely not obvious.

Calculating the follow sets:

This means that we must now sort out how to calculate the follow sets! Clearly if we have a production x -> \alpha y \beta then every token in First(\beta) must be in the follow set of y. Furthermore, if \beta is nullable then the follow set of y must contain the follow set of x. In fact, these observations determine the calculation of the follow sets.

The follow sets are the least sets such that:

A production x -> \alpha y \beta implies follow(y) contains First(\beta)
For every production x -> \alpha y \beta in which \beta is nullable follow(y) contains follow(x) .

The best way to calculate this is as follows:

For each nonterminal x calculate the immediate follow set ... that is those tokens which must be included in the follow sets due to first condition above.
For each nonterminal x calculate the nonterminals y whose follow sets must include the follow set of x due to the second rule above. This we call the propogation set. Note this involves finding rules x -> \alpha y \beta such that \beta is nullable.
Propogate the tokens in the follow set of x to all the nonterminals in the propogation set of y. Do this repeatedly until there are no more tokens to propogate.

We say that a nonterminal x is endable in case it is in the propogation set of the start node. This means that after the nonterminal has been developed the input can end. The start nonterminal is always endable. Knowing which nonterminals are endable is important as, recall, we have to fill in the action table for when the string ends (indicated by $).

One way to determine the endable nonterminals is to add the dummy symbol $ to the follow set of the start symbol. In this way the endable nonterminals will all be recognized as they will be precisely those which have the $ in their follow sets.
Recall that an LL(1) parser needs an action table which says, for each nonterminal token pair, what "push" action should be taken. This should be deterministic, which means that at each entry we need to have at most one possible action. Consider the problem for an arbitrary non-left-recursive grammar and suppose we have calculated the first sets and the follow sets. What action(s) should we associate with (x,T) an non-terminal token pair?

Suppose x -> \alpha_1 = pr(r_1)| ...| x -> \alpha_n=pr(r_n) are the productions associated with x, then we must include as a possible action push(r_i) for any \alpha_i with T in First(\alpha_i).
If T \in follow(x) we must include as an action push(r_i) whenever \alpha_i is nullable.

What actions should we associate with the pair (x,$)? If \alpha_i is nullable and x is endable then we must include push(r_i) .

That we note is that conditions (A), (B), and (C) above imply that there is at most one action which can be placed at each entry of the table. Note that if there is no rhs push the parse will fail so we can report an error. Thus, we actually also now have a way of constructing the action table ...

Exercises:
For all the grammars below calculate the first and follow sets, determine which nonterminals are nullable and which are endable.

Show

s -> ( s ) s [r_1]
| [r_2]

is LL(1).

Show that

expr -> term rest [r_1]
rest -> ADD term rest [r_2]
| SUB term rest [r_3]
| [r_4]
term -> NUM [r_5]

is LL(1).

Is the following LL(1)?

expr -> term + expr [r_1]
| term - expr    [r_2]
| term    [r_3]
term -> factor * term [r_4]
   | factor / term [r_5]
   | factor    [r_6]
factor -> NUM    [r_7]

Where the terminals are { +, -, /, NUM}.

Show that the following is not LL(1):

expr -> expr + expr
| term
term -> term * term
| NUM

(Dangling else) Show that the following grammar is not LL(1) ... indeed that it is ambiguous.

stmt -> if_stmt
| OTHER
if_stmt -> IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
expr -> TRUE
| FALSE

terminals = { OTHER, IF, THEN, ELSE, TRUE, FALSE}, start = stmt .

Is this grammar LL(1)?