CPSC411 -- Compiler Construction I:
LL(1) grammars and predictive top-down parsing
Lecture 5
Author: Robin Cockett
Date: 21 Jan 2002
Recognizing LL(1):
Here are two properties we know must be true of a grammar if it is to be
LL(1):
- the grammar must not be left recursive
- the rule which should be chosen when developing a nonterminal must
be determined by that nonterminal and the (at most) next token on the input.
Checking that a grammar is not left-recursive is relatively simple and has
already been discussed. It is the second condition that is somewhat more
tricky. Suppose we are standing at a non-terminal "a" with which
are associated a number of productions
a -> \alpha_1
| ...
| \alpha_n
and suppose we have a next input token T. How do we determine which
of the productions must be chosen? Essentially what we must do is to determine
for each production RHS, \alpha_i, what the first tokens can be: this set
is called the first set of \alpha_i and is written First(\alpha_i).
First(\alpha_i) = { A | \alpha_i ==> A \beta }
Clearly a necessary condition is that:
Condition A:
For the productions associated with each nonterminal x
x -> \alpha_1
pr(r_1)
| ...
| \alpha_n
pr(r_n)
First(\alpha_i) \intersection First(\alpha_j) is empty whenever i
=\= j. That is the bodies of the productions must have disjoint first
sets.
|
This is because if two RHS for x have a first token in common we will not
be able to choose between the two productions when that token is the next
input symbol.
First sets of sentential forms:
It turns out that first sets are relatively easy to calculate. We
say that a nonterminal x is nullable if the empty sequence can be
derived from it. There are then various cases:
- The first set of the empty sequence is empty
- The first set of a sequence starting with a terminal is
the set containing just that terminal.
- The first set of a sequence starting with a nonterminal
is the first set of that nonterminal together with, when the
nonterminal is nullable, the first set of the rest of the sequence.
This last condition can be expressed by:
first(x \alpha) = first(x)
if x is not nullable
= first(x) \union first(\alpha) if x is nullable
This means that if we can determine which nonterminals are nullable and
what the first set of each nonterminal is we are done.
Nullable nonterminals:
The set of nullable nonterminals are the
least set such that:
- every nonterminal with a null production is in the set,
- every nonterminal with a production consisting of only
nullable nonterminals is in.
To determine the nullable nonterminals we can use a marking algorithm.
We successively mark nullable nonterminals: first we mark those which have
an empty production, then we mark those which have productions whose RHS
consist of only nullable nonterminals. We keep marking until no new nullable
nonterminals are marked.
At this stage it is worth noting another necessary condition for a grammar
to be LL(1):
Condition B:
For each nonterminal x and the production set associated with the nonterminal
x -> \alpha_1
pr(r_1)
|...
|\alpha_n
pr(r_n)
at most one \alpha_i is nullable.
|
This is because whenever one rule could be fired the other could equally
well serve. Thus, we would have more than one entry in that slot in the
table. Unfortunately while the conditions (A) and (B) above are necessary
they are not sufficient.
First set of a nonterminal:
The first set of a nonterminal x is the
least set such that
for each production x -> \alpha we have
First(\alpha) \subset First(x).
To determine the first set of a nonterminal, x:
- First calculate the immediate first set of that nonterminal.
These are the terminals A which appear in the RHS of a production from
x as x -> \alpha A \beta, where \alpha
consists only of nullable nonterminals. N
- We develop a containment relation on the nonterminals
and where the fact that a pair x --> y is in this relation indicates
that the first set of y must be included in that of x. Such an
arrow is present if and only if there is a production:
x -> \alpha y \beta where \alpha consists
only of nullable nonterminals.
- Form the transitive reflexive closure of this graph and
compose this with the immediate first relation. This gives the first set
relation.
Exercise: Calculate the first
sets for the additive expression grammar.
Necessary and sufficient conditions for LL(1):
Conditions (A) and (B) are not sufficient: consider what happens when a non-terminal,
x, is nullable. While the next token T could indicate that we should
choose pr(r_i) = x -> \alpha_i as T is in the first set of only
\alpha_i, it is quite possible that by choosing the nullable
rule there will be a following production whose first set also contains
the token T. If this happens we will not know which production to
choose: r_i will eat the token but the null rule will allow the token
to be eaten by a later production ... which is the correct choice will depend
on the rest of the input (which we are not allowed to inspect!).
This problem is avoided if and only if the follow set of a nullable
non-terminal never contains tokens in common with the first set. A terminal
token T is in the follow set of a nonterminal x if and only if there is a
derivation
start ===> \alpha x B \beta
Fortunately the follow set of a nonterminal can also be calculated: we discuss
how this is done below. This gives a third necessary condition for being
LL(1):
Condition C:
If x is a nullable nonterminal then follow(x) is disjoint from
first(x).
|
In fact it is easy to see that these three conditions are also sufficient
as this last condition forces us to choose the production r_i when
the next token is in First(\alpha_i) and the (unique) nullable production
when it is not in any First(\alpha_i). We have:
THEOREM:
A grammar is LL(1) if and only if conditions
(A), (B), and (C) above are satisfied for all reachable nonterminals.
The reason for restricting to reachable nonterminals is that only these
will appear on top of the stack! Thus strictly speaking only these nonterminals
must satisfy the conditions.
Please note: we have dropped the requirement of not
being left recursive! This is not a mistake!! In fact it follows from
the above (A), (B), and (C) that the grammar is not left recursive ...
although this is definitely not obvious.
Calculating the follow sets:
This means that we must now sort out how to calculate the follow sets! Clearly
if we have a production x -> \alpha y \beta then
every token in First(\beta) must be in the follow set of y.
Furthermore, if \beta is nullable then the follow set of y
must contain the follow set of x. In fact, these observations determine
the calculation of the follow sets.
The follow sets are the least sets such that:
- A production x -> \alpha y \beta
implies follow(y) contains First(\beta)
- For every production x -> \alpha y \beta
in which \beta is nullable follow(y) contains follow(x)
.
The best way to calculate this is as follows:
- For each nonterminal x calculate the immediate
follow set ... that is those tokens which must be included in the
follow sets due to first condition above.
- For each nonterminal x calculate the nonterminals
y whose follow sets must include the follow
set of x due to the second rule above. This we call the propogation
set. Note this involves finding rules x -> \alpha y \beta
such that \beta is nullable.
- Propogate the tokens in the follow set of x to
all the nonterminals in the propogation set of y. Do this repeatedly
until there are no more tokens to propogate.
We say that a nonterminal x is endable in case it is in the
propogation set of the start node. This means that after the nonterminal
has been developed the input can end. The start nonterminal is always
endable. Knowing which nonterminals are endable is important as, recall,
we have to fill in the action table for when the string ends (indicated by
$).
One way to determine the endable nonterminals is to add the dummy symbol
$ to the follow set of the start symbol. In this way the endable nonterminals
will all be recognized as they will be precisely those which have the $ in
their follow sets.
Recall that an LL(1) parser needs an action table which says, for each nonterminal
token pair, what "push" action should be taken. This should be deterministic,
which means that at each entry we need to have at most one possible action.
Consider the problem for an arbitrary non-left-recursive grammar and suppose
we have calculated the first sets and the follow sets. What action(s) should
we associate with (x,T) an non-terminal token pair?
- Suppose x -> \alpha_1 = pr(r_1)| ...| x -> \alpha_n=pr(r_n)
are the productions associated with x, then we must include
as a possible action push(r_i) for any \alpha_i with T
in First(\alpha_i).
- If T \in follow(x) we must include as an
action push(r_i) whenever \alpha_i is nullable.
What actions should we associate with the pair (x,$)? If
\alpha_i is nullable and x is endable then we must include push(r_i)
.
That we note is that conditions (A), (B), and (C) above imply that there
is at most one action which can be placed at each entry of the table. Note
that if there is no rhs push the parse will fail so we can report an error.
Thus, we actually also now have a way of constructing the action table
...
Exercises:
For all the grammars below calculate the first and follow sets, determine
which nonterminals are nullable and which are endable.
- Show
s -> ( s ) s
[r_1]
|
[r_2]
|
is LL(1).
- Show that
expr -> term rest
[r_1]
rest -> ADD term rest [r_2]
| SUB term rest
[r_3]
|
[r_4]
term -> NUM
[r_5]
|
is LL(1).
- Is the following LL(1)?
expr -> term + expr
[r_1]
| term - expr
[r_2]
| term
[r_3]
term -> factor * term
[r_4]
| factor / term
[r_5]
| factor
[r_6]
factor -> NUM
[r_7]
|
Where the terminals are { +, -, /, NUM}.
- Show that the following is not LL(1):
expr -> expr + expr
| term
term -> term * term
| NUM
|
- (Dangling else) Show that the following grammar is
not LL(1) ... indeed that it is ambiguous.
stmt -> if_stmt
| OTHER
if_stmt -> IF expr THEN stmt
| IF expr THEN
stmt ELSE stmt
expr -> TRUE
| FALSE
|
terminals = { OTHER, IF, THEN, ELSE, TRUE, FALSE}, start = stmt
.
- Is this grammar LL(1)?
prog -> stmt
stmt -> PRINT expr
expr -> term moreterms
term -> factor morefactors
| SUB term
factor -> LPAR expr RPAR
| NUM
morefactors -> MUL factor morefactors
| DIV factor morefactors
|
moreterms -> ADD term moreterms
| SUB term moreterms
|
|