LL(1) grammars and predictive top-down parsing
Lecture 4
Author: Robin Cockett
Date: 21 Jan 2002
Being LL(1) is a property of a grammar not of the language it, the grammar, recognizes. In other words, while two grammars may recognize the same language, it is quite possible that one is LL(1) and the other is not! Thus, being LL(1) is a property of the presentation as a grammar not of the language being described.
An LL(1) grammar is one which is suitable for recursive descent parsing. That is at each step in the parse which rule must now be chosen is uniquely determined by the current nonterminal and the next token (if there is one). While it is of great benefit to have a recursive descent parser for an LL(1) grammar, this is not the usual theoretical description of an LL(1) grammar ... nor, in fact, is it the standard way to write a parser for an LL(1) grammar. The more direct implementation uses a pushdown automata. We illustrate how this works on two grammars (below) and discuss how the attributes can be added.
We should mention that a pushdown automata is a conceptually lower level model of the parsing process (when compared to recursive descent parsing) and that this is reflected in the code. A pushdown automata implementation will be more efficient than a recursive descent implementation, however, in practice there will not be so very much difference! The point of the pushdown automata approach to parsing is that it indicates quite clearly what properties a grammar must have in order to allow a deterministic parse. In fact, we must be able to build a deterministic action table to provide at most one possible action at each step in the parse (i.e. which production to use) given that one has available the current nonterminal and the next unprocessed input token (if one allows the next k tokens the grammar would be LL(k).)
An important side effect of showing that a grammar is LL(k) that this will also imply that it is unambiguous.
A context free grammar G = (T,N,P,pr,start) consists of
(1) T = { NUM, +, * }
N = { expr }
P = { p_1,p_2,p_3}
expr ------>
expr + expr
= pr(p_1)
| expr * expr
= pr(p_2)
| NUM
= pr(p_3)
start = expr
(2) T = { NUM, +, * }
N = { expr, factor,
term }
P = { e_1, e_2,
t_1, t_2, f_1}
expr ----> expr + term
= pr(e_1)
| term
= pr(e_2)
term ----> term * factor
= pr(t_1)
| factor
= pr(t_2)
factor --> NUM
=pr(f_1)
start
= expr
(3) T = { NUM, +, * }
N = {
expr, term, rest_expr, rest_term, factor }
P = {
k_1, k_2, k_3, k_4, k_5, k_6, k_7 }
expr -> term rest_expr
= pr(k_1)
rest_expr -> + expr
= pr(k_2)
|
= pr(k_3)
term -> factor rest_term
= pr(k_4)
rest_term -> * term
= pr(k_5)
|
= pr(k_6))
factor -> NUM
= pr(k_7)
Our purpose is to develop some tools, descriptive and analytic, for discussing grammars. To this end we need to introduce some basic definitions.
A nonterminal of a grammar G is reachable if there is a derivation (that is a series of productions) from the start symbol which has the nonterminal in the derived string.
start = expr
=>
{e_1}
expr + term
=>
{expr + t_2}
expr + factor
start = exprThe reachable nonterminals are those nonterminals which are reachable from "expr" in the usual graph theoretic sense that there is a path from "expr" to the other node.expr -> expr
expr -> term
term -> term
term -> factor
x -> A B xThe nonterminal x is not realizable!
| C x A.
The standard approach to parsing a context free grammar is to use a nondeterministic pushdown automaton. This is a machine consisting of a stack which can hold terminals and nonterminals and a read head on the input stream. The stack initially holds the start symbol. The machine operates as follows:
STACK | INPUT | ACTION |
expr | NUM * NUM + NUM * NUM | push(p_1) |
expr + expr | NUM * NUM + NUM * NUM | push(p_2) |
expr * expr + expr | NUM * NUM + NUM * NUM | push(p_3) |
NUM * expr + expr | NUM * NUM + NUM * NUM | pop |
* expr + expr | * NUM + NUM * NUM | pop |
expr + expr | NUM + NUM * NUM | push(p_3) |
NUM + expr | NUM + NUM * NUM | pop |
+ expr | + NUM * NUM | pop |
expr | NUM * NUM | push(p_2) |
expr * expr | NUM * NUM | push(p_3) |
NUM * expr | NUM * NUM | pop |
* expr | * NUM | pop |
expr | NUM | push(p_3) |
NUM | NUM | pop |
STOP |
Note this is not deterministic as here are a number of possible rules
we could have tried. Some such choices will result in the parse getting stuck
at step (2):
STACK | INPUT | ACTION |
expr | NUM * NUM + NUM * NUM | push(p_1) |
expr + expr | NUM * NUM + NUM * NUM | push(p_1) |
expr + expr + expr | NUM * NUM + NUM * NUM | push(p_3) |
NUM + expr + expr | NUM * NUM + NUM * NUM | pop |
+ expr + expr | * NUM + NUM * NUM | !! |
FAIL |
The idea of an LL(k) grammar is that one can predict the (only possible) next step from the current nonterminal on top of the stack and by looking ahead in the input stream at most k tokens. Furthermore, to be LL(k), there must be only one possible choice of which production to push: any other choice must fail to lead to a parse. This means that the grammar is unambiguous.
Note that for the above grammar we could have started with either p_1 or p_2 to get successful parses. This is because the grammar is not only not LL(k) for any k but it is also ambiguous.
However, to be able to do this, when the grammar is unambiguous, there must be at most one possible production which can be pushed at each stage! For suppose there are two different productions which have subsequent derivations whose next terminals are the same; assuming that everything is realizable, there then must be two different input strings one of which can only be recognized by pushing the first production and the other of which can only be recognized by pushing the second production. This means that we cannot recognize the whole language if we make a choice at this point!
We shall therefore concentrate our efforts on forming a table whose entries are all the possible pushable productions. When this table happens to have at most one production in each position the grammar will be LL(1).
For the above grammar (1) this table looks like this:
NUM | + | * | $ | |
expr | { p_1, p_2, p_3 } | { } | {} | {} |
Notice that the first entry for (expr, NUM) allows any of the productions to be pushed and indeed may actually require any of these. The $ stand for the case when there are no more input tokens.
However, now consider the grammar (3) above. The table for this grammar
looks like:
NUM | + | * | $ | |
expr | { k_1 } | {} | {} | {} |
rest_expr | {} | {k_2} | {} | {k_3} |
term | {k_4} | {} | {} | {} |
rest_term | {} | {} | {k_5} | {k_6 } |
factor | {k_7 } | {} | {} | {} |
The second row for "rest_expr" needs a little explanation: why could we not do k_3 always (the empty production) when "+" is not the next symbol? The answer is as follows the only symbol possible after "rest_expr" in a derivation is the end-of-input ($). So if the null rule k_3 is fired the next symbol must be an end-of-input.
A grammar is said to be LL(1) precisely in case this table has at most one production in each entry of its action table.
The next problem is to describe exactly how to create an action table from a grammar. Clearly it is crucial to be able to calculate what the first terminals which can result from the firing of a rule: these are called the first sets. However also there is the complication which arises when a nonterminal is nullable (that is can generate the empty string): in this case we need to know what can follow the nonterminal as well in order to calculate these tables. The latter are called follow sets. The notion of a first set can be generalized to an arbitrary sequence of terminals and nonterminals.
Once we can calculate these properties of a grammar in hand we may say exactly how to calculate the action table: a production p_i with pr(p_i) = n -> \a, that is with head n and body \a is in the action set at (n,T) in case: