AUTHOR: Robin Cockett
DATE: Jan 17th 2002

Lecture 4: Introduction to parsing


Recursive descent parsing - add expressions:

Now we shall turn to the problem of producing a parse tree.
There are many methods (some of which we shall study in the next
sections) but one of the most elegant and simple to understand
is "recursive descent parsing". This was invented by Tony Hoare
(Oxford). his idea was to make the parsing program correspond
almost precisely to the grammar.

We show the process of developing a recursive descent
parser for the language presented by the grammar:

expr -> term rest P_1

rest -> ADD term rest P_2
| SUB term rest P_3
| \epsilon P_4

term -> NUM P_5

Before proceeding any further please recall that this is not the only
presentation of this grammar which is possible. Here is another
presentation which is actually conceptually simpler:

expr -> expr ADD term
| expr SUB term
| term

term -> NUM

Unfortunately, while this presentation is conceptually simpler, it is
ambiguous and so not amenable to the sort of implementation we wish to
discuss. Therefore I have chosen a grammar which is in a convenient
(but conceptually less clear!) form. The main topic in the next
section concerns how one transformations grammars into a
convenient form for parsing: we shall go into this in some
considerable detail as you will need to be able to do these
transformations. Basically they involve transforming the grammar
to eliminate left recursion and then checking that it is LL(1).
It should be mentioned that all these techniques start with the
assumption that the grammar is essentially unambiguous (precedence
parsing is the one exception).

When presented with a grammar one should always try to write down some
strings in the language to get a better idea of its intent. For this
grammar a typical string is:

1 + 42 - 6 + 10

where we assume the lexical analysis stage does the following
conversion into tokens:

lexeme =====> token
+ =====> ADD
- =====> SUB
42 =====> NUM

For this grammar it is amazingly simple to write a recursive descent
parser: here it is beside the grammar in Haskell:

expr -> term rest || expr L = rest (term L)
rest -> ADD term rest || rest (ADD:L) = rest (term L)
| SUB term rest || rest (SUB:L) = rest (term L)
| \epsilon || rest L = L
term -> NUM || term (NUM:L) = L

This is a direct translation into a recursive descent parser.
(Do not forget such a translation is not possible for an arbitrary
grammar but is for these special grammars, namely LL(k) grammars.)

Here is a trace of what this code does:



| |
| | term
| v
| [ADD,NUM(42),SUB,NUM(6),ADD,NUM(10)]
| | |
| | | rest_1
| | v
| | [NUM(42),SUB,NUM(6),ADD,NUM(10)]
| | | |
| | | | term
| | | v
| | | [SUB,NUM(6),ADD,NUM(10)]
| | | | |
| | | | v
| | | | [NUM(6),ADD,NUM(10)]
| | | | | |
| | | | | | term
| | | | | v
| | | | | [ADD,NUM(10)]
| | | | | | |
| | | | | | | rest_1
| | | | | | v
| | | | | | [NUM(10)]
| | | | | | | |
| | | | | | | | term
| | | | | | | v
| | | | | | | []
| | | | | | | ||
| | | | | | | || rest_3
v v v v v v v vv
[] <-- []<-[] <--- []<-[] <-- []<-[] <--- []

Here we have attached the original numbers to the NUM token only so
that one can follow the parse steps better: in practice (as we shall
see) it is actually useful if the lexical analysis stage passes back this
extra information. However, for this program we are not actually
doing this.

The list of tokens is only in the language if the program returns the
empty list.

(Note: it is also possible that this program fails to match an input
in which case an exception will be raised  ...and, of course, the parse will have

We may also wish to build a syntax tree in the attached file is a Haskell program to
do this ...

Clearly to make this useful one really needs a little program to tokenize the input.
Here is such a program which uses the Parsec combinators (actually with each token we pass
back its position ffor error recovery). Here is an example of a recursive descent program
which uses this little lexer.

PREVIEW: When does a recursive descent program work?

Unfortunately a recursive descent program does not work for all grammars.
Essentially they must be LL(k) - which is a fancy way of saying that
at each stage of the parse one must be able to determine the rule to
apply by looking ahead no more than k input tokens. We shall consider
more carefully the case when k=1. As has been mentioned several times
now: not every grammar is of this form.

There are two obvious conditions which are necessary:
(a) there must be no "left recursion"
(b) the "first sets" for the productions of each nonterminal must
be mutually disjoint.

What does no left recursion mean?

A grammar is "not left recursive" if for each nonterminal any leftmost 
series of expansions terminates. Thus, in our grammar we have:

expr rest_1 rest_2 rest_3 term
| | | | |
v v v v v
term rest ADD term rest SUB term rest \epsilon NUM
NUM rest

where we take a nonterminal act on it by a production and
then we keep going by repeatedly expanding the first nonterminal.
We stop when the first symbol is a terminal or the list produced
is empty (\epsilon).

Why do we want the grammar not to be left recursive? If the grammar
is left recursive our recursive descent parser could get stuck in a
non-terminating recursion.


If we simply take our grammar and "reverse" it we obtain
a left recursive grammar which cannot be used with a recursive
descent parser (and actually has the same language):

expr -> rest term P_1

rest -> rest term ADD P_2
| rest term SUB P_3
| \epsilon P_4

term -> NUM P_5

Consider the leftmost expansion:

rest -> rest term ADD -> rest term SUB term ADD -> ....

gives an infinite leftmost expansion. Just by looking at the next
symbol (or next k - for fixed k) we cannot determine how many times
we need to "pump" the rest production in order to allow for all the
operation tokens in the input string. This would result in an
infinite recursion.

What are first sets?

In the above expansions we note that each nonterminal expands to have
only certain terminals occurring as the first terminal in any derived
string. These can be read-off from the above expansions:

nonterminal first set
---------- ---------
expr {NUM}
rest* {ADD,SUB}
term {NUM}

I have starred the nonterminal 'rest' as it is "nullable" that is it can
produce the empty list.


Question: Can you calculate first sets for grammars which are left 

Answer: Yes! First sets can be calculated for all context free grammars.
Of course, you cannot calculate them using the method of leftmost
expansion above.

... see next section.

What do we want in order to be able to do a recursive descent parse?

At each nonterminal we must be able to choose which production
we are going to use thus the possible first sets associated with each
PRODUCTION of a nonterminal must be disjoint so that the next token
will determine the production chosen. Suppose

A --> A1 A2 ... An P_i

is a production then the first set of the production (determined by
the RHS of the production) is

first(A1) if A1 is not nullable
else first(A1) \union first(A2) if A2 is not nullable
else first(A1) \union first(A2) \union first(A3) if A2 is not nullable

Here "nullable" means that the nonterminal is capable of producing an
empty string of terminals.

Thus, one requirement (besides being non-left recursive) is that each
production for A must have a first set which is disjoint from the
first sets of all the other productions of A.

Furthermore, it is also clear that at most one production can be
nullable. For we should only use the null derivation when the next
token is not in the first set of any production.


Question: Do these conditions suffice to characterize grammars
which can be parsed by recursive descent and a lookahead of
one symbol?
Answer: NO! These are necessary conditions they are not sufficient.

Recall that the reason for these conditions is that when the parser
sits at the a nonterminal N it will know that the next token is, say,
T. Now T can occur in at most one first set (of the productions of A).
If it does not occur in the first set of a (non-nullable) production we
know we cannot use that production (as it never produces a T as the next
token). However, there is a slightly complication when a
nullable production is present.

Clearly we have to use the nullable production when the next symbol
does not occur in any of the first sets. However, what tells us
that we should not use the null production EVEN THOUGH the next symbol
does occur in one of the first set?

Consider the following example:

s -> a X
a -> X
| \epsilon

(The only strings recognized are X and XX.)

By looking one symbol ahead how do we distinguish what we do when the
first symbol is X? Do we use a -> \epsilon or a -> X? Clearly
we cannot decide this on a one token lookahead!

Thus this is not an LL(1) grammar! But it certainly satisfies the
two conditions we have listed.

To obtain a sufficient condition we also have to check one other
aspect, namely that the "follow set" of any nonterminal which
is nullable is disjoint from its first set. What this means will
be discussed in more detail shortly.

Here is some Haskell code using the Parsec combinators which inputs a file with and expression and outputs stack machine instructions.   The code illustrates: