transformations

CPSC411: COMPILER CONSTRUCTION I

Author: Robin Cockett Date: Jan 28th 2002 (modified 2014)

Lecture: 7 & 8

Title: Transforming grammars

Reading:

Louden: Chapter 4.2.3

Dragon: 4.3

These notes describe methods of transforming grammars. We are motivated by the desire to transform a grammar into an LL(1) grammar. If a grammar can be so transformed it allows access to the predictive top-down parsing techniques (i.e. either a table-driven pushdown automata or a recursive descent parser). The bad news is that given an arbitrary context free grammar it is undecidable whether there is an LL(1) (or LL(k)) grammar which will recognize the same language: thus, ultimately our objective is impossible!

In fact, this is not quite the problem we wish to solve! We are not simply looking for a completely arbitrary and unrelated grammar which by some remarkable fluke happens to recognize the same language. We require a transformation between the set of parse trees ... this because we need to transfer the semantics (or meaning) intended by the old grammar onto the new grammar. Thus, we really need a step by step transformation system which allows us to also keep track of how we have transformed the parse tree.

We may see the problem as follows: we are given a grammar G and we wish to transform it to a grammar G' while providing a translation T: G' -> G such that

                      INPUT
                     /     \
          parse     /       \   parse
                   /         \
                  /           \
                 \/           \/

                G' -----------> G
                        T

commutes, in the sense that parsing an input for G' and then translating the parse tree into one for G is the same as parsing straight into G. In fact, futhermore, it is desirable for the translation T to be given by a linear time "fold" (an attribute grammar). As we shall see we can do a number of useful transformation steps in the direction of obtaining an LL(1) grammar. These steps almost amount to a procedure for generating an equivalent LL(1) grammar (if one exists) -- the one aspect we shall not discuss is how to eliminate the follow set clashes.

The basic methods for transforming grammars which we shall look at are:

Elimination of left recursion (dually elimination of right recursion)
Left factoring (dually right factoring)
Nonterminal expansion.

These techniques are often sufficient to bring a grammar to heel! However, it should be noted that if the original grammar is ambiguous then, as these techniques "preserve meaning", the result will still be ambiguous and so necessarily will fail to be LL(1) or to belong to any other unambiguous class (such as LL(k) or LR(k)). Furthermore, if the language is simply is not LL(1) then these techniques will also fail. Thus, you should not expect them to deliver magically an LL(1) grammar!

Set against the fact that the transformations may not help are:

By trying to transform the grammar you may expose aspects of the grammar which were not apparent in its original form;
Many of the grammars you will meet in practice can be transformed to LL(1)! If they can be transformed then one has access to particularly simple parsing techniques with good error reporting.

I lay more stress on grammar transformation techniques than most texts.

Eliminating left recursion:

Eliminating left recursion is the most difficult transformation step. I shall present an approach that has the following advantages:

it allows one to carry out the transformation process by hand relatively easily
it is directly applicable to most grammars as it allows nullable productions
it tries to keep as many of the original features of the grammar intact as is possible and tries to keep the the size of the grammar under control.

Transformations highlight the difference between the presentation (that is the grammar) of a language and the language itself (the set of strings recognized). Recall many grammars can recognize the same language but only one language is recognized by a grammar: transformations certainly must preserve the language ... we also require that they allow one to recover the parse tree as well.

The algorithm I will describe works on any grammar which has no cycles and is null unambiguous. A grammar has a cycle if there is a nonterminal x with a nontrivial derivation x =+=> x. Grammars with cycles are quite pathological (and ambiguous): they are of little interest in parsing applications so this does not unduly limit the applicability of this algorithm. Importantly, there is a straightforward test of whether a grammar has cycles. Null ambiguous grammars are, similarly, quite pathological (and ambiguous) and null ambiguity can be easily determined.

Testing for cycles:
We may form a relation on the nonterminals of any grammar by:

x << y if and if x -> w y w' where w and w' are nullable.

The grammar has no cycles in case the transitive closure of this relation is anti-reflexive (i.e. it has no cycles).

Question: Can you explain why this works?

Testing for null ambiguity:

A grammar is null unambiguous if and only if each non-terminal has at most one nullable production. It is as easy as this! If a grammar is not null unambiguous it must be ambiguous so it is not of great interest in parsing applications. In particular, this means there is no chance it can be turned into an LL(1) grammar. It is possible, however, to make a grammar which recognizes the same language and is null unambiguous (this is essentially done using \epsilon separation see below) ... the equivalence at the level of parse trees, however, will be lost.

Removing left-recursion requires a number of steps which we now describe in some detail. The algorithm starts by isolating where the "problem areas" are in the grammar. This is done by partially ordering the nonterminals of the grammar and classifying the productions as either being "good" or "bad": the aim is then to focus on the bad productions and transform them into good productions.

A feature of this algorithm is that it mimics how one might approach the problem when you do it by hand. If you were doing it by hand you would inspect the grammar and choose an ordering on the nodes which causes the minimal amount of damage to the grammar. The algorithm does this by developing the partial order on the nodes as it is needed.

GOOD and ALMOST GOOD nonterminals

Suppose we have some partial order (see below for the definition of this), x < y, on the non-terminals of our grammar. We then say, with respect to this partial order, that:

A production r, where pr(r) = x -> \alpha, is good in case \alpha satisfies any of the following:

it is empty (i.e. x is immediately nullable )
it starts with a terminal T
it starts with a non-nullable nonterminal y with x < y in the partial order
it starts with a nullable terminal y with x < y , so that it is of the form x -> y \alpha', and the production x -> \alpha' (skip over y) is good.

A non-terminal x is good if all its associated productions x -> \alpha_1 | ... | \alpha_n are good. Once a non-terminal is good, in this sense, you need not change its productions agian while transforming to remove left recursion ...

OBSERVATION:

A grammar has no left-recursion if and only if each nonterminal (or equivalently if every production) is good with respect to some partial order on the nonterminals of the grammar.

Proof.

Suppose we have a derivation a =*=> a w then there must be be a series of one step rules

a => w_1' a_1 w_1
a_1 => w_2' a_2 w_2
....
a_n = a w_n
where each w_i' is nullable. But when all rules are good this means a < a_1 < ... < a_n so that a < a which cannot be.

Thus, when all nonterminals (or productions) are good this implies that the grammar is not left-recursive.
Conversely, suppose we have a non left-recursive grammar then there is a relation induced on the non-terminals by demanding that each production be good. So we will say x < y in case there is a production x -> w' y w where w' is nullable. Saying that the transitive reflexive closure of this is an anti-symmetric relation is an equivalent way of defining what it means to say that the grammar is not left-recursive!

The idea is then to transform the rules so that they all become good. Clearly this means that we need to modify the rules which fail to be good!

What is a partial order?

Given a set X, a partial order on X is a relation x =< y satisfying:

x =< x (the relation is reflexive)
x =< y and y =< z implies x =< z (the relation is transitive)
if x =< y and y =< x then x=y (the relation is anti-symmetric)

For example: the integers are partially ordered (e.g. -7 =< 102 ). In fact, in this case all pairs of integers are related (either forward or backward) by this relation so that it is called a total order ). Strings of characters may be (totally) ordered lexicographically: this is how telephone books work! Stings can also be ordered by the prefix relation: w_1 is a prefix of w_2 , that is w_1 =< w_2, in case there is a w_3 such that w_1 w_3 = w_2. This prefix ordering is clearly just a partial order. (Can you see why it is transitive.)

Most importantly one can create a partial order by specifying which elements are "better" than which others, x < y, (where x<y if and only if x =< y and x =/= y) and then taking the transitive reflexive closure. If you do this, however, you must be careful to check the relation is still anti-symmetric (i.e. that there are no cycles).

A production x -> y \alpha is improvable in case either

if x > y and y is good, or
x < y and y is nullable and x -> \alpha is improvable (so you can skip over nullable non-terminals)

This can be re-expressed directly as follows:
x -> \gamma y \alpha is impovable at y if x > y, \gamma is nullable, and each z in \gamma has x < z.

The improvement which we shall make is to expand the good nonterminal y by replacing the occurence of y by its body in the production.

A nonterminal x has an impovable immediate left recursion or is almost good if its productions set is of the form:

x -> x \alpha_1 | .. | x \alpha_r | \beta_1 | ... | \beta_n

where each x -> \beta_i is good (note: each \alpha_i must be non-null as the grammar can have no cycles).

Note that the immediate left recursion in almost good rules can be eliminated and when no \beta_j is nullable all the resulting rules by a judicious choice of partial order on the nonterminals -- including the new nonterminal -- can be made good. If a \beta_j is nullable we are forced to assume x < x' and this may make an introduced rule on x' bad.

A production x -> \beta x \alpha, where \beta is non-empty and nullable, is said to have a hidden immediate left recursion. Such rules must always be "squeezed" in order to expose the immediate left recursion. A rule is squeezed essentially by expanding the non-terminals which are hiding the left recursion (that is those in \beta): the aim is to make the hidden left recursion into an immediate left recursion. Unfortunately expanding these non-terminal when they are not good may not achieve this and this will force us to \epsilon-separate the non-terminal one wishes to expand (see more below) when it is bad. However, \epsilon-separation, should only be used as a last resort: it can cause an exponential increase in the size of a grammar!

Our strategy for removing left recursion, thus, is as follows:

Removing left recursion:

Try to produce a partial order which makes as many production rules good as possible;
Improve any improvable rules;
Remove the immediate left recursion from almost good non-terminals;
Squeeze the good non-terminals involved in a hidden immediate left recursion. As a last resort: squeeze bad nonterminals after \epsilon-separating them!

Repeat this until all rules are good ...

The following observation tells us that, if there is a bad production, there will always be something we can do!

OBSERVATION:

A reachable realizable grammar G with a total order on nonterminals either satisfies one of the following:

all productions good;

there is an improvable production;

there is an almost good non-terminal (that is an immediate left recursion);

there is a hidden immediate left recursion.

Proof.

Consider the least nonterminal: the only way one of its rules is not good or is not improvable is if there is a production x -> w' x w where w' is nullable. Which means it has a hidden left recursion.
Suppose now that x is the smallest nonterminal which is not good: that is all y with y<x have all their productions good then any production which is not good fails to be good because it is of the form

x -> w' y w

where w' is nullable and y =< x. If y < x then y must be good, thus, we would have found an improvable production. Suppose this does not happen then y = x, but this makes the production squeezable or an immediate left recursion!

NOTE: A grammar with no null-productions will never require squeezing! However, if only squeezing is left, then for every y in w' , in a production x -> w' x w with a hidden left recursion on x, we must have x <y as otherwise we would have a source of improvement. As we shall see shortly this can happen.

Our basic strategy is to transform improvable productions and the production sets of almost good nonterminals. The observation above assures us that we will always find such while the grammar is not left-recursive ... or there will be something to squeeze! Often one is lucky and there is no need to do any squeezing and we shall consider below a number of examples in which squeezing is not required.

First consider the very first step of finding a partial order on the grammar:

Example 1: Consider the grammar:

expr -> expr ADD term                             --P_1
            | expr SUB term                               --P_2
            | term                                                --P_3
term -> term MUL factor                           --P_4
             | term DIV factor                            --P_5
             | factor                                             --P_6
factor -> LPAR expr RPAR                      --P_7
               | NUM                                           --P_8

Whatever partial order we put on the nonterminals notice that P_7 and P_8 will be good so that factor is good. To make expr almost good we must suppose expr < term. Finally, to make term almost good we must assume that term < factor.

Note how we choose the partial order to suit our grammar .... and, indeed, the strategy always should be to choose a partial order which makes as much as possible good!

Example 2: Consider the grammar:

s -> A                                                          --R_1
      | a                                                           --R_2
a -> B                                                          --R_3
       | b                                                          --R_4
b -> s                                                           --R_5
       |                                                             --R_6

s -> A --R_1 \| a --R_2 a -> B --R_3 \| b --R_4 b -> s --R_5 \| --R_6

Here we can make s < a, a < b but now we have to leave R_5 as an improvable production. In fact, whichever way we choose the partial order, we will always have to leave one production as improvable (because there is a left recursion).

Example 3: Consider the grammar:

s -> s s s B -- S_1
| -- S_2

s -> s s s B -- S_1 \| -- S_2

Here the start symbol s is almost good.

Example 5: Consider the grammars:

s -> s s s
      |

Grammar (a)
s -> s
      |

Grammar (b)
s -> s B

Grammar (c)
s -> a b C
a -> A
       |
b -> a b C
       | B
Grammar (d)

s -> s s s \| Grammar (a)	s -> s \| Grammar (b)	s -> s B Grammar (c)	s -> a b C a -> A \| b -> a b C \| B Grammar (d)

Here in (a) the start symbol is almost good while in (b) it is not and in (c) it is almost good. Exercise: what is wrong with the first three grammars? Work through (d) as an exercise!

Transformations for removing left-recursion:

We now allow ourselves the following transformations which we can apply repeatedly (in any order) until we can do no more:

(A) Expansions of improvable non-terminals in improvable productions:

x -> \gamma y \alpha' y -> \beta_1 | ... | \beta_n
------------------------------------------------------------------------ (y good, \gamma nullable)
x -> \gamma \beta_1 \alpha' | ... | \gamma \beta_n \alpha'

(B) Immediate left-recursion removal for almost good nonterminal

x -> x \alpha_1'| ..... | x \alpha_n| \beta_1 | ..... | \beta_m
------------------------------------------------ ----------------------- (x -> \beta_j good)
x -> \beta_1 x' | ... | \beta_m x'
x' -> \alpha_1 x' | .. | \alpha_n x' | \epsilon

and we add the non-terminal x' to the partial order with x < x' whenever any of \beta_j are nullable. The \beta_1,...,\beta_m are called the base cases ... if there are no base cases then there will be no productions x -> \alpha_i x' and so x' will not be reachable (and tacitly we are assuming that all actions are reachable ...) so that the whole production set can be removed.

We apply (A) in preference to (B), so that we start by improving all our rules (enlarging the preorder to make as much good or as possible) until there is no choice but to transform an almost good nonterminal.

Only if these transformations (and enlargement of the partial order) do not make all the productions good do we resort to squeezing a rule with a hidden left recursion:

Squeezing is the most delicate operation! There are two possibilities:
(a) There is a non-terminal which must be squeezed which is already good. In this case go ahead and expand this non-terminal.
(b) If all the non-terminals to be squeezed are bad to ensure progress \epsilon-separate the production to be substituted.

Let us trace through what these transformations do to an example which requires squeezing but not null \epsilon-separation before we return to discuss this transformation in more detail:

Example 1:

a -> b C                    good            a < b
     | c D.                   good            a < c
b -> e a E               bad              b < e
     | c B.                   good            b < c
c -> A.                     good
e -> F e                    good
    | .                         good

By trying to partially order the non-terminals to make as much good as possible we may come up with the partial ordering above (note that e is nullable). There is just one rule which causes a problem. However, note that it is improvable as a is a good non-terminal. So we improve the bad production:

a -> b C                    good            a < b
     | c D.                   good          a < c
b -> e b C E             bad               b < e   hidden left recursion
     | e c D E              good            b < c
     | c B.                   good
c -> A.                     good
e -> F e                    good
    | .                         good

This improvement now exposes a hidden left recursion. We must squeeze the non-terminal e in the bad rule. As e is a good non-terminal we may expand it:

a -> b C                    good            a < b
     | c D.                   good          a < c
b -> F e b C E          good
   | b C E bad immediate left recursion
     | e c D E              good            b < e, b < c
     | c B.                   good
c -> A.                     good
e -> F e                    good
    | .                         good

Now we can remove the immediate left recursion.

a -> b C                    good            a < b
     | c D.                   good          a < c
b -> F e b C E b'     good
     | e c D E b'         good            b< e, b < c
     | c B b'.               good
b' -> C E b'    good
   |. good
c -> A.                     good
e -> F e                    good
    | .                         good

As all the rules are now good we have removed the left recursion.

Example 2: Consider the grammar


expr -> expr ADD term --P_1 \| expr SUB term --P_2 \| term --P_3 term -> term MUL factor --P_4 \| term DIV factor --P_5 \| factor --P_6 factor -> LPAR expr RPAR --P_7 \| NUM --P_8

where we take the ordering expr < term < factor we can expand the almost good nodes:


expr -> term expr' --P_1 expr' -> ADD term expr' --P_1' \| SUB term expr' --P_2' \| --P_3' term -> factor term' --P_4 term' -> MUL factor term' --P_4' \| DIV factor term' --P_5' \| --P_6' factor -> LPAR expr RPAR --P_7 \| NUM --P_8

There is no need to add any relations to the order in this case as now all the productions are good ... so we have removed left recursion.

Example 3: Consider the grammar:

s -> A                                                          --R_1
      | a                                                           --R_2
a -> B                                                          --R_3
       | b                                                          --R_4
b -> s                                                           --R_5
       |                                                             --R_6

s -> A --R_1 \| a --R_2 a -> B --R_3 \| b --R_4 b -> s --R_5 \| --R_6

This gives a good example of the role of the partial order. Remember we can choose the partial order and we shall try to do so in such a manner as to make as many of the productions as possible good (or almost good) after all we do wish to have all the productions good eventually.

Now rather than saying outright what the partial order is we actually develop it as we go. Thus, we may reason as follows: in order to make s -> a a good production we shall make s < a ; this also has the desirable effect of ensuring that s is a good non-terminal. Excellent! Next to ensure that a -> b is a good production we shall make a < b; this will also ensure that a is a good non-terminal. Marvelous!

Now it is perhaps tempting to think we can always continue in this vein making productions good by adding to the ordering. However, to see what goes WRONG consider the attempt to now make b -> s good. To do this have to demand that b < s ... however here we run into a problem for s is already less than b! So we cannot make b < s. This is where the partial order begins to play an important role for it is stopping us from having a cycle s -> a -> b -> s -> .... etc.

So now realizing that s < b, we conclude that b -> s is an improvable production. So we may improve it by expanding to the productions b -> a and b -> A. But now, while b -> A is a good, we notice that b -> a has the same problem as before, namely a is already less than b . Thus b -> a is improvable. Expanding this production gives b -> b and b -> B . This now is VERY bad news! The production b -> b is an immediate cycle and its presence after transformations means that our original grammar had cycles!! This means we can never turn this grammar into an LL(1) grammar let alone into one with no left-recursion!

Thus, our attempt to transform this grammar must be abandoned as hopeless. In retrospect, inspecting the grammar more carefully this really is no surprise as it is cyclic! But what is interesting is how this transformation procedure discovers this problem ... AND this is a little reminder that one should check for cycles before you begin ...

Example 4: Consider the grammar:

s -> s s s B -- S_1
| -- S_2

s -> s s s B -- S_1 \| -- S_2

Here the start symbol S is almost good so we transform to:

s -> s'                                                           -- S'_1
s'-> s s B s'                                                  -- S'_2
      |                                                               -- S'_3

s -> s' -- S'_1 s'-> s s B s' -- S'_2 \| -- S'_3

where we add s < s'. This makes s' -> s s B s' improvable to s' -> s' s B s' which makes s' almost good so we may expand to obtain:

s -> s'                                                           -- S''_3
s' -> s''                                                         -- S''_3
s'' -> s B s' s''                                              -- S''_3
        |                                                              -- S''_3

s -> s' -- S''_3 s' -> s'' -- S''_3 s'' -> s B s' s'' -- S''_3 \| -- S''_3

where we add s' < s'' . This makes s'' -> s B s' s'' improvable in two steps to s'' -> s'' B s' s'' whence making s'' almost good ... so we obtain:

s -> s'                                                             -- S'''_3
s' -> s''                                                           -- S'''_3
s'' -> s'''                                                        -- S'''_3
s''' -> B s' s'' s'''                                          -- S'''_3
          |                                                             -- S'''_3

s -> s' -- S'''_3 s' -> s'' -- S'''_3 s'' -> s''' -- S'''_3 s''' -> B s' s'' s''' -- S'''_3 \| -- S'''_3

where we add s'' < s'''. Now all our productions are all good so we have removed left recursion.

stmts -> stmts stmt \| stmt -> VAR ASSIGN expr SEMICOLON expr -> expr addop term \| term addop -> ADD \| SUB term -> term mulop factor \| factor mulop -> MUL \| DIV factor -> LPAR expt RPAR \| NUM

Example 5: Consider the following grammar:

s -> a b C
a -> A
|
b -> a b C
| B

s -> a b C a -> A \| b -> a b C \| B

Suppose we deliberately choose an inappropriate total ordering b < a < s of the non-terminals. This leaves a good. Notice that b -> a b C is not good as a is nullable. The production s -> a b C is not good either, however, it can be improved by expanding a:

s -> A b C
      | b C
a -> A
       |
b -> a b C
       | B

s -> A b C \| b C a -> A \| b -> a b C \| B

This introduces a new bad production s -> b C. However we can also improve the last rule and this gives:


s -> A b C \| b C a -> A \| b -> A b C \| b C \| B

This makes the last rule almost good and we can remove the immediate left recursion:


s -> A b C \| b C a -> A \| b -> A b C b' \| B b' b' -> C b' \|

This makes all the rules good except for the second rule which now can be improved:


s -> A b C \| A b C b' C \| B b' C a -> A \| b -> A b C b' \| B b' b' -> C b' \|

Now all rules are good so that we have removed the left recursion. However, do notice that the cost of choosing an inappropriate order on the non-terminal was significant as the resulting grammar is larger and less perspicuous than the grammar which results from simply eliminating the obvious left recursion!

Unfortunately there are grammars in which one has to squeeze a non-terminal, x, which is not already good, in order to make progress. In this case we are forced to resort to extreme measures. The ideas is that, in this case, we shall replace the non-terminal by a non-nullable version, x* , and by nulling it. This will cause the hidden recursion to be blocked in the first case and made more immediate in the second case. But the question is how do we get the non-nullable version x* of x? Our strategy, in an attempt to be minimalistic, is to let x* inherit all the non-nullable productions of x and an \epsilon-separated version of the (one) null production.

First we explain \epsilon separation:

\epsilon-separation (or getting rid of nullable productions):

As a last resort in squeezing it may be necessary to transform the productions reachable from the non-terminal one wants to squeeze into an \epsilon-separated form. How this is done is now described. Note this process only gives an equivalence of the parse trees when the grammar is null unambiguous.

A (reachable realizable) grammar is null unambiguous if for every nonterminal there at most one null production. A null ambiguous grammar is always ambiguous in the usual sense, and so is, from our perspective, pathological. It can never be transformed into an LL(1) grammar. Thus, we shall assume that we are working with a null unambiguous grammar in this section.

To remove null productions from a null unambiguous grammar one replaces the productions of every immediately nullable nonterminal by a production which produces either a non-nullable version of the nonterminal or the empty string:
x -> \beta_1 |...| \beta_n | \epsilon =====> x -> x* | \epsilon x* -> \beta_1 | ... | \beta_n
once this expansion is done one can expand every occurrence of the nonterminal with the production which produces a non-nullable version or the empty string. This will make the nullable version of the non-terminal unreachable unless it is actually the start nonterminal: remove these unreachable nonterminals. One then repeats this process for the nonterminals which have newly become immediately nullable until there are no immediately nullable productions (except for the start non-terminal).

From this process one obtains an equivalent grammar which has at most one non-nullable nonterminal (the start nonterminal).

The problem with this procedure is that it potentially expands the grammar exponentially (every occurrence of a nullable nonterminal introduces two productions)! This is why one want to avoid it if possible.

Example 6:
Consider the grammar

s -> a A | b.
a -> b B c | .
b -> C | .
c -> D |.

This is null unambiguous with every non-terminal nullable. We expand using

a -> a* | .    a* -> b B c.
b -> b* | .    b* -> C.
c -> c*| .     c* -> D.

and this gives from our original rule set the following rules:

s -> a*A | A | b*|.
a -> a* | .
a* -> b* B c* | b* B | B c* | B .
b -> b* | .
b* -> C .
c -> c* |.
c* -> D.

Now taking the reachable rules gives:
s -> a*A | A | b*|.
a* -> b* B c* | b* B | B c* | B.

b* -> C.
c* -> D.

Note: the start symbol is nullable ... so can be expanded to obtain:
          s -> s* |.
        s*-> a*A | A | b*.
          a* -> b* B c* | b* B | B c* | B.

b* -> C.
c* -> D.

So now let us consider an example which involves the use of \epsilon-separation. We shall illustrate the use of \epsilon-separation just of the null production in an attempt to not expand the size of the grammar.

Example 7: For the grammar

y -> x y A | .               hidden left-recursion:      add y<x   (must squeeze x)
x -> y x B | .               have y < x but y is not good so not improvable (must squeeze y)
---------------------------   squeeze x in rule 1
y -> x* y A | y A | .     almost good y < x*
x -> x* | .                     good x < x*
x* -> y x B.
---------------------------- remove immediate left-recursion rule 1
y -> x* y A y' | y'.          good    y < x*, y < y'
y' -> A y' | .                     good
x -> x* | .                        good    x < x*
x* -> y x B.                    bad
----------------------------- improve y in last rule
y -> x* y A y' | y'.          good    y < x*, y < y'
y' -> A y' | .                     good
x -> x* | .                        good    x < x*
x* -> x* y A y' x B         bad (immediate left recursion)
        | y' x B.                    bad x* < y' (improve x)
------------------------------ improve x in last rule
y -> x* y A y' | y'.          good    y < x*, y < y'
y' -> A y' | .                     good
x -> x* | .                        good    x < x*
x* -> x* y A y' x B         bad (immediate left recursion)
        | y' x* B                  bad (hidden recursion - so squeeze y' which is good)
        | y' B.                       good x* < y'
-----------------------------   squeeze y'
y -> x* y A y' | y'.          good    y < x*, y < y'
y' -> A y' | .                     good
x -> x* | .                        good    x < x*
x* -> x* y A y' x B         bad (immediate left recursion)
        | x* B                 bad (immediate left recursion)
        | A y' x* B               good
        | y' B.                       good x* < y'
-----------------------------   remove immediate left recursion
y -> x* y A y' | y'.          good    y < x*, y < y'
y' -> A y' | .                     good
x -> x* | .                        good    x < x*
x* -> A y' x* B x*'        good
        | y' B x*'.                 good   x* < y'
x*' -> y A y' x B x*'        good   x*' < y
        | B x*'                      good
        |.                              good
--------------------------------------------------------------------------------

Exercise: Consider the grammar below. Similary squeezing the first rule by expanding the first non-terminal will land us in trouble. Here is another case where we are forced to do \epsilon-separation:

x -> y x A | x y A' | .                x < y hidden left recursion
y -> x y B | y x B' | .                 hidden immediate call to x
--------------------------------------   \epsilon-separate y to expose hidden call to x in rule 1

Can you complete the transformation??!

Translation....

In this section I will give a brief description of the translation required between the parse trees. This allows the parse tree of a transformed grammar to be translated back into a parse tree for the original grammar in an unambiguous and efficient manner. This is transformation is "for the record" and you might like to skip the section unless you actually need it! The point is that it can be done quite easily.

The translation is indicated by placing

T[ ]

round the new parse tree. We have to show how each transformation gives a translation backward between the parse trees.

If we improve a -> b \alpha (production P) by expanding along b -> \beta (production Q) to get a new production b -> \beta \alpha (production R ) then the synthesis translation is simply:

T[R(\alpha,\beta) ] = P(Q(\alpha),\beta)

The translation for an expansion by an empty production is more complex (but still quite simple): if we improve a -> b \alpha (production P) by expanding along b -> \epsilon (production Q) we get a -> a' (production R) a' -> \alpha (production S). The translation is:

T[ R(x) ] = x (the identity)
T[S(\alpha)] => P(Q \alpha)

Now we turn to the translation required for the expansion which removes immediate left recursion. We have already discussed the idea when we looked at the translation from left associative form to right associative form. Here we are improving:

a -> a \alpha_1              --P'_1
       | ...
       | \beta_j                  --P_j

a -> \beta_1 a'            --Q_1
        | ...
a' -> \alpha_1 a'        --Q'_1
         | ...
         |                            --E

The translation is:

T[ Q_j(\beta_j,f)] = f(P_j(\beta_j))
T[Q'_j( \alpha_i,g) ] = \x -> g(P'_j(x, \alpha_i)))
T[E] = \x => x (the identity function)

where here I have used lambda calculus style expressions.

Termination ...

When there are no cyclic derivations

x =+=> x

then we can sketch the argument that this process must terminate with a non-left-recursive grammar.

The key steps in proving this are to realize that

when there are no cycles the improvement process must terminate.
when immediate recursion is eliminated (and a new nonterminal is introduced) one "minimal" left recursive loop is eliminated.

As there are only finitely such "minimal" loops (and we do not increase this number by improving transformations) this process terminates.

Left-factoring

Left-factoring is the following rewrite:

a -> \alpha \gamma_1 | ... | \alpha \gamma_r | \gamma_{r+1} | ... | \gamma_n
---------------------------------------------- --
a -> \alpha a' | \gamma_{r+1} | ... | \gamma_n
a' -> \gamma_1 | ... | \gamma_r

This has the benefit of removing the obvious first set clashes. However, there is no guarantee that it will remove all first set clashes. Note that this transformation is a right non-terminal expansion in reverse!

Note that an LL(1) grammar necessarily has no left recursion and is necessarily left-factored.

Exercise: what is the following grammar left factored?


expr -> term ADD expr \| term SUB expr \| term term -> factor MUL term \| factor DIV factor \| factor factor -> LPAR expr RPAR \| NUM

Leftmost nonterminal expansion

Given a nonterminal which occurs as the leftmost nonterminal in the body of a production:

x -> y \alpha

we shall say that the occurrence has a first set clash in case:

there is another production with the same head, x -> \gamma, such that first(y \alpha) is not disjoint from first(\gamma).

Notice that if there is a leftmost occurrence of a non-terminal with a first set clash then the grammar cannot be LL(1). Given such an occurrence we shall substitute the nonterminal to expose the first set clash:

x -> y \alpha y -> \beta_1 | ... | \beta_n
---------------------------------------------------------------
x -> \beta_1 \alpha | ... | \beta_n \alpha

First/Follow Set Clashes

Removing first set/follow set clashes is tricky and there is no guarantee that it can be done in general. A basic technique is to find sequences which are causing the clashes and to remove them from the grammar. Thus one is looking for sequences in a production

x -> \alpha y z \beta

in which y is nullable and first(y) and first(z) are not disjoint. To remove the first/follow set clash create a new non-terminal for the sequence, w=yz, and create productions for w which expands the non-terminal y so if y -> \gamma_1 |...| \gamma_r then replace the above with:

x -> \alpha w \beta
w -> \gamma_1 z |...|\gamma_n z

in this manner this occurrence of z may no longer contribute to the follow set of y. Of course you can also be unlucky and one of the \gamma_i can contain an occurrence of y before a nullable ending which will ensures first(z) must be in follow(y) again.

Eliminating nonterminals

It is also worth keeping the grammar clean. When a nonterminal x has a single production:

x -> \alpha

(that is there are no other productions with x as the head) and x does not occur in \alpha (if it did then x would be unrealizable) then the nonterminal x can be removed from the grammar (provided x is not the start nonterminal) by substituting the body of this production for each occurrence of x.

Notice that this may not be advantageous if \alpha is a long string! However, when the length of \alpha is at most one it seems worthwhile to perform the substitution to eliminate the x.

A special case of this is unit rules: this the case when \alpha is a singleton x -> y. Unit rules can always be eliminated without loss.

Transforming to LL(1)

In trying to turn a grammar into an LL(1) grammar you should try the following steps:

Remove left-recursion.
Expose first set clashes.
Left-factor the grammar.
Attempt to remove first/follow set clashes.
Return to step 2 after thinking about whether you want to continue!

Health warning:

Recall there is no guarantee this process will terminate. In particular the expansion of nonterminals could simply blow up the grammar making the rules bigger and bigger