CPSC411: COMPILER CONSTRUCTION IAuthor: Robin Cockett Date: Jan 28th 2002 (modified 2014)
Lecture: 7 & 8Title: Transforming grammars
Reading:
We may see the problem as follows: we are given a grammar G and we wish to transform it to a grammar G' while providing a translation T: G' -> G such that
INPUTcommutes, in the sense that parsing an input for G' and then translating the parse tree into one for G is the same as parsing straight into G. In fact, futhermore, it is desirable for the translation T to be given by a linear time "fold" (an attribute grammar). As we shall see we can do a number of useful transformation steps in the direction of obtaining an LL(1) grammar. These steps almost amount to a procedure for generating an equivalent LL(1) grammar (if one exists) -- the one aspect we shall not discuss is how to eliminate the follow set clashes.
/ \
parse / \ parse
/ \
/ \
\/ \/
G' -----------> G
T
The basic methods for transforming grammars which we shall look at are:
Eliminating left recursion:
x << y if and if x -> w y w' where w and w' are nullable.The grammar has no cycles in case the transitive closure of this relation is anti-reflexive (i.e. it has no cycles).
Question: Can you explain why this works?
Testing for null ambiguity:
A grammar is null unambiguous if
and
only if each non-terminal has at most one nullable production. It is as easy as this!
If a grammar is not null unambiguous it must be ambiguous so it is
not of great interest in parsing applications. In
particular, this means there is no chance it can be turned into an
LL(1) grammar. It is possible, however, to make a grammar which
recognizes the same language and is null unambiguous (this is
essentially done using \epsilon separation see below)
... the equivalence at the level of parse trees, however,
will be lost.
Removing left-recursion requires a number of steps which we now
describe in some detail. The algorithm starts by isolating
where the "problem areas" are in the grammar. This is
done by partially ordering the nonterminals of the grammar and
classifying the productions as either being "good" or "bad": the
aim is then to focus on the bad productions and transform them
into good productions.
A feature of this algorithm is that it mimics how one might
approach the problem when you do it by hand. If you were
doing it by hand you would inspect the grammar and choose an
ordering on the nodes which causes the minimal amount of damage to
the grammar. The algorithm does this by developing the
partial order on the nodes as it is needed.
A production r, where pr(r) = x -> \alpha, is good in case \alpha satisfies any of the following:
A non-terminal x is good if all its associated
productions x -> \alpha_1 | ... | \alpha_n are
good. Once a non-terminal is good, in this sense, you
need not change its productions agian while transforming to remove
left recursion ...
OBSERVATION:
A grammar has no left-recursion if and only if each nonterminal (or equivalently if every production) is good with respect to some partial order on the nonterminals of the grammar.
Proof.
Suppose we have a derivation a =*=> a w then there must be be a series of one step rules
a => w_1' a_1 w_1where each w_i' is nullable. But when all rules are good this means a < a_1 < ... < a_n so that a < a which cannot be.
a_1 => w_2' a_2 w_2
....
a_n = a w_n
Thus, when all nonterminals (or productions) are good this implies that the grammar is not left-recursive.The idea is then to transform the rules so that they all become good. Clearly this means that we need to modify the rules which fail to be good!Conversely, suppose we have a non left-recursive grammar then there is a relation induced on the non-terminals by demanding that each production be good. So we will say x < y in case there is a production x -> w' y w where w' is nullable. Saying that the transitive reflexive closure of this is an anti-symmetric relation is an equivalent way of defining what it means to say that the grammar is not left-recursive!
Most importantly one can create a partial order by specifying
which elements are "better" than which others, x < y,
(where x<y if and only if x =< y and x =/= y) and then
taking the transitive reflexive closure. If you do
this, however, you must be careful to check the relation is still
anti-symmetric (i.e. that there are no cycles).
x -> x \alpha_1 | .. | x \alpha_r | \beta_1 | ... | \beta_nwhere each x -> \beta_i is good (note: each \alpha_i must be non-null as the grammar can have no cycles).
OBSERVATION:
A reachable realizable grammar G with a total order on nonterminals either satisfies one of the following:
- all productions good;
- there is an improvable production;
- there is an almost good non-terminal (that is an immediate left recursion);
- there is a hidden immediate left recursion.
Proof.
Consider the least nonterminal: the only way one of its rules is not good or is not improvable is if there is a production x -> w' x w where w' is nullable. Which means it has a hidden left recursion.Suppose now that x is the smallest nonterminal which is not good: that is all y with y<x have all their productions good then any production which is not good fails to be good because it is of the form
x -> w' y w
where w' is nullable and y =< x. If y < x then y must be good, thus, we would have found an improvable production. Suppose this does not happen then y = x, but this makes the production squeezable or an immediate left recursion!NOTE: A grammar with no null-productions will never require squeezing! However, if only squeezing is left, then for every y in w' , in a production x -> w' x w with a hidden left recursion on x, we must have x <y as otherwise we would have a source of improvement. As we shall see shortly this can happen.
expr -> expr ADD
term
--P_1 | expr SUB term --P_2 | term --P_3 term -> term MUL factor --P_4 | term DIV factor --P_5 | factor --P_6 factor -> LPAR expr RPAR --P_7 | NUM --P_8 |
Whatever partial order we put on the nonterminals notice that P_7 and P_8 will be good so that factor is good. To make expr almost good we must suppose expr < term. Finally, to make term almost good we must assume that term < factor.
Note how we choose the partial order to suit our grammar .... and, indeed, the strategy always should be to choose a partial order which makes as much as possible good!
Example 2: Consider the
grammar:
s ->
A
--R_1
| a --R_2 a -> B --R_3 | b --R_4 b -> s --R_5 | --R_6 |
Here we can make s < a, a < b but now we have to leave R_5 as an improvable production. In fact, whichever way we choose the partial order, we will always have to leave one production as improvable (because there is a left recursion).
Example 3: Consider the
grammar:
s -> s s s
B
--
S_1 | -- S_2 |
Here the start symbol s is almost good.
Example 5: Consider the
grammars:
s -> s s s | Grammar (a) |
s -> s | Grammar (b) |
s -> s B Grammar (c) |
s -> a b C a -> A | b -> a b C | B Grammar (d) |
Here in (a) the start symbol is almost good while in (b) it is not and in (c) it is almost good. Exercise: what is wrong with the first three grammars? Work through (d) as an exercise!
x -> \gamma y \alpha' y -> \beta_1 | ... | \beta_n(B) Immediate left-recursion removal for almost good nonterminal
------------------------------------------------------------------------ (y good, \gamma nullable)
x -> \gamma \beta_1 \alpha' | ... | \gamma \beta_n \alpha'
x -> x \alpha_1'| ..... | x \alpha_n| \beta_1 | ..... | \beta_m
------------------------------------------------ ----------------------- (x -> \beta_j good)
x -> \beta_1 x' | ... | \beta_m x'
x' -> \alpha_1 x' | .. | \alpha_n x' | \epsilon
and we add the non-terminal x' to the partial order with x < x' whenever any of \beta_j are nullable. The \beta_1,...,\beta_m are called the base cases ... if there are no base cases then there will be no productions x -> \alpha_i x' and so x' will not be reachable (and tacitly we are assuming that all actions are reachable ...) so that the whole production set can be removed.
We apply (A) in preference to
(B), so that we start by improving all our rules (enlarging
the preorder to make as much good or as possible) until there is
no choice but to transform an almost good nonterminal.
Only if these transformations (and
enlargement of the partial order) do not make all the productions
good do we resort to squeezing a rule with a hidden left
recursion:
(C) Squeezing productions with
hidden immediate left-recursions:
Let us trace through what these transformations do to an example
which requires squeezing but not null \epsilon-separation before
we return to discuss this
transformation in more detail:
a -> b
C
good
a
< b
| c
D.
good
a
< c
b -> e a
E
bad
b
< e
| c
B.
good
b < c
c ->
A.
good
e -> F
e
good
|
.
good
By trying to partially order the non-terminals to make as much
good as possible we may come up with the partial ordering above
(note that e is nullable). There is just one rule which
causes a problem. However, note that it is improvable as a
is a good non-terminal. So we improve the bad
production:
a -> b
C
good
a
< b
| c
D.
good
a < c
b -> e b C
E
bad
b < e hidden left recursion
| e c D
E
good
b
< c
| c
B.
good
c ->
A.
good
e -> F
e
good
|
.
good
This improvement now exposes a hidden left recursion.
We must squeeze the non-terminal e in the bad rule. As
e is a good non-terminal we may expand it:
a -> b
C
good
a
< b
| c
D.
good
a < c
b -> F e b C
E good
| b C E
bad
immediate left recursion
| e c D
E
good
b
< e, b < c
| c
B.
good
c ->
A.
good
e -> F
e
good
|
.
good
Now we can remove the immediate left recursion.
a -> b
C
good
a
< b
| c
D.
good
a < c
b -> F e b C E b' good
| e c D E b'
good
b< e, b < c
| c B
b'.
good
b' -> C E b'
good
|.
good
c ->
A.
good
e -> F
e
good
|
.
good
As all the rules are now good we have removed the left recursion.
expr -> expr ADD
term
--P_1
| expr SUB term --P_2 | term --P_3 term -> term MUL factor --P_4 | term DIV factor --P_5 | factor --P_6 factor -> LPAR expr RPAR --P_7 | NUM --P_8 |
expr -> term
expr'
--P_1
expr' -> ADD term expr' --P_1' | SUB term expr' --P_2' | --P_3' term -> factor term' --P_4 term' -> MUL factor term' --P_4' | DIV factor term' --P_5' | --P_6' factor -> LPAR expr RPAR --P_7 | NUM --P_8 |
There is no need to add any relations to the order in this case
as now all the productions are good ... so we have removed left
recursion.
Example 3: Consider the
grammar:
s ->
A
--R_1
| a --R_2 a -> B --R_3 | b --R_4 b -> s --R_5 | --R_6 |
This gives a good example of the role of the partial order. Remember we can choose the partial order and we shall try to do so in such a manner as to make as many of the productions as possible good (or almost good) after all we do wish to have all the productions good eventually.
Now rather than saying outright what the partial order is we actually develop it as we go. Thus, we may reason as follows: in order to make s -> a a good production we shall make s < a ; this also has the desirable effect of ensuring that s is a good non-terminal. Excellent! Next to ensure that a -> b is a good production we shall make a < b; this will also ensure that a is a good non-terminal. Marvelous!
Now it is perhaps tempting to think we can always continue in this vein making productions good by adding to the ordering. However, to see what goes WRONG consider the attempt to now make b -> s good. To do this have to demand that b < s ... however here we run into a problem for s is already less than b! So we cannot make b < s. This is where the partial order begins to play an important role for it is stopping us from having a cycle s -> a -> b -> s -> .... etc.
So now realizing that s < b, we conclude that b -> s is an improvable production. So we may improve it by expanding to the productions b -> a and b -> A. But now, while b -> A is a good, we notice that b -> a has the same problem as before, namely a is already less than b . Thus b -> a is improvable. Expanding this production gives b -> b and b -> B . This now is VERY bad news! The production b -> b is an immediate cycle and its presence after transformations means that our original grammar had cycles!! This means we can never turn this grammar into an LL(1) grammar let alone into one with no left-recursion!
Thus, our attempt to transform this grammar must be abandoned as hopeless. In retrospect, inspecting the grammar more carefully this really is no surprise as it is cyclic! But what is interesting is how this transformation procedure discovers this problem ... AND this is a little reminder that one should check for cycles before you begin ...
Example 4: Consider the
grammar:
s -> s s s
B
--
S_1 | -- S_2 |
Here the start symbol S is almost good so we
transform to:
s ->
s'
--
S'_1 s'-> s s B s' -- S'_2 | -- S'_3 |
where we add s < s'. This makes s' -> s s B s'
improvable to s' -> s' s B s' which makes s'
almost good so we may expand to obtain:
s ->
s'
--
S''_3 s' -> s'' -- S''_3 s'' -> s B s' s'' -- S''_3 | -- S''_3 |
where we add s' < s'' . This makes s'' -> s B
s' s'' improvable in two steps to s'' -> s'' B s' s''
whence making s'' almost good ... so we obtain:
s ->
s'
--
S'''_3 s' -> s'' -- S'''_3 s'' -> s''' -- S'''_3 s''' -> B s' s'' s''' -- S'''_3 | -- S'''_3 |
where we add s'' < s'''. Now all our productions are
all good so we have removed left recursion.
Exercise: Check that you can
transform the grammar below to remove left recursion!
stmts -> stmts stmt | stmt -> VAR ASSIGN expr SEMICOLON expr -> expr addop term | term addop -> ADD | SUB term -> term mulop factor | factor mulop -> MUL | DIV factor -> LPAR expt RPAR | NUM |
s -> a b C a -> A | b -> a b C | B |
Suppose we deliberately choose an inappropriate total ordering b
< a < s of the non-terminals. This leaves a
good. Notice that b -> a b C is not good as a
is nullable. The production s -> a b C is not good
either, however, it can be improved by expanding a:
s -> A b C | b C a -> A | b -> a b C | B |
This introduces a new bad production s -> b C. However we can also improve
the last rule and this gives:
s -> A b C | b C a -> A | b -> A b C | b C | B |
This makes the last rule almost good and we can remove the
immediate left recursion:
s -> A b C | b C a -> A | b -> A b C b' | B b' b' -> C b' | |
This makes all the rules good except for the second rule which now can be improved:
s -> A b C | A b C b' C | B b' C a -> A | b -> A b C b' | B b' b' -> C b' | |
Now all rules are good so that we have removed the left
recursion. However, do notice that the cost of choosing an
inappropriate order on the non-terminal was significant as the
resulting grammar is larger and less perspicuous than the grammar
which results from simply eliminating the obvious left recursion!
Unfortunately there are grammars in which one has to squeeze
a non-terminal, x,
which is not already good, in order to make progress.
In this case we are forced to resort to extreme measures.
The ideas is that, in this case, we shall replace the non-terminal
by a non-nullable version, x*
, and by nulling it. This will cause the hidden recursion to
be blocked in the first case and made more immediate in the second
case. But the question is how do we get the
non-nullable version x* of x? Our strategy, in an attempt to be
minimalistic, is to let x* inherit all the non-nullable
productions of x and an
\epsilon-separated
version of the (one) null production.
First we explain \epsilon
separation:
The translation is indicated by placing
T[ ]round the new parse tree. We have to show how each transformation gives a translation backward between the parse trees.
If we improve a -> b \alpha (production P) by
expanding along b -> \beta (production Q) to
get a new production b -> \beta \alpha
(production R ) then the synthesis translation is simply:
T[R(\alpha,\beta) ] = P(Q(\alpha),\beta)The translation for an expansion by an empty production is more complex (but still quite simple): if we improve a -> b \alpha (production P) by expanding along b -> \epsilon (production Q) we get a -> a' (production R) a' -> \alpha (production S). The translation is:
T[ R(x) ] = x (the identity)Now we turn to the translation required for the expansion which removes immediate left recursion. We have already discussed the idea when we looked at the translation from left associative form to right associative form. Here we are improving:
T[S(\alpha)] => P(Q \alpha)
a -> a \alpha_1 --P'_1to
| ...
| \beta_j --P_j
a -> \beta_1 a' --Q_1The translation is:
| ...
a' -> \alpha_1 a' --Q'_1
| ...
| --E
T[ Q_j(\beta_j,f)] = f(P_j(\beta_j))where here I have used lambda calculus style expressions.
T[Q'_j( \alpha_i,g) ] = \x -> g(P'_j(x, \alpha_i)))
T[E] = \x => x (the identity function)
When there are no cyclic derivations
x =+=> xthen we can sketch the argument that this process must terminate with a non-left-recursive grammar.
The key steps in proving this are to realize that
a -> \alpha \gamma_1 | ... | \alpha \gamma_r | \gamma_{r+1} | ... | \gamma_n
---------------------------------------------- --
a -> \alpha a' | \gamma_{r+1} | ... | \gamma_n
a' -> \gamma_1 | ... | \gamma_r
This has the benefit of removing the obvious first set
clashes. However, there is no guarantee that it will remove
all first set clashes. Note that this transformation is a
right non-terminal expansion in reverse!
Note that an LL(1) grammar necessarily has no left recursion and is necessarily left-factored.
expr -> term ADD expr | term SUB expr | term term -> factor MUL term | factor DIV factor | factor factor -> LPAR expr RPAR | NUM |
Leftmost nonterminal expansion
Given a nonterminal which occurs as the leftmost nonterminal in
the body of a production:
x -> y \alpha
we shall say that the occurrence has a first set clash
in case:
there is another production with the same head, x -> \gamma, such that first(y \alpha) is not disjoint from first(\gamma).
Notice that if there is a leftmost occurrence of a non-terminal
with a first set clash then the grammar cannot be LL(1). Given
such an occurrence we shall substitute the nonterminal to expose
the first set clash:
x -> y \alpha
y -> \beta_1 | ... | \beta_n
---------------------------------------------------------------
x -> \beta_1 \alpha | ... | \beta_n
\alpha
Transforming to LL(1)
In trying to turn a grammar into an LL(1) grammar you should try the following steps:
Health
warning:
Recall there is no guarantee this process will terminate. In
particular the expansion of nonterminals could simply blow up the
grammar making the rules bigger and bigger