Author: Robin Cockett
Date: 6 Feb 2002
Reading:
Louden: CH 2, especially 2.2, 2.3, 2.4
Dragon book: CH 3, especially 3.3, 3.6 - 3.9
(but you should
read the whole chapter!)
Regular languages:
In the parsing stage of a compiler regular languages and their recognizing automata play an important role in (at least) two stages. This first is in the lexical analysis stage when the tokenizing of the input takes place. The second is in the bottom up parsing stage and the family of LR parsing algorithms which we are about to discuss which are the most commonly used tools for generating the front end.
It is important that you know the basic results on regular languages and automata: you should have covered most of this in cpsc313. The purpose of these notes is to provide a review of the salient points and in particular to remind you of the basic methods of transforming between the different representations of these languages and their recognizing automata.
What is a regular expression? We may specify a regular expression as follows (here we are actually specifying the syntax tree):
Given a set \Sigma of labels a regular expression is anything which can be formed using the following rules:
|
A regular expression represents a set of strings the membership rules
for regular expresssions are as follows:
---------- [x] \in x : reg |
-------------- [] \in \epsilon : reg |
L \in r1 : reg ------------(left union) L \in r1 | r2 : reg |
L \in r2 : reg ------------(Right union) L \in r1 | r2 : reg |
L1 \in r1: reg L2 \in r2 :reg --------------------(sequencing) L1 ++ L2 \in r1 r2 : reg |
|
----------(basis star) [] \in r* : reg |
L1 \in r : reg L2 \in r* : reg ---------------------(step star) L1 ++ L2 \in r* : reg |
where ++ stands for the concatonation of strings.
These rules say:
Notice that nothing is in (\empty b*c) and (\empty | ab)
has the same membership as ab.
Here is an example of an explicit membership proof using this system:
--------- ---------
[a] \in a
[b] \in b
------------------
-----------
[a,b] \in ab
[] \in (ab)*
-----------------------------
---------
[a,b] \in (ab)*
[c] \in c
-------------------------------
[a,b,c] \in (ab)*c
Clearly constructing such proofs is not the most efficient way of establishing membership in a regular expression!
A very basic reason you must understand regular expressions is that LEX uses them to describe patterns! A pattern is a class of lexemes and LEX uses a regular expression-like-syntax to specify the class of lexemes to be recognized. This is a reminder on how lexing programs are specified and on what they mean.
A lex program has three parts: the definition section, the rules section, and the user subroutine section. These parts are separated by the symbols "%%". In the definition section one can predefine useful patterns:
WHITE [ \t]
DIG [0-9]
NUM -?{DIG}+
ALHA [a-zA-Z]
ALPHANUM [a-zA-Z0-9]
ID {ALPHA}{ALPHANUM}*
Here things between square brackets denote a character class: inside one can have a list of single characters or ranges of characters. Once a definition is made one can use the definition by enclosing the call in curly brackets.
In the rule section one can associate actions with patterns (the actual syntax varies depending on the lexer you use and the language involved):
r1 {a1}
r2 {a2}
....
rn {a3}
and in the last section one can provide additional code including a main program.
The way a lexer works conceptually is as follows: it gobbles up the input until there is no pattern which matches the input. At that stage it keeps a pointer to the input eaten and the input still unrecognized and does some action depending on the input which it has gobbled up. The action it chooses is the action associated with the first pattern in the list which still recognizes the input. Thus if L is the input which matches some of the patterns such that the next character x makes L++[x] unrecognizable by any of the patterns then L is collected as a lexeme and the action associated with the first rule whose LHS still matches L is "fired".
This action sometimes involves calling LEX itself (e.g. to skip over white spaces) ... but often will involve the formation of a token which is passed to a calling routine.
The way a LEX program implements this is by building a finite deterministic automaton with actions attached to the states. It is important that you know how this is done as it is a step in the construction of a compiler. Be thankful, however, that you do not have to do it: it has already been done!
To describe how this is done we need to know how to convert from a regular expression into a deterministic finite automaton and back.
Exercise:
(1) Let \Sigma = { a, b, c, ...} then show that the following are regular expressions:
(2) Let \Sigma = { a, b, c, ...} then prove that the following
are membership relations of regular expressions:
(3) We could use a language to describe regular expressions:
reg -> label
| NULL
| LPAR reg reg_or_rest RPAR
| LPAR reg_seq_rest RPAR
| reg STAR.
reg_or_rest -> SLASH reg reg_or_rest
|.
reg_seq_rest -> reg reg_seq_rest
|.
label -> ALPHA.
where the tokens are translated as
"(" => LPAR
")" => RPAR
"*" => STAR
"|" => SLASH
[a-zA-Z] => ALPHA
"0" => NULL
Note: the regular expression "()" represents \epsilon while the empty set is denoted by "0".
Why is this different from the definition of regular expressions given above?
Is this grammar LL(1). If not can you transform it to be
LL(1)?
Translating between regular expression, NFAs, and
DFAs
It is a classical result of language theory (which you should have
covered in cpsc313) that the descriptive power of regular
languages, regular expressions, non-deterministic finite automata
(NFAs), and deterministic
finite automata (DFAs) are all the same. The proof that they are
the
same relies on a series of translations between the various formalisms.
These translations are constructive: that is they can be
translated into programs which actually do the transformations.
LEX and YACC
actually rely very heavily on these transformation methods and so it is
important that you know how they work.
The transformations actually go through two intermediate notions, here called regular-NFAs and \epsilon-NFAs which allow slightly more general notions of transition. These intermediate notions (especially the \epsilon-NFAs) are useful conceptual tools which we shall use shortly in our description of LR (bottom up) parsing.
This should be a review of thing you did in cpsc313.
A (non-deterministic) finite automaton, also know affectionately as an NFA, A = (\Sigma, S, t, s0, f), consists of
a
S1 ------> S2 === t(S1,a,S2)
A finite automaton is said to be determinstic, or
affectionately
as a DFA, in case for each label every state has at most one
transitions out of it with that label. That is:
An automaton recognizes a list of labels [a1,...,an] in case that list of labels can drive it from the start state to a final state. That is there is a sequence of transitions t(Si-1,ai,Si) such thatt(S,a,S1) and t(S,a,S2) implies S1 = S2
a1 a2 a3 anand Sn \in f.
s0 ----> S1 -----> S2 ----> ... ----> Sn
An \epsilon-NFA is an automaton which allows \epsilon-transitions as
well as those labelled by the labels themselves. An \epsilon-NFA
recognizes [a1,..,an] if there is a sequence of transitions as above
where in addition between any two labels one allows an arbitrary number
of \epsilons transitions (shown by wiggley arrows):
ai ai+1
------> Si ~~~~> Si' ~~~~> .... ~~~~> Si+1 ---->
A regular-NFA is a slightly different beast. It must have a start
state IN which has no transitions in into it and a (unique)
final state OUT which has no transitions out.
Furthermore, its
transitions are labelled by regular expressions (in the given label
set).
By amalgamating the transition between any two states using the union
we
can arrange that there is (at most) one transition between any two
states.
This we shall call a regular-NFA matrix:
s1 | si | sn | OUT | |||
IN | ||||||
s1 | ||||||
sj | r_ji | |||||
sn |
where r_ji is the regular expressions on the transition sj
---> si (of course this could be the empty regular expression, 0,
which would indicate that there was no transition).
A regular grammar is a context free grammar whose productions have one of two forms:
a -> B
c. where B is a
terminal and a,c are non-terminal
a -> B.
Letting the non-terminals be states there is an immediation translation of this notion into a NFA. The only difficulty is the necessity of adding a final state: to affect this we will also allow empty productions:
a -> .
usually these are not allowed in texts as empty productions are usually avoided as bad things: but there is no harm in them (honest)!
Conversely an NFA can be converted into a regular grammar where a transition
Abecomes a production
a -----> b
a -> Ab
and final states c have empty productions
c -> .
Clearly the language these two forms describe is then the same.
First we shall look at a method to translate any NFA into a regular expression. Our aim here is not to generate a particularly good regular expression but rather to show that we can generate one.
Our first step is to form a regular-NFA by adding a new start
state with an \epsilon transition from the old start state and adding a
new
final state with an \epsilon transition from the old final states and
by
amalgating multiple transitions
a1, ..,an between two states into a single regular
transition (a1|...|an). If there is no transition
between two states we shall mark the transition as being empty, 0.
Consider
_ B
A
/ \
^ s0
--------->
s1v
s0 = start, s1 = final
/ |
^
/
\ _
/ \
A
/ B
B
\
v
s2
then the matrix is
s0 | s1 | s2 | OUT | |
IN | \e | 0 | 0 | 0 |
s0 | B | A | 0 | 0 |
s1 | 0 | B | B | \e |
s2 | A | 0 | 0 | 0 |
where \e is the \epsilon transition
We now "remove" state sk by noting that each transition si to sj must be modified on removal of sk by
rij := rij | rik (rkk)* rkj
to account for the possibility that it may have made a transition via sk (this for the theoretically minded is a trace operation and is actually related to the usual notion of the trace of a matrix). You should satisfy yourself that this indeed will recognize the same set of strings as before.
Thus removing s1 from this matrix gives the new matrix:
s0 | s2 | OUT | |
IN | \e | 0 | 0 |
s0 | B | AB*B | AB* |
s2 | A | 0 | 0 |
we can now remove s2 to obtain
s0 | OUT | |
IN | \e | 0 |
s0 | (B|AB*BA) | AB* |
and lastly we can remove s0
OUT | |
IN | (B|AB*BA)*AB* |
Given a regular expression r we may regard it as a regular-NFA:
r
IN -------->
OUT
in order to turn this into an \epsilon-NFA we may perform a number
of rewrites:
r1 | r2 S0 ----------> S1 |
======> | r1 -----------> S0 S1 -----------> r2 |
r1 r2 S0 ----------> S1 |
======> | r1
r2 S0 ------> S ------> S1 |
r* S0 ----------> S1 |
======> | <~~~~~~ r S0~~~>S0'------>S1'~~~>S1 ~~~~~~~~~~~~~~~~~~> |
which at each step will reduce the complexity of the regular expressions until they are simple labels. In this manner we will obtain an \epsilon-NFA.
There is a standard translation from an NFA to a DFA which uses the subset construction: this sets the states to be subsets of the original states and the transitions to be
The new start state is the singleton {s0} where s0 is the old start state. U is a final state provided there is an s in U which is final.t(U ,A,U' ) <=> for every s' in U' there is an s in U with t(s,A,s').
Now when there are \epsilon transitions we have to modify this technique to take the \epsilon-closure of the subsets.
If U is a subset, the \epsilon-closure of U is all those states which can be reached by a series of zero or more \epsilon-transitions from a state of U. In other words U is \epsilon-closed in case whenever
and s is in U then necessarily s' is in U.s ~~~~~~~~> s'
Let e(U) denote the \epsilon-closure of U then the usual subset construction from an NFA must be modified
(Note: The empty set of states is actually an \epsilon closed subset. It acts as a sink for the transitions which are not allowed. When one minimizes the automaton one removes states from which a final state cannot be reached. Thus, in the process of minimization this state is removed: therefore one often removes the empty set from consideration.)
It is well known that there is for each regular language an essentially unique deterministic finite automaton which recognizes that language. The formation of the minimal automaton is also important in producing finite good automata for both LEX and YACC. Therefore, again you must be familiar with how one forms a minimal automaton.
There are two important aspects:
Dually you can mark all the realizable states and retain only the
marked states.
Coming up with the equivalence relation is a little tricky in general. Often one can amalgamate states in pairs. However, in general, one starts with a proposed equivalence relation E, where we write si =_E sj to mean these states are equivalent under E, then you can apply it in case:
To minimize the automata one simply extracts the reachable and
realizable states and then one amalgamates equivalent
states until there are no more equivalent states to be
almalgamated. The result is the minimal automaton.
Another common way to minimize an automaton is to construct the
maximal equivalence relation by refinement. This is done by
proposing equivalence relations which are coarser (i.e. amalgmate
more things) than the maximal equivalence and then refining
them. To this end it is useful to starts by
guessing that two states are equivalent provided they satisfy the
first two rules above: that is they have the same finishing and
transition out behavior. We can then refine this guess by
looking at whether the third rule works. If one state in an
equivalence class allows a transion on A they all must do so (by the
second rule). By the third rule, equivalent states should
transition by A into the same
equivalence class. If they do not we are forced to split (or
refine) this equivalence class to make this true. We can keep
going like this, testing transitions, until no more refinements are
forced on us.
s -> A a | B b. b -> B b | A a | C c. c -> C c |. |