CPSC411 -- Finite automata and Regular expressions
Lecture 9

Author: Robin Cockett
Date: 6 Feb 2002

Reading:
Louden: CH 2, especially 2.2, 2.3, 2.4
Dragon book: CH 3, especially 3.3, 3.6 - 3.9
(but you should read the whole chapter!)

Regular languages:

In the parsing stage of a compiler regular languages and their recognizing automata play an important role in (at least) two stages. This first is in the lexical analysis stage when the tokenizing of the input takes place. The second is in the bottom up parsing stage and the family of LR parsing algorithms which we are about to discuss which are the most commonly used tools for generating the front end.

It is important that you know the basic results on regular languages and automata: you should have covered most of this in cpsc313. The purpose of these notes is to provide a review of the salient points and in particular to remind you of the basic methods of transforming between the different representations of these languages and their recognizing automata.

Regular expressions:

What is a regular expression? We may specify a regular expression as follows (here we are actually specifying the syntax tree):

Given a set \Sigma of labels a regular expression is anything which can be formed using the following rules:

x \in \Sigma ---------- x : reg	--------- \epsilon : reg	--------- \empty : reg
r1 : reg r2 : reg ---------(union) r1 \| r2 : reg	r1 : reg r2 : reg ----------(seq) r1 r2 : reg	r : reg ----(star) r : reg*

These say:

Any label is itself a regular expression. So if \Sigma = { a, b, ...} then b would be a regular expression.
The empty string, \epsilon, is a regular expression.
The empty set is a regular expression. Note: we may have to parenthesize this operation to indicate precedence etc.
If r1 and r2 are regular expressions then r1 | r2 (the union) is a regular expression.
If r1 and r2 are regular expressions then r1 r2 (the sequence) is a regular expression.
If r is a regular expressions then r* (the Kleene star) is a regular expression.

A regular expression represents a set of strings the membership rules for regular expresssions are as follows:

---------- [x] \in x : reg	-------------- [] \in \epsilon : reg
L \in r1 : reg ------------(left union) L \in r1 \| r2 : reg	L \in r2 : reg ------------(Right union) L \in r1 \| r2 : reg
L1 \in r1: reg L2 \in r2 :reg --------------------(sequencing) L1 ++ L2 \in r1 r2 : reg
----------(basis star) [] \in r : reg*	L1 \in r : reg L2 \in r : reg* ---------------------(step star) L1 ++ L2 \in r : reg*

where ++ stands for the concatonation of strings.

These rules say:

The singleton string [x] is one of the strings described by the regular expression x.
The empty string [] is one of the strings described by the regular expression \epsilon (it is also the only such string).
If the string L1 is one of the strings described by the regular expression r1 then L1 is one of the strings described by the regularexpression (r1 | r2).
If the string L2 is one of the strings described by the regular expression r2 then L2 is one of the strings described by the regularexpression (r1 | r2).
If the string L1 is one of the strings described by the regular expression r1 and if the string L2 is one of the strings described by the regular expression r2 then the string L1 ++ L2 is one of the strings described by the regular expression r1 r2.
The empty string [] is always one of the strings described by a Kleene star of a regular expression.
If L1 is one of the strings described by the regular expression r and L2 is one of the strings described by r* then L1++L2 is also one of the strings described by r* .

Notice that nothing is in (\empty b*c) and (\empty | ab) has the same membership as ab.

Here is an example of an explicit membership proof using this system:

--------- ---------
   [a] \in a            [b] \in b
------------------           -----------
              [a,b] \in ab                          [] \in (ab)*
       -----------------------------        ---------
                                   [a,b] \in (ab)*                               [c] \in c
                               -------------------------------
                                                   [a,b,c] \in (ab)*c

Clearly constructing such proofs is not the most efficient way of establishing membership in a regular expression!

Regular expressions in LEX:

A very basic reason you must understand regular expressions is that LEX uses them to describe patterns! A pattern is a class of lexemes and LEX uses a regular expression-like-syntax to specify the class of lexemes to be recognized. This is a reminder on how lexing programs are specified and on what they mean.

A lex program has three parts: the definition section, the rules section, and the user subroutine section. These parts are separated by the symbols "%%". In the definition section one can predefine useful patterns:

WHITE     [ \t]
DIG       [0-9]
NUM       -?{DIG}+
ALHA      [a-zA-Z]
ALPHANUM [a-zA-Z0-9]
ID        {ALPHA}{ALPHANUM}*

Here things between square brackets denote a character class: inside one can have a list of single characters or ranges of characters. Once a definition is made one can use the definition by enclosing the call in curly brackets.

In the rule section one can associate actions with patterns (the actual syntax varies depending on the lexer you use and the language involved):

r1        {a1}
r2        {a2}
   ....
rn        {a3}

and in the last section one can provide additional code including a main program.

The way a lexer works conceptually is as follows: it gobbles up the input until there is no pattern which matches the input. At that stage it keeps a pointer to the input eaten and the input still unrecognized and does some action depending on the input which it has gobbled up. The action it chooses is the action associated with the first pattern in the list which still recognizes the input. Thus if L is the input which matches some of the patterns such that the next character x makes L++[x] unrecognizable by any of the patterns then L is collected as a lexeme and the action associated with the first rule whose LHS still matches L is "fired".

This action sometimes involves calling LEX itself (e.g. to skip over white spaces) ... but often will involve the formation of a token which is passed to a calling routine.

The way a LEX program implements this is by building a finite deterministic automaton with actions attached to the states. It is important that you know how this is done as it is a step in the construction of a compiler. Be thankful, however, that you do not have to do it: it has already been done!

To describe how this is done we need to know how to convert from a regular expression into a deterministic finite automaton and back.

Exercise:

(1) Let \Sigma = { a, b, c, ...} then show that the following are regular expressions:

b
(a|b)*c
((abc)*d | ab*c)
(\empty b*c)
(\empty | ab)
(\epsilon ab)
(\epsilon | ab)
a(a)* (this is often written a+).

(2) Let \Sigma = { a, b, c, ...} then prove that the following are membership relations of regular expressions:

[c] \in (a|b)*c
[a,b,c] \in (a|b)*c
[a,a,c] \in (a|b)*c
[a,c] \in ((abc)*d | ab*c)
[d] \in ((abc)*d | ab*c)
[a,b,c,a,b,c,d] \in ((abc)*d | ab*c)
[a,b,b,c] \in ((abc)*d | ab*c)
[a,b] \in (\epsilon | ab)
[] \in (\epsilon | ab)

(3) We could use a language to describe regular expressions:

where the tokens are translated as

"(" => LPAR
")" => RPAR
"*" => STAR
"|" => SLASH
[a-zA-Z] => ALPHA
"0" => NULL

Note: the regular expression "()" represents \epsilon while the empty set is denoted by "0".

Why is this different from the definition of regular expressions given above?

Is this grammar LL(1). If not can you transform it to be LL(1)?

Translating between regular expression, NFAs, and DFAs

It is a classical result of language theory (which you should have covered in cpsc313) that the descriptive power of regular languages, regular expressions, non-deterministic finite automata (NFAs), and deterministic finite automata (DFAs) are all the same. The proof that they are the same relies on a series of translations between the various formalisms.
These translations are constructive: that is they can be translated into programs which actually do the transformations. LEX and YACC actually rely very heavily on these transformation methods and so it is important that you know how they work.

The transformations actually go through two intermediate notions, here called regular-NFAs and \epsilon-NFAs which allow slightly more general notions of transition. These intermediate notions (especially the \epsilon-NFAs) are useful conceptual tools which we shall use shortly in our description of LR (bottom up) parsing.

This should be a review of thing you did in cpsc313.

Finite automata:

A (non-deterministic) finite automaton, also know affectionately as an NFA, A = (\Sigma, S, t, s0, f), consists of

\Sigma - a set of labels
S - set of states
t - a transition relation t \subset S * \Sigma * S
s0 - a start state s0 \in S
f - a set of final states f \subset S

where we may view the transition relation as giving labelled arrows:

a
S1 ------> S2 === t(S1,a,S2)

A finite automaton is said to be determinstic, or affectionately as a DFA, in case for each label every state has at most one transitions out of it with that label. That is:

t(S,a,S1) and t(S,a,S2) implies S1 = S2

An automaton recognizes a list of labels [a1,...,an] in case that list of labels can drive it from the start state to a final state. That is there is a sequence of transitions t(Si-1,ai,Si) such that

a1 a2 a3 an
s0 ----> S1 -----> S2 ----> ... ----> Sn

and Sn \in f.

An \epsilon-NFA is an automaton which allows \epsilon-transitions as well as those labelled by the labels themselves. An \epsilon-NFA recognizes [a1,..,an] if there is a sequence of transitions as above where in addition between any two labels one allows an arbitrary number of \epsilons transitions (shown by wiggley arrows):

ai ai+1
------> Si ~~~~> Si' ~~~~> .... ~~~~> Si+1 ---->

A regular-NFA is a slightly different beast. It must have a start state IN which has no transitions in into it and a (unique) final state OUT which has no transitions out. Furthermore, its transitions are labelled by regular expressions (in the given label set). By amalgamating the transition between any two states using the union we can arrange that there is (at most) one transition between any two states. This we shall call a regular-NFA matrix:

	s1	si	sn	OUT
IN
s1

sj		r_ji

sn

where r_ji is the regular expressions on the transition sj ---> si (of course this could be the empty regular expression, 0, which would indicate that there was no transition).

Regular grammars are descriptively equivalent to NFAs:

A regular grammar is a context free grammar whose productions have one of two forms:

a -> B c. where B is a terminal and a,c are non-terminal
a -> B.

Letting the non-terminals be states there is an immediation translation of this notion into a NFA. The only difficulty is the necessity of adding a final state: to affect this we will also allow empty productions:

a -> .

usually these are not allowed in texts as empty productions are usually avoided as bad things: but there is no harm in them (honest)!

Conversely an NFA can be converted into a regular grammar where a transition

A
a -----> b

becomes a production

a -> Ab

and final states c have empty productions

c -> .

Clearly the language these two forms describe is then the same.

Translation from NFAs to regular expressions:

First we shall look at a method to translate any NFA into a regular expression. Our aim here is not to generate a particularly good regular expression but rather to show that we can generate one.

Our first step is to form a regular-NFA by adding a new start state with an \epsilon transition from the old start state and adding a new final state with an \epsilon transition from the old final states and by amalgating multiple transitions
a1, ..,an between two states into a single regular transition (a1|...|an). If there is no transition between two states we shall mark the transition as being empty, 0.

Consider
                                                 _ B
                    A                       /    \
           ^ s0 ---------> s1v             s0 = start, s1 = final
          /     |    ^                      /
          \ _ /       \ A             / B
         B              \             v
                                s2

then the matrix is

	s0	s1	s2	OUT
IN	\e	0	0	0
s0	B	A	0	0
s1	0	B	B	\e
s2	A	0	0	0

where \e is the \epsilon transition

We now "remove" state sk by noting that each transition si to sj must be modified on removal of sk by

rij := rij | rik (rkk)* rkj

to account for the possibility that it may have made a transition via sk (this for the theoretically minded is a trace operation and is actually related to the usual notion of the trace of a matrix). You should satisfy yourself that this indeed will recognize the same set of strings as before.

Thus removing s1 from this matrix gives the new matrix:

	s0	s2	OUT
IN	\e	0	0
s0	B	AB*B	AB*
s2	A	0	0

we can now remove s2 to obtain

	s0	OUT
IN	\e	0
s0	(B\|AB*BA)	AB*

and lastly we can remove s0

	OUT
IN	(B\|ABBA)AB*

Translation from regular expression to \epsilon-NFA:

Given a regular expression r we may regard it as a regular-NFA:

r
IN --------> OUT

in order to turn this into an \epsilon-NFA we may perform a number of rewrites:

r1 \| r2 S0 ----------> S1	======>	r1 -----------> S0 S1 -----------> r2
r1 r2 S0 ----------> S1	======>	r1 r2 S0 ------> S ------> S1
r* S0 ----------> S1	======>	<~~~~~~ r S0~~~>S0'------>S1'~~~>S1 ~~~~~~~~~~~~~~~~~~>

which at each step will reduce the complexity of the regular expressions until they are simple labels. In this manner we will obtain an \epsilon-NFA.

Translation from an \epsilon-NFA to a DFA:

There is a standard translation from an NFA to a DFA which uses the subset construction: this sets the states to be subsets of the original states and the transitions to be

t(U ,A,U' ) <=> for every s' in U' there is an s in U with t(s,A,s').

The new start state is the singleton {s0} where s0 is the old start state. U is a final state provided there is an s in U which is final.

Now when there are \epsilon transitions we have to modify this technique to take the \epsilon-closure of the subsets.

If U is a subset, the \epsilon-closure of U is all those states which can be reached by a series of zero or more \epsilon-transitions from a state of U. In other words U is \epsilon-closed in case whenever

s ~~~~~~~~> s'

and s is in U then necessarily s' is in U.

Let e(U) denote the \epsilon-closure of U then the usual subset construction from an NFA must be modified

states = \epsilon-closed subsets
start state = e({s0}) where s0 was the original start state
final states = all \epsilon-closed subsets which contain an old final state
transition = t(U ,A, e(U') ) where s' in U' if and only if there is an s in U such that t(s,A, s').

The result of this construction is a DFA.

(Note: The empty set of states is actually an \epsilon closed subset. It acts as a sink for the transitions which are not allowed. When one minimizes the automaton one removes states from which a final state cannot be reached. Thus, in the process of minimization this state is removed: therefore one often removes the empty set from consideration.)

The minimal DFA

It is well known that there is for each regular language an essentially unique deterministic finite automaton which recognizes that language. The formation of the minimal automaton is also important in producing finite good automata for both LEX and YACC. Therefore, again you must be familiar with how one forms a minimal automaton.

There are two important aspects:

Reachable and realizable states: You do not need to retain any states which cannot be reached from the start state. Dually you do not need to retain any states which cannot reach a final state.
Equivalent sets of states: If s1,.., sn are states from which exactly the same inputs can be recognized then you do not need to distinguish these states and they can be amalgamated.

To determine the set of reachable states you can use a marking algorithm: succesively mark states reachable by a single transition from reachable states (start by marking the start state as reachable!). When everyreachable state has all its transitions to states already marked reachable you have finished the marking .... you should then retain only the marked states.

Dually you can mark all the realizable states and retain only the marked states.

Coming up with the equivalence relation is a little tricky in general. Often one can amalgamate states in pairs. However, in general, one starts with a proposed equivalence relation E, where we write si =_E sj to mean these states are equivalent under E, then you can apply it in case:

si =_E sj implies si is a final state if and only if sj is (all or none final)
si =_E sj implies for every label A and there is a state s such that t(si,A,s) if and only if there is a state s' such that t(sj,A,s') (same transitions out);
si =_E sj implies for every label A and if t(s1,A,s) and t(s2,A,s') then s =_E s' (all transitions are to equivalent states).

To minimize the automata one simply extracts the reachable and realizable states and then one amalgamates equivalent states until there are no more equivalent states to be almalgamated. The result is the minimal automaton.

Another common way to minimize an automaton is to construct the maximal equivalence relation by refinement. This is done by proposing equivalence relations which are coarser (i.e. amalgmate more things) than the maximal equivalence and then refining them. To this end it is useful to starts by guessing that two states are equivalent provided they satisfy the first two rules above: that is they have the same finishing and transition out behavior. We can then refine this guess by looking at whether the third rule works. If one state in an equivalence class allows a transion on A they all must do so (by the second rule). By the third rule, equivalent states should transition by A into the same equivalence class. If they do not we are forced to split (or refine) this equivalence class to make this true. We can keep going like this, testing transitions, until no more refinements are forced on us.

Exercises:

Prove that any regular language can be recognized by an LL(1) grammar.
Provide an LL(1) grammar whose language is not regular.
When is a regular grammar LL(1)?
Draw an NFA and convert it into a regular expression.
Is it possible to directly "reverse" the trace operations to provide an alternative way of translating from a regular expression to \epsilon-NFA?
What is the \epsilon-NFA generated by (B|AB*BA)*AB* ?
What is the DFA generated by (B|AB*BA)*AB* ?
Convert the regular grammar below to a regular expression.

s -> A a | B b.
b -> B b | A a | C c.
c -> C c |.