CPSC 411 Compiler construction I

Author: Robin Cockett
Date: 9 Jan. 2012

Assignment #1 and #2: Basic compiler for M- 

Due date #1: 27 January 2012
Due date #2: 24 Febuary 2012


Your task is to implement a compiler to "basic stack machine code" (described below) for the language "Minisculus" whose syntax is given by the grammar below.

These assignments have a series of steps: for the main part #2 you must write a recursive descent parser (i.e. you cannot use YACC yet!) for Minisculus which creates a syntax tree (which you must print).  You must then use this tree to produce basic stack machine code (which you should be able to run).  For part  #1of the assignment you must create a lexical analyzer using LEX to produce a list of tokens (which you should print as a space separated list of just token names).  The purpose of the assignment is to introduce you to scanning and recursive descent parsing.

It is important to attend the labs as they will be discussing this assignment and LEX: they will  also discuss the Minisculus language.



 Minisculus grammar:
prog -> stmt.
stmt -> IF expr THEN stmt ELSE stmt
            | WHILE expr DO stmt
            | READ ID
            | ID ASSIGN expr
            | PRINT expr
            | BEGIN stmtlist END.
stmtlist -> stmtlist stmt SEMICOLON
            |.
expr -> expr addop term
            | term.
addop -> ADD
            | SUB.
term -> term mulop factor
            | factor.
mulop -> MUL
            | DIV.
factor -> LPAR expr RPAR
            | ID
            | NUM
            | SUB NUM.



This grammar has left recursive nonterminals.  The grammar should be transformed to remove the left recursion and whence into and LL(1) grammar before being implemented by recursive descent parsing.



Where the tokens above stand for the following lexemes:

"if" => IF
"then" => THEN
"while" => WHILE
"do" => DO
"read" => READ
"else" => ELSE
"begin" => BEGIN
"end" => END
"print" => PRINT
{alpha}[{digit}{alpha}]* => ID (identifier)
{digit}+ => NUM (positive integer)
"+" => ADD
":=" => ASSIGN
"-" => SUB
"*" => MUL
"/" => DIV
"(" => LPAR
")" => RPAR
";"=> SEMICOLON


Minisculus comments:

Minisculus has two types of comments:
The two sorts of comment interact!  One-line comments should take precedence over the multi-line comments.  Thus, the eventual behaviour should be equivalent to first stripping out the one-line comments and then removing the (possibly nested) multi-line comments.  If you propose a different behavior please explain it and be sure to provide test cases.


EXAMPLES:

Basic tests for Lexer:
Here are some basic test files for your lexer and for code generation:
       test1, test 2, test3, test4, test5, test6, test9.

Using Alex to get multi-line comments:
Alex is a parser generator which produces Haskell code ....  Using the monad wrapper one can easily program basic multi-line comments using the startcodes.  However, to get nested multiline comments harder!    Here is one way to do it involving the use of a new wrapper (specially written for you :-) )!   To access the new wrapper (here)  "AlexWrapper-monadUser"  put it in a directory with "AlexTemplate" which you can copy from "/usr/local/lib/alex-2.0.".  Now when you call Alex on your file redirect Alex to look for this wrapper with the "-t" option (i.e. write "alex -t./mywrappers <file>.x").  Here is a very basic example of how to use the wrapper (due to Brett Giles) which accesses the user state (also some explicit instructions as to how to access the wrapper).   Here is another example which illustrates how to change the start codes depending on the user state.


Minisculus program:

Here is an example program:
/* This program calculates the
    factorial of the number which
    is read */

begin
    read x;                    % read a number .. 
    y := 1;
    while x do
        begin
            y :=  y * x;
            x :=  x - 1;
         end;
    print y;
end


Code generation:

Typical Minisculus programs fragments are:

begin
y:= 23;
x:= 13 + y;
print x;
end

begin
if y then x:= 10
       else x:= 1;
z:= z * x;
end

(where we assume here that the variables x, y, and z must have been initialized earlier: Note that for conditionals and while statements zero is false anything else is true). These fragments are translated into a stack machine code:

cPUSH 23
LOAD y
cPUSH 13
rPUSH y
OP2 +
LOAD x
rPUSH x
PRINT

rPUSH y
cJUMP   L1
cPUSH 10
LOAD x
JUMP   L2

L1:
cPUSH 1
LOAD x
L2:
rPUSH z
rPUSH x
OP2 *
LOAD z


where


 Here is an implementation of this stack machine in the shell: source this then source the file your M- compiler produces!
# ... aliases for Robin's stack machine.
# This file should be "sourced" prior to executing
# stack machine files.

set stack = ""
alias cPUSH            'set stack = (\!:1 $stack)'
alias rPUSH            'set stack = ($\!:1 $stack)'
alias sPUSH            '@ stack[1] = $stack[1] + 1 ; set stack[1] = $stack[$stack[1]]'
alias LOAD            'eval "set \!:1 = \$stack[1] ; shift stack"'
alias OP2                'eval "@ stack[2] = \$stack[2] \!:1 \$stack[1]"; shift stack'
alias cJUMP           'set tos = $stack[1]; shift stack; if ($tos == 0) goto \!:1'
alias JUMP             goto
alias PRINT            'echo $stack[1]; shift stack'
alias READ            'eval "set \!:1 = $< " '
 

Some test programs:
Here are some simple test programs: there are errors in some of them! 
Mtest1, Mtest2, Mtest3, Mtest4, Mtest5, Mtest6, Mtest7, Mtest8.
You should also invent at least two of your own test programs.



HINTS:   Step by step approach to the assignment ...
  1. Use LEX to write a parser.   Print the list of tokens!   This complete's part #1 of the assignment.
  2. Transform the grammar!  As it stand it is not LL(1) you need to remove the immediate left recursion from stmtlist, term, expr.  The resulting grammar is LL(1) but still does not have terminals starting every rule ...  this does mean that when you do a recursive descent parser you may have to guard the function to eat input corresponding to a production with a check that the input has the appropriate first set.  Alternately you can expand first non-terminals until terminals are exposed (mulop and addop are the problems here). 
  3. Write a really simple recursive descent parser (which does not build a syntax tree).   Here is a very simple example of this.  Here is an example of this which uses the PARSEC system to tokenize the input: note the datatype for the Tokens are contained in the this file  ... !IMPORTANT!: initially ignore the  PARSEC code but look at the datatype for tokens for expressions.  You need to create a token datatype for M- (see suggestions below).
  4. Now you need to modify this code to build the syntax tree.  
  5. You have now built the syntax tree but you really wanted stack machine code.  To do this I suggest you modify the show function of the syntax tree in order to output stack machine code: the only complication here is you need to pass a counter round to give you fresh label numbers when you want them.  
  6. Modify the program to take input from a file and put the result in a file.   Look at how this is done at the end of this file
  7. Now improve the error handling. 



Some more detailed notes for Haskell:

Building the syntax tree ...

First you have to decide what the syntax tree and token datatype will actually look like.  Decisions, decisions!
First the Tokens ....
               data  Token = T_Identifier String                   --- a token for identifiers
                                    | T_num   Int                              --- a token for numbers
                                    | T_while                                   --- a token to catch the keyword "while"
                                         ....
Next the syntax tree: what does it look like?
               data Stmt = If Exp Stmt Stmt                        -- datatype for statements  starting with an if ... then ... else statement
                                | Until Stmt  Exp
                                | Assign String Exp
                                | Block [Stmt]
                                | Print Expr
                                | Input Expr
               data Exp = Add Exp Exp  
                               | Mul Exp Exp
                               | Div Exp Exp
                               | Neg Exp
                               | Id String
                               | Num Integer

The parser is then a function:
                           minParser: [(Pos,Token)] -> Stmt
whoops  .... so you also pass back from the lexer where (a position) in the input list your token occurs:  this is for error handling to start with ignore this and use only the token.

Now for each nonterminal you need to write a function.   If things work out well you can simply choose the production by matching against the first nonterminal.  A Haskell code snippet is as follows:

     stmt:: [(
Pos,Token)] -> (Stmt,[(Pos,Token)])
         ...
     stmt ( (_,T_if) : rest) = (S_if e s1 s2,rest') where
                (e,rest1) = expr rest
                rest2 = eat_token T_then rest1
                (s1,rest3) = stmt rest2
                rest4 = eat_token T_else rest3
                (s2,rest') = stmt rest4
     stmt( (_,T_while):rest) = ....

so what does this do?   If the current input starts with the "if" token then eat it and pass what is left of the token list into the expression parser it produces an expression e and gives back the tokens it has not eaten in rest1.  The first token of rest1 must be "then" eaten it this leaves rest2, the first part of rest2 must be an expression so you pass this to the expression eater etc.

Now having written the code in this form  I can actually simplify it a bit by not having so many intermediate variables.  In fact there is more that I can do but this is quite enough to start with.

     stmt:: [(Pos,Token)] -> (Stmt,[(Pos,Token)])
         ...
     stmt ( (_,T_if) : rest) = (S_if e s1 s2,rest') where
                (e,rest1) = expr rest
                (s1,rest2) = stmt (
eat_token T_then rest1)
                (s2,rest') = stmt (
eat_token T_else rest2)
     stmt( (_,T_while):rest) = ....

An amazing feature of Haskell is that, in the "where" clause (the indented part), I can actually write the definitions in any order --  it does not matter: Haskell sorts out how to use the functions.

OK that is the relatively easy part: there are two aspects which are a bit more subtle. 

The role of first sets ...

First how do you implement a rule like ...

       term1 -> mulop factor term1
                    |.

The problem is that you cannot match on the first terminal directly.   So what do you do?  Well one trick is to use the guards to test the first set:

       term1 [] = ...
       term1 (t:ts) | first_mulop t =   ....
                         | otherwise = ....

where first_mulop returns True if t is a token in the first set of the first production (yes you have to write first_mulop!).    Another approach (which is simpler) is just to bite the bullet and do an explicit match agains all the possibilities in the first set (you are going to have to separate them anyway!):

       term1 [] = ...
       term1 ((_,T_mul):ts)  =   ....
       term1((_,T_div):ts) =  ....

Making sure you have what you need ...

Now there is also a further difficulty with this nonterminal as it appears that one wishes to build an expression of the form MUL ? e where you don't seem to know the first argument although you do know the second argument.   So lets look carefully at two ways of surmounting this problem.  I am going to simplify matters now by assuming that the pattern matching solution is used for the problem above on the first terminal:

A. Pass what you need in!  So now term1 expects another argument namely an expression produced by the code which calls it.   What does this look like?
           term1:: Exp -> [(Pos,Token)] -> (Exp,[(Pos,Token)])
           term1 e ((_,T_mul):ts) = (MUL e e',ts')  where
                     (e1,ts1) = factor ts
                     (e',ts') = term1 e1 ts1
           term1 e ts = (e,ts)
B. Build an incomplete term! 
This is more sophisticated ....  This way term1 does not expect another argument but it returns a function (think of this as an incomplete tree) which when it is eventually given the missing information will produce the right thing.   SO what does this look like?
           term1::
[(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])   
           term1 ((_,T_mul):ts) = (\x -> f(MUL x e'),ts')  where
                     (e', ts1) = factor ts
                     (f, ts') = term1 ts1
           term1 ts = (\x -> x,ts)
 
OK so notice the call to term1
(line 4) produces a function f which is applied to the factor found, e, multiplied by x.    The expression \x -> f(MUL x e') is a function (the incomplete tree) which given a value for x will evaluate its body f(MUL x e') with the value substituted for x (and in this way a complete tree is built).  Haskell is quite happy to pass functions around (in fact, it is normal to do so in a functional language).  Notice that the second phrase of this code returns the identity function  \x -> x (literally give me x and I will return x). 

I am going to rewrite this second) code another, completely equivalent, but slightly different way so you can see how to express this passing of function in Haskell in different ways:
           term1:: [(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])   
           term1 ((_,T_mul):ts) = (g,ts')  where
                     (e', ts1) = factor ts
                     (f, ts') = term1 ts1
                     g x =
f(MUL x e')                   --  g::  Exp -> Exp
           term1 ts = (g,ts)  where
                     g x = x
                                    --  g::  Exp -> Exp

Now, I have to 'fess up, in fact, the fiist version of term1 and the later versions do not do quite the same thing  as the latter code associates the multiplication to the left while the first associates to the right (not too bad as multiplication is associative!!).  In fact the latter code is more correct to the specification as (arguably) we intended to associate the multiplication to the left.  

This function which returns a higher order function Exp -> Exp is then used in term as follows:
            term::
[(Pos,Token)] -> (Exp,[(Pos,Token)])
            term ts = (f e,ts') where
                      (e,ts1) = factor ts
                      (f,ts') = term1 ts1
Notice the application f e in th returned tuple ... now there are no half built expressions left at this level as you have supplied the missing bit of the expression.

The backend: producing the stack code

Producing the stack code from the syntax tree is not too difficult the only thing you must remember to do is to pass a counter to ensure you get new labels.   Here is a  code snippet for generating a conditional:

               shower::  Int  ->  Stmt -> (Int,String)
               shower n (IF e s1 s2) = 
                       ( ( show e) ++"cJumP label"++(show n)++"\n"
                                         ++ code1
                                         ++"JUMP label"++(show (n+1))++"\n"
                                         ++"label"++(show n)++":\n"
                                         ++code2
                                         ++"label"++(show (n+1))++":\n"
                         , m)   where
                      (code1,n') = shower (n+2) s1
                      (code2,m) = shower n' s2
               .....
Notice this uses the same sorts of techniques as were used for generating the syntax tree.

Error handling ....

Right now if there is an error your parser would probably die in the most ungraceful fashion!!  What do you have to do to make this more graceful?   Essentially you must add a pattern to each nonterminal to catch the case when things go wrong ... for example

               factor ((_,(T_num n)):ts) = ....
               factor ((_,T_lpar):ts) = ....
                   ...
               factor ((pos,_):ts) = error "Error in Factor: expected (, and identifier, or a number"++(show pos)

Well you can play with the wording to make it look good ......