CPSC 411 Compiler construction I

Author: Robin Cockett
Date: 15 Jan. 2016

Assignment #1 and #2: A compiler for Minisculus to basic stack machine code

Due date #1: 31 January 2015
Due date #2: 21 February 2015


Your task is to implement a compiler to "basic stack machine code" (described below) for the language M- whose syntax is given by the grammar below.

These assignments have a series of steps: for the main part #2 you must write a recursive descent parser (i.e. you cannot use YACC yet!) for M- which creates a syntax tree (which you must print).  You must then use this tree to produce basic stack machine code (which you should be able to run).  For part  #1of the assignment you must create a lexical analyzer using LEX to produce a list of tokens (which you should print as a space separated list of just token names). 

The purpose of the assignment is to introduce you to scanning and recursive descent parsing.

It is important to attend the labs as they will be discussing this assignment: they will  also discuss the Minisculus language.



 Minisculus grammar:
prog -> stmt.
stmt -> IF expr THEN stmt ELSE stmt
            | WHILE expr DO stmt
            | INPUT ID
            | ID ASSIGN expr
            | WRITE expr
            | BEGIN stmtlist END.
stmtlist -> stmtlist stmt SEMICOLON
            |.
expr -> expr addop term
            | term.
addop -> ADD
            | SUB.
term -> term mulop factor
            | factor.
mulop -> MUL
            | DIV.
factor -> LPAR expr RPAR
            | ID
            | NUM
            | SUB NUM.



This grammar has left recursive nonterminals.  The grammar should be transformed to remove the left recursion and whence into a grammar suitable for recursive descent parsing before implementing it.



Where the tokens above stand for the following lexemes:

"if" => IF
"then" => THEN
"while" => WHILE
"do" => DO
"input" => INPUT
"else" => ELSE
"begin" => BEGIN
"end" => END
"write" => WRITE
{alpha}[{digit}{alpha}]* => ID (identifier)
{digit}+ => NUM (positive integer)
"+" => ADD
":=" => ASSIGN
"-" => SUB
"*" => MUL
"/" => DIV
"(" => LPAR
")" => RPAR
";"=> SEMICOLON


Minisculus comments:

Minisculus has two types of comments:
The two sorts of comment interact!  One-line comments should take precedence over the multi-line comments.  Thus, the eventual behavior should be equivalent to first stripping out the one-line comments and then removing the (possibly nested) multi-line comments. 


TEST FILES:

Basic tests for Lexer:
Here are some basic test files for your lexer and for code generation:
       test1, test 2, test3, test4, test5, test6, test9.









M- program:

Here is an example program (thanks to Hayder Casey for correction!):
/* This program calculates the
    factorial of the number which
    is input */

begin
    input x;                    % input a number .. 
    y := 1;
    while x do
        begin
            y :=  y * x;
            x :=  x - 1;
         end;
    write y;
end


Code generation:

Typical M- programs fragments are:

begin
y:= 23;
x:= 13 + y;
write x;
end

begin
if y then x:= 10
       else x:= 1;
z:= z * x;
end

(where we assume here that the variables x, y, and z must have been initialized earlier: Note that for conditionals and while statements zero is false anything else is true). These fragments are translated into a stack machine code:

cPUSH 23
LOAD y
cPUSH 13
rPUSH y
OP2 +
LOAD x
rPUSH x
PRINT

rPUSH y
cJUMP   L1
cPUSH 10
LOAD x
JUMP   L2

L1:
cPUSH 1
LOAD x
L2:
rPUSH z
rPUSH x
OP2 *
LOAD z


where


 Here is an implementation of this stack machine in the shell: source this then source the file your M- compiler produces!
# ... aliases for Robin's stack machine in csh.
# This file should be "sourced" prior to executing
# stack machine files.

set stack = ""
alias cPUSH            'set stack = (\!:1 $stack)'
alias rPUSH            'set stack = ($\!:1 $stack)'
alias sPUSH            '@ stack[1] = $stack[1] + 1 ; set stack[1] = $stack[$stack[1]]'
alias LOAD            'eval "set \!:1 = \$stack[1] ; shift stack"'
alias OP2                'eval "@ stack[2] = \$stack[2] \!:1 \$stack[1]"; shift stack'
alias cJUMP           'set tos = $stack[1]; shift stack; if ($tos == 0) goto \!:1'
alias JUMP             goto
alias PRINT            'echo $stack[1]; shift stack'
alias READ            'eval "set \!:1 = $< " '
 
Here is the csh implementation and here is a version for bash (which  uses calls to csh).

Some test programs:
Here are some simple test programs: there are errors in some of them! 
Mtest1, Mtest2, Mtest3, Mtest4, Mtest5.
You should also invent at least five of your own test programs.



HINTS:   Step by step approach to the assignment ...
  1. Use LEX to write a parser.   Print the list of tokens!   This completes part #1 of the assignment.
  2. Transform the grammar!  As it stand it is not a recursive descent grammar you need to remove left recursion (fortunately this just means removing some immediate left recursions from stmtlist, term, expr).  The resulting grammar is LL(1) but still is not in recursive descent form.  To avoid having multiple rules with no pattern it is necessary to expand the first nonterminal until a terminal is reached. Also some rules must be broken up as they have terminals which are not in the pattern. 
  3. Write a really simple recursive descent parser (which does not build a syntax tree).   Here is a very simple example of this.  Here is an example of this which uses the PARSEC system to tokenize the input: note the datatype for the Tokens are contained in the this file  ... !IMPORTANT!: initially ignore the  PARSEC code but look at the datatype for tokens for expressions.  You need to create a token datatype for M- (see suggestions below).
  4. Now you need to modify this code to build the syntax tree (see the definition below).  
  5. You have now built the syntax tree but you really wanted stack machine code.  To do this I suggest you modify the show function of the syntax tree in order to output stack machine code: the only complication here is you need to pass a counter round to give you fresh label numbers when you want them.  
  6. Modify the program to take input from a file and put the result in a file.   Look at how this is done at the end of this file
  7. Now improve the error handling. 



Some notes in Haskell:

THE FRONT END: building the syntax tree ...

First you have to decide what the syntax tree and token datatype will actually look like.  Decisions, decisions!
First the Tokens ....
               data  Token = T_Identifier String                   --- a token for identifiers
                                    | T_num   Int                              --- a token for numbers
                                    | T_while                                   --- a token to catch the keyword "while"
                                         ....
Next the syntax tree: what does it look like?
               data Stmt = If Exp Stmt Stmt                        -- datatype for statements  starting with an if ... then ... else statement
                                | While Exp Stmt
                                | Assign String Exp
                                | Block [Stmt]
                                | Print Expr
                                | Input Expr
               data Exp = Add Exp Exp  
                               | Mul Exp Exp
                               | Div Exp Exp
                               | Neg Exp
                               | Id String
                               | Num Integer

The parser is then a function:
                           minParser: [(Pos,Token)] -> Stmt
whoops  .... so you also pass back from the lexer where (a position) in the input list your token occurs:  this is for error handling ... to start with ignore this and use only the token.

Now for each nonterminal you need to write a function.   If things work out well and you have correctly transformed the grammar into a recursive descent grammar you can simply choose the production by matching against the first nonterminals (the pattern). 

Conditionals ..

First you should have transformed the grammar for conditionals into something like:

stmt -> IF expr THEN stmt ELSE stmt  
              ===>   
stmt -> IF expr thenpart
thenpart -> THEN stmt elsepart
elsepart -> ELSE stmt

The corresponding code then looks like:

     stmt:: [(
Pos,Token)] -> (Stmt,[(Pos,Token)])
         ...
     stmt ( (_,T_if) : rest) = (If e s1 s2,rest') where
                (e,rest1) = expr rest
                ((s1,s2),rest') = thenpart rest1
     thenpart ((_,T_then):rest) = ((s1,s2),rest') where
                (s1,rest1) = stmt rest
                (s2,rest') = elsepart rest1
     elsepart ((_,T_else):rest) = (s,rest') where
                (s,rest') = stmt rest

The code recursively pulls off the "if" extracts an expression then calls code to handle the "thenpart".  The "thenpart" has the keyword "then" matched and a statement is extracted and the "elsepart" is called ....

OK that is the relatively easy part: there are two aspects which are a bit more subtle. 

Exposing patterns ...

  What do with a production like ...

       term1 -> mulop factor term1
                    |.
   To get it into recursive descent form you need to substitute "mulop" to expose a pattern:

       term1 -> MUL factor term1
                    | DIV factor term1
                    |.

The one can translate:

       term1 ((_,T_mul):ts)  =  term1 (factor ts))
       term1((_,T_div):ts) = 
term1 (factor ts))
       term1 ts  = ts

then one has to add the construction of the syntax tree!

Making sure you have what you need ...

Now there is also a further difficulty with the nonterminal "term1" which surfaces when one wishes to build an expression of the form MUL.  The problem is that you don't seem to know the first argument although you do know the second argument!   So lets look carefully at two ways of surmounting this problem:

A. Pass what you need in!  So now term1 expects another argument namely an expression produced by the code which calls it.   What does this look like?
           term1:: Exp -> [(Pos,Token)] -> (Exp,[(Pos,Token)])
           term1 e ((_,T_mul):ts) = (MUL e e',ts')  where
                     (e1,ts1) = factor ts
                     (e',ts') = term1 e1 ts1
           term1 e ts = (e,ts)

B.
Build an incomplete term! 
This is more sophisticated ....  This way term1 does not expect another argument but it returns a function (think of this as an incomplete tree) which when it is eventually given the missing information will produce the right thing.   SO what does this look like?
           term1::
[(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])   
           term1 ((_,T_mul):ts) = (\x -> f(MUL x e'),ts')  where
                     (e', ts1) = factor ts
                     (f, ts') = term1 ts1
           term1 ts = (\x -> x,ts)
 
OK so notice the call to term1
(line 4) produces a function f which is applied to the factor found, e, multiplied by x.    The expression \x -> f(MUL x e') is a function (the incomplete tree) which given a value for x will evaluate its body f(MUL x e') with the value substituted for x (and in this way a complete tree is built).  Haskell is quite happy to pass functions around (in fact, it is normal to do so in a functional language).  Notice that the second phrase of this code returns the identity function  \x -> x (literally give me x and I will return x). 

I am going to rewrite this second code another, completely equivalent, but slightly different way so you can see how to express this passing of functions in Haskell in different ways:
           term1:: [(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])   
           term1 ((_,T_mul):ts) = (g,ts')  where
                     (e', ts1) = factor ts
                     (f, ts') = term1 ts1
                     g x =
f(MUL x e')                   --  g::  Exp -> Exp
           term1 ts = (g,ts)  where
                     g x = x
                                    --  g::  Exp -> Exp

Now, I have to 'fess up, in fact, the first version of term1 and the later versions do not do quite the same thing  as the latter codes associate the multiplication to the left while the first associates to the right (not too bad as multiplication is associative!!).  In fact the latter codes is more correct to the specification as (arguably) we intended to associate the multiplication to the left.  

This function which returns a higher order function Exp -> Exp is then used in term as follows:
            term::
[(Pos,Token)] -> (Exp,[(Pos,Token)])
            term ts = (f e,ts') where
                      (e,ts1) = factor ts
                      (f,ts') = term1 ts1
Notice the application f e in the returned tuple ... now there are no half built expressions left at this level as you have supplied the missing bit of the expression.

THE BACKEND: producing stack code

Producing the stack code from the syntax tree is not as difficult as it might at first seem: the only thing you must remember to do is to pass a counter to ensure you get new labels whenever you need them.   Here is a  code snippet for generating a conditional:

               shower::  Int  ->  Stmt -> (Int,String)
               shower n (IF e s1 s2) = 
                       ( ( show e) ++"cJumP label"++(show n)++"\n"
                                         ++ code1
                                         ++"JUMP label"++(show (n+1))++"\n"
                                         ++"label"++(show n)++":\n"
                                         ++code2
                                         ++"label"++(show (n+1))++":\n"
                         , m)   where
                      (code1,n') = shower (n+2) s1
                      (code2,m) = shower n' s2
               .....
Notice this uses the same sorts of techniques as were used for generating the syntax tree.

Error handling ....

Right now if there is an error your parser would probably die in the most ungraceful fashion!!  What do you have to do to make this more graceful?   Essentially you must add a pattern to each nonterminal to catch the case when things go wrong ... for example

               factor ((_,(T_num n)):ts) = ....
               factor ((_,T_lpar):ts) = ....
                   ...
               factor ((pos,_):ts) = error "Error in Factor: expected (, and identifier, or a number"++(show pos)

Well you can play with the wording to make it look good ......































































































































































rest' = eat_token T_then rest1
                (s1,rest3) = stmt rest2
                rest4 = eat_token T_else rest3
                (s2,rest') = stmt rest4
     stmt( (_,T_while):rest) = ....

so what does this do?   If the current input starts with the "if" token then eat it and pass what is left of the token list into the expression parser it produces an
expression e and gives back the tokens it has not eaten in rest1.  The first token of rest1 must be "then" eaten it this leaves rest2, the first part of rest2 must be
an expression so you pass this to the expression eater etc.

Now having written the code in this form  I can actually simplify it a bit by not having so  many intermediate variables.  In fact there is more that I can do but this is quite enough to start with.

     stmt:: [(Pos,Token)] -> (Stmt,[(Pos,Token)])
         ...
     stmt ( (_,T_if) : rest) = (S_if e s1 s2,rest') where
                (e,rest1) = expr rest
                (s1,rest2) = stmt (
eat_token T_then rest1)
                (s2,rest') = stmt (
eat_token T_else rest2)
     stmt( (_,T_while):rest) = ....

An amazing feature of Haskell is that, in the "where" clause (the indented part), I can actually write the definitions in any order --  it does not matter: Haskell sorts out how to use the functions.

OK that is the relatively easy part: there are two aspects which are a bit more subtle. 

The role of first sets ...

First how do you implement a rule like ...

       term1 -> mulop factor term1
                    |.

The problem is that you cannot match on the first terminal directly.   So what do you do?  Well one trick is to use the guards to test the first set:

       term1 [] = ...
       term1 (t:ts) | first_mulop t =   ....
                         | otherwise = ....

where first_mulop returns True if t is a token in the first set of the first production (yes you have to write first_mulop!).    Another approach (which is simpler) is just to bite the bullet and do an explicit match agains all the possibilities in the first set (you are going to have to separate them anyway!):

       term1 [] = ...
       term1 ((_,T_mul):ts)  =   ....
       term1((_,T_div):ts) =  ....

Making sure you have what you need ...

Now there is also a further difficulty with this nonterminal as it appears that one wishes to build an expression of the form MUL ? e where you don't seem to know the first argument although you do know the second argument.   So lets look carefully at two ways of surmounting this problem.  I am going to simplify matters now by assuming that the pattern matching solution is used for the problem above on the first terminal:

A. Pass what you need in!  So now term1 expects another argument namely an expression produced by the code which calls it.   What does this look like?
           term1:: Exp -> [(Pos,Token)] -> (Exp,[(Pos,Token)])
           term1 e ((_,T_mul):ts) = (MUL e e',ts')  where
                     (e1,ts1) = factor ts
                     (e',ts') = term1 e1 ts1
           term1 e ts = (e,ts)
B. Build an incomplete term! 
This is more sophisticated ....  This way term1 does not expect another argument but it returns a function (think of this as an incomplete tree) which when it is eventually given the missing information will produce the right thing.   SO what does this look like?
           term1::
[(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])   
           term1 ((_,T_mul):ts) = (\x -> f(MUL x e'),ts')  where
                     (e', ts1) = factor ts
                     (f, ts') = term1 ts1
           term1 ts = (\x -> x,ts)
 
OK so notice the call to term1
(line 4) produces a function f which is applied to the factor found, e, multiplied by x.    The expression \x -> f(MUL x e') is a function (the incomplete tree) which given a value for x will evaluate its body f(MUL x e') with the value substituted for x (and in this way a complete tree is built).  Haskell is quite happy to pass functions around (in fact, it is normal to do so in a functional language).  Notice that the second phrase of this code returns the identity function  \x -> x (literally give me x and I will return x). 

I am going to rewrite this second code another, completely equivalent, but slightly different way so you can see how to express this passing of function in Haskell in different ways:
           term1:: [(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])   
           term1 ((_,T_mul):ts) = (g,ts')  where
                     (e', ts1) = factor ts
                     (f, ts') = term1 ts1
                     g x =
f(MUL x e')                   --  g::  Exp -> Exp
           term1 ts = (g,ts)  where
                     g x = x
                                    --  g::  Exp -> Exp

Now, I have to 'fess up, in fact, the first version of term1 and the later versions do not do quite the same thing  as the latter codes associate the multiplication to the left while the first associates to the right (not too bad as multiplication is associative!!).  In fact the latter codes is more correct to the specification as (arguably) we intended to associate the multiplication to the left.  

This function which returns a higher order function Exp -> Exp is then used in term as follows:
            term::
[(Pos,Token)] -> (Exp,[(Pos,Token)])
            term ts = (f e,ts') where
                      (e,ts1) = factor ts
                      (f,ts') = term1 ts1
Notice the application f e in th returned tuple ... now there are no half built expressions left at this level as you have supplied the missing bit of the expression.

The backend: producing the stack code

Producing the stack code from the syntax tree is not too difficult the only thing you must remember to do is to pass a counter to ensure you get new labels.   Here is a  code snippet for generating a conditional:

               shower::  Int  ->  Stmt -> (Int,String)
               shower n (IF e s1 s2) = 
                       ( ( show e) ++"cJumP label"++(show n)++"\n"
                                         ++ code1
                                         ++"JUMP label"++(show (n+1))++"\n"
                                         ++"label"++(show n)++":\n"
                                         ++code2
                                         ++"label"++(show (n+1))++":\n"
                         , m)   where
                      (code1,n') = shower (n+2) s1
                      (code2,m) = shower n' s2
               .....
Notice this uses the same sorts of techniques as were used for generating the syntax tree.

Error handling ....

Right now if there is an error your parser would probably die in the most ungraceful fashion!!  What do you have to do to make this more graceful?   Essentially you must add a pattern to each nonterminal to catch the case when things go wrong ... for example

               factor ((_,(T_num n)):ts) = ....
               factor ((_,T_lpar):ts) = ....
                   ...
               factor ((pos,_):ts) = error "Error in Factor: expected (, and identifier, or a number"++(show pos)

Well you can play with the wording to make it look good ......