Author: Robin Cockett

Date: 6 Jan. 2017

Assignment #1 and #2: A compiler for Minisculus to basic stack machine code

Due date #1: 27 January 2017

Due date #2: 17 February 2017

Your task is to implement a compiler to "basic stack machine code" (described below) for the language M- whose syntax is given by the grammar below.

These assignments have a series of steps: for the main part #2 you must write a recursive descent parser (i.e. you cannot use YACC yet!) for M- which creates a syntax tree (which you must print). You must then use this tree to produce basic stack machine code (which you should be able to run). For part #1of the assignment you must create a lexical analyzer using LEX to produce a list of tokens (which you should print as a space separated list of just token names).

The purpose of the assignment is to introduce you to scanning and recursive descent parsing.

It is important to attend the labs as they will be discussing this assignment: they will also discuss the Minisculus language.

prog -> stmt.

stmt -> IF expr THEN stmt ELSE stmt

| WHILE expr DO stmt

| INPUT ID

| ID ASSIGN expr

| WRITE expr

| BEGIN stmtlist END.

stmtlist -> stmtlist stmt SEMICOLON

|.

expr -> expr addop term

| term.

addop -> ADD

| SUB.

term -> term mulop factor

| factor.

mulop -> MUL

| DIV.

factor -> LPAR expr RPAR

| ID

| NUM

| SUB NUM.

This grammar has left recursive nonterminals. The grammar should be transformed to remove the left recursion and whence into a grammar suitable for recursive descent parsing before implementing it.

Where the

"if" => IF

"then" => THEN

"while" => WHILE

"do" => DO

"input" => INPUT

"else" => ELSE

"begin" => BEGIN

"end" => END

"write" => WRITE

{alpha}[{digit}{alpha}]* => ID (identifier)

{digit}+ => NUM (positive integer)

"+" => ADD

":=" => ASSIGN

"-" => SUB

"*" => MUL

"/" => DIV

"(" => LPAR

")" => RPAR

";"=> SEMICOLON

Minisculus has two types of comments:

- (recursive) multi-line comments: /* comment */
- and one-line comments: % comment

test1, test 2, test3, test4, test5, test6, test9.

M- program:

Here is an example program (thanks to Hayder Casey for correction!):

/* This program calculates the

factorial of the number which

is input */

begin

input x; % input a number ..

y := 1;

while x do

begin

y := y * x;

x := x - 1;

end;

write y;

end

Typical M- programs fragments are:

beginy:= 23;end |
begin if y then x:= 10end |

(where we assume here that the variables x, y, and z must have been initialized earlier: Note that for

cPUSH 23 |
rPUSH yL1: cPUSH 1L2: rPUSH z |

where

- cPUSH k --- push constant k onto stack
- rPUSH r --- push contents of register r onto stack
- sPUSH --- replaces the top element of the stack by the element it indexes in the stack
- LOAD r --- pop the top of the stack and put the value in register r
- OPn?? --- perform the operation on the top n values of the stack replacing them by the result
- cJUMP L --- conditional goto L (a label) pops top of stack and if it is zero (false) it jumps to label
- JUMP L --- unconditional jump to label
- PRINT --- pops and prints the top element of the stack
- READ r --- reads a value into register r (actually it reads a line and uses the first value on the line ...)

Here is an implementation of this stack machine in the shell: source this then source the file your M- compiler produces!

# ... aliases for Robin's stack machine in csh.

# This file should be "sourced" prior to executing

# stack machine files.

set stack = ""

alias cPUSH 'set stack = (\!:1 $stack)'

alias rPUSH 'set stack = ($\!:1 $stack)'

alias sPUSH '@ stack[1] = $stack[1] + 1 ; set stack[1] = $stack[$stack[1]]'

alias LOAD 'eval "set \!:1 = \$stack[1] ; shift stack"'

alias OP2 'eval "@ stack[2] = \$stack[2] \!:1 \$stack[1]"; shift stack'

alias cJUMP 'set tos = $stack[1]; shift stack; if ($tos == 0) goto \!:1'

alias JUMP goto

alias PRINT 'echo $stack[1]; shift stack'

alias READ 'eval "set \!:1 = $< " '

Here is the csh implementation and here is a version for bash (which uses calls to csh).

Some test programs:

Here are some simple test programs: there are errors in some of them!

You should also invent at least five of your own test programs.

HINTS: Step by step approach to the assignment ...

- Use LEX to write a parser. Print the list of
tokens! This completes part #1 of the assignment.

- Transform the grammar! As it stand it is not a recursive
descent grammar you need to remove left recursion (fortunately
this just means removing some immediate left recursions from
stmtlist, term, expr). The resulting grammar is LL(1) but
still is not in recursive descent form. To avoid having
multiple rules with no pattern it is necessary to expand the
first nonterminal until a terminal is reached. Also some rules
must be broken up as they have terminals which are not in the
pattern.

- Write a really simple recursive descent parser (which does not build a syntax tree). Here is a very simple example of this. Here is an example of this which uses the PARSEC system to tokenize the input: note the datatype for the Tokens are contained in the this file ... !IMPORTANT!: initially ignore the PARSEC code but look at the datatype for tokens for expressions. You need to create a token datatype for M- (see suggestions below).
- Now you need to modify this code to build the syntax tree (see
the definition below).

- You have now built the syntax tree but you really wanted stack machine code. To do this I suggest you modify the show function of the syntax tree in order to output stack machine code: the only complication here is you need to pass a counter round to give you fresh label numbers when you want them.
- Modify the program to take input from a file and put the
result in a file. Look at how this is done at the
end of this file.

- Now improve the error handling.

Some notes in Haskell:

THE FRONT END: building the syntax tree ...

First you have to decide what the syntax tree and token datatype will actually look like. Decisions, decisions!

First the Tokens ....

data Token = T_Identifier String --- a token for identifiers

| T_num Int --- a token for numbers

| T_while --- a token to catch the keyword "while"

....

Next the syntax tree: what does it look like?

data Stmt = If Exp Stmt Stmt -- datatype for statements starting with an if ... then ... else statement

| While Exp Stmt

| Assign String Exp

| Block [Stmt]

| Print Expr

| Input Expr

data Exp = Add Exp Exp

| Mul Exp Exp

| Div Exp Exp

| Neg Exp

| Id String

| Num Integer

The parser is then a function:

minParser: [(Pos,Token)] -> Stmt

whoops .... so you also pass back from the lexer where (a position) in the input list your token occurs: this is for error handling ... to start with ignore this and use only the token.

Now for each nonterminal you need to write a function. If things work out well and you have correctly transformed the grammar into a recursive descent grammar you can simply choose the production by matching against the first nonterminals (the pattern).

First you should have transformed the grammar for conditionals into something like:

stmt -> IF expr THEN stmt ELSE stmt

===>

stmt -> IF expr thenpart

thenpart -> THEN stmt elsepart

elsepart -> ELSE stmt

The corresponding code then looks like:

stmt:: [(Pos,Token)] -> (Stmt,[(Pos,Token)])

...

stmt ( (_,T_if) : rest) = (If e s1 s2,rest') where

(e,rest1) = expr rest

((s1,s2),rest') = thenpart rest1

thenpart ((_,T_then):rest) = ((s1,s2),rest') where

(s1,rest1) = stmt rest

(s2,rest') = elsepart rest1

elsepart ((_,T_else):rest) = (s,rest') where

(s,rest') = stmt rest

The code recursively pulls off the "if" extracts an expression then calls code to handle the "thenpart". The "thenpart" has the keyword "then" matched and a statement is extracted and the "elsepart" is called ....

OK that is the relatively easy part: there are two aspects which are a bit more subtle.

Exposing patterns ...

What do with a production like ...

term1 -> mulop factor term1

|.

To get it into recursive descent form you need to substitute "mulop" to expose a pattern:

term1 -> MUL factor term1

| DIV factor term1

|.

The one can translate:

term1 ((_,T_mul):ts) = term1 (factor ts))

term1((_,T_div):ts) = term1 (factor ts))

term1 ts = ts

then one has to add the construction of the syntax tree!

Making sure you have what you need ...

Now there is also a further difficulty with the nonterminal "term1" which surfaces when one wishes to build an expression of the form MUL. The problem is that you don't seem to know the first argument although you do know the second argument! So lets look carefully at two ways of surmounting this problem:

A. Pass what you need in! So now term1 expects another argument namely an expression produced by the code which calls it. What does this look like?

term1:: Exp -> [(Pos,Token)] -> (Exp,[(Pos,Token)])

term1 e ((_,T_mul):ts) = (MUL e e',ts') where

(e1,ts1) = factor ts

(e',ts') = term1 e1 ts1

term1 e ts = (e,ts)

B. Build an incomplete term! This is more sophisticated .... This way term1 does not expect another argument but it returns a function (think of this as an incomplete tree) which when it is eventually given the missing information will produce the right thing. SO what does this look like?

term1:: [(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])

term1 ((_,T_mul):ts) = (\x -> f(MUL x e'),ts') where

(e',ts1) = factor ts

(f, ts') = term1 ts1

term1 ts = (\x -> x,ts)

OK so notice the call to term1 (line 4) produces a function f which is applied to the factor found, e, multiplied by x. The expression \x -> f(MUL x e') is a function (the incomplete tree) which given a value for x will evaluate its body f(MUL x e') with the value substituted for x (and in this way a complete tree is built). Haskell is quite happy to pass functions around (in fact, it is normal to do so in a functional language). Notice that the second phrase of this code returns the identity function \x -> x (literally give me x and I will return x).

I am going to rewrite this second code another, completely equivalent, but slightly different way so you can see how to express this passing of functions in Haskell in different ways:

term1:: [(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])

term1 ((_,T_mul):ts) = (g,ts') where

(e',ts1) = factor ts

(f, ts') = term1 ts1

g x = f(MUL x e') -- g:: Exp -> Exp

term1 ts = (g,ts) where

g x = x -- g:: Exp -> Exp

Now, I have to 'fess up, in fact, the first version of term1 and the later versions do not do quite the same thing as the latter codes associate the multiplication to the left while the first associates to the right (not too bad as multiplication is associative!!). In fact the latter codes is more correct to the specification as (arguably) we intended to associate the multiplication to the left.

This function which returns a higher order function Exp -> Exp is then used in term as follows:

term:: [(Pos,Token)] -> (Exp,[(Pos,Token)])

term ts = (f e,ts') where

(e,ts1) = factor ts

(f,ts') = term1 ts1

Notice the application f e in the returned tuple ... now there are no half built expressions left at this level as you have supplied the missing bit of the expression.

THE BACKEND: producing stack code

Producing the stack code from the syntax tree is not as difficult as it might at first seem: the only thing you must remember to do is to pass a counter to ensure you get new labels whenever you need them. Here is a code snippet for generating a conditional:

shower:: Int -> Stmt -> (Int,String)

shower n (IF e s1 s2) =

( ( show e) ++"cJumP label"++(show n)++"\n"

++ code1

++"JUMP label"++(show (n+1))++"\n"

++"label"++(show n)++":\n"

++code2

++"label"++(show (n+1))++":\n"

, m) where

(code1,n') = shower (n+2) s1

(code2,m) = shower n' s2

.....

Notice this uses the same sorts of techniques as were used for generating the syntax tree.

Error handling ....

Right now if there is an error your parser would probably die in the most ungraceful fashion!! What do you have to do to make this more graceful? Essentially you must add a pattern to each nonterminal to catch the case when things go wrong ... for example

factor ((_,(T_num n)):ts) = ....

factor ((_,T_lpar):ts) = ....

...

factor ((pos,_):ts) = error "Error in Factor: expected (, and identifier, or a number"++(show pos)

Well you can play with the wording to make it look good ......

rest' = eat_token T_then rest1

(s1,rest3) = stmt rest2

rest4 = eat_token T_else rest3

(s2,rest') = stmt rest4

stmt( (_,T_while):rest) = ....

so what does this do? If the current input starts with the "if" token then eat it and pass what is left of the token list into the expression parser it produces an

expression e and gives back the tokens it has not eaten in rest1. The first token of rest1 must be "then" eaten it this leaves rest2, the first part of rest2 must be

an expression so you pass this to the expression eater etc.

Now having written the code in this form I can actually simplify it a bit by not having so many intermediate variables. In fact there is more that I can do but this is quite enough to start with.

stmt:: [(Pos,Token)] -> (Stmt,[(Pos,Token)])

...

stmt ( (_,T_if) : rest) = (S_if e s1 s2,rest') where

(e,rest1) = expr rest

(s1,rest2) = stmt (eat_token T_then rest1)

(s2,rest') = stmt (eat_token T_else rest2)

stmt( (_,T_while):rest) = ....

An amazing feature of Haskell is that, in the "where" clause (the indented part), I can actually write the definitions in any order -- it does not matter: Haskell sorts out how to use the functions.

OK that is the relatively easy part: there are two aspects which are a bit more subtle.

The role of first sets ...

First how do you implement a rule like ...

term1 -> mulop factor term1

|.

The problem is that you cannot match on the first terminal directly. So what do you do? Well one trick is to use the guards to test the first set:

term1 [] = ...

term1 (t:ts) | first_mulop t = ....

| otherwise = ....

where first_mulop returns True if t is a token in the first set of the first production (yes you have to write first_mulop!). Another approach (which is simpler) is just to bite the bullet and do an explicit match agains all the possibilities in the first set (you are going to have to separate them anyway!):

term1 [] = ...

term1 ((_,T_mul):ts) = ....

term1((_,T_div):ts) = ....

Making sure you have what you need ...

Now there is also a further difficulty with this nonterminal as it appears that one wishes to build an expression of the form MUL ? e where you don't seem to know the first argument although you do know the second argument. So lets look carefully at two ways of surmounting this problem. I am going to simplify matters now by assuming that the pattern matching solution is used for the problem above on the first terminal:

A. Pass what you need in! So now term1 expects another argument namely an expression produced by the code which calls it. What does this look like?

term1:: Exp -> [(Pos,Token)] -> (Exp,[(Pos,Token)])

term1 e ((_,T_mul):ts) = (MUL e e',ts') where

(e1,ts1) = factor ts

(e',ts') = term1 e1 ts1

term1 e ts = (e,ts)

B. Build an incomplete term! This is more sophisticated .... This way term1 does not expect another argument but it returns a function (think of this as an incomplete tree) which when it is eventually given the missing information will produce the right thing. SO what does this look like?

term1:: [(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])

term1 ((_,T_mul):ts) = (\x -> f(MUL x e'),ts') where

(e', ts1) = factor ts

(f, ts') = term1 ts1

term1 ts = (\x -> x,ts)

OK so notice the call to term1 (line 4) produces a function f which is applied to the factor found, e, multiplied by x. The expression \x -> f(MUL x e') is a function (the incomplete tree) which given a value for x will evaluate its body f(MUL x e') with the value substituted for x (and in this way a complete tree is built). Haskell is quite happy to pass functions around (in fact, it is normal to do so in a functional language). Notice that the second phrase of this code returns the identity function \x -> x (literally give me x and I will return x).

I am going to rewrite this second code another, completely equivalent, but slightly different way so you can see how to express this passing of function in Haskell in different ways:

term1:: [(Pos,Token)] -> (Exp -> Exp,[(Pos,Token)])

term1 ((_,T_mul):ts) = (g,ts') where

(e', ts1) = factor ts

(f, ts') = term1 ts1

g x = f(MUL x e') -- g:: Exp -> Exp

term1 ts = (g,ts) where

g x = x -- g:: Exp -> Exp

Now, I have to 'fess up, in fact, the first version of term1 and the later versions do not do quite the same thing as the latter codes associate the multiplication to the left while the first associates to the right (not too bad as multiplication is associative!!). In fact the latter codes is more correct to the specification as (arguably) we intended to associate the multiplication to the left.

This function which returns a higher order function Exp -> Exp is then used in term as follows:

term:: [(Pos,Token)] -> (Exp,[(Pos,Token)])

term ts = (f e,ts') where

(e,ts1) = factor ts

(f,ts') = term1 ts1

Notice the application f e in th returned tuple ... now there are no half built expressions left at this level as you have supplied the missing bit of the expression.

The backend: producing the stack code

Producing the stack code from the syntax tree is not too difficult the only thing you must remember to do is to pass a counter to ensure you get new labels. Here is a code snippet for generating a conditional:

shower:: Int -> Stmt -> (Int,String)

shower n (IF e s1 s2) =

( ( show e) ++"cJumP label"++(show n)++"\n"

++ code1

++"JUMP label"++(show (n+1))++"\n"

++"label"++(show n)++":\n"

++code2

++"label"++(show (n+1))++":\n"

, m) where

(code1,n') = shower (n+2) s1

(code2,m) = shower n' s2

.....

Notice this uses the same sorts of techniques as were used for generating the syntax tree.

Error handling ....

Right now if there is an error your parser would probably die in the most ungraceful fashion!! What do you have to do to make this more graceful? Essentially you must add a pattern to each nonterminal to catch the case when things go wrong ... for example

factor ((_,(T_num n)):ts) = ....

factor ((_,T_lpar):ts) = ....

...

factor ((pos,_):ts) = error "Error in Factor: expected (, and identifier, or a number"++(show pos)

Well you can play with the wording to make it look good ......