CPSC411 -- Datatypes for M+ Author: Robin Cockett Date: 8th March 2006 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Louden: 5.5 Dragon book: 5.2 (287-293) The practice of compiler writing has changed greatly from the early days of computing when limited storage and speed were ever present constraints. Under these constraints it was important to attempt to compile programs in one pass. This constraint had a knock-on effect both on the design of programming languages and what it was felt could reasonably be attempted by a compiler. It also led to a number of interesting but, perhaps now, less relevant bits of theory which were explicitly concerned with achieving compilation in one pass. The notion of an L-attributed grammar is an example of this as is (more controversially) the notion that one should compute attributes in place: the modern (but not universally accepted) approach is to generate a new structure (as an attribute) from the old structure. The more modern view tends to favor the decoupling of the parsing from the code generation stage by inserting a series of steps between the output of the parser and the actual generation of code. Each of these steps may actually generate a datatype (or data structure). Not only is this "good software engineering practice", as it facilitates modularity and debugging, but also it actually allow a greater flexibility for the program constructs in languages. An example is the added flexibility in Java which allows the use of some constructs before their declaration. Another example, highly relevant to this class, is the declaration structure in M+ which allows the use of functions and variables defined at the same level! The series of steps moving from one datatype to the next can usually be optimized into one big step. As this optimization is a mechanical process it makes sense to start by developing the program with the small steps (to get it correct) then later -- if speed really is an issue -- return to optimize the code. Syntax tree ----------- The first generic step is to generate a syntax tree out of the parsing stage. This is a representation of the parsed input which is suitable for the semantic analysis steps which follow. The syntax tree is usually not the parse tree itself: the parse tree often carries too much detailed parsing information. However, it is must be closely related to the parse tree as semantic errors will be detected using this structure and these errors must be displayed to the programmer in a manner which is related to the program text. Example (Additive expressions again!): Consider again the expression grammar. Pexp: expr -> term rest Prest1: rest -> ADD term rest Prest2: | SUB term rest Prest3: | \epsilon Pterm: term -> NUM The parse tree of "3+4-6" is Pexp(Pterm(3),Prest1(+,Pterm(4),Prest2(-,Pterm(6),Prest3()))) while an abstract syntax tree might be sub(add(num(3),num(4)),num(6)) These two representations are not equivalent! In the parse trees we associate to the right while in the abstract syntax tree we associate to the left. Furthermore, the abstract syntax allows BOTH associations while the parse tree is specifically designed to exclude that ambiguity. Thus, syntax trees may be a more flexible representation which could even allow for possibilities which cannot occur as the result of a parsed input. Of course, some of this flexibility can be achieved using precedence relations in YACC to disambiguate a deliberately ambiguous grammar. In a sense this facility in YACC is a recognition of the value of structures which do not have all the detailed parsing information. present. Semantic checking to Intermediate representation ------------------------------------------------ The semantic analysis stage checks the parse tree for semantic errors and produces an intermediate representation. It is at this stage that one sorts out the scope issues associated to variables. One also detects and type errors which occur: e.g. trying to assign an integer to a boolean, a function with the wrong number of arguments or wrong type of arguments ... The type information can also be used to disambiguate certain (ad hoc) polymorphic functions and to insert coercions (e.g. assignments of an integer to a reals can be replaced by first floating the integer then making the assignment). Out of the semantic analysis stage (often) comes an intermediate representation which is closer to the form of the machine code we wish to generate but may still be abstract enough to manipulate. For a simple stack based compilation (e.g. M+) this representation will, for example, tell us the levels (distance from static definition) and offsets of our variables and the levels and label to the code of our functions. Thus, in this case, we may have as an intermediate representation a tree which has replaced all variables names by offset and level information and all functions names by the level and label (and for built in functions special internal codes). The intermediate code in a more complex compiler may still be some distance from the actual code. For example if a three address code is used and we are compiling to a RISC (Reduced Instruction Set Computer) architecture, such as the SPARC we may yet have to work out a way of allocating registers. Efficient use of registers on these architectures (avoiding using main memory) is probably the single biggest factor in an optimizing the compilation. On the other hand, if we are compiling to a CISC (Complex Instruction Set Computer) then we may have to face problems of instruction selection - as there may be many different (with different efficiencies) ways of carrying out the overall desired computation. By the time the semantic checking is complete all the errors which the compiler can detect -- that is all the static checking of the language constructs - should have been detected. Thus, the intermediate code should have no errors and should not generate any further error messages (on which the user is expected to act) on the remaining passage into code (however long that process may be!). The fact that the intermediate code need no longer bear a close relationship to the original program means that at this stage we can attempt major program transformations and optimizations. Thus this intermediate level is the appropriate time at which to perform significant optimizations. Once the translations into assembler - replete with its machine idiosyncrasies - has been undertaken much of the structure of the program will have been lost. Thus, major optimizing transformations may be difficult to achieve after this stage. While there are certainly some very effective "peephole" optimizations which can be performed at the assembler level to remove unnecessary operations and to do some strength reductions (replacing expensive operations with cheaper ones) these optimizations are (as their name suggests) very local in nature and will not catch optimizations which require major code restructuring. Syntax tree for M+ ------------------ Consider the language M+ defined by ============================================================ prog -> block block -> declarations program_body. declarations -> declaration SEMICOLON declarations |. declaration -> var_declaration | fun_declaration. var_declaration -> VAR ID array_dimensions COLON type. type -> INT | REAL | BOOL. array_dimensions -> SLPAR expr SRPAR array_dimensions |. fun_declaration -> FUN ID param_list COLON type CLPAR fun_block CRPAR. fun_block -> declarations fun_body. param_list -> LPAR parameters RPAR. parameters -> basic_declaration more_parameters |. more_parameters -> COMMA basic_declaration more_parameters |. basic_declaration -> ID basic_array_dimensions COLON type. basic_array_dimensions -> SLPAR SRPAR basic_array_dimensions |. program_body -> BEGIN prog_stmts END. fun_body -> BEGIN prog_stmts RETURN expr SEMICOLON END. prog_stmts -> prog_stmt SEMICOLON prog_stmts |. prog_stmt -> IF expr THEN prog_stmt ELSE prog_stmt | WHILE expr DO prog_stmt | READ identifier | identifier ASSIGN expr | PRINT expr | CLPAR block CRPAR. identifier -> ID array_dimensions. expr -> expr OR bint_term | bint_term. bint_term -> bint_term AND bint_factor | bint_factor. bint_factor -> NOT bint_factor | int_expr compare_op int_expr | int_expr. compare_op -> EQUAL | LT | GT | LE |GE. int_expr -> int_expr addop int_term | int_term. addop -> ADD | SUB. int_term -> int_term mulop int_factor | int_factor. mulop -> MUL | DIV. int_factor -> LPAR expr RPAR | SIZE LPAR ID basic_array_dimensions RPAR | FLOAT LPAR expr RPAR | FLOOR LPAR expr RPAR | CEIL LPAR expr RPAR | ID modifier_list | IVAL | RVAL | BVAL | SUB int_factor. modifier_list -> LPAR arguments RPAR | array_dimensions. arguments -> expr more_arguments |. more_arguments -> COMMA expr more_arguments |. ============================================================ The parse tree is horribly complex but we may actually represent the program in a considerably simpler structure. There is a suggested Haskell datatype for the syntax tree: data M_prog = M_prog ([M_decl],[M_stmt]) data M_decl = M_var (String,[M_expr],M_type) | M_fun (String ,[(String,Int,M_Type)],M_type,{M_decl],[M_stmt]) data M_stmt = M_ass (String,[M_expr],M_expr) | M_while (M_expr,M_stmt) | M_cond (M_expr,M_stmt,M_stmt) | M_read (String,[M_expr]) | M_print M_expr | M_return M_expr | M_block ([M_decl],[M_stmt]) data M_type = M_int | M_bool | M_real data M_expr = M_ival Integer | M_rval Float | M_bval Bool | M_size (String,Int) | M_id (String,[M_expr]) | M_app (M_operation,[M_expr]) data M_operation = M_fn String | M_add | M_mul | M_sub | M_div | M_neg | M_lt | M_le | M_gt | M_ge | M_eq | M_not | M_and | M_or | M_float | M_floor | M_ceil Here is a small M+ program: ============================================================ var x[2]:int; fun exp(b:int):int { var z:int; begin if b=0 then z:= 1 else z:= x[1] * exp(b-1); return z; end}; begin read x[0]; read x[1]; print exp(x[0]); end ============================================================ Here is its syntax tree: M_prog ([M_var ("x",[M_ival 2],M_int), M_fun ("exp",[("b",0,M_int)],M_int ,[M_var ("z",[],M_int)] ,[M_cond (M_app (M_eq,[M_id("b",[]),M_ival 0]),M_ass ("z",[],M_ival 1), M_ass ("z",[], M_app (M_mul, [M_id("x",[M_ival 1]), M_app (M_fn "exp",[M_app (M_sub,[M_id("b",[]),M_ival 1])])]))), M_return (M_id("z",[])]))], [M_read("x",[M_ival 1]),M_read("y",[M_ival 2]) ,M_print (M_app (M_fn "exp",[M_id("y",[M_ival 2])]))]) Finally let us briefly consider what the intermediate representation will be for the stack machine. As mentioned before it will be essentially the same except each variable name will be replaced by a pair of integers (the levels from static declaration and the offset in the activation record) and each function call by the level from its static declaration and the label used for the call. The Haskell datatype for the intermediate representation looks like: data I_prog = IPROG ([I_fbody],Int,[(Int,[I_expr])],[I_stmt]) -- a program node consists of -- (a) the list of functions declared -- (b) the number of local variables -- (c) a list of array specifications (,) -- (d) the body: a list of statements data I_fbody = IFUN (String,[I_fbody],Int,Int,[(Int,[I_expr])],[I_stmt]) -- a function node consists of -- (a) the label given to the function -- (b) the list of local functions declared -- (c) the number of local variables -- (d) the number of arguments -- (c) a list of array specifications (,) -- (d) the body: a list of statements data I_stmt = IASS (Int,Int,[I_expr],I_expr) -- and assignment has argments (,,,expr) | IWHILE (I_expr,I_stmt) | ICOND (I_expr,I_stmt,I_stmt) | IREAD_F (Int,Int,[I_expr]) | IREAD_I (Int,Int,[I_expr]) | IREAD_B (Int,Int,[I_expr]) | IPRINT_F I_expr | IPRINT_I I_expr | IPRINT_B I_expr | IRETURN I_expr | IBLOCK ([I_fbody],Int,[(Int,[I_expr])],[I_stmt]) -- a block consists of -- (a) a list of local functions -- (b) the number of local varibles declared -- (c) a list of array declarations -- (d) the body: a lst of statements data I_expr = IINT Int | IREAL Float | IBOOL Bool | IID (Int,Int,[I_expr]) -- identifier (,,) | IAPP (I_opn,[I_expr]) | ISIZE (Int,Int,Int) -- isize(,,) -- level and offset identify which array the last integer -- tells you which dimension you want to look at!! data I_opn = ICALL (String,Int) | IADD_F | IMUL_F | ISUB_F | IDIV_F | INEG_F | ILT_F | ILE_F | IGT_F | IGE_F | IEQ_F -- operations for floats | IADD | IMUL | ISUB | IDIV | INEG | ILT | ILE | IGT | IGE | IEQ | INOT | IAND | IOR | IFLOAT | ICEIL |IFLOOR; Notice that the declarations have gone (they are actually "used" in the construction of the symbol tables as they are built for use in the various parts of the tree) they are replaced by the bodies of the functions. The functions are supplied with a label (to which the caller jumps), the body is then the same as a program block. Notice that the fact that there are arguments is now hidden in the code. A program block indicates how many cells it is necessary to allocate for local storage and also how to calculate the storage required for all the arrays. These are numbers needed in the generation of the entry and return code sequence of the function/block. It also, of course, contains the representation of the statements in the block. The call of a function requires the label and the number of levels. While all variables are replaced by the offset and level calculated from the symbol table. Notice also that all operations are distinuished by the data that they act on. Thus, multiplication of reals/floats is distinguished from multiplication of integers, reading or printing integers is distinuished from reading or printing booleans or reals/floats. We can do this because type checking has resolved the type of each argument. The intermediate representation is: IPROG ([IFUN ("fn1" ,[] ,1 -- one local variable ,1 -- one argument ,[] -- no arrays ,[ICOND(IAPP (IEQ,[IID(0,-4,[]),IINT 0]) ,IASS(0,1,[],IINT 1) ,IASS(0,1,[],IAPP(IMUL,[IID(1,1,[IINT 2]) ,IAPP(ICALL ("fn1",1),[IAPP(ISUB,[IID(0,-4,[]) ,IINT 1])])]))) ,IRETURN_V (IID(0,1,[])) ])] ,1 -- one local variable which is an array ,[(1,[IINT 2])] -- array spec ,[IREAD_I (0,1,[IINT 0]) ,IREAD_I (0,1,[IINT 1]) ,IPRINT_I(IAPP (ICALL ("fn1",0),[IID(0,1,[IINT 0])])) ]) From here it is relatively easy to generate the required stack machine code :-)