Answers for the cpsc510 midterm 2000 fall exam ============================================== Author: Robin Cockett [Answers for Qu 1 & 2 & 3] ========================================================================== Qu 1. ========================================================================== Describe the main stages of a compiler. Here is a common scheme: | source string (high level programming language) F ------------------- R | Lexical analysis |-------------------------- O ------------------- | N | list of tokens | T ---------------------- | | Grammatical analysis |---------------- | | /Parsing | | | ---------------------- | | E | syntax tree | | N ------------------- ----------------- errors D | Semantic analysis |------>--------| Error reporter |---->----- ------------------- ----------------- | intermediate code _ ------------------- | optimization | B ------------------- A | intermediate code (optimized) C ------------------- K | code generation | ------------------- E | target language/assembler/machine code N | D (1.1) Lexical analysis: the lexemes of the language are extracted (using a regular automaton) and made into tokens. [Complexity: O(n) where n is length of input string] (1.2) Grammatical analysis: the tokens are parsed (according to a context free grammar) and a structure is built which may not be the actual parse tree (but should be fairly closely related) which we call the syntax tree. This structure should be suitable for the next phase ... [Complexity: O(n) where n is length of list of tokens] (1.3) Semantic analysis: here type checking (which includes ensuring that functions and procedures have the correct number of arguments) and the differentiation of polymorhic symbols (using this type information) takes place. The output of this stage may be an intermediate code. [Complexity: O(n) - O(n log n) where n is the size of the syntax tree. The log n factor is necessary as one must lookup symbols - use of hash tables can reduce this to effectively n again - in the symbol table AND one may have to do type unification to complete the type checking.] (1.4) Most significant code optimization takes place at the level of intermediate code. [Complexity: =< O(n log n) the problems here are NP-hard but to end up with a practical compiler one must keep the complexity to at most n log n] (1.5) The intermediate code is turned into the target language. [Complexity: =< O(n log n) the register allocation problem is equivalent to the graph coloring problem which is NP-complete. Thus one is forced to use heuristic algorithms of low complexity.] ========================================================================== Qu 2. ========================================================================== The grammar you where asked to calculate the vital statistics for was: stmts -> stmts label stmts stmt SEMICOLON |. label -> LABEL COLON |. stmt -> ID ASSIGM r_expr. r_expr -> ID params | NUM. params -> LPAR arglist RPAR |. arglist -> r_expr arglist_end |. arglist_end -> COMMA r_expr arglist_end |. All nonterminals are reachable and realizable. The following nonterminals are nullable: stmts label params arglist arglist_end. The only endable nonterminal is: stmts. The nonterminal "stmts" is immediately left recursive: this means the grammar is not LL(1). This should also be apparent from the calculation of the first sets and follow sets: Nonterminal first set follow set ========== label ====> { LABEL } { LABEL ID } stmt ====> { ID } { SEMICOLON } r_expr ====> { ID NUM } { COMMA RPAR SEMICOLON } params ====> { LPAR } { COMMA RPAR SEMICOLON } arglist_end ====> { COMMA } { RPAR } stmts ====> { LABEL ID } { LABEL ID } arglist ====> { ID NUM } { RPAR } To be LL(1) there are three conditions involving these statistics: (1) Each nonterminal can have at most one nullable production: This is satisfied by this grammar! (2) The first sets of the RHS of any two distinct productions of the same non-terminal must be disjoint. Notice that this is true (rather obviously) of all the non-terminals. (3) For any nullable non-terminal the first set and the follow set must be disjoint. Here we run into problems! Only "stmt" and "r_expr" are non-nullable so both "label" and "stmts" have this problem. [If you got all this right you have 3/4 of the marks for the question.] Now it is worth noting at this stage that the grammar with start state "stmt" is LL(1) as for those non-terminals which can be reached from "stmt" all three of the above conditions are satisfied. It is a fact that every LL(1) grammar is LR(1) and this of course means that we now know the answer to 3(d) below! There is something else you might note about this grammar: it is rather obviously ambiguous! Look at the rule for "stmts": this starts with three nullable nonterminals "stmts label stmts" now if the initial string matches just "stmts" there are two ways of parsing this starting at this sentential form: one can expand the first "stmts" and null the next two nonterminals or null the first two nonterminals and expand the second "stmt". So this actually answers the last two parts of this question! You do not want an ambiguous grammar in a programming language (the one exception is when you use precedence parsing - which actually removes the ambiguity again). Finally one naotes: one cannot remove ambiguity by transformation and no ambiguous grammar is LL(1)!!!! Let us actually transform the grammar to see what happens!! We know everything from "stmt" down is non-left-recursive so really our concern is the first two rules: stmts -> stmts label stmts stmt SEMICOLON |. label -> LABEL COLON |. There is an immediate left recursion so we need to remove that (no implied ordering is needed): stmts -> stmts+ % using the null non-left-recursive production stmts+ -> label stmts stmt SEMICOLON stmts+ | . label -> LABEL COLON |. Now we want to make the first production good so we set stmts < stmts+ however examining the second rule as "label" is nullable we now have an improvable rule so we substitute "label" into this rule. As this is the only place in which "label" is used we may then throw that production away: stmts -> stmts+ stmts+ -> LABEL COLON stmts stmt SEMICOLON stmts+ | stmts stmt SEMICOLON stmts+ | . Again we can improve these productions to stmts -> stmts+ stmts+ -> LABEL COLON stmts stmt SEMICOLON stmts+ | stmts+ stmt SEMICOLON stmts+ | . But now we have made the second (hidden) left recursive aspect apparent! We remove left recursion again: stmts -> stmts+. stmts+ -> LABEL COLON stmts stmt SEMICOLON stmts+ stmts++ | stmts++. stmts++ -> stmt SEMICOLON stmts+ stmts++ |. But now we are done! Of course as we have already observed this grammar is not LL(1) we may actually recalculate the vital statistics to verify this (although I was not expecting you to do this) to see what goes wrong. The grammar with left recursion removed now looks like: ================================================================= stmts -> stmts+. stmts+ -> LABEL COLON stmts stmt SEMICOLON stmts+ stmts++ | stmts++. stmts++ -> stmt SEMICOLON stmts+ stmts++ |. stmt -> ID ASSIGM r_expr. r_expr -> ID params | NUM. params -> LPAR arglist RPAR |. arglist -> r_expr arglist_end |. arglist_end -> COMMA r_expr arglist_end |. ==================================================================== The nullable nonterminals are: stmts++ params arglist arglist_end stmts+ stmts. Nonterminals first follow ============ stmts+ ====> { LABEL ID } { ID } stmt ====> { ID } { SEMICOLON } r_expr ====> { ID NUM } { COMMA RPAR SEMICOLON } params ====> { LPAR } { COMMA RPAR SEMICOLON } arglist_end ====> { COMMA } { RPAR } stmts ====> { LABEL ID } { ID } stmts++ ====> { ID } { ID } arglist ====> { ID NUM } { RPAR } The three conditions for LL(1) now look like: (1) satisfied (2) satisfied (only the productions for "stmts+" must be checked!) (3) Here "stmts+" "stmts++" "stmts" all produce problems! ========================================================================== Qu. 3 ========================================================================== The grammar in question is: stmt -> l_expr ASSIGN r_expr. l_expr -> ID | NUM. r_expr -> ID params | NUM. params -> LPAR arglist RPAR |. arglist -> r_expr arglist_end |. arglist_end -> COMMA r_expr arglist_end |. The follow sets and what is endable can be read off from the previous question: The endable nonterminals are: params stmt r_expr. The follow sets are: Nonterminal: follow =========== arglist ====> { RPAR } arglist_end ====> { RPAR } params ====> { COMMA RPAR } l_expr ====> { ASSIGN } r_expr ====> { COMMA RPAR } The states of the LR(0) item DFA are: States: Transitions: ======= ============ NT T == = (1) S -> .stmt stmt ID ---------- stmt -> .ID ASS r_expr (2) stmt -> ID.ASS r_expr ASS --------------------- (3) stmt -> ID ASS.r_expr r_expr ID NUM --------------------- r_expr -> .ID parms r_expr -> .NUM (4) r_expr -> ID.parms parms LPAR ------------------ parms -> .LPAR args RPAR parms -> . [shift/reduce conflict] follow(parms) = {COM RPAR} (5) parms -> LPAR.args RPAR args r_expr ID NUM ----------------------- args -> .r_expr args_end args -> . r_expr -> .ID parms r_expr -> .NUM [shift/reduce conflict] follow(args) = {RPAR} (6) args -> r_expr.arg_end arg_end COM ---------------------- arg_end -> .COM r_expr arg_end arg_end -> . [shift/reduce conflict] follows(args) = {RPAR} (7) arg_end -> COM.r_expr arg_end r_expr ID NUM ----------------------------- r_expr -> .ID parms r_expr -> .NUM (8) arg_end -> COM r_expr.argend arg_end COM ---------------------------- arg_end -> .COM r_expr argend arg_end -> . [shift/reduce conflict] follows(arg_end) = {RPAR} (9) through (16) We can immediately see that this is not an LR(0) language as there are shift/reduce conflicts at the states (4) (5) (6) and (8). Is it an SLR(1) language? In order for this to be so we must be able to resolve the conflicts using the follow sets. This can be done as the completed item always has a disjoint follow set from the possible terminal transitions out! Every SLR(1) language is LALR(1). As all the conflicts are shift/reduce conflicts they already will be resolved at the LALR(1) level if they are resolved at the LR(1) level.