CPSC411 -- The symbol table Author: Robin Cockett Date: March 6, 2003 (updated March 2012) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Dragon book: Symbol Tables 7.6 (429-440) A symbol table is for storing the information associated with the various symbols used in the language. It is the main (inherited) attribute after the parse stage which is used in creating the intermediate representation. A symbol table can also be used in the lexical analysis stage to disambiguate tokens; this can make the parsers job considerably easier. However, the main use of the symbol table is in the semantic analysis phase whose aim is to generate the intermediate representation. There are three main functions a symbol table must implement: insert - inserting a newly introduced variable or function name into the table with its required attributes. lookup - looking up the attributes associated with a name. delete - removing items from the symbol table whose declarations are no longer visible. Basic symbol table ------------------ A basic implementation of a symbol table is as a linear list. Each item of the list is a pair: the name (a string) and an associated attribute (the type of the symbol and any other attributes one may need to know for code generation). This is a simple approach with a constant time insertion but a lookup (and possibly deletion) time in the worst case proportional to the number of elements in the list. More sophisticated implementations may use binary trees, AVL-trees, or (most usually) hash tables to improve the look up time. If you use a hash table ensure that the hashing function is suitable for the sorts of variable names which are used in programming languages (or else you will get an abnormal number of collisions). In addition it may be useful to have a header for the symbol table where attributes of the whole table can be stored (for example the current free offset, what it is the symbol table for (e.g. main, function, or block). Handling Scope -------------- When a variable is declared that delaration is only visible in the body associated with the declaration. Furthermore, a declaration in a subblock of a symbol with the same name as a symbol declared in an enclosing block will supercede the original meaning of the symbol in the enclosing block. Thus, the inner declaration punches a hole in the scope of the inner declaration. A symbol table should be able to handle these scope issues. In particular, as one leaves the scope of a definition one should be able to "delete" the symbols introduced in that inner scope easily. There are several approaches to implementing this, here we outline two: (a) Stacking values in the same symbol table: each name in the symbol table has a list of attribute values associated with it. When a new block is entered the attributes of the symbol defined are pushed on top of any existing attribute values, thus, hiding them. When the scope of a block is left it is necessary to remove all the declarations introduced in that block. This means that the introduced declarations of a block must be marked in some manner so that they can be removed. A simple scheme is to associate with each set of attributes (associated with a name) a level number. The deletion is then of a level. The disadvantage of this scheme is that one must then traverse the entire symbol table to delete a level. This is rather expensive! A more efficient alternative is to chain the introduced symbols together (by pointers) and then remove them by traversing this chain. Here the symbol table also needs a stack of starts for these chains as an extra structure. (b) Stacking symbol tables: the other approach is to simply open up a new symbol table when one enters a block. To find a symbol then involves seaching through the symbol tables to find the "highest" level in which the symbol appears. This is not very efficient ... on the otherhand deletion, the removal of a level of scope is efficient. The great advantage of (b) is that it is simple. Its (significant) disadvantage is that look up is slow. For a full blown compiler it may well be worth the extra effort to develop a symbol table of type (a) with an efficient deletion. Note, however, you will need to solve all the other issues of offset calculation etc. Example: Consider the following program fragment: var x:int; fun f(y:int, z:int):int { var x,a[y][z]:int; fun g(a[][]:int,b:int):int { var y:int; begin ... a[i][j]:= f(x,v); ... end } begin ... end }; var v:int; begin .... end; We illustrate what a symbol table of type (b) should look like when we are trying to generate code for the assignment [a:= f(x,v)]. This symbol table is built by first building the symbol table for main, next the symbol table for f and finally for g: [ Symbol_table(1,2,[ ("a",Var_attr(-5,M_int,0)) , ("b",Var_attr(-4,M_int,0)) , ("y",Var_attr(1,M_int,0))]) , Symbol_table(1,2,[ ("y",Var_attr(-5,M_int,0)) , ("z",Var_attr(-4,M_int,0)) , ("x",Var_attr(1,M_int,0)) , ("a",Var_attr(2,M_int,2)) , ("g",Fun_attr(code_label_g,[(M_int,0),(M_int,2)],M_int))]) , symbol_table(2,0,[ ("x",Var_attr(1,M_int,0)) , ("f",Fun_attr(code_label_f,[(M_int,0),(M_int,0)],M_int)) , ("v",Var_attr(2,M_int,0))]) ] Now when we look up "a" in this symbol table we find it immediately. Because it is an argument of the function it has a negative offset (it occurs before the frame pointer) and because it is an array we expect it to be passed as a pointer into the stack with the first two entries indicating the size of the two dimensions of the array. In this way with each symbol we may associate a level, with each variable an offset, and with each function a code label: name level type attribute ------------------------------------------------------------ "a" 0 array(int,2) -5 offset "f" 2 ([array(int,0),array(int,0)],int) code_label_f "x" 1 array(int,0) 1 offset "v" 2 array(int,0) 2 offset These attributes can then be used in the generation of code. Notice that each variable is viewed as a potential array ... it is an ordinary variable when the number of dimensions is zero (so for example "array(int,0)" may be read as "int"). Programming a symbol table in Haskell ===================================== While you will not be programming the m+ symbol table in Haskell it is quite instructive to see how it is done. One particularly nice feature of Haskell is its module system which allows one to specify interfaces before one writes the program. This allows one to provide a simple, although perhaps inefficient, implementation and later return to implement a more efficent symbol table which can be substituted without affecting the rest of the program. Below is an almost complete implementation of a symbol table for m+ (the label generating functions are missing)!! The two main operations we want to be able to perform on a symbol table are the insertion and lookup. The deletion will actually take care of itself in this simple design through the way that Haskell does garbage collection. Insertion: ---------- We first need to think about the form of the things we will wish to insert in the symbol table. The following datatype describes what we will be allowed to insert: data SYM_DESC = ARGUMENT (String,M_type,Int) | VARIABLE (String,M_type,Int) | FUNCTION (String,[(M_type,Int)],M_type) | DATATYPE String | CONSTRUCTOR (String, [M_type], String) M_type is the datatype of m+ types. This distinguishes five types of thing we may want to insert (1) an argument with its type and dimension (2) a (local) variable with its type and dimension (3) a (local) function with its argument types (which may be arrays) and output type (which cannot be an array) (4) a user defined type with its name (5) a constructor with its argument types and the user defined type (for which it is a constructor). in each case we insert a string with its attribute into the symbol table. However, the symbol table will treat the cases rather differently. For example in the case of a function insertion the symbol table must create a label (to which to jump when the function is invoked). In the case of an argument insertion the offset calculation is completely different from a local variable insertion. We will return to this when we discuss the implementation. The insertion function will have the following type: insert::Int -> ST -> SYM_DESC -> (Int,ST) where ST is the type of the symbol table. The first integer argument is the current function label number: this must be updated as we travese the synatx tree. If we try to insert an already locally declared symbol into the symbol-table then an error will be raised. This can be handled through Haskell's exceptions. Lookup: ------- When we do a look up we expect back certain information: again I will create a special datatype for the returned information so that I know exactly what information is to be passed back: datatype SYM_I_DESC = I_VARIABLE (Int,Int,M_type,Int) | I_FUNCTION (Int,String,[(M_type,Int)],M_type) | I_CONSTRUCTOR (Int,[M_type],String) | I_TYPE [String]; Here there are only two variants (1) The symbol is a variable I_VARIABLE(level, offset, type, dim) it was found at level in the symbol table and is at offset from the frame pointer of that activation record. The variable is of type and has dim dimensions. (2) The symbol is a function I_FUNCTION(level,label,argument_types,output_type) it was found at level , the label to which you must jump is