CPSC411 -- The symbol table

Author: Robin Cockett
Date: March 6, 2003 (updated March 2012)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Dragon book: Symbol Tables 7.6 (429-440)

A symbol table is for storing the information associated with the various 
symbols used in the language.  It is the main (inherited) attribute after 
the parse stage which is used in creating the intermediate representation.
A symbol table can also be used in the lexical analysis stage to 
disambiguate tokens; this can make the parsers job considerably 
easier.  However, the main use of the symbol table is in the semantic analysis 
phase whose aim is to generate the intermediate representation.

There are three main functions a symbol table must implement:

         insert  -  inserting a newly introduced variable or function 
                    name into the table with its required attributes.
         lookup  -  looking up the attributes associated with a name.
         delete  -  removing items from the symbol table whose declarations
                    are no longer visible.


Basic symbol table
------------------

A basic implementation of a symbol table is as a linear list.  Each item of 
the list is a pair: the name (a string) and an associated attribute (the 
type of the symbol and any other attributes one may need to know for code 
generation).

This is a simple approach with a constant time insertion but a lookup 
(and possibly deletion) time in the worst case proportional to the number 
of elements in the list.   

More sophisticated implementations may use binary trees, AVL-trees, or 
(most usually) hash tables to improve the look up time.  If you use a hash 
table ensure that the hashing function is suitable for the sorts of variable 
names which are used in programming languages (or else you will get an 
abnormal number of collisions).

In addition it may be useful to have a header for the symbol table where 
attributes of the whole table can be stored (for example the current free 
offset, what it is the symbol table for (e.g. main, function, or block).


Handling Scope
--------------

When a variable is declared that delaration is only visible in the 
body associated with the declaration.  Furthermore, a declaration 
in a subblock of a symbol with the same name as a symbol declared in an 
enclosing block will supercede the original meaning of the symbol in 
the enclosing block.  Thus, the inner declaration punches a hole in the 
scope of the inner declaration. A symbol table should be able to handle 
these scope issues.  In particular, as one leaves the scope of a 
definition one should be able to "delete" the symbols introduced in 
that inner scope easily.

There are several approaches to implementing this, here we outline two:

(a) Stacking values in the same symbol table:  each name in the symbol table 
has a list of attribute values associated with it.  When a new block is 
entered the attributes of the symbol defined are pushed on top of any 
existing attribute values, thus, hiding them.

When the scope of a block is left it is necessary to remove all the 
declarations introduced in that block.  This means that the introduced 
declarations of a block must be marked in some manner so that they can 
be removed.  A simple scheme is to associate with each set of attributes 
(associated with a name) a level number.  The deletion is then of a level.
The disadvantage of this scheme is that one must then traverse the entire 
symbol table to delete a level.  This is rather expensive!

A more efficient alternative is to chain the introduced symbols together 
(by pointers) and then remove them by traversing this chain.  Here the 
symbol table also needs a stack of starts for these chains as an extra 
structure.  

(b)  Stacking symbol tables:  the other approach is to simply open up
a new symbol table when one enters a block.  To find a symbol then involves 
seaching through the symbol tables to find the "highest" level in which 
the symbol appears.  This is not very efficient ... on the otherhand 
deletion, the removal of a level of scope is efficient.

The great advantage of (b) is that it is simple.  Its (significant) 
disadvantage is that look up is slow.  For a full blown compiler 
it may well be worth the extra effort to develop a symbol table of 
type (a) with an efficient deletion.  Note, however, you will need to 
solve all the other issues of offset calculation etc.


Example:  

Consider the following program fragment:

     var x:int;
     fun f(y:int, z:int):int
         { var x,a[y][z]:int;
           fun g(a[][]:int,b:int):int
               { var y:int;
                 begin
                    ...  a[i][j]:= f(x,v); ...
                 end
               }
            begin  ... end
         };
     var v:int;
     begin 
           ....
     end;


We illustrate what a symbol table of type (b) should look like when 
we are trying to generate code for the assignment [a:= f(x,v)].  This 
symbol table is built by first building the symbol table for main, 
next the symbol table for f and finally for g:

    [ Symbol_table(1,2,[ ("a",Var_attr(-5,M_int,0))
                       , ("b",Var_attr(-4,M_int,0))
                       , ("y",Var_attr(1,M_int,0))])
    , Symbol_table(1,2,[ ("y",Var_attr(-5,M_int,0))
                       , ("z",Var_attr(-4,M_int,0))
                       , ("x",Var_attr(1,M_int,0))
                       , ("a",Var_attr(2,M_int,2))
                       , ("g",Fun_attr(code_label_g,[(M_int,0),(M_int,2)],M_int))])
    , symbol_table(2,0,[ ("x",Var_attr(1,M_int,0))
                       , ("f",Fun_attr(code_label_f,[(M_int,0),(M_int,0)],M_int))
                       , ("v",Var_attr(2,M_int,0))])
    ]

Now when we look up "a" in this symbol table we find it immediately.  
Because it is an argument of the function it has a negative offset 
(it occurs before the frame pointer) and because it is an array we expect 
it to be passed as a pointer into the stack with the first two entries 
indicating the size of the two dimensions of the array.  In this way 
with each symbol we may associate a level, with each variable an offset, 
and with each function a code label:

              name     level    type                             attribute 
              ------------------------------------------------------------
              "a"       0      array(int,2)                            -5 offset
              "f"       2      ([array(int,0),array(int,0)],int)       code_label_f
              "x"       1      array(int,0)                            1 offset
              "v"       2      array(int,0)                            2 offset

These attributes can then be used in the generation of code.  Notice that each variable is viewed as a 
potential array ... it is an ordinary variable when the number of dimensions is zero  (so for example 
"array(int,0)" may be read as "int").


Programming a symbol table in Haskell
=====================================

While you will not be programming the m+ symbol table in Haskell it is quite 
instructive to see how it is done.  One particularly nice feature of 
Haskell is its module system which allows one to specify interfaces before one 
writes the program.  This allows one to provide a simple, although perhaps 
inefficient, implementation and later return to implement a more efficent 
symbol table which can be substituted without affecting the rest of the 
program.

Below is an almost complete implementation of a symbol table for m+ (the 
label generating functions are missing)!!

The two main operations we want to be able to perform on a symbol table 
are the insertion and lookup.  The deletion will actually take care of itself 
in this simple design through the way that Haskell does garbage collection.  

Insertion:
----------

We first need to think about the form of the things we will wish to insert 
in the symbol table.  The following datatype describes what we will be 
allowed to insert:

data SYM_DESC = ARGUMENT (String,M_type,Int)
              | VARIABLE (String,M_type,Int)
              | FUNCTION (String,[(M_type,Int)],M_type)
              | DATATYPE String 
              | CONSTRUCTOR (String, [M_type], String)


M_type is the datatype of m+ types.

This distinguishes five types of thing we may want to insert
      (1) an argument with its type and dimension
      (2) a (local) variable with its type and dimension
      (3) a (local) function with its argument types (which may be arrays) and 
          output type (which cannot be an array)
      (4) a user defined type with its name
      (5) a constructor with its argument types and the user defined 
          type (for which it is a constructor).
in each case we insert a string with its attribute into the symbol table. However, 
the symbol table will treat the cases rather differently.  For example in the case of a 
function insertion the symbol table must create a label (to which to jump when 
the function is invoked).  

In the case of an argument insertion the offset calculation is completely 
different from a local variable insertion.  We will return to this when 
we discuss the implementation.

The insertion function will have the following type:

    insert::Int -> ST -> SYM_DESC -> (Int,ST)

where ST is the type of the symbol table.  The first integer argument is the 
current function label number: this must be updated as we travese the synatx tree.
If we try to insert an already locally declared symbol into the symbol-table then an 
error will be raised.  This can be handled through Haskell's exceptions.  

Lookup:
-------

When we do a look up we expect back certain information: again I will create 
a special datatype for the returned information so that I know exactly what 
information is to be passed back:

datatype SYM_I_DESC = I_VARIABLE (Int,Int,M_type,Int)
                    | I_FUNCTION (Int,String,[(M_type,Int)],M_type)
                    | I_CONSTRUCTOR (Int,[M_type],String)
                    | I_TYPE [String];


Here there are only two variants 
         (1) The symbol is a variable 
                 I_VARIABLE(level, offset, type, dim)
             it was found at level <level> in the symbol table and is at 
             offset <offset> from the frame pointer of that activation record.
             The variable is of type <type> and has dim dimensions.
         (2) The symbol is a function
                 I_FUNCTION(level,label,argument_types,output_type)
             it was found at level <level>, the label to which you must jump 
             is <label>.  The types of its arguments are <argument_types> and the 
             type of its returned value is <output_type>.
         (3) The symbol is a constructor
                 I_CONSTRUCTOR(<constructor_number>,<argument_types>,<output_type>)
             with type <output_type> and arguments types <argument_types> and 
             constructor number <constructor_number>.
         (4) The symbol is a user defined type 
                 I_TYPE <constructor_names>
             with constructor names <constructor_names>,.

Again if anything goes wrong I will raise an exception!  This means that a 
retrieval is a function of the following type:

   lookup: ST -> String -> SYM_I_DESC

Scope change and the empty symbol table stack:
---------------------------------------------

We have to supply three further functions in the Haskell implementation:

   empty:: ST
   new_scope:: ScopeType -> ST -> ST
   return:: ST -> M_type
   
where ScopeType is:

data ScopeType = L_PROG | L_FUN M_Type | L_BLK | L_CASE

Here we are distinguishing the sorts of scopes we can have: this is necessary 
but can be achieved in various ways.   One thing to note is that it is useful 
to know the return type of a function: this is so one can correctly type the 
return statement.

The first  of these functions simply supplies an empty symbol table 
while the second starts a new symbol table "scope level".  The return 
function provides the return type of the current function (if it is 
not a function an error is generated).

In a C implemenentation you may need a further function to remove 
a scope level:

    remove_scope: ST -> ST


The signature of the symbol table
---------------------------------

Collecting this together we have an an abstract datatype which is a symbol 
table.  This datatype has the above mentioned functions which allow one 
to manipulate the contents.  In Haskell we may represent this as a module:

module SymbolTable (ST,empty,new_scope,insert,lookup,return)
where 
   import SymbolTypes  
   empty:: ST 
   new_scope:: ScopeType -> ST -> ST
   insert:: Int -> ST -> SYM_DESC -> (Int,ST)
   lookup:: ST -> string -> SYM_I_DESC
   return:: ST -> M_type


Here ST is the abstract datatype of the stack.  As things can go wrong we 
allow some exceptions to be raised.  Finally we need the functions 
we have described above to manipulate the symbol table.


The implementation:
-------------------

The last step is to implement this signature.  

(a) Internal and exported datatypes:
    -------------------------------

The implementation of the symbol table uses datatypes which are not 
exported to the outside world.  They might look like:

   data SYM_VALUE = Var_attr (Int,M_type,Int)
                  | Fun_attr (String,[(M_type,Int)],M_type)
                  | Con_attr (Int, [M_type], String)
                  | Typ_attr [String]


   data SYM_TABLE = Symbol_table (Int,Int,[(String,SYM_VALUE)])

   type ST = [SYM_TABLE]

The type of symbol table value (which is private) can be either the 
attributes of a function, a variable, a constructor, or a user defined type.  
For a variable we hold its offset (which can be negative) and its type and 
dimension.  For a function we hold its label, its argument types, and its 
output type.  For a constructor we hold its (internally assigned) constructor 
number, its argument types (no arrays) and the user defined type for which 
it is a constructor.

A symbol table level is essentially a list of these values paired with 
strings together with two integers to keep track of the positive and 
negative offsets.  The first integer is the number of local variables 
inserted into the table. The next local variable inserted must 
increment this value and take this as its offset.  The second integer is the 
number of arguments: the next argument inserted must increment this value 
and take as it negative offset three more than this incremented value.    
Note that the arguments must be inserted in reverse order into this symbol 
table.  Also remember that arguments occur BEFORE the frame pointer so 
will be accessed by negative offsets.

A symbol table is a list of symbol table levels and this (as an abstract 
type) is exported to the outside world.


(b) The  empty symbol table and adding a level:
    -------------------------------------------

These are the two simplest functions they are:

   empty = []

   new_scope:: ST -> ST 
   newscope s = (Symbol_table(0,0,[])):s

(b) Lookup:
    -------

The next simplest function is the lookup.  Here we look down through the 
levels until we find the symbol: 

   lookup:ST -> String -> SYM_I_DESC = 
   look_up s x = find 0 s where
      found level (Var_attr(offset,type,dim)) 
                    =  I_VARIABLE(level,offset,type,dim)
      found level (Fun_attr(label,arg_Type,type)) 
                    = I_FUNCTION(level,label,arg_Type,type)
      ...

      find_level ((str,v):rest)|x== str = Just v
                               |otherwise =  find_level rest
      find_level [] = Nothing

      find n [] = error ("Could not find "++ str)
      find n (Symbol_table(_,_,vs)::rest) = 
             (case find_level vs of 
	          Just v -> found n v
		  Nothing -> find (n+1) rest)


The subsiduary function "found" simply converts the data retrieved into the 
form required by the lookup from the internal form.  "find_level" searches 
a level for the symbol while "find" searches the levels in turn until the 
symbol is found raising an exception when it is not found.


(c)  Insertion:

When we insert information we must generate the correct offsets AND for 
function symbols a new label:


  insert:: Int -> ST -> SYM_DESC -> (Int,ST) = 
  insert n [] d =  error "Symbol table error: insertion before defining scope."
  insert n ((Symbol_table(nL,nA,sL)):rest) (ARGUMENT(str,t,dim)) 
           | (in_index_list str sL) = error
                ("Symbol table error: " ++ str ++"is already defined.")
           | otherwise = (n,Symbol_table(nL,nA+1
                             ,(str,Var_attr(~(nA+4),T,dim))::sL))
  insert n ((Symbol_table(nL,nA,sL)):rest) (VARIABLE (str,T,dim)) 
           | (in_index_list str sL) 
	       = error ("Symbol table error: "++ str ++"is already defined.")
           | otherwise = (n,Symbol_table(nL+1,nA
                             ,(str,Var_attr(nL+1,T,dim))::sL))
  insert n ((Symbol_table(nL,nA,sL)):rest) FUNCTION (str,Ts,T)
           | in_index_list str sL 
	       = error ("Symbol table error: "++str++"is already defined.")
           | otherwise = (n+1,(Symbol_table(nL,nA,(str,Fun_attr(getlabel n "fn",Ts,T)):sL)
                              ):rest)
  ....
	where in_index_list str [] = False
              in_index_list str ((x,_):xs)| str==x = True
                                          | otherwise = in_index_list str xs


Removing scope:
--------------

This simply removes a scope level and returns the number of local variables 
in that level together with the modified symbol table.


Problems with blocks!!!
-----------------------

How to handle blocks?

Probably the best way is to treat them is almost like function calls.  The main differences
are that nothing is returned and the program counter continues.  So on leaving a block 
one does need to clean up the activation record and to reset the frame pointer but no
arguments are returned nor is there a change to the code pointer.  The main thing 
that does happens is the deallocation of the activation record.