Author: Robin Cockett
Date: January 8, 2014

(Chapter 1: Louden & ASU)

Compiler as a translator:
      source language  --------| COMPILER |--->  target language

A compiler is a program which reads in a "program" in a source language and translates it into a program (often executable) in a target language.   We might expect that the program produced will be a "good translation" in some measurable sense ...

An interpreter (officially) is not a compiler.  An interpreter reads in a source program and produces the result of running or executing that program.  This involves actually executing the program in some fashion!  The technology required for compiling is also required to produce an interpreter as usually an interpreter has internal abstract machine code which it can execute and into which the source programs must be compiled.   Many modern programming language systems blurr the traditional distinction between interpreter and compiler as they support both!


What should a compiler do?
  1. Preserve the meaning of the original source code
  2. Produce reasonably (space and time) efficient target code
  3. Perform its job quickly! (order of the size of the program or O(n log n)...)
Compiling is not everything!

A compiler is usually just one step in a number of steps which lead to the production of "absolute machine code" ... however it is perhaps the biggest step!
            ________________         ___________         _______________          _________
          |                                |       |                       |       |                              |       |                  |
     ->-|    preprocess           |-->--|   compile      |-->--|    assemble           |-->--|   link         |->
          |________________|        |___________|       |_______________|       |_________|
 source                      source             assembler                                relocatable                   absolute
                                                                                                             object code                 machine code


Overview of a compiler:

  String                   _____|_______                                    ____
                              | Lexical           |                                    |
                              | Analysis         |                                    |   Syntax:
                              | (Scanning)     |                                     |
                              |____________|                                     |
                                         |                                                   |    O(|Prog|)
  Token list             _____|_________                                 |
                               | Grammatical      |                                |
                               | Analysis             |                                |
                               | (Parsing)            |                                |_____
Syntax Tree                      |
    (AST)                           |                                              ____
     ________          _____|______           _________     |
     | Symbol   |        | Semantic      |         |  Error        |    |  Semantics
     | Table       |        | Analysis      |         |  Handler    |    |  O(|C| log |C|)
     |________|        |___________|        |__________|    |
                                        |                                               |___
Intermediate                    |
representation                  |                                                ____
    (IR)                              |                                               |
                               _____|_______                                 |
                               |   Optim-         |                                |  Most of the
                               |   ization          |                                |  problems here
                               |____________|                                |  are NP-complete
  (IR)                                |                                               |
                                ____|_______                                 |  approximate
                               | Code             |                                 |  algorithms
                               | Generat'n      |                                 |  are appropriate!
                               |___________|                                 |
  Assembler                     |                                               |____

Let us follow an expression through a compiler:

           position := initial + rate * 60  

The first question one must ask is whether this is a valid expression
in the given source language.  For this we need
(a) A formal definition of the language
(b) An effective membership test
(c) A plan for how to handle failure (i.e. error recovery!)

Usually the syntax of a languages is specified in two stages:

  1. a specification of the lexemes of the language (that is the substrings  of the input which will have special significance to the compiler) and  their correspondence to the tokens  ...
  2. a specification of the grammar of the language based on tokens (obtained by translation from the lexemes) which are the terminal  symbols of the language.  This is usually done by providing a context free grammar.

STEP 1:   Lexical analysis      (micro syntax  -- regular expressions)

Group characters together to form lexemes which are translated into tokens which are the terminals of the grammar used in the next step.

Lexeme Token
"127" INT(000000001111111)
"length"  ID("length")
"+" ADD

This can be efficiently implemented using deterministic finite automata (DFA).  These are often specified using "patterns" which are essentially regular expressions (we will eventually be using LEX ).

STEP 2:   Grammatical analysis    (macro syntax - context free)

The structure of the language is defined by a context free language and it is determined whether the stream of tokens belong to the grammar.

             ---- example grammar for mini language ----
prog -> stmt

stmt ->    IF expr THEN stmt ELSE stmt
             |   ID ASSIGN expr
             |   PRINT expr
             |   BEGIN stmtlist END

stmtlist -> stmt morestmtlist

morestmtlist -> SEMICOLON stmtlist

expr -> term moreterms

term ->   factor morefactors
             |  SUB term

factor ->   LPAR expr RPAR
              |   ID
              |   NUM

morefactors -> MUL factor morefactors
              |  DIV factor morefactors

moreterms -> ADD term moreterms
             |   SUB term moreterms


Efficient parsers are built using pushdown automata (PDA) for LL(n) and  LR(n) grammars use shift-reduce parsing.  A very simple parsing technique is "recursive descent parsing" which requires an LL(n) grammar.

STEP 3:   Semantic analysis            (meaning --- type checking!)

Consider again the assignment

        position := initial + rate * 60

Does this syntactically legal expression actually mean anything?  How do we tell?

Suppose that all the variables are declared as real numbers then the meaning of "+" is "real addition" which will be executed rather  differently  than if all the declared variables had been integers and its meaning had been "integer addition".  Furthermore, in the former case we will also have to turn the integer "60" into a real ...

The main component of semantic checking is checking that variables have been declared and type checking expressions.

At the end of this stage we know that we have a legal program and we generate an intermediate representation of the code.  This representation may be sufficiently close to machine code that it facilitates the translation to machine code and yet sufficiently high level that it facilitates optimization.  A typical intermediate code is "three address code".  Each instruction in this code is a simple assignment consisting of an assigned variable a binary or unary operation and its two arguments.  The above assignment might translate to:

                 temp1 := int2real(60)
                 temp2 := id3 * temp1
                 temp3 := id2 + temp2
                 id1 := temp3

Modern compilers use often higher level intermediate languages which have associated transformations which can be reduced to this form.

STEP 5: Optimization of intermediate code

The intermediate representation is optimized using program transformation techniques:

Code optimization is a major topic in itself!  Some of this is the topic of CPSC510 where you are required to build a complete compiler with optimizations.

 These procedures applied to the above code may have the following effect:

                temp2 := id3 * 60.0
                id1 := id2 + temp1

STEP 6:  Code generation

Intermediate code is translated to a suitable assembler code.  Here is the SPARC assembler code for this!

        ld      [%fp-16],%f4
        ld      [%fp-12],%f3
        sethi   %hi(.L_cseg2),%l0
        or      %l0,%lo(.L_cseg2),%l0
        ld      [%l0+0],%f2
        fmuls   %f3,%f2,%f2
        fadds   %f4,%f2,%f2
        st      %f2,[%fp-8]

        **************   USEFUL INFO! **************

To obtain this I actually wrote a short C program

        float position,rate,initial;

        initial = 12.0;
        rate = 3;
        position = initial + rate * 60;
        printf ("%f %f %f \n", position,initial,rate);

and then compiled it with the  -S  option.  This produces assembler alongside the code.  If you want to see how statements should be translated into assembler this reverse engineering technique is invaluable .... when you get to the code generation stage (next course CPSC510).

STEP 7:  Assembler optimizations

In generating assembler there are a number of important issues:

These are beyond the scope of this course see CPSC510.


  1.  Correctness  (does it preserve meaning -not as easy as it sounds but it is very important!)
  2.  Compiles quickly (complexity of compiling program O(n log n)  ... remember bootstrapping!)
  3.  Output execution speed (how fast is the output code?)
  4.  Output footprint (how large is the code how much memory does it use?)
  5.  Separate compilation (relocatable code, linking)
  6.  Use friendly front-end: good error recovery.
  7.  Debugging facilities.
  8.  Cross language calls -- interface compatibilities.
  9.  Understandable and correct optimizations.